CutSet.trim_to_supervisions changes the start of all Supervisions to 0
See original GitHub issueI am trying my luck on the csj corpus. It has a structure of multiple supervision segments corresponding to one long recording. I used the CutSet.trim_to_supervisions function so that one cut has exactly one supervision and one recording.
So far, I managed to compute fbanks for this cutset, but I noticed that the start
of all supervisions have been changed to 0
while the start
of the cut object itself is inherited correctly from the supervision. This was a problem when I ran validate_manifest.py on my cut_set.
Am I doing something wrong? I tried to find the implementation of trim_to_supervisions, but somehow in cut.py the _trim_to_supervisions_single and CutSet.trim_to_supervisions seem to point to each other for implementation.
code:
cut_set = CutSet.from_manifests(
recordings=manifest["recordings"],
supervisions=manifest["supervisions"]
)
cut_set = cut_set.trim_to_supervisions(keep_overlapping=False)
cut_set before trimming:
{
"id": "S00M0079-0-0",
"start": 0,
"duration": 592.936375,
"channel": 0,
"supervisions": [
{
"id": "S00M0079_0001",
"recording_id": "S00M0079",
"start": 0.621,
"duration": 5.763,
"channel": 0,
"text": "テーマ は です ね あまり 考える 時間 が なく 私 の 私 生活 と いう こと で",
"language": "Japanese",
"speaker": "S00M0079",
"custom": {
},
{
"id": "S00M0079_0002",
"recording_id": "S00M0079",
"start": 7.597,
"duration": 2.665,
"channel": 0,
"text": "普段 何 を し てる か と いう の を 話そ う と 思い ます",
"language": "Japanese",
"speaker": "S00M0079",
},
...
}
cut_set after trimming:
{
"id": "c9c8596d-032b-4618-8e7e-67785a561d4a",
"start": 0.621,
"duration": 5.763,
"channel": 0,
"supervisions": [
{
"id": "S00M0079_0001",
"recording_id": "S00M0079",
"start": 0.0,
"duration": 5.763,
"channel": 0,
"text": "テーマ は です ね あまり 考える 時間 が なく 私 の 私 生活 と いう こと で",
"language": "Japanese",
"speaker": "S00M0079",
}
],
"recording": {
"id": "S00M0079",
"sources": [
{
"type": "file",
"channels": [
0
],
"source": "/mnt/minami_data_server/xx/corpus/csj/WAV/noncore/S00M0079.wav"
}
],
"sampling_rate": 16000,
"num_samples": 9486982,
"duration": 592.936375
},
"type": "MonoCut"
},
{
"id": "498e385b-15c0-4138-9eda-44d2db26f6d7",
"start": 7.597,
"duration": 2.665,
"channel": 0,
"supervisions": [
{
"id": "S00M0079_0002",
"recording_id": "S00M0079",
"start": 0.0,
"duration": 2.665,
"channel": 0,
"text": "普段 何 を し てる か と いう の を 話そ う と 思い ます",
"language": "Japanese",
"speaker": "S00M0079",
}
],
"recording": {
"id": "S00M0079",
"sources": [
{
"type": "file",
"channels": [
0
],
"source": "/mnt/minami_data_server/xx/corpus/csj/WAV/noncore/S00M0079.wav"
}
],
"sampling_rate": 16000,
"num_samples": 9486982,
"duration": 592.936375
},
"type": "MonoCut"
},
Issue Analytics
- State:
- Created a year ago
- Comments:10 (4 by maintainers)
Top GitHub Comments
For ASR it might be good to keep the check for ends, and also you don’t want to have negative supervision start time – these things are allowed in Lhotse because they indicate that a cut is smaller than a supervision. It makes sense to have data like this for some non-ASR tasks but not for ASR.
For feature extraction, the supervision information is ignored. You will need it to be sane at a later stage in a torch Dataset object that converts metadata into supervision tensors.
Great! Thanks for your confirmations. I will close this issue now.