Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CutSet.trim_to_supervisions changes the start of all Supervisions to 0

See original GitHub issue

I am trying my luck on the csj corpus. It has a structure of multiple supervision segments corresponding to one long recording. I used the CutSet.trim_to_supervisions function so that one cut has exactly one supervision and one recording.

So far, I managed to compute fbanks for this cutset, but I noticed that the start of all supervisions have been changed to 0 while the start of the cut object itself is inherited correctly from the supervision. This was a problem when I ran validate_manifest.py on my cut_set.

Am I doing something wrong? I tried to find the implementation of trim_to_supervisions, but somehow in cut.py the _trim_to_supervisions_single and CutSet.trim_to_supervisions seem to point to each other for implementation.

code:

cut_set = CutSet.from_manifests(
    recordings=manifest["recordings"],
    supervisions=manifest["supervisions"]
)

cut_set = cut_set.trim_to_supervisions(keep_overlapping=False)

cut_set before trimming:

  {
    "id": "S00M0079-0-0",
    "start": 0,
    "duration": 592.936375,
    "channel": 0,
    "supervisions": [
      {
        "id": "S00M0079_0001",
        "recording_id": "S00M0079",
        "start": 0.621,
        "duration": 5.763,
        "channel": 0,
        "text": "テーマ は です ね あまり 考える 時間 が なく 私 の 私 生活 と いう こと で",
        "language": "Japanese",
        "speaker": "S00M0079",
        "custom": {
      },
      {
        "id": "S00M0079_0002",
        "recording_id": "S00M0079",
        "start": 7.597,
        "duration": 2.665,
        "channel": 0,
        "text": "普段 何 を し てる か と いう の を 話そ う と 思い ます",
        "language": "Japanese",
        "speaker": "S00M0079",
      },
      ...
  }

cut_set after trimming:

  {
    "id": "c9c8596d-032b-4618-8e7e-67785a561d4a",
    "start": 0.621,
    "duration": 5.763,
    "channel": 0,
    "supervisions": [
      {
        "id": "S00M0079_0001",
        "recording_id": "S00M0079",
        "start": 0.0,
        "duration": 5.763,
        "channel": 0,
        "text": "テーマ は です ね あまり 考える 時間 が なく 私 の 私 生活 と いう こと で",
        "language": "Japanese",
        "speaker": "S00M0079",
      }
    ],
    "recording": {
      "id": "S00M0079",
      "sources": [
        {
          "type": "file",
          "channels": [
            0
          ],
          "source": "/mnt/minami_data_server/xx/corpus/csj/WAV/noncore/S00M0079.wav"
        }
      ],
      "sampling_rate": 16000,
      "num_samples": 9486982,
      "duration": 592.936375
    },
    "type": "MonoCut"
  },
  {
    "id": "498e385b-15c0-4138-9eda-44d2db26f6d7",
    "start": 7.597,
    "duration": 2.665,
    "channel": 0,
    "supervisions": [
      {
        "id": "S00M0079_0002",
        "recording_id": "S00M0079",
        "start": 0.0,
        "duration": 2.665,
        "channel": 0,
        "text": "普段 何 を し てる か と いう の を 話そ う と 思い ます",
        "language": "Japanese",
        "speaker": "S00M0079",
      }
    ],
    "recording": {
      "id": "S00M0079",
      "sources": [
        {
          "type": "file",
          "channels": [
            0
          ],
          "source": "/mnt/minami_data_server/xx/corpus/csj/WAV/noncore/S00M0079.wav"
        }
      ],
      "sampling_rate": 16000,
      "num_samples": 9486982,
      "duration": 592.936375
    },
    "type": "MonoCut"
  },

Issue Analytics

State:
Created a year ago
Comments:10 (4 by maintainers)

Top GitHub Comments

1reaction

pzelaskocommented, Sep 20, 2022

For ASR it might be good to keep the check for ends, and also you don’t want to have negative supervision start time – these things are allowed in Lhotse because they indicate that a cut is smaller than a supervision. It makes sense to have data like this for some non-ASR tasks but not for ASR.

For feature extraction, the supervision information is ignored. You will need it to be sane at a later stage in a torch Dataset object that converts metadata into supervision tensors.

0reactions

teowenshencommented, Sep 22, 2022

Great! Thanks for your confirmations. I will close this issue now.