question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CutSet.trim_to_supervisions changes the start of all Supervisions to 0

See original GitHub issue

I am trying my luck on the csj corpus. It has a structure of multiple supervision segments corresponding to one long recording. I used the CutSet.trim_to_supervisions function so that one cut has exactly one supervision and one recording.

So far, I managed to compute fbanks for this cutset, but I noticed that the start of all supervisions have been changed to 0 while the start of the cut object itself is inherited correctly from the supervision. This was a problem when I ran validate_manifest.py on my cut_set.

Am I doing something wrong? I tried to find the implementation of trim_to_supervisions, but somehow in cut.py the _trim_to_supervisions_single and CutSet.trim_to_supervisions seem to point to each other for implementation.

code:

cut_set = CutSet.from_manifests(
    recordings=manifest["recordings"],
    supervisions=manifest["supervisions"]
)

cut_set = cut_set.trim_to_supervisions(keep_overlapping=False)

cut_set before trimming:

  {
    "id": "S00M0079-0-0",
    "start": 0,
    "duration": 592.936375,
    "channel": 0,
    "supervisions": [
      {
        "id": "S00M0079_0001",
        "recording_id": "S00M0079",
        "start": 0.621,
        "duration": 5.763,
        "channel": 0,
        "text": "テーマ は です ね あまり 考える 時間 が なく 私 の 私 生活 と いう こと で",
        "language": "Japanese",
        "speaker": "S00M0079",
        "custom": {
      },
      {
        "id": "S00M0079_0002",
        "recording_id": "S00M0079",
        "start": 7.597,
        "duration": 2.665,
        "channel": 0,
        "text": "普段 何 を し てる か と いう の を 話そ う と 思い ます",
        "language": "Japanese",
        "speaker": "S00M0079",
      },
      ...
  }

cut_set after trimming:

  {
    "id": "c9c8596d-032b-4618-8e7e-67785a561d4a",
    "start": 0.621,
    "duration": 5.763,
    "channel": 0,
    "supervisions": [
      {
        "id": "S00M0079_0001",
        "recording_id": "S00M0079",
        "start": 0.0,
        "duration": 5.763,
        "channel": 0,
        "text": "テーマ は です ね あまり 考える 時間 が なく 私 の 私 生活 と いう こと で",
        "language": "Japanese",
        "speaker": "S00M0079",
      }
    ],
    "recording": {
      "id": "S00M0079",
      "sources": [
        {
          "type": "file",
          "channels": [
            0
          ],
          "source": "/mnt/minami_data_server/xx/corpus/csj/WAV/noncore/S00M0079.wav"
        }
      ],
      "sampling_rate": 16000,
      "num_samples": 9486982,
      "duration": 592.936375
    },
    "type": "MonoCut"
  },
  {
    "id": "498e385b-15c0-4138-9eda-44d2db26f6d7",
    "start": 7.597,
    "duration": 2.665,
    "channel": 0,
    "supervisions": [
      {
        "id": "S00M0079_0002",
        "recording_id": "S00M0079",
        "start": 0.0,
        "duration": 2.665,
        "channel": 0,
        "text": "普段 何 を し てる か と いう の を 話そ う と 思い ます",
        "language": "Japanese",
        "speaker": "S00M0079",
      }
    ],
    "recording": {
      "id": "S00M0079",
      "sources": [
        {
          "type": "file",
          "channels": [
            0
          ],
          "source": "/mnt/minami_data_server/xx/corpus/csj/WAV/noncore/S00M0079.wav"
        }
      ],
      "sampling_rate": 16000,
      "num_samples": 9486982,
      "duration": 592.936375
    },
    "type": "MonoCut"
  },

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
pzelaskocommented, Sep 20, 2022

For ASR it might be good to keep the check for ends, and also you don’t want to have negative supervision start time – these things are allowed in Lhotse because they indicate that a cut is smaller than a supervision. It makes sense to have data like this for some non-ASR tasks but not for ASR.

For feature extraction, the supervision information is ignored. You will need it to be sane at a later stage in a torch Dataset object that converts metadata into supervision tensors.

0reactions
teowenshencommented, Sep 22, 2022

Great! Thanks for your confirmations. I will close this issue now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cuts — lhotse 0.1 documentation
Audio cuts are one of the main Lhotse features. Cut is a part of a recording, but it can be longer than a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found