Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Real-time onsets/chroma with pyaudio and librosa

See original GitHub issue

I haven’t really found anything regarding this, besides the PCEN Streaming example, which uses librosa.stream.

I’d like to extract the onsets and chroma information from an audio stream, but i’m getting quite confused whether i need to implement an overlap-strategy on my own or if this is already handled. Also not quite sure, what frame hop size to chose, as the calls are only working on the chunk of audio.

From what i’ve gathered so far:

import pyaudio
import librosa
import time
import numpy as np

CHUNK = 2048
FORMAT = pyaudio.paInt16
CHANNELS = 1
SHORT_NORMALIZE = (1.0 / 32768.0)
DVC_IDX = 2

N_FFT = 1024
HOP_LENGTH = int(N_FFT / 2)

p = pyaudio.PyAudio()
d_info = p.get_device_info_by_index(DVC_IDX)
SAMPLE_RATE = int(d_info['defaultSampleRate'])


def callback(input_data, frame_count, time_info, flags):
    buffer = np.frombuffer(input_data, dtype=np.int16)
    buffer = buffer * SHORT_NORMALIZE

    onsets = librosa.onset.onset_strength(y=buffer, sr=SAMPLE_RATE, lag=1, center=False)
    chroma = librosa.feature.chroma_stft(y=buffer, sr=SAMPLE_RATE, center=False, n_fft=N_FFT , hop_length=HOP_LENGTH )

    return input_data, pyaudio.paContinue


stream = p.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=SAMPLE_RATE,
    input=True,
    frames_per_buffer=CHUNK,
    input_device_index=2,
    stream_callback=callback
)

stream.start_stream()

# keep main thread alive
while stream.is_active():
    time.sleep(0.1)

stream.stop_stream()
stream.close()

This will return onsets (5, ) and chroma (12, 3) for every chunk of audio.

When looking at the onsets:

[0.         1.24526488 0.99135636 0.70424905 0.75233263]
[0.         2.03855642 3.30533546 1.6733954  0.57786365]

the first value is always zero. Probably because there is no previous frame to compare against, how would i solve this?

Also, are the 3 chromagram calculations enough to be able to estimate the pitch correctly?

I’d appreciate any input on this.

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

2reactions

bmcfeecommented, Dec 20, 2021

This is probably a better question for the discussion forum, but I’ll try to answer briefly here.

I’d like to extract the onsets and chroma information from an audio stream, but i’m getting quite confused whether i need to implement an overlap-strategy on my own or if this is already handled.

Since you’re not using librosa.stream (which only works on soundfile objects and not pyaudio, at least for now), you’ll have to manage buffer overlap yourself. Really, all that stream does on top of the soundfile blocks interface is provide another layer of buffering up from frames so that you can coherently handle frame overlap across blocks. This is explained in our blog post on the topic: https://librosa.org/blog/2019/07/29/stream-processing/#stream-processing

Also not quite sure, what frame hop size to chose, as the calls are only working on the chunk of audio. … the first value is always zero. Probably because there is no previous frame to compare against, how would i solve this?

If you’re doing onset detection, you’ll need at least two frames worth of data to work with. You might also want to have a rolling buffer so that the last buffer can be used to detect onsets in the current buffer. Hop length is up to you; it won’t matter for chroma, but it definitely will for onsets. Note: your example forgot to include hop-length in the call to onset detection, which is why you have 5 frames there instead of 3.

Also, are the 3 chromagram calculations enough to be able to estimate the pitch correctly?

That’s a tough call. I’d guess probably not: chroma-stft is not terribly accurate to begin with, and it’s particularly noisy when there are transients/discontinuities involved. It’s impossible to say for sure without knowing the sampling rate here (which isn’t included in your example), but think about the time extent that 3 frames of audio covers for your configuration, and how that might relate to expected note duration. (Also, number of periods for the frequencies involved!)

You might also want to do some kind of moving average to smooth the chroma over time, eg:

chroma = 0.5 * chroma + 0.5 * prev_chroma

(or pick your favorite balance between current and previous). This will induce a bit of latency, but ought to stabilize things over time without much overhead.

1reaction

clbreccommented, Jan 14, 2022

Sorry that I haven’t gotten back to you, didn’t have too much time to look into this again. But your detailed answers certainly helped me to understand how librosa can be used for real-time application and what the pitfalls are that need to be taken into account! 😄