Try messing with buffer size?
See original GitHub issue@csukuangfj this is related to the slowness here https://github.com/k2-fsa/icefall/pull/312, of reading ffmpeg data. I am trying to debug it by doing strace
on your ffmpeg commands, like this (as you):
for p in $(ps -u kuangfangjun | grep ffmpeg | tail -n 1 | awk '{print $1}'); do strace -T -p $p; done >& foo
To see which system calls are slow, we can do:
awk '{print $NF, $0}' < foo | sort -r | less
# output is:
<0.014715> write(1, "\355\223H<X2^<\303\n\330<.VH=\353Z\231=\203H\305=\216j\341=\366f\347="..., 1280) = 1280 <0.014715>
<0.007298> write(1, "\332\r)=p9,=>\337\t=\200\335\336<L\331\352<&\222\"=`hh=(\360\233="..., 1280) = 1280 <0.007298>
<0.000146> fstat(3, {st_mode=S_IFREG|0664, st_size=9641836, ...}) = 0 <0.000146>
<0.000099> write(2, "ffmpeg version 3.4.8-0ubuntu0.2", 31) = 31 <0.000099>
<0.000086> openat(AT_FDCWD, "/ceph-fj/fangjun/open-source-2/icefall-pruned-multi-datasets/egs/librispeech/ASR/download/GigaSpeech/audio/podcast/P0036/POD0000003582.opus", O_RDONLY) = 3 <0.000086>
<0.000055> write(1, "lW\20\272\202\372\33\272\236\t\4\272\372H\221\271\36 \313\271\200r\374\271\35#\17\272n\30\371\271"..., 1280) = 1280 <0.000055>
<0.000055> getdents(3, /* 72 entries */, 32768) = 2208 <0.000055>
<0.000052> write(2, " Copyright (c) 2000-2020 the FFm"..., 46) = 46 <0.000052>
<0.000047> write(1, "*\3314>\313\4\21>\253W\221=\270\350\205\273\244\372\256\275\334 \31\2760\27I\276^\26m\276"..., 1280) = 1280 <0.000047>
<0.000046> write(1, "0\322i:X\t\2639\246\320\220\271/uz\272\n\244\325\272\226\360\344\272p~\264\272\222G\374\271"..., 1280) = 1280 <0.000046>
<0.000045> write(1, "t\203\354=;\250\320= z\334=\337\354\262=V\304\216=\336\222\233=Ia}=q\354o="..., 1280) = 1280 <0.000045>
<0.000044> write(1, "\362\1F:\252\225\3669NL\227:Zz\3;\300\314\332:\270e\3629{dK\271\276p5\272"..., 1280) = 1280 <0.000044>
<0.000043> write(1, "n\371\23\274\232\v\n<^\30\316<R\242\3\274\36\341\344\274`\304\254<,X\263<\324D\300\274"..., 1280) = 1280 <0.000043>
<0.000043> write(1, "\355\336\323\274\27\333s=\370\302\343\273\316p\240\275\350\204\205=\205\257\305\274\326\277C\273\266G\331\274"..., 1280) = 1280 <0.000043>
<0.000042> write(2, " configuration: --prefix=/usr -"..., 1098) = 1098 <0.000042>
<0.000040> write(1, "\230^<\276\236cV\276\264ys\276\314<\213\276\326I\232\276bM\250\276\7\212\266\276\17\211\302\276"..., 1280) = 1280 <0.000040>
<0.000040> write(1, "\202n\244;\34\370\217\274vF\31\275\246\32\30\275\323\31f\274\234\231\212<\242{\345<\342\270\303<"..., 1280) = 1280 <0.000040>
<0.000039> write(1, "|W\0<\"f6\273\273Tb\274b\2550\274\4\371\364\272\32K\265<\214$\3=\neD="..., 1280) = 1280 <0.000039>
<0.000039> write(1, "z\266\357=O\365\361=\36O\351=2\241\312=+\347\204=.\221\203<\262\3554\275\322\313\266\275"..., 1280) = 1280 <0.000039>
<0.000039> write(1, "\230f\350\275\303\216\17\276\10\312&\276\232\350\26\276R\243\375\275\250\264\5\276\276\2\364\275b\350\10\276"..., 1280) = 1280 <0.000039>
<0.000038> write(2, " libavcodec 57.107.100 / 57"..., 41) = 41 <0.000038>
… anyway it seems to be the case that writing takes longer than reading, i.e. it’s spending longer waiting to output data than to read data. “slow writes” happen generally once or twice per ffmpeg program, and I expect it corresponds to when a buffer gets full AND the python program it’s writing to happens to be busy doing something that it’s hard to wake up from. Now, it looks like the bufsize
arg to subprocess.run (which is one of the generic kwargs, not specifically listed in the docs) defaults to -1 which means io.DEFAULT_BUFFER_SIZE which seems to be 8192. However I don’t see any obvious periodicity in how long the syscall write
takes that would correspond to that buffer size.
This particular ffmpeg call seems to output about 500k bytes.
One way to make this a little faster might be to just add bufsize=2000000
to the subprocess.run()
call in read_opus_ffmpeg(). That would buffer all the output so it never has to wait on the python program that’s calling it.
Issue Analytics
- State:
- Created a year ago
- Reactions:3
- Comments:38 (12 by maintainers)
Top GitHub Comments
Here is the training log using pre-computed features on a machine with 20 CPUs and 2 dataloader workers.
Note it takes about only 5 minutes to process 350 batches, which is very close to the training time of the reworked model from Dan.
Great! And such a simple fix!