question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Creating index with many recursive TARs inside an xz compressed TAR is 100x slower than bz2!

See original GitHub issue

The lzmaffi module provides seeking support in multi-block xz files as created with pixz, see also #42. And in small unit-like tests, it really does provide true-seeking capabilities. However, I noticed that the test for tests/2k-recursive-tars.tar.xz is roughly 100x slower compared to tests/2k-recursive-tars.tar.bz2! This is the only test where this difference is so glaring because it contains recursive TARs and after each recursive TAR, a backwards seek has to be applied in order to resume reading the outer TAR. For some reason lzmaffi.seek seems to have some performance problems. Even if it might implement true seeking to a block under the hood, there might be some constant overhead cost. The file also has two problems:

  1. It is highly compressed, with 20MiB compressed to 16kiB for a compression ratio of roughly 1000!
  2. The xz file only has 3 blocks while the bz2 file has 24 blocks. That might slow down “true” seeking to an arbitrary point by factor 8 compared to bz2. But there still is a factor 12 missing for the observed slowdown! Also, the simple decoding time was found to be twice as fast a bz2, that means there effectively is even a factor 24 that can’t be explained.

Alternatively, to finding the problem in lzmaffi, I could try to reduce seeks for recursive indexing in ratarmount, e.g., by:

  1. Jumping to the next TAR block after analyzing the recursive TAR, effectively resulting in zero backward seeks. However, tarfile might not have an API allowing me to do this. However, I could make use of StenciledFile again, to force it to support this.
  2. First analyze the outer TAR and only then mount the recursive TARs in order. This would effectively reduce the backward seeks to the maximum recursion level. A nice side-effect would be that this solution could avoid recursion in ratarmount itself.

Here are some notes and benchmarks I made to try and find the problem:

Seemingly affected tests:

  • tests/gnu-sparse-files.tar
  • tests/2k-recursive-tars.tar.bz2

Reproduce problem:

bzip2 -kd tests/2k-recursive-tars.tar.bz2
xz -fk tests/2k-recursive-tars.tar
pixz -k tests/2k-recursive-tars.{tar,tpxz}

indexed_bzip2/tools/blockfinder tests/2k-recursive-tars.tar.bz2
    Block offsets  :
    4 B 0 b -> magic bytes: 0x314159265359
    590 B 0 b -> magic bytes: 0x314159265359
    1205 B 0 b -> magic bytes: 0x314159265359
    1796 B 0 b -> magic bytes: 0x314159265359
    2360 B 0 b -> magic bytes: 0x314159265359
    2897 B 0 b -> magic bytes: 0x314159265359
    3441 B 0 b -> magic bytes: 0x314159265359
    3997 B 0 b -> magic bytes: 0x314159265359
    4545 B 0 b -> magic bytes: 0x314159265359
    5169 B 0 b -> magic bytes: 0x314159265359
    5757 B 0 b -> magic bytes: 0x314159265359
    6313 B 0 b -> magic bytes: 0x314159265359
    6863 B 0 b -> magic bytes: 0x314159265359
    7441 B 0 b -> magic bytes: 0x314159265359
    8034 B 0 b -> magic bytes: 0x314159265359
    8584 B 0 b -> magic bytes: 0x314159265359
    9127 B 0 b -> magic bytes: 0x314159265359
    9688 B 0 b -> magic bytes: 0x314159265359
    10299 B 0 b -> magic bytes: 0x314159265359
    10834 B 0 b -> magic bytes: 0x314159265359
    11395 B 0 b -> magic bytes: 0x314159265359
    11963 B 0 b -> magic bytes: 0x314159265359
    12624 B 0 b -> magic bytes: 0x314159265359
    13174 B 0 b -> magic bytes: 0x314159265359
    Found 24 blocks

xz -l tests/2k-recursive-tars.*xz
    Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
        1       1     15.8 KiB     20.5 MiB  0.001  CRC64   tests/2k-recursive-tars.tar.xz
        1       3     20.2 KiB     20.6 MiB  0.001  CRC32   tests/2k-recursive-tars.tpxz

./ratarmount.py -cr tests/2k-recursive-tars.tar.bz2 bibi
    Creating offset dictionary for ratarmount/tests/2k-recursive-tars.tar.bz2 ...
    Creating new SQLite index database at ratarmount/tests/2k-recursive-tars.tar.bz2.index.sqlite
    Creating offset dictionary for mimi/00001.tar ...
    Creating offset dictionary for mimi/00001.tar took 0.00s
    [...]
    Creating offset dictionary for mimi/02000.tar ...
    Creating offset dictionary for mimi/02000.tar took 0.00s
    Creating offset dictionary for ratarmount/tests/2k-recursive-tars.tar.bz2 took 0.53s
    Writing out TAR index to ratarmount/tests/2k-recursive-tars.tar.bz2.index.sqlite took 0s and is sized 589824 B

./ratarmount.py -cr tests/2k-recursive-tars.tar.xz mimi
    [Warning] The specified file 'ratarmount/tests/2k-recursive-tars.tar.xz'
    [Warning] is compressed using xz but only contains one xz block. This makes it
    [Warning] impossible to use true seeking! Please (re)compress your TAR using pixz
    [Warning] (see https://github.com/vasi/pixz) in order for ratarmount to do be able
    [Warning] to do fast seeking to requested files.
    [Warning] As it is, each file access will decompress the whole TAR from the beginning!

    Creating offset dictionary for ratarmount/tests/2k-recursive-tars.tar.xz ...
    Creating new SQLite index database at ratarmount/tests/2k-recursive-tars.tar.xz.index.sqlite
    Creating offset dictionary for mimi/00001.tar ...
    Creating offset dictionary for mimi/00001.tar took 0.00s
    Creating offset dictionary for mimi/00002.tar ...
    Creating offset dictionary for mimi/00002.tar took 0.00s
    [...]
    Creating offset dictionary for mimi/01999.tar ...
    Creating offset dictionary for mimi/01999.tar took 0.00s
    Creating offset dictionary for mimi/02000.tar ...
    Creating offset dictionary for mimi/02000.tar took 0.00s
    Creating offset dictionary for ratarmount/tests/2k-recursive-tars.tar.xz took 104.80s
    Writing out TAR index to ratarmount/tests/2k-recursive-tars.tar.xz.index.sqlite took 0s and is sized 589824 B

./ratarmount.py -cr tests/2k-recursive-tars.tpxz pipi
    Creating offset dictionary for ratarmount/tests/2k-recursive-tars.tpxz ...
    Creating new SQLite index database at ratarmount/tests/2k-recursive-tars.tpxz.index.sqlite
    Creating offset dictionary for mimi/00001.tar ...
    Creating offset dictionary for mimi/00001.tar took 0.00s
    Creating offset dictionary for mimi/00002.tar ...
    Creating offset dictionary for mimi/00002.tar took 0.00s
    [...]
    Creating offset dictionary for mimi/02000.tar ...
    Creating offset dictionary for mimi/02000.tar took 0.00s
    Creating offset dictionary for ratarmount/tests/2k-recursive-tars.tpxz took 58.66s
    Writing out TAR index to ratarmount/tests/2k-recursive-tars.tpxz.index.sqlite took 0s and is sized 589824 B


time python3 -c 'import lzmaffi, sys; print( len( lzmaffi.open( sys.argv[1] ).read() ) );' tests/2k-recursive-tars.tar.xz
    21514240

    real	0m0.129s
    user	0m0.087s
    sys	0m0.038s

time python3 -c 'import lzmaffi, sys; print( len( lzmaffi.open( sys.argv[1] ).read() ) );' tests/2k-recursive-tars.tpxz
    21560288

    real	0m0.109s
    user	0m0.086s
    sys	0m0.020s

time python3 -c 'import indexed_bzip2, sys; print( len( indexed_bzip2.IndexedBzip2File( sys.argv[1] ).read() ) );' tests/2k-recursive-tars.tar.bz2
    21514240

    real	0m0.119s
    user	0m0.090s
    sys	0m0.028s

python3 -m timeit -s 'import lzmaffi' 'lzmaffi.open( "tests/2k-recursive-tars.tar.xz" ).read()'
    5 loops, best of 5: 41.5 msec per loop
python3 -m timeit -s 'import lzmaffi' 'lzmaffi.open( "tests/2k-recursive-tars.tpxz" ).read()'
    10 loops, best of 5: 32.4 msec per loop
python3 -m timeit -s 'import indexed_bzip2' 'indexed_bzip2.IndexedBzip2File( "tests/2k-recursive-tars.tar.bz2" ).read()'
    5 loops, best of 5: 98 msec per loop
  -> The xz decoder is actually 2-3x faster than the bz2 decoder!

time cat bibi/mimi/01333.tar/foo
    1333

    real	0m0.003s
    user	0m0.002s
    sys	0m0.000s

time cat mimi/mimi/01333.tar/foo
    1333

    real	0m0.042s
    user	0m0.002s
    sys	0m0.000s

time cat pipi/mimi/01333.tar/foo
    1333

    real	0m0.029s
    user	0m0.001s
    sys	0m0.000s

time cat pipi/mimi/01500.tar/foo
    1500

    real	0m0.012s
    user	0m0.001s
    sys	0m0.000s

python3 -m timeit -s 'import io, lzmaffi; f = lzmaffi.open( "tests/2k-recursive-tars.tar.xz" );' 'f.seek( -1, io.SEEK_END ); f.seek( 10*1024*1024 ); f.read( 1 )'
    10 loops, best of 5: 34.1 msec per loop
python3 -m timeit -s 'import io, lzmaffi; f = lzmaffi.open( "tests/2k-recursive-tars.tpxz" );' 'f.seek( -1, io.SEEK_END ); f.seek( 10*1024*1024 ); f.read( 1 )'
    20 loops, best of 5: 13.7 msec per loop
python3 -m timeit -s 'import indexed_bzip2, io; f = indexed_bzip2.IndexedBzip2File( "tests/2k-recursive-tars.tar.bz2" )' 'f.seek( -1, io.SEEK_END ); f.seek( 10*1024*1024 ); f.read( 1 )'
    20 loops, best of 5: 12.7 msec per loop
  • You can actually see the seeking and block boundaries by accessing the files and timing the access

  • Also, reading a file later in the TAR than the last accessed is actually multitudes faster (2ms -> ~10-20x) than reading that same file a second time because on the second time it will have to backward seek a bit!

  • Index Creation: BZ2 (24 Blocks): 0.52s, XZ (1 Block): 105s, XZ (3 Blocks): 58.7s

    • There seems to be a multitude of factors making the backend ~100x slower for mounting:
      • The recursive mounting requires one backwards seek per recursive TAR
      • The xz files have 8x and 24x less blocks, making seeking less efficient
      • Decoding is actually roughly twice as fast as bz2!
      • The pixz file is generally ~25% faster for some reason. Maybe, a different default compression. => Decoding isn’t the problem. Seeking by itself also does not seem to be the problem. At this point, I’m not sure why it’s not working as fast as bz2

Try to find the critical code location with cProfile

diff --git a/ratarmount.py b/ratarmount.py
index b71005d..7a6b5bd 100755
--- a/ratarmount.py
+++ b/ratarmount.py
@@ -1346,6 +1346,9 @@ class SQLiteIndexedTar:
         assert False, ( "Could not load or store block offsets for {} probably because adding support was forgotten!"
                         .format( self.compression ) )

+import cProfile
+import pstats
+
 class TarMount( fuse.Operations ):
     """
     This class implements the fusepy interface in order to create a mounted file system view
@@ -1384,6 +1387,15 @@ class TarMount( fuse.Operations ):
             except:
                 pass

+        tarFile = pathToMount[0]
+        pfname =  'ratarmount-profile'
+        cProfile.runctx( 'SQLiteIndexedTar( tarFile, writeIndex = True, encoding = self.encoding, **sqliteIndexedTarOptions )',
+                         globals(), locals(), pfname )
+        p = pstats.Stats( pfname )
+        p.sort_stats( pstats.SortKey.CUMULATIVE )
+        p.print_stats()
+        sys.exit( 0 )
+
         self.mountSources: List[Any] = [
             SQLiteIndexedTar( tarFile,
                               writeIndex = True,
./ratarmount.py -cr tests/2k-recursive-tars.tar.bz2 bibi
    Sun Dec 13 14:18:28 2020    ratarmount-profile

             686148 function calls (684134 primitive calls) in 0.671 seconds

       Ordered by: cumulative time

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000    0.671    0.671 {built-in method builtins.exec}
            1    0.001    0.001    0.671    0.671 <string>:1(<module>)
            1    0.000    0.000    0.670    0.670 ./ratarmount.py:287(__init__)
       2001/1    0.038    0.000    0.651    0.651 ./ratarmount.py:600(createIndex)
         8005    0.013    0.000    0.358    0.000 /usr/lib/python3.8/tarfile.py:2292(next)
         6004    0.007    0.000    0.234    0.000 /usr/lib/python3.8/tarfile.py:1097(fromtarfile)
         6003    0.004    0.000    0.203    0.000 /usr/lib/python3.8/tarfile.py:2407(__iter__)
        14005    0.007    0.000    0.195    0.000 /usr/lib/python3.8/tarfile.py:516(read)
        14005    0.005    0.000    0.187    0.000 /usr/lib/python3.8/tarfile.py:523(_read)
        14005    0.017    0.000    0.182    0.000 /usr/lib/python3.8/tarfile.py:550(__read)
         2002    0.004    0.000    0.172    0.000 /usr/lib/python3.8/tarfile.py:1552(open)
         2002    0.006    0.000    0.166    0.000 /usr/lib/python3.8/tarfile.py:1441(__init__)
         4105    0.164    0.000    0.164    0.000 {method 'read' of '_io.BufferedReader' objects}
         6004    0.022    0.000    0.116    0.000 /usr/lib/python3.8/tarfile.py:1034(frombuf)
         6003    0.106    0.000    0.106    0.000 {method 'seek' of '_io.BufferedReader' objects}
         4001    0.004    0.000    0.101    0.000 /usr/lib/python3.8/tarfile.py:503(seek)
         2000    0.004    0.000    0.077    0.000 ./ratarmount.py:214(read)
         4002    0.006    0.000    0.059    0.000 ./ratarmount.py:957(_setFileInfo)
        32024    0.017    0.000    0.039    0.000 /usr/lib/python3.8/tarfile.py:172(nti)
         4003    0.010    0.000    0.036    0.000 /usr/lib/python3.8/tarfile.py:221(calc_chksums)
         6009    0.032    0.000    0.032    0.000 {method 'execute' of 'sqlite3.Connection' objects}
        52039    0.019    0.000    0.032    0.000 /usr/lib/python3.8/tarfile.py:164(nts)
         4002    0.008    0.000    0.025    0.000 ./ratarmount.py:931(_tryAddParentFolders)
         4004    0.018    0.000    0.018    0.000 {built-in method builtins.print}
         8006    0.015    0.000    0.015    0.000 {built-in method builtins.sum}
         4003    0.002    0.000    0.014    0.000 /usr/lib/python3.8/tarfile.py:1118(_proc_member)
         4003    0.005    0.000    0.012    0.000 /usr/lib/python3.8/tarfile.py:1131(_proc_builtin)
         8006    0.011    0.000    0.011    0.000 {built-in method _struct.unpack_from}
            1    0.000    0.000    0.011    0.011 ./ratarmount.py:1199(_openCompressedFile)
         4004    0.007    0.000    0.011    0.000 /usr/lib/python3.8/posixpath.py:334(normpath)
         2000    0.003    0.000    0.009    0.000 ./ratarmount.py:148(__init__)
            7    0.009    0.001    0.009    0.001 {method 'executescript' of 'sqlite3.Connection' objects}
         2004    0.002    0.000    0.008    0.000 ./ratarmount.py:1018(indexIsLoaded)
         4002    0.004    0.000    0.008    0.000 ./ratarmount.py:937(<listcomp>)
         4002    0.004    0.000    0.008    0.000 ./ratarmount.py:584(_updateProgressBar)
        52039    0.007    0.000    0.007    0.000 {method 'find' of 'bytes' objects}
         2251    0.007    0.000    0.007    0.000 {method 'executemany' of 'sqlite3.Connection' objects}
            1    0.000    0.000    0.006    0.006 ./ratarmount.py:529(_pathIsWritable)
        52041    0.006    0.000    0.006    0.000 {method 'decode' of 'bytes' objects}
            1    0.006    0.006    0.006    0.006 {method 'write' of '_io.BufferedWriter' objects}
            1    0.000    0.000    0.005    0.005 ./ratarmount.py:1183(_detectTar)
            1    0.000    0.000    0.005    0.005 ./ratarmount.py:1153(_detectCompression)
            1    0.000    0.000    0.005    0.005 /usr/lib/python3.8/tarfile.py:1643(taropen)
         2000    0.003    0.000    0.005    0.000 ./ratarmount.py:242(seek)
    [...]

./ratarmount.py -cr tests/2k-recursive-tars.tpxz bibi
    Sun Dec 13 14:20:01 2020    ratarmount-profile

             4455897 function calls (4453893 primitive calls) in 52.952 seconds

       Ordered by: cumulative time

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000   52.952   52.952 {built-in method builtins.exec}
            1    0.000    0.000   52.952   52.952 <string>:1(<module>)
            1    0.000    0.000   52.952   52.952 ./ratarmount.py:287(__init__)
       2001/1    0.126    0.000   52.913   52.913 ./ratarmount.py:600(createIndex)
 !!! ->  6001    0.025    0.000   51.823    0.009 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:482(seek)
        10104    1.746    0.000   49.294    0.005 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:399(_read_block)
         9091    0.024    0.000   47.518    0.005 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:453(_fill_buffer)
         4991    0.027    0.000   47.430    0.010 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:711(decompress)
         4991   10.804    0.002   47.403    0.009 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:727(_decompress)
       309720    0.121    0.000   31.368    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:346(catch_lzma_error)
       297999   31.215    0.000   31.215    0.000 {built-in method _compiled_module.lzma_code}
       293008    4.770    0.000    4.770    0.000 {built-in method _compiled_module.realloc}
         3905    0.010    0.000    2.593    0.001 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:356(_move_to_block)
         3905    2.294    0.001    2.490    0.001 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:343(_init_decompressor)
       595998    0.263    0.000    0.461    0.000 ~/.local/lib/python3.8/site-packages/cffi/api.py:293(cast)
         8005    0.032    0.000    0.452    0.000 /usr/lib/python3.8/tarfile.py:2292(next)
         6004    0.013    0.000    0.338    0.000 /usr/lib/python3.8/tarfile.py:1097(fromtarfile)
         2002    0.014    0.000    0.257    0.000 /usr/lib/python3.8/tarfile.py:1552(open)
         6003    0.008    0.000    0.246    0.000 /usr/lib/python3.8/tarfile.py:2407(__iter__)
         2002    0.017    0.000    0.233    0.000 /usr/lib/python3.8/tarfile.py:1441(__init__)
         4002    0.023    0.000    0.222    0.000 ./ratarmount.py:957(_setFileInfo)
         6004    0.041    0.000    0.191    0.000 /usr/lib/python3.8/tarfile.py:1034(frombuf)
         6006    0.167    0.000    0.167    0.000 {method 'execute' of 'sqlite3.Connection' objects}
        14005    0.010    0.000    0.148    0.000 /usr/lib/python3.8/tarfile.py:516(read)
         3905    0.050    0.000    0.140    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:656(__init__)
        14005    0.009    0.000    0.137    0.000 /usr/lib/python3.8/tarfile.py:523(_read)
        14005    0.026    0.000    0.127    0.000 /usr/lib/python3.8/tarfile.py:550(__read)
         4103    0.007    0.000    0.110    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:367(read)
       620531    0.096    0.000    0.096    0.000 ~/.local/lib/python3.8/site-packages/cffi/api.py:180(_typeof)
        12814    0.093    0.000    0.093    0.000 {method 'read' of '_io.BufferedReader' objects}
         3905    0.023    0.000    0.093    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:549(find)
       595998    0.082    0.000    0.082    0.000 {built-in method _cffi_backend.cast}
         4001    0.008    0.000    0.068    0.000 /usr/lib/python3.8/tarfile.py:503(seek)
         4002    0.025    0.000    0.063    0.000 ./ratarmount.py:931(_tryAddParentFolders)
        32024    0.028    0.000    0.061    0.000 /usr/lib/python3.8/tarfile.py:172(nti)
        24533    0.021    0.000    0.060    0.000 ~/.local/lib/python3.8/site-packages/cffi/api.py:242(new)
         4007    0.060    0.000    0.060    0.000 {built-in method builtins.print}
         4003    0.016    0.000    0.052    0.000 /usr/lib/python3.8/tarfile.py:221(calc_chksums)
         2004    0.005    0.000    0.049    0.000 ./ratarmount.py:1018(indexIsLoaded)
         2000    0.014    0.000    0.046    0.000 ./ratarmount.py:148(__init__)
       639627    0.045    0.000    0.045    0.000 {built-in method builtins.isinstance}
         4002    0.017    0.000    0.044    0.000 ./ratarmount.py:584(_updateProgressBar)
        52039    0.022    0.000    0.043    0.000 /usr/lib/python3.8/tarfile.py:164(nts)
         2000    0.007    0.000    0.041    0.000 ./ratarmount.py:214(read)
         7816    0.018    0.000    0.040    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:575(__init__)
         4003    0.006    0.000    0.039    0.000 /usr/lib/python3.8/tarfile.py:1118(_proc_member)
         3915    0.007    0.000    0.038    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:287(_peek)
         4003    0.010    0.000    0.033    0.000 /usr/lib/python3.8/tarfile.py:1131(_proc_builtin)
            1    0.000    0.000    0.030    0.030 ./ratarmount.py:1199(_openCompressedFile)
         2000    0.008    0.000    0.028    0.000 ./ratarmount.py:242(seek)
         3905    0.008    0.000    0.025    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:296(_new_lzma_stream)
        24533    0.024    0.000    0.024    0.000 {built-in method _cffi_backend.newp}
         9091    0.008    0.000    0.023    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:41(memoryview_tobytes)
         4004    0.014    0.000    0.021    0.000 /usr/lib/python3.8/posixpath.py:334(normpath)
         8006    0.020    0.000    0.020    0.000 {built-in method _struct.unpack_from}
         3905    0.020    0.000    0.020    0.000 {built-in method _compiled_module.lzma_block_decoder}
         4002    0.011    0.000    0.017    0.000 ./ratarmount.py:937(<listcomp>)
         8006    0.017    0.000    0.017    0.000 {built-in method builtins.sum}
         4003    0.013    0.000    0.017    0.000 /usr/lib/python3.8/tarfile.py:1335(_apply_pax_info)
         2250    0.016    0.000    0.016    0.000 {method 'executemany' of 'sqlite3.Connection' objects}
         7838    0.016    0.000    0.016    0.000 {method 'seek' of '_io.BufferedReader' objects}
            1    0.000    0.000    0.015    0.015 ./ratarmount.py:1153(_detectCompression)
            1    0.000    0.000    0.015    0.015 ./ratarmount.py:1183(_detectTar)
         4003    0.014    0.000    0.014    0.000 /usr/lib/python3.8/tarfile.py:747(__init__)
        14009    0.014    0.000    0.014    0.000 {method 'join' of 'str' objects}
            1    0.000    0.000    0.014    0.014 /usr/lib/python3.8/tarfile.py:1643(taropen)
    [...]

=> Looks like the lzmaffi seek function is indeed problematic!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mxmlnkncommented, Sep 19, 2021

I’ll close this for now because of two reasons:

  • The original issue has been worked around and I don’t have any benchmark showing general slowdown.
  • On master, I made python-xz the default backend and dependency because lzmaffi’s installation from source is buggy and wheels are missing.
1reaction
Rogdhamcommented, May 20, 2021

Afaik, r| should open the tarfile for streaming, i.e., it should not seek at all (and therefore do less seeks)

I am not 100% sure, but I think the issue comes from the following:

  • with r| reads are buffered, and so may read more than strictly necessary
  • rartarmount is manually seeking on the following line; this introduces a backward seek if the previous read moved past it (which seems to be the case because files are so small)
                    fileObject.seek(globalOffset)

Do you have internal information as to why there is such a stark contrast?

I really don’t know. One thing I can think of is that python-xz is only modifying the wanted position when you seek, and only do seek (and potentially perform intensive operations) when you start reading. This way, if you seek multiple times before reading, there is no overhead. I’m not sure if that is the cause though.


I’m not sure how stable and tested your module is as of yet, though.

As far as tests go, I have 100% unittest coverage, plus integration coverage testing with xz files in as many different configurations as I could think of (number of streams, stream padding, number of blocks, size of blocks, etc.). Tests are run against all officially supported Python versions (plus PyPy). I’m not saying that there are no bugs of course, but at least this gives reasonable confidence that it should work as expected.

For the API stability, I’m mirroring the lzma module with xz.open and xz.XZFile, so it should not change at all. In the worst case, if anything is breaking backward compatibility, it will be thoroughly documented in the changelog.

The main change I’m planning to do on the library is to add write support which is completely missing as of now. This should not impact ratarmount’s usecase in any way.

With all of that being said, it is a very young library, and it is not battle-tested yet.

In any case, if you find any issues please report them! As you saw I’m not afraid to dive in to get a better understanding of problems and ultimately find possible solutions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to XZ a directory with TAR using maximum compression?
1. So to set both -9 and -e xz opts, you want XZ_OPT=-e9 but as @krzyk pointed out, -e is extremely slow ·...
Read more >
mxmlnkn/ratarmount: Random Access Tar Mount - GitHub
Ratarmount collects all file positions inside a TAR so that it can easily jump to and read from any file without extracting it....
Read more >
Using xz Compression in Linux - Baeldung
Learn about using xz for compressing and decompressing files from the Linux command line.
Read more >
Why is *.tar.gz still much more common than *.tar.xz? [closed]
Gzip compression is extremely fast on such processors when compared to all better compression methods such as XZ or even Bzip2. Gzip is...
Read more >
Read more
Packages compiler and parallel have been added to the reference index (refman.pdf). ... With bzip2 and xz compression having been available since R...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found