question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Significantly lower throughput compared to `open`

See original GitHub issue

I was iterating over a large csv file using open vs smart_open and noticed a significant performance drop when nothing was changed but open -> smart_open

# Iterate over csv

import time
import csv

from smart_open import smart_open

def report_time_iterate_rows(file_name, report_every=100000):
    start = time.time()
    last = start
    with open(file_name, 'r') as f:
        reader = csv.reader(f)
        for i, line in enumerate(reader, start=1):
            if not (i % report_every):
                current = time.time()
                time_taken = current - last
                print('Time taken for %d rows: %.2f seconds, %.2f rows/s' % (
                    report_every, time_taken, report_every / time_taken))
                last = current
    total = time.time() - start
    print('Total: %d rows, %.2f seconds, %.2f rows/s' % (
        i, total, i / total))

report_time_iterate_rows('file.csv')

Output with open:

Time taken for 100000 rows: 0.08 seconds, 1222907.59 rows/s
Time taken for 100000 rows: 0.08 seconds, 1217525.99 rows/s
Time taken for 100000 rows: 0.08 seconds, 1223503.33 rows/s
Time taken for 100000 rows: 0.08 seconds, 1247851.67 rows/s
Time taken for 100000 rows: 0.08 seconds, 1245898.25 rows/s
Time taken for 100000 rows: 0.08 seconds, 1238971.91 rows/s
...

Output with smart_open:

Time taken for 100000 rows: 0.37 seconds, 272099.79 rows/s
Time taken for 100000 rows: 0.37 seconds, 272198.68 rows/s
Time taken for 100000 rows: 0.37 seconds, 273532.88 rows/s
Time taken for 100000 rows: 0.37 seconds, 272889.00 rows/s
Time taken for 100000 rows: 0.37 seconds, 272412.42 rows/s
...

Unfortunately, the file I’m using is sensitive data, so I can’t share it, but I assume this should be reproducible with any file with a large number of lines. Information about file - Number of lines: 25206601 File size: 2707135791 (~2.7 GB)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
jayantjcommented, Apr 6, 2018

Ah I see you beat me to it @mpenkov 😃

2reactions
jayantjcommented, Apr 6, 2018

Sure. Random sample of 1000 line lengths -

30, 161, 71, 162, 84, 19, 28, 100, 32, 253, 37, 39, 191, 119, 75, 26, 44, 64, 230, 71, 71, 71, 45, 22, 78, 155, 32, 38, 45, 64,
 121, 51, 76, 22, 148, 76, 38, 53, 154, 51, 65, 50, 361, 31, 99, 75, 137, 45, 46, 62, 53, 37, 23, 63, 26, 276, 26, 44, 142, 64, 117, 76, 57, 647, 99, 52, 113, 114, 42, 271, 5
8, 26, 54, 26, 74, 52, 89, 51, 68, 51, 403, 51, 40, 72, 458, 43, 267, 148, 96, 38, 103, 83, 74, 23, 30, 332, 27, 30, 23, 106, 62, 61, 99, 43, 49, 482, 39, 179, 73, 443, 64, 5
8, 26, 74, 123, 152, 45, 376, 43, 331, 132, 34, 27, 57, 61, 29, 138, 42, 83, 60, 51, 21, 34, 57, 39, 28, 52, 54, 50, 236, 44, 37, 44, 54, 64, 22, 105, 20, 182, 110, 44, 65, 4
4, 46, 264, 76, 55, 39, 83, 36, 75, 121, 80, 63, 151, 71, 45, 38, 22, 130, 56, 57, 44, 78, 114, 66, 54, 85, 71, 26, 50, 40, 107, 62, 170, 35, 30, 57, 80, 30, 155, 112, 121, 1
17, 90, 277, 84, 217, 386, 24, 29, 100, 36, 105, 709, 22, 49, 307, 90, 51, 1493, 26, 45, 77, 30, 26, 154, 93, 31, 60, 85, 218, 54, 75, 24, 54, 40, 70, 37, 251, 38, 81, 55, 46
, 44, 150, 49, 198, 248, 68, 48, 69, 67, 25, 32, 24, 230, 1532, 79, 44, 118, 56, 188, 120, 60, 131, 132, 39, 50, 56, 74, 50, 107, 134, 273, 46, 258, 120, 99, 27, 65, 39, 80, 
74, 30, 44, 63, 91, 61, 21, 73, 267, 79, 26, 22, 85, 311, 17, 121, 53, 58, 44, 165, 319, 108, 83, 27, 82, 555, 50, 142, 54, 25, 183, 111, 51, 27, 66, 70, 775, 31, 29, 234, 87
6, 18, 55, 55, 74, 64, 52, 147, 18, 91, 1280, 90, 108, 259, 50, 33, 45, 112, 65, 66, 23, 120, 91, 196, 140, 390, 47, 72, 24, 41, 66, 62, 26, 87, 77, 56, 88, 28, 26, 147, 99, 
50, 158, 44, 78, 59, 37, 31, 59, 84, 42, 142, 22, 150, 53, 25, 70, 257, 89, 38, 99, 213, 24, 99, 2926, 65, 58, 42, 40, 17, 54, 26, 41, 28, 49, 89, 60, 48, 39, 97, 58, 575, 10
2, 68, 100, 68, 101, 38, 38, 43, 78, 48, 93, 141, 39, 168, 96, 21, 26, 40, 93, 122, 48, 92, 291, 99, 35, 625, 44, 40, 64, 148, 308, 26, 51, 144, 26, 26, 40, 80, 34, 30, 99, 6
5, 289, 31, 36, 38, 108, 24, 38, 27, 87, 426, 67, 72, 112, 94, 44, 50, 68, 72, 23, 51, 68, 28, 264, 36, 167, 29, 70, 45, 57, 41, 69, 36, 35, 44, 58, 43, 216, 58, 57, 22, 55, 
13, 65, 227, 36, 24, 121, 45, 49, 49, 87, 66, 26, 203, 32, 46, 32, 56, 179, 437, 70, 149, 44, 54, 123, 157, 21, 45, 65, 26, 146, 668, 29, 23, 31, 268, 100, 66, 1339, 73, 44, 
92, 47, 47, 45, 48, 35, 377, 161, 43, 94, 97, 30, 63, 360, 44, 99, 344, 26, 115, 160, 99, 205, 64, 47, 290, 57, 104, 36, 158, 300, 20, 40, 200, 92, 57, 32, 42, 62, 34, 68, 18
0, 142, 182, 56, 44, 84, 225, 95, 72, 38, 132, 109, 82, 23, 94, 40, 389, 44, 31, 53, 80, 57, 116, 37, 51, 47, 25, 169, 44, 25, 99, 97, 174, 115, 44, 55, 110, 70, 68, 70, 26, 
298, 307, 86, 74, 40, 122, 176, 50, 44, 22, 67, 44, 99, 99, 34, 20, 64, 340, 33, 47, 19, 150, 132, 158, 51, 296, 50, 310, 449, 201, 326, 75, 53, 66, 26, 119, 223, 96, 74, 38,
 279, 31, 207, 44, 249, 99, 197, 240, 23, 59, 44, 38, 181, 111, 31, 26, 86, 97, 148, 106, 289, 37, 48, 23, 26, 45, 64, 46, 172, 28, 50, 270, 362, 104, 61, 64, 34, 174, 65, 87
, 84, 249, 22, 44, 198, 24, 45, 68, 872, 70, 37, 32, 44, 38, 99, 24, 901, 272, 377, 40, 32, 68, 182, 26, 350, 888, 26, 174, 150, 69, 397, 38, 151, 37, 287, 49, 102, 14, 52, 4
4, 44, 121, 28, 178, 102, 131, 50, 57, 35, 113, 55, 29, 125, 75, 72, 162, 54, 45, 38, 40, 49, 15, 44, 83, 58, 20, 111, 140, 235, 63, 262, 50, 72, 44, 99, 35, 89, 44, 45, 162,
 277, 268, 74, 50, 83, 71, 60, 26, 124, 40, 39, 42, 34, 37, 32, 1371, 32, 26, 24, 77, 85, 71, 163, 44, 130, 282, 40, 51, 103, 53, 91, 161, 40, 58, 172, 147, 63, 55, 25, 28, 2
41, 81, 44, 65, 203, 41, 50, 288, 86, 244, 51, 108, 45, 50, 81, 152, 56, 218, 99, 88, 196, 323, 180, 128, 41, 146, 67, 145, 57, 78, 62, 44, 43, 25, 82, 30, 42, 67, 115, 333, 
78, 44, 85, 53, 19, 23, 28, 158, 51, 93, 330, 40, 14, 23, 51, 26, 63, 56, 46, 537, 48, 425, 119, 33, 170, 99, 167, 26, 29, 44, 1724, 494, 36, 58, 26, 152, 236, 44, 80, 50, 35
, 127, 432, 249, 30, 116, 281, 57, 34, 54, 35, 50, 283, 174, 74, 64, 111, 30, 70, 62, 52, 104, 65, 71, 68, 67, 65, 103, 26, 26, 68, 26, 25, 97, 510, 96, 13, 39, 36, 59, 66, 1
66, 78, 80, 86, 192, 23, 74, 43, 65, 45, 23, 131, 71, 30, 61, 24, 73, 112, 15, 102, 54, 113, 188, 60, 55, 33, 57, 453, 27, 131, 61, 201, 44, 292, 22, 55, 109, 65, 46, 82, 47,
 146, 345, 73, 98, 57, 65, 44, 143, 36, 92, 359, 157, 469, 244, 50, 180, 193, 392, 26, 42, 167, 21, 44, 105, 130, 223, 41, 45, 59, 107, 44, 97, 121, 74, 29, 99, 103, 70, 57
Read more comments on GitHub >

github_iconTop Results From Across the Web

Bandwidth and Throughput in Networking: Guide and Tools
Just like throughput, poorly optimized bandwidth can dramatically slow down your network and give users a less-than-stellar experience on an app ...
Read more >
What is throughput? | Definition from TechTarget
Throughput is necessarily lower than bandwidth because bandwidth represents the maximum capabilities of a network rather than the actual transfer rate.  ...
Read more >
More Throughput vs. Less Latency: Understand the Difference
When designing a system, "Speed" has two meanings. "How fast do the samples need to be acquired?" usually translates to throughput.
Read more >
5 network performance factors that slow data transfers and ...
Network performance factors include latency, packet loss, etc. The network is not the sole driver of data transfer speed or end-user ...
Read more >
Network throughput - Wikipedia
Network throughput refers to the rate of message delivery over a communication channel, such as Ethernet or packet radio, in a communication network....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found