Significantly lower throughput compared to `open`
See original GitHub issueI was iterating over a large csv file using open
vs smart_open
and noticed a significant performance drop when nothing was changed but open
-> smart_open
# Iterate over csv
import time
import csv
from smart_open import smart_open
def report_time_iterate_rows(file_name, report_every=100000):
start = time.time()
last = start
with open(file_name, 'r') as f:
reader = csv.reader(f)
for i, line in enumerate(reader, start=1):
if not (i % report_every):
current = time.time()
time_taken = current - last
print('Time taken for %d rows: %.2f seconds, %.2f rows/s' % (
report_every, time_taken, report_every / time_taken))
last = current
total = time.time() - start
print('Total: %d rows, %.2f seconds, %.2f rows/s' % (
i, total, i / total))
report_time_iterate_rows('file.csv')
Output with open
:
Time taken for 100000 rows: 0.08 seconds, 1222907.59 rows/s
Time taken for 100000 rows: 0.08 seconds, 1217525.99 rows/s
Time taken for 100000 rows: 0.08 seconds, 1223503.33 rows/s
Time taken for 100000 rows: 0.08 seconds, 1247851.67 rows/s
Time taken for 100000 rows: 0.08 seconds, 1245898.25 rows/s
Time taken for 100000 rows: 0.08 seconds, 1238971.91 rows/s
...
Output with smart_open
:
Time taken for 100000 rows: 0.37 seconds, 272099.79 rows/s
Time taken for 100000 rows: 0.37 seconds, 272198.68 rows/s
Time taken for 100000 rows: 0.37 seconds, 273532.88 rows/s
Time taken for 100000 rows: 0.37 seconds, 272889.00 rows/s
Time taken for 100000 rows: 0.37 seconds, 272412.42 rows/s
...
Unfortunately, the file I’m using is sensitive data, so I can’t share it, but I assume this should be reproducible with any file with a large number of lines. Information about file - Number of lines: 25206601 File size: 2707135791 (~2.7 GB)
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (3 by maintainers)
Top Results From Across the Web
Bandwidth and Throughput in Networking: Guide and Tools
Just like throughput, poorly optimized bandwidth can dramatically slow down your network and give users a less-than-stellar experience on an app ...
Read more >What is throughput? | Definition from TechTarget
Throughput is necessarily lower than bandwidth because bandwidth represents the maximum capabilities of a network rather than the actual transfer rate. ...
Read more >More Throughput vs. Less Latency: Understand the Difference
When designing a system, "Speed" has two meanings. "How fast do the samples need to be acquired?" usually translates to throughput.
Read more >5 network performance factors that slow data transfers and ...
Network performance factors include latency, packet loss, etc. The network is not the sole driver of data transfer speed or end-user ...
Read more >Network throughput - Wikipedia
Network throughput refers to the rate of message delivery over a communication channel, such as Ethernet or packet radio, in a communication network....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Ah I see you beat me to it @mpenkov 😃
Sure. Random sample of 1000 line lengths -