Unable to read in big csv file
See original GitHub issueThis is most probably related to #7628
Given a large, 19G csv file, and I’m unable to read it using astropy only.
When using the defaults naively, I run into a Segfault pretty quickly (within a minute):
from astropy.table import Table
testd=Table.read('test_set.csv')
Opting out of the fast reader, and turning of guessing etc, it’s still running now after 50 mins, already eating up substantial amount of memory, so I’m killing it:
119752 bsipocz 20 0 204.2g 197.6g 14628 R 99.3 39.2 49:00.38 ipython
117282 bsipocz 20 0 33.9g 29.7g 19440 S 0.0 5.9 9:48.62 ipython
The second session above is the one where the file was read in using pandas. Converting that DataFrame to Table then of course works nicely.
Issue Analytics
- State:
- Created 5 years ago
- Comments:14 (13 by maintainers)
Top Results From Across the Web
Issue while reading large CSV file - UiPath Community Forum
Hi, I am having problem in reading large CSV file. The file size is 260 MB. There are approximately 300,00 records. Some time...
Read more >Reading a big csv file in php, can't read all the file
A file that big can't fit into memory, especially not in PHP, which stores a lot of additional data with every variable created....
Read more >Optimized ways to Read Large CSVs in Python - Medium
Problem: Importing (reading) a large CSV file leads Out of Memory error. Not enough RAM to read the entire CSV at once crashes...
Read more >How To Open Large CSV Files - Gigasheet
How to open big CSV files if the data set is too large for Excel. Gigasheet makes working with large files as easy...
Read more >Issues reading big CSV file despite using CSV.Row - Data
If you didn't expect this maybe the delimiter has been incorrectly identified by CSV.jl - I believe the first 10 rows are used...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
So with a small fix I’m able to load the file (PR to come):
Quite fast actually! The issue is that the array storing the column length uses an integer. This file has a lot of lines, and the array size gets too big for a 32 bits integer.
This is what trying to open test_set.csv on a Macbook Pro (2015, 16 GB, OSX 10.14) looks like:
That is a quite large malloc size.