Keep original filenames in dask.dataframe.read_csv
See original GitHub issueFor the data I am reading, the path (directory name) is an important trait, and this would be useful to access (possibly as an additional column, path_as_column = True
) or at the very least in the collection of delayed objects
import dask.dataframe as dd
dd.read_csv('s3://bucket_name/*/*/*.csv', collection=True)
The collection version of the command comes closer
all_dfs = dd.read_csv('s3://bucket_name/*/*/*.csv', collection=True)
print('key:', all_dfs[0].key)
print('value:', all_dfs[0].compute())
but returns an internal code as the key and doesn’t seem to have the path (s3://bucket_name/actual_folder/subfolder/fancyfile.csv) anywhere
key: pandas_read_text-35c2999796309c2c92e6438c0ebcbba4
value: Unnamed: 0 T N M count
0 0 T1 N0 M0 0.4454
1 5 T2 N0 M0 0.4076
2 6 T2 N0 M1 0.0666
3 1 T1 N0 M1 0.0612
4 10 T3 N0 M0 0.0054
Issue Analytics
- State:
- Created 6 years ago
- Comments:17 (13 by maintainers)
Top Results From Across the Web
dask.dataframe.read_csv - Dask documentation
Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......
Read more >using dask read_csv to read filename as a column name
I was able to do this (fairly straight forward) using dask's delayed function: import pandas as pd import dask.dataframe as dd from dask...
Read more >Reading CSV files into Dask DataFrames with read_csv
This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.
Read more >Dask Dataframes — Python tools for Big data - Pierre Navaro
Load Data from CSVs in Dask Dataframes# · import os here = os.getcwd() filenames = os.path. · import dask import dask.dataframe as dd...
Read more >Pandas Read Multiple CSV Files into DataFrame
The Dask library can be used to read a data frame from multiple files. Before you use Dask library, first you need to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I have been thinking more about the
path_parser
idea. Pandas already accepts aconverters
kwargs. So we could just do a minor tweak and check thatconverters
dict for the path_col and if we find it do the transformation before returning the df. I think the value is in only needing to do the operation once for each path vs. lots of times. This is what it would look like:I’m inclined to skip the
path_parser
option suggested by @jlstevens . There is, I think, a clear path for users to handle this orthogonally on their own with themap
orapply
methods if they so choose. I don’t think that we need to make this any easier for them.I think that this approach is simpler in the long run.