Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Keep original filenames in dask.dataframe.read_csv

See original GitHub issue

For the data I am reading, the path (directory name) is an important trait, and this would be useful to access (possibly as an additional column, path_as_column = True) or at the very least in the collection of delayed objects

import dask.dataframe as dd
dd.read_csv('s3://bucket_name/*/*/*.csv', collection=True)

The collection version of the command comes closer

all_dfs = dd.read_csv('s3://bucket_name/*/*/*.csv', collection=True)
print('key:', all_dfs[0].key)
print('value:', all_dfs[0].compute())

but returns an internal code as the key and doesn’t seem to have the path (s3://bucket_name/actual_folder/subfolder/fancyfile.csv) anywhere

key: pandas_read_text-35c2999796309c2c92e6438c0ebcbba4
value:    Unnamed: 0   T   N   M   count
0           0  T1  N0  M0  0.4454
1           5  T2  N0  M0  0.4076
2           6  T2  N0  M1  0.0666
3           1  T1  N0  M1  0.0612
4          10  T3  N0  M0  0.0054

Issue Analytics

State:
Created 6 years ago
Comments:17 (13 by maintainers)

Top GitHub Comments

7reactions

jsignellcommented, Aug 28, 2018

I have been thinking more about the path_parser idea. Pandas already accepts a converters kwargs. So we could just do a minor tweak and check that converters dict for the path_col and if we find it do the transformation before returning the df. I think the value is in only needing to do the operation once for each path vs. lots of times. This is what it would look like:

4reactions

mrocklincommented, Aug 27, 2018

I’m inclined to skip the path_parser option suggested by @jlstevens . There is, I think, a clear path for users to handle this orthogonally on their own with the map or apply methods if they so choose. I don’t think that we need to make this any easier for them.

df = dd.read_csv(..., include_filenames=True)
df['filename'] = df['filename'].map(parse)

I think that this approach is simpler in the long run.

Top Results From Across the Web

dask.dataframe.read_csv - Dask documentation

Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......

using dask read_csv to read filename as a column name

I was able to do this (fairly straight forward) using dask's delayed function: import pandas as pd import dask.dataframe as dd from dask...

Reading CSV files into Dask DataFrames with read_csv

This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.

Dask Dataframes — Python tools for Big data - Pierre Navaro

Load Data from CSVs in Dask Dataframes# · import os here = os.getcwd() filenames = os.path. · import dask import dask.dataframe as dd...

Pandas Read Multiple CSV Files into DataFrame

The Dask library can be used to read a data frame from multiple files. Before you use Dask library, first you need to...