Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance degradation when dumping large dataframe with np.datetime to csv

See original GitHub issue

Large DataFrame with dates is slower in 0.24.0

import pandas as pd
import numpy as np

d = '2018-11-29'
dt = '2018-11-26 11:18:27.0'
N = 10000000
df = pd.DataFrame({'dt': [np.datetime64(dt)] * N, 'd': [np.datetime64(d)] * N, 'r': [np.random.uniform()] * N})

Problem description

Using pandas < 0.24.0 dumping a large dataset (10M rows, 40 columns) which contains text, dates and floats/ints takes about 12 minutes but using version 0.24.0 the same code takes about 2hours and 40 minutes.

We understand the code base to create the csv has change a bit. I didn’t notice the speed degradation when the columns are not dates. I’ve not checked with datetime objects.

We’ve tried to condense the problem to the few lines above. Below the results I get from %prun in each version.

`%prun` for pandas `0.24.0`:

        121342 function calls (120733 primitive calls) in 153.774 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  153.774  153.774 {built-in method builtins.exec}
        1    0.000    0.000  153.774  153.774 <string>:1(<module>)
        1    0.000    0.000  153.774  153.774 generic.py:2873(to_csv)
        1    0.000    0.000  153.773  153.773 csvs.py:130(save)
        1    0.002    0.002  153.684  153.684 csvs.py:272(_save)
      301    0.278    0.001  153.681    0.511 csvs.py:290(_save_chunk)
      301    0.009    0.000  119.472    0.397 blocks.py:2200(to_native_types)
      301    0.002    0.000  101.820    0.338 format.py:1300(_get_format_datetime64_from_values)
      301   96.018    0.319  101.818    0.338 format.py:1249(_is_dates_only)
      301   20.040    0.067   20.040    0.067 {pandas._libs.writers.write_csv_rows}
      301   17.525    0.058   17.580    0.058 {pandas._libs.tslib.format_array_from_datetime}
      602   13.766    0.023   13.766    0.023 {method 'astype' of 'numpy.ndarray' objects}
      301    0.005    0.000    8.095    0.027 blocks.py:1982(to_native_types)
     1204    5.798    0.005    5.798    0.005 {method 'reduce' of 'numpy.ufunc' objects}
      302    0.003    0.000    5.796    0.019 base.py:1023(to_native_types)
      302    0.005    0.000    5.777    0.019 base.py:1049(_format_native_types)
      301    0.002    0.000    5.749    0.019 {method 'sum' of 'numpy.ndarray' objects}
      301    0.001    0.000    5.747    0.019 _methods.py:34(_sum)
      603    0.001    0.000    0.089    0.000 missing.py:25(isna)
      603    0.003    0.000    0.088    0.000 missing.py:105(_isna_new)
      [snip]

cProfile of my script with 0.24.0

(I took the liberty of removing paths/username information)

stats:         3434284 function calls (3423766 primitive calls) in 8642.079 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000 8642.079 8642.079 ../lib/python3.7/site-packages/pandas/core/generic.py:2873(to_csv)
        1    0.000    0.000 8642.077 8642.077 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:130(save)
        1    0.042    0.042 8642.002 8642.002../lib/python3.7/site-packages/pandas/io/formats/csvs.py:272(_save)
     5242    3.588    0.001 8641.950    1.649../lib/python3.7/site-packages/pandas/io/formats/csvs.py:290(_save_chunk)
     5242    0.165    0.000 8294.922    1.582 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:2200(to_native_types)
     5242    0.044    0.000 8203.096    1.565 ../lib/python3.7/site-packages/pandas/io/formats/format.py:1300(_get_format_datetime64_from_values)
     5242 7757.167    1.480 8203.052    1.565 ../lib/python3.7/site-packages/pandas/io/formats/format.py:1249(_is_dates_only)
    20968  445.184    0.021  445.184    0.021 {method 'reduce' of 'numpy.ufunc' objects}
     5242    0.041    0.000  444.994    0.085 {method 'sum' of 'numpy.ndarray' objects}
     5242    0.023    0.000  444.953    0.085 ../lib/python3.7/site-packages/numpy/core/_methods.py:34(_sum)
     5242  212.677    0.041  212.677    0.041 {pandas._libs.writers.write_csv_rows}
    15726  122.216    0.008  122.216    0.008 {method 'astype' of 'numpy.ndarray' objects}
     5242   90.999    0.017   91.300    0.017 {pandas._libs.tslib.format_array_from_datetime}
     5242    0.570    0.000   79.782    0.015 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:1982(to_native_types)
    10484    0.200    0.000   43.947    0.004 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:730(to_native_types)
     5243    0.031    0.000    7.008    0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:1023(to_native_types)
    ...the rest was removed...

`%prun` for pandas `0.23.4`:

        71818 function calls (71517 primitive calls) in 52.371 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   52.371   52.371 {built-in method builtins.exec}
        1    0.000    0.000   52.371   52.371 <string>:1(<module>)
        1    0.000    0.000   52.371   52.371 frame.py:1653(to_csv)
        1    0.000    0.000   52.370   52.370 csvs.py:123(save)
        1    0.002    0.002   51.546   51.546 csvs.py:270(_save)
      301    0.245    0.001   51.543    0.171 csvs.py:288(_save_chunk)
      301   19.846    0.066   19.846    0.066 {pandas._libs.writers.write_csv_rows}
      301    0.007    0.000   18.031    0.060 internals.py:2743(to_native_types)
      301   17.654    0.059   17.698    0.059 {pandas._libs.tslib.format_array_from_datetime}
      602   13.309    0.022   13.309    0.022 {method 'astype' of 'numpy.ndarray' objects}
      301    0.006    0.000    7.480    0.025 internals.py:2052(to_native_types)
      302    0.002    0.000    5.940    0.020 base.py:2408(to_native_types)
      302    0.004    0.000    5.925    0.020 base.py:2434(_format_native_types)
        2    0.823    0.411    0.823    0.411 {method 'close' of '_io.TextIOWrapper' objects}
      301    0.001    0.000    0.293    0.001 format.py:1291(_get_format_datetime64_from_values)
      301    0.249    0.001    0.292    0.001 format.py:1243(_is_dates_only)
      603    0.001    0.000    0.081    0.000 missing.py:32(isna)
      603    0.003    0.000    0.080    0.000 missing.py:112(_isna_new)
      603    0.033    0.000    0.072    0.000 missing.py:189(_isna_ndarraylike)
     1204    0.067    0.000    0.067    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      903    0.001    0.000    0.044    0.000 _methods.py:42(_any)
      302    0.026    0.000    0.026    0.000 {method 'ravel' of 'numpy.ndarray' objects}
      301    0.001    0.000    0.024    0.000 {method 'sum' of 'numpy.ndarray' objects}

cProfile of my script with 0.23.4

stats:         2170825 function calls (2165580 primitive calls) in 442.783 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  442.783  442.783 ../lib/python3.7/site-packages/pandas/core/frame.py:1653(to_csv)
        1    0.000    0.000  442.782  442.782 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:123(save)
        1    0.041    0.041  442.781  442.781 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:270(_save)
     5242    2.815    0.001  442.730    0.084 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:288(_save_chunk)
     5242  216.105    0.041  216.105    0.041 {pandas._libs.writers.write_csv_rows}
    15726  125.188    0.008  125.188    0.008 {method 'astype' of 'numpy.ndarray' objects}
     5242    0.121    0.000   90.872    0.017 ../lib/python3.7/site-packages/pandas/core/internals.py:2743(to_native_types)
     5242   88.329    0.017   88.590    0.017 {pandas._libs.tslib.format_array_from_datetime}
     5242    0.606    0.000   79.149    0.015 ../lib/python3.7/site-packages/pandas/core/internals.py:2052(to_native_types)
    10484    0.188    0.000   46.683    0.004 ../lib/python3.7/site-packages/pandas/core/internals.py:755(to_native_types)
     5243    0.026    0.000    7.077    0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:2408(to_native_types)
     5243    0.039    0.000    6.497    0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:2434(_format_native_types)

What is Expected

No degradation in speed.

Output of `pd.show_versions()` for 0.24.0

INSTALLED VERSIONS

commit: None python: 3.7.0.final.0 python-bits: 64 OS: Linux OS-release: 4.14.72-68.55.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.24.0 pytest: None pip: 10.0.1 setuptools: 39.0.1 Cython: None numpy: 1.16.2 scipy: None pyarrow: None xarray: None IPython: 7.3.0 sphinx: 1.8.4 patsy: None dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Output of `pd.show_versions()` for 0.23.4

INSTALLED VERSIONS

pandas: 0.23.4 pytest: 3.7.1 pip: 10.0.1 setuptools: 39.0.1 Cython: None numpy: 1.16.1 scipy: 1.2.1 pyarrow: None xarray: None IPython: 7.3.0 sphinx: 1.8.4 patsy: None dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

hksonngancommented, Mar 15, 2019

SnakeViz of cProfile in Windows

0.23.4
master

0reactions

alepcommented, Mar 13, 2019

Also I’ve observed this in OS X and Linux

Top Results From Across the Web

Python pandas DataFrame.to_csv append becomes gradually ...

From looking at your code: Firstly there's a lot of copying here and a lot ... axis=1, inplace=True) #Saving DataFrame in CSV: if...

Stop persisting pandas data frames in CSVs

Since pandas is using numpy arrays as its backend structures, the int s and float s can be differentiated into more memory efficient...

IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation

Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of...

The Pandas DataFrame: Make Working With Data Delightful

Pandas DataFrame Labels as Sequences; Data as NumPy Arrays; Data Types ... Pandas DataFrames can sometimes be very large, making it impractical to...

Loading CSV data from Cloud Storage | BigQuery

CSV files do not support nested or repeated data. Remove byte order mark (BOM) characters. They might cause unexpected issues. If you use...