question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Results of profiling table creation

See original GitHub issue

Since there has been some concern about data size limits, I thought I would do a little profiling. This isn’t a TODO, I just wanted these notes to be somewhere where they could easily be discussed.

For these tests I used the MOMA artworks dataset. The version I have locally is a 36.2 MB CSV. It has 126,118 and 14 columns. The TypeTester infers this dataset to have 1 Boolean column Boolean, 1 Number, 1 Date, and 11 Text.

With cProfile running, loading this dataset takes:

  • 90.0s with no args
  • 42.0s with limit=100
  • 39.8s with all columns manually typed (to the same types TypeTester would infer)
  • 26.5s with types=[agate.Boolean(), agate.Number(), agate.Text()
  • 6.6s with all columns manually typed to Text()

(These were all single tests, so useful as order-of-magnitude measures only.)

What this tells me is that the single slowest part of the process is type testing—especially testing dates. Date/time testing is abominably slow, which isn’t surprising. Number casting (via babel) is also slower than I would like.

Profile with no args

     80836723 function calls (80828678 primitive calls) in 89.947 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  2554619   11.056    0.000   11.056    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
   364775    5.899    0.000   15.530    0.000 __init__.py:394(parseDate)
   364727    5.110    0.000   49.711    0.000 __init__.py:1775(parse)
  1094311    4.161    0.000    5.456    0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
  1459004    2.960    0.000    8.410    0.000 __init__.py:284(context)
   252259    2.733    0.000   18.084    0.000 number.py:32(cast)
  1765706    2.637    0.000    3.570    0.000 text.py:21(cast)
        1    1.871    1.871   50.775   50.775 type_tester.py:73(run)
   252282    1.726    0.000   39.993    0.000 date.py:44(cast)
   364727    1.720    0.000   55.101    0.000 __init__.py:1727(parseDT)
  1585136    1.679    0.000   10.092    0.000 {built-in method builtins.next}
  2918079    1.596    0.000    2.355    0.000 context.py:22(__stack)
  9004705    1.587    0.000    1.587    0.000 {built-in method builtins.isinstance}
   729502    1.554    0.000    1.796    0.000 contextlib.py:37(__init__)
   364727    1.495    0.000    1.495    0.000 {method 'timetuple' of 'datetime.datetime' objects}
   756780    1.432    0.000    2.300    0.000 core.py:963(get_locale_identifier)
  9153148    1.385    0.000    1.385    0.000 {method 'strip' of 'str' objects}
  1891770    1.276    0.000   36.112    0.000 __init__.py:159(<genexpr>)
   252259    1.236    0.000   14.368    0.000 numbers.py:418(parse_decimal)
   756777    1.172    0.000    6.341    0.000 core.py:205(parse)

Profile with limit

         34234329 function calls (34226284 primitive calls) in 42.001 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   854389    3.946    0.000    3.946    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
   121885    2.448    0.000    6.573    0.000 __init__.py:394(parseDate)
   121837    2.112    0.000   20.110    0.000 __init__.py:1775(parse)
  1387652    2.038    0.000    2.790    0.000 text.py:21(cast)
   365641    1.736    0.000    2.247    0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
   487444    1.583    0.000    3.730    0.000 __init__.py:284(context)
   126241    1.493    0.000   10.031    0.000 number.py:32(cast)
  1891770    1.299    0.000   38.425    0.000 __init__.py:159(<genexpr>)
   613576    1.267    0.000    4.999    0.000 {built-in method builtins.next}
        1    1.191    1.191   40.132   40.132 __init__.py:88(__init__)
   126264    1.023    0.000   23.810    0.000 date.py:44(cast)
  4406063    0.800    0.000    0.800    0.000 {built-in method builtins.isinstance}
   378726    0.792    0.000    1.281    0.000 core.py:963(get_locale_identifier)
  4779102    0.735    0.000    0.735    0.000 {method 'strip' of 'str' objects}
   126241    0.717    0.000    8.034    0.000 numbers.py:418(parse_decimal)
   121837    0.703    0.000   22.305    0.000 __init__.py:1727(parseDT)
   243722    0.694    0.000    0.794    0.000 contextlib.py:37(__init__)
   974959    0.655    0.000    0.941    0.000 context.py:22(__stack)
   378723    0.637    0.000    3.533    0.000 core.py:205(parse)
   504964    0.590    0.000    0.806    0.000 localedata.py:191(__getitem__)

Profile with no date/time types

         36417971 function calls (36411828 primitive calls) in 26.463 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1765674    2.614    0.000    3.515    0.000 text.py:21(cast)
   252259    2.324    0.000   16.166    0.000 number.py:32(cast)
   756780    1.877    0.000    2.653    0.000 core.py:963(get_locale_identifier)
        1    1.175    1.175   10.849   10.849 type_tester.py:73(run)
  7315416    1.062    0.000    1.062    0.000 {method 'strip' of 'str' objects}
  1891770    1.032    0.000   12.961    0.000 __init__.py:159(<genexpr>)
   252259    0.998    0.000   12.911    0.000 numbers.py:418(parse_decimal)
  1009036    0.997    0.000    1.386    0.000 localedata.py:191(__getitem__)
   756777    0.989    0.000    6.026    0.000 core.py:205(parse)
  5817937    0.980    0.000    0.980    0.000 {built-in method builtins.isinstance}
        1    0.912    0.912   24.931   24.931 __init__.py:88(__init__)
   126132    0.910    0.000    0.913    0.000 {built-in method builtins.next}
   252262    0.861    0.000    1.586    0.000 core.py:888(parse_locale)
761689/761680    0.748    0.000    0.749    0.000 {method 'join' of 'str' objects}
   504518    0.718    0.000    2.503    0.000 core.py:345(_data)
   252249    0.716    0.000    0.965    0.000 boolean.py:38(cast)
   252259    0.612    0.000    2.224    0.000 core.py:124(__init__)
   504518    0.600    0.000    3.847    0.000 core.py:528(number_symbols)
  2524352    0.506    0.000    0.506    0.000 {method 'lower' of 'str' objects}
   252259    0.486    0.000    4.397    0.000 numbers.py:199(get_group_symbol)

Profile with manual column types

         34154521 function calls (34147606 primitive calls) in 39.783 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   850953    3.867    0.000    3.867    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
   121545    2.570    0.000    6.532    0.000 __init__.py:394(parseDate)
  1387298    2.067    0.000    2.786    0.000 text.py:21(cast)
   121545    1.969    0.000   18.787    0.000 __init__.py:1775(parse)
   364765    1.489    0.000    1.945    0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
   486180    1.481    0.000    3.529    0.000 __init__.py:284(context)
   126118    1.382    0.000    9.580    0.000 number.py:32(cast)
   612312    1.222    0.000    4.759    0.000 {built-in method builtins.next}
  1891770    1.142    0.000   36.805    0.000 __init__.py:159(<genexpr>)
        1    1.082    1.082   38.096   38.096 __init__.py:88(__init__)
   126118    0.936    0.000   22.652    0.000 date.py:44(cast)
   378357    0.776    0.000    1.212    0.000 core.py:963(get_locale_identifier)
  4399429    0.750    0.000    0.750    0.000 {built-in method builtins.isinstance}
  4774246    0.690    0.000    0.690    0.000 {method 'strip' of 'str' objects}
   121545    0.686    0.000    0.686    0.000 {method 'timetuple' of 'datetime.datetime' objects}
   972360    0.652    0.000    0.905    0.000 context.py:22(__stack)
   126118    0.651    0.000    7.715    0.000 numbers.py:418(parse_decimal)
   121545    0.634    0.000   20.878    0.000 __init__.py:1727(parseDT)
   378354    0.598    0.000    3.278    0.000 core.py:205(parse)
   243090    0.563    0.000    0.644    0.000 contextlib.py:37(__init__)

Profile with all text columns

         9788420 function calls (9786604 primitive calls) in 6.619 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1765652    2.131    0.000    2.926    0.000 text.py:21(cast)
        1    1.155    1.155    5.041    5.041 __init__.py:88(__init__)
   126120    0.908    0.000    0.908    0.000 {built-in method builtins.next}
  1891770    0.781    0.000    3.707    0.000 __init__.py:159(<genexpr>)
  1765929    0.351    0.000    0.351    0.000 {method 'lower' of 'str' objects}
  1765653    0.223    0.000    0.223    0.000 {method 'strip' of 'str' objects}
  1767818    0.222    0.000    0.222    0.000 {built-in method builtins.isinstance}
        1    0.175    0.175    0.201    0.201 {method 'readlines' of '_io._IOBase' objects}
   126120    0.137    0.000    0.137    0.000 mapped_sequence.py:31(__init__)
        1    0.134    0.134    6.383    6.383 from_csv.py:10(from_csv)
        1    0.073    0.073    6.619    6.619 test.py:3(<module>)
     3300    0.057    0.000    0.057    0.000 {method 'join' of 'str' objects}
   126120    0.042    0.000    0.950    0.000 csv_py3.py:32(__next__)
      163    0.035    0.000    0.035    0.000 {method 'read' of '_io.FileIO' objects}
254939/254837    0.023    0.000    0.023    0.000 {built-in method builtins.len}
     4425    0.022    0.000    0.022    0.000 {built-in method _codecs.utf_8_decode}
   128700    0.020    0.000    0.020    0.000 {method 'append' of 'list' objects}
      163    0.014    0.000    0.014    0.000 {built-in method marshal.loads}
       12    0.009    0.001    0.014    0.001 {built-in method _imp.create_dynamic}
      961    0.008    0.000    0.008    0.000 {built-in method posix.stat}

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:22 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
onyxfishcommented, Apr 6, 2016

Okay, I think I’ve done as much on this as I can sanely do:

Final results for artworks dataset:

65.7s with no args (90.0s before) 24.8s with limit=100 (42.0s before) 25.0s with all columns manually typed (39.8s before) 10.7s with types=[agate.Boolean(), agate.Number(), agate.Text()] (26.5s before) 5.6s with all columns manually typed to Text() (6.6s before)

1reaction
onyxfishcommented, Apr 6, 2016

@nbedi Doh, yes. Thanks for catching.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Profiling Data at Table Level - erwin, Inc.
Profiling Data at Table Level · Go to Application Menu > Data Catalog > Metadata Manager. · Under the System Catalogue pane, click...
Read more >
Creating Reference Tables from Profile Columns Overview
Profiling Getting Started Guide​​ Create reference tables to establish relationships between source data values and the valid and standard values. You can create ......
Read more >
18 Performing Data Profiling - Oracle Help Center
To profile flat files, you must import them into Warehouse Builder, create external tables based on the flat files, and then profile the...
Read more >
Data Profiling: What Is It & How Does It Drive Decision Making?
Data profiling is an assessment of data that uses a combination of tools, algorithms, and business rules to create a high-level report of ......
Read more >
Create a Profiling Report in Power BI: Give the End User ...
The profiling data that you get from Table. Profile function is like below; This will give me the list of all columns in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found