Results of profiling table creation
See original GitHub issueSince there has been some concern about data size limits, I thought I would do a little profiling. This isn’t a TODO, I just wanted these notes to be somewhere where they could easily be discussed.
For these tests I used the MOMA artworks dataset. The version I have locally is a 36.2 MB CSV. It has 126,118 and 14 columns. The TypeTester infers this dataset to have 1 Boolean column Boolean, 1 Number, 1 Date, and 11 Text.
With cProfile running, loading this dataset takes:
- 90.0s with no args
- 42.0s with
limit=100
- 39.8s with all columns manually typed (to the same types TypeTester would infer)
- 26.5s with
types=[agate.Boolean(), agate.Number(), agate.Text()
- 6.6s with all columns manually typed to
Text()
(These were all single tests, so useful as order-of-magnitude measures only.)
What this tells me is that the single slowest part of the process is type testing—especially testing dates. Date/time testing is abominably slow, which isn’t surprising. Number casting (via babel) is also slower than I would like.
Profile with no args
80836723 function calls (80828678 primitive calls) in 89.947 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2554619 11.056 0.000 11.056 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
364775 5.899 0.000 15.530 0.000 __init__.py:394(parseDate)
364727 5.110 0.000 49.711 0.000 __init__.py:1775(parse)
1094311 4.161 0.000 5.456 0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
1459004 2.960 0.000 8.410 0.000 __init__.py:284(context)
252259 2.733 0.000 18.084 0.000 number.py:32(cast)
1765706 2.637 0.000 3.570 0.000 text.py:21(cast)
1 1.871 1.871 50.775 50.775 type_tester.py:73(run)
252282 1.726 0.000 39.993 0.000 date.py:44(cast)
364727 1.720 0.000 55.101 0.000 __init__.py:1727(parseDT)
1585136 1.679 0.000 10.092 0.000 {built-in method builtins.next}
2918079 1.596 0.000 2.355 0.000 context.py:22(__stack)
9004705 1.587 0.000 1.587 0.000 {built-in method builtins.isinstance}
729502 1.554 0.000 1.796 0.000 contextlib.py:37(__init__)
364727 1.495 0.000 1.495 0.000 {method 'timetuple' of 'datetime.datetime' objects}
756780 1.432 0.000 2.300 0.000 core.py:963(get_locale_identifier)
9153148 1.385 0.000 1.385 0.000 {method 'strip' of 'str' objects}
1891770 1.276 0.000 36.112 0.000 __init__.py:159(<genexpr>)
252259 1.236 0.000 14.368 0.000 numbers.py:418(parse_decimal)
756777 1.172 0.000 6.341 0.000 core.py:205(parse)
Profile with limit
34234329 function calls (34226284 primitive calls) in 42.001 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
854389 3.946 0.000 3.946 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
121885 2.448 0.000 6.573 0.000 __init__.py:394(parseDate)
121837 2.112 0.000 20.110 0.000 __init__.py:1775(parse)
1387652 2.038 0.000 2.790 0.000 text.py:21(cast)
365641 1.736 0.000 2.247 0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
487444 1.583 0.000 3.730 0.000 __init__.py:284(context)
126241 1.493 0.000 10.031 0.000 number.py:32(cast)
1891770 1.299 0.000 38.425 0.000 __init__.py:159(<genexpr>)
613576 1.267 0.000 4.999 0.000 {built-in method builtins.next}
1 1.191 1.191 40.132 40.132 __init__.py:88(__init__)
126264 1.023 0.000 23.810 0.000 date.py:44(cast)
4406063 0.800 0.000 0.800 0.000 {built-in method builtins.isinstance}
378726 0.792 0.000 1.281 0.000 core.py:963(get_locale_identifier)
4779102 0.735 0.000 0.735 0.000 {method 'strip' of 'str' objects}
126241 0.717 0.000 8.034 0.000 numbers.py:418(parse_decimal)
121837 0.703 0.000 22.305 0.000 __init__.py:1727(parseDT)
243722 0.694 0.000 0.794 0.000 contextlib.py:37(__init__)
974959 0.655 0.000 0.941 0.000 context.py:22(__stack)
378723 0.637 0.000 3.533 0.000 core.py:205(parse)
504964 0.590 0.000 0.806 0.000 localedata.py:191(__getitem__)
Profile with no date/time types
36417971 function calls (36411828 primitive calls) in 26.463 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1765674 2.614 0.000 3.515 0.000 text.py:21(cast)
252259 2.324 0.000 16.166 0.000 number.py:32(cast)
756780 1.877 0.000 2.653 0.000 core.py:963(get_locale_identifier)
1 1.175 1.175 10.849 10.849 type_tester.py:73(run)
7315416 1.062 0.000 1.062 0.000 {method 'strip' of 'str' objects}
1891770 1.032 0.000 12.961 0.000 __init__.py:159(<genexpr>)
252259 0.998 0.000 12.911 0.000 numbers.py:418(parse_decimal)
1009036 0.997 0.000 1.386 0.000 localedata.py:191(__getitem__)
756777 0.989 0.000 6.026 0.000 core.py:205(parse)
5817937 0.980 0.000 0.980 0.000 {built-in method builtins.isinstance}
1 0.912 0.912 24.931 24.931 __init__.py:88(__init__)
126132 0.910 0.000 0.913 0.000 {built-in method builtins.next}
252262 0.861 0.000 1.586 0.000 core.py:888(parse_locale)
761689/761680 0.748 0.000 0.749 0.000 {method 'join' of 'str' objects}
504518 0.718 0.000 2.503 0.000 core.py:345(_data)
252249 0.716 0.000 0.965 0.000 boolean.py:38(cast)
252259 0.612 0.000 2.224 0.000 core.py:124(__init__)
504518 0.600 0.000 3.847 0.000 core.py:528(number_symbols)
2524352 0.506 0.000 0.506 0.000 {method 'lower' of 'str' objects}
252259 0.486 0.000 4.397 0.000 numbers.py:199(get_group_symbol)
Profile with manual column types
34154521 function calls (34147606 primitive calls) in 39.783 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
850953 3.867 0.000 3.867 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
121545 2.570 0.000 6.532 0.000 __init__.py:394(parseDate)
1387298 2.067 0.000 2.786 0.000 text.py:21(cast)
121545 1.969 0.000 18.787 0.000 __init__.py:1775(parse)
364765 1.489 0.000 1.945 0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
486180 1.481 0.000 3.529 0.000 __init__.py:284(context)
126118 1.382 0.000 9.580 0.000 number.py:32(cast)
612312 1.222 0.000 4.759 0.000 {built-in method builtins.next}
1891770 1.142 0.000 36.805 0.000 __init__.py:159(<genexpr>)
1 1.082 1.082 38.096 38.096 __init__.py:88(__init__)
126118 0.936 0.000 22.652 0.000 date.py:44(cast)
378357 0.776 0.000 1.212 0.000 core.py:963(get_locale_identifier)
4399429 0.750 0.000 0.750 0.000 {built-in method builtins.isinstance}
4774246 0.690 0.000 0.690 0.000 {method 'strip' of 'str' objects}
121545 0.686 0.000 0.686 0.000 {method 'timetuple' of 'datetime.datetime' objects}
972360 0.652 0.000 0.905 0.000 context.py:22(__stack)
126118 0.651 0.000 7.715 0.000 numbers.py:418(parse_decimal)
121545 0.634 0.000 20.878 0.000 __init__.py:1727(parseDT)
378354 0.598 0.000 3.278 0.000 core.py:205(parse)
243090 0.563 0.000 0.644 0.000 contextlib.py:37(__init__)
Profile with all text columns
9788420 function calls (9786604 primitive calls) in 6.619 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1765652 2.131 0.000 2.926 0.000 text.py:21(cast)
1 1.155 1.155 5.041 5.041 __init__.py:88(__init__)
126120 0.908 0.000 0.908 0.000 {built-in method builtins.next}
1891770 0.781 0.000 3.707 0.000 __init__.py:159(<genexpr>)
1765929 0.351 0.000 0.351 0.000 {method 'lower' of 'str' objects}
1765653 0.223 0.000 0.223 0.000 {method 'strip' of 'str' objects}
1767818 0.222 0.000 0.222 0.000 {built-in method builtins.isinstance}
1 0.175 0.175 0.201 0.201 {method 'readlines' of '_io._IOBase' objects}
126120 0.137 0.000 0.137 0.000 mapped_sequence.py:31(__init__)
1 0.134 0.134 6.383 6.383 from_csv.py:10(from_csv)
1 0.073 0.073 6.619 6.619 test.py:3(<module>)
3300 0.057 0.000 0.057 0.000 {method 'join' of 'str' objects}
126120 0.042 0.000 0.950 0.000 csv_py3.py:32(__next__)
163 0.035 0.000 0.035 0.000 {method 'read' of '_io.FileIO' objects}
254939/254837 0.023 0.000 0.023 0.000 {built-in method builtins.len}
4425 0.022 0.000 0.022 0.000 {built-in method _codecs.utf_8_decode}
128700 0.020 0.000 0.020 0.000 {method 'append' of 'list' objects}
163 0.014 0.000 0.014 0.000 {built-in method marshal.loads}
12 0.009 0.001 0.014 0.001 {built-in method _imp.create_dynamic}
961 0.008 0.000 0.008 0.000 {built-in method posix.stat}
Issue Analytics
- State:
- Created 7 years ago
- Comments:22 (5 by maintainers)
Okay, I think I’ve done as much on this as I can sanely do:
Final results for artworks dataset:
65.7s with no args (90.0s before) 24.8s with
limit=100
(42.0s before) 25.0s with all columns manually typed (39.8s before) 10.7s withtypes=[agate.Boolean(), agate.Number(), agate.Text()]
(26.5s before) 5.6s with all columns manually typed toText()
(6.6s before)@nbedi Doh, yes. Thanks for catching.