Slow validation of large tables. Validate columns rather than cells?
See original GitHub issueI was initially surprised by how slow content checks are for large tables, until I realized that checks are performed line by line and cell by cell, rather than by column. Have you considered taking advantage of fast, vectorized operations already available in python, R, etc to speed up validation?
The example below features a single table with one integer field and one million rows. type-or-format-error
takes 19 seconds, whereas the equivalent (?) vectorized operations in python and R below (which read the data as strings and parse them to integer) take a fraction of a second.
goodtables-py
: 18.895 seconds
import goodtables
report = goodtables.validate(
'datapackage.json',
row_limit=1000000,
checks=['type-or-format-error'])
report['time']
pandas
: 0.629 seconds
import pandas
import time
start = time.time()
df = pandas.read_csv('resource.csv', dtype=str)
result = df.id.astype(int)
time.time() - start
readr
®: 0.207 seconds
start <- Sys.time()
df <- readr::read_csv('~/repos/temp/resource.csv', col_types = 'c')
result <- readr::parse_integer(df$id)
Sys.time() - start
data.table
+ readr
®: 0.135 seconds
start <- Sys.time()
df <- data.table::fread(
'~/repos/temp/resource.csv',
stringsAsFactors = FALSE, colClasses = list(character = 'id'))
result <- readr::parse_integer(df$id)
Sys.time() - start
Files
datapackage.json
{
"name": "package",
"profile": "tabular-data-package",
"resources": [
{
"name": "resource",
"profile": "tabular-data-resource",
"path": "resource.csv",
"schema": {
"fields": [
{
"name": "id",
"type": "integer"
}
]
}
}
]
}
resource.csv
(1 million rows)
id
1
2
3
...
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Excel Data Validation Drop Down Troubleshooting Lists
An extremely high count usually means that data validation rules have been applied to entire columns, instead of a small range of cells....
Read more >Performance issue with validateCells - Handsontable Forum
I have 50,000 rows to validate and there are 5 columns. Before saving the data in the table, we call hot.validateCells().
Read more >More on data validation - Microsoft Support
Validate data based on formulas or values in other cells— For example, you can use data validation to set a maximum limit for...
Read more >How to Fix an Excel Table That's Slow to Scroll or Respond
Learn to address a performance issue caused by referencing entire columns instead of just the range of data in used cells.
Read more >What is the best approach to validate data in 100 tables?
If only a single validation is needed and the tables share columns, I would merge the tables and validate the resulting large table....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@roll I’m not familiar enough with the stack requirements for reading, parsing, and streaming to say for sure whether pandas would provide much benefit to those steps. Where pandas clearly shines is where it can make use of numpy, namely casting field values and checking field and table constraints on numeric fields. There are many Table Schema that cannot be handled by
pandas.read_csv
internally, hence whygoodtables-pandas-py
first reads a table as a DataFrame of string fields, then parses and casts field values as a second step.In the example below, pandas is slower at extracting a regex group (for bareNumber: false) for casting to integer than a plain python version because it is just a wrapper that returns the results in a new pandas Series. However, casting extracted strings to int is faster, and checking the resulting integers against minimum and maximum constraints is much faster.
As you said, you’ll want to profile
goodtables
up the chain of command to see whether the slowdowns are in reading files, parsing strings, casting strings to values, or checking constraints. If improvements can’t be made upstream, having a faster subset ofgoodtables
seems like a good plan. Perhaps the functionality for reading and casting Tabular Resources to a pandas DataFrame could exist in its own package (i.e. a faster version oftableschema-pandas-py
).Currently,
goodtables-pandas-py
reports one error with an array of the (unique) invalid values, but it would be easy enough to report (DataFrame) row numbers. With large datasets, I don’t think it makes sense to add an error for each invalid value, since the error list can become gigantic and completely unreadable, but I suppose that could be done to fit the existing report format.Hi @ezwelty
I’m closing it as we can’t move to column-based validation in general as the whole FD infrastructure is based on row-streams.
It doesn’t mean we can’t improve the situation:
pandas.read_csv
speed) if checks/cast-functions are optimizedgoodtables-pandas-py
to the readme as a faster Pandas-specific based alternative