question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow validation of large tables. Validate columns rather than cells?

See original GitHub issue

I was initially surprised by how slow content checks are for large tables, until I realized that checks are performed line by line and cell by cell, rather than by column. Have you considered taking advantage of fast, vectorized operations already available in python, R, etc to speed up validation?

The example below features a single table with one integer field and one million rows. type-or-format-error takes 19 seconds, whereas the equivalent (?) vectorized operations in python and R below (which read the data as strings and parse them to integer) take a fraction of a second.

goodtables-py: 18.895 seconds

import goodtables
report = goodtables.validate(
  'datapackage.json',
  row_limit=1000000,
  checks=['type-or-format-error'])
report['time']

pandas: 0.629 seconds

import pandas
import time
start = time.time()
df = pandas.read_csv('resource.csv', dtype=str)
result = df.id.astype(int)
time.time() - start

readr ®: 0.207 seconds

start <- Sys.time()
df <- readr::read_csv('~/repos/temp/resource.csv', col_types = 'c')
result <- readr::parse_integer(df$id)
Sys.time() - start

data.table + readr ®: 0.135 seconds

start <- Sys.time()
df <- data.table::fread(
  '~/repos/temp/resource.csv',
  stringsAsFactors = FALSE, colClasses = list(character = 'id'))
result <- readr::parse_integer(df$id)
Sys.time() - start

Files

datapackage.json

{
  "name": "package",
  "profile": "tabular-data-package",
  "resources": [
    {
      "name": "resource",
      "profile": "tabular-data-resource",
      "path": "resource.csv",
      "schema": {
        "fields": [
          {
            "name": "id",
            "type": "integer"
          }
        ]
      }
    }
  ]
}

resource.csv (1 million rows)

id
1
2
3
...

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ezweltycommented, Oct 21, 2019

@roll I’m not familiar enough with the stack requirements for reading, parsing, and streaming to say for sure whether pandas would provide much benefit to those steps. Where pandas clearly shines is where it can make use of numpy, namely casting field values and checking field and table constraints on numeric fields. There are many Table Schema that cannot be handled by pandas.read_csv internally, hence why goodtables-pandas-py first reads a table as a DataFrame of string fields, then parses and casts field values as a second step.

In the example below, pandas is slower at extracting a regex group (for bareNumber: false) for casting to integer than a plain python version because it is just a wrapper that returns the results in a new pandas Series. However, casting extracted strings to int is faster, and checking the resulting integers against minimum and maximum constraints is much faster.

import pandas
import timeit

x = pandas.Series(range(10000000)).astype(str)

# ---- Extract integer characters from string (bareNumber: false) ----

def extract_integer():
  pattern = re.compile(r"(-?[0-9]+)")
  result = []
  for i, xi in x.iteritems():
    result.append(pattern.findall(xi)[0])

def extract_integer_vectorized():
    pattern = re.compile(r"(-?[0-9]+)")
    result = x.str.extract(pattern, expand=False)

timeit.timeit(extract_integer, number=1)
# 9.80 s
timeit.timeit(extract_integer_vectorized, number=1)
# 20.09 s (slower!)

# ---- Cast string to integer ----

def parse_integer():
  result = []
  for i, xi in x.iteritems():
    result.append(int(xi))

def parse_integer_vectorized():
    result = x.astype(int)

timeit.timeit(parse_integer, number=1)
# 5.67 s
timeit.timeit(parse_integer_vectorized, number=1)
# 1.25 s (faster!)

# ---- Check integer constraints ----

x = pandas.Series(range(10000000))

def check_integer():
    result = []
    for i, xi in x.iteritems():
        result.append(xi > 0 and xi < 9999999)

def check_integer_vectorized():
    result = (x > 0) & (x < 9999999)

timeit.timeit(check_integer, number=1)
# 3.77 s
timeit.timeit(check_integer_vectorized, number=1)
# 0.04 s (much faster!)

As you said, you’ll want to profile goodtables up the chain of command to see whether the slowdowns are in reading files, parsing strings, casting strings to values, or checking constraints. If improvements can’t be made upstream, having a faster subset of goodtables seems like a good plan. Perhaps the functionality for reading and casting Tabular Resources to a pandas DataFrame could exist in its own package (i.e. a faster version of tableschema-pandas-py).

Currently, goodtables-pandas-py reports one error with an array of the (unique) invalid values, but it would be easy enough to report (DataFrame) row numbers. With large datasets, I don’t think it makes sense to add an error for each invalid value, since the error list can become gigantic and completely unreadable, but I suppose that could be done to fit the existing report format.

0reactions
rollcommented, Apr 30, 2020

Hi @ezwelty

I’m closing it as we can’t move to column-based validation in general as the whole FD infrastructure is based on row-streams.

It doesn’t mean we can’t improve the situation:

  • in #341 I’m going to improve performance as much as possible. I ran some tests (actually for another project in JavaScript but it doesn’t matter) which showed that even row-based validation can be extremely fast (comparable to the pandas.read_csv speed) if checks/cast-functions are optimized
  • in #341 we will also rebase on multi-processing
  • I’ve added the link to goodtables-pandas-py to the readme as a faster Pandas-specific based alternative
Read more comments on GitHub >

github_iconTop Results From Across the Web

Excel Data Validation Drop Down Troubleshooting Lists
An extremely high count usually means that data validation rules have been applied to entire columns, instead of a small range of cells....
Read more >
Performance issue with validateCells - Handsontable Forum
I have 50,000 rows to validate and there are 5 columns. Before saving the data in the table, we call hot.validateCells().
Read more >
More on data validation - Microsoft Support
Validate data based on formulas or values in other cells— For example, you can use data validation to set a maximum limit for...
Read more >
How to Fix an Excel Table That's Slow to Scroll or Respond
Learn to address a performance issue caused by referencing entire columns instead of just the range of data in used cells.
Read more >
What is the best approach to validate data in 100 tables?
If only a single validation is needed and the tables share columns, I would merge the tables and validate the resulting large table....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found