Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow validation of large tables. Validate columns rather than cells?

See original GitHub issue

I was initially surprised by how slow content checks are for large tables, until I realized that checks are performed line by line and cell by cell, rather than by column. Have you considered taking advantage of fast, vectorized operations already available in python, R, etc to speed up validation?

The example below features a single table with one integer field and one million rows. type-or-format-error takes 19 seconds, whereas the equivalent (?) vectorized operations in python and R below (which read the data as strings and parse them to integer) take a fraction of a second.

`goodtables-py`: 18.895 seconds

import goodtables
report = goodtables.validate(
  'datapackage.json',
  row_limit=1000000,
  checks=['type-or-format-error'])
report['time']

`pandas`: 0.629 seconds

import pandas
import time
start = time.time()
df = pandas.read_csv('resource.csv', dtype=str)
result = df.id.astype(int)
time.time() - start

`readr` ®: 0.207 seconds

start <- Sys.time()
df <- readr::read_csv('~/repos/temp/resource.csv', col_types = 'c')
result <- readr::parse_integer(df$id)
Sys.time() - start

`data.table` + `readr` ®: 0.135 seconds

start <- Sys.time()
df <- data.table::fread(
  '~/repos/temp/resource.csv',
  stringsAsFactors = FALSE, colClasses = list(character = 'id'))
result <- readr::parse_integer(df$id)
Sys.time() - start

Files

datapackage.json

{
  "name": "package",
  "profile": "tabular-data-package",
  "resources": [
    {
      "name": "resource",
      "profile": "tabular-data-resource",
      "path": "resource.csv",
      "schema": {
        "fields": [
          {
            "name": "id",
            "type": "integer"
          }
        ]
      }
    }
  ]
}

resource.csv (1 million rows)

id
1
2
3
...

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

ezweltycommented, Oct 21, 2019

@roll I’m not familiar enough with the stack requirements for reading, parsing, and streaming to say for sure whether pandas would provide much benefit to those steps. Where pandas clearly shines is where it can make use of numpy, namely casting field values and checking field and table constraints on numeric fields. There are many Table Schema that cannot be handled by pandas.read_csv internally, hence why goodtables-pandas-py first reads a table as a DataFrame of string fields, then parses and casts field values as a second step.

In the example below, pandas is slower at extracting a regex group (for bareNumber: false) for casting to integer than a plain python version because it is just a wrapper that returns the results in a new pandas Series. However, casting extracted strings to int is faster, and checking the resulting integers against minimum and maximum constraints is much faster.

import pandas
import timeit

x = pandas.Series(range(10000000)).astype(str)

# ---- Extract integer characters from string (bareNumber: false) ----

def extract_integer():
  pattern = re.compile(r"(-?[0-9]+)")
  result = []
  for i, xi in x.iteritems():
    result.append(pattern.findall(xi)[0])

def extract_integer_vectorized():
    pattern = re.compile(r"(-?[0-9]+)")
    result = x.str.extract(pattern, expand=False)

timeit.timeit(extract_integer, number=1)
# 9.80 s
timeit.timeit(extract_integer_vectorized, number=1)
# 20.09 s (slower!)

# ---- Cast string to integer ----

def parse_integer():
  result = []
  for i, xi in x.iteritems():
    result.append(int(xi))

def parse_integer_vectorized():
    result = x.astype(int)

timeit.timeit(parse_integer, number=1)
# 5.67 s
timeit.timeit(parse_integer_vectorized, number=1)
# 1.25 s (faster!)

# ---- Check integer constraints ----

x = pandas.Series(range(10000000))

def check_integer():
    result = []
    for i, xi in x.iteritems():
        result.append(xi > 0 and xi < 9999999)

def check_integer_vectorized():
    result = (x > 0) & (x < 9999999)

timeit.timeit(check_integer, number=1)
# 3.77 s
timeit.timeit(check_integer_vectorized, number=1)
# 0.04 s (much faster!)

As you said, you’ll want to profile goodtables up the chain of command to see whether the slowdowns are in reading files, parsing strings, casting strings to values, or checking constraints. If improvements can’t be made upstream, having a faster subset of goodtables seems like a good plan. Perhaps the functionality for reading and casting Tabular Resources to a pandas DataFrame could exist in its own package (i.e. a faster version of tableschema-pandas-py).

Currently, goodtables-pandas-py reports one error with an array of the (unique) invalid values, but it would be easy enough to report (DataFrame) row numbers. With large datasets, I don’t think it makes sense to add an error for each invalid value, since the error list can become gigantic and completely unreadable, but I suppose that could be done to fit the existing report format.

0reactions

rollcommented, Apr 30, 2020

Hi @ezwelty

I’m closing it as we can’t move to column-based validation in general as the whole FD infrastructure is based on row-streams.

It doesn’t mean we can’t improve the situation:

in #341 I’m going to improve performance as much as possible. I ran some tests (actually for another project in JavaScript but it doesn’t matter) which showed that even row-based validation can be extremely fast (comparable to the pandas.read_csv speed) if checks/cast-functions are optimized
in #341 we will also rebase on multi-processing
I’ve added the link to goodtables-pandas-py to the readme as a faster Pandas-specific based alternative