Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

QA points & polygons

See original GitHub issue

We need to help the data team doing QA of the datasets and also, ensure enrichment works well enough.

So, the objectives of this issue are to create a notebook to allow us:

validate DO dataset (dataset, geography, variables, …)
validate enrichment in “worst scenario” case

Issue Analytics

State:
Created 4 years ago
Comments:8 (7 by maintainers)

Top GitHub Comments

1reaction

andy-eschcommented, Jan 3, 2020

Yes, just left a big comment here about it: https://github.com/CartoDB/data-observatory/issues/442#issuecomment-570647587

So far, the script runs like so:

for t in dataset_ids:
    dataset = Dataset.get(t)
    qa = QADataset(dataset)
    qa.execute()
    results.append(qa.result.result)
    write_results(results, provider)

And prints results to a file. It gives each dataset a pass/fail, with messages for each of the tests on why it failed if it fails.

1reaction

alasarrcommented, Nov 18, 2019

After a talk with @cmongut and @xavipereztr I’d like to clarify what’s going to be the role of each team in QA.

Data team is the responsible of the QA of their data: geographies, datasets and catalog.
Backend team will provide the mechanism to run the tests.

Because of ☝️ I’m assigning here @andy-esch. The idea is that data team will provide the scripts inside of a Python Notebook and after that backend team will provide the end2end tests with these scripts. Notebooks’ scripts will be temporal, data team must use the end2end test to validate each dataset in DO.

Since we’ve already worked in some python scripts, @oleurud please share it with @andy-esch he could reuse some of your work.

At the end of this task, we need to provide a python Notebook with the following features:

Validate the metadata (Catalog)

We need to create a test function to validate the metadata of a dataset or geography and their children entities (variables).

We must check if a dataset or geography has the minimum fields full-filled. Variables have been well defined (descriptions are not null, aggregators are there, etc…)

This function tests must use CARTOFrames Catalog.

@oleurud has code around this you must work on extending.

Geographies validation

A Geography needs to be validated, we need to check the geometries are ok.

In the past, we’ve experienced some performance issues because of geodesic issues when the data was upload to BQ. @arredond has more info about this.

Enrichment

Enrichment is today our main operation and we don’t have an automatic workflow to do it. Let’s try to do these scripts client-side using GeoPandas, if we don’t have a PostGIS dependency for these tests our life will be easier 😄.

Enrichment by points To validate an enrichment by points we could generate a dataset of N points inside of the geography we want to do the enrichment against.

generate_points(geography_id, n_points) -> it returns a geopandas dataframe with a distribution of points inside of the geography. Points should be across all the geography. We need to avoid having points only in a small area.

Using this function we need to generate a test function test_enrichment_point(dataset_id, n_points) that takes the dataset, call to generate_points, and it’ll run an enrichment using all the variables defined at the dataset.

By the moment, we’ll set the value of n_pointsby hand, in the future we’ll try to automatize it through a quick analysis of the geography.

Enrichment by polygons

Similar to points, but using polygons.

To validate an enrichment by polygon we could generate a dataset of N polygons inside of the geography we want to do the enrichment against.

generate_polygons(geography_id, n_polygons) -> it returns a GeoPandas DataFrame with a distribution of polygons inside of the geography. I think we can use generate_points and after that create a Voronoi with these points.

Using this function we need to generate a test function test_enrichment_polygon(dataset_id, n_points) that takes the dataset and it’ll run an enrichment using all the variables defined at the dataset. Aggregations’ functions should be fetched from metadata.

By the moment, we’ll set the value of n_polygonsby hand, in the future we’ll try to automatize it through a quick analysis of the geography.