question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

QA points & polygons

See original GitHub issue

We need to help the data team doing QA of the datasets and also, ensure enrichment works well enough.

So, the objectives of this issue are to create a notebook to allow us:

  • validate DO dataset (dataset, geography, variables, …)
  • validate enrichment in “worst scenario” case

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
andy-eschcommented, Jan 3, 2020

Yes, just left a big comment here about it: https://github.com/CartoDB/data-observatory/issues/442#issuecomment-570647587

So far, the script runs like so:

for t in dataset_ids:
    dataset = Dataset.get(t)
    qa = QADataset(dataset)
    qa.execute()
    results.append(qa.result.result)
    write_results(results, provider)

And prints results to a file. It gives each dataset a pass/fail, with messages for each of the tests on why it failed if it fails.

1reaction
alasarrcommented, Nov 18, 2019

After a talk with @cmongut and @xavipereztr I’d like to clarify what’s going to be the role of each team in QA.

  • Data team is the responsible of the QA of their data: geographies, datasets and catalog.
  • Backend team will provide the mechanism to run the tests.

Because of ☝️ I’m assigning here @andy-esch. The idea is that data team will provide the scripts inside of a Python Notebook and after that backend team will provide the end2end tests with these scripts. Notebooks’ scripts will be temporal, data team must use the end2end test to validate each dataset in DO.

Since we’ve already worked in some python scripts, @oleurud please share it with @andy-esch he could reuse some of your work.

At the end of this task, we need to provide a python Notebook with the following features:

Validate the metadata (Catalog)

We need to create a test function to validate the metadata of a dataset or geography and their children entities (variables).

We must check if a dataset or geography has the minimum fields full-filled. Variables have been well defined (descriptions are not null, aggregators are there, etc…)

This function tests must use CARTOFrames Catalog.

@oleurud has code around this you must work on extending.

Geographies validation

A Geography needs to be validated, we need to check the geometries are ok.

In the past, we’ve experienced some performance issues because of geodesic issues when the data was upload to BQ. @arredond has more info about this.

Enrichment

Enrichment is today our main operation and we don’t have an automatic workflow to do it. Let’s try to do these scripts client-side using GeoPandas, if we don’t have a PostGIS dependency for these tests our life will be easier 😄.

Enrichment by points To validate an enrichment by points we could generate a dataset of N points inside of the geography we want to do the enrichment against.

generate_points(geography_id, n_points) -> it returns a geopandas dataframe with a distribution of points inside of the geography. Points should be across all the geography. We need to avoid having points only in a small area.

Using this function we need to generate a test function test_enrichment_point(dataset_id, n_points) that takes the dataset, call to generate_points, and it’ll run an enrichment using all the variables defined at the dataset.

By the moment, we’ll set the value of n_pointsby hand, in the future we’ll try to automatize it through a quick analysis of the geography.

Enrichment by polygons

Similar to points, but using polygons.

To validate an enrichment by polygon we could generate a dataset of N polygons inside of the geography we want to do the enrichment against.

generate_polygons(geography_id, n_polygons) -> it returns a GeoPandas DataFrame with a distribution of polygons inside of the geography. I think we can use generate_points and after that create a Voronoi with these points.

Using this function we need to generate a test function test_enrichment_polygon(dataset_id, n_points) that takes the dataset and it’ll run an enrichment using all the variables defined at the dataset. Aggregations’ functions should be fetched from metadata.

By the moment, we’ll set the value of n_polygonsby hand, in the future we’ll try to automatize it through a quick analysis of the geography.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Point Pixels & Polygons | QA Engineer and Game ...
The Point Pixels & Polygons is a Udemy instructor with educational courses available for enrollment. Check out the latest ... QA Engineer and...
Read more >
Point in Polygon Strategies - Eric Haines
Point in polygon algorithms benefit from having a bounding box around polygons with many edges. The point is first tested against this box...
Read more >
Data QA: Identifying Small Polygon Features - FME Community
Testing for small polygons is a good QA test because polygons below a certain size are usually indicative of problems such as overlaps,...
Read more >
Properties of Polygons | SkillsYouNeed
This page examines the properties of two-dimensional or 'plane' polygons. A polygon is any shape made up of straight lines that can be...
Read more >
Q&A - How to count points inside polygons from another ... - TatukGIS
I need to perform a density count of points in Layer A inside each polygon in Layer B. Resulting layer should have polygons...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found