question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataHub is unable to search/list all datasets more than 10k datasets

See original GitHub issue

Describe the bug DataHub is unable to search/list all datasets more than 10k datasets

To Reproduce Steps to reproduce the behavior:

  1. Have a DataHub instance with more than 10k datasets.
  2. Run a search query on the UI e.g. * or snowflake if there are more than 10k snowflake datasets, it only displays up to 10k results.
  3. This happens similarly for a GraphQL search query, the searchResult total returned is 10000 but there exists more than 10000 datasets.
query {
  search(input: {
    type: DATASET,
    query: "*",
    count: 500,
    start: 0,
  }) {
    total
    searchResults {
      entity {
        ... on Dataset {
          urn
        }
      }
    }
  }
}

Expected behavior DataHub should be able to list / search for more than 10k datasets.

Screenshots Unable to post.

Desktop (please complete the following information):

  • OS: Mac
  • Browser: Chrome
  • Version: 99.0.4844.84 (Official Build) (x86_64)

Additional context We are using the search query to list all the datasets we have, but due to this limitation, we are unable to list all the datasets.

Use Case We have a scheduled job to generate a report of all datasets. The scheduled job calls the GaphQL Search API to list all of the datasets and relevant information we need. Our requirements is to catalog all of the datasets and some relevant information such as field tags / terms.

Problem Our datahub instance has more than 10k datasets. So we there is no way to pull all of the datasets via an API.

We tried using the Search API but since we have more than 10k datasets, it does not work. To clarify, we are paging through 500 datasets each call, but the searchResults only allow you to get you a total of the first 10k results, even with pagination.

So e.g. if there are 25k datasets, the searchResults total will be 10000 still and there’s no way to get the remaining 15k datasets.

Potential Solutions If DataHub has another API to get all of the datasets rather than a “search” API.

Or if DataHub should have an API to list all the URNs, then we can query each dataset individually (https://datahubproject.io/docs/graphql/queries#dataset)

Or if using search API, we can segment the search api call such that the results are under 10k each call by providing different “filters” for the search term. But not as reliable because:

  1. May not ensure coverage of all datasets
  2. Requires manual updates to the search filters as the number of datasets increase

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
dexter-mh-leecommented, Apr 27, 2022

Try using this rest endpoint which goes through all entities in mysql

curl --location --request POST 'http://localhost:8080/entities?action=listUrns' \
--header 'X-RestLi-Protocol-Version: 2.0.0' \
--header 'Content-Type: application/json' \
--data-raw '{
    "entity": "dataset",
    "start": 0,
    "count": 10
}'

Note we do not have any guarantees on the latency of this endpoint. We have seen that it could take order of minutes when there are more than a million datasets.

0reactions
github-actions[bot]commented, Oct 15, 2022

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dataset - DataHub
Dataset. The dataset entity is one the most important entities in the metadata model. They represent collections of data that are typically represented...
Read more >
metadata-service - DataHub
DataHub Metadata Service is a service written in Java consisting of multiple servlets: A public GraphQL API for fetching and mutating objects on...
Read more >
About DataHub Search
From the search bar, you can find Datasets, Columns, Dashboards, Charts, Data Pipelines, and more. Simply type in a term and press 'enter'....
Read more >
Working With Platform Instances | DataHub
DataHub's metadata model for Datasets supports a three-part key currently: ... you will need to specify it in more than one recipe to...
Read more >
About Metadata Tests - DataHub
Automated Asset Classification​. Metadata Tests allows you to define conditions for selecting a subset of data assets (e.g. datasets, dashboards, etc), along ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found