DataHub is unable to search/list all datasets more than 10k datasets
See original GitHub issueDescribe the bug DataHub is unable to search/list all datasets more than 10k datasets
To Reproduce Steps to reproduce the behavior:
- Have a DataHub instance with more than 10k datasets.
- Run a search query on the UI e.g.
*
orsnowflake
if there are more than 10k snowflake datasets, it only displays up to 10k results. - This happens similarly for a GraphQL search query, the searchResult
total
returned is10000
but there exists more than10000
datasets.
query {
search(input: {
type: DATASET,
query: "*",
count: 500,
start: 0,
}) {
total
searchResults {
entity {
... on Dataset {
urn
}
}
}
}
}
Expected behavior DataHub should be able to list / search for more than 10k datasets.
Screenshots Unable to post.
Desktop (please complete the following information):
- OS: Mac
- Browser: Chrome
- Version: 99.0.4844.84 (Official Build) (x86_64)
Additional context We are using the search query to list all the datasets we have, but due to this limitation, we are unable to list all the datasets.
Use Case We have a scheduled job to generate a report of all datasets. The scheduled job calls the GaphQL Search API to list all of the datasets and relevant information we need. Our requirements is to catalog all of the datasets and some relevant information such as field tags / terms.
Problem Our datahub instance has more than 10k datasets. So we there is no way to pull all of the datasets via an API.
We tried using the Search API but since we have more than 10k datasets, it does not work. To clarify, we are paging through 500 datasets each call, but the searchResults only allow you to get you a total
of the first 10k results, even with pagination.
So e.g. if there are 25k datasets, the searchResults
total
will be 10000
still and there’s no way to get the remaining 15k datasets.
Potential Solutions If DataHub has another API to get all of the datasets rather than a “search” API.
Or if DataHub should have an API to list all the URNs, then we can query each dataset individually (https://datahubproject.io/docs/graphql/queries#dataset)
Or if using search API, we can segment the search api call such that the results are under 10k each call by providing different “filters” for the search term. But not as reliable because:
- May not ensure coverage of all datasets
- Requires manual updates to the search filters as the number of datasets increase
Issue Analytics
- State:
- Created a year ago
- Comments:10 (7 by maintainers)
Top GitHub Comments
Try using this rest endpoint which goes through all entities in mysql
Note we do not have any guarantees on the latency of this endpoint. We have seen that it could take order of minutes when there are more than a million datasets.
This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io