Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pagination Support

See original GitHub issue

Right now, pagination only works with selection queries with order by

SELECT foo, bar FROM myTable
  WHERE baz > 20
  ORDER BY bar DESC
  LIMIT 50, 100

Paginate the selection results from the ‘a’ th results and return at most ‘b’ results.

The below query doesn’t paginate the results

SELECT count(*), foo, bar FROM myTable
  WHERE baz > 20
 GROUP BY foo
  ORDER BY bar DESC
  LIMIT 50, 100

Issue Analytics

State:
Created 3 years ago
Reactions:16
Comments:12 (8 by maintainers)

Top GitHub Comments

7reactions

atriscommented, Dec 3, 2021

@kishoreg as discussed offline, I will take this up

3reactions

siddharthteotiacommented, Jul 14, 2022

At LinkedIn, we have started to work on pagination on priority considering multiple requests we have received internally.

At a high level, our customer requirements are around the fact that they want to run query in Pinot that can potentially return a large response and users want the ability to paginate the response as multiple result sets (size per result set dictated by the user app).

The current pagination implementation in Pinot (even if it is just for selection query) is sub-optimal in the sense that it takes each query as a fresh query and executes the query again and again for every pagination window, discard the results outside the window and provides the result within the M, N window that user has asked for.

The main thing to note about pagination is that it has to be treated as a single query. In case of our customers, they don’t want to run a one-off pagination query OFFSET M, FETCH N where M and N are completely random in which case it is not possible to reason about the results and it’s even hard for the user to decide M as a one-off starting point. Result of a random pagination query doesn’t add any value to the user since they want to look at the entire result as a continuous stream of results / pages / batches with the will to stop anytime.

So, the semantics that we want to provide is that “I want to fetch 10 million records from Pinot for a query and want to fetch 100K at a time”. The customer will typically start with M as 0 and might just keep N fixed (say at 10K or 100K etc) and just keep paging the results through multiple calls from their app which simply changes M during every call (and they potentially refresh the results in UI etc returned by Pinot in every call).

I think we should look at the pagination problem from this perspective as opposed to a random one-off pagination query. We are trying to tackle the problem from this angle. Detail design discussion is in progress.

Some more thoughts slightly related to this –

Now, one problem is that users who run such queries may have the tendency to think that support for pagination means they can run “any” query in Pinot that can be very long running and Pinot is guaranteed to finish it and provide results. This can easily cause OOM (out of memory) and bring down the cluster.

Pinot is unlikely to enter the territory of running very long running queries and getting the entire 100% accurate result by spilling to disk and avoiding OOM at all times. Presto should be used for those cases.

However, for some of our users (who are ok with multi-second latency and prefer slightly more accurate response for GROUP BY queries), as a follow-up / next phase, we want to consider enhancing support in Pinot for queries that return large responses and/or process / aggregate more than usual amounts of data. We want to do this by doing some of the memory intensive query execution operations in off-heap (direct) memory. This along with the ability to paginate a large response back sort of fulfills the requirements we are seeing in production.