Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Translate queries where a row includes a collection to use arrays instead of being implemented through joins

See original GitHub issue

Following discussion in https://github.com/npgsql/efcore.pg/issues/712.

Consider queries like

var blogs = context.Blogs
    .Include(blog => blog.Posts)
    .ToList();

or something like

var blogs = context.Blogs
    .Where(...)
    .Select(b => new
    {
        BlogTitle = b.Title,
        LargeColumn = b.LargeColumn,
        LastPosts = b.Posts.OrderByDescending(p => p.Id).Take(4).Select(p => p.Message).ToList()
    })
    .ToList();

This might either lead to “cartesian explosion” if implemented through joins, or result in multiple duplicated work at the server if split-query strategy is used (https://docs.microsoft.com/en-us/ef/core/querying/single-split-queries).

PostgreSQL is capable of returning a record[] in a column, which can contain a result set of a subquery. It seems like other query engines can take advantage of this, for example in GraphQL https://www.shanestillwell.com/2017/09/15/postgresql-json-and-graphql/, although json is used instead of record[] to encode the data instead.

I’ve done some initial performance testing regarding PostgreSQL’s query plans.

There are basically two strategies, executing a subquery and join + group by + aggregate.

Subquery:

select *, array(select row(tbl2.*) from tbl2 where tbl1.id = tbl2.tbl1id) from tbl1

Join + group by + aggregate:

select tbl1.*, array_agg(tbl2) from tbl1 left join tbl2 on tbl1.id = tbl2.tbl1id group by tbl1.id

The first strategy includes a subquery which will be executed once per row. This means O(n*m) if there is no foreign key index, due to it will perform a full sequential scan on tbl2 for each row in tbl1, which is pretty bad. However, when there is an index for the foreign key, it can be used and hence result in optimal time complexity. An extra inner join inside the subquery in the case of a many-to-many relationship also just uses the primary key index on the third table to follow the extra redirection, which is good.

The second strategy results in a similar query plan compared to using just a traditional left join, plus a sort (if indexes are not available for the foreign key) plus an aggregation. So this basically changes overhead in network traffic (or cartesian explosion in case of 2 or more joins) to some simple aggregation work on the server side.

If there are nested includes (i.e. tbl1 [has-many] tbl2 [has-many] tbl3), then the first strategy will result in a simple and efficient query plan (assuming indexes exist!), while the second one becomes more complex assuming aggregation is performed after every join.

Depending on the specific data and query, a normal join could of course sometimes be more performant, especially if the data that is being duplicated is small.

To me this still looks very promising and I guess it could improve performance over split-query, to avoid the backend to perform duplicate work. It could really be a killer-feature in my opinion when the user wants to perform a complex query including many navigations, or when the returned data is more like a complex json structure than a 2D table. Probably a lot of work must be put into the efcore repo.

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:22 (10 by maintainers)

Top GitHub Comments

1reaction

rojicommented, Mar 11, 2021

@Emill that’s pretty much what I was proposing above - RecordReader would be a thin view over a portion of the internal NpgsqlReadBuffer (conceptually similar to a Memory<byte>). Either NpgsqlDataReader causes a new array to be created on Read (when undisposed RecordReaders still exist), or the RecordReaders are informed and copy their relevant portions out; the latter may be more efficient since Npgsql’s internal buffer is big (8K by default), and we’d need to allocate a new one for each row. But I guess that’s going a bit too deep into implementation details (in any case I’m sure @smitpatel doesn’t care much 🤣).

1reaction

Emillcommented, Mar 10, 2021

If we just ask Npgsql for an object[][] for a specific column (reader.GetFieldValue<object[][]>(index)), Npgsql won’t know what data types we want in the inner values. It currently uses the default type mapper for whatever data type being returned from the server, and casts it to an object. If Npgsql returns a DbDataReader for a column, we can (recursively) use the GetFieldValue API to read a column using a specific CLR type, which is needed to be supported for EF Core. Currently such tests fail with my current implementation.

As @roji mensions, object[][] also might create a boxing overhead (don’t known how significant that is though in this case).

We could add ValueTuple<T, …> or Tuple<T, …> support for Npgsql’s data reader, to be able to ask for e.g. reader.GetFieldValue<Tuple<int, string, Tuple<int, DateTime>[]>[]>(index). That’s another way of solving the problem, but of course might create unnecessary objects that we’re going to throw away soon after materializing into the final entities.

I just realized that later method naturally works good when we must use the BufferedDataReader. If we use a DbDataReader for every record[] column, we must make sure those objects work even when the reading is complete in the outer reader.

Do you think it would be an issue of creating a bunch of types at runtime such as Tuple<int, string, Tuple<int, DateTime>[]>[]? From what I understand, types defined at runtime cannot be garbage collected, so that might lead to some memory leak if you generate LINQ queries dynamically.