Translate queries where a row includes a collection to use arrays instead of being implemented through joins
See original GitHub issueFollowing discussion in https://github.com/npgsql/efcore.pg/issues/712.
Consider queries like
var blogs = context.Blogs
.Include(blog => blog.Posts)
.ToList();
or something like
var blogs = context.Blogs
.Where(...)
.Select(b => new
{
BlogTitle = b.Title,
LargeColumn = b.LargeColumn,
LastPosts = b.Posts.OrderByDescending(p => p.Id).Take(4).Select(p => p.Message).ToList()
})
.ToList();
This might either lead to “cartesian explosion” if implemented through joins, or result in multiple duplicated work at the server if split-query strategy is used (https://docs.microsoft.com/en-us/ef/core/querying/single-split-queries).
PostgreSQL is capable of returning a record[]
in a column, which can contain a result set of a subquery. It seems like other query engines can take advantage of this, for example in GraphQL https://www.shanestillwell.com/2017/09/15/postgresql-json-and-graphql/, although json is used instead of record[]
to encode the data instead.
I’ve done some initial performance testing regarding PostgreSQL’s query plans.
There are basically two strategies, executing a subquery and join + group by + aggregate.
Subquery:
select *, array(select row(tbl2.*) from tbl2 where tbl1.id = tbl2.tbl1id) from tbl1
Join + group by + aggregate:
select tbl1.*, array_agg(tbl2) from tbl1 left join tbl2 on tbl1.id = tbl2.tbl1id group by tbl1.id
The first strategy includes a subquery which will be executed once per row. This means O(n*m) if there is no foreign key index, due to it will perform a full sequential scan on tbl2 for each row in tbl1, which is pretty bad. However, when there is an index for the foreign key, it can be used and hence result in optimal time complexity. An extra inner join inside the subquery in the case of a many-to-many relationship also just uses the primary key index on the third table to follow the extra redirection, which is good.
The second strategy results in a similar query plan compared to using just a traditional left join, plus a sort (if indexes are not available for the foreign key) plus an aggregation. So this basically changes overhead in network traffic (or cartesian explosion in case of 2 or more joins) to some simple aggregation work on the server side.
If there are nested includes (i.e. tbl1 [has-many] tbl2 [has-many] tbl3), then the first strategy will result in a simple and efficient query plan (assuming indexes exist!), while the second one becomes more complex assuming aggregation is performed after every join.
Depending on the specific data and query, a normal join could of course sometimes be more performant, especially if the data that is being duplicated is small.
To me this still looks very promising and I guess it could improve performance over split-query, to avoid the backend to perform duplicate work. It could really be a killer-feature in my opinion when the user wants to perform a complex query including many navigations, or when the returned data is more like a complex json structure than a 2D table. Probably a lot of work must be put into the efcore repo.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:22 (10 by maintainers)
@Emill that’s pretty much what I was proposing above - RecordReader would be a thin view over a portion of the internal NpgsqlReadBuffer (conceptually similar to a
Memory<byte>
). Either NpgsqlDataReader causes a new array to be created on Read (when undisposed RecordReaders still exist), or the RecordReaders are informed and copy their relevant portions out; the latter may be more efficient since Npgsql’s internal buffer is big (8K by default), and we’d need to allocate a new one for each row. But I guess that’s going a bit too deep into implementation details (in any case I’m sure @smitpatel doesn’t care much 🤣).If we just ask Npgsql for an object[][] for a specific column (
reader.GetFieldValue<object[][]>(index)
), Npgsql won’t know what data types we want in the inner values. It currently uses the default type mapper for whatever data type being returned from the server, and casts it to anobject
. If Npgsql returns aDbDataReader
for a column, we can (recursively) use the GetFieldValue API to read a column using a specific CLR type, which is needed to be supported for EF Core. Currently such tests fail with my current implementation.As @roji mensions,
object[][]
also might create a boxing overhead (don’t known how significant that is though in this case).We could add ValueTuple<T, …> or Tuple<T, …> support for Npgsql’s data reader, to be able to ask for e.g.
reader.GetFieldValue<Tuple<int, string, Tuple<int, DateTime>[]>[]>(index)
. That’s another way of solving the problem, but of course might create unnecessary objects that we’re going to throw away soon after materializing into the final entities.I just realized that later method naturally works good when we must use the BufferedDataReader. If we use a DbDataReader for every record[] column, we must make sure those objects work even when the reading is complete in the outer reader.
Do you think it would be an issue of creating a bunch of types at runtime such as
Tuple<int, string, Tuple<int, DateTime>[]>[]
? From what I understand, types defined at runtime cannot be garbage collected, so that might lead to some memory leak if you generate LINQ queries dynamically.