Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Translating linq select of a column in a foreign key table to ARRAY subselect

See original GitHub issue

Hi,

we attempt to fetch a bunch of rows from one table that references another table via n:m relation table and select one column from the related table too. So far we have found no way to make the query generator generate a select query containing an ARRAY() subselect. Ideally the generated query should look like:

select user.id, user.name,
ARRAY(SELECT role.Name FROM role INNER JOIN user_role ON user_role.role_id = role.id WHERE user_role.user_id = user.id)
FROM user
WHERE ...

when writing the following linq query:

DbContext
  .User
  .Where(...)
  .Select(user => new {
    user.Id,
    user.Name,
    RoleNames = user
      .UserRoles
      .Select(userRole => userRole.Role.Name)
      .ToArray()
  })

The good thing is that currently the above linq query generates two queries which performs rather good. One query selects the columns from the user table, the other selects the role name from the related table by joining the previous user table query.

select user.id, user.name FROM user WHERE ...

SELECT t1.id, t3.name, t2.user_id
FROM user_role AS t2
INNER JOIN role AS t3 ON t2.role_id = t3.id
INNER JOIN (
    SELECT t0.id
    FROM ( SELECT * FROM user WHERE ... ) AS t0
) AS t1 ON t2.user_id = t1.id
ORDER BY t1.id

However this means also that the database has to run through all the returned users twice, once for the users and once for the role names of a user. The first query would fetch the role names with one roundtrip only. Adding additional related tables further adds another roundtrip and increases the latency further while the single column of that related table could be fetched with yet another ARRAY() column.

Another bad thing we noticed is that rewriting the linq query to read as RoleNames = new string[] { ... } for the role names worses the situation because then a subquery is executed for every returned user. To me this was an unexpected side effect that I consider as a bug.

Issue Analytics

State:
Created 5 years ago
Comments:17 (9 by maintainers)

Top GitHub Comments

1reaction

Emillcommented, Feb 6, 2021

Ok so I’ve done some initial performance testing regarding PostgreSQL’s query plans.

There are basically two strategies, executing a subquery and join + group by + aggregate.

Subquery:

select *, array(select row(tbl2.*) from tbl2 where tbl1.id = tbl2.tbl1id) from tbl1

Join + group by + aggregate:

select tbl1.*, array_agg(tbl2) from tbl1 left join tbl2 on tbl1.id = tbl2.tbl1id group by tbl1.id

The first strategy includes a subquery which will be executed once per row. This means O(n*m) if there is no foreign key index, due to it will perform a full sequential scan on tbl2 for each row in tbl1, which is pretty bad. However, when there is an index for the foreign key, it can be used and hence result in optimal time complexity. An extra inner join inside the subquery in the case of a many-to-many relationship also just uses the primary key index on the third table to follow the extra redirection, which is good.

The second strategy results in a similar query plan compared to using just a traditional left join, plus a sort (if indexes are not available for the foreign key) plus an aggregation. So this basically changes overhead in network traffic (or cartesian explosion in case of 2 or more joins) to some simple aggregation work on the server side.

If there are nested includes (i.e. tbl1 [has-many] tbl2 [has-many] tbl3), then the first strategy will result in a simple and efficient query plan (assuming indexes exist!), while the second one becomes more complex assuming aggregation is performed after every join.

Depending on the specific data and query, a normal join could of course sometimes be more performant, especially if the data that is being duplicated is small.

To me this still looks very promising and I guess it would improve performance over split-query, to avoid the backend to perform duplicate work. It could really be a killer-feature in my opinion when the user wants to perform a complex query including many navigations, or when the returned data is more like a complex json structure than a 2D table. Not sure if this issue should be re-opened, or if it’s more of an efcore issue, even though it’s quite postgresql-specific.

1reaction

dpsennercommented, Nov 20, 2018

I’m giving my unconsciousness a chance to work on this problem. Read you soon.