Bug in handling select_related with limit and all() because of many2many relation
See original GitHub issueWhen a model has a many2many relationship and you want to fetch all records including the related models, with a limit, the .all()
function returns the wrong amount of records.
I debugged the issue until this part, where the instances are merged in modelproxy.py: merge_instances_list(result_rows)
.
When entering the merge_instances_list function, the result_rows includes all the records, but it seems that the query that it runs includes multiple rows of the same instance, e.g. for the many2many relation.
Now this is unexpected behavior, since I’d expect to get all the rows from the database, not the grouped instances of the record set.
Here’s a test setup for proving the issue:
from typing import List, Optional
import ormar
import pytest
from app.db.database import db, engine, metadata
class Keyword(ormar.Model):
class Meta:
metadata = metadata
database = db
tablename = "keywords"
id: int = ormar.Integer(primary_key=True)
name: str = ormar.String(max_length=50)
class KeywordPrimaryModel(ormar.Model):
class Meta:
metadata = metadata
database = db
tablename = "primary_models_keywords"
id: int = ormar.Integer(primary_key=True)
class PrimaryModel(ormar.Model):
class Meta:
metadata = metadata
database = db
tablename = "primary_models"
id: int = ormar.Integer(primary_key=True)
name: str = ormar.String(max_length=255, index=True)
some_text: str = ormar.Text()
some_other_text: Optional[str] = ormar.Text(nullable=True)
keywords: Optional[List[Keyword]] = ormar.ManyToMany(
Keyword, through=KeywordPrimaryModel
)
class SecondaryModel(ormar.Model):
class Meta:
metadata = metadata
database = db
tablename = "secondary_models"
id: int = ormar.Integer(primary_key=True)
name: str = ormar.String(max_length=100)
primary_model: PrimaryModel = ormar.ForeignKey(
PrimaryModel,
related_name="secondary_models",
)
@pytest.mark.asyncio
@pytest.mark.parametrize("tag_id", [1, 2, 3, 4, 5])
async def test_create_keywords(tag_id):
await Keyword.objects.create(name=f"Tag {tag_id}")
@pytest.mark.asyncio
@pytest.mark.parametrize(
"name, some_text, some_other_text",
[
("Primary 1", "Some text 1", "Some other text 1"),
("Primary 2", "Some text 2", "Some other text 2"),
("Primary 3", "Some text 3", "Some other text 3"),
("Primary 4", "Some text 4", "Some other text 4"),
("Primary 5", "Some text 5", "Some other text 5"),
("Primary 6", "Some text 6", "Some other text 6"),
("Primary 7", "Some text 7", "Some other text 7"),
("Primary 8", "Some text 8", "Some other text 8"),
("Primary 9", "Some text 9", "Some other text 9"),
("Primary 10", "Some text 10", "Some other text 10"),
],
)
async def test_create_primary_models(name, some_text, some_other_text):
await PrimaryModel(
name=name, some_text=some_text, some_other_text=some_other_text
).save()
@pytest.mark.asyncio
async def test_add_keywords():
p1 = await PrimaryModel.objects.get(pk=1)
p2 = await PrimaryModel.objects.get(pk=2)
for i in range(1, 6):
keyword = await Keyword.objects.get(pk=i)
if i % 2 == 0:
await p1.keywords.add(keyword)
else:
await p2.keywords.add(keyword)
@pytest.mark.asyncio
async def test_create_secondary_model():
secondary = await SecondaryModel(name="Foo", primary_model=1).save()
assert secondary.id == 1
assert secondary.primary_model.id == 1
@pytest.mark.asyncio
async def test_list_primary_models_with_keywords_and_limit():
models = await PrimaryModel.objects.select_related("keywords").limit(5).all()
# This test fails, because of the keywords relation.
assert len(models) == 5
@pytest.mark.asyncio
async def test_list_primary_models_without_keywords_and_limit():
models = await PrimaryModel.objects.all()
assert len(models) == 10
@pytest.mark.asyncio
async def test_list_primary_models_without_keywords_but_with_limit():
models = await PrimaryModel.objects.limit(5).all()
assert len(models) == 5
@pytest.mark.asyncio
async def test_update_secondary():
secondary = await SecondaryModel.objects.get(id=1)
assert secondary.name == "Foo"
await secondary.update(name="Updated")
assert secondary.name == "Updated"
@pytest.fixture(autouse=True, scope="module")
def create_test_database():
metadata.create_all(engine)
yield
metadata.drop_all(engine)
Here the test fails with len(models)
being 2, not 5 as it should.
The grouping should probably happen in the query so that all records are returned.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
@collerek Now that I’m really awake again, I’m not entirely sure how that would be done after all. You are correct, there must be some processing also on the Python side.
Ok, so this one of the reasons why i.e.
django
don’t allowselect_related
on M2M fields and reverse FK.If you issue a
select_related
query a one (potentially huge) joined query is constructed and all typical SQL clauses are applied on the whole joined query -> sooffset, limit, where
etc. are applied at the end. Since it’s a join from multiple tables in raw sql response you will have duplicated values for parent, when you applylimit
on this it applies on the SQL rows (that’s how sql limit works), that’s why you get first 5 rows of data meaning 2 first PrimaryModels as they have 5 children, so consumed all in a limit in raw sql rows.Now in order to limit this to 5 rows of primary model I would have to know in advance either how many children the parent’s have (and children of children of children if it’s a multiple join query) or extract ids of those parents first. Both are possible, but require additional query against the database.
I could do it in python but not knowing any of the two in advance I would always have to fetch all data from join and limit number of parent models in the result list (wasting the rest of fetched data)
I don’t know if I will implement it cause it might be a huge effort or/and can slow down everything by quite a lot with that additional query and select_releted is specifically designed to be quick one db call query.
BUT - worry not 😃
That’s one of the reasons why prefetch_related was introduced. Yours solution is as simple as changing the
select_releted
toprefetch_related
in your query and it will pass.The reason is that prefetch_releted grabs the related models in consecutive queries after the initial one is completed. And limit/offset applies to the first query issued.
So it grabs 5 rows from primary model and then fetches the child models for only those 5 models already fetched. It should be better documented, that’s for sure 😃
Let me know if that solves your issue.