Using postgres arrays in relations for very significant performance improvements ?
See original GitHub issueHaving done a fair bit of investigation into many ORMs it seems there is a ubiquitous design flaw when it comes to the Postgres implementations. This doesn’t just apply to node ORMs but all the ORMs I’ve investigated, which is by no means an exhaustive list.
This design flaw results in a significantly more IO required to support one-to-many and many-to-many relationships than would be achievable using the “advanced” features of Postgres.
Currently all ORMS that support *-to-many relationships do so either by using a “join table” in the case of many-to-many or by referencing a foreign key on the “remote” table in the case of one-to-many.
As Postgres has the ability to store “array values”, this means it has the ability to store a “multi-valued” reference to the remote table directly from the local table thereby eliminating the need for a join table. Depending on the number of related objects, this approach has the potential to reduce the amount of logical IO required to manage relationships by a factor of 10 or even 100+.
Unfortunately, no ORMs seem to take advantage of this ability.
(Note: Needing a foreign key on the remote table in order to implement a “hasMany” relationship also violates something vaguely like Demeters Law / Principle of Least Knowledge". Why should the remote table have to “know” anything about the referencing table?)
Having done my investigation of ORMs I would really like to use bookshelf as it seems the best by a significant margin and for a number of reasons. BUT I can’t bring myself to sacrifice the performance / scalability improvements available with Postgres and I don’t want to end up having to maintain my own fork of bookshelf just for what would probably end up being a “minor” code change.
So, the reasons for this post are:
a) If you agree with my reasoning, it might be relatively easy for you to make the required changes to support this functionality in postgres (obviously it won’t work for other SQL databases).
b) there’s no point in me doing the work making changes to bookshelf if any pull request were to fall on stony ground.
If nothing else, even a “sorry, not interested” would be helpful as then I will know where I stand.
original post providing an example of my problem below.
I’m trying to set up relations that make use of postgres’ ability to store array values in a single column. So rather than using a join table or whatever, table 1 would have a column with a list of references to table 2.
CREATE TABLE properties
(
id serial NOT NULL,
title character varying(100),
created_at timestamp with time zone,
updated_at timestamp with time zone,
media_id integer[],
CONSTRAINT property_id PRIMARY KEY (id)
)
CREATE TABLE media
(
id serial NOT NULL,
filepath character varying(250),
created_at timestamp with time zone,
updated_at timestamp with time zone,
CONSTRAINT media_id PRIMARY KEY (id)
)
Each row in the media_id column stores multiple references to primary keys in the media table.
I had hoped I’d be able to find a configuration option for hasMany that would allow it to retrieve related media directly using properties.media_id rather than needing a property_id field on the media table, but it’s looking like this isn’t the case?
So assuming this isn’t possible directly with any of the standard Bookshelf relation functions, I’m wondering if there’s some “sneaky” way to have some kind of “custom relation” type where I can pass in a “join function”
The query I need to execute is along the lines of
select * from media
where id = ANY ((select media_id from properties where id = $1)::INT[])
or
select * from media where id in ($media_id_from_property_model)
Any suggestions would be appreciated.
Issue Analytics
- State:
- Created 9 years ago
- Reactions:5
- Comments:15 (3 by maintainers)
no benchmarks, but the performance benefits are obvious. with arrays, you avoid the need for multiple redundant database reads to the join table.
I finally got around to running some benchmarks on Arrays vs Join Tables.
Created a Gist along with my benchmark code.
https://gist.github.com/simg/2f28e9dcb6207dbaa11a285021935fe2
tl;dr Array references 5 times faster than join tables for retrieving “objects”. Arrays up to twice as fast for inserting “objects” but only as the number of relationships gets quite high (eg ~100 or so).