Writing resource data to sqlite DB raises "sqlite3.OperationalError: too many SQL variables"
See original GitHub issueHi,
writing sufficiently “wide” (i.e. number of fields/colums) and “long” (as in number of rows) table data easily runs into “sqlite3.OperationalError: too many SQL variables” for SQLite.
The reason seems to be how the Insert operation is done here:
For a given sample program
# frictionless_sqlite_params_error.py
from frictionless import Resource
from frictionless.plugins.sql import SqlDialect
# sufficiently 'wide' and 'long' table data to provoke SQLite param
# restrictions
number_of_fields = 90
number_of_rows = 100
data = '\n'.join([
','.join(f'header{i}' for i in range(number_of_fields)),
'\n'.join(
','.join(f'row{r}_col{c}' for c in range(number_of_fields))
for r in range(number_of_rows)
)
]).encode('ascii')
with Resource(data, format='csv') as resource:
resource.write('sqlite:///app.db', dialect=SqlDialect(table='datatable'))
…this creates a huge “bulk insert” SQL statement
[SQL: INSERT INTO datatable (header0, header1, header2, header3, header4, header5, header6, header7, header8, header9, header10, header11, header12, header13, header14, header15, header16, header17, header18, header19, header20, header21, header22, header23, header24, header25, header26, header27, header28, header29, header30, header31, header32, header33, header34, header35, header36, header37, header38, header39, header40, header41, header42, header43, header44, header45, header46, header47, header48, header49, header50, header51, header52, header53, header54, header55, header56, header57, header58, header59, header60, header61, header62, header63, header64, header65, header66, header67, header68, header69, header70, header71, header72, header73, header74, header75, header76, header77, header78, header79, header80, header81, header82, header83, header84, header85, header86, header87, header88, header89) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?), (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?), ...
I.e. this uses separate parameters for each field, for each row.
According to https://www.sqlite.org/limits.html the max. number of params in an SQL statement is “[…] SQLITE_MAX_VARIABLE_NUMBER, which defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0. […]”.
I’m running on 3.22.0 so that’s why I’ve immediately stumbled into the limit. While this has obviously been lifted somewhat in newer versions I still think it’s not a viable approach, as you’ll just run into the problems again for bigger (more rows) tables.
It does look like there’s some code that addresses this to a degree (the buffer
and buffer_size
parts) but IMHO this is not sufficient (think tables with 50 colums and 800 rows = 40000 params > 32766, again).
I’ve taken a look at SQLAlchemy possibilities and it seems to me one should rather use
self.__connection.execute(sql_table.insert(), buffer)
instead of
self.__connection.execute(sql_table.insert().values(buffer))
(see https://docs.sqlalchemy.org/en/14/tutorial/dbapi_transactions.html#tutorial-multiple-parameters) and make SQLAlchemy apply cursor.executemany.
In contrast, the former creates a prepared parameterized SQL
INSERT INTO datatable (header0, header1, header2, header3, header4, header5, header6, header7, header8, header9, header10, header11, header12, header13, header14, header15, header16, header17, header18, header19, header20, header21, header22, header23, header24, header25, header26, header27, header28, header29, header30, header31, header32, header33, header34, header35, header36, header37, header38, header39, header40, header41, header42, header43, header44, header45, header46, header47, header48, header49, header50, header51, header52, header53, header54, header55, header56, header57, header58, header59, header60, header61, header62, header63, header64, header65, header66, header67, header68, header69, header70, header71, header72, header73, header74, header75, header76, header77, header78, header79, header80, header81, header82, header83, header84, header85, header86, header87, header88, header89) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
I’m aware this might have performance implications - depending on if the SQL engine implements executemany
efficiently. Haven’t measured.
I’m not convinced the “params for each field for each row” is faster in the first place; but if it is one could also use the executemany
approach as a fallback, in case of running into the exception(?).
So a naive fix to not run into the error would be
$ diff frictionless/plugins/sql//storage.py.BAK frictionless/plugins/sql//storage.py
333c333
< self.__connection.execute(sql_table.insert().values(buffer))
---
> self.__connection.execute(sql_table.insert(), buffer)
336c336
< self.__connection.execute(sql_table.insert().values(buffer))
---
> self.__connection.execute(sql_table.insert(), buffer)
(no changes to the buffer
and buffer_size
parts here since I didn’t really understand their full intention 😉).
If helpful I could come up with a PR.
Again, awesome library, best regards, Holger
Issue Analytics
- State:
- Created a year ago
- Comments:16 (16 by maintainers)
Top GitHub Comments
Hi there @shashigharti @roll, PR #1255 is there (finally) - sorry for the long silence.
That’s the minimal code change discussed in previous ticket comments plus accompanying test.
As a little extra and compensation for your patience I’ve benchmarked the principal
executemany
approach against the one previously used in frictionless-frameworkstorage.py
. You can find it all here: https://github.com/hjoukl/bench-sqlalchemy-executemany 😃The gist of it is that
executemany
is superior to the previous approach performance-wise, sometimes vastly (especially with SQLite). I’ve run the sample benchmark for the DB engines tested in your CI (SQlite, PostgreSQL, MySQL).Btw I’ve stumbled over some rough edges in the v5 docs (e.g. invalid examples using not-existing-anymore
plugins
imports) and some missing bits that might be useful for the development/contribution docs (like “copy .env.example to .env before running make test”). Would you like some suggestions wrt this in form of tickets or otherwise?Greetings, Holger
Hi @roll and @shashigharti,
great to see you considering this issue! Just a note wrt the classification change from bug to enhancement:
Since stock RHEL8 seems to have below sqlite3 defaults (sqlite version + compile options) the usability of
frictionless-py
is severely limited on this platform, which I think is in very widespread use in “enterprisey” Linux environments (probably predominant). Same goes for the older RHEL7. I.e. you’d basically need to be able install a newer sqlite version or compile sqlite and use this instead of system sqlite. Which might mean you’d also need to recompile Python (at least it’s sqlite3 extension)…This is something that most users won’t be able/allowed to do in a corporate environment, from my experience.
You run into those limits with a table with 20 fields and 50 rows, i.e. not only tables with an unrealistically high number of columns:
So I basically think a change like
is both a bugfix (the 1000-lines chunking is not enough to avoid exceptions in general) and an enhancement (it looks like it’s dramatically faster for SQLite, but of course other DB backends might completely differ - needs proper benchmarking).
Hopefully not getting on anybody’s nerves here - of course it’s entirely your call how you handle this. I just wanted to bring it to your attention that
frictionless-py
might not be properly usable with an SQLite DB backend on these mentioned platforms, at all. Which would be a shame IMHO since it’s so cool. 😃Best regards, Holger