Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Writing resource data to sqlite DB raises "sqlite3.OperationalError: too many SQL variables"

See original GitHub issue

Hi,

writing sufficiently “wide” (i.e. number of fields/colums) and “long” (as in number of rows) table data easily runs into “sqlite3.OperationalError: too many SQL variables” for SQLite.

The reason seems to be how the Insert operation is done here:

https://github.com/frictionlessdata/frictionless-py/blob/3e6354e42a8d68a4b06f52858514d09ae871b994/frictionless/plugins/sql/storage.py#L324-L336

For a given sample program

#  frictionless_sqlite_params_error.py 
from frictionless import Resource
from frictionless.plugins.sql import SqlDialect                                

# sufficiently 'wide' and 'long' table data to provoke SQLite param
# restrictions
number_of_fields = 90
number_of_rows = 100

data = '\n'.join([
    ','.join(f'header{i}' for i in range(number_of_fields)),
    '\n'.join(
        ','.join(f'row{r}_col{c}' for c in range(number_of_fields))
        for r in range(number_of_rows)
        )
    ]).encode('ascii')
    
with Resource(data, format='csv') as resource:
    resource.write('sqlite:///app.db', dialect=SqlDialect(table='datatable'))

…this creates a huge “bulk insert” SQL statement

[SQL: INSERT INTO datatable (header0, header1, header2, header3, header4, header5, header6, header7, header8, header9, header10, header11, header12, header13, header14, header15, header16, header17, header18, header19, header20, header21, header22, header23, header24, header25, header26, header27, header28, header29, header30, header31, header32, header33, header34, header35, header36, header37, header38, header39, header40, header41, header42, header43, header44, header45, header46, header47, header48, header49, header50, header51, header52, header53, header54, header55, header56, header57, header58, header59, header60, header61, header62, header63, header64, header65, header66, header67, header68, header69, header70, header71, header72, header73, header74, header75, header76, header77, header78, header79, header80, header81, header82, header83, header84, header85, header86, header87, header88, header89) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?), (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?), ...

I.e. this uses separate parameters for each field, for each row.

According to https://www.sqlite.org/limits.html the max. number of params in an SQL statement is “[…] SQLITE_MAX_VARIABLE_NUMBER, which defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0. […]”.

I’m running on 3.22.0 so that’s why I’ve immediately stumbled into the limit. While this has obviously been lifted somewhat in newer versions I still think it’s not a viable approach, as you’ll just run into the problems again for bigger (more rows) tables.

It does look like there’s some code that addresses this to a degree (the buffer and buffer_size parts) but IMHO this is not sufficient (think tables with 50 colums and 800 rows = 40000 params > 32766, again).

I’ve taken a look at SQLAlchemy possibilities and it seems to me one should rather use

self.__connection.execute(sql_table.insert(), buffer)

instead of

self.__connection.execute(sql_table.insert().values(buffer))

(see https://docs.sqlalchemy.org/en/14/tutorial/dbapi_transactions.html#tutorial-multiple-parameters) and make SQLAlchemy apply cursor.executemany.

In contrast, the former creates a prepared parameterized SQL

INSERT INTO datatable (header0, header1, header2, header3, header4, header5, header6, header7, header8, header9, header10, header11, header12, header13, header14, header15, header16, header17, header18, header19, header20, header21, header22, header23, header24, header25, header26, header27, header28, header29, header30, header31, header32, header33, header34, header35, header36, header37, header38, header39, header40, header41, header42, header43, header44, header45, header46, header47, header48, header49, header50, header51, header52, header53, header54, header55, header56, header57, header58, header59, header60, header61, header62, header63, header64, header65, header66, header67, header68, header69, header70, header71, header72, header73, header74, header75, header76, header77, header78, header79, header80, header81, header82, header83, header84, header85, header86, header87, header88, header89) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)

I’m aware this might have performance implications - depending on if the SQL engine implements executemany efficiently. Haven’t measured.

I’m not convinced the “params for each field for each row” is faster in the first place; but if it is one could also use the executemany approach as a fallback, in case of running into the exception(?).

So a naive fix to not run into the error would be

$ diff frictionless/plugins/sql//storage.py.BAK frictionless/plugins/sql//storage.py
333c333
<                     self.__connection.execute(sql_table.insert().values(buffer))
---
>                     self.__connection.execute(sql_table.insert(), buffer)
336c336
<                 self.__connection.execute(sql_table.insert().values(buffer))
---
>                 self.__connection.execute(sql_table.insert(), buffer)

(no changes to the buffer and buffer_size parts here since I didn’t really understand their full intention 😉).

If helpful I could come up with a PR.

Again, awesome library, best regards, Holger

Issue Analytics

State:
Created a year ago
Comments:16 (16 by maintainers)

Top GitHub Comments

1reaction

hjouklcommented, Sep 15, 2022

Hi there @shashigharti @roll, PR #1255 is there (finally) - sorry for the long silence.

That’s the minimal code change discussed in previous ticket comments plus accompanying test.

As a little extra and compensation for your patience I’ve benchmarked the principal executemany approach against the one previously used in frictionless-framework storage.py. You can find it all here: https://github.com/hjoukl/bench-sqlalchemy-executemany 😃

The gist of it is that executemany is superior to the previous approach performance-wise, sometimes vastly (especially with SQLite). I’ve run the sample benchmark for the DB engines tested in your CI (SQlite, PostgreSQL, MySQL).

Btw I’ve stumbled over some rough edges in the v5 docs (e.g. invalid examples using not-existing-anymore plugins imports) and some missing bits that might be useful for the development/contribution docs (like “copy .env.example to .env before running make test”). Would you like some suggestions wrt this in form of tickets or otherwise?

Greetings, Holger

1reaction

hjouklcommented, Aug 8, 2022

Hi @roll and @shashigharti,

great to see you considering this issue! Just a note wrt the classification change from bug to enhancement:

Since stock RHEL8 seems to have below sqlite3 defaults (sqlite version + compile options) the usability of frictionless-py is severely limited on this platform, which I think is in very widespread use in “enterprisey” Linux environments (probably predominant). Same goes for the older RHEL7. I.e. you’d basically need to be able install a newer sqlite version or compile sqlite and use this instead of system sqlite. Which might mean you’d also need to recompile Python (at least it’s sqlite3 extension)…

This is something that most users won’t be able/allowed to do in a corporate environment, from my experience.

0 $ cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.5 (Ootpa)
0 $ sqlite3 --version
3.26.0 2018-12-01 12:34:55 bf8c1b2b7a5960c282e543b9c293686dccff272512d08865f4600fb58238alt1
0 $ sqlite3 app.db 'pragma compile_options;'
COMPILER=gcc-8.4.1 20210423 (Red Hat 8.4.1-2)
DISABLE_DIRSYNC
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS4
ENABLE_FTS5
ENABLE_JSON1
ENABLE_RTREE
ENABLE_STMTVTAB
ENABLE_UNKNOWN_SQL_FUNCTION
ENABLE_UNLOCK_NOTIFY
HAVE_ISNAN
SECURE_DELETE
THREADSAFE=1

You run into those limits with a table with 20 fields and 50 rows, i.e. not only tables with an unrealistically high number of columns:

0 $ FRICTIONLESS_TEST_FIELDS=20 FRICTIONLESS_TEST_ROWS=50 python3 frictionless_sqlite_params_error.py 
Traceback (most recent call last):
  File "venv/rhel8-frictionless-venv/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1820, in _execute_context
    cursor, statement, parameters, context
  File "venv/rhel8-frictionless-venv/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: too many SQL variables

...
(Background on this error at: https://sqlalche.me/e/14/e3q8)

So I basically think a change like

$ diff frictionless/plugins/sql//storage.py.BAK frictionless/plugins/sql//storage.py
333c333
<                     self.__connection.execute(sql_table.insert().values(buffer))
---
>                     self.__connection.execute(sql_table.insert(), buffer)
336c336
<                 self.__connection.execute(sql_table.insert().values(buffer))
---
>                 self.__connection.execute(sql_table.insert(), buffer)

is both a bugfix (the 1000-lines chunking is not enough to avoid exceptions in general) and an enhancement (it looks like it’s dramatically faster for SQLite, but of course other DB backends might completely differ - needs proper benchmarking).

Hopefully not getting on anybody’s nerves here - of course it’s entirely your call how you handle this. I just wanted to bring it to your attention that frictionless-py might not be properly usable with an SQLite DB backend on these mentioned platforms, at all. Which would be a shame IMHO since it’s so cool. 😃

Best regards, Holger