Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dashboard locked if `api/v1/viz` DELETE requests are sent on table involved in batch sql jobs

See original GitHub issue

Context

I get 504 gateway timeouts (and a frozen dashboard) when a delete table request is applied on a table involved in a batch sql operation.

Steps to Reproduce

Have a table with ~1M rows (download one here)
Run a batch operation, then try to delete the table. Here’s the operations in Python:

from carto.auth import APIKeyAuthClient
from carto.sql import BatchSQLClient

auth = APIKeyAuthClient('https://eschbacher.carto.com/',
                        'my api key')
# table with 950000 rows
table = 'batch_sql_viz_api_lock'

# update geometry from columns `lat` and `lng`
BatchSQLClient(auth).create([
    "UPDATE {table} SET the_geom = cdb_latlng(lat, lng)".format(table=table)
])
auth.send('api/v1/viz/{table}'.format(table=table),
          http_method='DELETE')

Current Result

Dashboard is frozen until the Batch SQL job completes. Map and dataset pages also cannot be loaded.

Expected result

Batch jobs and requests to delete tables should not freeze user account.

Browser and version

Chrome 61.0.3163.91 (Official Build) (64-bit) macOS 10.12.5

.carto file

None, but you can get a dataset to test here: https://eschbacher.carto.com/api/v2/sql?q=select+*+from+batch_sql_viz_api_lock_copy&format=csv&filename= batch_sql_viz_api_lock

Additional info

Discovered while developing cartoframes

cc @juanignaciosl

Issue Analytics

State:
Created 6 years ago
Comments:22 (19 by maintainers)

Top GitHub Comments

1reaction

javitoninocommented, Jun 27, 2018

@rafatower We would not only need to apply lock_timeout to the delete, but also to other queries that might get locked, like other ALTERs, cartodbfication and some more. Still, it should be a reduced number, so it could be ok. Two comments:

We cannot do a straight-forward SET lock_timeout because of pgbouncer (might get applied to other sessions). However, TIL about SET LOCAL which sets only for a transaction, which could be a good solution.
Applying lock_timeout at DB level could make sense in our case. There is a subtle trick you’re missing here. We use the postgres user to do DROP TABLE, which is overriden to have statement_timeout = 0 (and so, it never gives up). However, it does not have an override for lock_timeout, so if we set at a per-DB level, it would use that.

That’s more or less why I proposed setting it at database level, since it would help for all cases, including Rails, but also things like Batch SQL API and analyses (which also skip timeouts). We have had problems with those components in the past with competing analyses or user deletion i.e: this is not exclusive of Rails, you can trigger a similar situation just by not being careful using Batch API. Although, most cases are from Rails, since it’s the main user of direct postgres user connections.

I agree that setting a timeout in the most problematic Rails queries is a solution to this particular case, but I still think setting a global lock_timeout could be beneficial for other cases we have not yet pinpointed as clearly as this one.

Of course, the dashboard is still going to break if there is something locked at that point (long transaction with an exclusive lock), so we are only talking about mitigation by trying to avoid such long locks by limiting waitign queries.

In summary:

I still think a per-DB lock_timeout could be a good idea as a safety measure.
Even if we do the previous, we probably want to wrap some Rails code in a transaction with a SET LOCAL lock_timeout = 5 or something like that.

Bonus: we may want to consider setting the lock_timeout in the db size function in the extension. That would also have helped here (the table would be locked, but the dashboad would still work).

0reactions

rafatowercommented, Jul 4, 2018

And fixed in production:

https://gist.github.com/rafatower/e80b159d0fd66ccd6e7d573470c18604

$ pipenv run delete_lock_test https://rtorre.carto.com $API_KEY
2018-07-04 13:04:06,902 - INFO - Dropping table if it exists... 
2018-07-04 13:04:08,019 - INFO - dropped
2018-07-04 13:04:08,020 - INFO - Creating table...
2018-07-04 13:04:08,655 - INFO - table created
2018-07-04 13:04:08,655 - INFO - Populating table...
2018-07-04 13:04:34,970 - INFO - table populated
2018-07-04 13:04:34,976 - INFO - Cartodbfy'ing the table
2018-07-04 13:05:02,010 - INFO - Table cartodbfy'ed
2018-07-04 13:05:02,010 - INFO - Sending UPDATE to the Batch API...
2018-07-04 13:05:02,160 - INFO - Waiting for it to start...
2018-07-04 13:05:02,160 - INFO - UPDATE running
2018-07-04 13:05:02,160 - INFO - Making sure table is in the dashboard...
2018-07-04 13:05:07,943 - INFO - Table is now in the dashboard
2018-07-04 13:05:07,944 - INFO - Sending DELETE...
2018-07-04 13:05:12,430 - INFO - Could not delete canonical viz, status_code = 400
2018-07-04 13:05:12,431 - INFO - Cancelling UPDATE...
2018-07-04 13:05:12,589 - INFO - UPDATE canceled.

Obviously the Could not delete canonical viz is because of the lock timeout, and it is logged as such in rollbar: https://rollbar.com/carto/CartoDB/items/36087/