question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dashboard locked if `api/v1/viz` DELETE requests are sent on table involved in batch sql jobs

See original GitHub issue

Context

I get 504 gateway timeouts (and a frozen dashboard) when a delete table request is applied on a table involved in a batch sql operation.

Steps to Reproduce

  1. Have a table with ~1M rows (download one here)
  2. Run a batch operation, then try to delete the table. Here’s the operations in Python:
from carto.auth import APIKeyAuthClient
from carto.sql import BatchSQLClient

auth = APIKeyAuthClient('https://eschbacher.carto.com/',
                        'my api key')
# table with 950000 rows
table = 'batch_sql_viz_api_lock'

# update geometry from columns `lat` and `lng`
BatchSQLClient(auth).create([
    "UPDATE {table} SET the_geom = cdb_latlng(lat, lng)".format(table=table)
])
auth.send('api/v1/viz/{table}'.format(table=table),
          http_method='DELETE')

Current Result

Dashboard is frozen until the Batch SQL job completes. Map and dataset pages also cannot be loaded.

Expected result

Batch jobs and requests to delete tables should not freeze user account.

Browser and version

Chrome 61.0.3163.91 (Official Build) (64-bit) macOS 10.12.5

.carto file

None, but you can get a dataset to test here: https://eschbacher.carto.com/api/v2/sql?q=select+*+from+batch_sql_viz_api_lock_copy&format=csv&filename= batch_sql_viz_api_lock

Additional info

Discovered while developing cartoframes

cc @juanignaciosl

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:22 (19 by maintainers)

github_iconTop GitHub Comments

1reaction
javitoninocommented, Jun 27, 2018

@rafatower We would not only need to apply lock_timeout to the delete, but also to other queries that might get locked, like other ALTERs, cartodbfication and some more. Still, it should be a reduced number, so it could be ok. Two comments:

  • We cannot do a straight-forward SET lock_timeout because of pgbouncer (might get applied to other sessions). However, TIL about SET LOCAL which sets only for a transaction, which could be a good solution.
  • Applying lock_timeout at DB level could make sense in our case. There is a subtle trick you’re missing here. We use the postgres user to do DROP TABLE, which is overriden to have statement_timeout = 0 (and so, it never gives up). However, it does not have an override for lock_timeout, so if we set at a per-DB level, it would use that.

That’s more or less why I proposed setting it at database level, since it would help for all cases, including Rails, but also things like Batch SQL API and analyses (which also skip timeouts). We have had problems with those components in the past with competing analyses or user deletion i.e: this is not exclusive of Rails, you can trigger a similar situation just by not being careful using Batch API. Although, most cases are from Rails, since it’s the main user of direct postgres user connections.

I agree that setting a timeout in the most problematic Rails queries is a solution to this particular case, but I still think setting a global lock_timeout could be beneficial for other cases we have not yet pinpointed as clearly as this one.

Of course, the dashboard is still going to break if there is something locked at that point (long transaction with an exclusive lock), so we are only talking about mitigation by trying to avoid such long locks by limiting waitign queries.

In summary:

  • I still think a per-DB lock_timeout could be a good idea as a safety measure.
  • Even if we do the previous, we probably want to wrap some Rails code in a transaction with a SET LOCAL lock_timeout = 5 or something like that.

Bonus: we may want to consider setting the lock_timeout in the db size function in the extension. That would also have helped here (the table would be locked, but the dashboad would still work).

0reactions
rafatowercommented, Jul 4, 2018

And fixed in production:

https://gist.github.com/rafatower/e80b159d0fd66ccd6e7d573470c18604

$ pipenv run delete_lock_test https://rtorre.carto.com $API_KEY
2018-07-04 13:04:06,902 - INFO - Dropping table if it exists... 
2018-07-04 13:04:08,019 - INFO - dropped
2018-07-04 13:04:08,020 - INFO - Creating table...
2018-07-04 13:04:08,655 - INFO - table created
2018-07-04 13:04:08,655 - INFO - Populating table...
2018-07-04 13:04:34,970 - INFO - table populated
2018-07-04 13:04:34,976 - INFO - Cartodbfy'ing the table
2018-07-04 13:05:02,010 - INFO - Table cartodbfy'ed
2018-07-04 13:05:02,010 - INFO - Sending UPDATE to the Batch API...
2018-07-04 13:05:02,160 - INFO - Waiting for it to start...
2018-07-04 13:05:02,160 - INFO - UPDATE running
2018-07-04 13:05:02,160 - INFO - Making sure table is in the dashboard...
2018-07-04 13:05:07,943 - INFO - Table is now in the dashboard
2018-07-04 13:05:07,944 - INFO - Sending DELETE...
2018-07-04 13:05:12,430 - INFO - Could not delete canonical viz, status_code = 400
2018-07-04 13:05:12,431 - INFO - Cancelling UPDATE...
2018-07-04 13:05:12,589 - INFO - UPDATE canceled.

Obviously the Could not delete canonical viz is because of the lock timeout, and it is logged as such in rollbar: https://rollbar.com/carto/CartoDB/items/36087/

Read more comments on GitHub >

github_iconTop Results From Across the Web

Understand and resolve SQL Server blocking problems
For INSERT, UPDATE, and DELETE statements, the locks are held during the query, both for data consistency and to allow the query to...
Read more >
The best ways to use SQL DELETE Statement in a SQL table
There are several best practices to consider when using a SQL delete statement to remove data from a SQL table. Learn how to...
Read more >
Reporting and alerting on job failure in SQL Server
This table can be queried to determine how many jobs exist on a server or to search based on a specific string in...
Read more >
How to find out what is locking my tables? - Stack Overflow
sp_who; sp_lock. Also, in SSMS, you can view locks and processes in different ways: enter image description here. Different versions of SSMS ...
Read more >
SQL Performance Best Practices | CockroachDB Docs
For more information, see Batch delete expired data with Row-Level TTL. Assign column families. A column family is a group of columns in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found