question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AntiStampede Cache leaves orphaned threading.Event object on 304 Not Modified response

See original GitHub issue

AntiStampede Cache leaves orphaned threading.Event object on 304 Not Modified response and results in a 30 second timeout on subsequent request.

We are not able to reliably reproduce this problem, we only know that it happens and have the logs to scrutinize the TOOLS.CACHING code path followed when hitting the bug.

We’ve narrowed down the bug to

  • static files
  • an ‘If-Modified-Since’ request provoking a 304 Not modified response, leaving an orphaned threading.Event, causing
  • a 30 seconds timeout on subsequent (normal) request

The logs show that the 304 provoking request follows the AntistampeCache.wait path, setting the threading.Event while returning a None to MemoryCache.get() variant object, causing it to return None to get()'s cache_data, causing it to follow the ‘request is not cached’ path of get(), returning False, but MemoryCache.put() is never called, thus leaving the threading.Event object orphaned.

The subsequent request for the same static file also follows the AntistampeCache.wait path. Encountering the orphaned threading.Event object causing it to fruitlessly wait 30 seconds after which it resolves the problem by diligently populating the cache object with the (eventually) responded static file.

After this, normal operation is resumed.

This happens mostly once a day, probably after cache has expired, after one of the clients was the last to requests an ‘If-Modified-Since’ static file response. But until now, we were not able to come up with a clean reproducible test case.

CherryPy is part of the Pyff daemon we deployed. https://github.com/leifj/pyFF/issues/116

  • CherryPy version: 11.1.0
  • Python version: 2.7 due to Pyff dependancy
  • OS: Debian GNU/Linux 9
  • Browser: Chrome 63

The logging showing the problem, which was produced by us inserted extra debugging lines looks as follows. I understand that interpreting these logs without knowledge of the location of these statements is awkward. Nevertheless it clearly shows the timeout after the offending 304’d static file request.

Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING get https:/***/static/bootstrap/css/bootstrap.min.css
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING Wait result for key: (), type(value) <type 'NoneType'>
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING Cache was None, set Event <threading._Event object at 0x7f7b6c623a50>
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING variant found
Jan 19 08:43:26 proxy2 pyffd[4100]: [19/Jan/2018:08:43:26] TOOLS.CACHING request is not cached
[...]
Jan 19 08:43:26 proxy2 pyffd[4100]: 145.97.144.156 - - [19/Jan/2018:08:43:26] "GET /static/bootstrap/css/bootstrap.min.css HTTP/1.1" 304 - "https://***/role/idp.ds?return=https%3A%2F%2F***%2FSaml2SP%2Fdisco&entit
yID=https%3A%2F%2F***%2FSaml2SP%2Fproxy_saml2_backend.xml" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"


Jan 19 08:47:52 proxy2 pyffd[4100]: [19/Jan/2018:08:47:52] TOOLS.CACHING get https://***/static/bootstrap/css/bootstrap.min.css
Jan 19 08:47:52 proxy2 pyffd[4100]: [19/Jan/2018:08:47:52] TOOLS.CACHING Wait result for key: (), type(value) <class 'threading._Event'>
Jan 19 08:47:52 proxy2 pyffd[4100]: [19/Jan/2018:08:47:52] TOOLS.CACHING Event <threading._Event object at 0x7f7b6c623a50>
Jan 19 08:47:52 proxy2 pyffd[4100]: [19/Jan/2018:08:47:52] TOOLS.CACHING Waiting up to 30 seconds
[...]
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING Timed out 30 seconds
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING variant found
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING request is not cached
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING get() status: None
Jan 19 08:48:22 proxy2 pyffd[4100]: [19/Jan/2018:08:48:22] TOOLS.CACHING Storing status: 200 OK, uri:https://***/static/bootstrap/css/bootstrap.min.css
Jan 19 08:48:22 proxy2 pyffd[4100]: 2001:610:514:172:31da:c3b9:12ad:1b76 - - [19/Jan/2018:08:48:22] "GET /static/bootstrap/css/bootstrap.min.css HTTP/1.1" 200 109518 "https://***/role/idp.ds?return=https%3A%2F%2F***%2FSaml2SP%2Fdisco&entityID=https%3A%2F%2F***%2FSaml2SP%2Fproxy_saml2_backend.xml" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

What we noticed is that the caching.py get() cherrypy.HTTPRedirect 304 exception code path is not touched on the 304 response. This must mean that the 304 is generated in static.py in serve_file() by cptools.validate_since().

Up to now, we were unable to explain why this specific 304 response would provoke the cacheobject not to be populated by the corresponding cache, while having set a threading.Event.

What we did see was that MemoryCache.expire_cache() clears (del) the AntistampedeCache variant object in the store dictionary, but keeps the store uri key. This means that after cache expiration a store[uri] key exists but has no cache object attached. This might explain the 304 following the Antistampede.wait path without replacing the threading.Event object but we were not sure and unable to force faster cache expiration to produce a test-case (up to now).

For now, we had to decide to let the bug go and work-around it by letting nginx serve Pyff’s static files, although this is a less than ideal solution of course.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:20 (19 by maintainers)

github_iconTop GitHub Comments

2reactions
mrvanescommented, Jan 26, 2018

Ok, so I now know what happens, how to reproduce it and have a proposal for a fix.

How to reproduce:

  • Configure service with static handler and cache support
  • Restart cherrypy based service (pyff in my case), this cleans the cache
  • Do a normal GET request for a static resource, this fills the cache for the store[uri][variant]
  • Note the Last-Modified date of response
  • Wait for the cache to expire, this clears the store[uri][variant] (but not store[uri]!)
  • Do an If-Modified-Since GET on the static resource, with date Last-modified noted above
  • Do a normal GET request for the static resource and experience the AntiStampedeCache time-out for the request is fulfilled (which is set to 30 seconds in pyff).

What happens:

  • Because the expire only cleared the variant of the cache, AntiStampedeCache is now activated and inserts a threading._Event for the request
  • As soon as static.serve_fileobj encounters the 304 exception (raised by validate_since) it raises the exception
  • before_finalize is run and calls caching.tee_output, which replaces response.body with the tee(response.body) generator object (which is ultimately responsible for put’ing the response in cache)
  • Response.finalize() encounters an elif code < 200 or code in (204, 205, 304): and consequently pops the Content-Length header and sets self.body to ntob(‘’) replacing the body generator object
  • Whoever is responsible for sending the respons body will now iterate over the ntob(‘’) object and the tee(response.body) generator is never touched, leaving the AntiStampedeCaching threading._Event for this uri/variant orphaned

My solution:

  • Even though a 1**, 204, 205 or 304 are forbidden to contain a body, it is imperative to iterate over self.body before assigning ntob(‘’) to self.body for the CachingTool to finish it’s job, so I inserted the following lines above the self.body = ntob('') line in Response.finalize():
for i in self.body:
    pass

This forces the tee(response.body) generator to execute and finish it’s caching job. But now, the cache contains the cached 304 object, which is nonsense on a normal GET for the same resource. So I had to insert the following code in caching.tee_output.tee():

# save the cache data
body = ntob('').join(output)
if not body:
    cherrypy._cache.delete()
else:
    cherrypy._cache.put((response.status, response.headers or {},
                   body, response.time), len(body))

This prevents tee() from caching empty responses and solves the problem on the short term.

However, I’m not convinced this is the best solution, because it will undermine the caching of any empty body response which might be expensive even though they don’t actually contain a body and hence interferes with the intention of the cache.

1reaction
baszoetekouwcommented, Jan 20, 2018

Well, we haven’t deployed the workaround yet, so I can start a tcpdump and see if the problem occurs in the next couple of days.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Fix the HTTP 304 Not Modified Status Code - Kinsta®
The HTTP 304 not modified status code indicates a communication problem between a user's browser and a website's server.
Read more >
How To Fix an HTTP 304 Not Modified Code: A Complete Guide
An HTTP 304 not modified code is a server response showing that it hasn't updated a webpage since a user has last accessed...
Read more >
Threading Event Object In Python
A threading.Event object wraps a boolean variable that can either be “set” (True) or “not set” (False). Threads sharing the event instance ...
Read more >
Why am I getting "(304) Not Modified" error on some links ...
It means that the data from an API for example has no changes in comparison with the cached data available in the client...
Read more >
Synchronization Primitives — Python 3.11.1 documentation
Not thread -safe. An asyncio event can be used to notify multiple asyncio tasks that some event has happened. An Event object manages...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found