Requests randomly fail to receive data
See original GitHub issueExample
I’m having a very hard time reproducing this issue consistently, but have exhausted all other avenues that I could think of. I’ll do my best to describe the setup here, but unfortunately I couldn’t come up with a code sample that I was able to reproduce it consistently.
The gist is like this however:
import functools
from fastapi import FastAPI, Depends, HTTPException
from pydantic import BaseModel
from sqlalchemy import create_engine
from sqlalchemy.engine import Engine
from sqlalchemy.exc import IntegrityError
from sqlalchemy.orm import Session, sessionmaker
import ddtrace
app = FastAPI()
@functools.lru_cache(maxsize=1)
def get_database_engine() -> Engine:
"""Retrieve the database engine. There should only be one of these per application."""
engine = create_engine("postgres://")
return engine
@functools.lru_cache(maxsize=1)
def get_database_session(engine: Engine = Depends(get_database_engine)) -> sessionmaker:
"""Retrieve the database session maker. There should only be one of these per application."""
return sessionmaker(autocommit=True, autoflush=False, bind=engine)
def get_db(session_local: sessionmaker = Depends(get_database_session)):
"""Get a database connection."""
with ddtrace.tracer.trace("db_connection_acquire"):
db = session_local()
try:
yield db
finally:
db.close()
class MappingRequest(BaseModel):
"""Request class to create new mappings."""
original_id: str
new_id: str
@app.post("/mapping")
def create_mapping(
upload: MappingRequest, db: Session = Depends(get_db)
):
"""Create a new dicom deid mapping for a series."""
with ddtrace.tracer.trace("mapping_start"):
try:
with db.begin():
with ddtrace.tracer.trace("mapping_transaction"):
mapping = crud.create_mapping(
db, upload.original_id, upload.new_id
)
return {"mapping_id": mapping.id}
except IntegrityError:
raise HTTPException(status_code=409, detail="Mapping exists")
We have a simple sync endpoint which inserts keys into the database. The DB operation is very quick. We’ve also added span traces to debug the behavior using datadog.
Description
Under moderate load, about 10 requests per second, the endpoint mostly responds very fast in the 10ms range. However, sometimes there are extreme outliers in response time, where the request goes to 3 minutes before the connection is killed by our ALB.
Traces during the long operation show that the route code is never hit, nor are the dependencies. This seems to indicate that something within FastAPI is failing to properly schedule or submit the sync jobs into the thread pool [URL details obfuscated to remove company info]
Compared with normal requests:
This seems to be a similar issue to the one described here: https://stackoverflow.com/questions/61466243/uvicorn-not-processing-some-requests-randamly
And similar behavior here: https://github.com/tiangolo/fastapi/issues/1195 but we are not on windows.
Environment
- OS: [e.g. Linux / Windows / macOS]: docker with tiangolo/uvicorn-gunicorn-fastapi-docker:python:3.8, host is Linux.
- FastAPI Version [e.g. 0.3.0]: 0.63.0
- Python version: 3.8.2
- Gunicorn version: 20.1.0
- Uvicorn version: 0.13.4
Note that gunicorn and uvicorn were upgraded manually in the dockerfile in an attempt to resolve the issue. However, the default versions with the image exhibited the same behavior.
Additional context
We’ve tried a number of additional ways to reproduce this, but have been unsuccessful. When testing locally, even introducing 10x the load, we can’t reproduce this issue.
Also should note this is running in AWS in ECS, behind an ALB. We have toyed with the timeout settings within gunicorn and uvicorn to try and address this as well, but none of those seem to solve it.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (2 by maintainers)
Top GitHub Comments
Thanks for the report and discussion! ☕
This was possibly solved in https://github.com/tiangolo/fastapi/pull/5122, released as part of FastAPI 0.82.0 🎉
Could you check it? If this solves it for you, you could close the issue. 🤓
I think I’ve figured this out, though not sure if this is really a problem with FastAPI or with
httpx
, which is the client I’m using. It appears that if a httpx client with keep-alive connections is running, and it has a timeout error, it doesn’t complete the request. As such, FastAPI just waits patiently for more data to come in, even though the client request is dead. FastAPI doesn’t notice that the client request is done until the connection itself is closed.In short, the requests themselves aren’t actually taking this long, it’s just that the client has bailed, and FastAPI just keeps waiting.
I have a (mostly) reproducible example now:
DockerFile:
/app/main.py:
test_client.py:
Start the FastAPI server.
Run the test client:
You’ll observe that
test_client.py
immediately issues timeouts:But that FastAPI doesn’t register any output until 10 seconds later, when the client closes the connections. If you extend the wait time to 3 minutes, the same behavior occurs.
I’m not sure what the expected behavior really should be. It does seem that httpx could potentially be better about notifying the server that nothing else is coming, but also seems that something is potentially awry with how keep-alives are working with gunicorn/uvicorn/fastapi. Ideally, the server should be configurable to time out more proactively on these dead connections.
A simple workaround, in my case at least, was to avoid keeping long-lived httpx clients, and instead create a new client and connection for each request. This is probably worse from a performance perspective, but it does prevent these sort of confusing timings.