Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Requests randomly fail to receive data

See original GitHub issue

Example

I’m having a very hard time reproducing this issue consistently, but have exhausted all other avenues that I could think of. I’ll do my best to describe the setup here, but unfortunately I couldn’t come up with a code sample that I was able to reproduce it consistently.

The gist is like this however:

import functools
from fastapi import FastAPI, Depends, HTTPException
from pydantic import BaseModel
from sqlalchemy import create_engine
from sqlalchemy.engine import Engine
from sqlalchemy.exc import IntegrityError
from sqlalchemy.orm import Session, sessionmaker
import ddtrace

app = FastAPI()

@functools.lru_cache(maxsize=1)
def get_database_engine() -> Engine:
    """Retrieve the database engine.  There should only be one of these per application."""
    engine = create_engine("postgres://")
    return engine


@functools.lru_cache(maxsize=1)
def get_database_session(engine: Engine = Depends(get_database_engine)) -> sessionmaker:
    """Retrieve the database session maker.  There should only be one of these per application."""
    return sessionmaker(autocommit=True, autoflush=False, bind=engine)


def get_db(session_local: sessionmaker = Depends(get_database_session)):
    """Get a database connection."""
    with ddtrace.tracer.trace("db_connection_acquire"):
        db = session_local()
        try:
            yield db
        finally:
            db.close()

class MappingRequest(BaseModel):
    """Request class to create new mappings."""

    original_id: str
    new_id: str

@app.post("/mapping")
def create_mapping(
    upload: MappingRequest, db: Session = Depends(get_db)
):
    """Create a new dicom deid mapping for a series."""
    with ddtrace.tracer.trace("mapping_start"):
        try:
            with db.begin():
                with ddtrace.tracer.trace("mapping_transaction"):
                    mapping = crud.create_mapping(
                        db, upload.original_id, upload.new_id
                    )
                    return {"mapping_id": mapping.id}
        except IntegrityError:
            raise HTTPException(status_code=409, detail="Mapping exists")

We have a simple sync endpoint which inserts keys into the database. The DB operation is very quick. We’ve also added span traces to debug the behavior using datadog.

Description

Under moderate load, about 10 requests per second, the endpoint mostly responds very fast in the 10ms range. However, sometimes there are extreme outliers in response time, where the request goes to 3 minutes before the connection is killed by our ALB.

Traces during the long operation show that the route code is never hit, nor are the dependencies. This seems to indicate that something within FastAPI is failing to properly schedule or submit the sync jobs into the thread pool [URL details obfuscated to remove company info]

Compared with normal requests:

This seems to be a similar issue to the one described here: https://stackoverflow.com/questions/61466243/uvicorn-not-processing-some-requests-randamly

And similar behavior here: https://github.com/tiangolo/fastapi/issues/1195 but we are not on windows.

Environment

OS: [e.g. Linux / Windows / macOS]: docker with tiangolo/uvicorn-gunicorn-fastapi-docker:python:3.8, host is Linux.
FastAPI Version [e.g. 0.3.0]: 0.63.0
Python version: 3.8.2
Gunicorn version: 20.1.0
Uvicorn version: 0.13.4

Note that gunicorn and uvicorn were upgraded manually in the dockerfile in an attempt to resolve the issue. However, the default versions with the image exhibited the same behavior.

Additional context

We’ve tried a number of additional ways to reproduce this, but have been unsuccessful. When testing locally, even introducing 10x the load, we can’t reproduce this issue.

Also should note this is running in AWS in ECS, behind an ALB. We have toyed with the timeout settings within gunicorn and uvicorn to try and address this as well, but none of those seem to solve it.

Issue Analytics

State:
Created 2 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

2reactions

tiangolocommented, Sep 4, 2022

Thanks for the report and discussion! ☕

This was possibly solved in https://github.com/tiangolo/fastapi/pull/5122, released as part of FastAPI 0.82.0 🎉

Could you check it? If this solves it for you, you could close the issue. 🤓

2reactions

TimOrmecommented, May 6, 2021

I think I’ve figured this out, though not sure if this is really a problem with FastAPI or with httpx, which is the client I’m using. It appears that if a httpx client with keep-alive connections is running, and it has a timeout error, it doesn’t complete the request. As such, FastAPI just waits patiently for more data to come in, even though the client request is dead. FastAPI doesn’t notice that the client request is done until the connection itself is closed.

In short, the requests themselves aren’t actually taking this long, it’s just that the client has bailed, and FastAPI just keeps waiting.

I have a (mostly) reproducible example now:

DockerFile:

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.7

COPY ./app /app

/app/main.py:

from fastapi import FastAPI
from pydantic import BaseModel
import time



class TimingMiddleware:
    """Middleware to track request time."""

    def __init__(self, app):
        """Initialize the middleware."""
        self.app = app

    async def __call__(self, scope, receive, send):
        """Invoke the middleware."""

        async def wrapped_receive():
            start = time.time()
            data = await receive()
            end = time.time()
            print(f"RECEIVED IN {end-start}")
            return data

        return await self.app(scope, wrapped_receive, send)


app = FastAPI()
app.add_middleware(TimingMiddleware)


class Item(BaseModel):
    data: str


@app.post("/endpoint")
def endpoint(item: Item):
    return {"Hello": "World"}

test_client.py:

import asyncio
import httpx
import time


async def make_request(client):
    # post a big chunk of data
    await client.post("http://localhost/endpoint", json={"data": "A" * 10000000})
    return "XXX"


async def main():
    async with httpx.AsyncClient(timeout=httpx.Timeout(read=5, connect=5, pool=5, write=0.00000000000000000000000000000000000000001)) as client:
        tasks = []
        for x in range(5):
            tasks.append(asyncio.create_task(make_request(client)))

        # Gather all tasks
        done = await asyncio.gather(*tasks, return_exceptions=True)
        print(done)  # Should always print WriteTimeout.

        # Just sleep for 10 seconds.  Connections wont be released until the end of this.
        time.sleep(10)

asyncio.run(main())

Start the FastAPI server.

docker build -t timeout .
docker run --rm --name timeout -p 80:80 timeout

Run the test client:

python test_client.py

You’ll observe that test_client.py immediately issues timeouts:

%python test_client.py
[WriteTimeout(''), WriteTimeout(''), WriteTimeout(''), WriteTimeout(''), WriteTimeout('')]

But that FastAPI doesn’t register any output until 10 seconds later, when the client closes the connections. If you extend the wait time to 3 minutes, the same behavior occurs.

I’m not sure what the expected behavior really should be. It does seem that httpx could potentially be better about notifying the server that nothing else is coming, but also seems that something is potentially awry with how keep-alives are working with gunicorn/uvicorn/fastapi. Ideally, the server should be configurable to time out more proactively on these dead connections.

A simple workaround, in my case at least, was to avoid keeping long-lived httpx clients, and instead create a new client and connection for each request. This is probably worse from a performance perspective, but it does prevent these sort of confusing timings.