Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Change the architecture

See original GitHub issue

Currently: the API server, the cache/database, the assets, and the workers (that generate the data) are running on the same machine and share the same resources, which is the source of various issues:

a worker that requires a lot of resources can block the server (https://grafana.huggingface.co/d/rYdddlPWk/node-exporter-full?orgId=2&refresh=1m&from=now-24h&to=now&var-DS_PROMETHEUS=HF Prometheus&var-job=node_exporter_metrics&var-node=datasets-preview-backend&var-diskdevices=[a-z]%2B|nvme[0-9]%2Bn[0-9]%2B)
we have to kill the warming process if memory usage is too high to preserve the API resources, which requires manual supervision
also related to resources limits: we currently run the warming and refreshing tasks on one dataset at a time, while they are logically independent and could be launched on different workers in parallel, reducing the duration of these processes
also: I’m not sure if the current implementation of the database/cache (diskcache) really supports concurrent access (it does, but I’m not sure I used it adequately in the code, see http://www.grantjenks.com/docs/diskcache/tutorial.html / cache.close())
having everything in the same application also means that everything is developed in Python (since the workers have to be in Python), while managing a queue and async processes could be easier in node.js, for example

The architecture I imagine would have these components:

API server
queue
database
file storage
workers

The API server would:

deliver the data (/rows, /splits, /valid, /cache-reports, /cache, /healthcheck): directly querying the database. If not in the database, return an error.
serve the assets from the storage
command the queue (/webhook, /warm, /refresh) -> add authentication? Send new tasks to the queue

The queue would:

manage the tasks sent by the API server
launch workers for these tasks
add/update/delete the data in the database and the assets in the storage

The database would:

store the datasets’ data

The storage would:

store the assets (image files for example)

The workers would:

compute the data for one dataset

Issue Analytics

State:
Created 2 years ago
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

severocommented, Oct 27, 2021

Partially done in https://github.com/huggingface/datasets-preview-backend/releases/tag/0.14.0:

the cache is now managed by a mongo database,
a queue (also in mongo) manages the pending jobs to refresh the cache, and multiple workers (3 at the moment in production) take care of processing them when resources are available

0reactions

severocommented, May 11, 2022

Done