[Bug] Offline API does not write into GCS
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
RLlib
What happened + What you expected to happen
Somehow the Offline API does not write output into a GCS bucket.
I am running my experiments in the cloud and that works fine. I use RLlib’s Offline API to write out variables from my custom environment and policy. That works also fine.
What I want to do now is to store these outputs from the Offline API not on the head node, but in a GCS bucket (which works for example for tune syncing). I saw in the documentation:
# Specify where experiences should be saved:
# - None: don't save any experiences
# - "logdir" to save to the agent log dir
# - a path/URI to save to a custom output directory (e.g., "s3://bucket/")
# - a function that returns a rllib.offline.OutputWriter
"output": None,
# What sample batch columns to LZ4 compress in the output data.
"output_compress_columns": ["obs", "new_obs"],
# Max output file size before rolling over to a new file.
"output_max_file_size": 64 * 1024 * 1024,
where it says also “e.g. s3://bucket/”. This let me thought it works also with GCS, but it does not. I do not even know where the output gets written to - at least it does not get written into the GCS bucket.
I checked several times the path to the bucket (which I copied) and ensured that this path is correct. When running tune.run() I get the following debug info:
DEBUG json_writer.py:77 -- Wrote 1477409 bytes to <_io.TextIOWrapper name='output/output-2021-10-28_02-22-24_worker-1_0.json' encoding='UTF-8'> in 0.08527207374572754s
My original path in the config was actually
"output": "gs://output-from-train/output/"
What happened here? I haven’t found a solution yet as this might be intervened with tune
. And I asked myself if GCS support is even implemented for the Offline API or do I fight windmills here?
Thanks for any guidance here.
Versions / Dependencies
I use the example-full.yaml
to set up a cluster on GCP. All installations are the ones driven by the .yaml
:
ubuntu v.20.04.3 LTS (Focal Fossa) python v3.7.7 ray v1.7.0
Output pip freeze:
absl-py==0.14.1
accelerate==0.3.0
adal==1.2.7
aioboto3==8.3.0
aiobotocore==1.2.2
aiohttp==3.7.4.post0
aiohttp-cors==0.7.0
aioitertools==0.8.0
aiojobs==0.3.0
aioredis==1.3.1
alembic==1.4.1
applicationinsights==0.11.10
argcomplete==1.12.3
argon2-cffi==21.1.0
asgiref==3.4.1
astunparse==1.6.3
async-timeout==3.0.1
atari-py==0.2.9
attrs==21.2.0
autocfg==0.0.8
autogluon.core==0.1.0
autograd==1.3
autopage==0.4.0
ax-platform==0.2.1
azure-cli-core==2.22.0
azure-cli-telemetry==1.0.6
azure-common==1.1.27
azure-core==1.19.0
azure-mgmt-compute==14.0.0
azure-mgmt-core==1.3.0
azure-mgmt-msi==1.0.0
azure-mgmt-network==10.2.0
azure-mgmt-resource==13.0.0
backcall==0.2.0
backoff==1.10.0
bayesian-optimization==1.2.0
bcrypt==3.2.0
bleach==4.1.0
blessings==1.7
blist==1.3.6
bokeh==2.3.3
boto==2.49.0
boto3==1.18.53
botocore==1.21.53
botorch==0.5.0
brotlipy==0.7.0
cached-property==1.5.2
cachetools==4.2.4
catboost==1.0.0
certifi==2021.5.30
cffi @ file:///tmp/build/80754af9/cffi_1625814693446/work
chardet @ file:///tmp/build/80754af9/chardet_1607706768982/work
chex==0.0.8
click==8.0.1
cliff==3.9.0
cloudpickle==1.6.0
cma==2.7.0
cmaes==0.8.2
cmd2==2.2.0
colorama==0.4.4
colorful==0.5.4
colorlog==6.4.1
conda==4.10.3
conda-package-handling @ file:///tmp/build/80754af9/conda-package-handling_1618262151086/work
configparser==5.0.2
ConfigSpace==0.4.18
crcmod==1.7
cryptography==3.3.2
cycler==0.10.0
Cython==0.29.23
dask==2021.8.1
databricks-cli==0.15.0
datasets==1.11.0
debugpy==1.4.3
decorator==5.1.0
decord==0.6.0
defusedxml==0.7.1
dill==0.3.3
distributed==2021.8.1
dm-tree==0.1.6
docker==5.0.2
docker-pycreds==0.4.0
docutils==0.17.1
dopamine-rl==4.0.0
dragonfly-opt==0.1.6
entrypoints==0.3
fastapi==0.68.1
fasteners==0.16.3
filelock==3.2.0
FLAML==0.5.2
Flask==2.0.1
Flask-Cors==3.0.10
flatbuffers==1.12
flax==0.3.5
freezegun==1.1.0
fsspec==2021.10.1
future==0.18.2
gast==0.4.0
gcs-oauth2-boto-plugin==3.0
gcsfs==2021.10.1
gin-config==0.4.0
gitdb==4.0.7
GitPython==3.1.24
gluoncv==0.10.1.post0
google-api-core==1.31.3
google-api-python-client==1.7.8
google-apitools==0.5.32
google-auth==1.35.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-cloud-core==2.1.0
google-cloud-storage==1.42.3
google-crc32c==1.3.0
google-oauth==1.0.1
google-pasta==0.2.0
google-reauth==0.1.1
google-resumable-media==2.1.0
googleapis-common-protos==1.53.0
gpustat==0.6.0
GPy==1.10.0
gpytorch==1.5.1
graphviz==0.8.4
greenlet==1.1.2
grpcio==1.34.1
gsutil==5.4
gunicorn==20.1.0
gym==0.18.3
h11==0.12.0
h5py==3.1.0
HeapDict==1.0.1
HEBO==0.1.0
higher==0.2.1
hiredis==2.0.0
hpbandster==0.7.4
httplib2==0.19.1
huggingface-hub==0.0.12
humanfriendly==9.2
hyperopt==0.2.5
idna @ file:///home/linux1/recipes/ci/idna_1610986105248/work
imageio==2.9.0
importlib-metadata==4.8.1
iniconfig==1.1.1
ipykernel==6.4.1
ipython==7.28.0
ipython-genutils==0.2.0
ipywidgets==7.6.5
iso8601==0.1.16
isodate==0.6.0
itsdangerous==2.0.1
jax==0.2.21
jaxlib==0.1.71
jedi==0.18.0
Jinja2==3.0.1
jmespath==0.10.0
joblib==1.0.1
jsonschema==4.0.1
jupyter==1.0.0
jupyter-client==7.0.5
jupyter-console==6.4.0
jupyter-core==4.8.1
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.2
kaggle-environments==1.7.11
keras-nightly==2.5.0.dev2021032900
Keras-Preprocessing==1.1.2
kiwisolver==1.3.2
knack==0.8.2
kopf==1.34.0
kubernetes==17.17.0
lightgbm==3.2.1
lightning-bolts==0.4.0
locket==0.2.1
lz4==3.1.3
Mako==1.1.5
Markdown==3.3.4
MarkupSafe==2.0.1
matplotlib==3.4.3
matplotlib-inline==0.1.3
mistune==0.8.4
mlagents-envs==0.27.0
mlflow==1.19.0
modin==0.11.0
monotonic==1.6
more-itertools==8.10.0
moto==2.2.8
msal==1.15.0
msgpack==1.0.2
msrest==0.6.21
msrestazure==0.6.4
multidict==5.1.0
multiprocess==0.70.11.1
mxnet==1.8.0.post0
nbclient==0.5.4
nbconvert==6.2.0
nbformat==5.1.3
nest-asyncio==1.5.1
netifaces==0.11.0
networkx==2.6.3
nevergrad==0.4.3.post7
notebook==6.4.4
numpy==1.19.5
nvidia-ml-py3==7.352.0
oauth2client==4.1.3
oauthlib==3.1.1
onnx==1.9.0
onnxruntime==1.8.0
opencensus==0.7.13
opencensus-context==0.1.2
opencv-python==3.4.15.55
opentelemetry-api==1.1.0
opentelemetry-exporter-otlp==1.1.0
opentelemetry-exporter-otlp-proto-grpc==1.1.0
opentelemetry-proto==1.1.0
opentelemetry-sdk==1.1.0
opentelemetry-semantic-conventions==0.20b0
opt-einsum==3.3.0
optax==0.0.9
optuna==2.9.1
packaging==21.0
pandas==1.3.3
pandocfilters==1.5.0
paramiko==2.7.2
paramz==0.9.5
parso==0.8.2
partd==1.2.0
pathtools==0.1.2
patsy==0.5.2
pbr==5.6.0
PettingZoo==1.11.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==8.2.0
pkginfo==1.7.1
plotly==5.3.1
pluggy==1.0.0
portalocker==1.7.1
prettytable==2.2.1
prometheus-client==0.11.0
prometheus-flask-exporter==0.18.3
promise==2.3
prompt-toolkit==3.0.20
protobuf==3.17.3
psutil==5.8.0
ptyprocess==0.7.0
py==1.10.0
py-spy==0.3.10
py4j==0.10.9
pyaml==21.8.3
pyarrow==5.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybullet==3.1.7
pycosat==0.6.3
pycparser @ file:///tmp/build/80754af9/pycparser_1594388511720/work
pydantic==1.8.2
pyDeprecate==0.3.1
pygame==2.0.1
pyglet==1.5.0
Pygments==2.10.0
PyJWT==1.7.1
pymoo==0.4.2.2
pymunk==6.0.0
PyNaCl==1.4.0
pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1608057966937/work
pyparsing==2.4.7
pyperclip==1.8.2
pypng==0.0.21
Pyro4==4.81
pyrsistent==0.18.0
PySocks @ file:///tmp/build/80754af9/pysocks_1594394576006/work
pyspark==3.1.2
pytest==6.2.5
pytest-remotedata==0.3.2
pytest-repeat==0.9.1
python-dateutil==2.8.2
python-editor==1.0.4
python-json-logger==2.0.2
pytorch-lightning==1.4.5
pytz==2021.1
pyu2f==0.1.5
PyWavelets==1.1.1
PyYAML==5.4.1
pyzmq==22.3.0
qtconsole==5.1.1
QtPy==1.11.2
querystring-parser==1.2.4
ray @ file:///home/ray/ray-1.7.0-cp37-cp37m-manylinux2014_x86_64.whl
ray-cpp==1.7.0
raydp-nightly==2021.9.17.dev0
recsim==0.2.4
redis==3.5.3
regex==2021.9.30
requests @ file:///tmp/build/80754af9/requests_1608241421344/work
requests-oauthlib==1.3.0
responses==0.14.0
retry-decorator==1.1.1
rsa==4.7.2
ruamel-yaml-conda @ file:///tmp/build/80754af9/ruamel_yaml_1616016701961/work
s3fs==2021.9.0
s3transfer==0.5.0
sacremoses==0.0.46
scikit-image==0.18.3
scikit-learn==0.24.2
scikit-optimize==0.8.1
scipy==1.5.4
Send2Trash==1.8.0
sentencepiece==0.1.96
sentry-sdk==1.4.3
serpent==1.40
shortuuid==1.0.1
sigopt==7.5.0
six==1.15.0
smart-open==5.1.0
smmap==4.0.0
sortedcontainers==2.4.0
SQLAlchemy==1.4.25
sqlparse==0.4.2
starlette==0.14.2
statsmodels==0.13.0
stevedore==3.4.0
subprocess32==3.5.4
SuperSuit==3.1.0
tabulate==0.8.9
tblib==1.7.0
tenacity==8.0.1
tensorboard==2.6.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorboardX==2.4
tensorflow==2.5.0
tensorflow-estimator==2.5.0
tensorflow-probability==0.13.0
termcolor==1.1.0
terminado==0.12.1
testpath==0.5.0
tf-slim==1.1.0
tf2onnx==1.8.5
threadpoolctl==3.0.0
tifffile==2021.8.30
timm==0.4.5
tokenizers==0.10.3
toml==0.10.2
toolz==0.11.1
torch==1.9.0+cu111
torchmetrics==0.5.1
torchvision==0.10.0+cu111
tornado==6.1
tqdm @ file:///tmp/build/80754af9/tqdm_1625563689033/work
traitlets==5.1.0
transformers==4.9.1
typeguard==2.12.1
typing-extensions==3.7.4.3
uritemplate==3.0.1
urllib3 @ file:///tmp/build/80754af9/urllib3_1625084269274/work
uvicorn==0.15.0
wandb==0.10.29
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.2.1
Werkzeug==2.0.1
widgetsnbextension==3.5.1
wrapt==1.12.1
xgboost==1.3.3
xmltodict==0.12.0
xxhash==2.0.2
yacs==0.1.8
yarl==1.6.3
zict==2.0.0
zipp==3.6.0
zoopt==0.4.1
Reproduction script
1. Setup the Ray Cluster on GCP
As described above I use the example-full.yaml
prepared by the ray team:
# An unique identifier for the head node and workers of this cluster.
cluster_name: default
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
# image: rayproject/ray:latest-gpu # use this one if you don't need ML dependencies, it's faster to pull
container_name: "ray_container"
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
# Example of running a GPU head with CPU workers
# head_image: "rayproject/ray-ml:latest-gpu"
# Allow Ray to automatically detect GPUs
# worker_image: "rayproject/ray-ml:latest-cpu"
# worker_run_options: []
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5
# Cloud-provider specific configuration.
provider:
type: gcp
region: us-west1
availability_zone: us-west1-a
project_id: <PROJECT_ID> # Globally unique project id
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
# ssh_private_key: /path/to/your/key.pem
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray_head_default:
# The resources provided by this node type.
resources: {"CPU": 2}
# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
# Additional options can be found in in the compute docs at
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
# If the network interface is specified as below in both head and worker
# nodes, the manual network config is used. Otherwise an existing subnet is
# used. To use a shared subnet, ask the subnet owner to grant permission
# for 'compute.subnetworks.use' to the ray autoscaler account...
# networkInterfaces:
# - kind: compute#networkInterface
# subnetwork: path/to/subnet
# aliasIpRanges: []
ray_worker_small:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 0
# The maximum number of worker nodes of this type to launch.
# This takes precedence over min_workers.
max_workers: 2
# The resources provided by this node type.
resources: {"CPU": 2}
# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
# Run workers on preemtible instance by default.
# Comment this out to use on-demand.
scheduling:
- preemptible: true
# Un-Comment this to launch workers with the Service Account of the Head Node
serviceAccounts:
- email: ray-autoscaler-sa-v1@<PROJECT_ID>.iam.gserviceaccount.com
scopes:
- https://www.googleapis.com/auth/cloud-platform
# Additional options can be found in in the compute docs at
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up nodes.
setup_commands:
- pip install gsutil
# Note: if you're developing Ray, you probably want to create a Docker image that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
- pip install google-api-python-client==1.7.8
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- >-
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- >-
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
head_node: {}
worker_nodes: {}
2. Setup gsutil
We have to set up gsutil
such that ray can write into the GCS bucket.
ray attach example-full.yaml
# Then on the head node:
gsutil config
# Follow the instructions of gsutil
3. Submit the script
The following script contains the code to be executed on the cluster by running ray submit example-full.yaml main.py
import ray
from ray import tune
import ray.rllib.agents.dqn as dqn
ray.init(address="auto")
config = dqn.SIMPLE_Q_DEFAULT_CONFIG.copy()
config["lr"] = .0000625
tune_config = {
"env": "Breakout-v0",
"model": config,
"num_gpus": 0,
"num_workers": 0,
"log_level": "DEBUG",
"output": "gs://ray-results-29102021/output"
}
tune.run(dqn.SimpleQTrainer,
config=tune_config,
sync_config=tune.SyncConfig(
sync_to_driver=False,
upload_dir="gs://ray-results-29102021/",
),
stop={"training_iteration": 3}
)
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (10 by maintainers)
@simonsays1980 one workaround before #21907 gets merged:
Hi, Instead of using
worker_setup_commands
to copy things over (which I believe was using the wrong syntax), could you just manually copy the file over and inspect it? Can you manually ssh onto the worker node and just run a sample gsutil command? You don’t necessarily need to run Tune or RLlib.