[Bug] RayTaskError(RayOutOfMemoryError) although there's plenty of free SWAP left
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core, Ray Clusters
What happened + What you expected to happen
Ray throws an exception when node’s RAM is full. Would expect to continue using available SWAP and finish the tasks. Also setting os.environ["RAY_DISABLE_MEMORY_MONITOR"] = "1"
(before importing ray) does nothing.
Versions / Dependencies
name: puma-lab channels:
- pyviz
- conda-forge
- defaults dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=1_gnu
- abseil-cpp=20210324.2=h9c3ff4c_0
- alsa-lib=1.2.3=h516909a_0
- anyio=3.4.0=py37h89c1867_0
- aplus=0.11.0=py_1
- appdirs=1.4.4=pyh9f0ad1d_0
- argcomplete=1.12.3=pyhd8ed1ab_2
- argon2-cffi=21.1.0=py37h5e8e339_2
- arrow-cpp=6.0.1=py37h815fc2d_3_cpu
- astropy=4.3.1=py37hb1e94ed_2
- async_generator=1.10=py_0
- attrs=21.2.0=pyhd8ed1ab_0
- aws-c-auth=0.6.8=hfef2836_0
- aws-c-cal=0.5.12=h70efedd_7
- aws-c-common=0.6.17=h7f98852_0
- aws-c-compression=0.2.14=h7c7754b_7
- aws-c-event-stream=0.2.7=hb80ed28_31
- aws-c-http=0.6.10=h58a30cf_2
- aws-c-io=0.10.13=he836878_5
- aws-c-mqtt=0.7.9=h042a236_0
- aws-c-s3=0.1.27=h4f4cd48_12
- aws-c-sdkutils=0.1.1=h7c7754b_4
- aws-checksums=0.1.12=h7c7754b_6
- aws-crt-cpp=0.17.9=hc7d31a4_1
- aws-sdk-cpp=1.9.154=h77f1c7e_0
- babel=2.9.1=pyh44b312d_0
- backcall=0.2.0=pyh9f0ad1d_0
- backports=1.0=py_2
- backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
- backports.zoneinfo=0.2.1=py37h5e8e339_4
- blake3=0.2.1=py37hfd0a3e1_0
- bleach=4.1.0=pyhd8ed1ab_0
- blosc=1.21.0=h9c3ff4c_0
- bokeh=2.4.2=py37h89c1867_0
- bqplot=0.12.31=pyhd8ed1ab_0
- branca=0.4.2=pyhd8ed1ab_0
- brotli=1.0.9=h7f98852_6
- brotli-bin=1.0.9=h7f98852_6
- brotlipy=0.7.0=py37h5e8e339_1003
- brunsli=0.1=h9c3ff4c_0
- bzip2=1.0.8=h7f98852_4
- c-ares=1.18.1=h7f98852_0
- c-blosc2=2.0.4=h5f21a17_1
- ca-certificates=2021.10.8=ha878542_0
- cached-property=1.5.2=hd8ed1ab_1
- cached_property=1.5.2=pyha770c72_1
- cachetools=4.2.4=pyhd8ed1ab_0
- cffi=1.15.0=py37h036bc23_0
- cfitsio=4.0.0=h9a35b8e_0
- charls=2.2.0=h9c3ff4c_0
- click=8.0.3=py37h89c1867_1
- clickhouse-cityhash=1.0.2.3=py37hcd2ae1e_3
- clickhouse-driver=0.2.2=py37h5e8e339_1
- cloudpickle=2.0.0=pyhd8ed1ab_0
- colorama=0.4.4=pyh9f0ad1d_0
- colorcet=3.0.0=pyhd8ed1ab_0
- cramjam=2.3.1=py37h5e8e339_1
- cryptography=36.0.0=py37hf1a17b8_0
- cycler=0.11.0=pyhd8ed1ab_0
- cytoolz=0.11.2=py37h5e8e339_1
- dask=2021.11.2=pyhd8ed1ab_0
- dask-core=2021.11.2=pyhd8ed1ab_0
- datashader=0.13.0=pyh6c4a22f_0
- datashape=0.5.4=py_1
- dbus=1.13.6=h48d8840_2
- debugpy=1.5.1=py37hcd2ae1e_0
- decorator=5.1.0=pyhd8ed1ab_0
- defusedxml=0.7.1=pyhd8ed1ab_0
- distributed=2021.11.2=py37h89c1867_0
- entrypoints=0.3=py37hc8dfbb8_1002
- expat=2.4.1=h9c3ff4c_0
- fastapi=0.70.0=pyhd8ed1ab_0
- fastparquet=0.7.2=py37hb1e94ed_0
- filelock=3.4.0=pyhd8ed1ab_0
- fontconfig=2.13.1=hba837de_1005
- fonttools=4.28.3=py37h5e8e339_0
- freetype=2.10.4=h0708190_1
- frozendict=2.0.3=pyhd8ed1ab_0
- fsspec=2021.11.1=pyhd8ed1ab_0
- future=0.18.2=py37h89c1867_4
- geos=3.10.1=h9c3ff4c_1
- gettext=0.19.8.1=h73d1719_1008
- gflags=2.2.2=he1b5a44_1004
- giflib=5.2.1=h36c2ea0_2
- gitdb=4.0.9=pyhd8ed1ab_0
- gitpython=3.1.24=pyhd8ed1ab_0
- glib=2.70.1=h780b84a_0
- glib-tools=2.70.1=h780b84a_0
- glog=0.5.0=h48cff8f_0
- grpc-cpp=1.42.0=h7e358d9_0
- gst-plugins-base=1.18.5=hf529b03_2
- gstreamer=1.18.5=h9f60fe5_2
- h5py=3.6.0=nompi_py37hd308b1e_100
- hdf5=1.12.1=nompi_h2750804_103
- heapdict=1.0.1=py_0
- holoviews=1.14.6=py_0
- hvplot=0.7.3=py_0
- icu=68.2=h9c3ff4c_0
- imagecodecs=2021.11.20=py37h4167934_1
- imageio=2.13.1=pyhd8ed1ab_1
- importlib-metadata=4.8.2=py37h89c1867_0
- importlib_metadata=4.8.2=hd8ed1ab_0
- importlib_resources=5.4.0=pyhd8ed1ab_0
- ipydatawidgets=4.2.0=pyhd3deb0d_0
- ipykernel=6.6.0=py37h6531663_0
- ipyleaflet=0.15.0=pyhd8ed1ab_0
- ipympl=0.8.2=pyhd8ed1ab_0
- ipython=7.30.1=py37h89c1867_0
- ipython_genutils=0.2.0=py_1
- ipyvolume=0.6.0a8=pyhd8ed1ab_0
- ipyvue=1.7.0=pyhd8ed1ab_0
- ipyvuetify=1.8.1=pyhd8ed1ab_0
- ipywebrtc=0.6.0=pyhd8ed1ab_0
- ipywidgets=7.6.5=pyhd8ed1ab_0
- jbig=2.1=h7f98852_2003
- jedi=0.18.1=py37h89c1867_0
- jinja2=3.0.3=pyhd8ed1ab_0
- jpeg=9d=h36c2ea0_0
- json5=0.9.5=pyh9f0ad1d_0
- jsonschema=4.2.1=pyhd8ed1ab_0
- jupyter-server-mathjax=0.2.3=pyhd8ed1ab_0
- jupyter_client=7.1.0=pyhd8ed1ab_0
- jupyter_contrib_core=0.3.3=py_2
- jupyter_contrib_nbextensions=0.5.1=py37hc8dfbb8_1
- jupyter_core=4.9.1=py37h89c1867_1
- jupyter_highlight_selected_word=0.2.0=py37h89c1867_1005
- jupyter_latex_envs=1.4.6=py37h89c1867_1001
- jupyter_nbextensions_configurator=0.4.1=py37h89c1867_2
- jupyter_server=1.12.1=pyhd8ed1ab_0
- jupyterlab=3.2.4=pyhd8ed1ab_0
- jupyterlab-git=0.34.0=pyhd8ed1ab_0
- jupyterlab_pygments=0.1.2=pyh9f0ad1d_0
- jupyterlab_server=2.8.2=pyhd8ed1ab_0
- jupyterlab_widgets=1.0.2=pyhd8ed1ab_0
- jxrlib=1.1=h7f98852_2
- kiwisolver=1.3.2=py37h2527ec5_1
- krb5=1.19.2=hcc1bbae_3
- lcms2=2.12=hddcbb42_0
- ld_impl_linux-64=2.36.1=hea4e1c9_2
- lerc=3.0=h9c3ff4c_0
- libaec=1.0.6=h9c3ff4c_0
- libblas=3.9.0=12_linux64_openblas
- libbrotlicommon=1.0.9=h7f98852_6
- libbrotlidec=1.0.9=h7f98852_6
- libbrotlienc=1.0.9=h7f98852_6
- libcblas=3.9.0=12_linux64_openblas
- libclang=11.1.0=default_ha53f305_1
- libcurl=7.80.0=h2574ce0_0
- libdeflate=1.8=h7f98852_0
- libedit=3.1.20191231=he28a2e2_2
- libev=4.33=h516909a_1
- libevent=2.1.10=h9b69904_4
- libffi=3.4.2=h7f98852_5
- libgcc-ng=11.2.0=h1d223b6_11
- libgfortran-ng=11.2.0=h69a702a_11
- libgfortran5=11.2.0=h5c6108e_11
- libglib=2.70.1=h174f98d_0
- libgomp=11.2.0=h1d223b6_11
- libiconv=1.16=h516909a_0
- liblapack=3.9.0=12_linux64_openblas
- libllvm10=10.0.1=he513fc3_3
- libllvm11=11.1.0=hf817b99_2
- libnghttp2=1.43.0=h812cca2_1
- libnsl=2.0.0=h7f98852_0
- libogg=1.3.4=h7f98852_1
- libopenblas=0.3.18=pthreads_h8fe5266_0
- libopus=1.3.1=h7f98852_1
- libpng=1.6.37=h21135ba_2
- libpq=13.5=hd57d9b9_0
- libprotobuf=3.18.1=h780b84a_0
- libsodium=1.0.18=h36c2ea0_1
- libssh2=1.10.0=ha56f1ee_2
- libstdcxx-ng=11.2.0=he4da1e4_11
- libthrift=0.15.0=he6d91bd_1
- libtiff=4.3.0=h6f004c6_2
- libutf8proc=2.6.1=h7f98852_0
- libuuid=2.32.1=h7f98852_1000
- libvorbis=1.3.7=h9c3ff4c_0
- libwebp-base=1.2.1=h7f98852_0
- libxcb=1.13=h7f98852_1004
- libxkbcommon=1.0.3=he3ba5ed_0
- libxml2=2.9.12=h72842e0_0
- libxslt=1.1.33=h15afd5d_2
- libzlib=1.2.11=h36c2ea0_1013
- libzopfli=1.0.3=h9c3ff4c_0
- llvmlite=0.36.0=py37h9d7f4d0_0
- locket=0.2.0=py_2
- lxml=4.6.4=py37h77fd288_0
- lz4-c=1.9.3=h9c3ff4c_1
- markdown=3.3.6=pyhd8ed1ab_0
- markupsafe=2.0.1=py37h5e8e339_1
- matplotlib=3.5.0=py37h89c1867_0
- matplotlib-base=3.5.0=py37h1058ff1_0
- matplotlib-inline=0.1.3=pyhd8ed1ab_0
- mistune=0.8.4=py37h5e8e339_1005
- msgpack-python=1.0.3=py37h2527ec5_0
- multipledispatch=0.6.0=py_0
- munkres=1.1.4=pyh9f0ad1d_0
- mysql-common=8.0.27=ha770c72_1
- mysql-libs=8.0.27=hfa10184_1
- nb_conda_kernels=2.3.1=py37h89c1867_1
- nbclassic=0.3.4=pyhd8ed1ab_0
- nbclient=0.5.9=pyhd8ed1ab_0
- nbconvert=6.3.0=py37h89c1867_1
- nbdime=3.1.1=pyhd8ed1ab_0
- nbformat=5.1.3=pyhd8ed1ab_0
- ncurses=6.2=h58526e2_4
- nest-asyncio=1.5.4=pyhd8ed1ab_0
- networkx=2.6.3=pyhd8ed1ab_1
- notebook=6.4.6=pyha770c72_0
- nspr=4.32=h9c3ff4c_1
- nss=3.73=hb5efdd6_0
- numba=0.53.1=py37hb11d6e1_1
- numpy=1.21.4=py37h31617e3_0
- olefile=0.46=pyh9f0ad1d_1
- openjpeg=2.4.0=hb52868f_1
- openssl=1.1.1l=h7f98852_0
- orc=1.7.1=h68e2c4e_0
- packaging=21.3=pyhd8ed1ab_0
- pandas=1.3.4=py37he8f5f7f_1
- pandoc=2.16.2=h7f98852_0
- pandocfilters=1.5.0=pyhd8ed1ab_0
- panel=0.12.5=py_0
- param=1.12.0=pyh6c4a22f_0
- parquet-cpp=1.5.1=1
- parso=0.8.3=pyhd8ed1ab_0
- partd=1.2.0=pyhd8ed1ab_0
- pcre=8.45=h9c3ff4c_0
- pexpect=4.8.0=py37hc8dfbb8_1
- pickleshare=0.7.5=py37hc8dfbb8_1002
- pillow=8.4.0=py37h0f21c89_0
- pip=21.3.1=pyhd8ed1ab_0
- pooch=1.5.2=pyhd8ed1ab_0
- progressbar2=3.53.1=pyh9f0ad1d_0
- prometheus_client=0.12.0=pyhd8ed1ab_0
- prompt-toolkit=3.0.22=pyha770c72_0
- psutil=5.8.0=py37h5e8e339_2
- pthread-stubs=0.4=h36c2ea0_1001
- ptyprocess=0.7.0=pyhd3deb0d_0
- pyarrow=6.0.1=py37h20dbb2a_3_cpu
- pycparser=2.21=pyhd8ed1ab_0
- pyct=0.4.6=py_0
- pyct-core=0.4.6=py_0
- pydantic=1.8.2=py37h5e8e339_2
- pyerfa=2.0.0.1=py37hb1e94ed_1
- pygments=2.10.0=pyhd8ed1ab_0
- pykalman=0.9.5=py_1
- pyopenssl=21.0.0=pyhd8ed1ab_0
- pyparsing=3.0.6=pyhd8ed1ab_0
- pyqt=5.12.3=py37h89c1867_8
- pyqt-impl=5.12.3=py37hac37412_8
- pyqt5-sip=4.19.18=py37hcd2ae1e_8
- pyqtchart=5.12=py37he336c9b_8
- pyqtwebengine=5.12.1=py37he336c9b_8
- pyrsistent=0.18.0=py37h5e8e339_0
- pysocks=1.7.1=py37h89c1867_4
- python=3.7.12=hb7a2778_100_cpython
- python-dateutil=2.8.2=pyhd8ed1ab_0
- python-tzdata=2021.5=pyhd8ed1ab_0
- python-utils=2.5.6=pyh44b312d_0
- python_abi=3.7=2_cp37m
- pythreejs=2.3.0=pyhd8ed1ab_0
- pytz=2021.3=pyhd8ed1ab_0
- pytz-deprecation-shim=0.1.0.post0=py37h89c1867_1
- pyviz_comms=2.1.0=py_0
- pywavelets=1.2.0=py37hb1e94ed_1
- pyyaml=6.0=py37h5e8e339_3
- pyzmq=22.3.0=py37h336d617_1
- qt=5.12.9=hda022c4_4
- re2=2021.11.01=h9c3ff4c_0
- readline=8.1=h46c0cb4_0
- requests=2.26.0=pyhd8ed1ab_1
- s2n=1.3.0=h9b69904_0
- scikit-image=0.18.3=py37he8f5f7f_1
- scipy=1.7.3=py37hf2a6cf1_0
- send2trash=1.8.0=pyhd8ed1ab_0
- setuptools=59.4.0=py37h89c1867_0
- shapely=1.8.0=py37h9b0f7a3_4
- six=1.16.0=pyh6c4a22f_0
- smmap=3.0.5=pyh44b312d_0
- snappy=1.1.8=he1b5a44_3
- sniffio=1.2.0=py37h89c1867_2
- sortedcontainers=2.4.0=pyhd8ed1ab_0
- sqlite=3.37.0=h9cd32fc_0
- starlette=0.16.0=pyhd8ed1ab_0
- tabulate=0.8.9=pyhd8ed1ab_0
- tblib=1.7.0=pyhd8ed1ab_0
- terminado=0.12.1=py37h89c1867_1
- testpath=0.5.0=pyhd8ed1ab_0
- thrift=0.15.0=py37hcd2ae1e_1
- tifffile=2021.11.2=pyhd8ed1ab_0
- tk=8.6.11=h27826a3_1
- toolz=0.11.2=pyhd8ed1ab_0
- tornado=6.1=py37h5e8e339_2
- tqdm=4.62.3=pyhd8ed1ab_0
- traitlets=5.1.1=pyhd8ed1ab_0
- traittypes=0.2.1=pyh9f0ad1d_2
- typing-extensions=4.0.1=hd8ed1ab_0
- typing_extensions=4.0.1=pyha770c72_0
- tzdata=2021e=he74cb21_0
- tzlocal=4.1=py37h89c1867_1
- unicodedata2=13.0.0.post2=py37h5e8e339_4
- urllib3=1.26.7=pyhd8ed1ab_0
- vaex=4.6.0=pyhd8ed1ab_0
- vaex-astro=0.8.3=pyhd8ed1ab_0
- vaex-core=4.6.0=py37h092ef5d_0
- vaex-hdf5=0.11.0=pyhd8ed1ab_0
- vaex-jupyter=0.6.0=pyhd8ed1ab_0
- vaex-ml=0.15.0=pyhd8ed1ab_0
- vaex-server=0.7.0=pyhd8ed1ab_0
- vaex-viz=0.5.0=pyhd8ed1ab_0
- wcwidth=0.2.5=pyh9f0ad1d_2
- webencodings=0.5.1=py_1
- websocket-client=1.2.1=py37h89c1867_0
- wheel=0.37.0=pyhd8ed1ab_1
- widgetsnbextension=3.5.2=py37h89c1867_1
- xarray=0.20.1=pyhd8ed1ab_0
- xorg-libxau=1.0.9=h7f98852_0
- xorg-libxdmcp=1.1.3=h7f98852_0
- xz=5.2.5=h516909a_1
- yaml=0.2.5=h516909a_0
- zeromq=4.3.4=h9c3ff4c_1
- zfp=0.5.5=h9c3ff4c_8
- zict=2.0.0=py_0
- zipp=3.6.0=pyhd8ed1ab_0
- zlib=1.2.11=h36c2ea0_1013
- zstandard=0.16.0=py37h5e8e339_2
- zstd=1.5.0=ha95c52a_0
- pip:
- aiohttp==3.8.1
- aiohttp-cors==0.7.0
- aioredis==1.3.1
- aiosignal==1.2.0
- async-timeout==4.0.1
- asynctest==0.13.0
- blessed==1.19.0
- certifi==2021.10.8
- charset-normalizer==2.0.9
- colorful==0.5.4
- deprecated==1.2.13
- frozenlist==1.2.0
- google-api-core==2.2.2
- google-auth==2.3.3
- googleapis-common-protos==1.53.0
- gpustat==1.0.0b1
- grpcio==1.42.0
- hiredis==2.0.0
- idna==3.3
- multidict==5.2.0
- nvidia-ml-py3==7.352.0
- opencensus==0.8.0
- opencensus-context==0.1.2
- protobuf==3.19.1
- py-spy==0.3.11
- pyasn1==0.4.8
- pyasn1-modules==0.2.8
- ray==1.9.0
- redis==4.0.2
- rsa==4.8
- smart-open==5.2.1
- wrapt==1.13.3
- yarl==1.7.2
Reproduction script
---------------------------------------------------------------------------
RayTaskError(RayOutOfMemoryError) Traceback (most recent call last)
<timed exec> in <module>
/tmp/ipykernel_2046457/2054134056.py in get_hist(asks, labels, window_len_min, window_len_max, resolution_window, resolution_feature_threshold)
68 while refs_task:
69 refs_done, refs_task = ray.wait(refs_task)
---> 70 dfs = ray.get(refs_done)
71 res = pd.concat([res] + dfs, ignore_index=True)
72 return res
~/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
103 if func.__name__ != "init" or is_client_mode_enabled_by_default:
104 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)
106
107 return wrapper
~/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/worker.py in get(object_refs, timeout)
1711 worker.core_worker.dump_object_store_memory_usage()
1712 if isinstance(value, RayTaskError):
-> 1713 raise value.as_instanceof_cause()
1714 else:
1715 raise value
RayTaskError(RayOutOfMemoryError): ray::fun() (pid=50618, ip=192.168.0.106)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node node1 is used (14.88 / 15.53 GB). The top 10 memory consumers are:
PID MEM COMMAND
50462 0.6GiB ray::fun()
50524 0.6GiB ray::fun()
50493 0.6GiB ray::fun()
50555 0.59GiB ray::fun()
50275 0.58GiB ray::fun()
50244 0.57GiB ray::fun()
50151 0.57GiB ray::fun()
50120 0.57GiB ray::fun()
50306 0.57GiB ray::fun()
50369 0.57GiB ray::fun()
In addition, up to 0.21 GiB of shared memory is currently being used by the Ray object store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
--- To disable OOM exceptions, set RAY_DISABLE_MEMORY_MONITOR=1.
---
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (6 by maintainers)
Top Results From Across the Web
How to fix the constantly growing memory usage of ray?
I recently ran into a similar problem and found that if you are frequently putting large objects (using ray.put() ) that you need...
Read more >Why Swap is used when plenty of free memory is left?
No - swap usage is not a high water mark. When swapped out pages are mapped back the disk pages are marked as...
Read more >Out of memory, but swap available - Unix Stack Exchange
I start to get: Out of memory! even though there is clearly swap available. $ free total used free shared buff/cache available Mem ......
Read more >Swap is used even though >50% of RAM is still free
I have filed several developer bug reports to Apple and I am waiting for further info. The disk should NOT be used if...
Read more >OutOfMemoryError: Out of swap space - Problem Patterns
This error message is thrown by the Java HotSpot VM (native code) ... The OS is also showing plenty of physical & virtual...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Good questions! Let me answer one by one here.
Yes. If you are using cluster.yaml, you will have ray start command somewhere. You can do
The reason why os.environ is not working in this case, I think, is because the workers are started from process that’s started by
ray start
, so os.environ in your driver script (python script that you ran ray.init(address=‘auto’)) won’t be propagated there.Not every system has swap memory on by default. For example, many of EC2 instances don’t have Swap on by default I believe.
High memory usage will likely to cause unexpected behavior without swap memory, so it is safer crashing it early as a default behavior. Also, Swap memory is very slow compared to regular memory, so relying on this by default can cause many unexpected slow down.
We will remove
RAY_DISABLE_MEMORY_MONITOR
and replace it with Ray oom killer (https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html). So I think we can close it. I will just answer users questions here.This will be replaced by https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html which has the documentation!
RAY_DISABLE_MEMORY_MONITOR
basically disable the memory monitor which checks the mem usage of actor and kill them when the node memory usage exceeds the threshold (95%).yes. This flag is irrelevant to ray memory