Communication timeout on DGX-2 with UCX+NVLink
See original GitHub issueWe’ve been experiencing a communication timeout on a DGX-2 with UCX+NVlink for a particular problem. The code is below, but I can’t share data due to its size (210GB):
import time
import sys
import dask_cudf
import cudf
import cupy
import numpy as np
import cupy as cp
from distributed import wait
sys.path.append("../../tools/")
from readers import build_reader
from dask.distributed import Client
from dask_cuda import DGX
from dask_cuda.initialize import initialize
from dask_cuda import LocalCUDACluster
# ON/OFF settings for various devices
enable_tcp_over_ucx = True
enable_nvlink = True
enable_infiniband = False
interface="enp134s0f1"
protocol="ucx"
if __name__ == "__main__":
# initialize client with the same settings as workers
initialize(create_cuda_context=True,
enable_tcp_over_ucx=enable_tcp_over_ucx,
enable_infiniband=enable_infiniband,
enable_nvlink=enable_nvlink)
if protocol == "tcp":
cluster = LocalCUDACluster(protocol='tcp',
silence_logs=False,
interface=interface,
CUDA_VISIBLE_DEVICES=list(range(16)))
elif protocol == "ucx":
cluster = LocalCUDACluster(protocol='ucx',
silence_logs=False,
enable_tcp_over_ucx=enable_tcp_over_ucx,
enable_infiniband=enable_infiniband,
enable_nvlink=enable_nvlink,
interface=interface,
ucx_net_devices=interface,
CUDA_VISIBLE_DEVICES=list(range(16)))
client = Client(cluster)
client.run(cudf.set_allocator, "default", pool=True)
data_dir = '/data/parquet/'
q30_session_timeout_inSec = 3600
q30_limit = 1000
file_format = "parquet"
from dask.distributed import performance_report
with performance_report(filename="dask_report.html"):
t = time.time()
def vec_arange(end_sr):
"""
Returns flattented arange output with start=0, and for end<end_sr[i] in series
"""
ar = cp.arange(end_sr.max())
### get flag matrix for values that are < end
m = ar < end_sr.values[:, None]
# ranges in a flattened 1d array
ranges = (ar * m)[m]
return ranges
def get_session_id_from_session_boundry(session_change_df, last_session_len):
"""
This function returns session starts given a session change df
"""
user_val_counts = session_change_df.wcs_user_sk.value_counts(sort=False)
user_val_counts = user_val_counts.reset_index(drop=False)
user_val_counts = user_val_counts.rename(
{"index": "wcs_user_sk", "wcs_user_sk": "user_count"}
)
### sort again by user_sk because we want our starts to be aligned
user_val_counts = user_val_counts.sort_values(by="wcs_user_sk").reset_index(
drop=True
)
end_range = user_val_counts["user_count"]
user_session_ids = vec_arange(end_range)
### up shift the session length df
session_len = session_change_df["t_index"].diff().reset_index(drop=True)
session_len = session_len.shift(-1)
session_len.iloc[-1] = last_session_len
session_id_final_series = (
cudf.Series(user_session_ids).repeat(session_len).reset_index(drop=True)
)
return session_id_final_series
def get_session_id(df, time_out):
"""
This function creates a session id column for each click
The session id grows in incremeant for each user's susbequent session
Session boundry is defined by the time_out
"""
df["user_change_flag"] = df["wcs_user_sk"].diff(periods=1) != 0
df["time_delta"] = df["tstamp_inSec"].diff(periods=1)
df["session_timeout_flag"] = df["tstamp_inSec"].diff(periods=1) > time_out
df["session_change_flag"] = df["session_timeout_flag"] | df["user_change_flag"]
# print(f"Total session change = {df['session_change_flag'].sum():,}")
cols_keep = ["wcs_user_sk", "i_category_id", "session_change_flag"]
df = df[cols_keep]
df = df.reset_index(drop=True)
df["t_index"] = cudf.utils.cudautils.arange(start=0, stop=len(df), dtype=np.int32)
session_change_df = df[df["session_change_flag"]]
last_session_len = len(df) - session_change_df["t_index"].iloc[-1]
session_ids = get_session_id_from_session_boundry(
session_change_df, last_session_len
)
assert len(session_ids) == len(df)
return session_ids
table_reader = build_reader(file_format, basepath=data_dir)
wcs_cols = ["wcs_user_sk", "wcs_item_sk", "wcs_click_date_sk", "wcs_click_time_sk"]
wcs_df = table_reader.read("web_clickstreams", relevant_cols=wcs_cols)
item_cols = ["i_category_id", "i_item_sk"]
item_df = table_reader.read("item", relevant_cols=item_cols)
f_wcs_df = wcs_df[wcs_df["wcs_user_sk"].notnull()]
f_item_df = item_df[item_df["i_category_id"].notnull()]
merged_df = f_wcs_df.merge(f_item_df, left_on=["wcs_item_sk"], right_on=["i_item_sk"])
merged_df["tstamp_inSec"] = (
merged_df["wcs_click_date_sk"] * 24 * 60 * 60 + merged_df["wcs_click_time_sk"]
)
cols_keep = ["wcs_user_sk", "tstamp_inSec", "i_category_id"]
merged_df = merged_df[cols_keep]
### that the click for each user ends up at the same partition
merged_df = merged_df.set_index("wcs_user_sk")
merged_df = merged_df.reset_index(drop=False)
def get_sessions(df):
df = df.sort_values(by=["wcs_user_sk", "tstamp_inSec"]).reset_index(drop=True)
df["session_id"] = get_session_id(df, q30_session_timeout_inSec)
df = df[["wcs_user_sk", "i_category_id", "session_id"]]
return df
session_df = merged_df.map_partitions(get_sessions)
del merged_df
def get_distinct_sessions(df):
df = df.drop_duplicates().reset_index(drop=True)
return df
distinct_session_df = session_df.map_partitions(get_distinct_sessions)
### get_pair_helper
def get_pairs(
df,
merge_col=["session_id", "wcs_user_sk"],
pair_col="i_category_id",
output_col_1="category_id_1",
output_col_2="category_id_2",
):
"""
Gets pair after doing a inner merge
"""
pair_df = df.merge(df, on=merge_col, suffixes=["_t1", "_t2"], how="inner")
pair_df = pair_df[[f"{pair_col}_t1", f"{pair_col}_t2"]]
pair_df = pair_df[pair_df[f"{pair_col}_t1"] < pair_df[f"{pair_col}_t2"]]
pair_df = pair_df.rename(
columns={f"{pair_col}_t1": output_col_1, f"{pair_col}_t2": output_col_2}
)
return pair_df
pair_df = distinct_session_df.map_partitions(get_pairs)
del distinct_session_df
print("Time:", time.time() - t)
The problem with this code is that communication times out, but that can be circumvented by increasing distributed.comm.timeouts.connect
, still I believe this is just a consequence of a problem with how data transfer behaves. By watching the Dask taskstream I noticed that data transfer takes longer as time passes, the first transfers are of hundreds of milliseconds, then gradually start increasing to few seconds, tens and eventually hundreds of seconds. I’ve uploaded a Dask performance report of the code above, it was extracted on a DGX-2 running with 16 GPUs and UCX+NVLink. It’s worth noting this doesn’t happen with TCP communication.
It seems that this issue is created by a combination of factors, certainly small data transfers over NVLink are not efficient without caching CUDA IPC handles, and that probably contributes. I believe there’s some sort of issue with a blocking task somewhere that eventually causes transfer times to be computed as if they were longer than they actually are, when they are actually waiting for some blocking task to finish.
Any ideas here @mrocklin @quasiben @jakirkham @madsbk ?
Issue Analytics
- State:
- Created 4 years ago
- Comments:28 (25 by maintainers)
Top GitHub Comments
Was going to add the option to fallback to CuPy in Distributed as CuPy also has a memory pool and is likely in use by users working with Python on a GPU.
IOW the order would be RMM -> CuPy -> Numba
I’m somewhat convinced that this problem will remain when users can’t use an RMM pool or disable the IPC cache. In such situations, all transfers would require mapping the memory handle which has a cost in the range of 100 ms. For that reason, I think we should encourage users to always use an RMM pool and turn on IPC cache (in UCX-Py we’ve enabled it back by default), in principle it would work without those but will offer very little benefit (if any), therefore I suggest we keep things as is and tell users to rely on RMM pool+IPC cache as there isn’t much we’ll be able to do otherwise. Any objections to this proposal?