Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

serious AMD problems on some specific hardware

See original GitHub issue

I’m trying my latest HF/transformers deepspeed tests on 4 different machines:

Speed	CPU	RAM	GPUs	CUDA
242.45s	Intel i9	128GB	2xRTX TITAN	11.1
259.78s	Intel i7	128GB	RTX3090+GTX1070	11.1
1h+	AMD Ryzen 9	64GB	2xRTX TITAN	10.2
crashes	AMD Ryzen 9	64GB	2xRTX TITAN	10.2
420s	AMD EPYC-Rome Processor	512GB	2xA100	11.0

AMD Ryzen 9 is either taking forever or it crashes.

The machine of the last entry spews a ton of these:

Message from syslogd@badass at Apr  3 04:50:06 ...
 kernel:[2615390.401038] watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [pytest:3756011]

and then crashes.

OK the other difference is CUDA versions.

The tests are very light - doing very tiny batches for just a few iterations. So far from being stressed out - the gpus are mostly idle. I don’t think the RAM difference is of any difference either.

I’m using the release candidate branch that @jeffra made: https://github.com/microsoft/DeepSpeed/tree/multi-z3-prs But it has been like this for a while now - I originally thought it was just some odd issue with this one machine, but now I’m seeing an identical problem with another identical machine.

OK, dmesg has a ton of these:

[Sat Apr  3 05:23:18 2021] [UFW BLOCK] IN=enp6s0 OUT= MAC=01:00:5e:00:00:01:cc:40:d0:0d:7c:cc:08:00 SRC=0.0.0.0 DST=224.0.0.1 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 DF PROTO=2 
[Sat Apr  3 05:23:26 2021] [UFW BLOCK] IN=enp6s0 OUT= MAC=01:00:5e:00:00:fb:24:4b:fe:de:96:71:08:00 SRC=192.168.1.14 DST=224.0.0.251 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 DF PROTO=2 
[Sat Apr  3 05:23:42 2021] watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [pytest:25222]
[Sat Apr  3 05:23:42 2021] Modules linked in: xt_recent bluetooth ecdh_generic ecc xt_nat veth xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat aufs binfmt_misc nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua edac_mce_amd snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec ucsi_ccg typec_ucsi snd_hda_core typec snd_hwdep kvm snd_pcm snd_timer snd irqbypass soundcore k10temp eeepc_wmi asus_wmi ccp sparse_keymap video mxm_wmi wmi_bmof mac_hid nvidia_uvm(OE) nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG xt_limit xt_addrtype xt_tcpudp sch_fq_codel xt_conntrack nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat overlay nf_conntrack_ftp ip6table_filter ip6_tables nf_conntrack br_netfilter nf_defrag_ipv6 bridge nf_defrag_ipv4 stp llc iptable_filter arp_tables bpfilter ip_tables x_tables autofs4 btrfs zstd_compress
[Sat Apr  3 05:23:42 2021]  raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd drm_kms_helper syscopyarea sysfillrect cryptd sysimgblt glue_helper fb_sys_fops igb i2c_piix4 drm ahci dca i2c_nvidia_gpu i2c_algo_bit libahci gpio_amdpt wmi gpio_generic
[Sat Apr  3 05:23:42 2021] CPU: 7 PID: 25222 Comm: pytest Tainted: P        W  OEL    5.3.0-64-generic #58-Ubuntu
[Sat Apr  3 05:23:42 2021] Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5204 07/29/2019
[Sat Apr  3 05:23:42 2021] RIP: 0010:fetch_pte.isra.0+0x5c/0x160
[Sat Apr  3 05:23:42 2021] Code: 01 48 89 d0 44 8d 14 ff 41 8d 4a 0c 48 89 e5 48 d3 e8 53 48 8b 36 25 ff 01 00 00 4c 8d 04 c6 b8 01 00 00 00 48 d3 e0 49 89 01 <85> ff 0f 8e 87 00 00 00 41 8d 4a 03 48 63 ff 49 bb 00 f0 ff ff ff
[Sat Apr  3 05:23:42 2021] RSP: 0018:ffffa8a7c37b79a8 EFLAGS: 00000216 ORIG_RAX: ffffffffffffff13
[Sat Apr  3 05:23:42 2021] RAX: 0000008000000000 RBX: 0000000000001000 RCX: 0000000000000027
[Sat Apr  3 05:23:42 2021] RDX: 00008b1d31177000 RSI: ffff9b63e7311000 RDI: 0000000000000003
[Sat Apr  3 05:23:42 2021] RBP: ffffa8a7c37b79b0 R08: ffff9b63e73118b0 R09: ffffa8a7c37b79c0
[Sat Apr  3 05:23:42 2021] R10: 000000000000001b R11: 000ffffffffff000 R12: ffff9b7365c8f098
[Sat Apr  3 05:23:42 2021] R13: ffff9b7365c8f094 R14: 0000000000000000 R15: 00008b1d31177000
[Sat Apr  3 05:23:42 2021] FS:  00007f7e9462e740(0000) GS:ffff9b737e7c0000(0000) knlGS:0000000000000000
[Sat Apr  3 05:23:42 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sat Apr  3 05:23:42 2021] CR2: 00007fb1cf47e4c0 CR3: 0000000fe1c42000 CR4: 0000000000340ee0
[Sat Apr  3 05:23:42 2021] Call Trace:
[Sat Apr  3 05:23:42 2021]  iommu_unmap_page+0x78/0x100
[Sat Apr  3 05:23:42 2021]  __unmap_single.isra.0+0x5f/0x110
[Sat Apr  3 05:23:42 2021]  unmap_sg+0x5f/0x70
[Sat Apr  3 05:23:42 2021]  nv_unmap_dma_map_scatterlist+0x59/0xa0 [nvidia]
[Sat Apr  3 05:23:42 2021]  nv_dma_unmap_pages+0x56/0x130 [nvidia]
[Sat Apr  3 05:23:42 2021]  nv_dma_unmap_alloc+0x14/0x30 [nvidia]
[Sat Apr  3 05:23:42 2021]  _nv030381rm+0xd4/0x220 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv025113rm+0xce/0x100 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv007135rm+0x29/0x40 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv026097rm+0x7f/0x2a0 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv026186rm+0x2cb/0x990 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv002863rm+0x9/0x20 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv003272rm+0x1b/0x80 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv010845rm+0x479/0x4e0 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv034900rm+0x99/0x110 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv034899rm+0x391/0x500 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv033551rm+0xd7/0x1a0 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv033552rm+0x42/0x70 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv007214rm+0x4b/0x90 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? os_acquire_spinlock+0x12/0x20 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv000743rm+0x539/0x970 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? hrtimer_try_to_cancel+0x86/0x110
[Sat Apr  3 05:23:42 2021]  ? __check_object_size+0xf1/0x150
[Sat Apr  3 05:23:42 2021]  ? nvidia_ioctl+0x5b1/0x8a0 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? nvidia_frontend_unlocked_ioctl+0x3b/0x50 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? do_vfs_ioctl+0x407/0x670
[Sat Apr  3 05:23:42 2021]  ? do_futex+0x160/0x1e0
[Sat Apr  3 05:23:42 2021]  ? ksys_ioctl+0x67/0x90
[Sat Apr  3 05:23:42 2021]  ? __x64_sys_ioctl+0x1a/0x20
[Sat Apr  3 05:23:42 2021]  ? do_syscall_64+0x5a/0x130
[Sat Apr  3 05:23:42 2021]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

Here is one of the problematic machines:

$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch']
torch version .................... 1.8.1
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/stas/hf/DeepSpeed/deepspeed']
deepspeed info ................... 0.3.13+74902d9, 74902d9, multi-z3-prs
deepspeed wheel compiled w. ...... torch 1.8, cuda 10.2

The problematic CPU on both machines:

lscpu 
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          24
On-line CPU(s) list:             0-23
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           113
Model name:                      AMD Ryzen 9 3900X 12-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2195.585
CPU max MHz:                     3800.0000
CPU min MHz:                     2200.0000
BogoMIPS:                        7585.86
Virtualization:                  AMD-V
L1d cache:                       384 KiB
L1i cache:                       384 KiB
L2 cache:                        6 MiB
L3 cache:                        64 MiB
NUMA node0 CPU(s):               0-23
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP always-on, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxe
                                 xt fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq moni
                                 tor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy ab
                                 m sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx c
                                 pb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx sma
                                 p clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irp
                                 erf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter
                                 pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

Also I used to run deepspeed just fine on this machine a few months ago with the same cuda-10.2. Could it be related to changes introduced by https://github.com/microsoft/DeepSpeed/pull/735, since it was made for deepspeed segfaulting on AMD? edit: I reverted the changes from this PR, rebuilt and the problem is the same. So that PR didn’t introduce this problem.

The problem happens both with master and also 0.3.13.

@jeffra, @RezaYazdaniAminabadi

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

RezaYazdaniAminabadicommented, Jun 9, 2021

Thanks @stas00

I will work on adding some configurable parameters for that. Hopefully, we can fix this for those little-RAM systems

Reza

1reaction

stas00commented, Jun 9, 2021

So that last machine was upgraded to 32GB cpu RAM and all is good. The crash was over system not handling well 0% free RAM (while having swap available).

I highly recommend to configure cgroups on machines with little RAM to protect the main system from memory-hungry processes and of course adding a serious chunk of swap memory goes a long way.