Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to recreate onnx speedups demonstrated in 04-onnx-export.ipynb on mac or linux

See original GitHub issue

Environment info

transformers version: 3.1.0
Platform: Mac OS Mojave + Ubuntu 18.04.4
Python version: 3.7.7
PyTorch version (GPU?): 1.6.0
Tensorflow version (GPU?): na
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

Information

Model I am using (Bert, XLNet …): bert-base-uncased

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

I am running the /notebooks/04-onnx-export.ipynb example

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

I am using the example data in the notebook

To reproduce

Steps to reproduce the behavior:

Within the notebook add torch.set_num_threads(1)
Replace environ["OMP_NUM_THREADS"] = str(cpu_count(logical=True)) with environ["OMP_NUM_THREADS"] = "1"
Run the 04-onnx-export.ipynb example notebook

I am trying to recreate the speedups shown in this example notebook.

Note that without step 1 above I found pytorch to be considerably faster than onnx as presumably it was using more threads than onnx, step 2 doesn’t seem to impact the results but I set it for completeness (ensuring every thing is on the same number of threads)

Actual results on a Macbook Pro:

with hardware:

machdep.cpu.max_basic: 22
machdep.cpu.max_ext: 2147483656
machdep.cpu.vendor: GenuineIntel
machdep.cpu.brand_string: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
machdep.cpu.family: 6
machdep.cpu.model: 94
machdep.cpu.extmodel: 5
machdep.cpu.extfamily: 0
machdep.cpu.stepping: 3
machdep.cpu.feature_bits: 9221959987971750911
machdep.cpu.leaf7_feature_bits: 43806655 0
machdep.cpu.leaf7_feature_bits_edx: 2617255424
machdep.cpu.extfeature_bits: 1241984796928
machdep.cpu.signature: 329443
machdep.cpu.brand: 0
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 HLE AVX2 SMEP BMI2 ERMS INVPCID RTM FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT MDCLEAR TSXFA IBRS STIBP L1DF SSBD
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI
machdep.cpu.logical_per_package: 16
machdep.cpu.cores_per_package: 8
machdep.cpu.microcode_version: 220
machdep.cpu.processor_flag: 5
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.mwait.linesize_max: 64
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.sub_Cstates: 286531872
machdep.cpu.thermal.sensor: 1
machdep.cpu.thermal.dynamic_acceleration: 1
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.thresholds: 2
machdep.cpu.thermal.ACNT_MCNT: 1
machdep.cpu.thermal.core_power_limits: 1
machdep.cpu.thermal.fine_grain_clock_mod: 1
machdep.cpu.thermal.package_thermal_intr: 1
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.energy_policy: 1
machdep.cpu.xsave.extended_state: 31 832 1088 0
machdep.cpu.xsave.extended_state1: 15 832 256 0
machdep.cpu.arch_perf.version: 4
machdep.cpu.arch_perf.number: 4
machdep.cpu.arch_perf.width: 48
machdep.cpu.arch_perf.events_number: 7
machdep.cpu.arch_perf.events: 0
machdep.cpu.arch_perf.fixed_number: 3
machdep.cpu.arch_perf.fixed_width: 48
machdep.cpu.cache.linesize: 64
machdep.cpu.cache.L2_associativity: 4
machdep.cpu.cache.size: 256
machdep.cpu.tlb.inst.large: 8
machdep.cpu.tlb.data.small: 64
machdep.cpu.tlb.data.small_level1: 64
machdep.cpu.address_bits.physical: 39
machdep.cpu.address_bits.virtual: 48
machdep.cpu.core_count: 4
machdep.cpu.thread_count: 8
machdep.cpu.tsc_ccc.numerator: 216
machdep.cpu.tsc_ccc.denominator: 2

I obtained even worse results on a linux machine:

with hardware:

processor       : 11
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
stepping        : 2
microcode       : 0x43
cpu MHz         : 1199.433
cache size      : 15360 KB
physical id     : 0
siblings        : 12
core id         : 5
cpu cores       : 6
apicid          : 11
initial apicid  : 11
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips        : 6596.76
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Expected behavior

Expected to speed speedup from using onnx as in the example:

I know this is hardware specific but having tested it on two machines I wonder if there is some config not included in the example that I am missing or some other issue?

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

5reactions

tianleiwucommented, Sep 10, 2020

@erees1, your observation is correct.

It is recommended to use default setting (do not set the option intra_op_num_threads) for general usage.

onnxruntime-gpu package is not built with OpenMP, so OMP_NUM_THREADS does not have effect. If cpu cores >= 16, user might try intra_op_num_threads =16 explicitly.

For onnxruntime package, options.intra_op_num_threads = 1 was advised for version = 1.2.0 at the time that notebook created. User could set OMP_NUM_THREADS etc environment variable before importing onnxruntime to control the intra op thread number. For version >= 1.3.0, it is recommended to use default intra_op_num_threads.

@mfuntowicz, could you help update the setting in the notebook like the following?

Before:

  # Few properties that might have an impact on performances (provided by MS)
  options = SessionOptions()
  options.intra_op_num_threads = 1
  options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL

After:

   options = SessionOptions()
   # It is recommended to use default settings.
   # onnxruntime package uses environment variable OMP_NUM_THREADS to control the intra op threads.
   # For  onnxruntime 1.2.0 package, you need set intra_op_num_threads = 1 to enable OpenMP. It is not needed for newer versions.
   # For onnxruntime-gpu package, try the following when your cpu has many cores:
   # options.intra_op_num_threads = min(16, cpu_count(logical=True))

0reactions

erees1commented, Sep 14, 2020

Thanks for the help, I think that clears things up! - closing the issue.