baseline and modified version showing different benchmark result even if the codebase is same
See original GitHub issueHi @mikemccand,
I have cloned Lucene 9 code in both baseline and candidate folder( so codebase is 100% same) I saw there are performance difference after running command:
python3 src/python/localrun.py -source wikimedium10k
1st Output table shows
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Prefix3 407.18 (0.0%) 314.24 (0.0%) -22.8% ( -22% - -22%) 1.000
LowSpanNear 1073.76 (0.0%) 852.06 (0.0%) -20.6% ( -20% - -20%) 1.000
MedSloppyPhrase 1140.22 (0.0%) 927.42 (0.0%) -18.7% ( -18% - -18%) 1.000
MedPhrase 964.51 (0.0%) 848.50 (0.0%) -12.0% ( -12% - -12%) 1.000
HighIntervalsOrdered 1002.98 (0.0%) 884.65 (0.0%) -11.8% ( -11% - -11%) 1.000
HighTermMonthSort 4017.92 (0.0%) 3660.73 (0.0%) -8.9% ( -8% - -8%) 1.000
Respell 512.33 (0.0%) 467.72 (0.0%) -8.7% ( -8% - -8%) 1.000
HighSpanNear 893.76 (0.0%) 821.69 (0.0%) -8.1% ( -8% - -8%) 1.000
IntNRQ 1828.06 (0.0%) 1682.03 (0.0%) -8.0% ( -7% - -7%) 1.000
HighTerm 5614.10 (0.0%) 5200.05 (0.0%) -7.4% ( -7% - -7%) 1.000
BrowseMonthTaxoFacets 4142.06 (0.0%) 3870.82 (0.0%) -6.5% ( -6% - -6%) 1.000
HighTermDayOfYearSort 3782.61 (0.0%) 3538.93 (0.0%) -6.4% ( -6% - -6%) 1.000
BrowseDayOfYearSSDVFacets 2665.19 (0.0%) 2514.64 (0.0%) -5.6% ( -5% - -5%) 1.000
LowTerm 6806.33 (0.0%) 6460.07 (0.0%) -5.1% ( -5% - -5%) 1.000
HighSloppyPhrase 886.16 (0.0%) 845.10 (0.0%) -4.6% ( -4% - -4%) 1.000
OrHighMed 898.26 (0.0%) 858.97 (0.0%) -4.4% ( -4% - -4%) 1.000
LowPhrase 988.79 (0.0%) 947.64 (0.0%) -4.2% ( -4% - -4%) 1.000
OrHighLow 1171.10 (0.0%) 1124.50 (0.0%) -4.0% ( -3% - -3%) 1.000
BrowseDateTaxoFacets 3796.98 (0.0%) 3648.76 (0.0%) -3.9% ( -3% - -3%) 1.000
PKLookup 326.99 (0.0%) 315.53 (0.0%) -3.5% ( -3% - -3%) 1.000
BrowseMonthSSDVFacets 3212.18 (0.0%) 3110.22 (0.0%) -3.2% ( -3% - -3%) 1.000
AndHighLow 2763.74 (0.0%) 2691.30 (0.0%) -2.6% ( -2% - -2%) 1.000
MedSpanNear 634.86 (0.0%) 624.48 (0.0%) -1.6% ( -1% - -1%) 1.000
Wildcard 581.94 (0.0%) 572.55 (0.0%) -1.6% ( -1% - -1%) 1.000
HighPhrase 729.77 (0.0%) 720.61 (0.0%) -1.3% ( -1% - -1%) 1.000
BrowseDayOfYearTaxoFacets 3111.47 (0.0%) 3073.01 (0.0%) -1.2% ( -1% - -1%) 1.000
OrHighHigh 430.85 (0.0%) 426.77 (0.0%) -0.9% ( 0% - 0%) 1.000
AndHighHigh 1029.49 (0.0%) 1028.71 (0.0%) -0.1% ( 0% - 0%) 1.000
LowSloppyPhrase 1351.24 (0.0%) 1365.14 (0.0%) 1.0% ( 1% - 1%) 1.000
Fuzzy2 70.31 (0.0%) 71.83 (0.0%) 2.2% ( 2% - 2%) 1.000
Fuzzy1 324.58 (0.0%) 338.44 (0.0%) 4.3% ( 4% - 4%) 1.000
LowIntervalsOrdered 1721.13 (0.0%) 1807.65 (0.0%) 5.0% ( 5% - 5%) 1.000
MedTerm 5749.70 (0.0%) 6042.57 (0.0%) 5.1% ( 5% - 5%) 1.000
MedIntervalsOrdered 1291.17 (0.0%) 1382.36 (0.0%) 7.1% ( 7% - 7%) 1.000
AndHighMed 1322.11 (0.0%) 1575.31 (0.0%) 19.2% ( 19% - 19%) 1.000
My expectation was that both same code will perform the same way but you can notice deviations. Can you please explain it ? Is it right way to run benchmark?
Thanks!
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Performance Improvements in .NET 7
NET 7 is fast. Really fast. This post deep-dives into hundreds of performance improvements that contributed to that reality.
Read more >Commit Procedures — Verilog-to-Routing 8.0.0 documentation
Suppose we've make a change to VTR, and we now want to evaluate the change. As described above we produce QoR measurements for...
Read more >VLMbench: A Compositional Benchmark for Vision-and ...
VLMbench is the first benchmark that compositional designs for vision-and-language reasoning and categorizes the manipulation tasks from the perspectives of ...
Read more >phoronix-test-suite/phoronix-test-suite.md at master - GitHub
This option will allows you to specify a result as a baseline (first parameter) and a second result file (second parameter) that will...
Read more >Parboil: A Revised Benchmark Suite for Scientific ... - IMPACT
nity to focus on a shared codebase, with individual researchers more easily able to understand another's results. If a benchmark or set of ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
There is no statistical difference here. The final column, p value, tells you the probability that the difference you are observing is due to random chance. It’s one.
You can observe the absolute values of these random differences reduced by running larger test samples. More iterations, larger indexes, more queries.
What your a/a test shows you is the magnitude off noise on the system given your sample size.
On Mon, Oct 11, 2021, 5:17 AM praveennish @.***> wrote:
I am very sorry @mikemccand for the late reply!
I wanted to retest today but after latest pull i am getting FileNotFoundException for this file enwiki-20120502-lines-1k-fixed-utf8-with-random-label.txt in data folder
From where i can download this file please?