Re: Strange performance drop

  • From: Mike Pall <mikelj-1705@xxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Tue, 2 May 2017 21:57:51 +0200

Laurent Deniau wrote:

The small test case below shows the observable slow down with
cmp provided as-is (first timings), and no slow down with
less_than provided instead (second timings).

First, you're measuring non-monomorphic dispatch by passing
different closures in the same run. This is probably not a
representative benchmark for your application. Measure them in
isolation.

Second, any searching or sorting function that has an unbiased
branch at it's heart will exhibit potentially unstable
performance. Key to performance (and not just with LuaJIT) is to
use a branch-free comparisons and a branch-free step or swap.

The usual trick is to replace the comparison with
  sar(a - b, 31)
which returns either 0 (all zeros) or -1 (all ones). [Assuming
you've aliased bit.arshift to sar.]

Then use the result as a mask (or inverted mask) for the step
amount or swap index. There are more elegant formulations of
binary search that easily accomodate such a masked step amount.
If log2(N) is low enough, you can even unroll the steps.

[

Actually, branch-free binary search can be optimized further. And
there are many more algorithmic choices. Some references:

https://schani.wordpress.com/2010/04/30/linear-vs-binary-search/

https://www.pvk.ca/Blog/2012/07/03/binary-search-star-eliminates-star-branch-mispredictions/

https://www.researchgate.net/profile/Florian_Gross/publication/275971053_Index_Search_Algorithms_for_Databases_and_Modern_CPUs/links/554cffca0cf29f836c9cd539.pdf

http://www.researchgate.net/profile/Jatin_Chhugani/publication/221213860_FAST_fast_architecture_sensitive_tree_search_on_modern_CPUs_and_GPUs/links/0c96051f5d2990770d000000.pdf

http://erikdemaine.org/papers/BRICS2002/paper.pdf

]

--Mike

Other related posts: