Hi again,
Peter, I’ve taken a look at some of the core files and checked the faulting-rip
values. It’s always unique - e.g. 0x834c87b % 256 == 123, 0x14066216 % 256 ==
22, 0x1ff339bf % 256 == 191, 0x3e4a1d0 % 256 == 208, 0x9c66f99 % 256 == 153,
0x19cc4e1a % 256 == 26, … Otherwise no change, it’s always first instruction of
the trace. Igor, no new info regarding faulting chips, all the new core dumps
were produced by machines having E5-2620 v2 or E5-1620 v2 :/
As of now there are several machines running successfully with Peter’s patch
without crashing (so far). Same goes for unprotected mcode, but these can not
stay with us forever for obvious reasons. To confirm that we actually found the
cure I want to deploy the patched version to way more machines and give LuaJIT
some more time. But suppose the patch is going to do it’s magic and prevent
these SIGSEGVs from happening. Are there any drawbacks of using that solution
permanently? What about that thread affinity Igor suggested, does that have any
advantages over Peter’s magic one-liner? :) And finally, should this eventually
make it’s way to the main branch and save other innocent people from this
suffering? :)
As always - thank you very much for your time.
Tomas