Re: small script to reproduce bogus trace stitch errors at line 0 with coroutines in latest 2.1

From: Mike Pall <mike-1504@xxxxxxxxxx>
To: luajit@xxxxxxxxxxxxx
Date: Tue, 28 Apr 2015 21:34:34 +0200

Elias Hogstvedt wrote:

the order of these errors are consistent (at least on my computer until)
you change the code around (like appending "+ 2" to out in the loop)

Ok, even with this test case at hand, this elusive bug took me
another 5 hours to track down ... and I finally found the issue.
*big sigh of relief*

Elias, thank you very, very much for coming up with this test case!

Sadly, it's a design mistake in trace stitching, which is not easy
to fix. So I had to disable the feature in the v2.1 git branch. :-(

The ugly details:

When a trace has to stop at a NYI function, it compiles an exit to
the interpreter with a continuation underneath of it in the stack.
When the function later returns, this continuation either triggers
recording of a new trace that continues the control flow or it
directly jumps to that trace, if already compiled.

For this to work, the continuation needs to hold the first trace
number. This links ('stitches') the second trace to the first one.
The link to the second trace is stored as soon as that one is
compiled.

Without that link, the continuation wouldn't know where to jump
to, since there's no bytecode after the CALL that could hold a
trace number (which is how all other root traces are tied to the
bytecode).

Well, here's the problem: when a full trace flush occurs, e.g.
because the trace cache is full, there might still be a stitching
continuation somewhere on a stack. Either caused by a function
that calls back into the VM or by coroutine.yield() in particular.

Ok, you guessed it: the continuation now holds a stale trace
number! Which is used to look up the link to the stitched trace.
The trace number points to a stale data structure or (even worse)
to an unrelated trace created after the flush. The continuation
jumps to wherever this trace links to. Chaos ensues ...

Finding and cleaning up all of these continuations with stale data
would require a full GC scan, which is unfeasible. Invalidating
the data is tricky. Maybe some kind of generation number, which is
incremented after each flush (but it might wrap). Or some
completely different idea ...

Trace stitching is rather complex and the current implementation
is rather kludgy. The short dip to the interpreter isn't very
efficient, anyway. Code bases that have already been sieved for
NYI cases don't benefit that much. But it (reportedly) brings a
substantial speedup for others.

I'm not sure what to do, yet.

--Mike

Follow-Ups:
- Re: small script to reproduce bogus trace stitch errors at line 0 with coroutines in latest 2.1
  - From: Thomas Lindgren
- Re: small script to reproduce bogus trace stitch errors at line 0 with coroutines in latest 2.1
  - From: Claire Lewis

References:
- small script to reproduce bogus trace stitch errors at line 0 with coroutines in latest 2.1
  - From: Elias Hogstvedt

Re: small script to reproduce bogus trace stitch errors at line 0 with coroutines in latest 2.1

Other related posts: