Luke Gorrie wrote: > What are the ordering properties of loads/stores to FFI objects? Are there > situations where LuaJIT will reorder these? There's only elimination, but no reordering at this time. But there might be in the future. > If so, is there a way to do a compiler load/store barrier in > cases where ordering is important? There's no explicit compiler optimization barrier. But any call to a (dummy) C function via the FFI would effectively form such a barrier. > Can LuaJIT also eliminate FFI load/store operations that it thinks are > redundant? Yes, the compiler tries to eliminate redundant loads (L2L FWD) or stores (DSE) and forward stored values to loads (S2L FWD). > If so, is there a way to prevent this in cases where every > read/write must really be performed (e.g. when talking to hardware with > MMIO and reading/writing for side effects)? See above. Any C function call serves as an optimization barrier for the compiler. But unless this one uses a low-level barrier instruction, the CPU may still eliminate or reorder loads and stores. MMIO regions are usually tagged to prevent that, but that may have a significant impact on performance, even when you don't want or need it. > Generally speaking: what barriers exist in LuaJIT and is it important to > think about them? (For example are there common operations that will create > a barrier and prevent optimizations that the JIT could otherwise do?) Ok, so the workaround with a C function is more like using a hammer, because it stops all memory optimizations. Maybe more delicate tools are needed. I think it would be difficult to do this at the individual cdata object level (hard to explain the constraints to developers, too). That would leave us with stopping optimizations for either all loads or all stores or both. Alas, at that granularity one might as well disable both, since DSE is not that effective, anyway. There's already an IR instruction that does that (XBAR). But so far there's no cheap way to emit it, e.g. in a tight loop. The issue with hardware barriers is similar in that most architectures only offer a combined read/write barrier. And this is what you really want most of the time, anyway. That said, I'm willing to add an ffi.barrier() instruction that would give you finer control and more efficiency in tight polling loops. With an argument that can either be "l", "s", "m" (compiler barrier) or "L", "S", "M" (hardware barrier). The latter would imply the former, of course. Patches welcome! But I'm wary of adding an extra API function without seeing a good use case -- most of them use a C function call in the inner loop, anyway. --Mike