I think using the name LuaJIT Remake steps on the toes of the existing LuaJIT and I would advise using a more original name.
Anything distinctive would be better. Lua DeeJIT comes to mind, since the generator is called Deegen.
Wait what the hell
> For example, at LLVM IR level, it is trivial to make a function use GHC calling convention (a convention with no callee-saved registers)
This refers to the following section of 
“cc 10” - GHC convention This calling convention has been implemented specifically for use by the Glasgow Haskell Compiler (GHC). It passes everything in registers, going to extremes to achieve this by disabling callee save registers. This calling convention should not be used lightly but only for specific situations such as an alternative to the register pinning performance technique often used when implementing functional programming languages. At the moment only X86 supports this convention and it has the following limitations:
On X86-32 only supports up to 4 bit type parameters. No floating-point types are supported. On X86-64 only supports up to 10 bit type parameters and 6 floating-point parameters. This calling convention supports tail call optimization but requires both the caller and callee are using it.
Really huge, if true!
> The main blockade to the tail-call approach is the callee-saved registers. Since each bytecode function is still a function, it is required to abide to the calling convention, specifically, every callee-saved register must retain its old value at function exit.
This is correct, wasting of callee-saved registers is a shortcoming of the approach I published about protobuf parsing (linked from the first paragraph above). I have been experimenting with a new calling convention that uses no callee-saved registers to work around this, but the results so far are inconclusive. The new calling convention would use all registers for arguments, but allocate registers in the opposite order of normal functions, to reduce the chance of overlap. I have been calling this calling convention "reverse_cc".
I need to spend some time reading this article in more detail, to more fully understand this new work. I would like to know if a new calling convention in Clang would have the same performance benefits, or if Deegen is able to perform optimizations that go beyond this. Inline caching seems like a higher-level technique that operates above the level of individual opcode dispatch, and therefore somewhat orthogonal.
It uses && to combine the two doubles, I’d expect to see + instead.
Is this the case with Rust?
The core point I’ve identified is that existing compilers are pretty good at converting high level descriptions of operations into architecture-specific code (at least, better than we are given the amount of instructions we have to implement) but absolutely awful at doing register selection or dealing with open control flow that is important for an interpreter. Writing everything in assembly lets you do these two but you miss out on all the nice processor stuff that LLVM has encoded into Tablegen.
Anyways, the current plan is that we’re going to generate LLVM IR for each case and run it through a custom calling convention to take that load off the compiler, similar to what the author did here. There’s a lot more than I’m handwaving over that’s still going to be work, like whether we can automate the process of translating the semantics for each instruction into code, how we plan to pin registers, and how we plan to perform further optimizations on top of what the compiler spits out, but I think this is going to be the new way that people write interpreters. Nobody needs another bespoke macro assembler for every interpreter :)
Might try LuaToCee and then apply PGO iteratively.
My biggest pet peeve though is the inability to push unlikely stuff completely to the back of the program. I mean just put it away where nobody can see it. I also want to align code blocks to 1 bytes. Not 16 bytes. Not a single compiler lets me realign the instruction handlers in a way that compresses them down.
I also want to know what BOLT does that improves my interpreter by around 30%.
The lua code by itself did not ran long enough to actually profit much from optimizations much.
So, back then we did discuss possible solutions, and one idea was to basically notice a upcoming C-Call in bytecode ahead of execution, detect the stability of the arguments ahead of time. A background thread, extracts the values, perform the Call-processing of arguments and return pushes the values onto the stack, finally setting a "valid" bit to unblock the c-call (which by then actually is no longer a call). Both sides never have a complete cache eviction and live happily ever after. Unfortunatly i have a game dev addiction, so nothing ever came of it.
But similar minds might have pushed similar ideas.. so asking the hive for this jive. Anyone ever seen something similar in the wild?
Current measurement presents comparison of two wildly different implementation strategies: LuaJIT Remake uses IC and hidden classes while LuaJIT proper does not. It's a well known fact that ICs+hidden classes can lead to major performance improvements. That insight originated in Smalltalk world and was brought to dynamic language mainstream by V8† - but unfortunately it muddles the comparison here.
† V8 was open sourced on September 2, 2008... Coincidentally (though probably not :)) JSC landed their first IC prototype on September 1, 2008.
time luajit -j off array3d.lua real 0m1,968s user 0m1,915s sys 0m0,052s time luajit -j on array3d.lua real 0m0,128s user 0m0,068s sys 0m0,060s
I did peek at the deegen tool a bit, and it seems quite large? https://github.com/luajit-remake/luajit-remake/tree/master/d...
I would be interested in an overview of all the analysis it has to do, which as I understand is basically “automated Mike Pall”
FWIW I think this is the hand-written equivalent with LuaJIT’s dynasm tool: https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_x64.dasc (just under 5000 lines)
Also there are several of these files with no apparent sharing, as you would get with deegen: