|Version 2 (modified by gb, 8 years ago) (diff)|
ARM porting notes and issues
These are very preliminary notes, mostly intended to give an idea of what might be hard and what might not be. However we might wind up doing things could turn out to be very different from the ideas explored here. It involves a certain amount of thinking out loud, and should be edited to remove things that seem like bad ideas after they've been given more thought than they have so far.
I wasn't too involved in the Dylan/ARM efforts in the early 1990s; if anyone who was remembers some Stupid ARM Tricks that might be if interest here, please chime in ...
There are a large number of ARM cores in current use (see the table at http://en.wikipedia.org/wiki/ARM_architecture for more information; that page also describes some of the optional features, including "vector floating point" (vfp)) If we were to undertake porting CCL to "the ARM", it's hard to know what set of baseline features are desirable.) To pick an arbitrary baseline system, I think that "armv5 or greater with vfp in hardware or available via emulation" sounds reasonable. I'm not entirely sure that vfp emulation is available in the kernel; I'd ordinarily say "assume hardware vfp support", but one of the ARM machines that I have access to doesn't seem to offer it. The N810 and iPhone/iPod Touch do seem to offer vfp in hardware.
ARM instructions are generally 32 bits wide; many ARM variants offer an alternate 16-bit instruction set ("THUMB") that offers improved code density, generally at the cost of dropping some ARM features (conditional execution, pre-shifted operands) and making it a bit awkward to access all 16 GPRs. (There's also something in newer ARMs which allows a mixture of 32/16 bit THUMB instructions.) We might be able to use THUMB-encoding in some very specialized cases, but it's not clear how attractive this is.
The ARM will fault on misaligned data accesses (e.g., loading a 32-bit word from an address whose low 2 bits aren't 0.) This can be exploited to make (at least) things like CAR and CDR safe without explicit typechecking; if the 3-bit CONS and NIL tags (assuming that NIL has its own tag, as on PPC32) are congruent modulo 4 (as they are on PPC32), a CAR or CDR instruction would trap with an alignment fault if its argument isn't tagged correctly (and the alignment fault seems to happen before address range/access checks do.) CAR or CDR might otherwise be on the order of 4 instructions, so it seems desirable to be able to exploit this.
The bad news is that (at least on Linux), alignment fault handling for all processes is controlled by values in the pseudofile "/proc/cpu/alignment"; most values that can be written there (including the default value 0) cause the OS kernel to silently return misaligned data. It's possible to write (as root) a value to that pseudofile that causes the offending application to receive a SIGBUS signal; it's unclear what the negative effect of this setting would be on other applications. It'd really be better if Linux offered each process control over how alignment faults are handled; I don't know what OSX does.
I have run both an NSLU2 (Slug) and an N810 for extended periods of time with the global "alignment faults raise signals" setting in effect and haven't yet noticed any other applications keeling over; it's possible that many applications that generate alignment faults have signal handlers already installed.
Having 16 general-purpose registers probably means that we have enough to be able to statically partition them into tagged/untagged sets. The number 16 is a little misleading, because:
- one of those "general purpose registers" is the PC and the other is used to hold the return address on most forms of call instruction.
- one of them needs to be used as a stack pointer and, if we want a runtime frame pointer, we'd need to burn a register for that.
- we almost certainly need to keep thread-local data (the TCR) in some GPR.
- we may need to keep the current function object in a register, unless we can find a cheap scheme for mapping from the PC/LR to the containing function object.
If we assume that we use a frame pointer, don't abuse LR too badly, and need to keep the current function in a register, we still have around 10 GPRs that're "general" and which we can partition into node/immediate subsets. A split involving 2-3 immediate registers and 8-7 node registers sounds reasonable (with "nargs" being overloaded with either some imm reg or some node temporary, depending on whether we want to keep it tagged or not.)
Function objects, return addresses, constants
If we want to jump to an address and execute ARM (not THUMB) code, that address has to have its low 2 bits clear. (The PPC explicitly ignored the low 2 bits in that case, and there were tricks to be played there. We can't use a displacement in a jump or call (we can't say "jump to/call Rx-2"; we can (in a variety of ways) say "jump to/call Rx", and the value in Rx has to have its low 2 bits clear. We either have to treat a function's entry point as a locative, or treat it as a tagged object with a tag of 0 or 4. (Unless we want to make fixnum arithmetic at least a little more expensive, that probably means that a tagged object that represents a function entry point - if such a thing exists - has a tag of 4 and we have 29-bit fixnums instead of 30-bit fixnums. Losing a fixnum bit isn't the end of the world; one consequence is that use of fixnums to represent word-aligned addresses might be a bit constrained.
A "function" in general is a collection of one or more sequences of machine instructions and one or more sets of constants (immediates/metadata). On other ports, we've had exactly one sequence of each type; on the x86, the sequences are concatenated together, while on the PPC, the instruction sequence (CODE-VECTOR) is the 0th element of the constants sequence. On ARM, an LDR ("load register") instruction with a constant displacement is limited to a displacement of +/- 4K bytes relative to a base register (possibly R15/PC), which might mean that some large functions or functions with lots of constants might have difficulty referencing a constant with a single LDR. I think that such functions are likely fairly rare (the ARM code density is probably similar to PPC32) and this case can be detected at compile time, so we can probably assume that an ARM function will contain a single constants vector and a single code vector, most likely concatenated together.
At any point in time when a lisp function is executing, R15/PC is a word-aligned address within that function's code. At many points, R14/LR is a word-aligned address within some function's code. (In other words, both R15 and R14 will generally look like fixnums or will have the same tag as I'm proposing for functions.) We can generally avoid exposing R15 values to lisp code, but R14 values (return addresses) generally need to be stored in stack frames and need to be distinguishable from FIXNUMs or "proposed function entrypoints". We can steal an idea from the x86 ports and say that all return addresses are tagged; since we have to be able to branch to them/copy them to the PC and they therefore have to have their low 2 bits clear, we can use the same tag (4) as I've proposed for functions: something with a tag of 4 is therefore either of type FUNCTION or of type CCL::TAGGED-RETURN-ADDRESS (aka TRA), and we have to look closer in order to distinguish between those cases when we need to.
There may be cases where a locative into a function's code might be the only (direct or indirect) reference to the containing function; cases where this is true are probably pretty rare, but they can happen. (Consider
(let* ((fn (compile nil ...))) (funcall fn)) ; What's this newfangled tail-call elimination thing the youngsters were talking about ?
). Identifying the containing function can be an issue if an interrupt or exception happens during its execution and can be especially important for the GC: we want to be sure that an executing function and its constants don't become garbage even if the PC is the only thing referencing that function.
On x86, we try to address this by saying that a GPR always points to the function object or, if at function entry or at a return address within the function, the function is executing an instruction which recognizably points that GPR at the function object. (This is an "lea" relative to the instruction pointer on x86-64, and a "mov immediate" self-reference instruction on x86-32.)
On the PPC, a function is called with itself (the "new function", or "nfn") as an implicit argument. A non-leaf function function saves the caller's "fn" register in a stack frame, copies "nfn" to "fn", and restores "fn" on exit; the PC is a word-aligned pointer in a code vector that's the 0th element of the function (gvector) object contained in "fn or "nfn". There may be (usually small) windows at function-return or tail-call time when neither the "fn" or "nfn" register references the function which contains the code-vector which contains the PC; in those cases, the GC finds the code-vector by backing up (looking at preceding 32-bit instructions) until it finds the code-vector header (something that couldn't be a valid PPC instruction.) Note that that approach is:
- something that presumably happens fairly rarely
- something that may be expensive when it does happen, if the search involves hundreds or thousands of intervening instructions
- something that's not viable on x86 (where instructions are variable-width), but could be on the ARM
- probably cheaper overall than other approaches - like maintaining a table of function addresses - would be.
At the moment, I lean towards using something like the PPC approach on the ARM, with the differences being that the ARM would have to do this 'back up" trick more often (on exceptions as well as in the GC.) We don't need a dedicated "function" register to address constants on the ARM, since we have PC-relative addressing. (We don't need it for this reason on x86-64 either, but the back-up-and-find-the-header trick isn't practical there.) Here are couple of sketches that show how a call to a globally named function might compile in the two schemes; we certainly might be able to mix these schemes a bit or come up with totally different ideas, but it might be worth comparing them. In both cases, we'll assume that the "temp0" register contains a symbol and that we're trying to do a non-tail call of that symbol's global definition, and that the (:ALIGN N) pseudo-op aligns the PC to 2N.
Call with a "fn" register example
(ldr temp1 (temp0 (:# symbol.value))) (:align 3) (blx temp1) ; jump to address in temp1, set lr to address of next instruction. ;;; This address has tag 4 @back (sub fn pc whatever) ; FN = pc - offset of this instruction
Call without "fn" register example
(add lr pc (:# @back)) ; setup return address (ldr pc (temp0 (:# symbol.value))) ; jump to function (:align 3) (:long @back-fn) ; word of displacement ;;; This address has tag 4 @back ; nothing to do here
Both of the sequences above are on average 3.5 words long. The first sequence involves executing 3 instructions and using an extra temporary register (as well as a conventional "fn" register); the second sequence involves executing 2 instructions and leaving a word which can be used to map between the return address and the function.
Calling through a symbol like this generally requires that the canonical register holding the symbol (temp0) be preserved; if the symbol's not FBOUNDP, whatever's in its function cell can generally complain that temp0 isn't defined.
In the case where some register contains a known function, the first sequence can be one word shorter (just "blx reg", rather than "ldr/blx"). The second sequence can't use BLX, and would be the same length (with the LDR replaced with a MOV to the PC.)
For a subprimitive call (see more below), we have about the same set of issues as in the "known function" case. ARM calls are either register-direct (via BLX), done as two instructions (setup LR, load/mov to the PC), or pc-relative. The second scheme (with the :LONG pseudo-op) restricts the use of pc-relative BL instructions for the same reason that BLX is disallowed, but if we want functions to be able to move around and don't want to have the GC patch up their pc-relative calls when they do, it's not clear that pc-relative calls are very viable. (We might be able to use them to call lexically defined functions.)
A third variant
If we can arrange that the GC usually processes other "direct" references to functions before processing return addresses, then we might be able to to say that a TRA needs neither a preceding :long or a following instruction in order to allow mapping between the TRA and the containing function: if the GC rarely encounters a TRA without having first encountered something that references the containing function, then only things like backtrace would have to scan backwards from a TRA to find the containing function. That would allow the use of BLX to call subprims (preceded by a NOP for alignment half the time) to call subprims and known functions in some cases.)
On a platform that provides separate address spaces for each process and which provide the process with some control over the layout of that address space (e.g., not something like Classic MacOS), it's generally possible to treat NIL as a constant. A fairly common operation (and one of the principle uses of NIL) involves comparing a register to NIL (to check a boolean result or to terminate list traversal), and this can be made inexpensive if NIL's value fits into a compact form of a "compare register to constant" instruction.
Constants (as operands in ALU instructions) on the ARM must be 8-bit values that can be shifted or rotated by an even number of bits. Depending on how NIL is tagged and on what ranges of the address space are readily available, it may or may not be easy or practical to find a value of NIL that allows it to be used in an ARM CMPS instruction.
One other (less common on other platforms) use of NIL is to address constants. That's probably not common enough to matter; we can afford to use a few instructions to establish or reference a constant, nil-relative address.
The ARM doesn't offer absolute addressing and PC-relative addressing for things like subprim calls is undesirable if we want functions to be easily moved by the GC. If NIL is in a register and a subprimitive jump table is "near" NIL, then a subprim call might look like:
(add lr pc @back) (:align 3) (add pc nil disp) ; assuming that "disp" is an appropriate ARM constant. @back
(ldr temp (nil (:# disp)) ; disp is +/- 4K byte offset, temp is any reg not an arg to subprim (:align 3) (blx temp) @back
if we generally use the "third alternative" handling of return addresses.
If we don't have some other good reason to keep NIL in a register, it might be desirable to either keep the base of the subprims jump table in a register or ensure that it can easily be loaded into a register.
A Software Interrupt (SWI) instruction is conditional and contains a 24-bit trap number. Linux traditionally used something like #x900000...+n to invoke system call N; newer ABIs use #0000000000 and pass the syscall index in r7. In Linux, an SWI with an immediate operand not in the current or legacy range raises a SIGILL, so the application can use large chunks of the SWI space for its own purposes. I don't yet know what Darwin does; if we can't use unused SWI encodings semi-portably, there are likely other parts of the instruction space (coprocessor instructions, etc.) that we can use for conditional traps.
It's been a little weird to develop (tiny test programs) on the N810; I had to move some stuff to flash memory just to have room to install the toolchain (and Emacs!) on the device. Nokia tends to advocate cross-development and running under emulation; a developer using their toolchain on an x86 Linux box might not even target the ARM until late in the process (Nokia provides x86 versions of all of the internet tablet GUI libraries, and why would a developer know or care what the processor is ?) I found this whole system to be overly complex and mostly in the way.
The N810 needs to be recharched fairly frequently (having the WiFI on all the time - so that it's possible to SSH into the box - tends to consume power.) Leaving the charger connected after the battery's charged is supposedly bad for battery life, so doing development (where the machine's on and in use for extended periods) involves cycles of "keeping an eye on the battery" followed by "keeping an eye on the charger." All of this sort of thing takes time to get used to.
The other ARM machine that I have access to is a Linksys NSLU2 (a network storage device intended to act as a dedicated fileserver.) It's possible to replace Linksys's Linux kernel (in flash) with a traditional Linux distro (that mostly resides on the external USB disk); the NSLU2 has only 32MB of RAM and a slow (133Mhz) processor, and it takes a while to get used to that. (I think that the NSLU2 also lacks FP hardware of any kind; there are other similar boxes with more memory and faster CPUs, but they also seem to lack FP hardware.)
Debian (the only Linux distribution that seems to be actively supporting the ARM) is in the process of transitioning to a newer ABI that better supports reasonable floating-point on the ARM. I think that the N810 has already made the transition.
Exception handling seems to work (at least under Linux.) The iPhone SDK headers show that the Apple ARM-based systems are Mach-based, and I assume that that means that exception handling needs to be done at the Mach level on those systems.