Index: /branches/arm/lisp-kernel/ARM-notes.txt
===================================================================
--- /branches/arm/lisp-kernel/ARM-notes.txt	(revision 13663)
+++ /branches/arm/lisp-kernel/ARM-notes.txt	(revision 13663)
@@ -0,0 +1,201 @@
+Some notes on a CCL ARM port.
+
+- floating point 
+
+Recent ARM architecture variants offer a mostly reasonable FPU (the
+"Vector Floating Point" unit); older variants either don't have
+hardware FP or have somewhat bizarre ones.  The current Linux ABI
+("gnueabi") tries to support all configurations by specifying that
+FP arithmetic be done via calls to support routines that take their
+arguments in and return results in GPRs.  For CCL, that'd amount to
+doing all FP arithmetic via the FFI, which doesn't sound too attractive.
+I think that we're better off assuming the presence of a VFP, using
+VFP FP instructions, and supporting FPU-less machines via emulation
+(either in the OS kernel or in CCL itself.)  ARM ABIs have moved away
+from this because emulation is apparently pretty slow; anyone who
+does enough floating-point in CCL for this to matter should probably
+get a machine with a real FPU.  (There are interesting FPU-less ARMs,
+so it's not quite the same as the situation with SSE2 on X86, but I
+don't see any reason to try to optimize for older variants.)
+
+
+- instruction set
+
+Some newer variants support Thumb2, which allows a mixture of 32-bit
+and 16-bit instructions (better code density) and some other new features.
+The original Thumb extensions offered 16-bit instructions, but basically
+didn't allow 32-bit and 16-bit instructions to be mixed.
+
+I don't think that we want to require Thumb2 (at least until it's more
+dominant than it is now), and I don't see any use for the original
+Thumb instructions.  That pretty much leaves us with the traditional ARM
+instruction set, where most instructions are conditional and a fairly
+traditional RISC architecture is exposed.  (IIRC, we can mix ARM and
+Thumb2 code with no performance or other penalty, so we can phase Thumb2
+in at some point in the future.)
+
+- cache issues.
+
+It's necessary to execute a system call in order to "make data executable",
+e.g., to ensure that code is in the icache after it's been written to memory
+via the dcache.  This suggests that we need to keep code vectors
+disjoint from functions (as in the PPC ports and for similar reasons),
+and that we can't easily recover FN from the PC (and need to make it be
+a more explicit argument on each call.)
+
+- tagging
+Use the PPC32 tagging scheme, unless/until a reason not to.
+One minor variation: I think that it's desirable to keep a code vector
+and its (fixnum-tagged) entry point separate (either by storing them both
+in the function object or by deriving the code vector from the entrypoint.
+(Either approach would likely require some care and some GC support.)
+On the PPC, it was possible move a tagged pointer to the CTR; branching
+to the CTR was defined to ignore the low 2 bits (though Rosetta sometimes
+forgot this.)  On the ARM, the results are undefined (are they ever ...)
+when bit 1 is set in the PC, and setting bit 0 results in an ARM->Thumb
+transition.  To load a code vector from a function and jump to it, we'd
+have to do:
+
+   ldr temp,[nfn,#function.codevector]
+   andn pc,temp,#lisptagmask    ; or add/sub/whatever
+
+instead of
+
+   ldr pc,[nfn,#function.entrypoint]
+
+The extra instruction isn't desirable, but likely wouldn't kill
+us.  The need for yet another temp register during function call
+is probably more of a problem, and possibly a significant problem.
+
+
+- register partitioning
+
+The ARM offers 16 GPRs, but the PC and link register are GPRs;
+we probably need to use the C stack as a control stack (to make
+it easier to to recognize return addresses) and therefore use
+a separate VSP (with no explicit frame pointer, ala PPC CCL);
+we need to keep the TCR in a GPR, and probably need to keep
+ALLOCPTR in a GPR as well (unless we do consing in some totally
+different way), we need to keep FN in a GPR ... the short version
+is that we're less register-starved than on x8632 and can probably
+use a static partitioning scheme, but we're more register-starved
+than even the x8664.  The current register partictioning isn't
+carved in stone: we may find that it's better to have more/less
+args/temps/imms, but I don't think that we can justify having 
+callee-save NVRs.
+
+- Subprim calls
+
+All other factors being equal, the best way to call a subprim
+would be via a PC-relative call ("bl") instruction.  Those
+factors aren't equal; we'd have to find and adjust all of those
+instructions if we move the containing code vector.  "b" and "bl"
+instructions use a 26-bit displacement (24 bits, in words), so
+we'd have to do something to ensure that all code vectors were
+withing +/- 32MB of the subprims jump table or a copy of it, and
+this seems like a lot of complicated overhead in order to use bl.
+
+The current plan seems to be almost as good (as good when code vectors
+are purified), but it's based on a few brutal hacks.
+
+
+ARM constant operands in MOV and ALU instructions are encoded as an
+8-bit value and a 4 bit rotate count which allows the value to be
+rotated right by an even number of bits; the number of unique 32-bit
+values that can be expressed in this scheme seems to be a litte over
+3000.  (Some values can be expressed in more than one combination of
+8-bit value/4-bit rotate count.)  Obviously, all integers < 256 can
+be encoded this way; integers >= 256 and < 1024 can be encoded if they're
+multiples of 4.  There are 192 encodable values between #x4000 and 
+#x10000, and all of them are multiples of 256; we can plausibly
+reserve addresses in that range.  (Linux currently doesn't set
+the sysctl variable "vm.max_mmap_addr" on ARMs and sets it to 4K in
+recent x86 distributions; Darwin wants a PAGEZERO region before mapped
+memory, but we can control its size with linker options.)
+
+Using 256 bytes for a jump table entry would be wasteful; losing the
+jump table and using 256 bytes per subprim would be less so.  (Some
+large subprims - those dealing with THROW and unwinding, for instance,
+might not fit in 256 bytes and would have to be split into a part that
+fits in the 256-byte fixed address range and a part that doesn't.)
+
+Actually jumping to a subprim at address N (where N is an address expressible
+as an ARM constant) is just:
+
+   mov pc, #n
+
+There are a few ways to do a call; those that aren't PC-relative (and I'm
+leaning away from doing PC-relative calls in impure code) are generally
+2 instructions long:
+
+   mov lr, pc  ; when used as a source operand, the PC is read as .+8
+   mov pc,#n
+
+   or
+
+   mov reg,#n
+   blx reg      ; reg can be lr, e.g., blx lr goes to and returns to the
+                ; right addresses
+
+   or
+
+   bl jn
+   ...
+jn:
+   mov pc, #n
+
+
+In the last of these schemes, we might have a jump table of "mov pc,#n"
+instructions at the end of the code vector.  If we can purify code
+vectors to somewhere within 32MB of the actual subprims, then purify()
+can change calls into that jump table into PC-relative calls to the
+actual subprim code.  That's likely a small savings, but it might add
+up and it wouldn't be available under the first two approaches above.
+
+I generally like the whole idea of using immediate addressing to reference
+subprims, but it does depend on:
+
+ a) the OS allowing an application to use low addresses in its address
+    space
+ b) the linker allowing us to build the application that way.  (Linkers
+    generally provide a mechanism for this, but GNU ld scripts are 
+    sometimes sensitive to C libary/toolchain versions, and it'd be
+    good to avoid depending on them if possible.
+
+I suppose that another negative is:
+
+ c) it's hard to create CCL shared libraries, because CCL wants more
+    control over address space layout than a shared library usually
+    has.  There are many other ways in which this is true, but it's
+    a PITA to have to keep answering that question that way.
+
+If these are the only negatives of this scheme, at the moment I'd say
+that the positives outweigh them.
+
+- implementation parameters
+
+We can probably make NIL be a constant (#x10000005 or something similar);
+the only real issues are whether we can count on mapping that address and
+whether it'd wind up in the middle of an address range that we'd like to
+use freely.
+
+We probably have to limit CALL-ARGUMENTS-LIMIT and friends to 256, unless
+we're willing to load larger values from memory or synthesize them via
+a sequence of MOV/OR in SET-NARGS, and we'd have to load or synthesize
+a similar value or two into a temp reg in CHECK-NARGS.  (I'm sure that
+both are doable; I'm not very certain that these things are worth doing.)
+
+Word-sized load and store instructions can use a constant displacement
+of +/- 4K bytes from a base register.  We will probably rarely approach
+these limits when referencing function constants and values in stack
+frames, but it'd be good to have a way to handle these situations besides
+saying "function/stack frame too large".
+
+ARM documentation basically says that when a constant value can't be
+represented as a rotated 8-bit value, it's best to load it from PC-relative
+memory; ARM C functions generally have these "constant pools" interspersed
+with executable code.  (Large functions might have them "interspersed";
+smaller functions likely have code "followed by" constant pools.)  We almost
+certainly have to deal with similar issues ... and they can be complicated.
+
+  [explain complexity here ...]
