Notes on an IA-32 port

A port of OpenMCL to IA-32 (Intel's name for the 32 bit x86 architecture) would be a win for several reasons.

The most obvious reason would be to support Apple hardware that isn't 64-bit capable. This includes the first-generation Intel-based iMac, MacBook, MacBook Pro, and all Intel-based Mac Minis. Of course, support for non-Apple IA-32 hardware would also be nice.

Another reason would be to support access to Cocoa and Carbon (and other frameworks) on Intel-based Macintoshes running Mac OS X Tiger.

Although Apple has announced that Cocoa will be 64 bit in Leopard, they have publically confirmed that Carbon won't be. Therefore, a 32 bit lisp would still be needed to use Carbon, even on Leopard running on 64 bit hardware.

It would be interesting to support the AMD Geode LX (as used in the  OLPC laptop) as the minimum processor. This processor supports the P6 family instructions, including MMX instructions. We can therefore use the conditional move instructions, and maybe some MMX instructions to help out with bignums.

On the other hand, the AMD Geode LX processor doesn't support any of the SSE/SSE2/SSE3 instructions; this means that we'd have to use the x87 FPU (which is sort of funky). This would require modifications to the compiler, which believes that every floating point register can be accessed independently.

It might be reasonable to target the Core Solo/Duo? processor, at least to begin with. This would cover all Intel-based Macintosh systems ever shipped, and would allow us to avoid adding x87 FPU support for now.

Register usage and tagging

(See also

The tagging scheme can basically follow the PPC32 port. An important difference is that the three-bit tag #b101, which is for NIL on PPC32, is used for a thing called a tagged return addresses on IA-32. (More on this later.)

We want to keep the precise GC, but the limited number of registers that we have will probably make it impossible to statically partition the register file into immediate and tagged sets.

Anyway, we're looking at something like this:

eax 	 imm0
ecx 	 temp0
edx 	 temp1, nargs
ebx 	 arg_z
esp 	 stack pointer
ebp 	 frame pointer
esi 	 arg_y
edi 	 fn

We will augment this with a dynamic scheme: we will set or clear a bit in thread-private memory whenever a register transitions from one class to another. The GC will then look at these flag bits to decide how to treat the registers. (This may make the lispy register names confusing, since at times imm0 might actually contain a node, or arg_y an immediate.)

Another potential idea involves the direction flag (DF) in EFLAGS. The lisp can't use any of the string instructions, so it would be possible to use this bit for other purposes. When DF is clear, this could tell the GC that the normal register partitioning (as above) is in effect. When DF is set, this could indicate that an alternate register register partitioning would be in effect, e.g.,

eax 	 imm0
ecx 	 imm2
edx 	 imm1
ebx 	 arg_z
esp 	 stack pointer
ebp 	 frame pointer
esi 	 arg_y
edi 	 fn

We'd need to be careful save the state of the flags and clear DF before calling foreign code.

(I got the idea of using a bit in EFLAGS to distinguish between two modes of register usage from

One further idea is to arrange that no lisp object can be located at an address below 224. Since no valid vector index can be as great as 224 (the value of ARRAY-TOTAL-SIZE-LIMIT), this means that we could store unboxed array indicies in node registers and still have the GC reliably distinguish nodes from other values. (The main win to this would be making AREF easy. It wouldn't be necessary to fool around marking registers as immediates.)

Callee-saved "non-volatile" registers are probably a non-starter.

TCR additions

Node spill area

Since we have so few registers, we have a static spill area of 4 words in each lisp thread's TCR. The GC treats these as roots, and the convention is that they're caller-saved. (At the moment, the compiler knows nothing of this spill area, and it's used only from LAP functions and subprimitives.)

We do have to be careful about clearing out this spill area so that the GC doesn't hang onto objects that would otherwise be garbage. PROCESS-INTERRUPT will need to save and restore these values, and we might need to say that any callback (including traps) does so as well. (There's enough complexity already in traps and callbacks that saving/restoring the spill area isn't likely to add substantial overhead.)

Unboxed words

There are also a couple of words in the TCR that are used for unboxed values. We could build a frame on the tsp, but that's rather expensive.

One situation that comes to mind involves dividing an n-digit bignum by a single digit: we need registers to contain the bignum dividend, the result (the bignum quotient), and an index register. The DIV instruction requires the use of EDX:EAX pair. That means we're out of registers (fn, ebp, and esp are all in use), and have to store the single digit divisor somewhere else. We'd typically keep an unboxed value in an MMX register, but we can't use an MMX register as an operand to DIV. Therefore we have to use a memory operand: the unboxed word in the TCR.


The CLOS implementation sometimes uses an invisible argument to pass context information for CALL-NEXT-METHOD. On other ports, this is a register that's not part of the normal calling sequence, but on IA-32, all the registers are spoken for: arg_y and arg_z contain the last two arguments, temp1 is used as nargs, and temp0 is the function about to be called. We therefore pass the next-method-context via a slot in the TCR. (No, it's not pretty.) The GC will have to treat this as a root, and we might want to arrange to clear it out somehow so that it doesn't hang onto something that's otherwise garbage.

Comment by gb on Wed Aug 1 22:16:05 2007

It's probably sanest to think of the dynamic register partitioning as being a set of (local, temporary) changes relative to a baseline scheme, where the baseline scheme is in effect any time a function is entered (and therefore at the time of a function call). At that time, we probably need more node regs and fewer imm regs than the scheme suggested above provides, and we can probably overload nargs and imm0.

If we pass two arguments in registers, then we probably need a node register to address the callee on a function call (something like:

(movl (@ 'foo (% fn)) (% temp0))
(:talign ia32::fulltag-tra)
(call (@ ia32::symbol.fcell (% temp0))
(movl ($ :self) (% fn))

The CLOS implementation will sometimes funcall a method-function with an invisible argument (not counted against nargs) in a node register. (That's context information for CALL-NEXT-METHOD and it's done in a way that's not MOP-compliant.)

I think that in general if we err on the side of "too many node regs" in the baseline partitioning, we always (cheaply) save the values in those node regs if we need to temporarily make the register immediate for consing, shifts, multiply/divide, memory assignment, whatever else needs more than one imm reg.)

I think that having imm0=nargs would work fairly well, since we're usually either validating/defaulting based on nargs or doing tag/bounds checking, but rarely if ever need to do both at the same time.

Comment by rme

It turns out that there's a bit of a wrinkle in using imm0 as nargs, and that wrinkle is funcall. Funcall needs to use an imm reg for tag checking, and if our only imm reg contains nargs, well, that's going to be a problem. We'll use temp1 for nargs instead.

Another possibility might be a minor design change: make CALL-ARGUMENTS-LIMIT smaller, and use %ah for nargs, which would leave %al free for tag checking.


We don't have the __thread storage class on Darwin, so we will need to use i386_set_ldt to install a segment descriptor for each thread into the LDT. When a thread's segment descriptor is loaded into the %fs segment register, %fs can be used to refer to thread-local storage.

(This implies an 8K limit to the number of threads, by the way. Probably not a big deal for a 32 bit lisp.)

LAP notes

quoted numbers are fixnums, e.g., for x8664, (add ($ '2) (% rsp)) adds 16 to %rsp.

memory operand form: ([seg] [disp] [base] [index] [scale])

Bootstrapping notes

Follow these instructions to use a current Darwin/x86-64 lisp to build the ia32 branch. (I think that the 070722 snapshot should also work.)

Check out the ia32 branch from svn.

$ svn co svn+ssh://


$ svn co

if you don't have write access to the repository.

Next, copy a kernel, image, and interfaces into the ia32 tree.

$ p=/path/to/trunk/ccl/
$ cp $p/{dx86cl64,dx86cl64.image} .
$ cd darwin-x86-headers64/libc
$ cp $p/darwin-x86-headers64/libc/*.cdb .

Build the ia32 branch sources. Follow the directions below.

;;; pick up operators i386-ff-call and i386-syscall
(load "compiler/nxenv.lisp")
(load "compiler/nx1.lisp")

;;; define some stuff from x86-asm.lisp
(in-package "X86")
(defparameter *opcode-flags*
  `((:jump . ,(ash 1 0))		;special case for jump insns
    (:CpuNo64 . ,(ash 1 16))		;not supported in 64 bit mode
    (:Cpu64 . ,(ash 1 17))		;64 bit mode required
    (:CpuSSE . ,(ash 1 18))		;SSE extensions required
    (:CpuSSE2 . ,(ash 1 19))		;SSE2 extensions required
    (:CpuSSE3 . ,(ash 1 20))		;SSE3 extensions required

(defun %encode-opcode-flags (flags &optional errorp)
  (flet ((encode-atomic-flag (f)
	   (if f
	     (cdr (assoc f *opcode-flags*))
     (if (atom flags)
       (encode-atomic-flag flags)
       (let* ((k 0))
	 (dolist (f flags k)
	   (let* ((k0 (encode-atomic-flag f)))
	     (if k0
	       (setq k (logior k0 k))
     (if errorp (error "Unknown x86 opcode flags: ~s" flags)))))

(defmacro encode-opcode-flags (&rest flags)
  (%encode-opcode-flags flags t))

(in-package "CCL")

(load "compiler/X86/X8632/x8632-arch.lisp")
(load "lib/x8632env.lisp")
(compile-file "compiler/X86/X8664/x8664-backend.lisp")
(compile-file "compiler/X86/x86-asm.lisp")
(compile-file "compiler/X86/x86-lap.lisp")
(compile-file "compiler/X86/x862.lisp")

;;; Ignore the warnings about the undeclared x86::*x86...* free variables.

(load "compiler/X86/X8664/x8664-backend")
(load "compiler/X86/x86-asm")
(load "compiler/X86/x86-lap")
(load "compiler/X86/x862")

(compile-ccl t)

Exit, start up with the bootstrap image, and then save a new full image.

Darwin/IA-32 interface databases

To build interface files, get an ffigen binary from and install it.

$ cd darwin-x86-headers/libc/C
$ ./
+++ /Developer/SDKs/MacOSX10.4u.sdk/usr/include/ar.h
+++ /Developer/SDKs/MacOSX10.4u.sdk/usr/include/arpa/ftp.h
+++ /Developer/SDKs/MacOSX10.4u.sdk/usr/include/arpa/inet.h

There's one more step to making these interfaces available to the lisp, but we'll get to that in a minute.

Cross compiling

Given an image made as above, to set up for cross-compiling:

(in-package "CCL")
;;; may not actually need all this stuff
(load "compiler/X86/X8632/x8632-arch.lisp")
(require "X8632ENV")
(require "X8632-ARCH")
(defpackage "X86-DARWIN32")
(compile-file "compiler/X86/x86-lap.lisp")
(load "compiler/X86/x86-lap")
(load "compiler/X86/x86-backend.lisp")
(load "compiler/X86/x86-disassemble.lisp")
(load "compiler/X86/X8632/x8632-backend.lisp")

(let ((*target-backend* *x8632-backend*))
  (load "ccl:compiler;X86;X8632;x8632-vinsns.lisp")
  (load "ccl:compiler;X86;x86-lapmacros.lisp"))

(load "lib/ffi-darwinx8632.lisp")

(require-update-modules *x8632-xload-modules* t)

The first time you do this, you need to finish processing the interface databases.

(require "PARSE-FFI")
(parse-standard-ffi-files :libc :darwinx8632)
;;; lots of output...
;;; Do it once more...
(parse-standard-ffi-files :libc :darwinx8632)


(cross-xload-level-0 :darwinx8632 :force)

will create a boot image in x86-boot32.image.

To cross-compile the rest of the lisp, evaluate:

(cross-compile-ccl :darwinx8632 t)

Note that several files will not compile yet; just pick the "skip this file" restart and carry on for now.

To build an IA-32 lisp kernel:

$ cd lisp-kernel/darwinx8632
$ make

The current state of the port is that it will map in the boot image and start fasloading files. It'll die in l1-io.dx32fsl while trying to create the GRAY package.

To run the IA-32 lisp (from the main ccl directory):

$ gdb dx86cl
GNU gdb 6.3.50-20050815 (Apple version gdb-768) (Tue Oct  2 04:07:49 UTC 2007)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-apple-darwin"...Reading symbols for shared libraries .. done

(gdb) run x86-boot32.image 
Starting program: /Users/rme/ccl/dx86cl x86-boot32.image
Reading symbols for shared libraries +. done
;Loading level-1.dx32fsl
;Loading ./l1-fasls/l1-cl-package.dx32fsl
;Loading ./l1-fasls/l1-utils.dx32fsl
;Loading ./l1-fasls/l1-init.dx32fsl
;Loading ./l1-fasls/l1-symhash.dx32fsl
;Loading ./l1-fasls/l1-numbers.dx32fsl
;Loading ./l1-fasls/l1-aprims.dx32fsl
;Loading ./l1-fasls/x86-callback-support.dx32fsl
;Loading ./l1-fasls/l1-callbacks.dx32fsl
;Loading ./l1-fasls/l1-sort.dx32fsl
;Loading ./bin/lists.dx32fsl
;Loading ./bin/sequences.dx32fsl
;Loading ./l1-fasls/l1-dcode.dx32fsl
;Loading ./l1-fasls/l1-clos-boot.dx32fsl
;Loading ./bin/hash.dx32fsl
;Loading ./l1-fasls/l1-clos.dx32fsl
;Loading ./bin/defstruct.dx32fsl
;Loading ./bin/dll-node.dx32fsl
;Loading ./l1-fasls/l1-unicode.dx32fsl
;Loading ./l1-fasls/l1-streams.dx32fsl
;Loading ./l1-fasls/linux-files.dx32fsl
;Loading ./bin/chars.dx32fsl
;Loading ./l1-fasls/l1-files.dx32fsl
;Loading ./l1-fasls/l1-typesys.dx32fsl
;Loading ./l1-fasls/sysutils.dx32fsl
;Loading ./l1-fasls/x86-threads-utils.dx32fsl
;Loading ./l1-fasls/l1-lisp-threads.dx32fsl
;Loading ./l1-fasls/l1-application.dx32fsl
;Loading ./l1-fasls/l1-processes.dx32fsl
;Loading ./l1-fasls/l1-io.dx32fsl
Unhandled exception 10 at 0x80875d3, context->regs at #x342bbc
? for help
[1604] CCL kernel debugger: 

If you take a look and have questions or comments, please send mail to openmcl-devel@…, or to me directly (rme@…).