source: trunk/source/doc/src/13-implementation.xml @ 8567

Last change on this file since 8567 was 8567, checked in by rme, 13 years ago

Break up monolithic openmcl-documentation.xml file into chapters. Add file
top.xml, which includes all the chapters via xi:xinclude.

File size: 77.8 KB
Line 
1<?xml version="1.0" encoding="utf-8"?>
2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
3  <chapter id="Implementation-Details-of-OpenMCL">
4    <title>Implementation Details of OpenMCL</title>
5    <para>This chapter describes many aspects of OpenMCL's
6    implementation as of (roughly) version 1.1.  Details vary a bit
7    between the three archutectures (PPC32, PPC64, and X86-64)
8    currently supported and those details change over time, so the
9    definitive reference is the source code (especially some files in
10    the ccl/compiler/ directory whose names contain the string "arch"
11    and some files in the ccl/lisp-kernel/ directory whose namee
12    contain the string "constants".)  Hopefully, this chapter will
13    make it easier for someone who's interested to read and understand
14    the contents of those files.</para>
15
16    <sect1 id="Threads-and-exceptions">
17      <title>Threads and exceptions</title>
18      <para>OpenMCL's threads are "native" (meaning that they're
19      scheduled and controlled by the operating system.)  Most of the
20      implications of this are discussed elsewhere; this section tries
21      to describe how threads look from the lisp kernel's perspective
22      (and especailly from the GC's point of view.)</para>
23      <para>OpenMCL's runtime system tries to use machine-level
24      exception mechanisms (conditional traps when available, illegal
25      instructions, memory access protection in some cases) to detect
26      and handle ...  exceptional situations.  These situations
27      include some TYPE-ERRORs and PROGRAM-ERRORS (notably
28      wrong-number-of-args errors), and also include cases like "not
29      being able to allocate memory without GCing or obtaining more
30      memory from the OS."  The general idea is that it's usually
31      faster to pay (very occasional) exception-processing overhead
32      and figure out what's going on in an exception handler than it
33      is to maintain enough state and context to handle an exceptional
34      case via a lighter-weight mechanism when that exceptional case
35      (by definition) rarely occurs.</para>
36      <para>Some emulated execution environments (the Rosetta PPC
37      emulator on x86 versions of OSX) don't provide accurate
38      exception information to exception handling functions. OpenMCL
39      can't run in such environments.</para>
40
41      <sect2 id="The-Thread-Context-Record">
42        <title>The Thread Context Record</title>
43
44        <para>When a lisp thread is first created (or when a thread
45        created by foreign code first calls back to lisp), a data
46        structure called a Thread Context Record (or TCR) is allocated
47        and initialized.  On modern versions of Linux and FreeBSD, the
48        allocation actually happens via a set of thread-local-storage
49        ABI extensions, so a thread's TCR is created when the thread
50        is created and dies when the thread dies.  (The World's Most
51        Advanced Operating System - as Apple's marketing literature
52        refers to Darwin - is not very advanced in this regard, and I
53        know of no reason to assume that advances will be made in this
54        area anytime soon.)</para>
55        <para>A TCR contains a few dozen fields (and is therefore a
56        few hundred bytes in size.)  The fields are mostly
57        thread-specific information about the thread's stacks'
58        locations and sizes, information about the underlying (POSIX)
59        thread, and information about the thread's dynamic binding
60        history and pending CATCH/UNWIND-PROTECTs.  Some of this
61        information could be kept in individual machine registers
62        while the thread is running (and the PPC - which has more
63        registers available - keeps a few things in registers that the
64        X86-64 has to access via the TCR), but it's important to
65        remember that the information is thread-specific and can't
66        (for instance) be kept in a fixed global memory
67        location.</para>
68        <para>When lisp code is running, the current thread's TCR is
69        kept in a register.  On PPC platforms, a general purpose
70        register is used; on x86-64, an (otherwise nearly useless)
71        segment register works well (prevents the expenditure of a
72        more generally useful general- purpose register for this
73        purpose.)</para>
74        <para>The address of a TCR is aligned in memory in such a way
75        that a FIXNUM can be used to represent it.  The lisp function
76        CCL::%CURRENT-TCR returns the calling thread's TCR as a
77        fixnum; actual value of the TCR's address is 4 or 8 times the
78        value of this fixnum.</para>
79        <para>When the lisp kernel initializes a new TCR, it's added
80        to a global list maintained by the kernel; when a thread
81        exits, its TCR is removed from this list.</para>
82        <para>When a thread calls foreign code, lisp stack pointers
83        are saved in its TCR, lisp registers (at least those whose
84        value should be preserved across the call) are saved on the
85        thread's value stack, and (on x86-64) RSP is switched to the
86        control stack.  A field in the TCR (tcr.valence) is then set
87        to indicate that the thread is running foreigm code, foreign
88        argument registers are loaded from a frame on the foreign
89        stack, and the foreign function is called. (That's a little
90        oversimplified and possibly inaccurate, but the important
91        things to note are that the thread "stops following lisp stack
92        and register usage conventions" and that it advertises the
93        fact that it's done so.  Similar transitions in a thread's
94        state ("valence") occur when it enters of exits an exception
95        handler (which is sort of an OS/hardware-mandated foreign
96        function call where the OS thoughtfully saves the thread's
97        register state for it beforehand.)</para>
98      </sect2>
99
100      <sect2 id="Exception-contexts-comma---and-exception-handling-in-general">
101        <title>Exception contexts, and exception-handling in general</title>
102        <para>Unix-like OSes tend to refer to exceptions as "signals";
103        the same general mechanism ("signal handling") is used to
104        process both asynchronous OS-level events (such as the result
105        of the keyboard driver noticing that ^C or ^Z has been
106        pressed) and synchronous hardware-level events (like trying to
107        execute and illegal instruction or access protected memory.)
108        It makes some sense to defer ("block") handling of
109        aysnchronous signals so that some critical code sequences
110        complete without interruption; since it's generally not
111        possible for a thread to proceed after a synchronous exception
112        unless and until its state is modified by an exception
113        handler, it makes no sense to talk about blocking synchronous
114        signals (though some OSes will let you do so and doing so can
115        have mysterious effects.)</para>
116        <para>On OSX/Darwin, the POSIX signal handling facilities
117        coexist with lower-level Mach-based exception handling
118        facilities.  Unfortunately, the way that this is implemented
119        interacts poorly with debugging tools: GDB will generally stop
120        whenever the target program encounters a Mach-level exception
121        and offers no way to proceed from that point (and let the
122        program's POSIX signal handler try to handle the exception);
123        Apple's CrashReporter program has had a similar issue and,
124        depending on how it's configured, may bombard the user with
125        alert dialogs which falsely claim that an application has
126        crashed (when in fact the application in question has
127        routinely handled a routine exception.)  On Darwin/OSX,
128        OpenMCL uses Mach thread-level exception handling facilities
129        which run before GDB or CrashReporter get a chance to confuse
130        themeselves; OpenMCL's Mach exception handling tries to force
131        the thread which received a synchronous exception to invoke a
132        signal handling function ("as if" signal handling worked more
133        usefully under Darwin.)  Mach exception handlers run in a
134        dedicated thread (which basically does nothing but wait for
135        exception messages from the lisp kernel, obtain and modify
136        information about the state of threads in which exceptions
137        have occurred, and reply to the exception messages with an
138        indication that the exception has been handled.  The reply
139        from a thread-level exception handler keeps the exception from
140        being reported to GDB or CrashReporter and avoids the problems
141        related to those programs.  Since OpenMCL's Mach exception
142        handler doesn't claim to handle debugging-related exceptions
143        (from breakpoints or single-step operations), it's possible to
144        use GDB to debug OpenMCL.</para>
145        <para>On platforms where signal handling and debugging don't get in each
146other's way, a signal handler is entered with all signals blocked.
147(This behavior is specified in the call to the sigaction() function
148which established the signal handler.)  The signal handler recieves
149three arguments from the OS kernel; the first is an intger which
150identifies the signal, the second is a pointer to an object of
151type "siginfo_t", which may or may not contain a few fields that
152would help to identify the cause of the exception, and the third
153argument is a pointer to a data structure (called a "ucontext"
154or something similar) which contains machine-dependent information
155about the state of the tread at the time that the exception/signal
156occurred.  While asynchronous signals are blocked, the signal handler
157stores the pointer to its third argument (the "signal context") in
158a field in the current thread's TCR, sets some bits in another TCR
159field to indicate that the thread is now waiting to handle an
160exception, unblocks asynchronous signals, and waits for a global
161exception lock which serializes exception processing.</para>
162        <para>On Darwin, the Mach exception thread creates a signal
163        context (and maybe a siginfo_t structure), stores the signal
164        context in the thread's TCR, sets the TCR field wich describes
165        the thread's state, and arranges that the thread resume
166        execution at its signal handling function (with a signal
167        handler, possibly NULL siginfo_t, and signal context as
168        arguments.  When the thread resumes, it waits for the global
169        exception lock.</para>
170        <para>On x86-64 platforms where signal handing can be used to
171        handle synchronous exceptions, there's an additional
172        complication: the OS kernel ordinarily allocates the signal
173        context and siginfo structures on the stack of the thread
174        which received the signal; in practice, that means "wherever
175        RSP is pointing."  OpenMCL's require that the thread's value
176        stack - where RSP is usually pointing while lisp code is
177        running - contain only "nodes" (properly tagged lisp objects),
178        and scribbling a signal context all over the value stack would
179        violate this requirement.  To maintain consistency, the
180        sigaltstack() mechanism is used to cause the signal to be
181        delivered on (and the signal context and siginfo to be
182        allocated on) a special stack area (the last few pages of the
183        thread's cntrol stack, in practice.  When the signal handler
184        runs, it (carefully) copies the signal context and siginfo to
185        the thread's control stack and makes RSP point into that stack
186        before invoking the "real" signal handler.  (The effect of
187        this hack is that the "real" signal handler always runs on the
188        thread's control stack.)</para>
189        <para>Once the exception handler has obtained the global
190        exception lock, it uses the values of the signal number,
191        siginfo_t, and signal context arguments to determine the
192        (logical) cause of the exception.  Some exceptions may be
193        caused by factors that should generate lisp errors or other
194        serious conditions (stack overflow); if this is the case, the
195        kernel code may release the global exception lock and call out
196        to lisp code.  (The lisp code in question may need to repeat
197        some of the exception decoding process; in particular, it
198        needs to be able to interpret register values in the signal
199        context that it receives as an argument.)</para>
200        <para>In some cases, the lisp kernel exception handler may not
201        be able to recover from the exception (this is currently true
202        of some types of memory-access fault and is also true of traps
203        or illegal instructions that occur during foreign code
204        execution.  In such cases, the kernel exception handler
205        reports the exception as "unhandled", and the kernel debugger
206        is invoked.</para>
207        <para>If the kernel exception handler identifies the
208        exception' cause as being a transient out-of-memory condition
209        (indicating that the current thread needs more memory to cons
210        in), it tries to make that memory available.  In some cases,
211        doing so involves invoking the GC.</para>
212      </sect2>
213
214      <sect2 id="Threads-comma---exceptions-comma---and-the-GC">
215        <title>Threads, exceptions, and the GC</title>
216        <para>OpenMCL's GC is not concurrent: when the GC is invoked
217        in response to an exception in a particular thread, all other
218        lisp threads must stop until the GC's work is done.  The
219        thread that triggered the GC iterates over the global TCR
220        list, sending each other thread a distinguished "suspend"
221        signal, then iterates over the list again, waiting for a
222        per-thread semaphore that indicates that the thread has
223        received the "suspend" signal and responded appropriatedly.
224        Once all other threads have acknowledged the request to
225        suspend themselves, the GC thread can run the GC proper (after
226        doing any necessary .)  Once the GC's completed its work, the
227        thread that invoked the GC iterates over the global TCR list,
228        raising a per-thread "resume" semaphore for each other
229        thread.</para>
230        <para>The signal handler for the asynchronous "suspend" signal
231        is entered with all asynchronous signals blocked.  It saves
232        its signal-context argument in a TCR slot, raises the tcr's
233        "suspend" semaphore, then waits on the TCR's "resume"
234        semaphore.</para>
235        <para>The GC thread has access to the signal contexts of all
236        TCRs (including its own) at the time when the thread received
237        an exception or acknowledged a request to suspend itself.
238        This information (and information about stack areas in the TCR
239        itself) allows the GC to identify the "stack locations and
240        register contents" that are elements of the GC's root
241        set.</para>
242      </sect2>
243
244      <sect2 id="PC-lusering">
245        <title>PC-lusering</title>
246        <para>It's not quite accurate to say that OpenMCL's compiler
247        and runtime follow precise stack and register usage
248        conventions at all times; there are a few exceptions:</para>
249
250        <itemizedlist>
251          <listitem>
252<para>On both PPC and x86-64 platforms, consing isn't fully atomic.It takes at least a few instructions to allocate an object in memory(and slap a header on it if necesssary); if a thread is interrupted inthe middle of that instruction sequence, the new object may or may nothave been created or fully initialized at the point in time that theinterrupt occurred.  (There are actually a few different states ofpartial initialization)</para>
253</listitem>
254          <listitem>
255<para>On the PPC, the common act of building a lisp control stack frameinvolves allocating a four-word frame and storing three register valuesinto that frame.  (The fourth word - the back pointer to the previousframe - is automatically set when the frame is allocated.)  The previouscontents of those three words are unknown (there might have been aforeign stack frame at the same address a few instructions earlier),so interrupting a thread that's in the process of initializing aPPC control stack frame isn't GC-safe.</para>
256</listitem>
257          <listitem>
258<para>There are similar problems with the initialization of temp stackframes on the PPC.  (Allocation and initialization doesn't happenatomically, and the newly allocated stack memory may have undefinedcontents.)</para>
259</listitem>
260          <listitem>
261<para>'s write barrier has to be implemented atomically (i.e.,both an intergenerational store and the update of a correspondingreference bit has to happen without interruption, or neither of theseevents can happen.)</para>
262</listitem>
263          <listitem>
264<para>There are a few more similar cases.</para>
265</listitem>
266       
267        </itemizedlist>
268
269        <para>Fortunately, the number of these non-atomic instruction sequences is
270small, and fortunately it's fairly easy for the interrupting thread
271to recognize when the interrupted thread is in the middle of such
272a sequence.  When this is detected, the interrupting thread modfies
273the state of the interrupted thread (modifying its PC and other
274registers) so that it is no longer in the middle of such a sequenece
275(it's either backed out of it or the remaining instructions are
276emulated.)</para>
277        <para>This works because (a) many of the troublesome instruction sequences
278are PPC-specific and it's relatively easy to partially disassemble the
279instructions surrounding the interrupted thread's PC on the PPC and
280(b) those instruction sequences are heavily stylized and intended to
281be easily recognized.</para>
282      </sect2>
283    </sect1>
284
285    <sect1 id="Register-usage-and-tagging">
286      <title>Register usage and tagging</title>
287
288      <sect2 id="Register-usage-and-tagging-overview">
289        <title>Overview</title>
290        <para>Regardless of other details of its implementation, a
291        garbage collector's job is to partition the set of all
292        heap-allocated lisp objects (CONSes, STRINGs, INSTANCEs, etc.)
293        into two subsets.  The first subset contains all objects that
294        are transitively referenced from a small set of "root" objects
295        (the contents of the stacks and registers of all active
296        threads at the time the GC occurs and the values of some
297        global variables.)  The second subset contains everything
298        else: those lisp objects that are not transitively reachable
299        from the roots are garbage, and the memory occupied by garbage
300        objects can be reclaimed (since the GC has just proven that
301        it's impossible to reference them.)</para>
302        <para>The set of live, reachable lisp objects basically form
303        the nodes of a (usually large) graph, with edges from each
304        node A to any other objects (nodes) that object A
305        references.</para>
306        <para>Some nodes in this graph can never have outgoing edges:
307        an array with a specialized numeric or character type usually
308        represents its elements in some (possibly more compact)
309        specialized way.  Some nodes may refer to lisp objects that
310        are never allocated in memory (FIXNUMs, CHARACTERs,
311        SINGLE-FLOATs on 64-bit platforms ..)  This latter class of
312        objects are sometimes called "immediates", but that's a little
313        confusing because the term "immediate" is sometimes used to
314        refer to things that can never be part of the big connectivity
315        graph (e.g., the "raw" bits that make up a floating-point
316        value, foreign address, or numeric value that needs to be used
317        - at least fleetingly - in compiled code.)</para>
318        <para>For the GC to be able to build the connectivity graph
319        reliably, it's necessary for it to be able to reliably tell
320        (a) whether or not a "potential root" - the contents of a
321        machine register or stack location - is in fact a node and (b)
322        for any node, whether it may have components that refer to
323        other nodes.</para>
324        <para>There's no reliable way to answer the first question on
325        stock hardware.  (If everything was a node, as might be the
326        case on specially microcoded "lisp machine" hardware, it
327        wouldn't even need to be asked.)  Since there's no way to just
328        look at a machine word (the contents of a machine register or
329        stack location) and tell whether or not it's a node or just
330        some random non-node value, we have to either adopt and
331        enforce strict conventions on register and stack usage or
332        tolerate ambiguity.</para>
333        <para>"Tolerating ambiguity" is an approach taken by some
334        ("conservative") GC schemes; by contrast, OpenMCL's GC is
335        "precise", which in this case means that it believes that the
336        contents of certain machine registers and stack locations are
337        always nodes and that other registers and stack locations are
338        never nodes and that these conventions are never violated by
339        the compiler or runtime system.  The fact that threads are
340        preemptively scheduled means that a GC could occur (because of
341        activity in some other thread) on any instruction boundary,
342        which in turn means that the compiler and runtime system must
343        follow precise at all times.</para>
344        <para>Once we've decided that a given machine word is a node,
345        a describes how the node's value and type are encoded in that
346        machine word.</para>
347        <para>Most of this - so far - has discussed thigs from the
348        GC's very low-level perspective.  From a much higher point of
349        view, lisp functions accept nodes as arguments, return nodes
350        as values, and (usually) perform some operations on those
351        arguments in order to produce those results.  (In many cases,
352        the operations in question involve raw non-node values.)
353        Higher-level parts of the lisp type system (functions like
354        TYPE-OF and CLASS-OF, etc.) depend on the .</para>
355      </sect2>
356
357      <sect2 id="pc-locatives-on-the-PPC">
358        <title>pc-locatives on the PPC</title>
359        <para>On the PPC, there's a third case (besides "node" and
360        "immediate" values).  As discussed below, a node that denotes
361        a memory-allocated lisp object is a biased (tagged) pointer
362        -to- that object; it's not generally possible to point -into-
363        some composite (multi-element) object (such a pointer would
364        not be a node, and the GC would have no way to update the
365        pointer if it were to move the underlying object.)</para>
366        <para>Such a pointer ("into" the interior of a heap-allocated
367        object) is often called a <emphasis>locative</emphasis>; the
368        cases where locatives are allowed in OpenMCL mostly involve
369        the behavior of function call and return instructions.  (To be
370        technicaly accurate, the other case also arises on x86-64, but
371        that case isn't as user-visible.)</para>
372        <para>On the PowerPC (both PPC32 and PPC64), all machine
373        instructions are 32 bits wide and all in1struction words are
374        allocated on 32-bit boundaries.  In PPC OpenMCL, a CODE-VECTOR
375        is a specialized type of vector-like object; its elements are
376        32-bit PPC machine instructions.  A CODE-VECTOR is an
377        attribute of FUNCTION object; a function call involves
378        accessing the function's code-vector and jumping to the
379        address of its first instruction.</para>
380        <para>As each instruction in the code vector sequentially
381        executes, the hardware program counter (PC) register advances
382        to the address of the next instruction (a locative into the
383        code vector); since PPC instructions are always 32 bits wide
384        and aligned on 32-bit boundaries, the low two bits of the PC
385        are always 0.  If the function executes a call (simple call
386        instrucions have the mnemonic "bl" on the PPC, which stands
387        for "branch and link"), the address of the next instruction
388        (also a word-aligned locative into a code-vector) is copied
389        into the special- purpose PPC "link register" (lr); a function
390        returns to its caller via a "branch to link register" (blr)
391        instruction.  Some cases of function call and return might
392        also use the PPC's "count register" (ctr), and if either the
393        lr or ctr needs to be stored in memory it needs to first be
394        copied to a general-purpose registers.</para>
395        <para>OpenMCL's GC understands that certain registers contain
396        these special "pc-locatives" (locatives that point into
397        CODE-VECTOR objects); it contains specal support for finding
398        the containing CODE-VECTOR object and for adjusting all of
399        these "pc-locatives" if the containing object is moved in
400        memory.  The first part of that - finding the containing
401        object - is possible and practical on the PPC because of
402        architectural artifcacts (fixed-width instructions and arcana
403        of instruction encoding.)  It's not possible on x86-64, but
404        fortunately not necessary either (though the second part -
405        adjusting the PC/RIP when the containing object moves) is both
406        necessary and simple.</para>
407      </sect2>
408
409      <sect2 id="Register-and-stack-usage-conventions">
410        <title>Register and stack usage conventions</title>
411
412        <sect3 id="Stack-conventions">
413          <title>Stack conventions</title>
414          <para>On both PPC and X86 platforms, each lisp thread uses 3
415          stacks; the ways in which these stacks are used differs
416          between the PPC and X86.</para>
417          <para>Each thread has:</para>
418          <itemizedlist>
419            <listitem>
420              <para>A "control stack".  On both platforms, this is
421              "the stack" used by foreign code.  On the PPC, it
422              consists of a linked list of frames where the first word
423              in each frame points to the first word in the previous
424              frame (and the outermost frame points to 0.)  Some
425              frames on a PPC control stack are lisp frames; lisp
426              frames are always 4 words in size and contain (in
427              addition to the back pointer to the previous frame) the
428              calling function (a node), the return address (a
429              "locative" into the calling function's code-vector), and
430              the value to which the value-stack pointer (see below)
431              should be restored on function exit.  On the PPC, the GC
432              has to look at control-stack frames, identify which of
433              those frames are lisp frames, and treat the contents
434              ofthe saved function slot as a node (and handle the
435              return address locative specially.)  On x86-64, the
436              control stack is used for dynamic-extent allocation of
437              immediate objects.  Since the control stack never
438              contains nodes on x86-64, the GC ignores it on that
439              platform.  Alignment of the control stack follows the
440              ABI conventions of the platform (at least at any point
441              in time where foreign code could run.)  On PPC, the r1
442              register always points to the top of the current
443              thread's control stack; on x86-64, the RSP register
444              points to the top of the current thread's control stack
445              when the thread is running foreign code and the address
446              of the top of the control stack is kept in the thread's
447              TCR see when not running foreign code.  The control
448              stack "grows down."</para>
449            </listitem>
450            <listitem>
451              <para>A "value stack".  On both platforms, all values on
452              the value stack are nodes (including "tagged return
453              addresses" on x86-64.)  The value stack is always
454              aligned to the native word size; objects are always
455              pushed on the value stack using atomic instructions
456              ("stwu"/"stdu" on PPC, "push" on x86-64), so the
457              contents of the value stack between its bottom and top
458              are always unambiguously nodes; the compiler usually
459              tries to pop or discard nodes from the value stack as
460              soon as possible after their last use (as soon as they
461              may have become garbage.)  On x86-64, the RSP register
462              addresses the top of the value stack when running lisp
463              code; that address is saved in the TCR when running
464              foreign code.  On the PPC, a dedicated regiter (VSP,
465              currently r15) is used to address the top of the value
466              stack when running lisp code, and the VSP value is saved
467              in the TCR when running foreign code.  The value stack
468              grows down.</para>
469            </listitem>
470            <listitem>
471              <para>A "temp stack".  The temp stack consists of a
472              linked list of frames, each of which points to the
473              previous temp stack frame.  The number of native machine
474              words in each temp stack frame is always even, so the
475              temp stack is aligned on a two-word (64- or 128-bit)
476              boundary.  The temp stack is used for dynamic-extent
477              objects on both platforms; on the PPC, it's used for
478              essentially all such objects (regardless of whether or
479              not the objects contain nodes); on the x86-64, immediate
480              dynamic-extent objects (strings, foreign pointers, etc.)
481              are allocated on the control stack and only
482              node-containing dynamic-extent objects are allocated on
483              the temp stack.  Data structures used to implement CATCH
484              and UNWIND-PROTECT are stored on the temp stack on both
485              ppc and x86-64.  Temp stack frames are always doublenode
486              aligned and objects within a temp stack frame are
487              aligned on doublenode boundaries.  The first word in
488              each frame contains a back pointer to the previous
489              frame; on the PPC, the second word is used to indicate
490              to the GC whether the remaining objects are nodes (if
491              the second word is 0) or immediate (otherwise.)  On
492              x86-64, where temp stack frames always contain nodes,
493              the second word is always 0.  The temp stack grows down.
494              It usually takes several instuctions to allocate and
495              safely initialize a temp stack frame that's intended to
496              contain nodes, and the GC has to recognize the case
497              where a thread is in the process of allocating and
498              initializing a temp stack frame and take care not to
499              interpret any uninitialized words in the frame as nodes.
500              See (someplace).  The PPC keeps the current top of the
501              temp stack in a dedicated register (TSP, currently r12)
502              when running lisp code and saves this register's value
503              in the TCR when running foreign code.  The x86-64 keeps
504              the address of the top of each thread's temp stack in
505              the thread's TCR.</para>
506            </listitem>
507          </itemizedlist>
508        </sect3>
509
510        <sect3 id="Register-conventions">
511          <title>Register conventions</title>
512          <para>If there are a "reasonable" (for some value of
513          "reasonable") number or general-purpose registers and the
514          instruction set is "reasonably" orthogonal (most
515          instructions that operate on GPRs can operate on any GPR),
516          then it's possible to statically partition the GPRs into at
517          least two sets: "immediate registers" never contain nodes,
518          and "node registers" always contain nodes.  (On the PPC, a
519          few registers are members of a third set of "PC locatives",
520          and on both platforms some registers may have dedicated
521          roles as stack or heap pointers; the latter class is treated
522          as immediates by the GC proper but may be used to help
523          determine the bounds of stack and heap memory areas.)</para>
524          <para>The ultimate definition of register partitioning is
525          hardwired into the GC in functions like "mark_xp()" and
526          "forward_xp()", which process the values of some of the
527          registers in an exception frame as nodes and may give some
528          sort of special treatment to other register values they
529          encounter there.)</para>
530          <para>On x86-64, the static register partitioning scheme involves:</para>
531          <itemizedlist>
532            <listitem>
533              <para>(only) two "immediate" registers.The RAX and RDX
534              registers are used as the implicit operands andresults
535              of some extended-precision multiply and divide
536              instructionswhich generally involve non-node values;
537              since their use in theseinstructions means that they
538              can't be guaranteed to contain nodevalues at all times,
539              it's natural to put these registers in the"immediate"
540              set.  RAX is generally given the symbolic name
541              "imm0",and RDX is given the symbolic name "imm1"; you
542              may see these namesin disassembled code, usually in
543              operations involving type checking,array indexing, and
544              foreign memory and function access.</para>
545            </listitem>
546            <listitem>
547              <para>(only) two "dedicated" registers.RSP and RBP have
548              dedicated functionality dictated by the hardwareand
549              calling conventions.  (There are a few places where RBP
550              istemporarily used as an extra immediate
551              register.)</para>
552            </listitem>
553            <listitem>
554              <para>12 "node" registers.All other registers (RBX, RCX,
555              RSI, RDI, and R8-R15) are asserted tocontain node values
556              at (almost) all times; legacy "string" operationsthat
557              implicitly use RSI and/or RDI are not used.  Shift and
558              rotateintructions which shift/rotate by a variable
559              number of bits arerequired by the architecture to use
560              the low byte of RCX (the traditionalCL register) as the
561              implicit shift count; when it's necessary to keepa
562              non-node shift count in the low byte of RCX, the upper 7
563              bytes ofthe register are zeroed (so that
564              misinterpetation of the immediatevalue in RCX as a node
565              will not have negative GC affects.  (The GCmight briefly
566              treate it as a node, but since it's not pointing
567              anywherenear the lisp heap it'll soon lose interest in
568              it.)Legacy instructions that use RCX (or some portions
569              of it) as a loopcounter can not be used (since such
570              instructions might introducenon-node values into
571              RCX.)</para>
572</listitem>
573          </itemizedlist>
574          <para>On the PPC, the static register partitioning scheme involves:</para>
575
576          <itemizedlist>
577            <listitem>
578              <para>6 "immediate" registersRegisters r3-r8 are given
579              the symbolic names imm0-imm5.  As a RISCarchitecture
580              with simpler addressing modes, the PPC probably
581              usesimmediate registers a bit more often than the CISC
582              x86-64 does, butthey're generally used for the same sort
583              of things (type checking,array indexing, FFI,
584              etc.)</para>
585            </listitem>
586            <listitem>
587              <para>9 dedicated registers
588              <itemizedlist>
589                <listitem>
590                  <para>r0 (symbolic name rzero) always contains the
591                  value 0 when runninglisp code.  Its value is
592                  sometimes read as 0 when it's used as thebase
593                  register in a memory address; keeping the value 0
594                  there issometimes convenient and avoids
595                  asymmetry.</para>
596                </listitem>
597                <listitem>
598                  <para>r1 (symbolic name sp) is the control stack
599                  pointer, by PPC convention.</para>
600                </listitem>
601                <listitem>
602                  <para>r2 is used to hold the current thread's TCR on
603                  ppc64 systems; it'snot used on ppc32.</para>
604                </listitem>
605                <listitem>
606                  <para>r9 and r10 (symbolic names allocptr and
607                  allocbase) are used to do per-thread memory
608                  allocation</para>
609                </listitem>
610                <listitem>
611                  <para>r11 (symbolic name nargs) contains the number
612                  of function arguments on entry and the number of
613                  return values in multiple-value returning
614                  constructs.  It's not used more generally as either
615                  a node or immediate register because of the way that
616                  certain trap instruction encodings are
617                  interpreted.</para>
618                </listitem>
619                <listitem>
620                  <para>r12 (symbolic name tsp) holds the top of the current thread's temp stack.</para>
621                </listitem>
622                <listitem>
623                  <para>r13 is used to hold the TCR on PPC32 sytems; it's not used on PPC64.</para>
624                </listitem>
625                <listitem>
626                  <para>r14 (symbolic name loc-pc) is used to copy
627                  "pc-locative" values between main memory and
628                  special-purpose PPC registers (LR and CTR) used in
629                  function-call and return instructions.</para>
630                </listitem>
631                <listitem>
632                  <para>r15 (symbolic name vsp) addresses the top of
633                  the current thread's value stack.</para>
634                </listitem>
635                <listitem>
636                  <para>lr and ctr are PPC branch-unit registers used
637                  in function call and return instructions; they're
638                  always treated as "pc-locatives", which precludes
639                  the use of the ctr in some PPC looping
640                  constructs.</para>
641                </listitem>
642             
643              </itemizedlist>
644              </para>
645            </listitem>
646            <listitem>
647              <para>17 "node" registersr15-r31 are always treated as
648              node registers</para>
649            </listitem>
650           
651          </itemizedlist>
652        </sect3>
653      </sect2>
654
655      <sect2 id="Tagging-scheme">
656        <title>Tagging scheme</title>
657        <para>OpenMCL always allocates lisp objects on double-node
658        (64-bit for 32-bit platforms, 128-bit for 64-bit platforms)
659        boundaries; this mean that the low 3 bits (32-bit lisp) or 4
660        bits (64-bit lisp) are always 0 and are therefore redundant
661        (we only really need to know the upper 29 or 60 bits in order
662        to identify the aligned object address.)  The extra bits in a
663        lisp node can be used to encode at least some information
664        about the node's type, and the other 29/60 bits represent
665        either an immediate value or a doublenode-aligned memory
666        address.  The low 3 or 4 bits of a node are called the node's
667        "tag bits", and the conventions used to encode type
668        information in those tag bits are called a "tagging
669        scheme."</para>
670        <para>It might be possible to use the same tagging scheme on
671        all platforms (at least on all platforms with the same word
672        size and/or the same number of available tag bits), but there
673        are often some strong reasons for not doing so.  These
674        arguments tend to be very machine-specific: sometimes, there
675        are fairly obvious machine-dependent tricks that can be
676        exploited to make common operations on some types of tagged
677        objects faster; other times, there are architectural
678        restrictions that make it impractical to use certain tags for
679        certain types.  (On PPC64, the "ld" (load doubleword) and
680        "std" (store doubleword) instructions - which load and store a
681        GPR operand at the effective address formed by adding the
682        value of another GPR operand and a 16-bit constant operand -
683        require that the low two bits of that constant operand be 0.
684        Since such instructions would typically be used to access the
685        fields of things like CONS cells and structures, it's
686        desirable that that the tags chosen for CONS cells and
687        structures allow the use of these intructions as opposed to
688        more expensive alternatives.)</para>
689        <para>One architecture-dependent tagging trick that works well
690        on all architectures is to use a tag of 0 for FIXNUMs: a
691        fixnum basically encodes its value shifted left a few bits and
692        keeps those low bits clear. FIXNUM addition, subtraction, and
693        binary logical operations can operate directly on the node
694        operands, addition and subtraction can exploit hardware-based
695        overflow detection, and (in the absence of overflow) the
696        hardware result of those operations is a node (fixnum).  Some
697        other slightly-less-common operations may require a few extra
698        instructions, but arithmetic operations on FIXNUMs should be
699        as cheap as possible and using a tag of zero for FIXNUMs helps
700        to ensure that it will be.</para> 
701        <para>If we have N available tag bits (N = 3 for 32-bit
702        OpenMCL and N = 4 for 64-bit OpenMCL), this way of
703        representing fixnums with the low M bits forced to 0 works as
704        long as M &lt;= N.  The smaller we make M, the larger the
705        values of MOST-POSITIVE-FIXNUM and MOST-NEGATIVE become; the
706        larger we make N, the more distinct non-FIXNUM tags become
707        available.  A reasonable compromise is to choose M = N-1; this
708        basically yields two distinct FIXNUM tags (one for even
709        fixnums, one for odd fixnums), gives 30-bit fixnums on 32-bit
710        platforms and 61-bit fixnums on 64-bit platforms, and leaves
711        us with 6 or 14 tags to encoded other types.</para>
712        <para>Once we get past the assignment of FIXNUM tags, things
713        quickly devolve into machine-dependencies.  We can fairly
714        easily see that we can't directly all other primitive lisp
715        object types with only 6 or 14 available tag values; the
716        details of how types are encoded vary between the ppc32,
717        ppc64, and x86-64 implementations, but there are some general
718        common principles:</para>
719
720        <itemizedlist>
721          <listitem>
722            <para>CONS cells always contain exactly 2 elements and are
723            usually fairly common.It therefore makes sense to give
724            CONS cells their own tag.  Unlike thefixnum case - where a
725            tag value of 0 had positive implications - theredoesn't
726            seem to be any advantage to using any particular value.
727            (A longtime ago - in the case of 68K MCL - the CONS tag
728            and the order of CAR and CDR in memory were chosen to allow
729            smaller, cheaper addressing modes to be used to "cdr down a
730            list."  That's not a factor on ppc or x86-64,but all
731            versions of OpenMCL still store the CDR of a CONS cell
732            first in memory.  It doesn't matter, but doing it the way
733            that the host system did made boostrapping to a new target
734            system a little easier.)
735            </para>
736          </listitem>
737          <listitem>
738            <para>Any way you look at it, NIL is a bit ... unusual.NIL
739            is both a SYMBOL and a LIST (as well as being a canonical
740            truth value and probably a few other things.)  Its role as
741            a LIST is probably much more important to most programs
742            than its role as a SYMBOL is:LISTP has to be true of NIL
743            and primitives like CAR and CDR do LISTP implicitly when
744            safe and want that operation to be fast.There are several
745            possible approaches to this; OpenMCL uses two of them. On
746            PPC32 and X86-64, NIL is basically a weird CONS cell that
747            straddles two doublenodes; the tag of NIL is unique and
748            congruent modulo 4 (modulo 8 on 64-bit) with the tag used
749            for CONS cells.  LISTP is therefore true of any node whose
750            low 2 (or 3) bits contain the appropriate tag value (it's
751            not otherwise necessary to special-case NIL.)
752            SYMBOL accessors (SYMBOL-NAME, SYMBOL-VALUE, SYMBOL-PLIST
753            ..) -do- have to special-case NIL (and access the
754            components of an internal proxy symbol.) On PPC64 (where
755            architectural restrictions dictate the set of tags that can
756            be used to access fixed components of an object),
757            that approach wasn't practical.  NIL is just a
758            distinguished SYMBOL,and it just happens to be the case
759            that its pname slot and values lots are at the same offsets
760            from a tagged pointer as a CONS cell's CDR and CAR would be.
761            NIL's pname is set to NIL (SYMBOL-NAME checks for this and
762            returns the string "NIL"), and LISTP (and therefore safe
763            CAR and CDR) have to check for (OR NULL CONSP).At least in
764            the case of CAR and CDR, the fact that the PPC has multiple
765            condition-code fields keeps that extra test from
766            being prohibitively expensive.</para>
767          </listitem>
768          <listitem>
769            <para>Some objects are immediate.(but not FIXNUMs).This is
770            true of CHARACTERs and, on 64-bit platforms,
771            SINGLE-FLOATs.It's also true of some nodes used in the
772            runtime system (specialvalues used to indicate unbound
773            variables and slots, for instance.)On 64-bit platforms,
774            SINGLE-FLOATs have their own unique tag (makingthem a
775            little easier to recognize; on all platforms, CHARACTERs
776            sharea tag with other immediate objects (unbound markers)
777            but are easyto recognize (by looking at several of their
778            low bits.)  The GCtreats any node with an immediate tag
779            (and any node with a fixnumtag) as a leaf.</para>
780          </listitem>
781          <listitem>
782            <para>There are some advantages to treating everything
783            else - memory-allocated objects that aren't CONS cells -
784            uniformly.There are some disadvantages to that uniform
785            treatment as well, and the treatment of "memory-allocated
786            non-CONS objects" isn't entirely uniformaccross all
787            OpenMCL implementations.  Let's first pretend that
788            the treatment is uniform, then discuss the ways in which it
789            isn't.The "uniform approach" is to treat all
790            memory-allocated non-CONS objectsas if they were vectors;
791            this use of the term is a little looser thanwhat's implied
792            by the CL VECTOR type.  OpenMCL actually uses the
793            term"uvector" to mean "a memory-allocated lisp object
794            other than a CONS cell,whose first word is a header which
795            describes the object's type andthe number of elements that
796            it contains."  In this view, a SYMBOL isa UVECTOR, as is a
797            STRING, a STANDARD-INSTANCE, a CL array or vector,a
798            FUNCTION, and even a DOUBLE-FLOAT.In the PPC
799            implementations (where things are a little more
800            ... uniform),a single tag value is used to denote any
801            uvector; in order to determinesomething more specific
802            about the type of the object in question, it'snecessary to
803            fetch the low byte of the header word from memory.  On
804            thex86-64 platform, certain types of uvectors - SYMBOLs
805            and FUNCTIONs -are given their own unique tags.  The good
806            news about the x86-64 approachis that SYMBOLs and
807            FUNCTIONs can be recognized without referencingmemory; the
808            slightly bad news is that primitive operations that workon
809            UVECTOR-tagged objects - like the function CCL:UVREF -
810            don't workon SYMBOLs or FUNCTIONs on x86-64 (but -do- work
811            on those types of objectsin the PPC ports.)The header word
812            which precedes a UVECTOR's data in memory contains 8bits
813            of type information in the low byte and either 24 or 56
814            bits of"element-count" information in the rest of the
815            word.  (This is wherethe sometimes-limiting value of 2^24
816            for ARRAY-TOTAL-SIZE-LIMIT onPPC32 platforms comes from.)
817            The low byte of the header - sometimescalled the uvector's
818            subtag - is itself tagged (which means thatthe header is
819            tagged.)  The (3 or 4) tag bits in the subtag are usedto
820            determine whether the uvector's elements are nodes or
821            immediates.(A UVECTOR whose elements are nodes is called a
822            GVECTOR; a UVECTORwhose elements are immediates is called
823            an IVECTOR.  This terminologycame from Spice Lisp, which
824            was a predecessor of CMUCL.)Even though a uvector header
825            is tagged, a header is not a node.  There'sno (supported)
826            way to get your hands on one in lisp and doing so couldbe
827            dangerous.  (If the value of a header wound up in a lisp
828            noderegister and that register wound up getting pushed on
829            a thread's valuestack, the GC might misinterpret that
830            situation to mean that therewas a stack-allocated UVECTOR
831            on the value stack.)</para>
832          </listitem>
833       
834        </itemizedlist>
835      </sect2>
836    </sect1>
837
838    <sect1 id="Heap-Allocation">
839      <title>Heap Allocation</title> <para>When the OpenMCL kernel
840      first starts up, a large contiguous chunk of the process's
841      address space is mapped as "anonymous, no access"
842      memory. ("Large" means different things in different contexts;
843      on LinuxPPC32, it means "about 1 gigabyte", on DarwinPPC32, it
844      means "about 2 gigabytes", and on current 64-bit platforms it
845      ranges from 128 to 512 gigabytes, depending on OS. These values
846      are both defaults and upper limits; the --heap-reserve
847      argument can be used to try to reserve less than the
848      default.)</para>
849      <para>Reserving address space that can't (yet) be read or
850      written to doesn't cost much; in particular, it doesn't require
851      that correspinding swap space or physical memory be available.
852      Marking the address range as being "mapped" helps to ensure that
853      other things (result from random calls to malloc(), dynamically
854      loaded shared libraries) won't be allocated in this region that
855      lisp has reserved for its own heap growth.</para>
856      <para>A small portion (around 1/32 on 32-bit platforms and 1/64
857      on 64-bit platforms) of that large chunk of address space is
858      reserved for GC data structures.  Memory pages reserved for
859      these data structures are mapped read-write as pages made
860      writable in the main portion of the heap.</para>
861      <para>The initial heap image is mapped into this reserved
862      address space and an additional (LISP-HEAP-GC-THRESHOLD) bytes
863      are mapped read-write.  GC data structures grow to match the
864      amount of GC-able memory in the initial image + the gc
865      threshold, and control is transferred to lisp code.  Inevitably,
866      that code spoils everything and starts consing; there are
867      basically three layers of memory allocation that can go
868      on.</para>
869
870      <sect2 id="Per-thread-object-allocation">
871        <title>Per-thread object allocation</title>
872        <para>Each lisp thread has a private "reserved memory
873        segment"; when a thread starts up, its reserved memory segment
874        is empty.  PPC ports maintain the highest unallocated addres
875        and he lowest allocated address in the current segment in
876        registers when running lisp code; on x86-664, these values are
877        maintained in the current threads's TCR.  (An "empty" heap
878        segment is one whose high pointer and low pointer are equal.)
879        When a thread is not in the midde of allocating something, the
880        low 3 or 4 bits of the high and low pointers are clear (the
881        pointers are doublenode-aligned.)</para>
882        <para>A thread tries to allocate an object whose physical size
883        in bytes is X and whose tag is Y by:</para>
884        <orderedlist>
885          <listitem>
886            <para>decrementing the "high" pointer by (- X Y)</para>
887          </listitem>
888          <listitem>
889            <para>trapping if the high pointer is less than the low
890            pointer</para>
891          </listitem>
892          <listitem>
893            <para>using the (tagged) high pointer to initialize the
894            object, if necessary</para>
895          </listitem>
896          <listitem>
897            <para>clearing the low bits of the high pointer</para>
898          </listitem>
899        </orderedlist>
900        <para>On PPC32, where the size of a CONS cell is 8 bytes and
901        the tag of a CONS cell is 1, machine code which sets the arg_z
902        register to the result of doing (CONS arg_y arg_z) looks
903        like:</para>
904        <programlisting>
905  (SUBI ALLOCPTR ALLOCPTR 7)    ; decrement the high pointer by (- 8 1)
906  (TWLLT ALLOCPTR ALLOCBASE)    ; trap if the high pointer is below the base
907  (STW ARG_Z -1 ALLOCPTR)       ; set the CDR of the tagged high pointer
908  (STW ARG_Y 3 ALLOCPTR)        ; set the CAR
909  (MR ARG_Z ALLOCPTR)           ; arg_z is the new CONS cell
910  (RLWINM ALLOCPTR ALLOCPTR 0 0 28)     ; clear tag bits
911</programlisting>
912        <para>On x86-64, the idea's similar but the implementation is
913        different.  The high and low pointers to the current thread's
914        reserved segment are kept in the TCR, which is addressed by
915        the gs segment register. An x86-64 CONS cell is 16 bytes wide
916        and has a tag of 3; we canonically use the temp0 register to
917        initialize the object</para>
918        <programlisting>
919  (subq ($ 13) ((% gs) 216))    ; decrement allocptr
920  (movq ((% gs) 216) (% temp0)) ; load allocptr into temp0
921  (cmpq ((% gs) 224) (% temp0)) ; compare to allocabase
922  (jg L1)                       ; skip trap
923  (uuo-alloc)                   ; uh, don't skip trap
924L1
925  (andb ($ 240) ((% gs) 216))   ; untag allocptr in the tcr
926  (movq (% arg_y) (5 (% temp0))) ; set the car
927  (movq (% arg_z) (-3 (% temp0))); set the cdr
928  (movq (% temp0) (% arg_z))    ; return the cons
929        </programlisting>
930        <para>If we don't take the trap (if allocating 8-16 bytes
931        doesn't exhaust the thread's reserved memory segment), that's
932        a fairly short and simple instruction sequence.  If we do take
933        the trap, we'll have to do some additional work in order to
934        get a new segment for the current thread.</para>
935      </sect2>
936
937      <sect2 id="Allocation-of-reserved-heap-segments">
938        <title>Allocation of reserved heap segments</title>
939        <para>After the lisp image is first mapped into memory - and after
940        each full GC - the lisp kernel ensures that
941        (LISP-HEAP-GC-TRESHOLD) additional bytes beyond the current
942        end of the heap are mapped read-write.</para>
943        <para>If a thread traps while trying to allocate memory, the
944        thread goes through the usual exception-handling protocol (to
945        ensure that any oher thread that GCs "sees" the state of the
946        trapping thread and to serialize exception handling.)  When
947        the exception handler runs, it determines the nature and size
948        of the failed allocation and tries to complete the allocation
949        on the thread's behalf (and leave it with a reasonably large
950        thread-specific memory segment so that the next small
951        allocation is unlikely to trap.</para>
952        <para>Depending on the size of the requested segment
953        allocation, the number of segment allocations that have
954        occurred since the last GC, and the EGC and GC thresholds, the
955        segment allocation trap handler may invoke a full or ephemeral
956        GC before returning a new segment.  It's worth noting that the
957        [E]GC is triggered based on the number of and size of these
958        segments that've been allocated since the last GC; it doesn't
959        have much to do with how "full" each of those per-thread
960        segments are.  It's possible for a large number of threads to
961        do fairly incidental memory allocation and trigger the GC as a
962        result; avoiding this involves tuning the per-thread
963        allocation quantum and the GC/EGC thresholds
964        appropriately.</para>
965      </sect2>
966
967      <sect2 id="Heap-growth">
968        <title>Heap growth</title>
969        <para>All OSes on which OpenMCL currently runs use an
970        "overcommit" memory allocation strategy by default (though
971        some of them provide ways of overriding that default.)  What
972        this means in general is that the OS doesn't necessarily
973        ensure that backing store is available when asked to map pages
974        as read-write; it'll often return a success indicator from the
975        mapping attempt (mapping the pages as "zero-fill,
976        copy-on-write"), and only try to allocate the backing store
977        (swap space and/or physical memory) when non-zero contents are
978        written to the pages.</para>
979        <para>It -sounds- like it'd be better to have the mmap() call
980        fail immediately, but it's actually a complicated issue.
981        (It's possible that other applications will stop using some
982        backing store before lisp code actually touches the pages that
983        need it, for instance.)  It's also not guaranteed that lisp
984        code would be able to "cleanly" signal an out-of-memory
985        condition if lisp is ... out of memory</para>
986        <para>I don't know that I've ever seen an abrupt out-of-memory failure that
987wasn't preceeded by several minutes of excessive paging activity.  The
988most expedient course in cases like this is to either (a) use less memory
989or (b) get more memory; it's generally hard to use memory that you don't
990have.</para>
991      </sect2>
992    </sect1>
993
994    <sect1 id="GC-details">
995      <title>GC details</title>
996      <para>The GC uses a Mark/Compact algorithm; its
997      execution time is essentially a factor of the amount of live
998      data in the heap. (The somewhat better-known Mark/Sweep
999      algorithms don't compact the live data but instead traverse the
1000      garbage to rebuild free-lists; their execution time is therefore
1001      a factor of the total heap size.)</para>
1002      <para>As mentioned in , two auxiliary data structures
1003      (proportional to the size of the lisp heap) are maintained. These are</para>
1004      <orderedlist>
1005        <listitem>
1006          <para>the markbits bitvector, which contains a bit for
1007          everydoublenode in the dynamic heap (plus a few extra words
1008          for alignmentand so that sub-bitvectors can start on word
1009          boundaries.)</para>
1010        </listitem>
1011        <listitem>
1012          <para>the relocation table, which contains a native word for
1013          every 32 or 64 doublenodes in the dynamic heap, plus an
1014          extra word used to keep trackof the end of the heap.</para>
1015        </listitem>
1016      </orderedlist>
1017      <para>The total GC space overhead is therefore on the order of
1018      3% (2/64 or 1/32).</para>
1019      <para>The general algorithm proceeds as follows:</para>
1020
1021      <sect2 id="Mark-phase">
1022        <title>Mark phase</title>
1023        <para>Each doublenode in the dynamic heap has a corresponding
1024        bit in the markbits vector. (For any doublenode in the heap,
1025        the index of its mark bit is determined by subtracing the
1026        address of the start of the heap from the address of the
1027        object and dividing the result by 8 or 16.) The GC knows the
1028        markbit index of the free pointer, so determining that the
1029        markbit index of a doubleword address is between the start of
1030        the heap and the free pointer can be done with a single
1031        unsigned comparison.</para>
1032        <para>The markbits of all doublenodes in the dynamic heap are
1033        zeroed before the mark phase begins. An object is
1034        <emphasis>marked</emphasis> if the markbits of all of its
1035        constituent doublewords are set and unmarked otherwise;
1036        setting an object's markbits involves setting the corrsponding
1037        markbits of all constituent doublenodes in the object.</para>
1038        <para>The mark phase traverses each root. If the tag of the
1039        value of the root indicates that it's a non-immediate node
1040        whose address lies in the lisp heap, then:</para>
1041        <orderedlist>
1042          <listitem>
1043            <para>If the object is already marked, do nothing.</para>
1044          </listitem>
1045          <listitem>
1046            <para>Set the object's markbit(s).</para>
1047          </listitem>
1048          <listitem>
1049            <para>If the object is an ivector, do nothing further.</para>
1050          </listitem>
1051          <listitem>
1052            <para>If the object is a cons cell, recursively mark its
1053            car and cdr.</para>
1054          </listitem>
1055          <listitem>
1056            <para>Otherwise, the object is a gvector. Recursively mark
1057            itselements.</para>
1058          </listitem>
1059        </orderedlist>
1060        <para>Marking an object thus involves ensuring that its mark
1061        bits are set and then recursively marking any pointers
1062        contained within the object if the object was originally
1063        unmarked. If this recursive step was implemented in the
1064        obvious manner, marking an object would take stack space
1065        proportional to the length of the pointer chain from some root
1066        to that object. Rather than storing that pointer chain
1067        implicitly on the stack (in a series of recursive calls to the
1068        mark subroutine), the OpenMCL marker uses mixture of recursion
1069        and a technique called <emphasis>link inversion</emphasis> to
1070        store the pointer chain in the objects themselves.  (Recursion
1071        tends to be simpler and faster; if a recursive step notes that
1072        stack space is becoming limited, the link-inversion technique
1073        is used.)</para>
1074        <para>Certain types of objects are treated a little specially:</para>
1075        <orderedlist>
1076        <listitem>
1077          <para>To support a feature called <emphasis>GCTWA
1078              <footnote>
1079                <para>I believe that theacronym comes from MACLISP,
1080                where it stood for "Garbage Collection ofTruly
1081                Worthless Atoms".</para>
1082              </footnote>
1083              , </emphasis>the vector which contains the
1084              internalsymbols of the current package is marked on
1085              entry to the mark phasebut the symbols themselves are
1086              not marked at this time. Near the endof the mark phase,
1087              symbols referenced from this vector which are
1088              nototherwise marked are marked if and only if they're
1089              somehowdistinguishable from newly created symbols (by
1090              virtue of their havingfunction bindings, value bindings,
1091              plists, or other attributes.)</para>
1092        </listitem>
1093        <listitem>
1094          <para>Pools have their first element set to NIL before any
1095          otherelements are marked.</para>
1096        </listitem>
1097        <listitem>
1098          <para>All hash tables have certain fields (used to cache
1099          previous results) invalidated.</para>
1100        </listitem>
1101        <listitem>
1102          <para>Weak Hash Tables and other weak objects are put on a
1103          linkedlist as they're encountered; their contents are only
1104          retained if there are other (non-weak) references to
1105          them.</para>
1106        </listitem>
1107        </orderedlist>
1108        <para>At the end of the mark phase, the markbits of all objects which
1109        are transitively reachable from the roots are set and all other markbits
1110        are clear.</para>
1111      </sect2>
1112
1113      <sect2 id="Relocation-phase">
1114        <title>Relocation phase</title>
1115        <para>The <emphasis>forwarding address</emphasis> of a
1116        doublenode in the dynamic heap is (&lt;its current address> -
1117        (size_of_doublenode * &lt;the number of unmarked markbits that
1118        precede it>)) or alternately (&lt;the base of the heap> +
1119        (size_of_doublenode * &lt;the number of marked markbits that
1120        preced it &gt;)). Rather than count the number of preceding
1121        markbits each time, the relocation table is used to precompute
1122        an approximation of the forwarding addresses for all
1123        doublewords. Given this approximate address and a pointer into
1124        the markbits vector, it's relatively easy to compute the exact
1125        forwarding address.</para>
1126        <para>The relocation table contains the forwarding addresses
1127        of each <emphasis>pagelet</emphasis>, where a pagelet is 256
1128        bytes (or 32 doublenodes). The forwarding address of the first
1129        pagelet is the base of the heap. The forwarding address of the
1130        second pagelet is the sum of the forwarding address of the
1131        first and 8 bytes for each mark bit set in the first 32-bit
1132        word in the markbits table. The last entry in the relocation
1133        table contains the forwarding address that the freepointer
1134        would have, e.g., the new value of the freepointer after
1135        compaction.</para>
1136        <para>In many programs, old objects rarely become garbage and
1137        new objects often do. When building the relocation table, the
1138        relocation phase notes the address of the first unmarked
1139        object in the dynamic heap. Only the area of the heap between
1140        the first unmarked object and the freepointer needs to be
1141        compacted; only pointers to this area will need to be
1142        forwarded (the forwarding address of all other pointers to the
1143        dynamic heap is the address of that pointer.)  Often, the
1144        first unmarked object is much nearer the free pointer than it
1145        is to the base of the heap.</para>
1146      </sect2>
1147
1148      <sect2 id="Forwarding-phase">
1149        <title>Forwarding phase</title>
1150        <para>The forwarding phase traverses all roots and the "old"
1151        part of the dynamic heap (the part between the base of the
1152        heap and the first unmarked object.) All references to objects
1153        whose address is between the first unmarked object and the
1154        free pointer are updated to point to the address the object
1155        will have after compaction by using the relocation table and
1156        the markbits vector and interpolating.</para>
1157        <para>The relocation table entry for the pagelet nearest the
1158        object is found. If the pagelet's address is less than the
1159        object's address, the number of set markbits that precede the
1160        object on the pagelet is used to determine the object's
1161        address; otherwise, the number of set markbits the follow the
1162        object on the pagelet is used.</para>
1163        <para>Since forwarding views the heap as a set of doublewords,
1164        locatives are (mostly) treated like any other pointers. (The
1165        basic difference is that locatives may appear to be tagged as
1166        fixnums, in which case they're treated as word-aligned
1167        pointers into the object.)</para>
1168        <para>If the forward phase changes the address of any hash
1169        table key in a hash table that hashes by address (e.g., an EQ
1170        hash table), it sets a bit in the hash table's header. The
1171        hash table code will rehash the hash table's contents if it
1172        tries to do a lookup on a key in such a table.</para>
1173        <para>Profiling reveals that about half of the total time
1174        spent in the GC is spent in the subroutine which determines a
1175        pointer's forwarding address. Exploiting GCC-specific idioms,
1176        hand-coding the routine, and inlining calls to it could all be
1177        expected to improve GC performance.</para>
1178      </sect2>
1179
1180      <sect2 id="Compact-phase">
1181        <title>Compact phase</title>
1182        <para>The compact phase compacts the area between the first
1183        unmarked object and the freepointer so that it contains only
1184        marked objects.  While doing so, it forwards any pointers it
1185        finds in the objects it copies.</para>
1186        <para>When the compact phase is finished, so is the GC (more
1187        or less): the free pointer and some other data structures are
1188        updated and control returns to the exception handler that
1189        invoked the GC. If sufficient memory has been freed to satisfy
1190        any allocation request that may have triggered the GC, the
1191        exception handler returns; otherwise, a "seriously low on
1192        memory" condition is signalled, possibly after releasing a
1193        small emergency pool of memory.</para>
1194      </sect2>
1195    </sect1>
1196
1197    <sect1 id="The-ephemeral-GC">
1198      <title>The ephemeral GC</title>
1199      <para>In the OpenMCL memory management scheme, the relative age
1200      of two objects in the dynamic heap can be determined by their
1201      addresses: if addresses X and Y are both addresses in the
1202      dynamic heap, X is younger than Y (X was created more recently
1203      than Y) if it is nearer to the free pointer (and farther from
1204      the base of the heap) than Y.</para>
1205      <para>Ephemeral (or generational) garbage collectors attempt to
1206      exploit the following assumptions:</para>
1207      <itemizedlist>
1208        <listitem>
1209          <para>most newly created objects become garbage soon after
1210          they'recreated.</para>
1211        </listitem>
1212        <listitem>
1213          <para>most objects that have already survived several GCs
1214          are unlikely to ever become garbage.</para>
1215        </listitem>
1216        <listitem>
1217          <para>old objects can only point to newer objects as the
1218          result of adestructive modification (e.g., via
1219          SETF.)</para>
1220        </listitem>
1221      </itemizedlist>
1222
1223      <para>By concentrating its efforts on (frequently and quickly)
1224      reclaiming newly created garbage, an ephemeral collector hopes
1225      to postpone the more costly full GC as long as possible. It's
1226      important to note that most programs create some long-lived
1227      garbage, so an EGC can't typically eliminate the need for full
1228      GC.</para>
1229      <para>An EGC views each object in the heap as belonging to
1230      exactly one <emphasis>generation</emphasis>; generations are
1231      sets of objects that are related to each other by age: some
1232      generation is the youngest, some the oldest, and there's an age
1233      relationship between any intervening generations. Objects are
1234      typically assigned to the youngest generation when first
1235      allocated; any object that has survived some number of GCs in
1236      its current generation is promoted (or
1237      <emphasis>tenured</emphasis>) into an older generation.</para>
1238      <para>When a generation is GCed, the roots consist of the
1239      stacks, registers, and global variables as always and also of
1240      any pointers to objects in that generation from other
1241      generations. To avoid the need to scan those (often large) other
1242      generations looking for such intergenerational references, the
1243      runtime system must note all such intergenerational references
1244      at the point where they're created (via Setf).<footnote><para>This is
1245      sometimes called "The Write Barrier": all assignments which
1246      might result in intergenerational references must be noted, as
1247      if the other generations were write-protected.</para></footnote> The
1248      set of pointers that may contain intergenerational references is
1249      sometimes called <emphasis>the remembered set</emphasis>.</para>
1250      <para>In OpenMCL's EGC, the heap is organized exactly the same
1251      as otherwise; "generations" are merely structures which contain
1252      pointers to regions of the heap (which is already ordered by
1253      age.) When a generation needs to be GCed, any younger generation
1254      is incorporated into it; all objects which survive a GC of a
1255      given generation are promoted into the next older
1256      generation. The only intergenerational references that can exist
1257      are therefore those where an old object is modified to contain a
1258      pointer to a new object.</para>
1259      <para>The EGC uses exactly the same code as the full GC. When a
1260      given GC is "ephemeral",</para>
1261      <itemizedlist>
1262        <listitem>
1263          <para>the "base of the heap" used to determine anobject's
1264          markbit address is the base of the generation
1265          being collected;</para>
1266        </listitem>
1267        <listitem>
1268          <para>the markbits vector is actually a pointer into the
1269          middle of the global markbits table; preceding entries in
1270          this table are used to note doubleword addresses in older
1271          generations that (may) contain intergenerational
1272          references;</para>
1273        </listitem>
1274        <listitem>
1275          <para>some steps (notably GCTWA and the handling of weak
1276          objects) are not performed;</para>
1277        </listitem>
1278        <listitem>
1279          <para>the intergenerational references table is used to
1280          findadditional roots for the mark and forward phases. If a
1281          bit is set inthe intergenerational references table, that
1282          means that thecorresponding doubleword (in some "old"
1283          generation, insome "earlier" part of the heap) may have had
1284          a pointerto an object in a younger generation stored into
1285          it.</para>
1286        </listitem>
1287     
1288      </itemizedlist>
1289      <para>The intergenerational references table is maintained
1290      indirectly: whenever a setf operation that may introduce an
1291      intergenerational reference occurs, a pointer to the doubleword
1292      being stored into is pushed onto the <emphasis>memo
1293      buffer</emphasis>, which is a stack whos top is addressed by the
1294      memo register. Whenever the memo buffer overflows<tip><para>A
1295      guard page at the end of the memo buffer simplifies overflow
1296      detection.</para></tip> when the EGC is active, the handler
1297      scans the buffer and sets bits in the intergenerational
1298      references table for each doubleword address it finds in the
1299      buffer that belongs to some generation other than the youngest;
1300      the same scan is performed on entry to any ephemeral GC.  After
1301      (possibly) performing this scan, the handler resets the memo
1302      register to point to the bottom of the memo stack; this means
1303      that when the EGC is inactive, the memo buffer is constantly
1304      being filled and emptied for no apparent reason.</para>
1305      <para>With one exception (the implicit setfs that occur on entry
1306      to and exit from the binding of a special variable), all setfs
1307      that might introduce an intergenerational reference must be
1308      memoized.<tip><para>Note that the implicit setfs that occur when
1309      initializing an object - as in the case of a call to cons or
1310      vector - can't introduce intergenerational references, since the
1311      newly created object is always younger than the objects used to
1312      initialize it.</para></tip> It's always safe to push any cons
1313      cell or gvector locative onto the memo stack; it's never safe to
1314      push anything else.</para>
1315      <para>Typically, the intergenerational references bitvector is
1316      sparse: a relatively small number of old locations are stored
1317      into, although some of them may have been stored into many
1318      times. The routine that scans the memoization buffer does a lot
1319      of work and usually does it fairly often; it uses a simple,
1320      brute-force method but might run faster if it was smarter about
1321      recognizing addresses that it'd already seen.</para>
1322      <para>When the EGC mark and forward phases scan the
1323      intergenerational reference bits, they can clear any bits that
1324      denote doublewords that definitely do not contain
1325      intergenerational references.</para>
1326    </sect1>
1327
1328    <sect1 id="Fasl-files">
1329      <title>Fasl files</title>
1330      <para>The information in this section was current in November
1331      2004.  Saving and loading of Fasl files is implemented in
1332      xdump/faslenv.lisp, level-0/nfasload.lisp, and lib/nfcomp.lisp.
1333      The information here is only an overview, which might help when
1334      reading the source.</para>
1335      <para>The OpenMCL Fasl format is forked from the old MCL Fasl
1336      format; there are a few differences, but they are minor.  The
1337      name "nfasload" comes from the fact that this is the so-called
1338      "new" Fasl system, which was true in 1986 or so.  The format has
1339      held up well, although it would certainly need extensions to
1340      deal with 64-bit data, and some other modernization might be
1341      possible.</para>
1342      <para>A Fasl file begins with a "file header", which contains
1343      version information and a count of the following "blocks".
1344      There's typically only one "block" per Fasl file.  The blocks
1345      are part of a mechanism for combining multiple logical files
1346      into a single physical file, in order to simplify the
1347      distribution of precompiled programs.  (Nobody seems to be doing
1348      anything interesting with this feature, at the moment, probably
1349      because it isn't documented.)</para>
1350      <para>Each block begins with a header for itself, which just
1351      describes the size of the data that follows.</para>
1352      <para>The data in each block is treated as a simple stream of
1353      bytes, which define a bytecode program.  The actual bytecodes,
1354      "fasl operators", are defined in xdump/faslenv.lisp.  The
1355      descriptions in the source file are terse, but, according to
1356      Gary, "probably accurate".</para>
1357      <para>Some of the operators are used to create a per-block
1358      "object table", which is a vector used to keep track of
1359      previously-loaded objects and simplify references to them.  When
1360      the table is created, an index associated with it is set to
1361      zero; this is analogous to an array fill-pointer, and allows the
1362      table to be treated like a stack.</para>
1363      <para>The low seven bits of each bytecode are used to specify
1364      the fasl operator; currently, about fifty operators are defined.
1365      The high byte, when set, indicates that the result of the
1366      operation should be pushed onto the object table.</para>
1367      <para>Most bytecodes are followed by operands; the operand data
1368      is byte-aligned.  How many operands there are, and their type,
1369      depend on the bytecode.  Operands can be indices into the object
1370      table, immediate values, or some combination of these.</para>
1371      <para>An exception is the bytecode #xFF, which has the symbolic
1372      name ccl::$faslend; it is used to mark the end of the
1373      block.</para>
1374    </sect1>
1375
1376
1377
1378    <sect1 id="The-Objective-C-Bridge--1-">
1379      <title>The Objective-C Bridge</title>
1380
1381      <sect2 id="How-OpenMCL-Recognizes-Objective-C-Objects">
1382        <title>How OpenMCL Recognizes Objective-C Objects</title>
1383        <para>In most cases, pointers to instances of Objective-C
1384        classes are recognized as such; the recognition is (and
1385        probably always will be) slightly heuristic. Basically, any
1386        pointer that passes basic sanity checks and whose first word
1387        is a pointer to a known ObjC class is considered to be an
1388        instance of that class; the Objective-C runtime system would
1389        reach the same conclusion.</para>
1390        <para>It's certainly possible that a random pointer to an
1391        arbitrary memory address could look enough like an ObjC
1392        instance to fool the lisp runtime system, and it's possible
1393        that pointers could have their contents change so that
1394        something that had either been a true ObjC instance (or had
1395        looked a lot like one) is changed (possibly by virtue of
1396        having been deallocated.)</para>
1397        <para>In the first case, we can improve the heuristics
1398        substantially: we can make stronger assertions that a
1399        particular pointer is really "of type :ID" when it's a
1400        parameter to a function declared to take such a pointer as an
1401        argument or a similarly declared function result; we can be
1402        more confident of something we obtained via SLOT-VALUE of a
1403        slot defined to be of type :ID than if we just dug a pointer
1404        out of memory somewhere.</para>
1405        <para>The second case is a little more subtle: ObjC memory
1406        management is based on a reference-counting scheme, and it's
1407        possible for an object to ... cease to be an object while lisp
1408        is still referencing it.  If we don't want to deal with this
1409        possibility (and we don't), we'll basically have to ensure
1410        that the object is not deallocated while lisp is still
1411        thinking of it as a first-class object. There's some support
1412        for this in the case of objects created with MAKE-INSTANCE,
1413        but we may need to give similar treatment to foreign objects
1414        that are introduced to the lisp runtime in other ways (as
1415        function arguments, return values, SLOT-VALUE results, etc. as
1416        well as those instances that're created under lisp
1417        control.)</para>
1418        <para>This doesn't all work yet (in fact, not much of it works
1419        yet); in practice, this has not yet been as much of a problem
1420        as anticipated, but that may be because existing Cocoa code
1421        deals primarily with relatively long-lived objects such as
1422        windows, views, menus, etc.</para>
1423      </sect2>
1424
1425      <sect2>
1426        <title>Recommended Reading</title>
1427
1428        <variablelist>
1429          <varlistentry>
1430            <term>
1431              <ulink url="http://developer.apple.com/documentation/Cocoa/">Cocoa Documentation</ulink>
1432            </term>
1433           
1434           <listitem>
1435             <para>
1436               This is the top page for all of Apple's documentation on
1437               Cocoa.  If you are unfamiliar with Cocoa, it is a good
1438               place to start.
1439             </para>
1440           </listitem>
1441        </varlistentry>
1442        <varlistentry>
1443          <term>
1444            <ulink url="http://developer.apple.com/documentation/Cocoa/Reference/Foundation/ObjC_classic/index.html">Foundation Reference for Objective-C</ulink>
1445          </term>
1446
1447          <listitem>
1448            <para>
1449              This is one of the two most important Cocoa references; it
1450              covers all of the basics, except for GUI programming.  This is
1451              a reference, not a tutorial.
1452            </para>
1453          </listitem>
1454        </varlistentry>
1455      </variablelist>
1456      </sect2>
1457    </sect1>
1458  </chapter>
Note: See TracBrowser for help on using the repository browser.