source: trunk/source/doc/src/implementation.xml @ 8981

Last change on this file since 8981 was 8981, checked in by mikel, 11 years ago

additions to ObjC and ffi docs; many mechanical edits; some standardization of XML elements and formatting

File size: 76.8 KB
Line 
1<?xml version="1.0" encoding="utf-8"?>
2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
3<!ENTITY rest "<varname>&amp;rest</varname>">
4<!ENTITY key "<varname>&amp;key</varname>">
5<!ENTITY optional "<varname>&amp;optional</varname>">
6<!ENTITY body "<varname>&amp;body</varname>">
7<!ENTITY aux "<varname>&amp;aux</varname>">
8<!ENTITY allow-other-keys "<varname>&amp;allow-other-keys</varname>">
9<!ENTITY CCL "Clozure CL">
10]>
11  <chapter id="Implementation-Details-of-CCL">
12    <title>Implementation Details of &CCL;</title>
13    <para>This chapter describes many aspects of OpenMCL's
14    implementation as of (roughly) version 1.1. Details vary a bit
15    between the three architectures (PPC32, PPC64, and x86-64)
16    currently supported and those details change over time, so the
17    definitive reference is the source code (especially some files in
18    the ccl/compiler/ directory whose names contain the string "arch"
19    and some files in the ccl/lisp-kernel/ directory whose names
20    contain the string "constants".) Hopefully, this chapter will make
21    it easier for someone who's interested to read and understand the
22    contents of those files.</para>
23
24    <sect1 id="Threads-and-exceptions">
25      <title>Threads and exceptions</title>
26      <para>&CCL;'s threads are "native" (meaning that they're
27      scheduled and controlled by the operating system.)  Most of the
28      implications of this are discussed elsewhere; this section tries
29      to describe how threads look from the lisp kernel's perspective
30      (and especially from the GC's point of view.)</para>
31      <para>&CCL;'s runtime system tries to use machine-level
32      exception mechanisms (conditional traps when available, illegal
33      instructions, memory access protection in some cases) to detect
34      and handle ...  exceptional situations.  These situations
35      include some TYPE-ERRORs and PROGRAM-ERRORS (notably
36      wrong-number-of-args errors), and also include cases like "not
37      being able to allocate memory without GCing or obtaining more
38      memory from the OS."  The general idea is that it's usually
39      faster to pay (very occasional) exception-processing overhead
40      and figure out what's going on in an exception handler than it
41      is to maintain enough state and context to handle an exceptional
42      case via a lighter-weight mechanism when that exceptional case
43      (by definition) rarely occurs.</para>
44      <para>Some emulated execution environments (the Rosetta PPC
45      emulator on x86 versions of Mac OS X) don't provide accurate
46      exception information to exception handling functions. &CCL;
47      can't run in such environments.</para>
48
49      <sect2 id="The-Thread-Context-Record">
50        <title>The Thread Context Record</title>
51
52        <para>When a lisp thread is first created (or when a thread
53        created by foreign code first calls back to lisp), a data
54        structure called a Thread Context Record (or TCR) is allocated
55        and initialized.  On modern versions of Linux and FreeBSD, the
56        allocation actually happens via a set of thread-local-storage
57        ABI extensions, so a thread's TCR is created when the thread
58        is created and dies when the thread dies.  (The World's Most
59        Advanced Operating System - as Apple's marketing literature
60        refers to Darwin - is not very advanced in this regard, and I
61        know of no reason to assume that advances will be made in this
62        area anytime soon.)</para>
63        <para>A TCR contains a few dozen fields (and is therefore a
64        few hundred bytes in size.)  The fields are mostly
65        thread-specific information about the thread's stacks'
66        locations and sizes, information about the underlying (POSIX)
67        thread, and information about the thread's dynamic binding
68        history and pending CATCH/UNWIND-PROTECTs.  Some of this
69        information could be kept in individual machine registers
70        while the thread is running (and the PPC - which has more
71        registers available - keeps a few things in registers that the
72        X86-64 has to access via the TCR), but it's important to
73        remember that the information is thread-specific and can't
74        (for instance) be kept in a fixed global memory
75        location.</para>
76        <para>When lisp code is running, the current thread's TCR is
77        kept in a register.  On PPC platforms, a general purpose
78        register is used; on x86-64, an (otherwise nearly useless)
79        segment register works well (prevents the expenditure of a
80        more generally useful general- purpose register for this
81        purpose.)</para>
82        <para>The address of a TCR is aligned in memory in such a way
83        that a FIXNUM can be used to represent it.  The lisp function
84        CCL::%CURRENT-TCR returns the calling thread's TCR as a
85        fixnum; actual value of the TCR's address is 4 or 8 times the
86        value of this fixnum.</para>
87        <para>When the lisp kernel initializes a new TCR, it's added
88        to a global list maintained by the kernel; when a thread
89        exits, its TCR is removed from this list.</para>
90        <para>When a thread calls foreign code, lisp stack pointers
91        are saved in its TCR, lisp registers (at least those whose
92        value should be preserved across the call) are saved on the
93        thread's value stack, and (on x86-64) RSP is switched to the
94        control stack.  A field in the TCR (tcr.valence) is then set
95        to indicate that the thread is running foreign code, foreign
96        argument registers are loaded from a frame on the foreign
97        stack, and the foreign function is called. (That's a little
98        oversimplified and possibly inaccurate, but the important
99        things to note are that the thread "stops following lisp stack
100        and register usage conventions" and that it advertises the
101        fact that it's done so.  Similar transitions in a thread's
102        state ("valence") occur when it enters of exits an exception
103        handler (which is sort of an OS/hardware-mandated foreign
104        function call where the OS thoughtfully saves the thread's
105        register state for it beforehand.)</para>
106      </sect2>
107
108      <sect2 id="Exception-contexts-comma---and-exception-handling-in-general">
109        <title>Exception contexts, and exception-handling in general</title>
110        <para>Unix-like OSes tend to refer to exceptions as "signals";
111        the same general mechanism ("signal handling") is used to
112        process both asynchronous OS-level events (such as the result
113        of the keyboard driver noticing that ^C or ^Z has been
114        pressed) and synchronous hardware-level events (like trying to
115        execute an illegal instruction or access protected memory.)
116        It makes some sense to defer ("block") handling of
117        asynchronous signals so that some critical code sequences
118        complete without interruption; since it's generally not
119        possible for a thread to proceed after a synchronous exception
120        unless and until its state is modified by an exception
121        handler, it makes no sense to talk about blocking synchronous
122        signals (though some OSes will let you do so and doing so can
123        have mysterious effects.)</para>
124        <para>On OSX/Darwin, the POSIX signal handling facilities
125        coexist with lower-level Mach-based exception handling
126        facilities.  Unfortunately, the way that this is implemented
127        interacts poorly with debugging tools: GDB will generally stop
128        whenever the target program encounters a Mach-level exception
129        and offers no way to proceed from that point (and let the
130        program's POSIX signal handler try to handle the exception);
131        Apple's CrashReporter program has had a similar issue and,
132        depending on how it's configured, may bombard the user with
133        alert dialogs which falsely claim that an application has
134        crashed (when in fact the application in question has
135        routinely handled a routine exception.)  On Darwin/OSX,
136        &CCL; uses Mach thread-level exception handling facilities
137        which run before GDB or CrashReporter get a chance to confuse
138        themselves; &CCL;'s Mach exception handling tries to force
139        the thread which received a synchronous exception to invoke a
140        signal handling function ("as if" signal handling worked more
141        usefully under Darwin.)  Mach exception handlers run in a
142        dedicated thread (which basically does nothing but wait for
143        exception messages from the lisp kernel, obtain and modify
144        information about the state of threads in which exceptions
145        have occurred, and reply to the exception messages with an
146        indication that the exception has been handled.  The reply
147        from a thread-level exception handler keeps the exception from
148        being reported to GDB or CrashReporter and avoids the problems
149        related to those programs.  Since &CCL;'s Mach exception
150        handler doesn't claim to handle debugging-related exceptions
151        (from breakpoints or single-step operations), it's possible to
152        use GDB to debug &CCL;.</para>
153        <para>On platforms where signal handling and debugging don't
154        get in each other's way, a signal handler is entered with all
155        signals blocked.  (This behavior is specified in the call to
156        the sigaction() function which established the signal
157        handler.)  The signal handler receives three arguments from
158        the OS kernel; the first is an integer which identifies the
159        signal, the second is a pointer to an object of type
160        "siginfo_t", which may or may not contain a few fields that
161        would help to identify the cause of the exception, and the
162        third argument is a pointer to a data structure (called a
163        "ucontext" or something similar) which contains
164        machine-dependent information about the state of the tread at
165        the time that the exception/signal occurred.  While
166        asynchronous signals are blocked, the signal handler stores
167        the pointer to its third argument (the "signal context") in a
168        field in the current thread's TCR, sets some bits in another
169        TCR field to indicate that the thread is now waiting to handle
170        an exception, unblocks asynchronous signals, and waits for a
171        global exception lock which serializes exception
172        processing.</para>
173        <para>On Darwin, the Mach exception thread creates a signal
174        context (and maybe a siginfo_t structure), stores the signal
175        context in the thread's TCR, sets the TCR field which describes
176        the thread's state, and arranges that the thread resume
177        execution at its signal handling function (with a signal
178        handler, possibly NULL siginfo_t, and signal context as
179        arguments.  When the thread resumes, it waits for the global
180        exception lock.</para>
181        <para>On x86-64 platforms where signal handing can be used to
182        handle synchronous exceptions, there's an additional
183        complication: the OS kernel ordinarily allocates the signal
184        context and siginfo structures on the stack of the thread
185        which received the signal; in practice, that means "wherever
186        RSP is pointing."  &CCL;'s <xref
187        linkend="Register-and-stack-usage-conventions"/> require that
188        the thread's value stack - where RSP is usually pointing while
189        lisp code is running - contain only "nodes" (properly tagged
190        lisp objects), and scribbling a signal context all over the
191        value stack would violate this requirement.  To maintain
192        consistency, the sigaltstack() mechanism is used to cause the
193        signal to be delivered on (and the signal context and siginfo
194        to be allocated on) a special stack area (the last few pages
195        of the thread's control stack, in practice.  When the signal
196        handler runs, it (carefully) copies the signal context and
197        siginfo to the thread's control stack and makes RSP point into
198        that stack before invoking the "real" signal handler.  (The
199        effect of this hack is that the "real" signal handler always
200        runs on the thread's control stack.)</para>
201        <para>Once the exception handler has obtained the global
202        exception lock, it uses the values of the signal number,
203        siginfo_t, and signal context arguments to determine the
204        (logical) cause of the exception.  Some exceptions may be
205        caused by factors that should generate lisp errors or other
206        serious conditions (stack overflow); if this is the case, the
207        kernel code may release the global exception lock and call out
208        to lisp code.  (The lisp code in question may need to repeat
209        some of the exception decoding process; in particular, it
210        needs to be able to interpret register values in the signal
211        context that it receives as an argument.)</para>
212        <para>In some cases, the lisp kernel exception handler may not
213        be able to recover from the exception (this is currently true
214        of some types of memory-access fault and is also true of traps
215        or illegal instructions that occur during foreign code
216        execution.  In such cases, the kernel exception handler
217        reports the exception as "unhandled", and the kernel debugger
218        is invoked.</para>
219        <para>If the kernel exception handler identifies the
220        exception's cause as being a transient out-of-memory condition
221        (indicating that the current thread needs more memory to cons
222        in), it tries to make that memory available.  In some cases,
223        doing so involves invoking the GC.</para>
224      </sect2>
225
226      <sect2 id="Threads-comma---exceptions-comma---and-the-GC">
227        <title>Threads, exceptions, and the GC</title>
228        <para>&CCL;'s GC is not concurrent: when the GC is invoked in
229        response to an exception in a particular thread, all other
230        lisp threads must stop until the GC's work is done.  The
231        thread that triggered the GC iterates over the global TCR
232        list, sending each other thread a distinguished "suspend"
233        signal, then iterates over the list again, waiting for a
234        per-thread semaphore that indicates that the thread has
235        received the "suspend" signal and responded appropriately.
236        Once all other threads have acknowledged the request to
237        suspend themselves, the GC thread can run the GC proper (after
238        doing any necessary <xref linkend="PC-lusering"/>.)  Once the
239        GC's completed its work, the thread that invoked the GC
240        iterates over the global TCR list, raising a per-thread
241        "resume" semaphore for each other thread.</para>
242        <para>The signal handler for the asynchronous "suspend" signal
243        is entered with all asynchronous signals blocked.  It saves
244        its signal-context argument in a TCR slot, raises the tcr's
245        "suspend" semaphore, then waits on the TCR's "resume"
246        semaphore.</para>
247        <para>The GC thread has access to the signal contexts of all
248        TCRs (including its own) at the time when the thread received
249        an exception or acknowledged a request to suspend itself.
250        This information (and information about stack areas in the TCR
251        itself) allows the GC to identify the "stack locations and
252        register contents" that are elements of the GC's root
253        set.</para>
254      </sect2>
255
256      <sect2 id="PC-lusering">
257        <title>PC-lusering</title>
258        <para>It's not quite accurate to say that &CCL;'s compiler
259        and runtime follow precise stack and register usage
260        conventions at all times; there are a few exceptions:</para>
261
262        <itemizedlist>
263          <listitem>
264            <para>On both PPC and x86-64 platforms, consing isn't
265            fully atomic.It takes at least a few instructions to
266            allocate an object in memory(and slap a header on it if
267            necessary); if a thread is interrupted in the middle of
268            that instruction sequence, the new object may or may
269            not have been created or fully initialized at the point in
270            time that the interrupt occurred.  (There are actually a
271            few different states of partial initialization)</para>
272          </listitem>
273          <listitem>
274            <para>On the PPC, the common act of building a lisp
275            control stack frame involves allocating a four-word frame
276            and storing three register values into that frame.  (The
277            fourth word - the back pointer to the previous frame - is
278            automatically set when the frame is allocated.)  The
279            previous contents of those three words are unknown (there
280            might have been a foreign stack frame at the same address a
281            few instructions earlier),so interrupting a thread that's
282            in the process of initializing a PPC control stack frame
283            isn't GC-safe.</para>
284          </listitem>
285          <listitem>
286            <para>There are similar problems with the initialization
287            of temp stackframes on the PPC.  (Allocation and
288            initialization doesn't happen atomically, and the newly
289            allocated stack memory may have undefined contents.)</para>
290          </listitem>
291          <listitem>
292            <para><xref linkend="The-ephemeral-GC"/>'s write barrier
293            has to be implemented atomically (i.e.,both an
294            intergenerational store and the update of a
295            corresponding reference bit has to happen without
296            interruption, or neither of these events can
297            happen.)</para>
298          </listitem>
299          <listitem>
300            <para>There are a few more similar cases.</para>
301          </listitem>
302        </itemizedlist>
303
304        <para>Fortunately, the number of these non-atomic instruction
305        sequences is small, and fortunately it's fairly easy for the
306        interrupting thread to recognize when the interrupted thread
307        is in the middle of such a sequence.  When this is detected,
308        the interrupting thread modifies the state of the interrupted
309        thread (modifying its PC and other registers) so that it is no
310        longer in the middle of such a sequence (it's either backed
311        out of it or the remaining instructions are emulated.)</para>
312        <para>This works because (a) many of the troublesome
313        instruction sequences are PPC-specific and it's relatively
314        easy to partially disassemble the instructions surrounding the
315        interrupted thread's PC on the PPC and (b) those instruction
316        sequences are heavily stylized and intended to be easily
317        recognized.</para>
318      </sect2>
319    </sect1>
320
321    <sect1 id="Register-usage-and-tagging">
322      <title>Register usage and tagging</title>
323
324      <sect2 id="Register-usage-and-tagging-overview">
325        <title>Overview</title>
326        <para>Regardless of other details of its implementation, a
327        garbage collector's job is to partition the set of all
328        heap-allocated lisp objects (CONSes, STRINGs, INSTANCEs, etc.)
329        into two subsets.  The first subset contains all objects that
330        are transitively referenced from a small set of "root" objects
331        (the contents of the stacks and registers of all active
332        threads at the time the GC occurs and the values of some
333        global variables.)  The second subset contains everything
334        else: those lisp objects that are not transitively reachable
335        from the roots are garbage, and the memory occupied by garbage
336        objects can be reclaimed (since the GC has just proven that
337        it's impossible to reference them.)</para>
338        <para>The set of live, reachable lisp objects basically form
339        the nodes of a (usually large) graph, with edges from each
340        node A to any other objects (nodes) that object A
341        references.</para>
342        <para>Some nodes in this graph can never have outgoing edges:
343        an array with a specialized numeric or character type usually
344        represents its elements in some (possibly more compact)
345        specialized way.  Some nodes may refer to lisp objects that
346        are never allocated in memory (FIXNUMs, CHARACTERs,
347        SINGLE-FLOATs on 64-bit platforms ..)  This latter class of
348        objects are sometimes called "immediates", but that's a little
349        confusing because the term "immediate" is sometimes used to
350        refer to things that can never be part of the big connectivity
351        graph (e.g., the "raw" bits that make up a floating-point
352        value, foreign address, or numeric value that needs to be used
353        - at least fleetingly - in compiled code.)</para>
354        <para>For the GC to be able to build the connectivity graph
355        reliably, it's necessary for it to be able to reliably tell
356        (a) whether or not a "potential root" - the contents of a
357        machine register or stack location - is in fact a node and (b)
358        for any node, whether it may have components that refer to
359        other nodes.</para>
360        <para>There's no reliable way to answer the first question on
361        stock hardware.  (If everything was a node, as might be the
362        case on specially microcoded "lisp machine" hardware, it
363        wouldn't even need to be asked.)  Since there's no way to just
364        look at a machine word (the contents of a machine register or
365        stack location) and tell whether or not it's a node or just
366        some random non-node value, we have to either adopt and
367        enforce strict conventions on register and stack usage or
368        tolerate ambiguity.</para>
369        <para>"Tolerating ambiguity" is an approach taken by some
370        ("conservative") GC schemes; by contrast, &CCL;'s GC is
371        "precise", which in this case means that it believes that the
372        contents of certain machine registers and stack locations are
373        always nodes and that other registers and stack locations are
374        never nodes and that these conventions are never violated by
375        the compiler or runtime system.  The fact that threads are
376        preemptively scheduled means that a GC could occur (because of
377        activity in some other thread) on any instruction boundary,
378        which in turn means that the compiler and runtime system must
379        follow precise <xref
380        linkend="Register-and-stack-usage-conventions"/> at all
381        times.</para>
382        <para>Once we've decided that a given machine word is a node,
383        a <xref linkend="Tagging-scheme"/> describes how the node's
384        value and type are encoded in that machine word.</para>
385        <para>Most of this - so far - has discussed things from the
386        GC's very low-level perspective.  From a much higher point of
387        view, lisp functions accept nodes as arguments, return nodes
388        as values, and (usually) perform some operations on those
389        arguments in order to produce those results.  (In many cases,
390        the operations in question involve raw non-node values.)
391        Higher-level parts of the lisp type system (functions like
392        TYPE-OF and CLASS-OF, etc.) depend on the <xref
393        linkend="Tagging-scheme"/>.</para>
394      </sect2>
395
396      <sect2 id="pc-locatives-on-the-PPC">
397        <title>pc-locatives on the PPC</title>
398        <para>On the PPC, there's a third case (besides "node" and
399        "immediate" values).  As discussed below, a node that denotes
400        a memory-allocated lisp object is a biased (tagged) pointer
401        -to- that object; it's not generally possible to point -into-
402        some composite (multi-element) object (such a pointer would
403        not be a node, and the GC would have no way to update the
404        pointer if it were to move the underlying object.)</para>
405        <para>Such a pointer ("into" the interior of a heap-allocated
406        object) is often called a <emphasis>locative</emphasis>; the
407        cases where locatives are allowed in &CCL; mostly involve
408        the behavior of function call and return instructions.  (To be
409        technically accurate, the other case also arises on x86-64, but
410        that case isn't as user-visible.)</para>
411        <para>On the PowerPC (both PPC32 and PPC64), all machine
412        instructions are 32 bits wide and all instruction words are
413        allocated on 32-bit boundaries.  In PPC &CCL;, a CODE-VECTOR
414        is a specialized type of vector-like object; its elements are
415        32-bit PPC machine instructions.  A CODE-VECTOR is an
416        attribute of FUNCTION object; a function call involves
417        accessing the function's code-vector and jumping to the
418        address of its first instruction.</para>
419        <para>As each instruction in the code vector sequentially
420        executes, the hardware program counter (PC) register advances
421        to the address of the next instruction (a locative into the
422        code vector); since PPC instructions are always 32 bits wide
423        and aligned on 32-bit boundaries, the low two bits of the PC
424        are always 0.  If the function executes a call (simple call
425        instructions have the mnemonic "bl" on the PPC, which stands
426        for "branch and link"), the address of the next instruction
427        (also a word-aligned locative into a code-vector) is copied
428        into the special- purpose PPC "link register" (lr); a function
429        returns to its caller via a "branch to link register" (blr)
430        instruction.  Some cases of function call and return might
431        also use the PPC's "count register" (ctr), and if either the
432        lr or ctr needs to be stored in memory it needs to first be
433        copied to a general-purpose register.</para>
434        <para>&CCL;'s GC understands that certain registers contain
435        these special "pc-locatives" (locatives that point into
436        CODE-VECTOR objects); it contains special support for finding
437        the containing CODE-VECTOR object and for adjusting all of
438        these "pc-locatives" if the containing object is moved in
439        memory.  The first part of that - finding the containing
440        object - is possible and practical on the PPC because of
441        architectural artifacts (fixed-width instructions and arcana
442        of instruction encoding.)  It's not possible on x86-64, but
443        fortunately not necessary either (though the second part -
444        adjusting the PC/RIP when the containing object moves) is both
445        necessary and simple.</para>
446      </sect2>
447
448      <sect2 id="Register-and-stack-usage-conventions">
449        <title>Register and stack usage conventions</title>
450
451        <sect3 id="Stack-conventions">
452          <title>Stack conventions</title>
453          <para>On both PPC and X86 platforms, each lisp thread uses 3
454          stacks; the ways in which these stacks are used differs
455          between the PPC and X86.</para>
456          <para>Each thread has:</para>
457          <itemizedlist>
458            <listitem>
459              <para>A "control stack".  On both platforms, this is
460              "the stack" used by foreign code.  On the PPC, it
461              consists of a linked list of frames where the first word
462              in each frame points to the first word in the previous
463              frame (and the outermost frame points to 0.)  Some
464              frames on a PPC control stack are lisp frames; lisp
465              frames are always 4 words in size and contain (in
466              addition to the back pointer to the previous frame) the
467              calling function (a node), the return address (a
468              "locative" into the calling function's code-vector), and
469              the value to which the value-stack pointer (see below)
470              should be restored on function exit.  On the PPC, the GC
471              has to look at control-stack frames, identify which of
472              those frames are lisp frames, and treat the contents of
473              the saved function slot as a node (and handle the return
474              address locative specially.)  On x86-64, the control
475              stack is used for dynamic-extent allocation of immediate
476              objects.  Since the control stack never contains nodes
477              on x86-64, the GC ignores it on that platform.
478              Alignment of the control stack follows the ABI
479              conventions of the platform (at least at any point in
480              time where foreign code could run.)  On PPC, the r1
481              register always points to the top of the current
482              thread's control stack; on x86-64, the RSP register
483              points to the top of the current thread's control stack
484              when the thread is running foreign code and the address
485              of the top of the control stack is kept in the thread's
486              TCR (see <xref linkend="The-Thread-Context-Record"/>
487              when not running foreign code.  The control stack "grows
488              down."</para>
489            </listitem>
490            <listitem>
491              <para>A "value stack".  On both platforms, all values on
492              the value stack are nodes (including "tagged return
493              addresses" on x86-64.)  The value stack is always
494              aligned to the native word size; objects are always
495              pushed on the value stack using atomic instructions
496              ("stwu"/"stdu" on PPC, "push" on x86-64), so the
497              contents of the value stack between its bottom and top
498              are always unambiguously nodes; the compiler usually
499              tries to pop or discard nodes from the value stack as
500              soon as possible after their last use (as soon as they
501              may have become garbage.)  On x86-64, the RSP register
502              addresses the top of the value stack when running lisp
503              code; that address is saved in the TCR when running
504              foreign code.  On the PPC, a dedicated register (VSP,
505              currently r15) is used to address the top of the value
506              stack when running lisp code, and the VSP value is saved
507              in the TCR when running foreign code.  The value stack
508              grows down.</para>
509            </listitem>
510            <listitem>
511              <para>A "temp stack".  The temp stack consists of a
512              linked list of frames, each of which points to the
513              previous temp stack frame.  The number of native machine
514              words in each temp stack frame is always even, so the
515              temp stack is aligned on a two-word (64- or 128-bit)
516              boundary.  The temp stack is used for dynamic-extent
517              objects on both platforms; on the PPC, it's used for
518              essentially all such objects (regardless of whether or
519              not the objects contain nodes); on the x86-64, immediate
520              dynamic-extent objects (strings, foreign pointers, etc.)
521              are allocated on the control stack and only
522              node-containing dynamic-extent objects are allocated on
523              the temp stack.  Data structures used to implement CATCH
524              and UNWIND-PROTECT are stored on the temp stack on both
525              ppc and x86-64.  Temp stack frames are always doublenode
526              aligned and objects within a temp stack frame are
527              aligned on doublenode boundaries.  The first word in
528              each frame contains a back pointer to the previous
529              frame; on the PPC, the second word is used to indicate
530              to the GC whether the remaining objects are nodes (if
531              the second word is 0) or immediate (otherwise.)  On
532              x86-64, where temp stack frames always contain nodes,
533              the second word is always 0.  The temp stack grows down.
534              It usually takes several instructions to allocate and
535              safely initialize a temp stack frame that's intended to
536              contain nodes, and the GC has to recognize the case
537              where a thread is in the process of allocating and
538              initializing a temp stack frame and take care not to
539              interpret any uninitialized words in the frame as nodes.
540              See (someplace).  The PPC keeps the current top of the
541              temp stack in a dedicated register (TSP, currently r12)
542              when running lisp code and saves this register's value
543              in the TCR when running foreign code.  The x86-64 keeps
544              the address of the top of each thread's temp stack in
545              the thread's TCR.</para>
546            </listitem>
547          </itemizedlist>
548        </sect3>
549
550        <sect3 id="Register-conventions">
551          <title>Register conventions</title>
552          <para>If there are a "reasonable" (for some value of
553          "reasonable") number or general-purpose registers and the
554          instruction set is "reasonably" orthogonal (most
555          instructions that operate on GPRs can operate on any GPR),
556          then it's possible to statically partition the GPRs into at
557          least two sets: "immediate registers" never contain nodes,
558          and "node registers" always contain nodes.  (On the PPC, a
559          few registers are members of a third set of "PC locatives",
560          and on both platforms some registers may have dedicated
561          roles as stack or heap pointers; the latter class is treated
562          as immediates by the GC proper but may be used to help
563          determine the bounds of stack and heap memory areas.)</para>
564          <para>The ultimate definition of register partitioning is
565          hardwired into the GC in functions like "mark_xp()" and
566          "forward_xp()", which process the values of some of the
567          registers in an exception frame as nodes and may give some
568          sort of special treatment to other register values they
569          encounter there.)</para>
570          <para>On x86-64, the static register partitioning scheme involves:</para>
571          <itemizedlist>
572            <listitem>
573              <para>(only) three "immediate" registers.</para>
574              <para>The RAX, RCX, and RDX registers are used as the
575              implicit operands and results of some extended-precision
576              multiply and divide instructions which generally involve
577              non-node values; since their use in these instructions
578              means that they can't be guaranteed to contain node
579              values at all times, it's natural to put these registers
580              in the "immediate" set. RAX is generally given the
581              symbolic name "imm0", RDX is given the symbolic name
582              "imm1" and RCX is given the symbolic name "imm2"; you
583              may see these names in disassembled code, usually in
584              operations involving type checking, array indexing, and
585              foreign memory and function access.</para>
586            </listitem>
587            <listitem>
588              <para>(only) two "dedicated" registers.</para>
589              <para>RSP and RBP have
590              dedicated functionality dictated by the hardware and
591              calling conventions.</para>
592            </listitem>
593            <listitem>
594              <para>11 "node" registers.</para>
595              <para>All other registers (RBX, RSI, RDI, and R8-R15)
596              are asserted to contain node values at (almost) all
597              times; legacy "string" operations that implicitly use RSI
598              and/or RDI are not used.</para>
599            </listitem>
600          </itemizedlist>
601          <para>On the PPC, the static register partitioning scheme
602          involves:</para>
603
604          <itemizedlist>
605            <listitem>
606              <para>6 "immediate" registers.</para>
607              <para>Registers r3-r8 are given
608              the symbolic names imm0-imm5.  As a RISC architecture
609              with simpler addressing modes, the PPC probably
610              uses immediate registers a bit more often than the CISC
611              x86-64 does, but they're generally used for the same sort
612              of things (type checking, array indexing, FFI,
613              etc.)</para>
614            </listitem>
615            <listitem>
616              <para>9 dedicated registers
617              <itemizedlist>
618                <listitem>
619                  <para>r0 (symbolic name rzero) always contains the
620                  value 0 when running lisp code.  Its value is
621                  sometimes read as 0 when it's used as the base
622                  register in a memory address; keeping the value 0
623                  there is sometimes convenient and avoids
624                  asymmetry.</para>
625                </listitem>
626                <listitem>
627                  <para>r1 (symbolic name sp) is the control stack
628                  pointer, by PPC convention.</para>
629                </listitem>
630                <listitem>
631                  <para>r2 is used to hold the current thread's TCR on
632                  ppc64 systems; it's not used on ppc32.</para>
633                </listitem>
634                <listitem>
635                  <para>r9 and r10 (symbolic names allocptr and
636                  allocbase) are used to do per-thread memory
637                  allocation</para>
638                </listitem>
639                <listitem>
640                  <para>r11 (symbolic name nargs) contains the number
641                  of function arguments on entry and the number of
642                  return values in multiple-value returning
643                  constructs.  It's not used more generally as either
644                  a node or immediate register because of the way that
645                  certain trap instruction encodings are
646                  interpreted.</para>
647                </listitem>
648                <listitem>
649                  <para>r12 (symbolic name tsp) holds the top of the
650                  current thread's temp stack.</para>
651                </listitem>
652                <listitem>
653                  <para>r13 is used to hold the TCR on PPC32 systems;
654                  it's not used on PPC64.</para>
655                </listitem>
656                <listitem>
657                  <para>r14 (symbolic name loc-pc) is used to copy
658                  "pc-locative" values between main memory and
659                  special-purpose PPC registers (LR and CTR) used in
660                  function-call and return instructions.</para>
661                </listitem>
662                <listitem>
663                  <para>r15 (symbolic name vsp) addresses the top of
664                  the current thread's value stack.</para>
665                </listitem>
666                <listitem>
667                  <para>lr and ctr are PPC branch-unit registers used
668                  in function call and return instructions; they're
669                  always treated as "pc-locatives", which precludes
670                  the use of the ctr in some PPC looping
671                  constructs.</para>
672                </listitem>
673             
674              </itemizedlist>
675              </para>
676            </listitem>
677            <listitem>
678              <para>17 "node" registers</para>
679              <para>r15-r31 are always treated as node
680              registers</para>
681            </listitem>
682           
683          </itemizedlist>
684        </sect3>
685      </sect2>
686
687      <sect2 id="Tagging-scheme">
688        <title>Tagging scheme</title>
689        <para>&CCL; always allocates lisp objects on double-node
690        (64-bit for 32-bit platforms, 128-bit for 64-bit platforms)
691        boundaries; this mean that the low 3 bits (32-bit lisp) or 4
692        bits (64-bit lisp) are always 0 and are therefore redundant
693        (we only really need to know the upper 29 or 60 bits in order
694        to identify the aligned object address.)  The extra bits in a
695        lisp node can be used to encode at least some information
696        about the node's type, and the other 29/60 bits represent
697        either an immediate value or a doublenode-aligned memory
698        address.  The low 3 or 4 bits of a node are called the node's
699        "tag bits", and the conventions used to encode type
700        information in those tag bits are called a "tagging
701        scheme."</para>
702        <para>It might be possible to use the same tagging scheme on
703        all platforms (at least on all platforms with the same word
704        size and/or the same number of available tag bits), but there
705        are often some strong reasons for not doing so.  These
706        arguments tend to be very machine-specific: sometimes, there
707        are fairly obvious machine-dependent tricks that can be
708        exploited to make common operations on some types of tagged
709        objects faster; other times, there are architectural
710        restrictions that make it impractical to use certain tags for
711        certain types.  (On PPC64, the "ld" (load doubleword) and
712        "std" (store doubleword) instructions - which load and store a
713        GPR operand at the effective address formed by adding the
714        value of another GPR operand and a 16-bit constant operand -
715        require that the low two bits of that constant operand be 0.
716        Since such instructions would typically be used to access the
717        fields of things like CONS cells and structures, it's
718        desirable that that the tags chosen for CONS cells and
719        structures allow the use of these instructions as opposed to
720        more expensive alternatives.)</para>
721        <para>One architecture-dependent tagging trick that works well
722        on all architectures is to use a tag of 0 for FIXNUMs: a
723        fixnum basically encodes its value shifted left a few bits and
724        keeps those low bits clear. FIXNUM addition, subtraction, and
725        binary logical operations can operate directly on the node
726        operands, addition and subtraction can exploit hardware-based
727        overflow detection, and (in the absence of overflow) the
728        hardware result of those operations is a node (fixnum).  Some
729        other slightly-less-common operations may require a few extra
730        instructions, but arithmetic operations on FIXNUMs should be
731        as cheap as possible and using a tag of zero for FIXNUMs helps
732        to ensure that it will be.</para> 
733        <para>If we have N available tag bits (N = 3 for 32-bit
734        &CCL; and N = 4 for 64-bit &CCL;), this way of
735        representing fixnums with the low M bits forced to 0 works as
736        long as M &lt;= N.  The smaller we make M, the larger the
737        values of MOST-POSITIVE-FIXNUM and MOST-NEGATIVE become; the
738        larger we make N, the more distinct non-FIXNUM tags become
739        available.  A reasonable compromise is to choose M = N-1; this
740        basically yields two distinct FIXNUM tags (one for even
741        fixnums, one for odd fixnums), gives 30-bit fixnums on 32-bit
742        platforms and 61-bit fixnums on 64-bit platforms, and leaves
743        us with 6 or 14 tags to encoded other types.</para>
744        <para>Once we get past the assignment of FIXNUM tags, things
745        quickly devolve into machine-dependencies.  We can fairly
746        easily see that we can't directly all other primitive lisp
747        object types with only 6 or 14 available tag values; the
748        details of how types are encoded vary between the ppc32,
749        ppc64, and x86-64 implementations, but there are some general
750        common principles:</para>
751
752        <itemizedlist>
753          <listitem>
754            <para>CONS cells always contain exactly 2 elements and are
755            usually fairly common.It therefore makes sense to give
756            CONS cells their own tag.  Unlike the fixnum case - where a
757            tag value of 0 had positive implications - there doesn't
758            seem to be any advantage to using any particular value.
759            (A longtime ago - in the case of 68K MCL - the CONS tag
760            and the order of CAR and CDR in memory were chosen to allow
761            smaller, cheaper addressing modes to be used to "cdr down a
762            list."  That's not a factor on ppc or x86-64,but all
763            versions of &CCL; still store the CDR of a CONS cell
764            first in memory.  It doesn't matter, but doing it the way
765            that the host system did made boostrapping to a new target
766            system a little easier.)
767            </para>
768          </listitem>
769          <listitem>
770            <para>Any way you look at it, NIL is a bit ... unusual. NIL
771            is both a SYMBOL and a LIST (as well as being a canonical
772            truth value and probably a few other things.)  Its role as
773            a LIST is probably much more important to most programs
774            than its role as a SYMBOL is:LISTP has to be true of NIL
775            and primitives like CAR and CDR do LISTP implicitly when
776            safe and want that operation to be fast.There are several
777            possible approaches to this; &CCL; uses two of them. On
778            PPC32 and X86-64, NIL is basically a weird CONS cell that
779            straddles two doublenodes; the tag of NIL is unique and
780            congruent modulo 4 (modulo 8 on 64-bit) with the tag used
781            for CONS cells.  LISTP is therefore true of any node whose
782            low 2 (or 3) bits contain the appropriate tag value (it's
783            not otherwise necessary to special-case NIL.)
784            SYMBOL accessors (SYMBOL-NAME, SYMBOL-VALUE, SYMBOL-PLIST
785            ..) -do- have to special-case NIL (and access the
786            components of an internal proxy symbol.) On PPC64 (where
787            architectural restrictions dictate the set of tags that can
788            be used to access fixed components of an object),
789            that approach wasn't practical.  NIL is just a
790            distinguished SYMBOL,and it just happens to be the case
791            that its pname slot and values lots are at the same offsets
792            from a tagged pointer as a CONS cell's CDR and CAR would be.
793            NIL's pname is set to NIL (SYMBOL-NAME checks for this and
794            returns the string "NIL"), and LISTP (and therefore safe
795            CAR and CDR) have to check for (OR NULL CONSP). At least in
796            the case of CAR and CDR, the fact that the PPC has multiple
797            condition-code fields keeps that extra test from
798            being prohibitively expensive.</para>
799          </listitem>
800          <listitem>
801            <para>Some objects are immediate (but not FIXNUMs).This is
802            true of CHARACTERs and, on 64-bit platforms,
803            SINGLE-FLOATs.It's also true of some nodes used in the
804            runtime system (special values used to indicate unbound
805            variables and slots, for instance.) On 64-bit platforms,
806            SINGLE-FLOATs have their own unique tag (making them a
807            little easier to recognize; on all platforms, CHARACTERs
808            share a tag with other immediate objects (unbound markers)
809            but are easy to recognize (by looking at several of their
810            low bits.)  The GC treats any node with an immediate tag
811            (and any node with a fixnum tag) as a leaf.</para>
812          </listitem>
813          <listitem>
814            <para>There are some advantages to treating everything
815            else - memory-allocated objects that aren't CONS cells -
816            uniformly.There are some disadvantages to that uniform
817            treatment as well, and the treatment of "memory-allocated
818            non-CONS objects" isn't entirely uniform across all
819            &CCL; implementations.  Let's first pretend that
820            the treatment is uniform, then discuss the ways in which it
821            isn't.The "uniform approach" is to treat all
822            memory-allocated non-CONS objects as if they were vectors;
823            this use of the term is a little looser than what's implied
824            by the CL VECTOR type.  &CCL; actually uses the
825            term "uvector" to mean "a memory-allocated lisp object
826            other than a CONS cell,whose first word is a header which
827            describes the object's type and the number of elements that
828            it contains."  In this view, a SYMBOL is a UVECTOR, as is a
829            STRING, a STANDARD-INSTANCE, a CL array or vector,a
830            FUNCTION, and even a DOUBLE-FLOAT.In the PPC
831            implementations (where things are a little more
832            ... uniform),a single tag value is used to denote any
833            uvector; in order to determine something more specific
834            about the type of the object in question, it's necessary to
835            fetch the low byte of the header word from memory.  On
836            the x86-64 platform, certain types of uvectors - SYMBOLs
837            and FUNCTIONs -are given their own unique tags.  The good
838            news about the x86-64 approach is that SYMBOLs and
839            FUNCTIONs can be recognized without referencing memory; the
840            slightly bad news is that primitive operations that work on
841            UVECTOR-tagged objects - like the function CCL:UVREF -
842            don't work on SYMBOLs or FUNCTIONs on x86-64 (but -do- work
843            on those types of objects in the PPC ports.) The header word
844            which precedes a UVECTOR's data in memory contains 8 bits
845            of type information in the low byte and either 24 or 56
846            bits of"element-count" information in the rest of the
847            word.  (This is where the sometimes-limiting value of 2^24
848            for ARRAY-TOTAL-SIZE-LIMIT on PPC32 platforms comes from.)
849            The low byte of the header - sometimes called the uvector's
850            subtag - is itself tagged (which means that the header is
851            tagged.)  The (3 or 4) tag bits in the subtag are used to
852            determine whether the uvector's elements are nodes or
853            immediates.(A UVECTOR whose elements are nodes is called a
854            GVECTOR; a UVECTOR whose elements are immediates is called
855            an IVECTOR.  This terminology came from Spice Lisp, which
856            was a predecessor of CMUCL.)  Even though a uvector header
857            is tagged, a header is not a node.  There's no (supported)
858            way to get your hands on one in lisp and doing so could be
859            dangerous.  (If the value of a header wound up in a lisp
860            node register and that register wound up getting pushed on
861            a thread's value stack, the GC might misinterpret that
862            situation to mean that there was a stack-allocated UVECTOR
863            on the value stack.)</para>
864          </listitem>
865       
866        </itemizedlist>
867      </sect2>
868    </sect1>
869
870    <sect1 id="Heap-Allocation">
871      <title>Heap Allocation</title> <para>When the &CCL; kernel first
872      starts up, a large contiguous chunk of the process's address
873      space is mapped as "anonymous, no access" memory. ("Large" means
874      different things in different contexts; on LinuxPPC32, it means
875      "about 1 gigabyte", on DarwinPPC32, it means "about 2
876      gigabytes", and on current 64-bit platforms it ranges from 128
877      to 512 gigabytes, depending on OS. These values are both
878      defaults and upper limits; the --heap-reserve argument can be
879      used to try to reserve less than the default.)</para>
880      <para>Reserving address space that can't (yet) be read or
881      written to doesn't cost much; in particular, it doesn't require
882      that corresponding swap space or physical memory be available.
883      Marking the address range as being "mapped" helps to ensure that
884      other things (results from random calls to malloc(), dynamically
885      loaded shared libraries) won't be allocated in this region that
886      lisp has reserved for its own heap growth.</para>
887      <para>A small portion (around 1/32 on 32-bit platforms and 1/64
888      on 64-bit platforms) of that large chunk of address space is
889      reserved for GC data structures.  Memory pages reserved for
890      these data structures are mapped read-write as pages made
891      writable in the main portion of the heap.</para>
892      <para>The initial heap image is mapped into this reserved
893      address space and an additional (LISP-HEAP-GC-THRESHOLD) bytes
894      are mapped read-write.  GC data structures grow to match the
895      amount of GC-able memory in the initial image + the gc
896      threshold, and control is transferred to lisp code.  Inevitably,
897      that code spoils everything and starts consing; there are
898      basically three layers of memory allocation that can go
899      on.</para>
900
901      <sect2 id="Per-thread-object-allocation">
902        <title>Per-thread object allocation</title>
903        <para>Each lisp thread has a private "reserved memory
904        segment"; when a thread starts up, its reserved memory segment
905        is empty.  PPC ports maintain the highest unallocated address
906        and the lowest allocatable address in the current segment in
907        registers when running lisp code; on x86-664, these values are
908        maintained in the current threads's TCR.  (An "empty" heap
909        segment is one whose high pointer and low pointer are equal.)
910        When a thread is not in the middle of allocating something, the
911        low 3 or 4 bits of the high and low pointers are clear (the
912        pointers are doublenode-aligned.)</para>
913        <para>A thread tries to allocate an object whose physical size
914        in bytes is X and whose tag is Y by:</para>
915        <orderedlist>
916          <listitem>
917            <para>decrementing the "high" pointer by (- X Y)</para>
918          </listitem>
919          <listitem>
920            <para>trapping if the high pointer is less than the low
921            pointer</para>
922          </listitem>
923          <listitem>
924            <para>using the (tagged) high pointer to initialize the
925            object, if necessary</para>
926          </listitem>
927          <listitem>
928            <para>clearing the low bits of the high pointer</para>
929          </listitem>
930        </orderedlist>
931        <para>On PPC32, where the size of a CONS cell is 8 bytes and
932        the tag of a CONS cell is 1, machine code which sets the arg_z
933        register to the result of doing (CONS arg_y arg_z) looks
934        like:</para>
935        <programlisting>
936  (SUBI ALLOCPTR ALLOCPTR 7)    ; decrement the high pointer by (- 8 1)
937  (TWLLT ALLOCPTR ALLOCBASE)    ; trap if the high pointer is below the base
938  (STW ARG_Z -1 ALLOCPTR)       ; set the CDR of the tagged high pointer
939  (STW ARG_Y 3 ALLOCPTR)        ; set the CAR
940  (MR ARG_Z ALLOCPTR)           ; arg_z is the new CONS cell
941  (RLWINM ALLOCPTR ALLOCPTR 0 0 28)     ; clear tag bits
942        </programlisting>
943        <para>On x86-64, the idea's similar but the implementation is
944        different.  The high and low pointers to the current thread's
945        reserved segment are kept in the TCR, which is addressed by
946        the gs segment register. An x86-64 CONS cell is 16 bytes wide
947        and has a tag of 3; we canonically use the temp0 register to
948        initialize the object</para>
949        <programlisting>
950  (subq ($ 13) ((% gs) 216))    ; decrement allocptr
951  (movq ((% gs) 216) (% temp0)) ; load allocptr into temp0
952  (cmpq ((% gs) 224) (% temp0)) ; compare to allocabase
953  (jg L1)                       ; skip trap
954  (uuo-alloc)                   ; uh, don't skip trap
955L1
956  (andb ($ 240) ((% gs) 216))   ; untag allocptr in the tcr
957  (movq (% arg_y) (5 (% temp0))) ; set the car
958  (movq (% arg_z) (-3 (% temp0))); set the cdr
959  (movq (% temp0) (% arg_z))    ; return the cons
960        </programlisting>
961        <para>If we don't take the trap (if allocating 8-16 bytes
962        doesn't exhaust the thread's reserved memory segment), that's
963        a fairly short and simple instruction sequence.  If we do take
964        the trap, we'll have to do some additional work in order to
965        get a new segment for the current thread.</para>
966      </sect2>
967
968      <sect2 id="Allocation-of-reserved-heap-segments">
969        <title>Allocation of reserved heap segments</title>
970        <para>After the lisp image is first mapped into memory - and after
971        each full GC - the lisp kernel ensures that
972        (LISP-HEAP-GC-TRESHOLD) additional bytes beyond the current
973        end of the heap are mapped read-write.</para>
974        <para>If a thread traps while trying to allocate memory, the
975        thread goes through the usual exception-handling protocol (to
976        ensure that any other thread that GCs "sees" the state of the
977        trapping thread and to serialize exception handling.)  When
978        the exception handler runs, it determines the nature and size
979        of the failed allocation and tries to complete the allocation
980        on the thread's behalf (and leave it with a reasonably large
981        thread-specific memory segment so that the next small
982        allocation is unlikely to trap.</para>
983        <para>Depending on the size of the requested segment
984        allocation, the number of segment allocations that have
985        occurred since the last GC, and the EGC and GC thresholds, the
986        segment allocation trap handler may invoke a full or ephemeral
987        GC before returning a new segment.  It's worth noting that the
988        [E]GC is triggered based on the number of and size of these
989        segments that have been allocated since the last GC; it doesn't
990        have much to do with how "full" each of those per-thread
991        segments are.  It's possible for a large number of threads to
992        do fairly incidental memory allocation and trigger the GC as a
993        result; avoiding this involves tuning the per-thread
994        allocation quantum and the GC/EGC thresholds
995        appropriately.</para>
996      </sect2>
997
998      <sect2 id="Heap-growth">
999        <title>Heap growth</title>
1000        <para>All OSes on which &CCL; currently runs use an
1001        "overcommit" memory allocation strategy by default (though
1002        some of them provide ways of overriding that default.)  What
1003        this means in general is that the OS doesn't necessarily
1004        ensure that backing store is available when asked to map pages
1005        as read-write; it'll often return a success indicator from the
1006        mapping attempt (mapping the pages as "zero-fill,
1007        copy-on-write"), and only try to allocate the backing store
1008        (swap space and/or physical memory) when non-zero contents are
1009        written to the pages.</para>
1010        <para>It -sounds- like it'd be better to have the mmap() call
1011        fail immediately, but it's actually a complicated issue.
1012        (It's possible that other applications will stop using some
1013        backing store before lisp code actually touches the pages that
1014        need it, for instance.)  It's also not guaranteed that lisp
1015        code would be able to "cleanly" signal an out-of-memory
1016        condition if lisp is ... out of memory</para>
1017        <para>I don't know that I've ever seen an abrupt out-of-memory
1018        failure that wasn't preceded by several minutes of excessive
1019        paging activity.  The most expedient course in cases like this
1020        is to either (a) use less memory or (b) get more memory; it's
1021        generally hard to use memory that you don't have.</para>
1022      </sect2>
1023    </sect1>
1024
1025    <sect1 id="GC-details">
1026      <title>GC details</title>
1027      <para>The GC uses a Mark/Compact algorithm; its
1028      execution time is essentially a factor of the amount of live
1029      data in the heap. (The somewhat better-known Mark/Sweep
1030      algorithms don't compact the live data but instead traverse the
1031      garbage to rebuild free-lists; their execution time is therefore
1032      a factor of the total heap size.)</para>
1033      <para>As mentioned in <xref linkend="Heap-Allocation"/>, two
1034      auxiliary data structures (proportional to the size of the lisp
1035      heap) are maintained. These are</para>
1036      <orderedlist>
1037        <listitem>
1038          <para>the markbits bitvector, which contains a bit for
1039          every doublenode in the dynamic heap (plus a few extra words
1040          for alignment and so that sub-bitvectors can start on word
1041          boundaries.)</para>
1042        </listitem>
1043        <listitem>
1044          <para>the relocation table, which contains a native word for
1045          every 32 or 64 doublenodes in the dynamic heap, plus an
1046          extra word used to keep track of the end of the heap.</para>
1047        </listitem>
1048      </orderedlist>
1049      <para>The total GC space overhead is therefore on the order of
1050      3% (2/64 or 1/32).</para>
1051      <para>The general algorithm proceeds as follows:</para>
1052
1053      <sect2 id="Mark-phase">
1054        <title>Mark phase</title>
1055        <para>Each doublenode in the dynamic heap has a corresponding
1056        bit in the markbits vector. (For any doublenode in the heap,
1057        the index of its mark bit is determined by subtracting the
1058        address of the start of the heap from the address of the
1059        object and dividing the result by 8 or 16.) The GC knows the
1060        markbit index of the free pointer, so determining that the
1061        markbit index of a doubleword address is between the start of
1062        the heap and the free pointer can be done with a single
1063        unsigned comparison.</para>
1064        <para>The markbits of all doublenodes in the dynamic heap are
1065        zeroed before the mark phase begins. An object is
1066        <emphasis>marked</emphasis> if the markbits of all of its
1067        constituent doublewords are set and unmarked otherwise;
1068        setting an object's markbits involves setting the corresponding
1069        markbits of all constituent doublenodes in the object.</para>
1070        <para>The mark phase traverses each root. If the tag of the
1071        value of the root indicates that it's a non-immediate node
1072        whose address lies in the lisp heap, then:</para>
1073        <orderedlist>
1074          <listitem>
1075            <para>If the object is already marked, do nothing.</para>
1076          </listitem>
1077          <listitem>
1078            <para>Set the object's markbit(s).</para>
1079          </listitem>
1080          <listitem>
1081            <para>If the object is an ivector, do nothing further.</para>
1082          </listitem>
1083          <listitem>
1084            <para>If the object is a cons cell, recursively mark its
1085            car and cdr.</para>
1086          </listitem>
1087          <listitem>
1088            <para>Otherwise, the object is a gvector. Recursively mark
1089            its elements.</para>
1090          </listitem>
1091        </orderedlist>
1092        <para>Marking an object thus involves ensuring that its mark
1093        bits are set and then recursively marking any pointers
1094        contained within the object if the object was originally
1095        unmarked. If this recursive step was implemented in the
1096        obvious manner, marking an object would take stack space
1097        proportional to the length of the pointer chain from some root
1098        to that object. Rather than storing that pointer chain
1099        implicitly on the stack (in a series of recursive calls to the
1100        mark subroutine), the &CCL; marker uses mixture of recursion
1101        and a technique called <emphasis>link inversion</emphasis> to
1102        store the pointer chain in the objects themselves.  (Recursion
1103        tends to be simpler and faster; if a recursive step notes that
1104        stack space is becoming limited, the link-inversion technique
1105        is used.)</para>
1106        <para>Certain types of objects are treated a little specially:</para>
1107        <orderedlist>
1108        <listitem>
1109          <para>To support a feature called <emphasis>GCTWA
1110              <footnote>
1111                <para>I believe that the acronym comes from MACLISP,
1112                where it stood for "Garbage Collection of Truly
1113                Worthless Atoms".</para>
1114              </footnote>
1115              , </emphasis>the vector which contains the
1116              internal symbols of the current package is marked on
1117              entry to the mark phase, but the symbols themselves are
1118              not marked at this time. Near the end of the mark phase,
1119              symbols referenced from this vector which are
1120              not otherwise marked are marked if and only if they're
1121              somehow distinguishable from newly created symbols (by
1122              virtue of their having function bindings, value bindings,
1123              plists, or other attributes.)</para>
1124        </listitem>
1125        <listitem>
1126          <para>Pools have their first element set to NIL before any
1127          other elements are marked.</para>
1128        </listitem>
1129        <listitem>
1130          <para>All hash tables have certain fields (used to cache
1131          previous results) invalidated.</para>
1132        </listitem>
1133        <listitem>
1134          <para>Weak Hash Tables and other weak objects are put on a
1135          linkedlist as they're encountered; their contents are only
1136          retained if there are other (non-weak) references to
1137          them.</para>
1138        </listitem>
1139        </orderedlist>
1140        <para>At the end of the mark phase, the markbits of all
1141        objects which are transitively reachable from the roots are
1142        set and all other markbits are clear.</para>
1143      </sect2>
1144
1145      <sect2 id="Relocation-phase">
1146        <title>Relocation phase</title>
1147        <para>The <emphasis>forwarding address</emphasis> of a
1148        doublenode in the dynamic heap is (&lt;its current address> -
1149        (size_of_doublenode * &lt;the number of unmarked markbits that
1150        precede it>)) or alternately (&lt;the base of the heap> +
1151        (size_of_doublenode * &lt;the number of marked markbits that
1152        precede it &gt;)). Rather than count the number of preceding
1153        markbits each time, the relocation table is used to precompute
1154        an approximation of the forwarding addresses for all
1155        doublewords. Given this approximate address and a pointer into
1156        the markbits vector, it's relatively easy to compute the exact
1157        forwarding address.</para>
1158        <para>The relocation table contains the forwarding addresses
1159        of each <emphasis>pagelet</emphasis>, where a pagelet is 256
1160        bytes (or 32 doublenodes). The forwarding address of the first
1161        pagelet is the base of the heap. The forwarding address of the
1162        second pagelet is the sum of the forwarding address of the
1163        first and 8 bytes for each mark bit set in the first 32-bit
1164        word in the markbits table. The last entry in the relocation
1165        table contains the forwarding address that the freepointer
1166        would have, e.g., the new value of the freepointer after
1167        compaction.</para>
1168        <para>In many programs, old objects rarely become garbage and
1169        new objects often do. When building the relocation table, the
1170        relocation phase notes the address of the first unmarked
1171        object in the dynamic heap. Only the area of the heap between
1172        the first unmarked object and the freepointer needs to be
1173        compacted; only pointers to this area will need to be
1174        forwarded (the forwarding address of all other pointers to the
1175        dynamic heap is the address of that pointer.)  Often, the
1176        first unmarked object is much nearer the free pointer than it
1177        is to the base of the heap.</para>
1178      </sect2>
1179
1180      <sect2 id="Forwarding-phase">
1181        <title>Forwarding phase</title>
1182        <para>The forwarding phase traverses all roots and the "old"
1183        part of the dynamic heap (the part between the base of the
1184        heap and the first unmarked object.) All references to objects
1185        whose address is between the first unmarked object and the
1186        free pointer are updated to point to the address the object
1187        will have after compaction by using the relocation table and
1188        the markbits vector and interpolating.</para>
1189        <para>The relocation table entry for the pagelet nearest the
1190        object is found. If the pagelet's address is less than the
1191        object's address, the number of set markbits that precede the
1192        object on the pagelet is used to determine the object's
1193        address; otherwise, the number of set markbits the follow the
1194        object on the pagelet is used.</para>
1195        <para>Since forwarding views the heap as a set of doublewords,
1196        locatives are (mostly) treated like any other pointers. (The
1197        basic difference is that locatives may appear to be tagged as
1198        fixnums, in which case they're treated as word-aligned
1199        pointers into the object.)</para>
1200        <para>If the forward phase changes the address of any hash
1201        table key in a hash table that hashes by address (e.g., an EQ
1202        hash table), it sets a bit in the hash table's header. The
1203        hash table code will rehash the hash table's contents if it
1204        tries to do a lookup on a key in such a table.</para>
1205        <para>Profiling reveals that about half of the total time
1206        spent in the GC is spent in the subroutine which determines a
1207        pointer's forwarding address. Exploiting GCC-specific idioms,
1208        hand-coding the routine, and inlining calls to it could all be
1209        expected to improve GC performance.</para>
1210      </sect2>
1211
1212      <sect2 id="Compact-phase">
1213        <title>Compact phase</title>
1214        <para>The compact phase compacts the area between the first
1215        unmarked object and the freepointer so that it contains only
1216        marked objects.  While doing so, it forwards any pointers it
1217        finds in the objects it copies.</para>
1218        <para>When the compact phase is finished, so is the GC (more
1219        or less): the free pointer and some other data structures are
1220        updated and control returns to the exception handler that
1221        invoked the GC. If sufficient memory has been freed to satisfy
1222        any allocation request that may have triggered the GC, the
1223        exception handler returns; otherwise, a "seriously low on
1224        memory" condition is signaled, possibly after releasing a
1225        small emergency pool of memory.</para>
1226      </sect2>
1227    </sect1>
1228
1229    <sect1 id="The-ephemeral-GC">
1230      <title>The ephemeral GC</title>
1231      <para>In the &CCL; memory management scheme, the relative age
1232      of two objects in the dynamic heap can be determined by their
1233      addresses: if addresses X and Y are both addresses in the
1234      dynamic heap, X is younger than Y (X was created more recently
1235      than Y) if it is nearer to the free pointer (and farther from
1236      the base of the heap) than Y.</para>
1237      <para>Ephemeral (or generational) garbage collectors attempt to
1238      exploit the following assumptions:</para>
1239      <itemizedlist>
1240        <listitem>
1241          <para>most newly created objects become garbage soon after
1242          they'recreated.</para>
1243        </listitem>
1244        <listitem>
1245          <para>most objects that have already survived several GCs
1246          are unlikely to ever become garbage.</para>
1247        </listitem>
1248        <listitem>
1249          <para>old objects can only point to newer objects as the
1250          result of a destructive modification (e.g., via
1251          SETF.)</para>
1252        </listitem>
1253      </itemizedlist>
1254
1255      <para>By concentrating its efforts on (frequently and quickly)
1256      reclaiming newly created garbage, an ephemeral collector hopes
1257      to postpone the more costly full GC as long as possible. It's
1258      important to note that most programs create some long-lived
1259      garbage, so an EGC can't typically eliminate the need for full
1260      GC.</para>
1261      <para>An EGC views each object in the heap as belonging to
1262      exactly one <emphasis>generation</emphasis>; generations are
1263      sets of objects that are related to each other by age: some
1264      generation is the youngest, some the oldest, and there's an age
1265      relationship between any intervening generations. Objects are
1266      typically assigned to the youngest generation when first
1267      allocated; any object that has survived some number of GCs in
1268      its current generation is promoted (or
1269      <emphasis>tenured</emphasis>) into an older generation.</para>
1270      <para>When a generation is GCed, the roots consist of the
1271      stacks, registers, and global variables as always and also of
1272      any pointers to objects in that generation from other
1273      generations. To avoid the need to scan those (often large) other
1274      generations looking for such intergenerational references, the
1275      runtime system must note all such intergenerational references
1276      at the point where they're created (via Setf).<footnote><para>This is
1277      sometimes called "The Write Barrier": all assignments which
1278      might result in intergenerational references must be noted, as
1279      if the other generations were write-protected.</para></footnote> The
1280      set of pointers that may contain intergenerational references is
1281      sometimes called <emphasis>the remembered set</emphasis>.</para>
1282      <para>In &CCL;'s EGC, the heap is organized exactly the same
1283      as otherwise; "generations" are merely structures which contain
1284      pointers to regions of the heap (which is already ordered by
1285      age.) When a generation needs to be GCed, any younger generation
1286      is incorporated into it; all objects which survive a GC of a
1287      given generation are promoted into the next older
1288      generation. The only intergenerational references that can exist
1289      are therefore those where an old object is modified to contain a
1290      pointer to a new object.</para>
1291      <para>The EGC uses exactly the same code as the full GC. When a
1292      given GC is "ephemeral",</para>
1293      <itemizedlist>
1294        <listitem>
1295          <para>the "base of the heap" used to determine an object's
1296          markbit address is the base of the generation
1297          being collected;</para>
1298        </listitem>
1299        <listitem>
1300          <para>the markbits vector is actually a pointer into the
1301          middle of the global markbits table; preceding entries in
1302          this table are used to note doubleword addresses in older
1303          generations that (may) contain intergenerational
1304          references;</para>
1305        </listitem>
1306        <listitem>
1307          <para>some steps (notably GCTWA and the handling of weak
1308          objects) are not performed;</para>
1309        </listitem>
1310        <listitem>
1311          <para>the intergenerational references table is used to
1312          find additional roots for the mark and forward phases. If a
1313          bit is set inthe intergenerational references table, that
1314          means that the corresponding doubleword (in some "old"
1315          generation, in some "earlier" part of the heap) may have had
1316          a pointer to an object in a younger generation stored into
1317          it.</para>
1318        </listitem>
1319     
1320      </itemizedlist>
1321      <para>With one exception (the implicit setfs that occur on entry
1322      to and exit from the binding of a special variable), all setfs
1323      that might introduce an intergenerational reference must be
1324      memoized.@footnote{Note that the implicit setfs that occur when
1325      initializing an object - as in the case of a call to cons or
1326      vector - can't introduce intergenerational references, since the
1327      newly created object is always younger than the objects used to
1328      initialize it.} It's always safe to push any cons cell or
1329      gvector locative onto the memo stack; it's never safe to push
1330      anything else.
1331      </para>
1332
1333      <para>Typically, the intergenerational references bitvector is
1334      sparse: a relatively small number of old locations are stored
1335      into, although some of them may have been stored into many
1336      times. The routine that scans the memoization buffer does a lot
1337      of work and usually does it fairly often; it uses a simple,
1338      brute-force method but might run faster if it was smarter about
1339      recognizing addresses that it'd already seen.
1340      </para>
1341
1342      <para>When the EGC mark and forward phases scan the
1343      intergenerational reference bits, they can clear any bits that
1344      denote doublewords that definitely do not contain
1345      intergenerational references.
1346      </para>
1347    </sect1>
1348
1349    <sect1 id="Fasl-files">
1350      <title>Fasl files</title>
1351      <para>Saving and loading of Fasl files is implemented in
1352      xdump/faslenv.lisp, level-0/nfasload.lisp, and lib/nfcomp.lisp.
1353      The information here is only an overview, which might help when
1354      reading the source.</para>
1355      <para>The &CCL; Fasl format is forked from the old MCL Fasl
1356      format; there are a few differences, but they are minor.  The
1357      name "nfasload" comes from the fact that this is the so-called
1358      "new" Fasl system, which was true in 1986 or so.  </para>
1359      <para>A Fasl file begins with a "file header", which contains
1360      version information and a count of the following "blocks".
1361      There's typically only one "block" per Fasl file.  The blocks
1362      are part of a mechanism for combining multiple logical files
1363      into a single physical file, in order to simplify the
1364      distribution of precompiled programs. </para>
1365      <para>Each block begins with a header for itself, which just
1366      describes the size of the data that follows.</para>
1367      <para>The data in each block is treated as a simple stream of
1368      bytes, which define a bytecode program.  The actual bytecodes,
1369      "fasl operators", are defined in xdump/faslenv.lisp.  The
1370      descriptions in the source file are terse, but, according to
1371      Gary, "probably accurate".</para>
1372      <para>Some of the operators are used to create a per-block
1373      "object table", which is a vector used to keep track of
1374      previously-loaded objects and simplify references to them.  When
1375      the table is created, an index associated with it is set to
1376      zero; this is analogous to an array fill-pointer, and allows the
1377      table to be treated like a stack.</para>
1378      <para>The low seven bits of each bytecode are used to specify
1379      the fasl operator; currently, about fifty operators are defined.
1380      The high byte, when set, indicates that the result of the
1381      operation should be pushed onto the object table.</para>
1382      <para>Most bytecodes are followed by operands; the operand data
1383      is byte-aligned.  How many operands there are, and their type,
1384      depend on the bytecode.  Operands can be indices into the object
1385      table, immediate values, or some combination of these.</para>
1386      <para>An exception is the bytecode #xFF, which has the symbolic
1387      name ccl::$faslend; it is used to mark the end of the
1388      block.</para>
1389    </sect1>
1390
1391
1392
1393    <sect1 id="The-Objective-C-Bridge--1-">
1394      <title>The Objective-C Bridge</title>
1395
1396      <sect2 id="How-CCL-Recognizes-Objective-C-Objects">
1397        <title>How &CCL; Recognizes Objective-C Objects</title>
1398        <para>In most cases, pointers to instances of Objective-C
1399        classes are recognized as such; the recognition is (and
1400        probably always will be) slightly heuristic. Basically, any
1401        pointer that passes basic sanity checks and whose first word
1402        is a pointer to a known ObjC class is considered to be an
1403        instance of that class; the Objective-C runtime system would
1404        reach the same conclusion.</para>
1405        <para>It's certainly possible that a random pointer to an
1406        arbitrary memory address could look enough like an ObjC
1407        instance to fool the lisp runtime system, and it's possible
1408        that pointers could have their contents change so that
1409        something that had either been a true ObjC instance (or had
1410        looked a lot like one) is changed (possibly by virtue of
1411        having been deallocated.)</para>
1412        <para>In the first case, we can improve the heuristics
1413        substantially: we can make stronger assertions that a
1414        particular pointer is really "of type :ID" when it's a
1415        parameter to a function declared to take such a pointer as an
1416        argument or a similarly declared function result; we can be
1417        more confident of something we obtained via SLOT-VALUE of a
1418        slot defined to be of type :ID than if we just dug a pointer
1419        out of memory somewhere.</para>
1420        <para>The second case is a little more subtle: ObjC memory
1421        management is based on a reference-counting scheme, and it's
1422        possible for an object to ... cease to be an object while lisp
1423        is still referencing it.  If we don't want to deal with this
1424        possibility (and we don't), we'll basically have to ensure
1425        that the object is not deallocated while lisp is still
1426        thinking of it as a first-class object. There's some support
1427        for this in the case of objects created with MAKE-INSTANCE,
1428        but we may need to give similar treatment to foreign objects
1429        that are introduced to the lisp runtime in other ways (as
1430        function arguments, return values, SLOT-VALUE results, etc. as
1431        well as those instances that are created under lisp
1432        control.)</para>
1433        <para>This doesn't all work yet (in fact, not much of it works
1434        yet); in practice, this has not yet been as much of a problem
1435        as anticipated, but that may be because existing Cocoa code
1436        deals primarily with relatively long-lived objects such as
1437        windows, views, menus, etc.</para>
1438      </sect2>
1439
1440      <sect2>
1441        <title>Recommended Reading</title>
1442
1443        <variablelist>
1444          <varlistentry>
1445            <term>
1446              <ulink url="http://developer.apple.com/documentation/Cocoa/">Cocoa Documentation</ulink>
1447            </term>
1448           
1449           <listitem>
1450             <para>
1451               This is the top page for all of Apple's documentation on
1452               Cocoa.  If you are unfamiliar with Cocoa, it is a good
1453               place to start.
1454             </para>
1455           </listitem>
1456        </varlistentry>
1457        <varlistentry>
1458          <term>
1459            <ulink url="http://developer.apple.com/documentation/Cocoa/Reference/Foundation/ObjC_classic/index.html">Foundation Reference for Objective-C</ulink>
1460          </term>
1461
1462          <listitem>
1463            <para>
1464              This is one of the two most important Cocoa references; it
1465              covers all of the basics, except for GUI programming.  This is
1466              a reference, not a tutorial.
1467            </para>
1468          </listitem>
1469        </varlistentry>
1470      </variablelist>
1471      </sect2>
1472    </sect1>
1473  </chapter>
Note: See TracBrowser for help on using the repository browser.