source: trunk/source/doc/src/implementation.xml @ 8820

Last change on this file since 8820 was 8820, checked in by jaj, 12 years ago

This commit includes support for docbook 4.5, stylesheet changes, and updated documentation.

In order to support docbook 4.5 in nXML mode, I have added a new directory called docbook-rng-4.5 and changed schemas.xml to point to it. This should just work when editing the documentation in EMACS.

The two most obvious changes to the stylesheets are that the table of contents for each chapter now occurs at the beginning of the chapter, and the format for refentries is cleaner and more concise.

I think that we should consistently use refentry elements for all of the definitions of functions, macros, variables, etc. This retains the structured data for the definitions that can be reformatted to have different appearences by the stylesheets. We should also consistently use other docbook elements such as function and varname. I'm not really happy with their appearance right now, but that can be easily tweaked in the stylesheets as long as they are consistently used throughout the documentation.

File size: 78.1 KB
Line 
1<?xml version="1.0" encoding="utf-8"?>
2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
3<!ENTITY rest "<varname>&amp;rest</varname>">
4<!ENTITY key "<varname>&amp;key</varname>">
5<!ENTITY optional "<varname>&amp;optional</varname>">
6<!ENTITY body "<varname>&amp;body</varname>">
7<!ENTITY aux "<varname>&amp;aux</varname>">
8<!ENTITY allow-other-keys "<varname>&amp;allow-other-keys</varname>">
9<!ENTITY CCL "Clozure CL">
10]>
11  <chapter id="Implementation-Details-of-CCL">
12    <title>Implementation Details of &CCL;</title>
13    <para>This chapter describes many aspects of &CCL;'s
14    implementation as of (roughly) version 1.1.  Details vary a bit
15    between the three archutectures (PPC32, PPC64, and X86-64)
16    currently supported and those details change over time, so the
17    definitive reference is the source code (especially some files in
18    the ccl/compiler/ directory whose names contain the string "arch"
19    and some files in the ccl/lisp-kernel/ directory whose namee
20    contain the string "constants".)  Hopefully, this chapter will
21    make it easier for someone who's interested to read and understand
22    the contents of those files.</para>
23
24    <sect1 id="Threads-and-exceptions">
25      <title>Threads and exceptions</title>
26      <para>&CCL;'s threads are "native" (meaning that they're
27      scheduled and controlled by the operating system.)  Most of the
28      implications of this are discussed elsewhere; this section tries
29      to describe how threads look from the lisp kernel's perspective
30      (and especailly from the GC's point of view.)</para>
31      <para>&CCL;'s runtime system tries to use machine-level
32      exception mechanisms (conditional traps when available, illegal
33      instructions, memory access protection in some cases) to detect
34      and handle ...  exceptional situations.  These situations
35      include some TYPE-ERRORs and PROGRAM-ERRORS (notably
36      wrong-number-of-args errors), and also include cases like "not
37      being able to allocate memory without GCing or obtaining more
38      memory from the OS."  The general idea is that it's usually
39      faster to pay (very occasional) exception-processing overhead
40      and figure out what's going on in an exception handler than it
41      is to maintain enough state and context to handle an exceptional
42      case via a lighter-weight mechanism when that exceptional case
43      (by definition) rarely occurs.</para>
44      <para>Some emulated execution environments (the Rosetta PPC
45      emulator on x86 versions of Mac OS X) don't provide accurate
46      exception information to exception handling functions. &CCL;
47      can't run in such environments.</para>
48
49      <sect2 id="The-Thread-Context-Record">
50        <title>The Thread Context Record</title>
51
52        <para>When a lisp thread is first created (or when a thread
53        created by foreign code first calls back to lisp), a data
54        structure called a Thread Context Record (or TCR) is allocated
55        and initialized.  On modern versions of Linux and FreeBSD, the
56        allocation actually happens via a set of thread-local-storage
57        ABI extensions, so a thread's TCR is created when the thread
58        is created and dies when the thread dies.  (The World's Most
59        Advanced Operating System - as Apple's marketing literature
60        refers to Darwin - is not very advanced in this regard, and I
61        know of no reason to assume that advances will be made in this
62        area anytime soon.)</para>
63        <para>A TCR contains a few dozen fields (and is therefore a
64        few hundred bytes in size.)  The fields are mostly
65        thread-specific information about the thread's stacks'
66        locations and sizes, information about the underlying (POSIX)
67        thread, and information about the thread's dynamic binding
68        history and pending CATCH/UNWIND-PROTECTs.  Some of this
69        information could be kept in individual machine registers
70        while the thread is running (and the PPC - which has more
71        registers available - keeps a few things in registers that the
72        X86-64 has to access via the TCR), but it's important to
73        remember that the information is thread-specific and can't
74        (for instance) be kept in a fixed global memory
75        location.</para>
76        <para>When lisp code is running, the current thread's TCR is
77        kept in a register.  On PPC platforms, a general purpose
78        register is used; on x86-64, an (otherwise nearly useless)
79        segment register works well (prevents the expenditure of a
80        more generally useful general- purpose register for this
81        purpose.)</para>
82        <para>The address of a TCR is aligned in memory in such a way
83        that a FIXNUM can be used to represent it.  The lisp function
84        CCL::%CURRENT-TCR returns the calling thread's TCR as a
85        fixnum; actual value of the TCR's address is 4 or 8 times the
86        value of this fixnum.</para>
87        <para>When the lisp kernel initializes a new TCR, it's added
88        to a global list maintained by the kernel; when a thread
89        exits, its TCR is removed from this list.</para>
90        <para>When a thread calls foreign code, lisp stack pointers
91        are saved in its TCR, lisp registers (at least those whose
92        value should be preserved across the call) are saved on the
93        thread's value stack, and (on x86-64) RSP is switched to the
94        control stack.  A field in the TCR (tcr.valence) is then set
95        to indicate that the thread is running foreigm code, foreign
96        argument registers are loaded from a frame on the foreign
97        stack, and the foreign function is called. (That's a little
98        oversimplified and possibly inaccurate, but the important
99        things to note are that the thread "stops following lisp stack
100        and register usage conventions" and that it advertises the
101        fact that it's done so.  Similar transitions in a thread's
102        state ("valence") occur when it enters of exits an exception
103        handler (which is sort of an OS/hardware-mandated foreign
104        function call where the OS thoughtfully saves the thread's
105        register state for it beforehand.)</para>
106      </sect2>
107
108      <sect2 id="Exception-contexts-comma---and-exception-handling-in-general">
109        <title>Exception contexts, and exception-handling in general</title>
110        <para>Unix-like OSes tend to refer to exceptions as "signals";
111        the same general mechanism ("signal handling") is used to
112        process both asynchronous OS-level events (such as the result
113        of the keyboard driver noticing that ^C or ^Z has been
114        pressed) and synchronous hardware-level events (like trying to
115        execute and illegal instruction or access protected memory.)
116        It makes some sense to defer ("block") handling of
117        aysnchronous signals so that some critical code sequences
118        complete without interruption; since it's generally not
119        possible for a thread to proceed after a synchronous exception
120        unless and until its state is modified by an exception
121        handler, it makes no sense to talk about blocking synchronous
122        signals (though some OSes will let you do so and doing so can
123        have mysterious effects.)</para>
124        <para>On OSX/Darwin, the POSIX signal handling facilities
125        coexist with lower-level Mach-based exception handling
126        facilities.  Unfortunately, the way that this is implemented
127        interacts poorly with debugging tools: GDB will generally stop
128        whenever the target program encounters a Mach-level exception
129        and offers no way to proceed from that point (and let the
130        program's POSIX signal handler try to handle the exception);
131        Apple's CrashReporter program has had a similar issue and,
132        depending on how it's configured, may bombard the user with
133        alert dialogs which falsely claim that an application has
134        crashed (when in fact the application in question has
135        routinely handled a routine exception.)  On Darwin/OSX,
136        &CCL; uses Mach thread-level exception handling facilities
137        which run before GDB or CrashReporter get a chance to confuse
138        themeselves; &CCL;'s Mach exception handling tries to force
139        the thread which received a synchronous exception to invoke a
140        signal handling function ("as if" signal handling worked more
141        usefully under Darwin.)  Mach exception handlers run in a
142        dedicated thread (which basically does nothing but wait for
143        exception messages from the lisp kernel, obtain and modify
144        information about the state of threads in which exceptions
145        have occurred, and reply to the exception messages with an
146        indication that the exception has been handled.  The reply
147        from a thread-level exception handler keeps the exception from
148        being reported to GDB or CrashReporter and avoids the problems
149        related to those programs.  Since &CCL;'s Mach exception
150        handler doesn't claim to handle debugging-related exceptions
151        (from breakpoints or single-step operations), it's possible to
152        use GDB to debug &CCL;.</para>
153        <para>On platforms where signal handling and debugging don't get in each
154other's way, a signal handler is entered with all signals blocked.
155(This behavior is specified in the call to the sigaction() function
156which established the signal handler.)  The signal handler recieves
157three arguments from the OS kernel; the first is an intger which
158identifies the signal, the second is a pointer to an object of
159type "siginfo_t", which may or may not contain a few fields that
160would help to identify the cause of the exception, and the third
161argument is a pointer to a data structure (called a "ucontext"
162or something similar) which contains machine-dependent information
163about the state of the tread at the time that the exception/signal
164occurred.  While asynchronous signals are blocked, the signal handler
165stores the pointer to its third argument (the "signal context") in
166a field in the current thread's TCR, sets some bits in another TCR
167field to indicate that the thread is now waiting to handle an
168exception, unblocks asynchronous signals, and waits for a global
169exception lock which serializes exception processing.</para>
170        <para>On Darwin, the Mach exception thread creates a signal
171        context (and maybe a siginfo_t structure), stores the signal
172        context in the thread's TCR, sets the TCR field wich describes
173        the thread's state, and arranges that the thread resume
174        execution at its signal handling function (with a signal
175        handler, possibly NULL siginfo_t, and signal context as
176        arguments.  When the thread resumes, it waits for the global
177        exception lock.</para>
178        <para>On x86-64 platforms where signal handing can be used to
179        handle synchronous exceptions, there's an additional
180        complication: the OS kernel ordinarily allocates the signal
181        context and siginfo structures on the stack of the thread
182        which received the signal; in practice, that means "wherever
183        RSP is pointing."  &CCL;'s require that the thread's value
184        stack - where RSP is usually pointing while lisp code is
185        running - contain only "nodes" (properly tagged lisp objects),
186        and scribbling a signal context all over the value stack would
187        violate this requirement.  To maintain consistency, the
188        sigaltstack() mechanism is used to cause the signal to be
189        delivered on (and the signal context and siginfo to be
190        allocated on) a special stack area (the last few pages of the
191        thread's cntrol stack, in practice.  When the signal handler
192        runs, it (carefully) copies the signal context and siginfo to
193        the thread's control stack and makes RSP point into that stack
194        before invoking the "real" signal handler.  (The effect of
195        this hack is that the "real" signal handler always runs on the
196        thread's control stack.)</para>
197        <para>Once the exception handler has obtained the global
198        exception lock, it uses the values of the signal number,
199        siginfo_t, and signal context arguments to determine the
200        (logical) cause of the exception.  Some exceptions may be
201        caused by factors that should generate lisp errors or other
202        serious conditions (stack overflow); if this is the case, the
203        kernel code may release the global exception lock and call out
204        to lisp code.  (The lisp code in question may need to repeat
205        some of the exception decoding process; in particular, it
206        needs to be able to interpret register values in the signal
207        context that it receives as an argument.)</para>
208        <para>In some cases, the lisp kernel exception handler may not
209        be able to recover from the exception (this is currently true
210        of some types of memory-access fault and is also true of traps
211        or illegal instructions that occur during foreign code
212        execution.  In such cases, the kernel exception handler
213        reports the exception as "unhandled", and the kernel debugger
214        is invoked.</para>
215        <para>If the kernel exception handler identifies the
216        exception' cause as being a transient out-of-memory condition
217        (indicating that the current thread needs more memory to cons
218        in), it tries to make that memory available.  In some cases,
219        doing so involves invoking the GC.</para>
220      </sect2>
221
222      <sect2 id="Threads-comma---exceptions-comma---and-the-GC">
223        <title>Threads, exceptions, and the GC</title>
224        <para>&CCL;'s GC is not concurrent: when the GC is invoked
225        in response to an exception in a particular thread, all other
226        lisp threads must stop until the GC's work is done.  The
227        thread that triggered the GC iterates over the global TCR
228        list, sending each other thread a distinguished "suspend"
229        signal, then iterates over the list again, waiting for a
230        per-thread semaphore that indicates that the thread has
231        received the "suspend" signal and responded appropriatedly.
232        Once all other threads have acknowledged the request to
233        suspend themselves, the GC thread can run the GC proper (after
234        doing any necessary .)  Once the GC's completed its work, the
235        thread that invoked the GC iterates over the global TCR list,
236        raising a per-thread "resume" semaphore for each other
237        thread.</para>
238        <para>The signal handler for the asynchronous "suspend" signal
239        is entered with all asynchronous signals blocked.  It saves
240        its signal-context argument in a TCR slot, raises the tcr's
241        "suspend" semaphore, then waits on the TCR's "resume"
242        semaphore.</para>
243        <para>The GC thread has access to the signal contexts of all
244        TCRs (including its own) at the time when the thread received
245        an exception or acknowledged a request to suspend itself.
246        This information (and information about stack areas in the TCR
247        itself) allows the GC to identify the "stack locations and
248        register contents" that are elements of the GC's root
249        set.</para>
250      </sect2>
251
252      <sect2 id="PC-lusering">
253        <title>PC-lusering</title>
254        <para>It's not quite accurate to say that &CCL;'s compiler
255        and runtime follow precise stack and register usage
256        conventions at all times; there are a few exceptions:</para>
257
258        <itemizedlist>
259          <listitem>
260<para>On both PPC and x86-64 platforms, consing isn't fully atomic.It takes at least a few instructions to allocate an object in memory(and slap a header on it if necesssary); if a thread is interrupted inthe middle of that instruction sequence, the new object may or may nothave been created or fully initialized at the point in time that theinterrupt occurred.  (There are actually a few different states ofpartial initialization)</para>
261</listitem>
262          <listitem>
263<para>On the PPC, the common act of building a lisp control stack frameinvolves allocating a four-word frame and storing three register valuesinto that frame.  (The fourth word - the back pointer to the previousframe - is automatically set when the frame is allocated.)  The previouscontents of those three words are unknown (there might have been aforeign stack frame at the same address a few instructions earlier),so interrupting a thread that's in the process of initializing aPPC control stack frame isn't GC-safe.</para>
264</listitem>
265          <listitem>
266<para>There are similar problems with the initialization of temp stackframes on the PPC.  (Allocation and initialization doesn't happenatomically, and the newly allocated stack memory may have undefinedcontents.)</para>
267</listitem>
268          <listitem>
269<para>'s write barrier has to be implemented atomically (i.e.,both an intergenerational store and the update of a correspondingreference bit has to happen without interruption, or neither of theseevents can happen.)</para>
270</listitem>
271          <listitem>
272<para>There are a few more similar cases.</para>
273</listitem>
274       
275        </itemizedlist>
276
277        <para>Fortunately, the number of these non-atomic instruction sequences is
278small, and fortunately it's fairly easy for the interrupting thread
279to recognize when the interrupted thread is in the middle of such
280a sequence.  When this is detected, the interrupting thread modfies
281the state of the interrupted thread (modifying its PC and other
282registers) so that it is no longer in the middle of such a sequenece
283(it's either backed out of it or the remaining instructions are
284emulated.)</para>
285        <para>This works because (a) many of the troublesome instruction sequences
286are PPC-specific and it's relatively easy to partially disassemble the
287instructions surrounding the interrupted thread's PC on the PPC and
288(b) those instruction sequences are heavily stylized and intended to
289be easily recognized.</para>
290      </sect2>
291    </sect1>
292
293    <sect1 id="Register-usage-and-tagging">
294      <title>Register usage and tagging</title>
295
296      <sect2 id="Register-usage-and-tagging-overview">
297        <title>Overview</title>
298        <para>Regardless of other details of its implementation, a
299        garbage collector's job is to partition the set of all
300        heap-allocated lisp objects (CONSes, STRINGs, INSTANCEs, etc.)
301        into two subsets.  The first subset contains all objects that
302        are transitively referenced from a small set of "root" objects
303        (the contents of the stacks and registers of all active
304        threads at the time the GC occurs and the values of some
305        global variables.)  The second subset contains everything
306        else: those lisp objects that are not transitively reachable
307        from the roots are garbage, and the memory occupied by garbage
308        objects can be reclaimed (since the GC has just proven that
309        it's impossible to reference them.)</para>
310        <para>The set of live, reachable lisp objects basically form
311        the nodes of a (usually large) graph, with edges from each
312        node A to any other objects (nodes) that object A
313        references.</para>
314        <para>Some nodes in this graph can never have outgoing edges:
315        an array with a specialized numeric or character type usually
316        represents its elements in some (possibly more compact)
317        specialized way.  Some nodes may refer to lisp objects that
318        are never allocated in memory (FIXNUMs, CHARACTERs,
319        SINGLE-FLOATs on 64-bit platforms ..)  This latter class of
320        objects are sometimes called "immediates", but that's a little
321        confusing because the term "immediate" is sometimes used to
322        refer to things that can never be part of the big connectivity
323        graph (e.g., the "raw" bits that make up a floating-point
324        value, foreign address, or numeric value that needs to be used
325        - at least fleetingly - in compiled code.)</para>
326        <para>For the GC to be able to build the connectivity graph
327        reliably, it's necessary for it to be able to reliably tell
328        (a) whether or not a "potential root" - the contents of a
329        machine register or stack location - is in fact a node and (b)
330        for any node, whether it may have components that refer to
331        other nodes.</para>
332        <para>There's no reliable way to answer the first question on
333        stock hardware.  (If everything was a node, as might be the
334        case on specially microcoded "lisp machine" hardware, it
335        wouldn't even need to be asked.)  Since there's no way to just
336        look at a machine word (the contents of a machine register or
337        stack location) and tell whether or not it's a node or just
338        some random non-node value, we have to either adopt and
339        enforce strict conventions on register and stack usage or
340        tolerate ambiguity.</para>
341        <para>"Tolerating ambiguity" is an approach taken by some
342        ("conservative") GC schemes; by contrast, &CCL;'s GC is
343        "precise", which in this case means that it believes that the
344        contents of certain machine registers and stack locations are
345        always nodes and that other registers and stack locations are
346        never nodes and that these conventions are never violated by
347        the compiler or runtime system.  The fact that threads are
348        preemptively scheduled means that a GC could occur (because of
349        activity in some other thread) on any instruction boundary,
350        which in turn means that the compiler and runtime system must
351        follow precise at all times.</para>
352        <para>Once we've decided that a given machine word is a node,
353        a describes how the node's value and type are encoded in that
354        machine word.</para>
355        <para>Most of this - so far - has discussed thigs from the
356        GC's very low-level perspective.  From a much higher point of
357        view, lisp functions accept nodes as arguments, return nodes
358        as values, and (usually) perform some operations on those
359        arguments in order to produce those results.  (In many cases,
360        the operations in question involve raw non-node values.)
361        Higher-level parts of the lisp type system (functions like
362        TYPE-OF and CLASS-OF, etc.) depend on the .</para>
363      </sect2>
364
365      <sect2 id="pc-locatives-on-the-PPC">
366        <title>pc-locatives on the PPC</title>
367        <para>On the PPC, there's a third case (besides "node" and
368        "immediate" values).  As discussed below, a node that denotes
369        a memory-allocated lisp object is a biased (tagged) pointer
370        -to- that object; it's not generally possible to point -into-
371        some composite (multi-element) object (such a pointer would
372        not be a node, and the GC would have no way to update the
373        pointer if it were to move the underlying object.)</para>
374        <para>Such a pointer ("into" the interior of a heap-allocated
375        object) is often called a <emphasis>locative</emphasis>; the
376        cases where locatives are allowed in &CCL; mostly involve
377        the behavior of function call and return instructions.  (To be
378        technicaly accurate, the other case also arises on x86-64, but
379        that case isn't as user-visible.)</para>
380        <para>On the PowerPC (both PPC32 and PPC64), all machine
381        instructions are 32 bits wide and all in1struction words are
382        allocated on 32-bit boundaries.  In PPC &CCL;, a CODE-VECTOR
383        is a specialized type of vector-like object; its elements are
384        32-bit PPC machine instructions.  A CODE-VECTOR is an
385        attribute of FUNCTION object; a function call involves
386        accessing the function's code-vector and jumping to the
387        address of its first instruction.</para>
388        <para>As each instruction in the code vector sequentially
389        executes, the hardware program counter (PC) register advances
390        to the address of the next instruction (a locative into the
391        code vector); since PPC instructions are always 32 bits wide
392        and aligned on 32-bit boundaries, the low two bits of the PC
393        are always 0.  If the function executes a call (simple call
394        instrucions have the mnemonic "bl" on the PPC, which stands
395        for "branch and link"), the address of the next instruction
396        (also a word-aligned locative into a code-vector) is copied
397        into the special- purpose PPC "link register" (lr); a function
398        returns to its caller via a "branch to link register" (blr)
399        instruction.  Some cases of function call and return might
400        also use the PPC's "count register" (ctr), and if either the
401        lr or ctr needs to be stored in memory it needs to first be
402        copied to a general-purpose registers.</para>
403        <para>&CCL;'s GC understands that certain registers contain
404        these special "pc-locatives" (locatives that point into
405        CODE-VECTOR objects); it contains specal support for finding
406        the containing CODE-VECTOR object and for adjusting all of
407        these "pc-locatives" if the containing object is moved in
408        memory.  The first part of that - finding the containing
409        object - is possible and practical on the PPC because of
410        architectural artifcacts (fixed-width instructions and arcana
411        of instruction encoding.)  It's not possible on x86-64, but
412        fortunately not necessary either (though the second part -
413        adjusting the PC/RIP when the containing object moves) is both
414        necessary and simple.</para>
415      </sect2>
416
417      <sect2 id="Register-and-stack-usage-conventions">
418        <title>Register and stack usage conventions</title>
419
420        <sect3 id="Stack-conventions">
421          <title>Stack conventions</title>
422          <para>On both PPC and X86 platforms, each lisp thread uses 3
423          stacks; the ways in which these stacks are used differs
424          between the PPC and X86.</para>
425          <para>Each thread has:</para>
426          <itemizedlist>
427            <listitem>
428              <para>A "control stack".  On both platforms, this is
429              "the stack" used by foreign code.  On the PPC, it
430              consists of a linked list of frames where the first word
431              in each frame points to the first word in the previous
432              frame (and the outermost frame points to 0.)  Some
433              frames on a PPC control stack are lisp frames; lisp
434              frames are always 4 words in size and contain (in
435              addition to the back pointer to the previous frame) the
436              calling function (a node), the return address (a
437              "locative" into the calling function's code-vector), and
438              the value to which the value-stack pointer (see below)
439              should be restored on function exit.  On the PPC, the GC
440              has to look at control-stack frames, identify which of
441              those frames are lisp frames, and treat the contents
442              ofthe saved function slot as a node (and handle the
443              return address locative specially.)  On x86-64, the
444              control stack is used for dynamic-extent allocation of
445              immediate objects.  Since the control stack never
446              contains nodes on x86-64, the GC ignores it on that
447              platform.  Alignment of the control stack follows the
448              ABI conventions of the platform (at least at any point
449              in time where foreign code could run.)  On PPC, the r1
450              register always points to the top of the current
451              thread's control stack; on x86-64, the RSP register
452              points to the top of the current thread's control stack
453              when the thread is running foreign code and the address
454              of the top of the control stack is kept in the thread's
455              TCR see when not running foreign code.  The control
456              stack "grows down."</para>
457            </listitem>
458            <listitem>
459              <para>A "value stack".  On both platforms, all values on
460              the value stack are nodes (including "tagged return
461              addresses" on x86-64.)  The value stack is always
462              aligned to the native word size; objects are always
463              pushed on the value stack using atomic instructions
464              ("stwu"/"stdu" on PPC, "push" on x86-64), so the
465              contents of the value stack between its bottom and top
466              are always unambiguously nodes; the compiler usually
467              tries to pop or discard nodes from the value stack as
468              soon as possible after their last use (as soon as they
469              may have become garbage.)  On x86-64, the RSP register
470              addresses the top of the value stack when running lisp
471              code; that address is saved in the TCR when running
472              foreign code.  On the PPC, a dedicated regiter (VSP,
473              currently r15) is used to address the top of the value
474              stack when running lisp code, and the VSP value is saved
475              in the TCR when running foreign code.  The value stack
476              grows down.</para>
477            </listitem>
478            <listitem>
479              <para>A "temp stack".  The temp stack consists of a
480              linked list of frames, each of which points to the
481              previous temp stack frame.  The number of native machine
482              words in each temp stack frame is always even, so the
483              temp stack is aligned on a two-word (64- or 128-bit)
484              boundary.  The temp stack is used for dynamic-extent
485              objects on both platforms; on the PPC, it's used for
486              essentially all such objects (regardless of whether or
487              not the objects contain nodes); on the x86-64, immediate
488              dynamic-extent objects (strings, foreign pointers, etc.)
489              are allocated on the control stack and only
490              node-containing dynamic-extent objects are allocated on
491              the temp stack.  Data structures used to implement CATCH
492              and UNWIND-PROTECT are stored on the temp stack on both
493              ppc and x86-64.  Temp stack frames are always doublenode
494              aligned and objects within a temp stack frame are
495              aligned on doublenode boundaries.  The first word in
496              each frame contains a back pointer to the previous
497              frame; on the PPC, the second word is used to indicate
498              to the GC whether the remaining objects are nodes (if
499              the second word is 0) or immediate (otherwise.)  On
500              x86-64, where temp stack frames always contain nodes,
501              the second word is always 0.  The temp stack grows down.
502              It usually takes several instuctions to allocate and
503              safely initialize a temp stack frame that's intended to
504              contain nodes, and the GC has to recognize the case
505              where a thread is in the process of allocating and
506              initializing a temp stack frame and take care not to
507              interpret any uninitialized words in the frame as nodes.
508              See (someplace).  The PPC keeps the current top of the
509              temp stack in a dedicated register (TSP, currently r12)
510              when running lisp code and saves this register's value
511              in the TCR when running foreign code.  The x86-64 keeps
512              the address of the top of each thread's temp stack in
513              the thread's TCR.</para>
514            </listitem>
515          </itemizedlist>
516        </sect3>
517
518        <sect3 id="Register-conventions">
519          <title>Register conventions</title>
520          <para>If there are a "reasonable" (for some value of
521          "reasonable") number or general-purpose registers and the
522          instruction set is "reasonably" orthogonal (most
523          instructions that operate on GPRs can operate on any GPR),
524          then it's possible to statically partition the GPRs into at
525          least two sets: "immediate registers" never contain nodes,
526          and "node registers" always contain nodes.  (On the PPC, a
527          few registers are members of a third set of "PC locatives",
528          and on both platforms some registers may have dedicated
529          roles as stack or heap pointers; the latter class is treated
530          as immediates by the GC proper but may be used to help
531          determine the bounds of stack and heap memory areas.)</para>
532          <para>The ultimate definition of register partitioning is
533          hardwired into the GC in functions like "mark_xp()" and
534          "forward_xp()", which process the values of some of the
535          registers in an exception frame as nodes and may give some
536          sort of special treatment to other register values they
537          encounter there.)</para>
538          <para>On x86-64, the static register partitioning scheme involves:</para>
539          <itemizedlist>
540            <listitem>
541              <para>(only) two "immediate" registers.The RAX and RDX
542              registers are used as the implicit operands andresults
543              of some extended-precision multiply and divide
544              instructionswhich generally involve non-node values;
545              since their use in theseinstructions means that they
546              can't be guaranteed to contain nodevalues at all times,
547              it's natural to put these registers in the"immediate"
548              set.  RAX is generally given the symbolic name
549              "imm0",and RDX is given the symbolic name "imm1"; you
550              may see these namesin disassembled code, usually in
551              operations involving type checking,array indexing, and
552              foreign memory and function access.</para>
553            </listitem>
554            <listitem>
555              <para>(only) two "dedicated" registers.RSP and RBP have
556              dedicated functionality dictated by the hardwareand
557              calling conventions.  (There are a few places where RBP
558              istemporarily used as an extra immediate
559              register.)</para>
560            </listitem>
561            <listitem>
562              <para>12 "node" registers.All other registers (RBX, RCX,
563              RSI, RDI, and R8-R15) are asserted tocontain node values
564              at (almost) all times; legacy "string" operationsthat
565              implicitly use RSI and/or RDI are not used.  Shift and
566              rotateintructions which shift/rotate by a variable
567              number of bits arerequired by the architecture to use
568              the low byte of RCX (the traditionalCL register) as the
569              implicit shift count; when it's necessary to keepa
570              non-node shift count in the low byte of RCX, the upper 7
571              bytes ofthe register are zeroed (so that
572              misinterpetation of the immediatevalue in RCX as a node
573              will not have negative GC affects.  (The GCmight briefly
574              treate it as a node, but since it's not pointing
575              anywherenear the lisp heap it'll soon lose interest in
576              it.)Legacy instructions that use RCX (or some portions
577              of it) as a loopcounter can not be used (since such
578              instructions might introducenon-node values into
579              RCX.)</para>
580</listitem>
581          </itemizedlist>
582          <para>On the PPC, the static register partitioning scheme involves:</para>
583
584          <itemizedlist>
585            <listitem>
586              <para>6 "immediate" registersRegisters r3-r8 are given
587              the symbolic names imm0-imm5.  As a RISCarchitecture
588              with simpler addressing modes, the PPC probably
589              usesimmediate registers a bit more often than the CISC
590              x86-64 does, butthey're generally used for the same sort
591              of things (type checking,array indexing, FFI,
592              etc.)</para>
593            </listitem>
594            <listitem>
595              <para>9 dedicated registers
596              <itemizedlist>
597                <listitem>
598                  <para>r0 (symbolic name rzero) always contains the
599                  value 0 when runninglisp code.  Its value is
600                  sometimes read as 0 when it's used as thebase
601                  register in a memory address; keeping the value 0
602                  there issometimes convenient and avoids
603                  asymmetry.</para>
604                </listitem>
605                <listitem>
606                  <para>r1 (symbolic name sp) is the control stack
607                  pointer, by PPC convention.</para>
608                </listitem>
609                <listitem>
610                  <para>r2 is used to hold the current thread's TCR on
611                  ppc64 systems; it'snot used on ppc32.</para>
612                </listitem>
613                <listitem>
614                  <para>r9 and r10 (symbolic names allocptr and
615                  allocbase) are used to do per-thread memory
616                  allocation</para>
617                </listitem>
618                <listitem>
619                  <para>r11 (symbolic name nargs) contains the number
620                  of function arguments on entry and the number of
621                  return values in multiple-value returning
622                  constructs.  It's not used more generally as either
623                  a node or immediate register because of the way that
624                  certain trap instruction encodings are
625                  interpreted.</para>
626                </listitem>
627                <listitem>
628                  <para>r12 (symbolic name tsp) holds the top of the current thread's temp stack.</para>
629                </listitem>
630                <listitem>
631                  <para>r13 is used to hold the TCR on PPC32 sytems; it's not used on PPC64.</para>
632                </listitem>
633                <listitem>
634                  <para>r14 (symbolic name loc-pc) is used to copy
635                  "pc-locative" values between main memory and
636                  special-purpose PPC registers (LR and CTR) used in
637                  function-call and return instructions.</para>
638                </listitem>
639                <listitem>
640                  <para>r15 (symbolic name vsp) addresses the top of
641                  the current thread's value stack.</para>
642                </listitem>
643                <listitem>
644                  <para>lr and ctr are PPC branch-unit registers used
645                  in function call and return instructions; they're
646                  always treated as "pc-locatives", which precludes
647                  the use of the ctr in some PPC looping
648                  constructs.</para>
649                </listitem>
650             
651              </itemizedlist>
652              </para>
653            </listitem>
654            <listitem>
655              <para>17 "node" registersr15-r31 are always treated as
656              node registers</para>
657            </listitem>
658           
659          </itemizedlist>
660        </sect3>
661      </sect2>
662
663      <sect2 id="Tagging-scheme">
664        <title>Tagging scheme</title>
665        <para>&CCL; always allocates lisp objects on double-node
666        (64-bit for 32-bit platforms, 128-bit for 64-bit platforms)
667        boundaries; this mean that the low 3 bits (32-bit lisp) or 4
668        bits (64-bit lisp) are always 0 and are therefore redundant
669        (we only really need to know the upper 29 or 60 bits in order
670        to identify the aligned object address.)  The extra bits in a
671        lisp node can be used to encode at least some information
672        about the node's type, and the other 29/60 bits represent
673        either an immediate value or a doublenode-aligned memory
674        address.  The low 3 or 4 bits of a node are called the node's
675        "tag bits", and the conventions used to encode type
676        information in those tag bits are called a "tagging
677        scheme."</para>
678        <para>It might be possible to use the same tagging scheme on
679        all platforms (at least on all platforms with the same word
680        size and/or the same number of available tag bits), but there
681        are often some strong reasons for not doing so.  These
682        arguments tend to be very machine-specific: sometimes, there
683        are fairly obvious machine-dependent tricks that can be
684        exploited to make common operations on some types of tagged
685        objects faster; other times, there are architectural
686        restrictions that make it impractical to use certain tags for
687        certain types.  (On PPC64, the "ld" (load doubleword) and
688        "std" (store doubleword) instructions - which load and store a
689        GPR operand at the effective address formed by adding the
690        value of another GPR operand and a 16-bit constant operand -
691        require that the low two bits of that constant operand be 0.
692        Since such instructions would typically be used to access the
693        fields of things like CONS cells and structures, it's
694        desirable that that the tags chosen for CONS cells and
695        structures allow the use of these intructions as opposed to
696        more expensive alternatives.)</para>
697        <para>One architecture-dependent tagging trick that works well
698        on all architectures is to use a tag of 0 for FIXNUMs: a
699        fixnum basically encodes its value shifted left a few bits and
700        keeps those low bits clear. FIXNUM addition, subtraction, and
701        binary logical operations can operate directly on the node
702        operands, addition and subtraction can exploit hardware-based
703        overflow detection, and (in the absence of overflow) the
704        hardware result of those operations is a node (fixnum).  Some
705        other slightly-less-common operations may require a few extra
706        instructions, but arithmetic operations on FIXNUMs should be
707        as cheap as possible and using a tag of zero for FIXNUMs helps
708        to ensure that it will be.</para> 
709        <para>If we have N available tag bits (N = 3 for 32-bit
710        &CCL; and N = 4 for 64-bit &CCL;), this way of
711        representing fixnums with the low M bits forced to 0 works as
712        long as M &lt;= N.  The smaller we make M, the larger the
713        values of MOST-POSITIVE-FIXNUM and MOST-NEGATIVE become; the
714        larger we make N, the more distinct non-FIXNUM tags become
715        available.  A reasonable compromise is to choose M = N-1; this
716        basically yields two distinct FIXNUM tags (one for even
717        fixnums, one for odd fixnums), gives 30-bit fixnums on 32-bit
718        platforms and 61-bit fixnums on 64-bit platforms, and leaves
719        us with 6 or 14 tags to encoded other types.</para>
720        <para>Once we get past the assignment of FIXNUM tags, things
721        quickly devolve into machine-dependencies.  We can fairly
722        easily see that we can't directly all other primitive lisp
723        object types with only 6 or 14 available tag values; the
724        details of how types are encoded vary between the ppc32,
725        ppc64, and x86-64 implementations, but there are some general
726        common principles:</para>
727
728        <itemizedlist>
729          <listitem>
730            <para>CONS cells always contain exactly 2 elements and are
731            usually fairly common.It therefore makes sense to give
732            CONS cells their own tag.  Unlike thefixnum case - where a
733            tag value of 0 had positive implications - theredoesn't
734            seem to be any advantage to using any particular value.
735            (A longtime ago - in the case of 68K MCL - the CONS tag
736            and the order of CAR and CDR in memory were chosen to allow
737            smaller, cheaper addressing modes to be used to "cdr down a
738            list."  That's not a factor on ppc or x86-64,but all
739            versions of &CCL; still store the CDR of a CONS cell
740            first in memory.  It doesn't matter, but doing it the way
741            that the host system did made boostrapping to a new target
742            system a little easier.)
743            </para>
744          </listitem>
745          <listitem>
746            <para>Any way you look at it, NIL is a bit ... unusual.NIL
747            is both a SYMBOL and a LIST (as well as being a canonical
748            truth value and probably a few other things.)  Its role as
749            a LIST is probably much more important to most programs
750            than its role as a SYMBOL is:LISTP has to be true of NIL
751            and primitives like CAR and CDR do LISTP implicitly when
752            safe and want that operation to be fast.There are several
753            possible approaches to this; &CCL; uses two of them. On
754            PPC32 and X86-64, NIL is basically a weird CONS cell that
755            straddles two doublenodes; the tag of NIL is unique and
756            congruent modulo 4 (modulo 8 on 64-bit) with the tag used
757            for CONS cells.  LISTP is therefore true of any node whose
758            low 2 (or 3) bits contain the appropriate tag value (it's
759            not otherwise necessary to special-case NIL.)
760            SYMBOL accessors (SYMBOL-NAME, SYMBOL-VALUE, SYMBOL-PLIST
761            ..) -do- have to special-case NIL (and access the
762            components of an internal proxy symbol.) On PPC64 (where
763            architectural restrictions dictate the set of tags that can
764            be used to access fixed components of an object),
765            that approach wasn't practical.  NIL is just a
766            distinguished SYMBOL,and it just happens to be the case
767            that its pname slot and values lots are at the same offsets
768            from a tagged pointer as a CONS cell's CDR and CAR would be.
769            NIL's pname is set to NIL (SYMBOL-NAME checks for this and
770            returns the string "NIL"), and LISTP (and therefore safe
771            CAR and CDR) have to check for (OR NULL CONSP).At least in
772            the case of CAR and CDR, the fact that the PPC has multiple
773            condition-code fields keeps that extra test from
774            being prohibitively expensive.</para>
775          </listitem>
776          <listitem>
777            <para>Some objects are immediate.(but not FIXNUMs).This is
778            true of CHARACTERs and, on 64-bit platforms,
779            SINGLE-FLOATs.It's also true of some nodes used in the
780            runtime system (specialvalues used to indicate unbound
781            variables and slots, for instance.)On 64-bit platforms,
782            SINGLE-FLOATs have their own unique tag (makingthem a
783            little easier to recognize; on all platforms, CHARACTERs
784            sharea tag with other immediate objects (unbound markers)
785            but are easyto recognize (by looking at several of their
786            low bits.)  The GCtreats any node with an immediate tag
787            (and any node with a fixnumtag) as a leaf.</para>
788          </listitem>
789          <listitem>
790            <para>There are some advantages to treating everything
791            else - memory-allocated objects that aren't CONS cells -
792            uniformly.There are some disadvantages to that uniform
793            treatment as well, and the treatment of "memory-allocated
794            non-CONS objects" isn't entirely uniformaccross all
795            &CCL; implementations.  Let's first pretend that
796            the treatment is uniform, then discuss the ways in which it
797            isn't.The "uniform approach" is to treat all
798            memory-allocated non-CONS objectsas if they were vectors;
799            this use of the term is a little looser thanwhat's implied
800            by the CL VECTOR type.  &CCL; actually uses the
801            term"uvector" to mean "a memory-allocated lisp object
802            other than a CONS cell,whose first word is a header which
803            describes the object's type andthe number of elements that
804            it contains."  In this view, a SYMBOL isa UVECTOR, as is a
805            STRING, a STANDARD-INSTANCE, a CL array or vector,a
806            FUNCTION, and even a DOUBLE-FLOAT.In the PPC
807            implementations (where things are a little more
808            ... uniform),a single tag value is used to denote any
809            uvector; in order to determinesomething more specific
810            about the type of the object in question, it'snecessary to
811            fetch the low byte of the header word from memory.  On
812            thex86-64 platform, certain types of uvectors - SYMBOLs
813            and FUNCTIONs -are given their own unique tags.  The good
814            news about the x86-64 approachis that SYMBOLs and
815            FUNCTIONs can be recognized without referencingmemory; the
816            slightly bad news is that primitive operations that workon
817            UVECTOR-tagged objects - like the function CCL:UVREF -
818            don't workon SYMBOLs or FUNCTIONs on x86-64 (but -do- work
819            on those types of objectsin the PPC ports.)The header word
820            which precedes a UVECTOR's data in memory contains 8bits
821            of type information in the low byte and either 24 or 56
822            bits of"element-count" information in the rest of the
823            word.  (This is wherethe sometimes-limiting value of 2^24
824            for ARRAY-TOTAL-SIZE-LIMIT onPPC32 platforms comes from.)
825            The low byte of the header - sometimescalled the uvector's
826            subtag - is itself tagged (which means thatthe header is
827            tagged.)  The (3 or 4) tag bits in the subtag are usedto
828            determine whether the uvector's elements are nodes or
829            immediates.(A UVECTOR whose elements are nodes is called a
830            GVECTOR; a UVECTORwhose elements are immediates is called
831            an IVECTOR.  This terminologycame from Spice Lisp, which
832            was a predecessor of CMUCL.)Even though a uvector header
833            is tagged, a header is not a node.  There'sno (supported)
834            way to get your hands on one in lisp and doing so couldbe
835            dangerous.  (If the value of a header wound up in a lisp
836            noderegister and that register wound up getting pushed on
837            a thread's valuestack, the GC might misinterpret that
838            situation to mean that therewas a stack-allocated UVECTOR
839            on the value stack.)</para>
840          </listitem>
841       
842        </itemizedlist>
843      </sect2>
844    </sect1>
845
846    <sect1 id="Heap-Allocation">
847      <title>Heap Allocation</title> <para>When the &CCL; kernel
848      first starts up, a large contiguous chunk of the process's
849      address space is mapped as "anonymous, no access"
850      memory. ("Large" means different things in different contexts;
851      on LinuxPPC32, it means "about 1 gigabyte", on DarwinPPC32, it
852      means "about 2 gigabytes", and on current 64-bit platforms it
853      ranges from 128 to 512 gigabytes, depending on OS. These values
854      are both defaults and upper limits; the --heap-reserve
855      argument can be used to try to reserve less than the
856      default.)</para>
857      <para>Reserving address space that can't (yet) be read or
858      written to doesn't cost much; in particular, it doesn't require
859      that correspinding swap space or physical memory be available.
860      Marking the address range as being "mapped" helps to ensure that
861      other things (result from random calls to malloc(), dynamically
862      loaded shared libraries) won't be allocated in this region that
863      lisp has reserved for its own heap growth.</para>
864      <para>A small portion (around 1/32 on 32-bit platforms and 1/64
865      on 64-bit platforms) of that large chunk of address space is
866      reserved for GC data structures.  Memory pages reserved for
867      these data structures are mapped read-write as pages made
868      writable in the main portion of the heap.</para>
869      <para>The initial heap image is mapped into this reserved
870      address space and an additional (LISP-HEAP-GC-THRESHOLD) bytes
871      are mapped read-write.  GC data structures grow to match the
872      amount of GC-able memory in the initial image + the gc
873      threshold, and control is transferred to lisp code.  Inevitably,
874      that code spoils everything and starts consing; there are
875      basically three layers of memory allocation that can go
876      on.</para>
877
878      <sect2 id="Per-thread-object-allocation">
879        <title>Per-thread object allocation</title>
880        <para>Each lisp thread has a private "reserved memory
881        segment"; when a thread starts up, its reserved memory segment
882        is empty.  PPC ports maintain the highest unallocated addres
883        and he lowest allocated address in the current segment in
884        registers when running lisp code; on x86-664, these values are
885        maintained in the current threads's TCR.  (An "empty" heap
886        segment is one whose high pointer and low pointer are equal.)
887        When a thread is not in the midde of allocating something, the
888        low 3 or 4 bits of the high and low pointers are clear (the
889        pointers are doublenode-aligned.)</para>
890        <para>A thread tries to allocate an object whose physical size
891        in bytes is X and whose tag is Y by:</para>
892        <orderedlist>
893          <listitem>
894            <para>decrementing the "high" pointer by (- X Y)</para>
895          </listitem>
896          <listitem>
897            <para>trapping if the high pointer is less than the low
898            pointer</para>
899          </listitem>
900          <listitem>
901            <para>using the (tagged) high pointer to initialize the
902            object, if necessary</para>
903          </listitem>
904          <listitem>
905            <para>clearing the low bits of the high pointer</para>
906          </listitem>
907        </orderedlist>
908        <para>On PPC32, where the size of a CONS cell is 8 bytes and
909        the tag of a CONS cell is 1, machine code which sets the arg_z
910        register to the result of doing (CONS arg_y arg_z) looks
911        like:</para>
912        <programlisting>
913  (SUBI ALLOCPTR ALLOCPTR 7)    ; decrement the high pointer by (- 8 1)
914  (TWLLT ALLOCPTR ALLOCBASE)    ; trap if the high pointer is below the base
915  (STW ARG_Z -1 ALLOCPTR)       ; set the CDR of the tagged high pointer
916  (STW ARG_Y 3 ALLOCPTR)        ; set the CAR
917  (MR ARG_Z ALLOCPTR)           ; arg_z is the new CONS cell
918  (RLWINM ALLOCPTR ALLOCPTR 0 0 28)     ; clear tag bits
919</programlisting>
920        <para>On x86-64, the idea's similar but the implementation is
921        different.  The high and low pointers to the current thread's
922        reserved segment are kept in the TCR, which is addressed by
923        the gs segment register. An x86-64 CONS cell is 16 bytes wide
924        and has a tag of 3; we canonically use the temp0 register to
925        initialize the object</para>
926        <programlisting>
927  (subq ($ 13) ((% gs) 216))    ; decrement allocptr
928  (movq ((% gs) 216) (% temp0)) ; load allocptr into temp0
929  (cmpq ((% gs) 224) (% temp0)) ; compare to allocabase
930  (jg L1)                       ; skip trap
931  (uuo-alloc)                   ; uh, don't skip trap
932L1
933  (andb ($ 240) ((% gs) 216))   ; untag allocptr in the tcr
934  (movq (% arg_y) (5 (% temp0))) ; set the car
935  (movq (% arg_z) (-3 (% temp0))); set the cdr
936  (movq (% temp0) (% arg_z))    ; return the cons
937        </programlisting>
938        <para>If we don't take the trap (if allocating 8-16 bytes
939        doesn't exhaust the thread's reserved memory segment), that's
940        a fairly short and simple instruction sequence.  If we do take
941        the trap, we'll have to do some additional work in order to
942        get a new segment for the current thread.</para>
943      </sect2>
944
945      <sect2 id="Allocation-of-reserved-heap-segments">
946        <title>Allocation of reserved heap segments</title>
947        <para>After the lisp image is first mapped into memory - and after
948        each full GC - the lisp kernel ensures that
949        (LISP-HEAP-GC-TRESHOLD) additional bytes beyond the current
950        end of the heap are mapped read-write.</para>
951        <para>If a thread traps while trying to allocate memory, the
952        thread goes through the usual exception-handling protocol (to
953        ensure that any oher thread that GCs "sees" the state of the
954        trapping thread and to serialize exception handling.)  When
955        the exception handler runs, it determines the nature and size
956        of the failed allocation and tries to complete the allocation
957        on the thread's behalf (and leave it with a reasonably large
958        thread-specific memory segment so that the next small
959        allocation is unlikely to trap.</para>
960        <para>Depending on the size of the requested segment
961        allocation, the number of segment allocations that have
962        occurred since the last GC, and the EGC and GC thresholds, the
963        segment allocation trap handler may invoke a full or ephemeral
964        GC before returning a new segment.  It's worth noting that the
965        [E]GC is triggered based on the number of and size of these
966        segments that've been allocated since the last GC; it doesn't
967        have much to do with how "full" each of those per-thread
968        segments are.  It's possible for a large number of threads to
969        do fairly incidental memory allocation and trigger the GC as a
970        result; avoiding this involves tuning the per-thread
971        allocation quantum and the GC/EGC thresholds
972        appropriately.</para>
973      </sect2>
974
975      <sect2 id="Heap-growth">
976        <title>Heap growth</title>
977        <para>All OSes on which &CCL; currently runs use an
978        "overcommit" memory allocation strategy by default (though
979        some of them provide ways of overriding that default.)  What
980        this means in general is that the OS doesn't necessarily
981        ensure that backing store is available when asked to map pages
982        as read-write; it'll often return a success indicator from the
983        mapping attempt (mapping the pages as "zero-fill,
984        copy-on-write"), and only try to allocate the backing store
985        (swap space and/or physical memory) when non-zero contents are
986        written to the pages.</para>
987        <para>It -sounds- like it'd be better to have the mmap() call
988        fail immediately, but it's actually a complicated issue.
989        (It's possible that other applications will stop using some
990        backing store before lisp code actually touches the pages that
991        need it, for instance.)  It's also not guaranteed that lisp
992        code would be able to "cleanly" signal an out-of-memory
993        condition if lisp is ... out of memory</para>
994        <para>I don't know that I've ever seen an abrupt out-of-memory failure that
995wasn't preceeded by several minutes of excessive paging activity.  The
996most expedient course in cases like this is to either (a) use less memory
997or (b) get more memory; it's generally hard to use memory that you don't
998have.</para>
999      </sect2>
1000    </sect1>
1001
1002    <sect1 id="GC-details">
1003      <title>GC details</title>
1004      <para>The GC uses a Mark/Compact algorithm; its
1005      execution time is essentially a factor of the amount of live
1006      data in the heap. (The somewhat better-known Mark/Sweep
1007      algorithms don't compact the live data but instead traverse the
1008      garbage to rebuild free-lists; their execution time is therefore
1009      a factor of the total heap size.)</para>
1010      <para>As mentioned in , two auxiliary data structures
1011      (proportional to the size of the lisp heap) are maintained. These are</para>
1012      <orderedlist>
1013        <listitem>
1014          <para>the markbits bitvector, which contains a bit for
1015          everydoublenode in the dynamic heap (plus a few extra words
1016          for alignmentand so that sub-bitvectors can start on word
1017          boundaries.)</para>
1018        </listitem>
1019        <listitem>
1020          <para>the relocation table, which contains a native word for
1021          every 32 or 64 doublenodes in the dynamic heap, plus an
1022          extra word used to keep trackof the end of the heap.</para>
1023        </listitem>
1024      </orderedlist>
1025      <para>The total GC space overhead is therefore on the order of
1026      3% (2/64 or 1/32).</para>
1027      <para>The general algorithm proceeds as follows:</para>
1028
1029      <sect2 id="Mark-phase">
1030        <title>Mark phase</title>
1031        <para>Each doublenode in the dynamic heap has a corresponding
1032        bit in the markbits vector. (For any doublenode in the heap,
1033        the index of its mark bit is determined by subtracing the
1034        address of the start of the heap from the address of the
1035        object and dividing the result by 8 or 16.) The GC knows the
1036        markbit index of the free pointer, so determining that the
1037        markbit index of a doubleword address is between the start of
1038        the heap and the free pointer can be done with a single
1039        unsigned comparison.</para>
1040        <para>The markbits of all doublenodes in the dynamic heap are
1041        zeroed before the mark phase begins. An object is
1042        <emphasis>marked</emphasis> if the markbits of all of its
1043        constituent doublewords are set and unmarked otherwise;
1044        setting an object's markbits involves setting the corrsponding
1045        markbits of all constituent doublenodes in the object.</para>
1046        <para>The mark phase traverses each root. If the tag of the
1047        value of the root indicates that it's a non-immediate node
1048        whose address lies in the lisp heap, then:</para>
1049        <orderedlist>
1050          <listitem>
1051            <para>If the object is already marked, do nothing.</para>
1052          </listitem>
1053          <listitem>
1054            <para>Set the object's markbit(s).</para>
1055          </listitem>
1056          <listitem>
1057            <para>If the object is an ivector, do nothing further.</para>
1058          </listitem>
1059          <listitem>
1060            <para>If the object is a cons cell, recursively mark its
1061            car and cdr.</para>
1062          </listitem>
1063          <listitem>
1064            <para>Otherwise, the object is a gvector. Recursively mark
1065            itselements.</para>
1066          </listitem>
1067        </orderedlist>
1068        <para>Marking an object thus involves ensuring that its mark
1069        bits are set and then recursively marking any pointers
1070        contained within the object if the object was originally
1071        unmarked. If this recursive step was implemented in the
1072        obvious manner, marking an object would take stack space
1073        proportional to the length of the pointer chain from some root
1074        to that object. Rather than storing that pointer chain
1075        implicitly on the stack (in a series of recursive calls to the
1076        mark subroutine), the &CCL; marker uses mixture of recursion
1077        and a technique called <emphasis>link inversion</emphasis> to
1078        store the pointer chain in the objects themselves.  (Recursion
1079        tends to be simpler and faster; if a recursive step notes that
1080        stack space is becoming limited, the link-inversion technique
1081        is used.)</para>
1082        <para>Certain types of objects are treated a little specially:</para>
1083        <orderedlist>
1084        <listitem>
1085          <para>To support a feature called <emphasis>GCTWA
1086              <footnote>
1087                <para>I believe that theacronym comes from MACLISP,
1088                where it stood for "Garbage Collection ofTruly
1089                Worthless Atoms".</para>
1090              </footnote>
1091              , </emphasis>the vector which contains the
1092              internalsymbols of the current package is marked on
1093              entry to the mark phasebut the symbols themselves are
1094              not marked at this time. Near the endof the mark phase,
1095              symbols referenced from this vector which are
1096              nototherwise marked are marked if and only if they're
1097              somehowdistinguishable from newly created symbols (by
1098              virtue of their havingfunction bindings, value bindings,
1099              plists, or other attributes.)</para>
1100        </listitem>
1101        <listitem>
1102          <para>Pools have their first element set to NIL before any
1103          otherelements are marked.</para>
1104        </listitem>
1105        <listitem>
1106          <para>All hash tables have certain fields (used to cache
1107          previous results) invalidated.</para>
1108        </listitem>
1109        <listitem>
1110          <para>Weak Hash Tables and other weak objects are put on a
1111          linkedlist as they're encountered; their contents are only
1112          retained if there are other (non-weak) references to
1113          them.</para>
1114        </listitem>
1115        </orderedlist>
1116        <para>At the end of the mark phase, the markbits of all objects which
1117        are transitively reachable from the roots are set and all other markbits
1118        are clear.</para>
1119      </sect2>
1120
1121      <sect2 id="Relocation-phase">
1122        <title>Relocation phase</title>
1123        <para>The <emphasis>forwarding address</emphasis> of a
1124        doublenode in the dynamic heap is (&lt;its current address> -
1125        (size_of_doublenode * &lt;the number of unmarked markbits that
1126        precede it>)) or alternately (&lt;the base of the heap> +
1127        (size_of_doublenode * &lt;the number of marked markbits that
1128        preced it &gt;)). Rather than count the number of preceding
1129        markbits each time, the relocation table is used to precompute
1130        an approximation of the forwarding addresses for all
1131        doublewords. Given this approximate address and a pointer into
1132        the markbits vector, it's relatively easy to compute the exact
1133        forwarding address.</para>
1134        <para>The relocation table contains the forwarding addresses
1135        of each <emphasis>pagelet</emphasis>, where a pagelet is 256
1136        bytes (or 32 doublenodes). The forwarding address of the first
1137        pagelet is the base of the heap. The forwarding address of the
1138        second pagelet is the sum of the forwarding address of the
1139        first and 8 bytes for each mark bit set in the first 32-bit
1140        word in the markbits table. The last entry in the relocation
1141        table contains the forwarding address that the freepointer
1142        would have, e.g., the new value of the freepointer after
1143        compaction.</para>
1144        <para>In many programs, old objects rarely become garbage and
1145        new objects often do. When building the relocation table, the
1146        relocation phase notes the address of the first unmarked
1147        object in the dynamic heap. Only the area of the heap between
1148        the first unmarked object and the freepointer needs to be
1149        compacted; only pointers to this area will need to be
1150        forwarded (the forwarding address of all other pointers to the
1151        dynamic heap is the address of that pointer.)  Often, the
1152        first unmarked object is much nearer the free pointer than it
1153        is to the base of the heap.</para>
1154      </sect2>
1155
1156      <sect2 id="Forwarding-phase">
1157        <title>Forwarding phase</title>
1158        <para>The forwarding phase traverses all roots and the "old"
1159        part of the dynamic heap (the part between the base of the
1160        heap and the first unmarked object.) All references to objects
1161        whose address is between the first unmarked object and the
1162        free pointer are updated to point to the address the object
1163        will have after compaction by using the relocation table and
1164        the markbits vector and interpolating.</para>
1165        <para>The relocation table entry for the pagelet nearest the
1166        object is found. If the pagelet's address is less than the
1167        object's address, the number of set markbits that precede the
1168        object on the pagelet is used to determine the object's
1169        address; otherwise, the number of set markbits the follow the
1170        object on the pagelet is used.</para>
1171        <para>Since forwarding views the heap as a set of doublewords,
1172        locatives are (mostly) treated like any other pointers. (The
1173        basic difference is that locatives may appear to be tagged as
1174        fixnums, in which case they're treated as word-aligned
1175        pointers into the object.)</para>
1176        <para>If the forward phase changes the address of any hash
1177        table key in a hash table that hashes by address (e.g., an EQ
1178        hash table), it sets a bit in the hash table's header. The
1179        hash table code will rehash the hash table's contents if it
1180        tries to do a lookup on a key in such a table.</para>
1181        <para>Profiling reveals that about half of the total time
1182        spent in the GC is spent in the subroutine which determines a
1183        pointer's forwarding address. Exploiting GCC-specific idioms,
1184        hand-coding the routine, and inlining calls to it could all be
1185        expected to improve GC performance.</para>
1186      </sect2>
1187
1188      <sect2 id="Compact-phase">
1189        <title>Compact phase</title>
1190        <para>The compact phase compacts the area between the first
1191        unmarked object and the freepointer so that it contains only
1192        marked objects.  While doing so, it forwards any pointers it
1193        finds in the objects it copies.</para>
1194        <para>When the compact phase is finished, so is the GC (more
1195        or less): the free pointer and some other data structures are
1196        updated and control returns to the exception handler that
1197        invoked the GC. If sufficient memory has been freed to satisfy
1198        any allocation request that may have triggered the GC, the
1199        exception handler returns; otherwise, a "seriously low on
1200        memory" condition is signalled, possibly after releasing a
1201        small emergency pool of memory.</para>
1202      </sect2>
1203    </sect1>
1204
1205    <sect1 id="The-ephemeral-GC">
1206      <title>The ephemeral GC</title>
1207      <para>In the &CCL; memory management scheme, the relative age
1208      of two objects in the dynamic heap can be determined by their
1209      addresses: if addresses X and Y are both addresses in the
1210      dynamic heap, X is younger than Y (X was created more recently
1211      than Y) if it is nearer to the free pointer (and farther from
1212      the base of the heap) than Y.</para>
1213      <para>Ephemeral (or generational) garbage collectors attempt to
1214      exploit the following assumptions:</para>
1215      <itemizedlist>
1216        <listitem>
1217          <para>most newly created objects become garbage soon after
1218          they'recreated.</para>
1219        </listitem>
1220        <listitem>
1221          <para>most objects that have already survived several GCs
1222          are unlikely to ever become garbage.</para>
1223        </listitem>
1224        <listitem>
1225          <para>old objects can only point to newer objects as the
1226          result of adestructive modification (e.g., via
1227          SETF.)</para>
1228        </listitem>
1229      </itemizedlist>
1230
1231      <para>By concentrating its efforts on (frequently and quickly)
1232      reclaiming newly created garbage, an ephemeral collector hopes
1233      to postpone the more costly full GC as long as possible. It's
1234      important to note that most programs create some long-lived
1235      garbage, so an EGC can't typically eliminate the need for full
1236      GC.</para>
1237      <para>An EGC views each object in the heap as belonging to
1238      exactly one <emphasis>generation</emphasis>; generations are
1239      sets of objects that are related to each other by age: some
1240      generation is the youngest, some the oldest, and there's an age
1241      relationship between any intervening generations. Objects are
1242      typically assigned to the youngest generation when first
1243      allocated; any object that has survived some number of GCs in
1244      its current generation is promoted (or
1245      <emphasis>tenured</emphasis>) into an older generation.</para>
1246      <para>When a generation is GCed, the roots consist of the
1247      stacks, registers, and global variables as always and also of
1248      any pointers to objects in that generation from other
1249      generations. To avoid the need to scan those (often large) other
1250      generations looking for such intergenerational references, the
1251      runtime system must note all such intergenerational references
1252      at the point where they're created (via Setf).<footnote><para>This is
1253      sometimes called "The Write Barrier": all assignments which
1254      might result in intergenerational references must be noted, as
1255      if the other generations were write-protected.</para></footnote> The
1256      set of pointers that may contain intergenerational references is
1257      sometimes called <emphasis>the remembered set</emphasis>.</para>
1258      <para>In &CCL;'s EGC, the heap is organized exactly the same
1259      as otherwise; "generations" are merely structures which contain
1260      pointers to regions of the heap (which is already ordered by
1261      age.) When a generation needs to be GCed, any younger generation
1262      is incorporated into it; all objects which survive a GC of a
1263      given generation are promoted into the next older
1264      generation. The only intergenerational references that can exist
1265      are therefore those where an old object is modified to contain a
1266      pointer to a new object.</para>
1267      <para>The EGC uses exactly the same code as the full GC. When a
1268      given GC is "ephemeral",</para>
1269      <itemizedlist>
1270        <listitem>
1271          <para>the "base of the heap" used to determine anobject's
1272          markbit address is the base of the generation
1273          being collected;</para>
1274        </listitem>
1275        <listitem>
1276          <para>the markbits vector is actually a pointer into the
1277          middle of the global markbits table; preceding entries in
1278          this table are used to note doubleword addresses in older
1279          generations that (may) contain intergenerational
1280          references;</para>
1281        </listitem>
1282        <listitem>
1283          <para>some steps (notably GCTWA and the handling of weak
1284          objects) are not performed;</para>
1285        </listitem>
1286        <listitem>
1287          <para>the intergenerational references table is used to
1288          findadditional roots for the mark and forward phases. If a
1289          bit is set inthe intergenerational references table, that
1290          means that thecorresponding doubleword (in some "old"
1291          generation, insome "earlier" part of the heap) may have had
1292          a pointerto an object in a younger generation stored into
1293          it.</para>
1294        </listitem>
1295     
1296      </itemizedlist>
1297      <para>The intergenerational references table is maintained
1298      indirectly: whenever a setf operation that may introduce an
1299      intergenerational reference occurs, a pointer to the doubleword
1300      being stored into is pushed onto the <emphasis>memo
1301      buffer</emphasis>, which is a stack whos top is addressed by the
1302      memo register. Whenever the memo buffer overflows<tip><para>A
1303      guard page at the end of the memo buffer simplifies overflow
1304      detection.</para></tip> when the EGC is active, the handler
1305      scans the buffer and sets bits in the intergenerational
1306      references table for each doubleword address it finds in the
1307      buffer that belongs to some generation other than the youngest;
1308      the same scan is performed on entry to any ephemeral GC.  After
1309      (possibly) performing this scan, the handler resets the memo
1310      register to point to the bottom of the memo stack; this means
1311      that when the EGC is inactive, the memo buffer is constantly
1312      being filled and emptied for no apparent reason.</para>
1313      <para>With one exception (the implicit setfs that occur on entry
1314      to and exit from the binding of a special variable), all setfs
1315      that might introduce an intergenerational reference must be
1316      memoized.<tip><para>Note that the implicit setfs that occur when
1317      initializing an object - as in the case of a call to cons or
1318      vector - can't introduce intergenerational references, since the
1319      newly created object is always younger than the objects used to
1320      initialize it.</para></tip> It's always safe to push any cons
1321      cell or gvector locative onto the memo stack; it's never safe to
1322      push anything else.</para>
1323      <para>Typically, the intergenerational references bitvector is
1324      sparse: a relatively small number of old locations are stored
1325      into, although some of them may have been stored into many
1326      times. The routine that scans the memoization buffer does a lot
1327      of work and usually does it fairly often; it uses a simple,
1328      brute-force method but might run faster if it was smarter about
1329      recognizing addresses that it'd already seen.</para>
1330      <para>When the EGC mark and forward phases scan the
1331      intergenerational reference bits, they can clear any bits that
1332      denote doublewords that definitely do not contain
1333      intergenerational references.</para>
1334    </sect1>
1335
1336    <sect1 id="Fasl-files">
1337      <title>Fasl files</title>
1338      <para>The information in this section was current in November
1339      2004.  Saving and loading of Fasl files is implemented in
1340      xdump/faslenv.lisp, level-0/nfasload.lisp, and lib/nfcomp.lisp.
1341      The information here is only an overview, which might help when
1342      reading the source.</para>
1343      <para>The &CCL; Fasl format is forked from the old MCL Fasl
1344      format; there are a few differences, but they are minor.  The
1345      name "nfasload" comes from the fact that this is the so-called
1346      "new" Fasl system, which was true in 1986 or so.  The format has
1347      held up well, although it would certainly need extensions to
1348      deal with 64-bit data, and some other modernization might be
1349      possible.</para>
1350      <para>A Fasl file begins with a "file header", which contains
1351      version information and a count of the following "blocks".
1352      There's typically only one "block" per Fasl file.  The blocks
1353      are part of a mechanism for combining multiple logical files
1354      into a single physical file, in order to simplify the
1355      distribution of precompiled programs.  (Nobody seems to be doing
1356      anything interesting with this feature, at the moment, probably
1357      because it isn't documented.)</para>
1358      <para>Each block begins with a header for itself, which just
1359      describes the size of the data that follows.</para>
1360      <para>The data in each block is treated as a simple stream of
1361      bytes, which define a bytecode program.  The actual bytecodes,
1362      "fasl operators", are defined in xdump/faslenv.lisp.  The
1363      descriptions in the source file are terse, but, according to
1364      Gary, "probably accurate".</para>
1365      <para>Some of the operators are used to create a per-block
1366      "object table", which is a vector used to keep track of
1367      previously-loaded objects and simplify references to them.  When
1368      the table is created, an index associated with it is set to
1369      zero; this is analogous to an array fill-pointer, and allows the
1370      table to be treated like a stack.</para>
1371      <para>The low seven bits of each bytecode are used to specify
1372      the fasl operator; currently, about fifty operators are defined.
1373      The high byte, when set, indicates that the result of the
1374      operation should be pushed onto the object table.</para>
1375      <para>Most bytecodes are followed by operands; the operand data
1376      is byte-aligned.  How many operands there are, and their type,
1377      depend on the bytecode.  Operands can be indices into the object
1378      table, immediate values, or some combination of these.</para>
1379      <para>An exception is the bytecode #xFF, which has the symbolic
1380      name ccl::$faslend; it is used to mark the end of the
1381      block.</para>
1382    </sect1>
1383
1384
1385
1386    <sect1 id="The-Objective-C-Bridge--1-">
1387      <title>The Objective-C Bridge</title>
1388
1389      <sect2 id="How-CCL-Recognizes-Objective-C-Objects">
1390        <title>How &CCL; Recognizes Objective-C Objects</title>
1391        <para>In most cases, pointers to instances of Objective-C
1392        classes are recognized as such; the recognition is (and
1393        probably always will be) slightly heuristic. Basically, any
1394        pointer that passes basic sanity checks and whose first word
1395        is a pointer to a known ObjC class is considered to be an
1396        instance of that class; the Objective-C runtime system would
1397        reach the same conclusion.</para>
1398        <para>It's certainly possible that a random pointer to an
1399        arbitrary memory address could look enough like an ObjC
1400        instance to fool the lisp runtime system, and it's possible
1401        that pointers could have their contents change so that
1402        something that had either been a true ObjC instance (or had
1403        looked a lot like one) is changed (possibly by virtue of
1404        having been deallocated.)</para>
1405        <para>In the first case, we can improve the heuristics
1406        substantially: we can make stronger assertions that a
1407        particular pointer is really "of type :ID" when it's a
1408        parameter to a function declared to take such a pointer as an
1409        argument or a similarly declared function result; we can be
1410        more confident of something we obtained via SLOT-VALUE of a
1411        slot defined to be of type :ID than if we just dug a pointer
1412        out of memory somewhere.</para>
1413        <para>The second case is a little more subtle: ObjC memory
1414        management is based on a reference-counting scheme, and it's
1415        possible for an object to ... cease to be an object while lisp
1416        is still referencing it.  If we don't want to deal with this
1417        possibility (and we don't), we'll basically have to ensure
1418        that the object is not deallocated while lisp is still
1419        thinking of it as a first-class object. There's some support
1420        for this in the case of objects created with MAKE-INSTANCE,
1421        but we may need to give similar treatment to foreign objects
1422        that are introduced to the lisp runtime in other ways (as
1423        function arguments, return values, SLOT-VALUE results, etc. as
1424        well as those instances that're created under lisp
1425        control.)</para>
1426        <para>This doesn't all work yet (in fact, not much of it works
1427        yet); in practice, this has not yet been as much of a problem
1428        as anticipated, but that may be because existing Cocoa code
1429        deals primarily with relatively long-lived objects such as
1430        windows, views, menus, etc.</para>
1431      </sect2>
1432
1433      <sect2>
1434        <title>Recommended Reading</title>
1435
1436        <variablelist>
1437          <varlistentry>
1438            <term>
1439              <ulink url="http://developer.apple.com/documentation/Cocoa/">Cocoa Documentation</ulink>
1440            </term>
1441           
1442           <listitem>
1443             <para>
1444               This is the top page for all of Apple's documentation on
1445               Cocoa.  If you are unfamiliar with Cocoa, it is a good
1446               place to start.
1447             </para>
1448           </listitem>
1449        </varlistentry>
1450        <varlistentry>
1451          <term>
1452            <ulink url="http://developer.apple.com/documentation/Cocoa/Reference/Foundation/ObjC_classic/index.html">Foundation Reference for Objective-C</ulink>
1453          </term>
1454
1455          <listitem>
1456            <para>
1457              This is one of the two most important Cocoa references; it
1458              covers all of the basics, except for GUI programming.  This is
1459              a reference, not a tutorial.
1460            </para>
1461          </listitem>
1462        </varlistentry>
1463      </variablelist>
1464      </sect2>
1465    </sect1>
1466  </chapter>
Note: See TracBrowser for help on using the repository browser.