source: release/1.2/source/doc/src/implementation.xml @ 9200

Last change on this file since 9200 was 9200, checked in by gb, 13 years ago

synch with trunk

File size: 81.6 KB
Line 
1<?xml version="1.0" encoding="utf-8"?>
2<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
3<!ENTITY rest "<varname>&amp;rest</varname>">
4<!ENTITY key "<varname>&amp;key</varname>">
5<!ENTITY optional "<varname>&amp;optional</varname>">
6<!ENTITY body "<varname>&amp;body</varname>">
7<!ENTITY aux "<varname>&amp;aux</varname>">
8<!ENTITY allow-other-keys "<varname>&amp;allow-other-keys</varname>">
9<!ENTITY CCL "Clozure CL">
10]>
11  <chapter id="Implementation-Details-of-CCL">
12    <title>Implementation Details of &CCL;</title>
13    <para>This chapter describes many aspects of OpenMCL's
14    implementation as of (roughly) version 1.1. Details vary a bit
15    between the three architectures (PPC32, PPC64, and x86-64)
16    currently supported and those details change over time, so the
17    definitive reference is the source code (especially some files in
18    the ccl/compiler/ directory whose names contain the string "arch"
19    and some files in the ccl/lisp-kernel/ directory whose names
20    contain the string "constants".) Hopefully, this chapter will make
21    it easier for someone who's interested to read and understand the
22    contents of those files.</para>
23
24    <sect1 id="Threads-and-exceptions">
25      <title>Threads and exceptions</title>
26
27      <para>&CCL;'s threads are "native" (meaning that they're
28        scheduled and controlled by the operating system.)  Most of the
29        implications of this are discussed elsewhere; this section tries
30        to describe how threads look from the lisp kernel's perspective
31        (and especially from the GC's point of view.)</para>
32      <para>&CCL;'s runtime system tries to use machine-level
33        exception mechanisms (conditional traps when available,
34        illegal instructions, memory access protection in some cases)
35        to detect and handle exceptional situations.  These situations
36        include some TYPE-ERRORs and PROGRAM-ERRORS (notably
37        wrong-number-of-args errors), and also include cases like "not
38        being able to allocate memory without GCing or obtaining more
39        memory from the OS."  The general idea is that it's usually
40        faster to pay (very occasional) exception-processing overhead
41        and figure out what's going on in an exception handler than it
42        is to maintain enough state and context to handle an
43        exceptional case via a lighter-weight mechanism when that
44        exceptional case (by definition) rarely occurs.</para>
45      <para>Some emulated execution environments (the Rosetta PPC
46        emulator on x86 versions of Mac OS X) don't provide accurate
47        exception information to exception handling functions. &CCL;
48        can't run in such environments.</para>
49
50      <sect2 id="The-Thread-Context-Record">
51            <title>The Thread Context Record</title>
52
53            <para>When a lisp thread is first created (or when a thread
54          created by foreign code first calls back to lisp), a data
55          structure called a Thread Context Record (or TCR) is
56          allocated and initialized.  On modern versions of Linux and
57          FreeBSD, the allocation actually happens via a set of
58          thread-local-storage ABI extensions, so a thread's TCR is
59          created when the thread is created and dies when the thread
60          dies.  (The World's Most Advanced Operating System&mdash;as
61          Apple's marketing literature refers to Darwin&mdash;is not
62          very advanced in this regard, and I know of no reason to
63          assume that advances will be made in this area anytime
64          soon.)</para>
65        <para>A TCR contains a few dozen fields (and is therefore a
66          few hundred bytes in size.)  The fields are mostly
67          thread-specific information about the thread's stacks'
68          locations and sizes, information about the underlying (POSIX)
69          thread, and information about the thread's dynamic binding
70          history and pending CATCH/UNWIND-PROTECTs.  Some of this
71          information could be kept in individual machine registers
72          while the thread is running (and the PPC - which has more
73          registers available - keeps a few things in registers that the
74          X86-64 has to access via the TCR), but it's important to
75          remember that the information is thread-specific and can't
76          (for instance) be kept in a fixed global memory
77          location.</para>
78        <para>When lisp code is running, the current thread's TCR is
79          kept in a register.  On PPC platforms, a general purpose
80          register is used; on x86-64, an (otherwise nearly useless)
81          segment register works well (prevents the expenditure of a
82          more generally useful general- purpose register for this
83          purpose.)</para>
84        <para>The address of a TCR is aligned in memory in such a way
85          that a FIXNUM can be used to represent it.  The lisp function
86          CCL::%CURRENT-TCR returns the calling thread's TCR as a
87          fixnum; actual value of the TCR's address is 4 or 8 times the
88          value of this fixnum.</para>
89        <para>When the lisp kernel initializes a new TCR, it's added
90          to a global list maintained by the kernel; when a thread
91          exits, its TCR is removed from this list.</para>
92        <para>When a thread calls foreign code, lisp stack pointers
93          are saved in its TCR, lisp registers (at least those whose
94          value should be preserved across the call) are saved on the
95          thread's value stack, and (on x86-64) RSP is switched to the
96          control stack.  A field in the TCR (tcr.valence) is then set
97          to indicate that the thread is running foreign code, foreign
98          argument registers are loaded from a frame on the foreign
99          stack, and the foreign function is called. (That's a little
100          oversimplified and possibly inaccurate, but the important
101          things to note are that the thread "stops following lisp
102          stack and register usage conventions" and that it advertises
103          the fact that it's done so.  Similar transitions in a
104          thread's state ("valence") occur when it enters or exits an
105          exception handler (which is sort of an OS/hardware-mandated
106          foreign function call where the OS thoughtfully saves the
107          thread's register state for it beforehand.)</para>
108      </sect2>
109
110      <sect2 id="Exception-contexts-comma---and-exception-handling-in-general">
111            <title>Exception contexts, and exception-handling in general</title>
112        <para>Unix-like OSes tend to refer to exceptions as "signals";
113          the same general mechanism ("signal handling") is used to
114          process both asynchronous OS-level events (such as the result
115          of the keyboard driver noticing that ^C or ^Z has been
116          pressed) and synchronous hardware-level events (like trying to
117          execute an illegal instruction or access protected memory.)
118          It makes some sense to defer ("block") handling of
119          asynchronous signals so that some critical code sequences
120          complete without interruption; since it's generally not
121          possible for a thread to proceed after a synchronous exception
122          unless and until its state is modified by an exception
123          handler, it makes no sense to talk about blocking synchronous
124          signals (though some OSes will let you do so and doing so can
125          have mysterious effects.)</para>
126        <para>On OSX/Darwin, the POSIX signal handling facilities
127          coexist with lower-level Mach-based exception handling
128          facilities.  Unfortunately, the way that this is implemented
129          interacts poorly with debugging tools: GDB will generally stop
130          whenever the target program encounters a Mach-level exception
131          and offers no way to proceed from that point (and let the
132          program's POSIX signal handler try to handle the exception);
133          Apple's CrashReporter program has had a similar issue and,
134          depending on how it's configured, may bombard the user with
135          alert dialogs which falsely claim that an application has
136          crashed (when in fact the application in question has
137          routinely handled a routine exception.)  On Darwin/OSX,
138          &CCL; uses Mach thread-level exception handling facilities
139          which run before GDB or CrashReporter get a chance to confuse
140          themselves; &CCL;'s Mach exception handling tries to force
141          the thread which received a synchronous exception to invoke a
142          signal handling function ("as if" signal handling worked more
143          usefully under Darwin.)  Mach exception handlers run in a
144          dedicated thread (which basically does nothing but wait for
145          exception messages from the lisp kernel, obtain and modify
146          information about the state of threads in which exceptions
147          have occurred, and reply to the exception messages with an
148          indication that the exception has been handled.  The reply
149          from a thread-level exception handler keeps the exception from
150          being reported to GDB or CrashReporter and avoids the problems
151          related to those programs.  Since &CCL;'s Mach exception
152          handler doesn't claim to handle debugging-related exceptions
153          (from breakpoints or single-step operations), it's possible to
154          use GDB to debug &CCL;.</para>
155        <para>On platforms where signal handling and debugging don't
156          get in each other's way, a signal handler is entered with
157          all signals blocked.  (This behavior is specified in the
158          call to the sigaction() function which established the
159          signal handler.)  The signal handler receives three
160          arguments from the OS kernel; the first is an integer that
161          identifies the signal, the second is a pointer to an object
162          of type "siginfo_t", which may or may not contain a few
163          fields that would help to identify the cause of the
164          exception, and the third argument is a pointer to a data
165          structure (called a "ucontext" or something similar), which
166          contains machine-dependent information about the state of
167          the thread at the time that the exception/signal occurred.
168          While asynchronous signals are blocked, the signal handler
169          stores the pointer to its third argument (the "signal
170          context") in a field in the current thread's TCR, sets some
171          bits in another TCR field to indicate that the thread is now
172          waiting to handle an exception, unblocks asynchronous
173          signals, and waits for a global exception lock that
174          serializes exception processing.</para>
175        <para>On Darwin, the Mach exception thread creates a signal
176          context (and maybe a siginfo_t structure), stores the signal
177          context in the thread's TCR, sets the TCR field which describes
178          the thread's state, and arranges that the thread resume
179          execution at its signal handling function (with a signal
180          handler, possibly NULL siginfo_t, and signal context as
181          arguments.  When the thread resumes, it waits for the global
182          exception lock.</para>
183        <para>On x86-64 platforms where signal handing can be used to
184          handle synchronous exceptions, there's an additional
185          complication: the OS kernel ordinarily allocates the signal
186          context and siginfo structures on the stack of the thread
187          that received the signal; in practice, that means "wherever
188          RSP is pointing."  &CCL;'s
189          <xref linkend="Register-and-stack-usage-conventions"/>
190          require that the thread's value stack&mdash;where RSP is
191          usually pointing while lisp code is running&mdash;contain
192          only "nodes" (properly tagged lisp objects), and scribbling
193          a signal context all over the value stack would violate this
194          requirement.  To maintain consistency, the sigaltstack()
195          mechanism is used to cause the signal to be delivered on
196          (and the signal context and siginfo to be allocated on) a
197          special stack area (the last few pages of the thread's
198          control stack, in practice).  When the signal handler runs,
199          it (carefully) copies the signal context and siginfo to the
200          thread's control stack and makes RSP point into that stack
201          before invoking the "real" signal handler. The effect of
202          this hack is that the "real" signal handler always runs on
203          the thread's control stack.</para>
204        <para>Once the exception handler has obtained the global
205          exception lock, it uses the values of the signal number,
206          siginfo_t, and signal context arguments to determine the
207          (logical) cause of the exception.  Some exceptions may be
208          caused by factors that should generate lisp errors or other
209          serious conditions (stack overflow); if this is the case, the
210          kernel code may release the global exception lock and call out
211          to lisp code.  (The lisp code in question may need to repeat
212          some of the exception decoding process; in particular, it
213          needs to be able to interpret register values in the signal
214          context that it receives as an argument.)</para>
215        <para>In some cases, the lisp kernel exception handler may not
216          be able to recover from the exception (this is currently true
217          of some types of memory-access fault and is also true of traps
218          or illegal instructions that occur during foreign code
219          execution.  In such cases, the kernel exception handler
220          reports the exception as "unhandled", and the kernel debugger
221          is invoked.</para>
222        <para>If the kernel exception handler identifies the
223          exception's cause as being a transient out-of-memory condition
224          (indicating that the current thread needs more memory to cons
225          in), it tries to make that memory available.  In some cases,
226          doing so involves invoking the GC.</para>
227      </sect2>
228
229      <sect2 id="Threads-comma---exceptions-comma---and-the-GC">
230            <title>Threads, exceptions, and the GC</title>
231        <para>&CCL;'s GC is not concurrent: when the GC is invoked in
232          response to an exception in a particular thread, all other
233          lisp threads must stop until the GC's work is done.  The
234          thread that triggered the GC iterates over the global TCR
235          list, sending each other thread a distinguished "suspend"
236          signal, then iterates over the list again, waiting for a
237          per-thread semaphore that indicates that the thread has
238          received the "suspend" signal and responded appropriately.
239          Once all other threads have acknowledged the request to
240          suspend themselves, the GC thread can run the GC proper (after
241          doing any necessary <xref linkend="PC-lusering"/>.)  Once the
242          GC's completed its work, the thread that invoked the GC
243          iterates over the global TCR list, raising a per-thread
244          "resume" semaphore for each other thread.</para>
245        <para>The signal handler for the asynchronous "suspend" signal
246          is entered with all asynchronous signals blocked.  It saves
247          its signal-context argument in a TCR slot, raises the tcr's
248          "suspend" semaphore, then waits on the TCR's "resume"
249          semaphore.</para>
250        <para>The GC thread has access to the signal contexts of all
251          TCRs (including its own) at the time when the thread received
252          an exception or acknowledged a request to suspend itself.
253          This information (and information about stack areas in the TCR
254          itself) allows the GC to identify the "stack locations and
255          register contents" that are elements of the GC's root
256          set.</para>
257      </sect2>
258
259      <sect2 id="PC-lusering">
260            <title>PC-lusering</title>
261        <para>It's not quite accurate to say that &CCL;'s compiler
262          and runtime follow precise stack and register usage
263          conventions at all times; there are a few exceptions:</para>
264
265            <itemizedlist>
266          <listitem>
267                <para>On both PPC and x86-64 platforms, consing isn't
268                  fully atomic.It takes at least a few instructions to
269                  allocate an object in memory(and slap a header on it if
270                  necessary); if a thread is interrupted in the middle of
271                  that instruction sequence, the new object may or may
272                  not have been created or fully initialized at the point in
273                  time that the interrupt occurred.  (There are actually a
274                  few different states of partial initialization)</para>
275              </listitem>
276              <listitem>
277                <para>On the PPC, the common act of building a lisp
278                  control stack frame involves allocating a four-word frame
279                  and storing three register values into that frame.  (The
280                  fourth word - the back pointer to the previous frame - is
281                  automatically set when the frame is allocated.)  The
282                  previous contents of those three words are unknown (there
283                  might have been a foreign stack frame at the same address a
284                  few instructions earlier),so interrupting a thread that's
285                  in the process of initializing a PPC control stack frame
286                  isn't GC-safe.</para>
287              </listitem>
288          <listitem>
289                <para>There are similar problems with the initialization
290                  of temp stackframes on the PPC.  (Allocation and
291                  initialization doesn't happen atomically, and the newly
292                  allocated stack memory may have undefined contents.)</para>
293              </listitem>
294          <listitem>
295                <para><xref linkend="The-ephemeral-GC"/>'s write barrier
296                  has to be implemented atomically (i.e.,both an
297                  intergenerational store and the update of a
298                  corresponding reference bit has to happen without
299                  interruption, or neither of these events can
300                  happen.)</para>
301              </listitem>
302          <listitem>
303                <para>There are a few more similar cases.</para>
304              </listitem>
305        </itemizedlist>
306
307        <para>Fortunately, the number of these non-atomic instruction
308          sequences is small, and fortunately it's fairly easy for the
309          interrupting thread to recognize when the interrupted thread
310          is in the middle of such a sequence.  When this is detected,
311          the interrupting thread modifies the state of the interrupted
312          thread (modifying its PC and other registers) so that it is no
313          longer in the middle of such a sequence (it's either backed
314          out of it or the remaining instructions are emulated.)</para>
315        <para>This works because (a) many of the troublesome
316          instruction sequences are PPC-specific and it's relatively
317          easy to partially disassemble the instructions surrounding the
318          interrupted thread's PC on the PPC and (b) those instruction
319          sequences are heavily stylized and intended to be easily
320          recognized.</para>
321      </sect2>
322    </sect1>
323
324    <sect1 id="Register-usage-and-tagging">
325      <title>Register usage and tagging</title>
326     
327      <sect2 id="Register-usage-and-tagging-overview">
328            <title>Overview</title>
329            <para>Regardless of other details of its implementation, a
330              garbage collector's job is to partition the set of all
331              heap-allocated lisp objects (CONSes, STRINGs, INSTANCEs, etc.)
332              into two subsets.  The first subset contains all objects that
333              are transitively referenced from a small set of "root" objects
334              (the contents of the stacks and registers of all active
335              threads at the time the GC occurs and the values of some
336              global variables.)  The second subset contains everything
337              else: those lisp objects that are not transitively reachable
338              from the roots are garbage, and the memory occupied by garbage
339              objects can be reclaimed (since the GC has just proven that
340              it's impossible to reference them.)</para>
341        <para>The set of live, reachable lisp objects basically form
342          the nodes of a (usually large) graph, with edges from each
343          node A to any other objects (nodes) that object A
344          references.</para>
345        <para>Some nodes in this graph can never have outgoing edges:
346          an array with a specialized numeric or character type usually
347          represents its elements in some (possibly more compact)
348          specialized way.  Some nodes may refer to lisp objects that
349          are never allocated in memory (FIXNUMs, CHARACTERs,
350          SINGLE-FLOATs on 64-bit platforms ..)  This latter class of
351          objects are sometimes called "immediates", but that's a little
352          confusing because the term "immediate" is sometimes used to
353          refer to things that can never be part of the big connectivity
354          graph (e.g., the "raw" bits that make up a floating-point
355          value, foreign address, or numeric value that needs to be used
356          - at least fleetingly - in compiled code.)</para>
357        <para>For the GC to be able to build the connectivity graph
358          reliably, it's necessary for it to be able to reliably tell
359          (a) whether or not a "potential root" - the contents of a
360          machine register or stack location - is in fact a node and (b)
361          for any node, whether it may have components that refer to
362          other nodes.</para>
363        <para>There's no reliable way to answer the first question on
364          stock hardware.  (If everything was a node, as might be the
365          case on specially microcoded "lisp machine" hardware, it
366          wouldn't even need to be asked.)  Since there's no way to just
367          look at a machine word (the contents of a machine register or
368          stack location) and tell whether or not it's a node or just
369          some random non-node value, we have to either adopt and
370          enforce strict conventions on register and stack usage or
371          tolerate ambiguity.</para>
372        <para>"Tolerating ambiguity" is an approach taken by some
373          ("conservative") GC schemes; by contrast, &CCL;'s GC is
374          "precise", which in this case means that it believes that the
375          contents of certain machine registers and stack locations are
376          always nodes and that other registers and stack locations are
377          never nodes and that these conventions are never violated by
378          the compiler or runtime system.  The fact that threads are
379          preemptively scheduled means that a GC could occur (because of
380          activity in some other thread) on any instruction boundary,
381          which in turn means that the compiler and runtime system must
382          follow precise <xref
383                            linkend="Register-and-stack-usage-conventions"/> at all
384          times.</para>
385        <para>Once we've decided that a given machine word is a node,
386          a <xref linkend="Tagging-scheme"/> describes how the node's
387          value and type are encoded in that machine word.</para>
388        <para>Most of this discussion&mdash;so far&mdash;has treated
389          things from the GC's very low-level perspective. From a much
390          higher point of view, lisp functions accept nodes as
391          arguments, return nodes as values, and (usually) perform
392          some operations on those arguments in order to produce those
393          results.  (In many cases, the operations in question involve
394          raw non-node values.)  Higher-level parts of the lisp type
395          system (functions like TYPE-OF and CLASS-OF, etc.) depend on
396          the <xref linkend="Tagging-scheme"/>.</para>
397      </sect2>
398
399      <sect2 id="pc-locatives-on-the-PPC">
400            <title>pc-locatives on the PPC</title>
401        <para>On the PPC, there's a third case (besides "node" and
402          "immediate" values).  As discussed below, a node that denotes
403          a memory-allocated lisp object is a biased (tagged) pointer
404          -to- that object; it's not generally possible to point -into-
405          some composite (multi-element) object (such a pointer would
406          not be a node, and the GC would have no way to update the
407          pointer if it were to move the underlying object.)</para>
408        <para>Such a pointer ("into" the interior of a heap-allocated
409          object) is often called a <emphasis>locative</emphasis>; the
410          cases where locatives are allowed in &CCL; mostly involve
411          the behavior of function call and return instructions.  (To be
412          technically accurate, the other case also arises on x86-64, but
413          that case isn't as user-visible.)</para>
414        <para>On the PowerPC (both PPC32 and PPC64), all machine
415          instructions are 32 bits wide and all instruction words are
416          allocated on 32-bit boundaries.  In PPC &CCL;, a CODE-VECTOR
417          is a specialized type of vector-like object; its elements
418          are 32-bit PPC machine instructions.  A CODE-VECTOR is an
419          attribute of a FUNCTION object; a function call involves
420          accessing the function's code-vector and jumping to the
421          address of its first instruction.</para>
422        <para>As each instruction in the code vector sequentially
423          executes, the hardware program counter (PC) register advances
424          to the address of the next instruction (a locative into the
425          code vector); since PPC instructions are always 32 bits wide
426          and aligned on 32-bit boundaries, the low two bits of the PC
427          are always 0.  If the function executes a call (simple call
428          instructions have the mnemonic "bl" on the PPC, which stands
429          for "branch and link"), the address of the next instruction
430          (also a word-aligned locative into a code-vector) is copied
431          into the special- purpose PPC "link register" (lr); a function
432          returns to its caller via a "branch to link register" (blr)
433          instruction.  Some cases of function call and return might
434          also use the PPC's "count register" (ctr), and if either the
435          lr or ctr needs to be stored in memory it needs to first be
436          copied to a general-purpose register.</para>
437        <para>&CCL;'s GC understands that certain registers contain
438          these special "pc-locatives" (locatives that point into
439          CODE-VECTOR objects); it contains special support for
440          finding the containing CODE-VECTOR object and for adjusting
441          all of these "pc-locatives" if the containing object is
442          moved in memory.  The first part of that
443          operation&mdash;finding the containing object&mdash;is
444          possible and practical on the PPC because of architectural
445          artifacts (fixed-width instructions and arcana of
446          instruction encoding.)  It's not possible on x86-64, but
447          fortunately not necessary either (though the second part -
448          adjusting the PC/RIP when the containing object moves) is
449          both necessary and simple.</para>
450      </sect2>
451
452      <sect2 id="Register-and-stack-usage-conventions">
453        <title>Register and stack usage conventions</title>
454       
455        <sect3 id="Stack-conventions">
456              <title>Stack conventions</title>
457          <para>On both PPC and X86 platforms, each lisp thread uses 3
458            stacks; the ways in which these stacks are used differs
459            between the PPC and X86.</para>
460          <para>Each thread has:</para>
461              <itemizedlist>
462            <listitem>
463                  <para>A "control stack".  On both platforms, this is
464                    "the stack" used by foreign code.  On the PPC, it
465                    consists of a linked list of frames where the first word
466                    in each frame points to the first word in the previous
467                    frame (and the outermost frame points to 0.)  Some
468                    frames on a PPC control stack are lisp frames; lisp
469                    frames are always 4 words in size and contain (in
470                    addition to the back pointer to the previous frame) the
471                    calling function (a node), the return address (a
472                    "locative" into the calling function's code-vector), and
473                    the value to which the value-stack pointer (see below)
474                    should be restored on function exit.  On the PPC, the GC
475                    has to look at control-stack frames, identify which of
476                    those frames are lisp frames, and treat the contents of
477                    the saved function slot as a node (and handle the return
478                    address locative specially.)  On x86-64, the control
479                    stack is used for dynamic-extent allocation of immediate
480                    objects.  Since the control stack never contains nodes
481                    on x86-64, the GC ignores it on that platform.
482                    Alignment of the control stack follows the ABI
483                    conventions of the platform (at least at any point in
484                    time where foreign code could run.)  On PPC, the r1
485                    register always points to the top of the current
486                    thread's control stack; on x86-64, the RSP register
487                    points to the top of the current thread's control stack
488                    when the thread is running foreign code and the address
489                    of the top of the control stack is kept in the thread's
490                    TCR (see <xref linkend="The-Thread-Context-Record"/>
491                    when not running foreign code.  The control stack "grows
492                    down."</para>
493                </listitem>
494            <listitem>
495                  <para>A "value stack".  On both platforms, all values on
496                    the value stack are nodes (including "tagged return
497                    addresses" on x86-64.)  The value stack is always
498                    aligned to the native word size; objects are always
499                    pushed on the value stack using atomic instructions
500                    ("stwu"/"stdu" on PPC, "push" on x86-64), so the
501                    contents of the value stack between its bottom and top
502                    are always unambiguously nodes; the compiler usually
503                    tries to pop or discard nodes from the value stack as
504                    soon as possible after their last use (as soon as they
505                    may have become garbage.)  On x86-64, the RSP register
506                    addresses the top of the value stack when running lisp
507                    code; that address is saved in the TCR when running
508                    foreign code.  On the PPC, a dedicated register (VSP,
509                    currently r15) is used to address the top of the value
510                    stack when running lisp code, and the VSP value is saved
511                    in the TCR when running foreign code.  The value stack
512                    grows down.</para>
513                </listitem>
514                <listitem>
515                  <para>A "temp stack".  The temp stack consists of a
516                    linked list of frames, each of which points to the
517                    previous temp stack frame.  The number of native
518                    machine words in each temp stack frame is always even,
519                    so the temp stack is aligned on a two-word (64- or
520                    128-bit) boundary.  The temp stack is used for
521                    dynamic-extent objects on both platforms; on the PPC,
522                    it's used for essentially all such objects (regardless
523                    of whether or not the objects contain nodes); on the
524                    x86-64, immediate dynamic-extent objects (strings,
525                    foreign pointers, etc.)  are allocated on the control
526                    stack and only node-containing dynamic-extent objects
527                    are allocated on the temp stack.  Data structures used
528                    to implement CATCH and UNWIND-PROTECT are stored on
529                    the temp stack on both ppc and x86-64.  Temp stack
530                    frames are always doublenode aligned and objects
531                    within a temp stack frame are aligned on doublenode
532                    boundaries.  The first word in each frame contains a
533                    back pointer to the previous frame; on the PPC, the
534                    second word is used to indicate to the GC whether the
535                    remaining objects are nodes (if the second word is 0)
536                    or immediate (otherwise.)  On x86-64, where temp stack
537                    frames always contain nodes, the second word is always
538                    0.  The temp stack grows down.  It usually takes
539                    several instructions to allocate and safely initialize
540                    a temp stack frame that's intended to contain nodes,
541                    and the GC has to recognize the case where a thread is
542                    in the process of allocating and initializing a temp
543                    stack frame and take care not to interpret any
544                    uninitialized words in the frame as nodes. The PPC
545                    keeps the current top of the temp stack in a dedicated
546                    register (TSP, currently r12) when running lisp code
547                    and saves this register's value in the TCR when
548                    running foreign code.  The x86-64 keeps the address of
549                    the top of each thread's temp stack in the thread's
550                    TCR.</para>
551                </listitem>
552          </itemizedlist>
553        </sect3>
554
555        <sect3 id="Register-conventions">
556              <title>Register conventions</title>
557          <para>If there are a "reasonable" (for some value of
558            "reasonable") number of general-purpose registers and the
559            instruction set is "reasonably" orthogonal (most
560            instructions that operate on GPRs can operate on any GPR),
561            then it's possible to statically partition the GPRs into at
562            least two sets: "immediate registers" never contain nodes,
563            and "node registers" always contain nodes.  (On the PPC, a
564            few registers are members of a third set of "PC locatives",
565            and on both platforms some registers may have dedicated
566            roles as stack or heap pointers; the latter class is treated
567            as immediates by the GC proper but may be used to help
568            determine the bounds of stack and heap memory areas.)</para>
569              <para>The ultimate definition of register partitioning is
570            hardwired into the GC in functions like "mark_xp()" and
571            "forward_xp()", which process the values of some of the
572            registers in an exception frame as nodes and may give some
573            sort of special treatment to other register values they
574            encounter there.)</para>
575          <para>On x86-64, the static register partitioning scheme involves:</para>
576              <itemizedlist>
577            <listitem>
578                  <para>(only) three "immediate" registers.</para>
579                  <para>The RAX, RCX, and RDX registers are used as the
580                    implicit operands and results of some extended-precision
581                    multiply and divide instructions which generally involve
582                    non-node values; since their use in these instructions
583                    means that they can't be guaranteed to contain node
584                    values at all times, it's natural to put these registers
585                    in the "immediate" set. RAX is generally given the
586                    symbolic name "imm0", RDX is given the symbolic name
587                    "imm1" and RCX is given the symbolic name "imm2"; you
588                    may see these names in disassembled code, usually in
589                    operations involving type checking, array indexing, and
590                    foreign memory and function access.</para>
591                </listitem>
592            <listitem>
593                  <para>(only) two "dedicated" registers.</para>
594                  <para>RSP and RBP have
595                    dedicated functionality dictated by the hardware and
596                    calling conventions.</para>
597                </listitem>
598            <listitem>
599                  <para>11 "node" registers.</para>
600                  <para>All other registers (RBX, RSI, RDI, and R8-R15)
601                    are asserted to contain node values at (almost) all
602                    times; legacy "string" operations that implicitly use RSI
603                    and/or RDI are not used.</para>
604                </listitem>
605              </itemizedlist>
606
607          <para>On the PPC, the static register partitioning scheme
608            involves:</para>
609              <itemizedlist>
610            <listitem>
611                  <para>6 "immediate" registers.</para>
612                  <para>Registers r3-r8 are given
613                    the symbolic names imm0-imm5.  As a RISC architecture
614                    with simpler addressing modes, the PPC probably
615                    uses immediate registers a bit more often than the CISC
616                    x86-64 does, but they're generally used for the same sort
617                    of things (type checking, array indexing, FFI,
618                    etc.)</para>
619                </listitem>
620                <listitem>
621                  <para>9 dedicated registers
622                    <itemizedlist>
623                          <listitem>
624                            <para>r0 (symbolic name rzero) always contains the
625                              value 0 when running lisp code.  Its value is
626                              sometimes read as 0 when it's used as the base
627                              register in a memory address; keeping the value 0
628                              there is sometimes convenient and avoids
629                              asymmetry.</para>
630                          </listitem>
631                          <listitem>
632                            <para>r1 (symbolic name sp) is the control stack
633                              pointer, by PPC convention.</para>
634                          </listitem>
635                  <listitem>
636                            <para>r2 is used to hold the current thread's TCR on
637                              ppc64 systems; it's not used on ppc32.</para>
638                          </listitem>
639                  <listitem>
640                            <para>r9 and r10 (symbolic names allocptr and
641                              allocbase) are used to do per-thread memory
642                              allocation</para>
643                          </listitem>
644                  <listitem>
645                            <para>r11 (symbolic name nargs) contains the number
646                              of function arguments on entry and the number of
647                              return values in multiple-value returning
648                              constructs.  It's not used more generally as either
649                              a node or immediate register because of the way that
650                              certain trap instruction encodings are
651                              interpreted.</para>
652                          </listitem>
653                  <listitem>
654                            <para>r12 (symbolic name tsp) holds the top of the
655                              current thread's temp stack.</para>
656                          </listitem>
657                          <listitem>
658                            <para>r13 is used to hold the TCR on PPC32 systems;
659                              it's not used on PPC64.</para>
660                          </listitem>
661                          <listitem>
662                            <para>r14 (symbolic name loc-pc) is used to copy
663                              "pc-locative" values between main memory and
664                              special-purpose PPC registers (LR and CTR) used in
665                              function-call and return instructions.</para>
666                          </listitem>
667                  <listitem>
668                            <para>r15 (symbolic name vsp) addresses the top of
669                              the current thread's value stack.</para>
670                          </listitem>
671                          <listitem>
672                            <para>lr and ctr are PPC branch-unit registers used
673                              in function call and return instructions; they're
674                              always treated as "pc-locatives", which precludes
675                              the use of the ctr in some PPC looping
676                              constructs.</para>
677                          </listitem>
678                 
679                    </itemizedlist>
680                  </para>
681                </listitem>
682            <listitem>
683                  <para>17 "node" registers</para>
684                  <para>r15-r31 are always treated as node
685                    registers</para>
686                </listitem>
687               
688          </itemizedlist>
689        </sect3>
690      </sect2>
691
692      <sect2 id="Tagging-scheme">
693            <title>Tagging scheme</title>
694        <para>&CCL; always allocates lisp objects on double-node
695          (64-bit for 32-bit platforms, 128-bit for 64-bit platforms)
696          boundaries; this mean that the low 3 bits (32-bit lisp) or 4
697          bits (64-bit lisp) are always 0 and are therefore redundant
698          (we only really need to know the upper 29 or 60 bits in order
699          to identify the aligned object address.)  The extra bits in a
700          lisp node can be used to encode at least some information
701          about the node's type, and the other 29/60 bits represent
702          either an immediate value or a doublenode-aligned memory
703          address.  The low 3 or 4 bits of a node are called the node's
704          "tag bits", and the conventions used to encode type
705          information in those tag bits are called a "tagging
706          scheme."</para>
707        <para>It might be possible to use the same tagging scheme on
708          all platforms (at least on all platforms with the same word
709          size and/or the same number of available tag bits), but there
710          are often some strong reasons for not doing so.  These
711          arguments tend to be very machine-specific: sometimes, there
712          are fairly obvious machine-dependent tricks that can be
713          exploited to make common operations on some types of tagged
714          objects faster; other times, there are architectural
715          restrictions that make it impractical to use certain tags for
716          certain types.  (On PPC64, the "ld" (load doubleword) and
717          "std" (store doubleword) instructions - which load and store a
718          GPR operand at the effective address formed by adding the
719          value of another GPR operand and a 16-bit constant operand -
720          require that the low two bits of that constant operand be 0.
721          Since such instructions would typically be used to access the
722          fields of things like CONS cells and structures, it's
723          desirable that that the tags chosen for CONS cells and
724          structures allow the use of these instructions as opposed to
725          more expensive alternatives.)</para>
726        <para>One architecture-dependent tagging trick that works well
727          on all architectures is to use a tag of 0 for FIXNUMs: a
728          fixnum basically encodes its value shifted left a few bits
729          and keeps those low bits clear. FIXNUM addition,
730          subtraction, and binary logical operations can operate
731          directly on the node operands, addition and subtraction can
732          exploit hardware-based overflow detection, and (in the
733          absence of overflow) the hardware result of those operations
734          is a node (fixnum).  Some other slightly-less-common
735          operations may require a few extra instructions, but
736          arithmetic operations on FIXNUMs should be as cheap as
737          possible and using a tag of zero for FIXNUMs helps to ensure
738          that it will be.</para>
739            <para>If we have N available tag bits (N = 3 for 32-bit &CCL;
740              and N = 4 for 64-bit &CCL;), this way of representing
741              fixnums with the low M bits forced to 0 works as long as M
742              &lt;= N.  The smaller we make M, the larger the values of
743              MOST-POSITIVE-FIXNUM and MOST-NEGATIVE become; the larger we
744              make N, the more distinct non-FIXNUM tags become available.
745              A reasonable compromise is to choose M = N-1; this basically
746              yields two distinct FIXNUM tags (one for even fixnums, one
747              for odd fixnums), gives 30-bit fixnums on 32-bit platforms
748              and 61-bit fixnums on 64-bit platforms, and leaves us with 6
749              or 14 tags to encoded other types.</para>
750        <para>Once we get past the assignment of FIXNUM tags, things
751          quickly devolve into machine-dependencies.  We can fairly
752          easily see that we can't directly tag all other primitive
753          lisp object types with only 6 or 14 available tag values;
754          the details of how types are encoded vary between the ppc32,
755          ppc64, and x86-64 implementations, but there are some
756          general common principles:</para>
757
758            <itemizedlist>
759              <listitem>
760                <para>CONS cells always contain exactly 2 elements and are
761                  usually fairly common.It therefore makes sense to give
762                  CONS cells their own tag.  Unlike the fixnum case -
763                  where a tag value of 0 had positive implications - there
764                  doesn't seem to be any advantage to using any particular
765                  value.  (A longtime ago - in the case of 68K MCL - the
766                  CONS tag and the order of CAR and CDR in memory were
767                  chosen to allow smaller, cheaper addressing modes to be
768                  used to "cdr down a list."  That's not a factor on ppc
769                  or x86-64, but all versions of &CCL; still store the CDR
770                  of a CONS cell first in memory.  It doesn't matter, but
771                  doing it the way that the host system did made
772                  boostrapping to a new target system a little easier.)
773                </para>
774              </listitem>
775              <listitem>
776                <para>Any way you look at it, NIL is a bit
777                  ... unusual. NIL is both a SYMBOL and a LIST (as well as
778                  being a canonical truth value and probably a few other
779                  things.)  Its role as a LIST is probably much more
780                  important to most programs than its role as a SYMBOL is:
781                  LISTP has to be true of NIL and primitives like CAR and
782                  CDR do LISTP implicitly when safe and want that
783                  operation to be fast. There are several possible
784                  approaches to this problem; &CCL; uses two of them. On
785                  PPC32 and X86-64, NIL is basically a weird CONS cell
786                  that straddles two doublenodes; the tag of NIL is unique
787                  and congruent modulo 4 (modulo 8 on 64-bit) with the tag
788                  used for CONS cells.  LISTP is therefore true of any
789                  node whose low 2 (or 3) bits contain the appropriate tag
790                  value (it's not otherwise necessary to special-case
791                  NIL.)  SYMBOL accessors (SYMBOL-NAME, SYMBOL-VALUE,
792                  SYMBOL-PLIST ..) -do- have to special-case NIL (and
793                  access the components of an internal proxy symbol.) On
794                  PPC64 (where architectural restrictions dictate the set
795                  of tags that can be used to access fixed components of
796                  an object), that approach wasn't practical.  NIL is just
797                  a distinguished SYMBOL,and it just happens to be the
798                  case that its pname slot and values slot are at the same
799                  offsets from a tagged pointer as a CONS cell's CDR and
800                  CAR would be.  NIL's pname is set to NIL (SYMBOL-NAME
801                  checks for this and returns the string "NIL"), and LISTP
802                  (and therefore safe CAR and CDR) has to check for (OR
803                  NULL CONSP). At least in the case of CAR and CDR, the
804                  fact that the PPC has multiple condition-code fields
805                  keeps that extra test from being prohibitively
806                  expensive.</para>
807              </listitem>
808              <listitem>
809                <para>Some objects are immediate (but not FIXNUMs). This
810                  is true of CHARACTERs and, on 64-bit platforms,
811                  SINGLE-FLOATs. It's also true of some nodes used in the
812                  runtime system (special values used to indicate unbound
813                  variables and slots, for instance.) On 64-bit platforms,
814                  SINGLE-FLOATs have their own unique tag (making them a
815                  little easier to recognize; on all platforms, CHARACTERs
816                  share a tag with other immediate objects (unbound
817                  markers) but are easy to recognize (by looking at
818                  several of their low bits.)  The GC treats any node with
819                  an immediate tag (and any node with a fixnum tag) as a
820                  leaf.</para>
821              </listitem>
822          <listitem>
823                <para>There are some advantages to treating everything
824                  else&mdash;memory-allocated objects that aren't CONS
825                  cells&mdash;uniformly.There are some disadvantages to
826                  that uniform treatment as well, and the treatment of
827                  "memory-allocated non-CONS objects" isn't entirely
828                  uniform across all &CCL; implementations.  Let's first
829                  pretend that the treatment is uniform, then discuss the
830                  ways in which it isn't.The "uniform approach" is to
831                  treat all memory-allocated non-CONS objects as if they
832                  were vectors; this use of the term is a little looser
833                  than what's implied by the CL VECTOR type.  &CCL;
834                  actually uses the term "uvector" to mean "a
835                  memory-allocated lisp object other than a CONS cell,
836                  whose first word is a header that describes the object's
837                  type and the number of elements that it contains."  In
838                  this view, a SYMBOL is a UVECTOR, as is a STRING, a
839                  STANDARD-INSTANCE, a CL array or vector, a FUNCTION, and
840                  even a DOUBLE-FLOAT. In the PPC implementations (where
841                  things are a little more ... uniform), a single tag
842                  value is used to denote any uvector; in order to
843                  determine something more specific about the type of the
844                  object in question, it's necessary to fetch the low byte
845                  of the header word from memory.  On the x86-64 platform,
846                  certain types of uvectors - SYMBOLs and FUNCTIONs -are
847                  given their own unique tags.  The good news about the
848                  x86-64 approach is that SYMBOLs and FUNCTIONs can be
849                  recognized without referencing memory; the slightly bad
850                  news is that primitive operations that work on
851                  UVECTOR-tagged objects&mdash;like the function
852                  CCL:UVREF&mdash;don't work on SYMBOLs or FUNCTIONs on
853                  x86-64 (but -do- work on those types of objects in the
854                  PPC ports.) The header word that precedes a UVECTOR's
855                  data in memory contains 8 bits of type information in
856                  the low byte and either 24 or 56 bits of "element-count"
857                  information in the rest of the word.  (This is where the
858                  sometimes-limiting value of 2^24 for
859                  ARRAY-TOTAL-SIZE-LIMIT on PPC32 platforms comes from.)
860                  The low byte of the header&mdash;sometimes called the
861                  uvector's subtag&mdash;is itself tagged (which means
862                  that the header is tagged.)  The (3 or 4) tag bits in
863                  the subtag are used to determine whether the uvector's
864                  elements are nodes or immediates. (A UVECTOR whose
865                  elements are nodes is called a GVECTOR; a UVECTOR whose
866                  elements are immediates is called an IVECTOR.  This
867                  terminology came from Spice Lisp, which was a
868                  predecessor of CMUCL.)  Even though a uvector header is
869                  tagged, a header is not a node.  There's no (supported)
870                  way to get your hands on one in lisp and doing so could
871                  be dangerous.  (If the value of a header wound up in a
872                  lisp node register and that register wound up getting
873                  pushed on a thread's value stack, the GC might
874                  misinterpret that situation to mean that there was a
875                  stack-allocated UVECTOR on the value stack.)</para>
876              </listitem>
877         
878            </itemizedlist>
879      </sect2>
880    </sect1>
881
882    <sect1 id="Heap-Allocation">
883      <title>Heap Allocation</title> <para>When the &CCL; kernel first
884        starts up, a large contiguous chunk of the process's address
885        space is mapped as "anonymous, no access" memory. ("Large"
886        means different things in different contexts; on LinuxPPC32,
887        it means "about 1 gigabyte", on DarwinPPC32, it means "about 2
888        gigabytes", and on current 64-bit platforms it ranges from 128
889        to 512 gigabytes, depending on OS. These values are both
890        defaults and upper limits;
891        the <literal>--heap-reserve</literal> argument can be used to
892        try to reserve less than the default.)</para>
893      <para>Reserving address space that can't (yet) be read or
894        written to doesn't cost much; in particular, it doesn't require
895        that corresponding swap space or physical memory be available.
896        Marking the address range as being "mapped" helps to ensure that
897        other things (results from random calls to malloc(), dynamically
898        loaded shared libraries) won't be allocated in this region that
899        lisp has reserved for its own heap growth.</para>
900      <para>A small portion (around 1/32 on 32-bit platforms and 1/64
901        on 64-bit platforms) of that large chunk of address space is
902        reserved for GC data structures.  Memory pages reserved for
903        these data structures are mapped read-write as pages are made
904        writable in the main portion of the heap.</para>
905      <para>The initial heap image is mapped into this reserved
906        address space and an additional (LISP-HEAP-GC-THRESHOLD) bytes
907        are mapped read-write.  GC data structures grow to match the
908        amount of GC-able memory in the initial image plus the gc
909        threshold, and control is transferred to lisp code.
910        Inevitably, that code spoils everything and starts consing;
911        there are basically three layers of memory allocation that can
912        go on.</para>
913
914      <sect2 id="Per-thread-object-allocation">
915            <title>Per-thread object allocation</title>
916        <para>Each lisp thread has a private "reserved memory
917          segment"; when a thread starts up, its reserved memory segment
918          is empty.  PPC ports maintain the highest unallocated address
919          and the lowest allocatable address in the current segment in
920          registers when running lisp code; on x86-664, these values are
921          maintained in the current threads's TCR.  (An "empty" heap
922          segment is one whose high pointer and low pointer are equal.)
923          When a thread is not in the middle of allocating something, the
924          low 3 or 4 bits of the high and low pointers are clear (the
925          pointers are doublenode-aligned.)</para>
926        <para>A thread tries to allocate an object whose physical size
927          in bytes is X and whose tag is Y by:</para>
928            <orderedlist>
929              <listitem>
930                <para>decrementing the "high" pointer by (- X Y)</para>
931              </listitem>
932              <listitem>
933                <para>trapping if the high pointer is less than the low
934                  pointer</para>
935              </listitem>
936              <listitem>
937                <para>using the (tagged) high pointer to initialize the
938                  object, if necessary</para>
939              </listitem>
940              <listitem>
941                <para>clearing the low bits of the high pointer</para>
942              </listitem>
943            </orderedlist>
944        <para>On PPC32, where the size of a CONS cell is 8 bytes and
945          the tag of a CONS cell is 1, machine code which sets the arg_z
946          register to the result of doing (CONS arg_y arg_z) looks
947          like:</para>
948        <programlisting>
949  (SUBI ALLOCPTR ALLOCPTR 7)    ; decrement the high pointer by (- 8 1)
950  (TWLLT ALLOCPTR ALLOCBASE)    ; trap if the high pointer is below the base
951  (STW ARG_Z -1 ALLOCPTR)       ; set the CDR of the tagged high pointer
952  (STW ARG_Y 3 ALLOCPTR)        ; set the CAR
953  (MR ARG_Z ALLOCPTR)           ; arg_z is the new CONS cell
954  (RLWINM ALLOCPTR ALLOCPTR 0 0 28)     ; clear tag bits
955            </programlisting>
956            <para>On x86-64, the idea's similar but the implementation is
957          different.  The high and low pointers to the current thread's
958          reserved segment are kept in the TCR, which is addressed by
959          the gs segment register. An x86-64 CONS cell is 16 bytes wide
960          and has a tag of 3; we canonically use the temp0 register to
961          initialize the object</para>
962        <programlisting>
963  (subq ($ 13) ((% gs) 216))    ; decrement allocptr
964  (movq ((% gs) 216) (% temp0)) ; load allocptr into temp0
965  (cmpq ((% gs) 224) (% temp0)) ; compare to allocabase
966  (jg L1)                       ; skip trap
967  (uuo-alloc)                   ; uh, don't skip trap
968L1
969  (andb ($ 240) ((% gs) 216))   ; untag allocptr in the tcr
970  (movq (% arg_y) (5 (% temp0))) ; set the car
971  (movq (% arg_z) (-3 (% temp0))); set the cdr
972  (movq (% temp0) (% arg_z))    ; return the cons
973            </programlisting>
974        <para>If we don't take the trap (if allocating 8-16 bytes
975          doesn't exhaust the thread's reserved memory segment), that's
976          a fairly short and simple instruction sequence.  If we do take
977          the trap, we'll have to do some additional work in order to
978          get a new segment for the current thread.</para>
979      </sect2>
980
981      <sect2 id="Allocation-of-reserved-heap-segments">
982            <title>Allocation of reserved heap segments</title>
983        <para>After the lisp image is first mapped into memory - and after
984          each full GC - the lisp kernel ensures that
985          (LISP-HEAP-GC-TRESHOLD) additional bytes beyond the current
986          end of the heap are mapped read-write.</para>
987        <para>If a thread traps while trying to allocate memory, the
988          thread goes through the usual exception-handling protocol (to
989          ensure that any other thread that GCs "sees" the state of the
990          trapping thread and to serialize exception handling.)  When
991          the exception handler runs, it determines the nature and size
992          of the failed allocation and tries to complete the allocation
993          on the thread's behalf (and leave it with a reasonably large
994          thread-specific memory segment so that the next small
995          allocation is unlikely to trap.</para>
996        <para>Depending on the size of the requested segment
997          allocation, the number of segment allocations that have
998          occurred since the last GC, and the EGC and GC thresholds, the
999          segment allocation trap handler may invoke a full or ephemeral
1000          GC before returning a new segment.  It's worth noting that the
1001          [E]GC is triggered based on the number of and size of these
1002          segments that have been allocated since the last GC; it doesn't
1003          have much to do with how "full" each of those per-thread
1004          segments are.  It's possible for a large number of threads to
1005          do fairly incidental memory allocation and trigger the GC as a
1006          result; avoiding this involves tuning the per-thread
1007          allocation quantum and the GC/EGC thresholds
1008          appropriately.</para>
1009      </sect2>
1010
1011      <sect2 id="Heap-growth">
1012            <title>Heap growth</title>
1013        <para>All OSes on which &CCL; currently runs use an
1014          "overcommit" memory allocation strategy by default (though
1015          some of them provide ways of overriding that default.)  What
1016          this means in general is that the OS doesn't necessarily
1017          ensure that backing store is available when asked to map pages
1018          as read-write; it'll often return a success indicator from the
1019          mapping attempt (mapping the pages as "zero-fill,
1020          copy-on-write"), and only try to allocate the backing store
1021          (swap space and/or physical memory) when non-zero contents are
1022          written to the pages.</para>
1023        <para>It -sounds- like it'd be better to have the mmap() call
1024          fail immediately, but it's actually a complicated issue.
1025          (It's possible that other applications will stop using some
1026          backing store before lisp code actually touches the pages that
1027          need it, for instance.)  It's also not guaranteed that lisp
1028          code would be able to "cleanly" signal an out-of-memory
1029          condition if lisp is ... out of memory</para>
1030            <para>I don't know that I've ever seen an abrupt out-of-memory
1031              failure that wasn't preceded by several minutes of excessive
1032              paging activity.  The most expedient course in cases like this
1033              is to either (a) use less memory or (b) get more memory; it's
1034              generally hard to use memory that you don't have.</para>
1035      </sect2>
1036    </sect1>
1037
1038    <sect1 id="GC-details">
1039      <title>GC details</title>
1040      <para>The GC uses a Mark/Compact algorithm; its
1041        execution time is essentially a factor of the amount of live
1042        data in the heap. (The somewhat better-known Mark/Sweep
1043        algorithms don't compact the live data but instead traverse the
1044        garbage to rebuild free-lists; their execution time is therefore
1045        a factor of the total heap size.)</para>
1046      <para>As mentioned in <xref linkend="Heap-Allocation"/>, two
1047        auxiliary data structures (proportional to the size of the lisp
1048        heap) are maintained. These are</para>
1049      <orderedlist>
1050            <listitem>
1051              <para>the markbits bitvector, which contains a bit for
1052                every doublenode in the dynamic heap (plus a few extra words
1053                for alignment and so that sub-bitvectors can start on word
1054                boundaries.)</para>
1055            </listitem>
1056            <listitem>
1057              <para>the relocation table, which contains a native word for
1058                every 32 or 64 doublenodes in the dynamic heap, plus an
1059                extra word used to keep track of the end of the heap.</para>
1060            </listitem>
1061      </orderedlist>
1062      <para>The total GC space overhead is therefore on the order of
1063        3% (2/64 or 1/32).</para>
1064      <para>The general algorithm proceeds as follows:</para>
1065
1066      <sect2 id="Mark-phase">
1067            <title>Mark phase</title>
1068        <para>Each doublenode in the dynamic heap has a corresponding
1069          bit in the markbits vector. (For any doublenode in the heap,
1070          the index of its mark bit is determined by subtracting the
1071          address of the start of the heap from the address of the
1072          object and dividing the result by 8 or 16.) The GC knows the
1073          markbit index of the free pointer, so determining that the
1074          markbit index of a doubleword address is between the start of
1075          the heap and the free pointer can be done with a single
1076          unsigned comparison.</para>
1077        <para>The markbits of all doublenodes in the dynamic heap are
1078          zeroed before the mark phase begins. An object is
1079          <emphasis>marked</emphasis> if the markbits of all of its
1080          constituent doublewords are set and unmarked otherwise;
1081          setting an object's markbits involves setting the corresponding
1082          markbits of all constituent doublenodes in the object.</para>
1083        <para>The mark phase traverses each root. If the tag of the
1084          value of the root indicates that it's a non-immediate node
1085          whose address lies in the lisp heap, then:</para>
1086            <orderedlist>
1087              <listitem>
1088                <para>If the object is already marked, do nothing.</para>
1089              </listitem>
1090              <listitem>
1091                <para>Set the object's markbit(s).</para>
1092              </listitem>
1093              <listitem>
1094                <para>If the object is an ivector, do nothing further.</para>
1095              </listitem>
1096              <listitem>
1097                <para>If the object is a cons cell, recursively mark its
1098                  car and cdr.</para>
1099              </listitem>
1100              <listitem>
1101                <para>Otherwise, the object is a gvector. Recursively mark
1102                  its elements.</para>
1103              </listitem>
1104            </orderedlist>
1105        <para>Marking an object thus involves ensuring that its mark
1106          bits are set and then recursively marking any pointers
1107          contained within the object if the object was originally
1108          unmarked. If this recursive step was implemented in the
1109          obvious manner, marking an object would take stack space
1110          proportional to the length of the pointer chain from some root
1111          to that object. Rather than storing that pointer chain
1112          implicitly on the stack (in a series of recursive calls to the
1113          mark subroutine), the &CCL; marker uses mixture of recursion
1114          and a technique called <emphasis>link inversion</emphasis> to
1115          store the pointer chain in the objects themselves.  (Recursion
1116          tends to be simpler and faster; if a recursive step notes that
1117          stack space is becoming limited, the link-inversion technique
1118          is used.)</para>
1119        <para>Certain types of objects are treated a little specially:</para>
1120            <orderedlist>
1121              <listitem>
1122                <para>To support a feature called <emphasis>GCTWA
1123                <footnote>
1124                          <para>I believe that the acronym comes from MACLISP,
1125                            where it stood for "Garbage Collection of Truly
1126                            Worthless Atoms".</para>
1127                </footnote>
1128                    , </emphasis>the vector that contains the internal
1129                  symbols of the current package is marked on entry to the
1130                  mark phase, but the symbols themselves are not marked at
1131                  this time. Near the end of the mark phase, symbols
1132                  referenced from this vector which are not otherwise
1133                  marked are marked if and only if they're somehow
1134                  distinguishable from newly created symbols (by virtue of
1135                  their having function bindings, value bindings, plists,
1136                  or other attributes.)</para>
1137              </listitem>
1138              <listitem>
1139                <para>Pools have their first element set to NIL before any
1140                  other elements are marked.</para>
1141              </listitem>
1142              <listitem>
1143                <para>All hash tables have certain fields (used to cache
1144                  previous results) invalidated.</para>
1145              </listitem>
1146              <listitem>
1147                <para>Weak Hash Tables and other weak objects are put on a
1148                  linkedlist as they're encountered; their contents are only
1149                  retained if there are other (non-weak) references to
1150                  them.</para>
1151              </listitem>
1152            </orderedlist>
1153        <para>At the end of the mark phase, the markbits of all
1154          objects that are transitively reachable from the roots are
1155          set and all other markbits are clear.</para>
1156      </sect2>
1157
1158      <sect2 id="Relocation-phase">
1159            <title>Relocation phase</title>
1160            <para>The <emphasis>forwarding address</emphasis> of a
1161              doublenode in the dynamic heap is (&lt;its current address> -
1162              (size_of_doublenode * &lt;the number of unmarked markbits that
1163              precede it>)) or alternately (&lt;the base of the heap> +
1164              (size_of_doublenode * &lt;the number of marked markbits that
1165              precede it &gt;)). Rather than count the number of preceding
1166              markbits each time, the relocation table is used to precompute
1167              an approximation of the forwarding addresses for all
1168              doublewords. Given this approximate address and a pointer into
1169              the markbits vector, it's relatively easy to compute the exact
1170              forwarding address.</para>
1171            <para>The relocation table contains the forwarding addresses
1172              of each <emphasis>pagelet</emphasis>, where a pagelet is 256
1173              bytes (or 32 doublenodes). The forwarding address of the first
1174              pagelet is the base of the heap. The forwarding address of the
1175              second pagelet is the sum of the forwarding address of the
1176              first and 8 bytes for each mark bit set in the first 32-bit
1177              word in the markbits table. The last entry in the relocation
1178              table contains the forwarding address that the freepointer
1179              would have, e.g., the new value of the freepointer after
1180              compaction.</para>
1181            <para>In many programs, old objects rarely become garbage and
1182              new objects often do. When building the relocation table, the
1183              relocation phase notes the address of the first unmarked
1184              object in the dynamic heap. Only the area of the heap between
1185              the first unmarked object and the freepointer needs to be
1186              compacted; only pointers to this area will need to be
1187              forwarded (the forwarding address of all other pointers to the
1188              dynamic heap is the address of that pointer.)  Often, the
1189              first unmarked object is much nearer the free pointer than it
1190              is to the base of the heap.</para>
1191      </sect2>
1192
1193      <sect2 id="Forwarding-phase">
1194            <title>Forwarding phase</title>
1195        <para>The forwarding phase traverses all roots and the "old"
1196          part of the dynamic heap (the part between the base of the
1197          heap and the first unmarked object.) All references to objects
1198          whose address is between the first unmarked object and the
1199          free pointer are updated to point to the address the object
1200          will have after compaction by using the relocation table and
1201          the markbits vector and interpolating.</para>
1202            <para>The relocation table entry for the pagelet nearest the
1203              object is found. If the pagelet's address is less than the
1204              object's address, the number of set markbits that precede
1205              the object on the pagelet is used to determine the object's
1206              address; otherwise, the number of set markbits that follow
1207              the object on the pagelet is used.</para>
1208        <para>Since forwarding views the heap as a set of doublewords,
1209          locatives are (mostly) treated like any other pointers. (The
1210          basic difference is that locatives may appear to be tagged as
1211          fixnums, in which case they're treated as word-aligned
1212          pointers into the object.)</para>
1213        <para>If the forward phase changes the address of any hash
1214          table key in a hash table that hashes by address (e.g., an EQ
1215          hash table), it sets a bit in the hash table's header. The
1216          hash table code will rehash the hash table's contents if it
1217          tries to do a lookup on a key in such a table.</para>
1218        <para>Profiling reveals that about half of the total time
1219          spent in the GC is spent in the subroutine which determines a
1220          pointer's forwarding address. Exploiting GCC-specific idioms,
1221          hand-coding the routine, and inlining calls to it could all be
1222          expected to improve GC performance.</para>
1223      </sect2>
1224
1225      <sect2 id="Compact-phase">
1226            <title>Compact phase</title>
1227        <para>The compact phase compacts the area between the first
1228          unmarked object and the freepointer so that it contains only
1229          marked objects.  While doing so, it forwards any pointers it
1230          finds in the objects it copies.</para>
1231        <para>When the compact phase is finished, so is the GC (more
1232          or less): the free pointer and some other data structures are
1233          updated and control returns to the exception handler that
1234          invoked the GC. If sufficient memory has been freed to satisfy
1235          any allocation request that may have triggered the GC, the
1236          exception handler returns; otherwise, a "seriously low on
1237          memory" condition is signaled, possibly after releasing a
1238          small emergency pool of memory.</para>
1239      </sect2>
1240    </sect1>
1241
1242    <sect1 id="The-ephemeral-GC">
1243      <title>The ephemeral GC</title>
1244      <para>In the &CCL; memory management scheme, the relative age
1245        of two objects in the dynamic heap can be determined by their
1246        addresses: if addresses X and Y are both addresses in the
1247        dynamic heap, X is younger than Y (X was created more recently
1248        than Y) if it is nearer to the free pointer (and farther from
1249        the base of the heap) than Y.</para>
1250      <para>Ephemeral (or generational) garbage collectors attempt to
1251        exploit the following assumptions:</para>
1252      <itemizedlist>
1253            <listitem>
1254              <para>most newly created objects become garbage soon after
1255                they'recreated.</para>
1256            </listitem>
1257            <listitem>
1258              <para>most objects that have already survived several GCs
1259                are unlikely to ever become garbage.</para>
1260            </listitem>
1261            <listitem>
1262              <para>old objects can only point to newer objects as the
1263                result of a destructive modification (e.g., via
1264                SETF.)</para>
1265            </listitem>
1266      </itemizedlist>
1267
1268      <para>By concentrating its efforts on (frequently and quickly)
1269        reclaiming newly created garbage, an ephemeral collector hopes
1270        to postpone the more costly full GC as long as possible. It's
1271        important to note that most programs create some long-lived
1272        garbage, so an EGC can't typically eliminate the need for full
1273        GC.</para>
1274      <para>An EGC views each object in the heap as belonging to
1275        exactly one <emphasis>generation</emphasis>; generations are
1276        sets of objects that are related to each other by age: some
1277        generation is the youngest, some the oldest, and there's an age
1278        relationship between any intervening generations. Objects are
1279        typically assigned to the youngest generation when first
1280        allocated; any object that has survived some number of GCs in
1281        its current generation is promoted (or
1282        <emphasis>tenured</emphasis>) into an older generation.</para>
1283      <para>When a generation is GCed, the roots consist of the
1284        stacks, registers, and global variables as always and also of
1285        any pointers to objects in that generation from other
1286        generations. To avoid the need to scan those (often large) other
1287        generations looking for such intergenerational references, the
1288        runtime system must note all such intergenerational references
1289        at the point where they're created (via Setf).<footnote><para>This is
1290            sometimes called "The Write Barrier": all assignments which
1291            might result in intergenerational references must be noted, as
1292            if the other generations were write-protected.</para></footnote> The
1293        set of pointers that may contain intergenerational references is
1294        sometimes called <emphasis>the remembered set</emphasis>.</para>
1295      <para>In &CCL;'s EGC, the heap is organized exactly the same
1296        as otherwise; "generations" are merely structures which contain
1297        pointers to regions of the heap (which is already ordered by
1298        age.) When a generation needs to be GCed, any younger generation
1299        is incorporated into it; all objects which survive a GC of a
1300        given generation are promoted into the next older
1301        generation. The only intergenerational references that can exist
1302        are therefore those where an old object is modified to contain a
1303        pointer to a new object.</para>
1304      <para>The EGC uses exactly the same code as the full GC. When a
1305        given GC is "ephemeral",</para>
1306      <itemizedlist>
1307        <listitem>
1308              <para>the "base of the heap" used to determine an object's
1309                markbit address is the base of the generation
1310                being collected;</para>
1311            </listitem>
1312        <listitem>
1313              <para>the markbits vector is actually a pointer into the
1314                middle of the global markbits table; preceding entries in
1315                this table are used to note doubleword addresses in older
1316                generations that (may) contain intergenerational
1317                references;</para>
1318            </listitem>
1319        <listitem>
1320              <para>some steps (notably GCTWA and the handling of weak
1321                objects) are not performed;</para>
1322            </listitem>
1323        <listitem>
1324              <para>the intergenerational references table is used to
1325                find additional roots for the mark and forward phases. If a
1326                bit is set in the intergenerational references table, that
1327                means that the corresponding doubleword (in some "old"
1328                generation, in some "earlier" part of the heap) may have had
1329                a pointer to an object in a younger generation stored into
1330                it.</para>
1331            </listitem>
1332       
1333      </itemizedlist>
1334      <para>With one exception (the implicit setfs that occur on entry
1335        to and exit from the binding of a special variable), all setfs
1336        that might introduce an intergenerational reference must be
1337        memoized.
1338        <footnote><para>Note that the implicit setfs that occur when
1339        initializing an object - as in the case of a call to cons or
1340        vector - can't introduce intergenerational references, since
1341        the newly created object is always younger than the objects
1342        used to initialize it.</para></footnote> It's always safe to
1343        push any cons cell or gvector locative onto the memo stack;
1344        it's never safe to push anything else.
1345      </para>
1346
1347      <para>Typically, the intergenerational references bitvector is
1348        sparse: a relatively small number of old locations are stored
1349        into, although some of them may have been stored into many
1350        times. The routine that scans the memoization buffer does a lot
1351        of work and usually does it fairly often; it uses a simple,
1352        brute-force method but might run faster if it was smarter about
1353        recognizing addresses that it'd already seen.
1354      </para>
1355
1356      <para>When the EGC mark and forward phases scan the
1357        intergenerational reference bits, they can clear any bits that
1358        denote doublewords that definitely do not contain
1359        intergenerational references.
1360      </para>
1361    </sect1>
1362
1363    <sect1 id="Fasl-files">
1364      <title>Fasl files</title>
1365      <para>Saving and loading of Fasl files is implemented in
1366        xdump/faslenv.lisp, level-0/nfasload.lisp, and lib/nfcomp.lisp.
1367        The information here is only an overview, which might help when
1368        reading the source.</para>
1369      <para>The &CCL; Fasl format is forked from the old MCL Fasl
1370        format; there are a few differences, but they are minor.  The
1371        name "nfasload" comes from the fact that this is the so-called
1372        "new" Fasl system, which was true in 1986 or so.  </para>
1373      <para>A Fasl file begins with a "file header", which contains
1374        version information and a count of the following "blocks".
1375        There's typically only one "block" per Fasl file.  The blocks
1376        are part of a mechanism for combining multiple logical files
1377        into a single physical file, in order to simplify the
1378        distribution of precompiled programs. </para>
1379      <para>Each block begins with a header for itself, which just
1380        describes the size of the data that follows.</para>
1381      <para>The data in each block is treated as a simple stream of
1382        bytes, which define a bytecode program.  The actual bytecodes,
1383        "fasl operators", are defined in xdump/faslenv.lisp.  The
1384        descriptions in the source file are terse, but, according to
1385        Gary, "probably accurate".</para>
1386      <para>Some of the operators are used to create a per-block
1387        "object table", which is a vector used to keep track of
1388        previously-loaded objects and simplify references to them.  When
1389        the table is created, an index associated with it is set to
1390        zero; this is analogous to an array fill-pointer, and allows the
1391        table to be treated like a stack.</para>
1392      <para>The low seven bits of each bytecode are used to specify
1393        the fasl operator; currently, about fifty operators are defined.
1394        The high byte, when set, indicates that the result of the
1395        operation should be pushed onto the object table.</para>
1396      <para>Most bytecodes are followed by operands; the operand data
1397        is byte-aligned.  How many operands there are, and their type,
1398        depend on the bytecode.  Operands can be indices into the object
1399        table, immediate values, or some combination of these.</para>
1400      <para>An exception is the bytecode #xFF, which has the symbolic
1401        name ccl::$faslend; it is used to mark the end of the
1402        block.</para>
1403    </sect1>
1404
1405
1406
1407    <sect1 id="The-Objective-C-Bridge--1-">
1408      <title>The Objective-C Bridge</title>
1409
1410      <sect2 id="How-CCL-Recognizes-Objective-C-Objects">
1411            <title>How &CCL; Recognizes Objective-C Objects</title>
1412        <para>In most cases, pointers to instances of Objective-C
1413          classes are recognized as such; the recognition is (and
1414          probably always will be) slightly heuristic. Basically, any
1415          pointer that passes basic sanity checks and whose first word
1416          is a pointer to a known ObjC class is considered to be an
1417          instance of that class; the Objective-C runtime system would
1418          reach the same conclusion.</para>
1419        <para>It's certainly possible that a random pointer to an
1420          arbitrary memory address could look enough like an ObjC
1421          instance to fool the lisp runtime system, and it's possible
1422          that pointers could have their contents change so that
1423          something that had either been a true ObjC instance (or had
1424          looked a lot like one) is changed (possibly by virtue of
1425          having been deallocated.)</para>
1426        <para>In the first case, we can improve the heuristics
1427          substantially: we can make stronger assertions that a
1428          particular pointer is really "of type :ID" when it's a
1429          parameter to a function declared to take such a pointer as an
1430          argument or a similarly declared function result; we can be
1431          more confident of something we obtained via SLOT-VALUE of a
1432          slot defined to be of type :ID than if we just dug a pointer
1433          out of memory somewhere.</para>
1434        <para>The second case is a little more subtle: ObjC memory
1435          management is based on a reference-counting scheme, and it's
1436          possible for an object to ... cease to be an object while lisp
1437          is still referencing it.  If we don't want to deal with this
1438          possibility (and we don't), we'll basically have to ensure
1439          that the object is not deallocated while lisp is still
1440          thinking of it as a first-class object. There's some support
1441          for this in the case of objects created with MAKE-INSTANCE,
1442          but we may need to give similar treatment to foreign objects
1443          that are introduced to the lisp runtime in other ways (as
1444          function arguments, return values, SLOT-VALUE results, etc. as
1445          well as those instances that are created under lisp
1446          control.)</para>
1447        <para>This doesn't all work yet (in fact, not much of it works
1448          yet); in practice, this has not yet been as much of a problem
1449          as anticipated, but that may be because existing Cocoa code
1450          deals primarily with relatively long-lived objects such as
1451          windows, views, menus, etc.</para>
1452      </sect2>
1453
1454      <sect2>
1455            <title>Recommended Reading</title>
1456
1457            <variablelist>
1458              <varlistentry>
1459                <term>
1460                  <ulink url="http://developer.apple.com/documentation/Cocoa/">Cocoa Documentation</ulink>
1461                </term>
1462               
1463                <listitem>
1464                  <para>
1465                    This is the top page for all of Apple's documentation on
1466                    Cocoa.  If you are unfamiliar with Cocoa, it is a good
1467                    place to start.
1468                  </para>
1469                </listitem>
1470              </varlistentry>
1471              <varlistentry>
1472                <term>
1473                  <ulink url="http://developer.apple.com/documentation/Cocoa/Reference/Foundation/ObjC_classic/index.html">Foundation Reference for Objective-C</ulink>
1474                </term>
1475
1476                <listitem>
1477                  <para>
1478                    This is one of the two most important Cocoa references; it
1479                    covers all of the basics, except for GUI programming.  This is
1480                    a reference, not a tutorial.
1481                  </para>
1482                </listitem>
1483              </varlistentry>
1484        </variablelist>
1485      </sect2>
1486    </sect1>
1487  </chapter>
Note: See TracBrowser for help on using the repository browser.