source: trunk/source/tests/ansi-tests/doc/ilc2005.tex @ 8991

Last change on this file since 8991 was 8991, checked in by gz, 12 years ago

Check in the gcl ansi test suite (original, in preparation for making local changes)

File size: 27.3 KB
2% \setlength{\oddsidemargin}{0in}
3% \setlength{\evensidemargin}{0in}
4% \setlength{\footskip}{1in}
5% \setlength{\textwidth}{6.5in}
12% \usepackage{theorem}
19\title{The GCL ANSI Common Lisp Test Suite}
20\author{Paul F. Dietz\footnote{Motorola Global Software Group, 1303
21E. Algonquin Road, Annex 2, Schaumburg, IL 60196.}}
27I describe the conformance test suite for ANSI Common Lisp distributed
28as part of GNU Common Lisp (GCL).  The test suite includes more than
2920,000 individual tests, as well as random test generators for
30exercising specific parts of Common Lisp implementations, and has
31revealed many conformance bugs in all implementations on
32which it has been run.
37One of the strengths of Common Lisp is the existence of a large,
38detailed standard specifying the behavior of conforming
39implementations.  The value of the standard to users is enhanced when
40they can be confident that implementations that purport to conform
41actually do.
43In the 1990s I found substantial numbers of conformance bugs in many
44Lisp implementations.  As a result, I decided to build a
45comprehensive functional test suite for Common Lisp.  The goals of the
46effort were, in no particular order:
49\item To thoroughly familiarize myself with the standard.
50\item To provide a tool to locate conformance problems in CL
51implementations, both commercial and free.
52\item To enable implementors to improve CL implementations while
53      maintaining conformance.
54\item To explore the standard itself for ambiguities, unintended
55      consequences, and other problems.
56\item To explore different testing strategies.
59I deliberately did not design the test suite to measure or rank
60conformance of Lisp implementations.  For this reason, I will not here
61report the overall score of any implementation.
63I decided to locate the test suite in the GCL development tree for two
64reasons.  First, its development team had a goal of making GCL more
65ANSI compliant, and tests would assist there.  Secondly, the GCL CVS
66tree is easily publicly accessible\footnote{See
67\url{}}, so any developers or users of
68Common Lisp implementations would have easy access to it.
70The test suite was constructed over the period from 1998 to 2005, with
71most of the work done in 2002 to 2004. 
72As of 24 May 2005, the test suite contains over 20,000 tests.
74The test suite is based on a version of the ANSI Common Lisp
75specification (ANSI/INCITS 226-1994, formerly ANSI X3.226-1994) that
76was made publicly available by Harlequin (now LispWorks) in
77hyperlinked form in 1996 \cite{X3J13:94}.
79Table \ref{lispimpltab} contains a list of Lisp implementations on
80which I am aware the test suite has been run.
85Implementation & Hardware Platforms \\ \hline
86GNU Common Lisp & All debian platforms \\
87GNU CLISP & x86 \\
88CMUCL & x86, Sparc \\
89SBCL & x86, x86-64, Sparc, MIPS, Alpha, PowerPC \\
90Allegro CL (6.2, 7) & x86, Sparc, PowerPC \\
91LispWorks (4.3) & x86 \\
92OpenMCL & PowerPC \\
93ABCL & x86 (JVM) \\
94ECL & x86 \\
97\caption{\label{lispimpltab} Implementations Tested}
100\section {Infrastructure}
102The test suite uses Waters' RT package \cite{Waters:91a}.  This
103package provides a simple interface for defining tests.  In its
104original form, tests are defined with a name (typically a symbol
105or string), a form to be evaluated, and zero or more expected
106values.  The test passes if the form evaluates to the specified
107number of values, and those values are as specified.  See figure
108\ref{examplefig} for an example from the test suite:
112   (deftest let.17
113     (let ((x :bad))
114       (declare (special x))
115       (let ((x :good)) ;; lexical binding
116         (let ((y x))
117           (declare (special x)) ;; free declaration
118           y)))
119     :good)
121\caption{\label{examplefig} Example of a test}
124As the test suite evolved RT was extended.  Features added include:
126   \item Error conditions raised by tests may be trapped.
127   \item Tests may optionally be executed by wrapping the form to be
128evaluated in a lambda form, compiling it, and calling the compiled
129code.  This makes sense for testing Lisp itself, but would not be
130useful for testing Lisp applications.
131   \item A subset of the tests can be run repeatedly, in random order, a
132style of testing called \emph{Repeated Random Regression} by Kaner,
133Bond and McGee \cite{KanerBondMcGee:04}\footnote{This was previously
134called `Extended Random Regression'; McGee renamed it to avoid the
135confusing acronym.}
136   \item Notes may be attached to tests, and these notes used to turn off
137groups of tests.
139\item Tests can be marked as being expected to fail.  Unexpected
140      failures are reported separately.
143\section {Functional Tests}
145The bulk of the test suite consists of functional tests derived from
146specific parts of the ANSI specification.  Typically, for each
147standardized operator there is a file \emph{operator}.lsp containing
148tests for that operator.  This provides a crude form of traceability.
149There are exceptions to this naming convention, and many tests that
150test more than one operator are located somewhat arbitrarily.
151Table \ref{tab:testsize} shows the number and size of tests for each
152section of the ANSI specification.
158Section of CLHS & Size (Bytes) & Number of Tests \\
159\hline \hline
160Arrays & 212623 & 1109 \\
161Characters & 38655 & 256  \\
162Conditions & 71250 & 658 \\
163Cons & 264208 & 1816 \\
164Data \& Control Flow & 185973 & 1217 \\
165Environment & 51110 & 206 \\
166Eval/Compile & 41638 & 234 \\
167Files & 26375 & 87 \\
168Hash Tables & 38752 & 158 \\
169Iteration & 98339 & 767  \\
170Numbers & 290991 & 1382 \\
171Objects & 283549 & 774 \\
172Packages & 162203 & 493 \\
173Pathnames & 47100 & 215 \\
174Printer & 454314 & 2364 \\
175Reader & 101662 & 663 \\
176Sequences & 562210 & 3219 \\
177Streams & 165956 & 796 \\
178Strings & 83982 & 415 \\
179Structures & 46271 & 1366 \\
180Symbols & 106063 & 1141 \\
181System Construction & 16909 & 77 \\
182Types & 104804 & 599 \\
183Misc & 291883 & 679 \\ \hline
184Infrastructure & 115090 & \\
185Random Testers & 190575 & \\
187Total & 4052485 & 20702 \\
192\caption{\label{tab:testsize} Sizes of Parts of the Test Suite}
195Individual tests vary widely in power.  Some are as simple as a
196test that {\tt (CAR NIL)} is {\tt NIL}.  Others are more involved.
197For example, {\tt TYPES.9} checks that {\tt SUBTYPEP} is transitive
198on a large collection of built-in types.
200The time required to run the test suite depends on the implementation,
201but it is not excessive on modern hardware.  SBCL on a
202machine with 2 GHz 64 bit AMD processor, for example, runs the test
203suite in under eight minutes.
205Error tests have been written where the error behavior is specified by
206the standard.  This includes specifications in the `Exceptional
207Situations' sections for operator dictionary entries, as well as tests
208for calls to functions with too few or too many arguments, keyword
209parameter errors, and violations of the first paragraph of CLHS
210section  When type errors are specified or when the CLHS
211requires that some operator have a well-defined meaning on any Lisp
212value, the tests iterate over a set of precomputed Lisp objects
213called the `universe' that contains representatives of all
214standardized Lisp classes.  In some cases a subset of this universe is
215used, for efficiency reasons.
217There are some rules that perform random input testing.  This testing
218technique is described more fully in the next section.  Other tests
219are themselves deterministic, but are the product of one of the
220suite's high volume random test harnesses.  The `Misc' entry in table
221\ref{tab:testsize} refers to these randomly generated tests.  Each of
222these tests caused a failure in at least one implementation.
224Inevitably, bugs have appeared in the test suite.  Running the test
225suite on multiple implementations (see table \ref{lispimpltab})
226exposes most problems.  If a test fails in most of them, it is likely
227(but not certain) that the test is flawed.  Feedback from implementors
228has also been invaluable, and is deeply appreciated.  In some cases,
229when it has not been possible to agree on the proper interpretation
230of the standard, I've added a note to the set of disputed tests so
231they can be disabled as a group.  This is in keeping with the purpose
232of the test suite -- to help implementors, not judge implementations.
235\section {Random Testing}
237Random testing (more properly, random-input testing) is a standard
238technique in the testing of hardware systems.  However, it has been the
239subject of controversy in the software testing community for more than
240two decades.  Myers \cite{Myers:79} called it ``Probably the poorest
241... methodology of all''.  This assessment presumes that the cost of
242executing tests and checking their results for validity dominates the
243cost of constructing the tests.  If test inputs can be constructed and
244results checked automatically, it may be very cost-effective to
245generate and execute many lower quality tests.  Kaner et
246al. call this High Volume Automated Testing \cite{KanerBondMcGee:04}.
248Duran and Ntafos \cite{DuranNtafos:81} report favorably on the ability
249of random testing to find relatively subtle bugs without a great deal
250of effort.  Random testing has been used to test Unix utilities
251(so-called `fuzz testing') \cite{MillerFredriksenSo:90}, database
252systems \cite{Slutz:98}, and C compilers \cite{McKeeman:98,Lindig:05,Faigon:05}.
253Bach and Schroeder \cite{BachSchroeder:04} report that random input
254testing compares well with the ability of the popular All-Pairs
255testing technique at actually finding bugs.
257Random input testing provides a powerful means of testing algebraic
258properties of systems.  Common Lisp has many instances where such
259properties can be checked, and the test suite tests many of them.
260Random testing is used to test numeric operators, type operators,
261the compiler, some sequence operators, and the readability of
262objects printed in `print readably' mode.
264One criticism of random testing is its irreproducibility.
265With care, this needn't be a problem.  If a random failure
266is sufficiently frequent, it can be reproduced with high
267probability by simply running a randomized test again.  Tests
268can also be designed so that on failure, they print sufficient
269information so that a non-randomized test can be constructed
270exercising the bug.  Most of the randomized tests in the test
271suite have this property.
273\subsection {Compiler Tests}
276Efficiency of compiled code has long been one of Common Lisp's
277strengths.  Implementations have been touted as in some cases
278approaching the speed of statically typed languages.  Achieving this
279efficiency places strong demands on Lisp compilers.  A sufficiently
280smart compiler needs a sufficiently smart test suite.
282Compilers (and Lisp compilers in particular) are an ideal target for
283random input testing.  Inputs may have many parts that interact in
284the compiler in unpredictable ways.  Because the language has a
285well-defined semantics, it is easy to generate related, but different,
286forms that should yield the same result (thereby providing a test
289The Random Tester performs the following steps.  For some input
290parameters $n$ and $s$ (each positive integers):
292  \item Produce a list of $n$ symbols that will be the parameters
293        of a lambda expression.  These parameters will have integer
294        values.
295  \item Produce a list of $n$ finite integer subrange types.  These
296        will be the types of the lambda parameters.  The endpoints of
297        these types are not uniformly distributed, but instead follow
298        an approximately exponential distribution, preferring small
299        integers over larger ones.  Integers close in absolute value
300        to integer powers of 2 are also overrepresented.
301  \item Generate a random conforming Lisp form of `size' approximately $s$
302        containing (mostly) integer-valued forms.  The parameters from
303        step 1 occur as free variables.
304  \item From this form, construct two lambda forms.  In the first,
305        the lambda parameters are declared to have their integer
306        types, and random {\tt OPTIMIZE} settings are included.  In the
307        second, a different set of {\tt OPTIMIZE} settings is declared, and
308        all the standardized Lisp functions that occur in the form
309        are declared {\tt NOTINLINE}.  The goal here is to attempt to make
310        optimizations work differently on the two forms.
311  \item For each lambda form, its value on each set of inputs is
312        computed.  This is done either by compiling the lambda form
313        and calling it on the inputs, or by evaling forms in which
314        the lambda form is the {\tt CAR} and the argument list the
315        {\tt CDR}.
316  \item A failure occurs if any call to the compiler or evaluator
317        signals an error, or if the two lambda forms yield different
318        results on any of the inputs.
321This procedure very quickly -- within seconds -- found failures in
322every Lisp implementation on which it was tried.  Failures included
323assertion failures in the compiler, type errors, differing return
324values, code that caused segmentation faults, and in some cases code
325that crashed the Lisps entirely.  Most of the 679 `Misc' tests in
326table \ref{tab:testsize} were produced by this tester; each represents
327a failure in one or more implementations.
329Generating failing tests was easy, but minimizing them was tedious
330and time consuming.  I therefore wrote a pruner that repeatedly tries
331to simplify a failing random form, replacing integer-valued subforms
332with simpler ones, until no substitution preserving failure
333exists.  In most cases, this greatly reduced the size of the failing
334form.  Others have previously observed that bug-exposing random inputs
335can often be automatically simplified
336\cite{HildZeller:02a,McKeeman:98}.  The desire to be able to
337automatically simplify the failing forms constrained the tester;
338I will discuss this problem later in section \ref{sec:future}.
343\hline Sourceforge Bug \# & Type of Bug & Description \\
345813119 & C & Simplification of conditional forms \\
346842910 & C & Simplification of conditional forms \\
347842912 & R & Incorrect generated code \\
348842913 & R & Incorrect generated code \\
349858011 & C & Compiler didn't handle implicit block in {\tt FLET} \\
350858658 & R & Incorrect code for {\tt UNWIND-PROTECT} and multiple values \\
351860052 & C & Involving {\tt RETURN-FROM} and {\tt MULTIPLE-VALUE-PROG1}. \\
352864220 & C & Integer tags in tagbody forms. \\
353864479 & C & Compiler bug in stack analysis. \\
354866282 & V & Incorrect value computed due to erroneous side effect \\
355& & analysis in compiler on special variables \\
356874859 & R & Stack mixup causing catch tag to be returned. \\
357889037 & V & Bug involving nested {\tt LABELS}, {\tt UNWIND-PROTECT},
358{\tt DOTIMES} forms. \\
359890138 & R & Incorrect bytecodes for {\tt CASE}, crashing the Lisp. \\
3601167991 & C & Simplification of conditional forms. \\ \hline
365C & Condition thrown by the compiler (assert or type check failure.) \\
366R & Condition thrown at runtime (incorrectly compiled code). \\
367V & Incorrect value returned by compiled code. \\
370\caption{\label{clispbugs} Compiler bugs found in GNU CLISP by Random Tester}
373Table \ref{clispbugs} contains a list of the fourteen compiler bugs
374detected by the random tester in GNU CLISP.  Roughly 200 million
375iterations of the random tester were executed to find these bugs,
376using a single 1.2 GHz Athlon XP+ workstation running intermittently
377over a period of months.  All these bugs have been fixed (in CVS) and
378CLISP now fails only when the random forms produce bignum values that
379exceed CLISP's internal limit.
381The greatest obstacle to using the random tester is the presence of
382unfixed, high probability bugs.  If an implementation has such a bug,
383it will generate many useless hits that will conceal
384lower probability bugs.
386\subsection {Types and Compilation}
388Type inference and type-based specialization of built-in operators is a
389vital part of any high performance Lisp compiler for stock hardware,
390so it makes sense to focus testing effort on it.  The test suite
391contains a facility for generating random inputs for operators and
392compiling them with appropriate randomly generated type annotations,
393then checking if the result matches that from an unoptimized version
394of the operator.
396As an example, the operator {\tt ISQRT} had this bug in one commercial
399  (compile nil '(lambda (x) (declare (type (member 4 -1) x)
400                                     (optimize speed (safety 1)))
401                            (isqrt x)))
402  ==> Error: -1 is illegal argument to isqrt
404Amusingly, the bug occurs only when the negative integer is the second
405item in the {\tt MEMBER} list.  The test that found this bug is
406succinctly defined via a macro:
408  (def-type-prop-test isqrt 'isqrt '((integer 0)) 1)
410The function to be compiled can be generated in such a way that it stores
411the result value into an array specialized to a type that contains
412the expected value.  This is intended to allow the result value to
413remain unboxed.
415The general random testing framework of section
416\ref{sec:compilertests} is also useful for testing type-based compiler
417optimizations, with two drawbacks: it currently only handles integer
418operators, and it is less efficient than the more focused tests.
419Even so, it was used to improve unboxed arithmetic in several
420implementations (SBCL, CMUCL, GCL, ABCL).
422\subsection {{\tt SUBTYPEP} Testing}
424The test suite uses the algebraic properties of the {\tt SUBTYPEP}
425function in both deterministic and randomized tests.  For example,
426if {\tt T1} is known to be a subtype of {\tt T2}, we can also check:
428    (subtypep '(not t2) '(not t1))
429    (subtypep '(and t1 (not t2)) nil)
430    (subtypep '(or (not t1) t2) t)
433The generator/pruner approach of the compiler random tester was
434applied to testing {\tt SUBTYPEP}.  Random types were generated and,
435if one was a subtype of the other, the three alternative formulas
436were also tested.  If any return the two values (false, true), a
437failure has been found.
439Christophe Rhodes used feedback from this tester to fix logic and
440performance bugs in SBCL's {\tt SUBTYPEP} implementation.  The
441handling of {\tt CONS} types is particularly interesting, since
442deciding the subtype relationship in the presence of cons types is
443NP-hard.  At least one implementation's {\tt SUBTYPEP} will run wild
444on moderately complicated cons types, consuming large amounts of
445memory before aborting.
447\subsection {Repeated Random Regression}
449As mentioned earlier, RRR is a technique for executing tests in an
450extended random sequence, in order to flush out interaction bugs and
451slow corruption problems.  As an experiment, RT was extended to
452support RRR on subsets of the tests.  The main result was to find many
453unwanted dependencies in the test suite, particularly among the
454package tests.  These dependencies had not surfaced when the tests had
455been run in their normal order.
457After fixing these problems, RRR did find one CLOS bug in CLISP,
458involving interaction between generic functions and class
459redefinitions.  The bug was localized by bisecting the set of tests
460being run until a minimal core had been found, then minimizing the
461sequence of invocations of those tests.  If more bugs of this kind are
462found it may be worthwhile to add a delta debugging
463\cite{HildZeller:02a} facility to perform automatic test minimization.
465In Lisps that support preemptively scheduled threads, it would be
466interesting to use RRR with subsets of the tests that lack global side
467effects.  The tests would be run in two or more threads at once in
468order to find thread safety problems.
470\section {Issues with the ANSI Common Lisp Specification}
472Building the test suite involved going over the standard in detail.
473Many points were unclear, ambiguous, or contradictory; some
474parts of the standard proved difficult to test in a portable
475way.  This section describes some of these findings.
477See `Proposed ANSI Revisions and Clarifications' on
478\url{} for a more complete list that includes
479issues arising from the test suite.
481\subsection {Testability}
483Some parts of the standard proved difficult to test in a completely
484conforming way.  The specification of pathnames, for example, was
485difficult to test.  The suite has assumed that UNIX-like filenames
486are legal as physical pathnames.
488Floating point operators presented problems.  The standard does not
489specify the accuracy of floating point computations, even if it
490does specify a minimum precision for each of the standardized float
491types. \footnote{The standard does specify a feature indicating
492the implementation purports to conform to the IEEE Standard for Binary
493Floating Point Arithmetic (ANSI/IEEE Std 754-1985); this suite
494does not test this.}  Some implementations have accuracy that varies
495depending on the details of compilation; in particular, boxed values
496may be constrained to 64 bits while unboxed values in machine
497registers may have additional `hidden' bits.  These differences
498make differential testing challenging.
500The Objects chapter contains interfaces that are intended to be used
501with the Metaobject Protocol (MOP).  Since the MOP is not part of the
502standard, some of these cannot be tested.  For example, there is
503apparently no conforming way to obtain an instance of class {\tt
504METHOD-COMBINATION}, or to produce any subclass of {\tt
507\subsection {Unintended Consequences}
509There seem to be many issues associated with Common Lisp's type
510system.  One example is the {\tt TYPE-OF} function.  According
511to the standard, this function has the property that
513  For any object that is an element of some built-in type: [\ldots]
514  the type returned is a recognizable subtype of that built-in type.
516A \emph{built-in} type is defined to be
518  built-in type {\it n}. one of the types in Figure 4-2.
520Figure 4-2 of the standard contains {\tt UNSIGNED-BYTE}, the type of
521nonnegative integers.  These constraints imply that {\tt TYPE-OF} can
522never return {\tt FIXNUM} or {\tt BIGNUM} for any nonnegative integer,
523since neither of those types is a subtype of {\tt UNSIGNED-BYTE}.
525A more serious set of problems involves {\tt
526UPGRADED-ARRAY-ELEMENT-TYPE}. \footnote{I ignore the issue that,
527strictly speaking, {\tt UPGRADED-ARRAY-ELEMENT-TYPE} is either an
528identity function or is not computable, since as defined it must work
529on {\tt SATISFIES} types.}  This function (from types to types) is
530specified to satisfy these two axioms for all types $T_1$ and $T_2$:
532   T_1 \subseteq UAET(T_1)
536   T_1 \subseteq T_2 \Longrightarrow UAET(T_1) \subseteq UAET(T_2)
538A type $T_1$ is a \emph{specialized array element type} if $T_1 = UAET(T_1)$.
539These axioms imply:
541If two types $T_1$ and $T_2$ are specialized
542array element types, then so is $T_1 \cap T_2$.
545This theorem has a number of unpleasant consequences.  For example,
546if {\tt (UNSIGNED-BYTE 16)} and {\tt (SIGNED-BYTE 16)} are specialized
547array element types, then so must be {\tt (UNSIGNED-BYTE 15)}.  Even
548worse, since {\tt BIT} and {\tt CHARACTER} are required to be
549specialized array element types, and since they are disjoint,
550then {\tt NIL}, the empty type, must also be a specialized array
551element type.  Topping all this off, note that
553    A string is a specialized vector whose elements are of type
554    character or a subtype of type character. (CLHS page for {\tt STRING})
556Since {\tt NIL} is a subtype of {\tt CHARACTER}, a vector with
557array element type {\tt NIL} is a string.  It is
558impossible for a conforming implementation to have only a
559single representation of strings.\footnote{But since `nil strings' can
560never be accessed, it's acceptable in non-safe code to just assume
561string accesses are to some other string representation.  The SBCL
562implementors took advantage of this when using nil strings as a stepping
563stone to Unicode support.}
565\section {Directions For Future Work}
568The test suite still has a few areas that are not sufficiently tested.
569Setf expanders need more testing, as do logical pathnames and file
570compilation.  Floating point functions are inadequately tested.  As
571mentioned earlier, it isn't clear what precision is expected of these
572functions, but perhaps tests can be written that check if the error
573is too large (in some sufficiently useful sense.)
575The random compiler tester, as implemented, is constrained to generate
576forms that remain conforming as they are simplified.  This limits the
577use of certain operators that do not take the entire set of integers
578as their arguments.  For example, {\tt ISQRT} appears only in forms
579like {\tt (ISQRT (ABS ...))}, and this pattern is preserved during
580pruning.  The forms also make very limited use of non-numeric types.
582More sophisticated random tester could avoid these limitations.  One
583approach would be to randomly generate trees from which Lisp forms
584could be produced, but that also carry along information that would
585enable pruning to be done more intelligently.  Another approach would
586be to check each pruned form for validity on the set of chosen random
587inputs by doing a trial run with all operators replaced by special
588versions that always check for illegal behaviors.  I intend to explore
589both options.
591The test suite has been written mostly as a `black box' suite (aside
592from the randomly generated Misc tests).  It would be interesting to
593add more implementation knowledge, with tests that, while conforming,
594will be more useful if the Lisp has been implemented in a particular
595way.  The type propagation tester is an example of this kind of `gray
596box' testing.
598It would be interesting to determine the level of coverage achieved by
599the test suite in various implementations.  The coverage is probably
600not very good, since the suite cannot contain tests of nonstandardized
601error situations, but this should be confirmed, and compared against
602the coverage obtained from running typical applications.  Internal
603coverage could also provide feedback for nudging the random tester
604toward testing relatively untested parts of the compiler, say by using
605an evolutionary algorithm on the parameters governing the construction
606of random forms.
608\section {Acknowledgments}
610I would like to thank Camm Maguire, the head of the GCL development
611team, for allowing the GCL ANSI test suite to be a part of that
612project.  I also would like to thank users of the test suite who have
613returned feedback, including Camm, Christophe Rhodes, Sam Steingold,
614Bruno Haible, Duane Rettig, Raymond Toy, Dan Barlow, Juan Jos\'{e}
615Garc\'{i}a-Ripoll, Brian Mastenbrook and many others.
Note: See TracBrowser for help on using the repository browser.