Changeset 5415


Ignore:
Timestamp:
Oct 23, 2006, 4:03:08 PM (18 years ago)
Author:
Gary Byers
Message:

Updated; almost ready to go (still a WITH-OUTPUT-TO-STRING/PPRINT (?) bug).

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/ccl/release-notes.txt

    r5039 r5415  
     1OpenMCL 1.1-pre-060923
     2- There's now a port of OpenMCL to FreeBSD/amd64; it claims to be
     3  of beta quality.  (The problems that made it too unstable
     4  to release as of a few months ago have been fixed;  I stil run
     5  into occasional FreeBSD-specific issues, and some such issues
     6  may remain.)
     7- CHAR-CODE-LIMIT is now #x110000, which means that all Unicode
     8  characters can be directly represented.  There is one CHARACTER
     9  type (all CHARACTERs are BASE-CHARs) and one string type (all
     10  STRINGs are BASE-STRINGs.)  This change (and some other changes
     11  in the compiler and runtime) made the heap images a few MB larger
     12  than in previous versions.
     13- As of Unicode 5.0, only about 100,000 of 1114112./#x110000 CHAR-CODEs
     14  are actually defined; the function CODE-CHAR knows that certain
     15  ranges of code values (notably #xd800-#xddff) will never be valid
     16  character codes and will return NIL for arguments in that range,
     17  but may return a non-NIL value (an undefined/non-standard CHARACTER
     18  object) for other unassigned code values.
     19- The :EXTERNAL-FORMAT argument to OPEN/LOAD/COMPILE-FILE has been
     20  extended to allow the stream's character encoding scheme (as well
     21  as line-termination conventions) to be specified; see more
     22  details below.  MAKE-SOCKET has been extended to allow an
     23  :EXTERNAL-FORMAT argument with similar semantics.
     24- Strings of the form "u+xxxx" - where "x" is a sequence of one
     25  or more hex digits- can be used as as character names to denote
     26  the character whose code is the value of the string of hex digits.
     27  (The +  character is actually optional, so  #\u+0020, #\U0020, and
     28  #\U+20 all refer to the #\Space character.)  Characters with codes
     29  in the range #xa0-#x7ff (IIRC) also have symbolic names (the
     30  names from the Unicode standard with spaces replaced with underscores),
     31  so #\Greek_Capital_Letter_Epsilon can be used to refer to the character
     32  whose CHAR-CODE is #x395.
     33- The line-termination convention popularized with the CP/M operating
     34  system (and used in its descendants) - e.g., CRLF - is now supported,
     35  as is the use of Unicode #\Line_Separator (#\u+2028).
     36- About 15-20 character encoding schemes are defined (so far); these
     37  include UTF-8/16/32 and the big-endian/little-endian variants of
     38  the latter two and ISO-8859-* 8-bit encodings.  (There is not
     39  yet any support for traditional (non-Unicode) ways of externally
     40  encoding characters used in Asian languages, support for legacy
     41  MacOS encodings, legacy Windows/DOS/IBM encodings, ...)  It's hoped
     42  that the existing infrastructure will handle most (if not all) of
     43  what's missing; that may not be the case for "stateful" encodings
     44  (where the way that a given character is encoded/decoded depend
     45  on context, like the value of the preceding/following character.)
     46- There isn't yet any support for Unicode-aware collation (CHAR>
     47  and related CL functions just compare character codes, which
     48  can give meaningless results for non-STANDARD-CHARs), case-inversion,
     49  or normalization/denormalization.  There's generally good support
     50  for this sort of thing in OS-provided libraries (e.g., CoreFoundation
     51  on MacOSX), and it's not yet clear whether it'd be best to duplicate
     52  that in lisp or leverage library support.
     53- Unicode-aware FFI functions and macros are still in a sort of
     54  embryonic state if they're there at all; things like WITH-CSTRs
     55  continue to exist (and continue to assume an 8-bit character
     56  encoding.)
     57- Characters that can't be represented in a fixed-width 8-bit
     58  character encoding are replaced with #\Sub (= (code-char 26) =
     59  ^Z) on output, so if you do something like:
     60
     61? (format t "~a" #\u+20a0)
     62
     63  you might see a #\Sub character (however that's displayed on
     64  the terminal device/Emacs buffer) or a Euro currency sign or
     65  practically anything else (depending on how lisp is configured
     66  to encode output to *TERMINAL-IO* and on how the terminal/Emacs
     67  is configured to decode its input.
     68
     69  On output to streams with character encodings that can encode
     70  the full range of Unicode - and on input from any stream -
     71  "unencodable characters" are represented using the Unicode
     72  #\Replacement_Character (= #\U+fffd); the presence of such a
     73  character usually indicates that something got lost in translation
     74  (data wasn't encoded properly or there was a bug in the decoding
     75  process.)
     76- Streams encoded in schemes which use more than one octet per code unit
     77  (UTF-16, UTF-32, ...) and whose endianness is not explicit will be
     78  written with a leading byte-order-mark character on (new) output and
     79  will expect a BOM on input; if a BOM is missing from input data,
     80  that data will be assumed to have been serialized in big-endian order.
     81  Streams encoded in variants of these schemes whose endianness is
     82  explicit (UTF-16BE, UCS-4LE, ...) will not have byte-order-marks written
     83  on output or expected on input.  (UTF-8 streams might also contain
     84  encoded byte-order-marks; even though UTF-8 uses a single octet per
     85  code unit - and possibly more than one code unit per character - this
     86  convention is sometimes used to advertise that the stream is UTF-8-
     87  encoded.  The current implementation doesn't skip over/ignore leading
     88  BOMs on UTF8-encoded input, but it probably should.)
     89
     90  If the preceding paragraph made little sense, a shorter version is
     91  that sometimes the endianness of encoded data matters and there
     92  are conventions for expressing the endianness of encoded data; I
     93  think that OpenMCL gets it mostly right, but (even if that's true)
     94  the real world may be messier.
     95- By default, OpenMCL uses ISO-8859-1 encoding for *TERMINAL-IO*
     96  and for all streams whose EXTERNAL-FORMAT isn't explicitly specified.
     97  (ISO-8859-1 just covers the first 256 Unicode code points, where
     98  the first 128 code points are equivalent to US-ASCII.)  That should
     99  be pretty much equivalent to what previous versions (that only
     100  supported 8-bit characters) did, but it may not be optimal for
     101  users working in a particular locale.  The default for *TERMINAL-IO*
     102  can be set via a command-line argument (see below) and this setting
     103  persists across calls to SAVE-APPLICATION, but it's not clear that
     104  there's a good way of setting it automatically (e.g., by checking
     105  the POSIX "locale" settings on startup.)  Thing like POSIX locales
     106  aren't always set correctly (even if they're set correctly for
     107  the shell/terminal, they may not be set correctly when running
     108  under Emacs ...) and in general, *TERMINAL-IO*'s notion of the
     109  character encoding it's using and the "terminal device"/Emacs subprocess's
     110  notion need to agree (and fonts need to contain glyphs for the
     111  right set of characters) in order for everything to "work".  Using
     112  ISO-8859-1 as the default seemed to increase the likelyhood that
     113  most things would work even if things aren't quite set up ideally
     114  (since no character translation occurs for 8-bit characters in
     115  ISO-8859-1.)
     116- In non-Unicode-related news: the rewrite of OpenMCL's stream code
     117  that was started a few months ago should now be complete (no more
     118  "missing method for BASIC-STREAM" errors, or at least there shouldn't
     119  be any.)
     120- I haven't done anything with the Cocoa bridge/demos lately, besides
     121  a little bit of smoke-testing.
     122
     123Some implementation/usage details:
     124
     125Character encodings.
     126
     127CHARACTER-ENCODINGs are objects (structures) that're named by keywords
     128(:ISO-8859-1, :UTF-8, etc.).  The structures contain attributes of
     129the encoding and functions used to encode/decode external data, but
     130unless you're trying to define or debug an encoding there's little
     131reason to know much about the CHARACTER-ENCODING objects and it's
     132generally desirable (and sometimes necessary) to refer to the encoding
     133via its name.
     134
     135Most encodings have "aliases"; the encoding named :ISO-8859-1 can
     136also be referred to by the names :LATIN1 and :IBM819, among others.
     137Where possible, the keywordized name of an encoding is equivalent
     138to the preferred MIME charset name (and the aliases are all registered
     139IANA charset names.)
     140
     141NIL is an alias for the :ISO-8859-1 encoding; it's treated a little
     142specially by the I/O system.
     143
     144The function CCL:DESCRIBE-CHARACTER-ENCODINGS will write descriptions
     145of all defined character encodings to *terminal-io*; these descriptions
     146include the names of the encoding's aliases and a doc string which
     147briefly describes each encoding's properties and intended use.
     148
     149Line-termination conventions.
     150
     151As noted in the <=1.0 documentation, the keywords :UNIX, :MACOS, and
     152:INFERRED can be used to denote a stream's line-termination conventions.
     153(:INFERRED is only useful for FILE-STREAMs that're open for :INPUT or
     154:IO.)  In this release, the keyword :CR can also be used to indicate
     155that a stream uses #\Return characters for line-termination (equivalent
     156to :MACOS), the keyword :UNICODE denotes that the stream uses Unicode
     157#\Line_Separator characters to terminate lines, and the keywords :CRLF,
     158:CP/M, :MSDOS, :DOS, and :WINDOWS all indicate that lines are terminated
     159via a #\Return #\Linefeed sequence.
     160
     161In some contexts (when specifying EXTERNAL-FORMATs), the keyword :DEFAULT
     162can also be used; in this case, it's equivalent to specifying the value
     163of the variable CCL:*DEFAULT-LINE-TERMINATION*.  The initial value of
     164this variable is :UNIX.
     165
     166Note that the set of keywords used to denote CHARACTER-ENCODINGs and
     167the set of keywords used to denote line-termination conventions is
     168disjoint: a keyword denotes at most a character encoding or a line
     169termination convention, but never both.
     170
     171External-formats.
     172
     173EXTERNAL-FORMATs are also objects (structures) with two read-only
     174fields that can be accessed via the functions EXTERNAL-FORMAT-LINE-TERMINATION
     175and EXTERNAL-FORMAT-CHARACTER-ENCODING; the values of these fields are
     176line-termination-convention-names and character-encoding names as described
     177above.
     178
     179An EXTERNAL-FORMAT object via the function MAKE-EXTERNAL-FORMAT:
     180
     181MAKE-EXTERNAL-FORMAT &key domain character-encoding line-termination
     182
     183(Despite the function's name, it doesn't necessarily create a new,
     184unique EXTERNAL-FORMAT object: two calls to MAKE-EXTERNAL-FORMAT
     185with the same arguments made in the same dynamic environment will
     186return the same (eq) object.)
     187
     188Both the :LINE-TERMINATION and :CHARACTER-ENCODING arguments default
     189to :DEFAULT; if :LINE-TERMINATION is specified as or defaults to
     190:DEFAULT, the value of CCL:*DEFAULT-LINE-TERMINATION* is used to
     191provide a concrete value.
     192
     193When the :CHARACTER-ENCODING argument is specifed as/defaults to
     194:DEFAULT, the concrete character encoding name that's actually used
     195depends on the value of the :DOMAIN argument to MAKE-EXTERNAL-FORMAT.
     196The :DOMAIN-ARGUMENT's value can be practically anything; when it's
     197the keyword :FILE and the :CHARACTER-ENCODING argument's value is
     198:DEFAULT, the concrete character encoding name that's used will be
     199the value of the variable CCL:*DEFAULT-FILE-CHARACTER-ENCODING*; the
     200initial value of this variable is NIL (which is an alias for :ISO-8859-1);
     201if the value of the :DOMAIN argument is anything else, :ISO-8859-1 is
     202also used (but there's no way to override this.)  The intent is that
     203other values of the DOMAIN argument - notably :SOCKET - could be
     204used to provide defaults for other classes of streams, but this
     205isn't yet implemented.
     206
     207The result of a call to MAKE-EXTERNAL-FORMAT can be used as the value
     208of the :EXTERNAL-FORMAT argument to OPEN, LOAD, COMPILE-FILE, and
     209MAKE-SOCKET; it's also possible to use a few shorthand constructs
     210in these contexts.
     211
     212* if ARG is unspecified or specified as :DEFAULT, the value of the
     213  variable CCL:*DEFAULT-EXTERNAL-FORMAT* is used.  Since the value
     214  of this variable has historically been used to name a default
     215  line-termination convention, this case effectively falls into
     216  the next one:
     217* if ARG is a keyword which names a concrete line-termination convention,
     218  an EXTERNAL-FORMAT equivalent to the result of calling
     219  (MAKE-EXTERNAL-FORMAT :line-termination ARG)
     220   will be used
     221* if ARG is a keyword which names a character encoding, an EXTERNAL-FORMAT
     222  equvalent to the result of calling
     223  (MAKE-EXTERNAL-FORMAT :character-encoding ARG)
     224  will be used
     225* if ARG is a list, the result of (APPLY #'MAKE-CHARACTER-ENCODING ARG)
     226  will be used
     227
     228STREAM-EXTERNAL-FORMAT.
     229The CL function STREAM-EXTERNAL-FORMAT - which is portably defined
     230on FILE-STREAMs - can be applied to any open stream in this release
     231and will return an EXTERNAL-FORMAT object when applied to an open
     232CHARACTER-STREAM. For open CHARACTER-STREAMs (other than STRING-STREAMs),
     233SETF can be used with STREAM-EXTERNAL-FORMAT to change the stream's
     234character encoding, line-termination, or both.
     235
     236(I'm not sure if all of the (SETF STREAM-EXTERNAL-FORMAT) methods
     237that're implemented accept "shorthand" designators for EXTERNAL-FORMAT
     238objects; they probably should, but there may be some inconsistencies
     239there.)
     240
     241Note that the effect or doing something like:
     242
     243(let* ((s (open "foo" ... :external-format :utf-8)))
     244  ...
     245  (unread-char ch s)
     246  (eetf (stream-external-format s) :us-ascii)
     247  (read-char s))
     248
     249might or might not be what was intended.  The current behavior is
     250that the call to READ-CHAR will return the previously unread character
     251CH, which might surprise any code which assumes that the READ-CHAR
     252will return something encodable in 7 or 8 bits.  Since functions
     253like READ may call UNREAD-CHAR "behind your back", it may or may
     254not be obvious that this has even occurred; the best approach to
     255dealing with this issue might be to avoid using READ or explicit
     256calls to UNREAD-CHAR when processing content encoded in multiple
     257external formats.
     258
     259There's a similar issue with "bivalent" streams (sockets) which
     260can do both character and binary I/O with an :ELEMENT-TYPE of
     261(UNSIGNED-BYTE 8).  Historically, the sequence:
     262
     263   (unread-char ch s)
     264   (read-byte s)
     265
     266caused the READ-BYTE to return (CHAR-CODE CH); that made sense
     267when everything was implicitly encoded as :ISO-8859-1, but may not
     268make any sense anymore.  (The only thing that seems to make sense
     269in that case is to clear the unread character and read the next
     270octet; that's implemented in some cases but I don't think that
     271things are always handled consistently.)
     272
    1273OpenMCL 1.1-pre-069826
    2274- There's an (alpha-quality, maybe) port to x86-64 Darwin (e.g., the
Note: See TracChangeset for help on using the changeset viewer.