Changeset 5415
- Timestamp:
- Oct 23, 2006, 4:03:08 PM (18 years ago)
- File:
-
- 1 edited
-
trunk/ccl/release-notes.txt (modified) (1 diff)
Legend:
- Unmodified
- Added
- Removed
-
trunk/ccl/release-notes.txt
r5039 r5415 1 OpenMCL 1.1-pre-060923 2 - There's now a port of OpenMCL to FreeBSD/amd64; it claims to be 3 of beta quality. (The problems that made it too unstable 4 to release as of a few months ago have been fixed; I stil run 5 into occasional FreeBSD-specific issues, and some such issues 6 may remain.) 7 - CHAR-CODE-LIMIT is now #x110000, which means that all Unicode 8 characters can be directly represented. There is one CHARACTER 9 type (all CHARACTERs are BASE-CHARs) and one string type (all 10 STRINGs are BASE-STRINGs.) This change (and some other changes 11 in the compiler and runtime) made the heap images a few MB larger 12 than in previous versions. 13 - As of Unicode 5.0, only about 100,000 of 1114112./#x110000 CHAR-CODEs 14 are actually defined; the function CODE-CHAR knows that certain 15 ranges of code values (notably #xd800-#xddff) will never be valid 16 character codes and will return NIL for arguments in that range, 17 but may return a non-NIL value (an undefined/non-standard CHARACTER 18 object) for other unassigned code values. 19 - The :EXTERNAL-FORMAT argument to OPEN/LOAD/COMPILE-FILE has been 20 extended to allow the stream's character encoding scheme (as well 21 as line-termination conventions) to be specified; see more 22 details below. MAKE-SOCKET has been extended to allow an 23 :EXTERNAL-FORMAT argument with similar semantics. 24 - Strings of the form "u+xxxx" - where "x" is a sequence of one 25 or more hex digits- can be used as as character names to denote 26 the character whose code is the value of the string of hex digits. 27 (The + character is actually optional, so #\u+0020, #\U0020, and 28 #\U+20 all refer to the #\Space character.) Characters with codes 29 in the range #xa0-#x7ff (IIRC) also have symbolic names (the 30 names from the Unicode standard with spaces replaced with underscores), 31 so #\Greek_Capital_Letter_Epsilon can be used to refer to the character 32 whose CHAR-CODE is #x395. 33 - The line-termination convention popularized with the CP/M operating 34 system (and used in its descendants) - e.g., CRLF - is now supported, 35 as is the use of Unicode #\Line_Separator (#\u+2028). 36 - About 15-20 character encoding schemes are defined (so far); these 37 include UTF-8/16/32 and the big-endian/little-endian variants of 38 the latter two and ISO-8859-* 8-bit encodings. (There is not 39 yet any support for traditional (non-Unicode) ways of externally 40 encoding characters used in Asian languages, support for legacy 41 MacOS encodings, legacy Windows/DOS/IBM encodings, ...) It's hoped 42 that the existing infrastructure will handle most (if not all) of 43 what's missing; that may not be the case for "stateful" encodings 44 (where the way that a given character is encoded/decoded depend 45 on context, like the value of the preceding/following character.) 46 - There isn't yet any support for Unicode-aware collation (CHAR> 47 and related CL functions just compare character codes, which 48 can give meaningless results for non-STANDARD-CHARs), case-inversion, 49 or normalization/denormalization. There's generally good support 50 for this sort of thing in OS-provided libraries (e.g., CoreFoundation 51 on MacOSX), and it's not yet clear whether it'd be best to duplicate 52 that in lisp or leverage library support. 53 - Unicode-aware FFI functions and macros are still in a sort of 54 embryonic state if they're there at all; things like WITH-CSTRs 55 continue to exist (and continue to assume an 8-bit character 56 encoding.) 57 - Characters that can't be represented in a fixed-width 8-bit 58 character encoding are replaced with #\Sub (= (code-char 26) = 59 ^Z) on output, so if you do something like: 60 61 ? (format t "~a" #\u+20a0) 62 63 you might see a #\Sub character (however that's displayed on 64 the terminal device/Emacs buffer) or a Euro currency sign or 65 practically anything else (depending on how lisp is configured 66 to encode output to *TERMINAL-IO* and on how the terminal/Emacs 67 is configured to decode its input. 68 69 On output to streams with character encodings that can encode 70 the full range of Unicode - and on input from any stream - 71 "unencodable characters" are represented using the Unicode 72 #\Replacement_Character (= #\U+fffd); the presence of such a 73 character usually indicates that something got lost in translation 74 (data wasn't encoded properly or there was a bug in the decoding 75 process.) 76 - Streams encoded in schemes which use more than one octet per code unit 77 (UTF-16, UTF-32, ...) and whose endianness is not explicit will be 78 written with a leading byte-order-mark character on (new) output and 79 will expect a BOM on input; if a BOM is missing from input data, 80 that data will be assumed to have been serialized in big-endian order. 81 Streams encoded in variants of these schemes whose endianness is 82 explicit (UTF-16BE, UCS-4LE, ...) will not have byte-order-marks written 83 on output or expected on input. (UTF-8 streams might also contain 84 encoded byte-order-marks; even though UTF-8 uses a single octet per 85 code unit - and possibly more than one code unit per character - this 86 convention is sometimes used to advertise that the stream is UTF-8- 87 encoded. The current implementation doesn't skip over/ignore leading 88 BOMs on UTF8-encoded input, but it probably should.) 89 90 If the preceding paragraph made little sense, a shorter version is 91 that sometimes the endianness of encoded data matters and there 92 are conventions for expressing the endianness of encoded data; I 93 think that OpenMCL gets it mostly right, but (even if that's true) 94 the real world may be messier. 95 - By default, OpenMCL uses ISO-8859-1 encoding for *TERMINAL-IO* 96 and for all streams whose EXTERNAL-FORMAT isn't explicitly specified. 97 (ISO-8859-1 just covers the first 256 Unicode code points, where 98 the first 128 code points are equivalent to US-ASCII.) That should 99 be pretty much equivalent to what previous versions (that only 100 supported 8-bit characters) did, but it may not be optimal for 101 users working in a particular locale. The default for *TERMINAL-IO* 102 can be set via a command-line argument (see below) and this setting 103 persists across calls to SAVE-APPLICATION, but it's not clear that 104 there's a good way of setting it automatically (e.g., by checking 105 the POSIX "locale" settings on startup.) Thing like POSIX locales 106 aren't always set correctly (even if they're set correctly for 107 the shell/terminal, they may not be set correctly when running 108 under Emacs ...) and in general, *TERMINAL-IO*'s notion of the 109 character encoding it's using and the "terminal device"/Emacs subprocess's 110 notion need to agree (and fonts need to contain glyphs for the 111 right set of characters) in order for everything to "work". Using 112 ISO-8859-1 as the default seemed to increase the likelyhood that 113 most things would work even if things aren't quite set up ideally 114 (since no character translation occurs for 8-bit characters in 115 ISO-8859-1.) 116 - In non-Unicode-related news: the rewrite of OpenMCL's stream code 117 that was started a few months ago should now be complete (no more 118 "missing method for BASIC-STREAM" errors, or at least there shouldn't 119 be any.) 120 - I haven't done anything with the Cocoa bridge/demos lately, besides 121 a little bit of smoke-testing. 122 123 Some implementation/usage details: 124 125 Character encodings. 126 127 CHARACTER-ENCODINGs are objects (structures) that're named by keywords 128 (:ISO-8859-1, :UTF-8, etc.). The structures contain attributes of 129 the encoding and functions used to encode/decode external data, but 130 unless you're trying to define or debug an encoding there's little 131 reason to know much about the CHARACTER-ENCODING objects and it's 132 generally desirable (and sometimes necessary) to refer to the encoding 133 via its name. 134 135 Most encodings have "aliases"; the encoding named :ISO-8859-1 can 136 also be referred to by the names :LATIN1 and :IBM819, among others. 137 Where possible, the keywordized name of an encoding is equivalent 138 to the preferred MIME charset name (and the aliases are all registered 139 IANA charset names.) 140 141 NIL is an alias for the :ISO-8859-1 encoding; it's treated a little 142 specially by the I/O system. 143 144 The function CCL:DESCRIBE-CHARACTER-ENCODINGS will write descriptions 145 of all defined character encodings to *terminal-io*; these descriptions 146 include the names of the encoding's aliases and a doc string which 147 briefly describes each encoding's properties and intended use. 148 149 Line-termination conventions. 150 151 As noted in the <=1.0 documentation, the keywords :UNIX, :MACOS, and 152 :INFERRED can be used to denote a stream's line-termination conventions. 153 (:INFERRED is only useful for FILE-STREAMs that're open for :INPUT or 154 :IO.) In this release, the keyword :CR can also be used to indicate 155 that a stream uses #\Return characters for line-termination (equivalent 156 to :MACOS), the keyword :UNICODE denotes that the stream uses Unicode 157 #\Line_Separator characters to terminate lines, and the keywords :CRLF, 158 :CP/M, :MSDOS, :DOS, and :WINDOWS all indicate that lines are terminated 159 via a #\Return #\Linefeed sequence. 160 161 In some contexts (when specifying EXTERNAL-FORMATs), the keyword :DEFAULT 162 can also be used; in this case, it's equivalent to specifying the value 163 of the variable CCL:*DEFAULT-LINE-TERMINATION*. The initial value of 164 this variable is :UNIX. 165 166 Note that the set of keywords used to denote CHARACTER-ENCODINGs and 167 the set of keywords used to denote line-termination conventions is 168 disjoint: a keyword denotes at most a character encoding or a line 169 termination convention, but never both. 170 171 External-formats. 172 173 EXTERNAL-FORMATs are also objects (structures) with two read-only 174 fields that can be accessed via the functions EXTERNAL-FORMAT-LINE-TERMINATION 175 and EXTERNAL-FORMAT-CHARACTER-ENCODING; the values of these fields are 176 line-termination-convention-names and character-encoding names as described 177 above. 178 179 An EXTERNAL-FORMAT object via the function MAKE-EXTERNAL-FORMAT: 180 181 MAKE-EXTERNAL-FORMAT &key domain character-encoding line-termination 182 183 (Despite the function's name, it doesn't necessarily create a new, 184 unique EXTERNAL-FORMAT object: two calls to MAKE-EXTERNAL-FORMAT 185 with the same arguments made in the same dynamic environment will 186 return the same (eq) object.) 187 188 Both the :LINE-TERMINATION and :CHARACTER-ENCODING arguments default 189 to :DEFAULT; if :LINE-TERMINATION is specified as or defaults to 190 :DEFAULT, the value of CCL:*DEFAULT-LINE-TERMINATION* is used to 191 provide a concrete value. 192 193 When the :CHARACTER-ENCODING argument is specifed as/defaults to 194 :DEFAULT, the concrete character encoding name that's actually used 195 depends on the value of the :DOMAIN argument to MAKE-EXTERNAL-FORMAT. 196 The :DOMAIN-ARGUMENT's value can be practically anything; when it's 197 the keyword :FILE and the :CHARACTER-ENCODING argument's value is 198 :DEFAULT, the concrete character encoding name that's used will be 199 the value of the variable CCL:*DEFAULT-FILE-CHARACTER-ENCODING*; the 200 initial value of this variable is NIL (which is an alias for :ISO-8859-1); 201 if the value of the :DOMAIN argument is anything else, :ISO-8859-1 is 202 also used (but there's no way to override this.) The intent is that 203 other values of the DOMAIN argument - notably :SOCKET - could be 204 used to provide defaults for other classes of streams, but this 205 isn't yet implemented. 206 207 The result of a call to MAKE-EXTERNAL-FORMAT can be used as the value 208 of the :EXTERNAL-FORMAT argument to OPEN, LOAD, COMPILE-FILE, and 209 MAKE-SOCKET; it's also possible to use a few shorthand constructs 210 in these contexts. 211 212 * if ARG is unspecified or specified as :DEFAULT, the value of the 213 variable CCL:*DEFAULT-EXTERNAL-FORMAT* is used. Since the value 214 of this variable has historically been used to name a default 215 line-termination convention, this case effectively falls into 216 the next one: 217 * if ARG is a keyword which names a concrete line-termination convention, 218 an EXTERNAL-FORMAT equivalent to the result of calling 219 (MAKE-EXTERNAL-FORMAT :line-termination ARG) 220 will be used 221 * if ARG is a keyword which names a character encoding, an EXTERNAL-FORMAT 222 equvalent to the result of calling 223 (MAKE-EXTERNAL-FORMAT :character-encoding ARG) 224 will be used 225 * if ARG is a list, the result of (APPLY #'MAKE-CHARACTER-ENCODING ARG) 226 will be used 227 228 STREAM-EXTERNAL-FORMAT. 229 The CL function STREAM-EXTERNAL-FORMAT - which is portably defined 230 on FILE-STREAMs - can be applied to any open stream in this release 231 and will return an EXTERNAL-FORMAT object when applied to an open 232 CHARACTER-STREAM. For open CHARACTER-STREAMs (other than STRING-STREAMs), 233 SETF can be used with STREAM-EXTERNAL-FORMAT to change the stream's 234 character encoding, line-termination, or both. 235 236 (I'm not sure if all of the (SETF STREAM-EXTERNAL-FORMAT) methods 237 that're implemented accept "shorthand" designators for EXTERNAL-FORMAT 238 objects; they probably should, but there may be some inconsistencies 239 there.) 240 241 Note that the effect or doing something like: 242 243 (let* ((s (open "foo" ... :external-format :utf-8))) 244 ... 245 (unread-char ch s) 246 (eetf (stream-external-format s) :us-ascii) 247 (read-char s)) 248 249 might or might not be what was intended. The current behavior is 250 that the call to READ-CHAR will return the previously unread character 251 CH, which might surprise any code which assumes that the READ-CHAR 252 will return something encodable in 7 or 8 bits. Since functions 253 like READ may call UNREAD-CHAR "behind your back", it may or may 254 not be obvious that this has even occurred; the best approach to 255 dealing with this issue might be to avoid using READ or explicit 256 calls to UNREAD-CHAR when processing content encoded in multiple 257 external formats. 258 259 There's a similar issue with "bivalent" streams (sockets) which 260 can do both character and binary I/O with an :ELEMENT-TYPE of 261 (UNSIGNED-BYTE 8). Historically, the sequence: 262 263 (unread-char ch s) 264 (read-byte s) 265 266 caused the READ-BYTE to return (CHAR-CODE CH); that made sense 267 when everything was implicitly encoded as :ISO-8859-1, but may not 268 make any sense anymore. (The only thing that seems to make sense 269 in that case is to clear the unread character and read the next 270 octet; that's implemented in some cases but I don't think that 271 things are always handled consistently.) 272 1 273 OpenMCL 1.1-pre-069826 2 274 - There's an (alpha-quality, maybe) port to x86-64 Darwin (e.g., the
Note:
See TracChangeset
for help on using the changeset viewer.
