Opened 9 years ago

Closed 7 years ago

#749 closed defect (fixed)

Unicode encoding fails silently on illegal characters

Reported by: rongarret Owned by:
Priority: normal Milestone:
Component: other Version: trunk
Keywords: Cc:

Description

If an attempt is made to encode a string to octets using an encoding that does not support some of the characters in the string, the result should be an error, but is in fact a bogus byte vector, e.g.:

? (encode-string-to-octets "(λ (μ) μ)" :EXTERNAL-format :ascii)
#(40 26 32 40 26 41 32 26 41)

Change History (3)

comment:1 Changed 9 years ago by gb

Signaling an error is certainly legal (according to one or more passages in the Unicode standard whose URLs I can't find at the moment), as is CCL's behavior (encoding a #\SUB character if the character can't be encoded, returning a replacement character on a decoding error.) I don't think that signaling an error - at this level - is as desirable: a little bit of substitution (a copyright symbol in a comment) might be harmless and a lot of substitution might indicate that the source wasn't encoded in the same way that it's being decoded, and decisions as to which of these cases should be treated as errors should probably be made at a higher level.

There are arguments against quietly losing information and returning

#(40 32 40 41 32 41) 

in your example, and I think that that sort of behavior has been the basis for a lot of security exploits over the years.

In the case of encoding ASCII to an octet vector, it's trivial to detect (via POSITION or COUNT or ...) whether the result contains substituted characters; for some other encodings, it's at least a little harder. Having some utility that did this for you - something like

  (count-substituted-encodings vector :external-format :ascii)
  • could be useful. (Or the in-memory encoding/decoding functions could return a second value indicating whether or not substitution occurred, and let higher-level code decide whether to treat that as an error or not.)

comment:2 Changed 9 years ago by rongarret

(Or the in-memory encoding/decoding functions could return a second value indicating whether or not substitution occurred, and let higher-level code decide whether to treat that as an error or not.)

That sounds like the best plan to me.

comment:3 Changed 7 years ago by gb

  • Resolution set to fixed
  • Status changed from new to closed

(In [15236]) Change the initial-values *TERMINAL-CHARACTER-ENCODING-NAME* and *DEFAULT-FILE-CHARACTER-ENCODING* to :UTF-8, mostly for the benefit of the Init-File-Editing-Impaired. (I've resolved not to make fun of the IFEI.) Note that this may require changes to startup scripts etc.

Define new conditions CCL:DECODING-PROBLEM and CCL:ENCODING-PROBLEM. Signal these conditions (via SIGNAL) when decoding characters from/enoding them to a stream, pointer or octet vector and a substitution or replacement character would be used.

New macros (CCL:WITH-DECODING-PROBLEMS-AS-ERRORS &body body) and (CCL:WITH-ENCODING-PROBLEMS-AS-ERRORS &body body) signal the corresponding conditions as ERRORs if they are signaled during execution of the body.

(Arguably) fixes ticket:749.

FILE-STRING-LENGTH checks to see if the encoding wants to use a byte-order-mark before subtracting the length of an encoded BOM from the encoded string length if the file is at its beginning.

Note: See TracTickets for help on using tickets.