Ticket #756 (closed defect: invalid)

Opened 4 years ago

Last modified 4 years ago

read-line() broken for utf-16 and ucs-2

Reported by: rigus Owned by:
Priority: normal Milestone:
Component: Compiler Version: trunk
Keywords: Cc:

Description

I get a crash when executing something like

(with-open-file (stream ucs-2-file :external-format :ucs-2)
   (print (read-line stream)))

The same for :utf-16. read-char() is OK.

  • Paul

Change History

comment:1 Changed 4 years ago by gb

What crash or error do you get ?

(with-open-file (f "home:example.txt" :direction :output :if-exists :supersede :if-does-not-exist :create :external-format :ucs-2)
  (write-line "this is a test" f)
  (write-line "this is only a test" f)
  (write-line "we control the audio, we control the video." f))

(with-open-file (f "home:example.txt" :direction :input :external-format :ucs-2)
  (dotimes  (i 3) (print (read-line f))))

prints the 3 lines of text that were written to the file.

comment:2 Changed 4 years ago by rigus

Your example works fine for me. My guess is that it doesn't work for files missing a BOM (U+FEFF). Then CCL is either looping forever and consuming RAM, or Emacs (running Slime) crashes. If I recall correctly, BOM usage is optional.

comment:3 Changed 4 years ago by rigus

I get the crash with a really big file (the one I originally run into the problem with). When I modify your file and remove the BOM, i get:

Unexpected end of file on #<BASIC-FILE-CHARACTER-INPUT-STREAM ("home:example-no-bom.txt"/7 UCS-2) #x302000DBDE6D>, near position 158
   [Condition of type END-OF-FILE]

comment:4 Changed 4 years ago by gb

  • Status changed from new to closed
  • Resolution set to invalid

BOM usage isn't exactly "optional": if you say that a file is encoded in UCS-2 (or UTF-16), you're saying that the file either begins with a BOM or is implicitly big-endian (UCS-2BE, UTF-16BE.) If it's in fact UCS-2LE, then the first call to READ-LINE will likely try to read the entire file into a string whose characters have byte-reversed codes and return a second value of T, and subsequent calls will immediately return EOF.

(See  http://tools.ietf.org/html/rfc2781. Note that  http://en.wikipedia.org/wiki/UTF-16/UCS-2 claims that some Windows software assumes little-endian encoding when no BOM is present, and this behavior may be what you expect.)

If you have data that's encoded as UCS-2LE (or UTF-16LE) with no BOM, you don't want to claim that it's UCS-2/UTF-16. E.g., you want to say:

(with-open-file (f path :external-format :ucs-2le)
  (read-line f)
  ...)

comment:5 Changed 4 years ago by rigus

Sorry for my confusion, and thanks for the clarification. This makes sense.

  • Paul
Note: See TracTickets for help on using tickets.