Ticket #732 (closed defect: worksforme)

Opened 4 years ago

Last modified 4 years ago

small unicode problem

Reported by: kmorgan Owned by:
Priority: normal Milestone:
Component: IDE Version: 1.4
Keywords: Cc:

Description (last modified by gb) (diff)

I'm doing an experiment in formatting some files that use Devanagari unicode characters. The input file is utf8, and it's my intention to produce a utf8 output file. The following function reads a sexp, and for each correctly prints a Devanagari word to the screen, and apparently writes the same word as garbage to the output file. Can you please tell me the right stream parameters? Thanks.

(defun format-dict ()
  (let ((fi "/Users/kmorgan/documents/yoga/sanskrit/roots/roots.txt")
        (fo "/Users/kmorgan/documents/yoga/sanskrit/roots/dict.txt"))
    (with-open-file (si fi :external-format :utf-8)
      (with-open-file (so fo  :direction :output :if-exists :supersede :external-format :utf-8)
        (let ((*print-miser-width* 120))
          (do ((x (read si nil nil) (read si nil nil))) 
              ((null x))
            (princ (second (second x)) so)
            (princ (second (second x)))
            ;(print-entry so x)
            ))))))

Attachments

dict.txt Download (6 bytes) - added by kmorgan 4 years ago.
dict.txt
roots.txt Download (1.2 KB) - added by kmorgan 4 years ago.
roots.txt

Change History

Changed 4 years ago by kmorgan

dict.txt

Changed 4 years ago by kmorgan

roots.txt

comment:1 Changed 4 years ago by gb

  • Status changed from new to closed
  • Resolution set to invalid
  • Description modified (diff)

Saying :external-format :utf-8 should cause both the input and output files to be treated as being encoded in utf-8. You probably do need to add :IF-DOES-NOT-EXIST :CREATE to the clause that opens the output file.

As far as I can tell, your test worked correctly. "roots.txt" contains one form; the second element of the second element of that form is a token containing the two characters #\U+0915 and #\U+0943; those characters were encoded (in UTF-8) in the input file as the octet sequences #xe0 #xa4 #x95 and #xe0 #xa5 #x83. That same sequence of octets was written to the output file; that sequence will look like garbage unless whatever's looking at it knows that the file's encoded in utf-8; it'll look like two Devangari characters to something that knows how the file's encoded.

Unless I'm missing something, I don't see a bug here. I get the same results that you report, and they seem to be correct.

comment:2 Changed 4 years ago by kmorgan

  • Status changed from closed to reopened
  • Resolution invalid deleted

I can open roots.txt in TextEdit? on Mac OS and it looks like (root (1 कृ) ...). TextEdit? opens dict.txt as ‡§ï‡•É when I expect कृ. Do I need to write a beginning of file marker of some kind?

comment:3 Changed 4 years ago by rme

  • Status changed from reopened to closed
  • Resolution set to worksforme

Your lisp code is producing correct output. You need to tell TextEdit?.app that dict.txt is encoded in UTF-8.

There should be a pop-up menu at the bottom of the open file dialog called "Plain Text Encoding". Select "Unicode (UTF-8)" and the file should open correctly. "Automatic" will guess wrong.

On recent-enough versions of Mac OS X, TextEdit?.app will remember the file's encoding by saving it in an extended attribute attached to the file when the file is saved.

comment:4 Changed 4 years ago by kmorgan

"On recent-enough versions of Mac OS X, TextEdit??.app will remember the file's encoding by saving it in an extended attribute attached to the file when the file is saved."

That explains why my source file looked fine in TextEdit? but the generated one didn't. Thanks for your help.

Note: See TracTickets for help on using tickets.