Opened 11 years ago

Closed 11 years ago

#363 closed enhancement (fixed)

German umlaut #\Latin_Small_Letter_Sharp_S is not alpha-char-p

Reported by: bbeu Owned by: gz
Priority: normal Milestone:
Component: ANSI CL Compliance Version: trunk
Keywords: umlaut, alpha-char-p Cc:

Description (last modified by gb)

German umlaut ß (#\Latin_Small_Letter_Sharp_S) is not alpha-char-p

CL-USER> (lisp-implementation-version)

"Version 1.3-dev-r11173M-trunk (DarwinX8664)"

CL-USER> (map 'list #'alpha-char-p "äöüÄÖÜß")

(T T T T T T NIL)

CL-USER> (char-code #\ß)

223

CL-USER> (code-char 223)

#\Latin_Small_Letter_Sharp_S

SBCL does it right:

  • (lisp-implementation-version)

"1.0.20"

  • (map 'list #'alpha-char-p "äöüÄÖÜß")

(T T T T T T T)'

Change History (5)

comment:1 Changed 11 years ago by gb

  • Component changed from IDE to ANSI CL Compliance
  • Description modified (diff)
  • Resolution set to invalid
  • Status changed from new to closed
  • Type changed from defect to enhancement

Well, since the behavior of ALPHA-CHAR-P on non STANDARD-CHARs is implementation-dependent, all that we can really say is that the fact that ALPHA-CHAR-P isn't true of #\Latin_Small_Letter_Sharp_S in CCL doesn't meet your expectations. CCL's behavior isn't arbitrary, and I think that it's more consistent than what you may be expecting.

ALPHA-CHAR-P should be true of characters that have "case".

A character "has case" if it's either an upper-case character (UPPER-CASE-P is true of it) or a lower-case character (LOWER-CASE-P is true of it.)

In CCL:

(LOWER-CASE-P X is true if (CHAR-UPCASE X returns Y, Y is not EQL to X, and {{{CHAR-DOWNCASE Y)}} is EQL to X. Similarly,

(UPPER-CASE-P X) is true if (CHAR-DOWNCASE X) returns Y, Y is not EQL to X, and (CHAR-UPCASE Y) is EQL to X.

The requirement that there be a 1:1 mapping between upper- and lower-case characters is imposed by CLHS 13.1.4.3.4.

In Unicode (see http://www.unicode.org/Public/UNIDATA/CaseFolding.txt), #\Latin_Small_Letter_Sharp_S in the opposite case is the sequence of two "s" characters (with codes #x73). In other words, there is no 1:1 mapping, and neither UPPER-CASE-P nor LOWER-CASE-P can be true of #\Latin_Small_Letter_Sharp_S, and it seems consistent to conclude that that character is not a "character with case" and that ALPHA-CHAR-P should therefore be false of #\Latin_Small_Letter_Sharp_S.

I understand that this is somewhat unintuitive, since informally it's easy to think of that character as being "alphabetic". An implementation could decide that some character "has case" even though neither UPPER-CASE-P nor LOWER-CASE-P can be true of it, but that strikes me as being both unintuitive and inconsistent, and it's difficult for me to believe that the behavior that you expect is somehow "right."

comment:2 Changed 11 years ago by gz

The spec says that characters that have case should be alpha-char-p, but it doesn't say that those are the only characters that can be alpha-char-p. In fact it's pretty careful to say otherwise, e.g. section 13.1.4.3 explicitly states that characters that have case are a subset of all alpha-char-p characters, and the glossary definition of alphabetic also explicitly mentions that non-cased alphabetic characters are allowed. It was clear even then that there would be alphabetic characters in other languages that don't meet the strict definition of characters with case, and that in and of itself is not a reason to consider them non-alphabetic.

There may be some implementation reasons why it would be difficult to return the expected results for all languages, but I don't buy the argument that it would somehow be wrong to do so.

comment:3 Changed 11 years ago by bbeu

  • Resolution invalid deleted
  • Status changed from closed to reopened

Well, it's new but there is now a Capital Sharp S in Unicode.

http://www.unicode.org/versions/Unicode5.1.0/

  1. Notable Changes From Unicode 5.0.0 to Unicode 5.1.0

Characters

  • Additions to Malayalam and Myanmar; characters to complete support of Indic scripts
  • New symbols: Mahjong, editorial punctuation marks, significant additions for math
  • Capital Sharp S for German

comment:4 Changed 11 years ago by gb

http://www.unicode.org/versions/Unicode5.1.0/#Character_Assignment_Overview says:

"The Latin additions include U+1E9E LATIN CAPITAL LETTER SHARP S for use in German. The recommended uppercase form for most casing operations on U+00DF LATIN SMALL LETTER SHARP S continues to be "SS", as a capital sharp s is only used in restricted circumstances."

So, LATIN_SMALL_LETTER_SHARP_S is not retroactively changed into a "character with case" by changes in 5.1 and is not required to be ALPHA-CHAR-P.

As gz points out, it's perfectly legal for characters that don't "have case" to be considered ALPHA-CHAR-P. (Or not.) It's pretty hard to portably assume anything about which non-STANDARD-CHARACTERs are ALPHA-CHAR-P. Some implementations (LW and ALLEGRO) define CHAR-CODE-LIMIT to be 65536; I believe that they both claim to support the BMP (UCS-2) subset of Unicode.

(let* ((n 0))
  (dotimes (i 65536 n)
    (let* ((char (code-char i)))
      (when (and char (alpha-char-p char)) (incf n)))))

Lispworks:     114
Allegro:     65428
CLISP:       45654
SBCL:        45851
CCL:          1746

Unicode defines 48381 characters with codes less than 65536 to have the "ALPHABETIC" attribute (and all characters that have a 1:1 case mapping have this attribute.) None of the answers above is any more or less correct than any other; I don't know what criteria other implementations use, but I would agree that restricting ALPHA-CHAR-P to to be true only when it's absolutely required to be isn't particularly useful, especially since there is a standard (Unicode) that already provides a definition of what characters are ALPHABETIC and what aren't.

Using the Unicode ALPHABETIC attribute seems like the sanest and least arbitrary approach (and it would cause both #\Latin_Small_Letter_Sharp_S and the new large sharp_s to be ALPHA-CHAR-P.)

comment:5 Changed 11 years ago by rme

  • Resolution set to fixed
  • Status changed from reopened to closed

r11224 (bitmap for alpha-char-p), r11226 (use said bitmap)

Note: See TracTickets for help on using tickets.