30 Matching Annotations
  1. Nov 2022
    1. The btoa() function takes a JavaScript string as a parameter. In JavaScript strings are represented using the UTF-16 character encoding: in this encoding, strings are represented as a sequence of 16-bit (2 byte) units. Every ASCII character fits into the first byte of one of these units, but many other characters don't. Base64, by design, expects binary data as its input. In terms of JavaScript strings, this means strings in which each character occupies only one byte. So if you pass a string into btoa() containing characters that occupy more than one byte, you will get an error, because this is not considered binary data:
    2. If you need to encode Unicode text as ASCII using btoa(), one option is to convert the string such that each 16-bit unit occupies only one byte.
    1. Thus the replacement character is now only seen for encoding errors, such as invalid UTF-8.
    2. At one time the replacement character was often used when there was no glyph available in a font for that character. However, most modern text rendering systems instead use a font's .notdef character, which in most cases is an empty box (or "?" or "X" in a box[5]), sometimes called a "tofu" (this browser displays 􏿾). There is no Unicode code point for this symbol.
    3. The replacement character � (often displayed as a black rhombus with a white question mark) is a symbol found in the Unicode standard at code point U+FFFD in the Specials table. It is used to indicate problems when a system is unable to render a stream of data to a correct symbol.[4] It is usually seen when the data is invalid and does not match any character:
    1. By the way, I am not talking about � (replacement character). This one is displayed when a Unicode character could not be correctly decoded from a data stream. It does not necessarily produce the same glyph:
    2. replacement glyph
    3. U+25A1 □ WHITE SQUARE may be used to represent a missing ideograph

      apparently distinct from: Unicode replacement character (U+FFFD)

    1. However after doing a bit of testing I see that this character is not used to represent missing glyphs on either my Windows 7 computer or the Android phone I've tested with (Motorola Atrix).
    2. The Unicode replacement character sounds promising when reading about it on Wikipedia: It is used to indicate problems when a system is not able to render a stream of data to a correct symbol. It is most commonly seen when a font does not contain a character, but is also seen when the data is invalid and does not match any character
  2. Oct 2022
    1. Of course, if super-intelligent Aliens will arrive on our planet, bearing a writing system with billions characters, I will withdraw this proposal and donate the name "UTF-64" to the Unicode Consortium.
  3. Aug 2022
    1. Unicode 是基于通用字符集(Universal Character Set)的标准来发展,并且同时也以书本的形式[1]对外发表

      utf-8是unicode字符集的编码方式之一

  4. Apr 2022
  5. Mar 2022
  6. Dec 2021
    1. Here are the single characters which can be normalised down to a valid TLD. They're mostly country codes, but there are a few interesting exceptions:

      ㏕ - US Military
      ℡ - .tel registry
      № - Norway
      ㍳ - Australia
      ㍷ - Dominica
      ㎀ - Panama
      ㎁ - Namibia
      ㎃ - Morocco
      ㎊ - French Polynesia
      ㎋ - Norfolk Island
      ㎏ - Kyrgyzstan
      ㎖ - Mali
      ㎙ - Federated States of Micronesia
      fi - Finland
      ㎜ - Myanmar
      ㎝ - Cameroon
      ㎞ & ㏎ - Comoros
      ㎰ - Palestine
      ㎳ - Montserrat
      ㎷ & ㎹ - Republic of Maldives.
      ㎺ - Palau
      ㎽ & ㎿ - Malawi
      ㏄ - Cocos (Keeling) Islands
      ㏅ - Democratic Republic of Congo
      ㏉ - Guyana
      ㏗ - Philippines
      ㏘ - Saint Pierre and Miquelon
      ㏚ - Puerto Rico
      ㏛ - Suriname
      ㏜ - El Salvador
      ℠ - San Marino
      ™ - Turkmenistan
      st & ſt - São Tomé and Príncipe
      ㎇ - Great Britain (Obsolete)
      ß - South Sudan (Not available)
      ㏌ - India and Indiana (subdomain of .us)
      Ⅵ & ⅵ - Virgin Islands and Virginia (subdomain of .us)
      fl - Florida (subdomain of .us)
      ㎚ - New Mexico (subdomain of .us)
      ㎵ - Nevada (subdomain of .us)
      ㍵ - As part of .ovh
      
    2. Nestling among the "Letterlike Symbols" are two curious entries. Both of these are single characters:

      • Telephone symbol - ℡
      • Numero Sign - №

      What's interesting is both .tel and .no are Top-Level-Domains (TLD) on the Domain Name System (DNS).

      So my contact site - https://edent.tel/ - can be written as - https://edent.℡/

      And the Norwegian domain name registry NORID can be accessed at https://www.norid.№/

      Copy and paste those links - they work in any browser!

  7. Jun 2021
    1. Through a linkpin called "Property Value Alias", Unicode has made a 1:1 connection between a script defined, and its ISO 15924 standard.
  8. Apr 2021
  9. Feb 2021
    1. But the circle on its own doesn’t seem to be available as a nonspacing diacritic in Unicode. Bugger.
  10. Sep 2020
    1. The value of dotAll is a Boolean and true if the "s" flag was used; otherwise, false. The "s" flag indicates that the dot special character (".") should additionally match the following line terminator ("newline") characters in a string, which it would not match otherwise: U+000A LINE FEED (LF) ("\n") U+000D CARRIAGE RETURN (CR) ("\r") U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR This effectively means the dot will match any character on the Unicode Basic Multilingual Plane (BMP). To allow it to match astral characters, the "u" (unicode) flag should be used. Using both flags in conjunction allows the dot to match any Unicode character, without exceptions.
  11. Jun 2020
  12. Feb 2020
  13. Oct 2018
  14. Sep 2018
  15. Sep 2015
  16. Apr 2015
    1. This part of the Character Model for the World Wide Web covers string matching—the process by which a specification or implementation defines whether two string values are the same or different from one another.