Encodings And Character Display

From Notepad++ Wiki
Jump to: navigation, search
Wrong display of non-ASCII characters


Instances of typing non-western characters (Russian, Hebrew etc) and getting text which is all garbled (strange characters) have been reported.

Old plugins had been considered a likely cause, but as of Notepad++ v5.4.2 this does no longer seem to hold.

Since there are many ays to represent the same character, called encodings, let's delve a bit into the matter. Following the general descrition are numerous examples.

Encodings : a short overview

(contributed by guy038)

Notepad++ can support plenty of characters encodings:

A Single Byte encoding, also called, by usage, Single Byte Character Set (SBCS)

For each of these, one character is encoded as one byte. These are the oldest encodings used by computers. Inspection will show that there has to be at least one encoding per language, mre or less:

  • The Windows ANSI Code Pages : Windows-nnnn
  • The Windows OEM Code Pages : OEM nnn
  • The ISO Code Pages : ISO8859-n and ISO8859-nn ( No code-point in the range [\x80 - \x9F ]
  • The Cyrillic Codes Pages : KOI8-R, KOI8-U, Macintosh
  • The Thai Code Page : TIS-620 ( Windows-874 )

See the different SBCS encodings on top of these three Web pages: http://msdn.microsoft.com/en-us/goglobal/bb964654, http://msdn.microsoft.com/en-us/goglobal/bb964654 and http://msdn.microsoft.com/en-us/goglobal/bb964654

Double Byte encodings, also called, by usage, Double Byte Character Set (DBCS)

Some of them preexisted Unicode, and were designed to encode character sets with a large number of characters, mainly found in Far East languages with ideographic or syllabic scripts:

  • The 2 Bytes Universal Character Set : UCS-2 Big Endian and UCS-2 Little Endian
  • The Japanese Code Page : Shift-JIS ( Windows-932 )
  • The Chinese Code Pages : Simplified Chinese GB2312 ( Windows-936 ),
  • Traditionnal Chinese Big5 ( Windows-950 )
  • The Korean Code Pages : Windows 949, EUC-KR

Note:

Formally, the Chinese, Japanese and Korean (CJK) encodings should be named Multi Bytes encodings because a character, in these encodings, can be coded with one or two bytes. When a character is encoded in two bytes, the first one is always > 0x7F.

See the different DBCS CJK encodings at the bottom of the Web page http://msdn.microsoft.com/en-us/goglobal/bb964654

In these four Web pages, the first byte of a two-bytes sequence, called lead byte, is indicated by a light grey background and unused lead bytes are identified by a darker grey background.

Concerning UNICODE characters, consult the Unicode Character Chart at the Web pages below:

http://www.unicode.org/charts/ and http://www.unicode.org/charts/charindex.html

See, also, the UCS-2 Big Endian, UCS-2 Little Endian and UTF-16 encodings, at the Web page http://en.wikipedia.org/wiki/UCS-2

A Multi Byte encoding, also called, by usage, Multi Bytes Character Set (MBCS)

The UCS 8-bit Transformation Format : UTF-8 without BOM and UTF-8

The UTF-8 encoding is the most efficint encoding for languages with a few characters alphabet, as European, Hebrew or Arab languages, because, although it is able to encode ALL UNICODE characters, between 0x0 and 0x10FFFF, the size of a file produced is generally smaller than the one produced with an UCS-2 encoding. See its full description at the Web page http://en.wikipedia.org/wiki/UTF-8

On the contrary, for Asian languages, with numerous ideographic characters, the best encodings are the UCS-2BE or the UCS-2LE encodings.

Going out of the Basic Multilingual Plane

The Double-Byte encodings UTF-16BE and UTF-16LE, NOT used in N++, displays ALL Unicode characters, with code points between 0x0 and 0x10FFFF. This is more than the more traditional 0x0 - 0xFFFF range, which is allocated to all letters, symbols and ideograms more or less currently in use on Earth. Unicode characters whose code point is above 0xFFFF, are displayed in a pseudo 32 bits form with the "surrogate pair" mechanism. See the Web page http://en.wikipedia.org/wiki/Utf-16#Code_points_U.2B10000_to_U.2B10FFFF. This is used for encoding old egyptian hieroglyphs, for instance.

By contrast, the Double-Byte encodings UCS-2 Big Endian ( UCS-2BE ) and UCS-2 Little Endian ( UCS-2LE ), used by Notepad++, can ONLY display characters, with code-points under 0x10000, which belong to the Basic Multiplane Plan ( BMP ) of Unicode. See the http://en.wikipedia.org/wiki/Utf-16#Code_points_U.2B0000_to_U.2BD7FF_and_U.2BE000_to_U.2BFFFF Web pages

However, in practice, this limitation is NOT a problem, because very few fonts are able to display characters, with code points above \xFFFF. Assessing the capabilities of fonts can be done using free software, like Arjan Mels' font viewer ( http://www.windows8downloads.com/win8-arjan-mels-font-viewer-xivmapcj/ ).

The Multi Byte encoding UTF-8 or UTF-8 without BOM, used by Notepad++, displays ALL Unicode characters, with code points between 0x0 and 0x10FFFF.

So, in order to display or write using characters outside the BNP, the following four conditions, below, must hold:

  1. Install a special font, with exact code points are above 0xFFFF, on your OS system. For example, download the Andagii font at http://www.i18nguy.com/unicode/unicode-font.html
    (See Osmanya characters, of the Unicode Plane 1, at the Web page http://www.i18nguy.com/unicode-example-plane1.html )
  2. Eventually, do some registry settings, in your OS system, to support Unicode Supplementary Characters
    (See http://www.i18nguy.com/surrogates.html for Windows NT, Windows 2000 or Windows XP)
  3. Define this font as the "Default Style" font, of "Global Styles" language, in Notepad++
  4. Set the UTF-8 or the UTF-8 without BOM encoding, in Notepad++

For example, the 48 Osmanya characters, from 0x10480 to 0x104af, are coded from F0 90 92 80 to F0 90 92 AF, in an UTF-8 file, and are displayed, in Notepad++, as in the PDF file : http://www.unicode.org/charts/PDF/U10480.pdf)

Externalt links for further documentation

You may find further documentation, at the different addresses below:

http://en.wikipedia.org/wiki/Unicode

http://en.wikipedia.org/wiki/UTF-16

http://en.wikipedia.org/wiki/UTF-8

http://en.wikipedia.org/wiki/UTF-32


http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane

http://en.wikipedia.org/wiki/Byte_order_mark

http://en.wikipedia.org/wiki/Code_page

http://en.wikipedia.org/wiki/Endianness

http://en.wikipedia.org/wiki/Unicode_Specials


http://www.unicode.org/charts/

http://www.unicode.org/charts/charindex.html

http://www.unicode.org/versions/Unicode6.3.0/

http://www.unicode.org/reports/tr10/tr10-28.html , about the UNICODE collation algorithm

http://www.unicode.org/reports/tr18/ , about the UNICODE regular expressions


http://www.i18nguy.com/unicode/codepages.html

http://en.wikibooks.org/wiki/Unicode/Character_reference

http://www.lingua-systems.com/knowledge/unicode-mappings

http://www.fileformat.info/info/charset/index.htm

http://www.cogsci.ed.ac.uk/%7erichard/utf-8.html , for simultaneous calculus between UNICODE, UTF-8 and UTF-16

And for French speaking users, this article, ADA language oriented, but very detailed:

http://bulleforum.net/decodeur-encodeur-unicode-en-ada-t4295.html

http://gskinner.com/RegExr/ (Pour chercher des expressions régulières)

Also on this wiki:

http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Convert_Or_Encode%3F

http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Encodings_And_Character_Display

What to check when characters display in an unusable way?

Since version 6.5.5, Notepad++ tries hard to autoatically detect the encoding of any document, so the following may not be ery useful any more. But just in case... By the way, this detection can be toggled from Settings -> Preferences -> MISC -> Autodetect character encoding


There are a few things to check:

  1. Encoding for the file may need to be set to UCS2 Little Endian, on the Format menu. This will happen when there is no OEM code page for your language, like for Esperanto.
  2. Does the font selected in Settings -> Styler Configurator -> Global Styles , Default Style support the characters you wish to type?
  3. Make sure advanced text services are on. This will also enable you to show the language bar, which will help in the step below
  4. What is the input language for your keyboard? Notepad++ was shown to reset it according to current locale (still true in v6.6.6). Press Left Alt+Shift repeatedly until the right input language is found.
  5. If you get the right character set, but the wrong characters, repeatedly press Ctrl+Shift until the right keyboard layout is back.

Notepad++ not only provides standard (OEM437) support in ANSI mode, but also allows selecting a different code page. All known OEM/Windows/ISO standards are supported. They are to be found in Format -> Character Sets as a series of submenus grouping related languages together.

HTML and XML files allow auto detection of the encoding being used, and Notepad++ uses these mechanisms.

Please note that, as of v6.6.6, conversion from some Unicode format to a non-default ANSI code page is not supported.

If the encoding is still wrong

If you need to deal with an encoding different from these, for instance an EBCDIC code page, oor if the font you selected doesn't switch to the right character set on its own, the following macro may be useful. It is to be inserted in shortcuts.xml inside the <Macros> tag:

<Macro name=""setCharSet"" Ctrl="yes" Alt="yes" Shift="no" Key="100">
   <Action type="0" message="2066" wParam="32" lParam=charset sParam=""/>
</Macro>

As usual, the integer value must be used inside double quotes. This can be any OEM or Windows code page number.

NOTE: you must fire this macro, and then change to some font. You may have to change to some bogus font and then back to the one you had.

If the style you wish to modify is not the default style, please look up the relevant styleID in styler.xml, and then replace "32" with that number.

Cyrillic scripts specific issues

For writers in cyrillic scripts: there are three commonplace, different non Unicode font encodings:

  1. CP866 (DOS, OEM, ASCII). This comes from DOS epoch, moreover, Microsoft still uses it in *.bat files, even in Windows 7 and Windows Server 2008 R2!!!
  2. KOI8-R. This is from Unix world. Many Unix/Linux servers still use it, but more and more of them are moving to UTF-8.
  3. CP1251 (ANSI). This is the Russian Windows codepage. Now modern Windows versions use UTF-8, but in progs, that doesn't understand it there's still codepage CP1251.

Check whether the font you use has the encoding you expect. If not, then the font needs to be changed.

The following chart shows various mangling patterns for cyrillic text, depending on how it is encoded and decoded.

Various renderings of Ещё раз ("Once again" in russian)
Initial encoding Decoding Rendering without UTF-8 conversion Rendering with UTF-8 conversion
CP 866
(Hex: 85 e9 f1 20 e0 a0 a7)
CP 866 Ещё раз

┬Е├й├▒ ├а┬а┬з

KOI-8 ┘ИЯ Ю═ї б┘ц╘ц╠ б═ц═цї
CP 1251 …йс а § …éñ Г В В§
CP 850 àÚ± ÓẠ┬à├®├░ ├á┬á┬º
KOI-8
(Hex: e5 dd a3 20 d2 c1 da)
CP 866 х▌г ╥┴┌

┬е┬Э└г ├Т┬а┬Ъ

KOI-8 Ещё раз ц╔ц²бё ц▓ц│䑆 
CP 1251 еЭЈ ТБЪ ГµГќВЈ Г’ГЃГљ
CP 850 Õ¦ú Ê┴┌ ├Á├Ø┬ú ├Æ├ü├Ü
CP 1251
(Hex: c5 f9 b8 20 f0 e0 e7)
CP 866 ┼∙╕ ≡рч ├Е├╣┬╕ ├░├а├з
KOI-8 еЫ╦ ПЮГ ц┘ц╧б╦ ц╟ц═цї
CP 1251 Ещё раз Г…Г№Вё Г°Г Г§
CP 850 ┼¨© ­Óþ ├à├╣┬© ├░├á├º
Unicode (UCS-2 LE)
(Hex: 15 04 49 04 51 04 20 00 40 04 30 04 37 04)
CP 866 §♦I♦Q♦ @♦0♦7♦ ╨Х╤З╤С ╤А╨░╨╖
KOI-8 §♦I♦Q♦ @♦0♦7♦ п┙я┴я▒ я─п╟Я╥
CP 1251 §♦I♦Q♦ @♦0♦7♦ Ещё СЂР°Р·
CP 850 §♦I♦Q♦ @♦0♦7♦ ðòÐëÐæ ÐÇð░ðÀ

Windows 7

Problems viewing or typing various accented letters have been reported under Windows 7. It has also been reported that running Notepad++ in XP or Vista compatibility mode would solve the issue.