Convert Or Encode?

From Notepad++ Wiki
Jump to: navigation, search
Why converting and encoding, in Encodings menu, differ


This article applies to text files that are not written in English. What is the difference? Using either flag the file as changed.

Differences between the options Encoding -> Convert to ... and Encoding -> Encode in ....

The option Convert to... transforms the current file, with its actual encoding, to the same contents file, translated in the chosen encoding.

Use this option, ONLY if all the actual characters of the file are correctly displayed. Non representable characters, in the new encoding, will be replace by a question Mark ?

The contents of the current file, after conversion, are ALWAYS modified.

The option Encode in .... OR Character sets / .... / ...., apply the chosen encoding to the actual contents of the current file.

Generally, the contents of the current file, after encoding, are not modified and ONLY displaying is changed. This is because

  • All ANSI code pages encode characters over one byte, except for some related to Asian languages
  • Both UCS2 (or UTF-16) encodings encode over 2 bytes
  • UTF-8 encodes over a variable number of bytes, and the with/without BOM variants do change the file, by adding or removing 3 bytes a the beginning.

But, if the actual OR the future encoding is UTF-8, UCS-2 Big Endian or UCS-2 Little Endian, then the contents of the current file are modified

Use this option, ONLY if some characters of the file are unreadable or displayed as small boxes.

If your current file is displayed correctly, special characters will, generally, be displayed in different ways:

  • If the actual encoding of the test file is ANSI, the option Encode in UTF-8 displays, in the control character way, the string xB1x80, as these bytes are not part of a legal UTF-8 sequence.
  • If the actual encoding of the test file is UTF-8 without BOM, the option Encode in ANSI, for example, displays the string A±€, according to the actual contents of the file ( 41 , C2 , B1 , E2 , 82 , AC )
  • If the actual encoding of the test file is ANSI, the option Character Sets / Western European / OEM850 displays the string A▒Ç, in DOS CP 850, according to the actual contents of the test file ( 41 , B1 , 80 )

Encoding is not the whole story

Changing the code page used to view a text file is not done through Format, but by changing the font, presumably the one for Settings -> Styler Configurator -> Global Styles , Default Style. This means you need fonts for Norwegian, Turk, Hungarian (well, Magyár),... languages installed if you want to correctly display text written in those languages. Notepad++ cannot provide that many fonts, so it leaves the matter with you.

By the way, even though you could send Scintilla the SCI_SETCODEPAGE message to set the code page somehow, this is not what it it intended for. Scintilla's docs state that the message is only honoured for a limited number of code pages, those related whith DBCS scripts.

A hands-on example using an acrual text file

Setting up the test file

Let suppose a new file with, ONLY, the three characters A°’.

The two characters ° (in HTML, °) and ’ (aka backquote) will be used because they are part of most Microsoft ANSI pagecodes ( Windows-1252, Windows-1255, ..., except the Thai encoding Windows-874 - TIS-620 ), as well as part of all the different Asiatic encodings of Notepad++.

The hexadecimal values, which represent these three characters, are, respectively:

Byte contents for various encodings
Bytes Encodine
41 , B0 , 92 ANSI one-byte encodings of Microsoft ( Windows-nnnn )
41 , 81 , 8B , 81 , 66 two-bytes Japanese shift-JIS encoding (Windows-932)
41 , A1 , E3 , A1 , AF two-bytes Chinese GB2312 encoding (Windows-936)
41 , A1 , C6 , A1 , AF two-bytes Korean Windows-949 encoding
41 , A2 , 58 , A1 , A6 two-bytes Chinese Big5 encoding (Windows-950)
0x000041 , 0x0000B0 , 0x002019 Universal Character Set ( UCS ) (UNICODE code-points)

As this file contains characters, with code-point > 7F, then, regardless to the actual encoding, if you convert this test file ( Menu Encoding / Sub-Menu Convert to ....... ), the real contents of the test file become:

Byte contents for various encodings
Bytes Encodine
41 , B0 , 92 ANSI
41 , C2 B0 , E2 80 99 UTF-8 without BOM
EF BB BF , 41 , C2 B0 , E2 80 99 UTF-8
FE FF , 00 41 , 00 B0 , 20 19 UCS-2 Big Endian
FF FE , 41 00 , B0 00 , 19 20 UCS-2 Little Endian

IMPORTANT:

If a file is correctly encoded in an Asiatic encoding, any conversion to the ANSI encoding must be preceded, first, by a conversion to an Unicode encoding (UTF-8, UCS-2 BE or UCS-2 LE)

In this way, any ANSI character, of code-point > \x7F, with a right two-bytes Asiatic encoding, will be correctly converted

Notes:

I underlined the words Convert to ..., because the use of the option Encode in ... has a different behaviour, which will be discuss later.

The underlined part represents a HEADER, which is NEVER displayed in Notepad++ and allows Notepad++ to identify to right encoding of a file.

Contents of the test file, after Conversion

In UCS-2 Big Endian:

The header is the Unicode character 0xFEFF and represents the BOM (Byte Order Mark). If this character is found, further, in the file, it stands for the character ZWNBSP (Zero Width Non-Breaking Space).

Every valid Unicode character, with code-point <= \xFFFF, is coded with TWO bytes.

The FIRST byte stored is the Most Significant Byte of each sequence of two bytes, so the three characters of the test file, are stored : 00 41 , 00 B0 , 20 19

In UCS-2 Little Endian:

The header is the sequence FFFE, which represents the character 0xFEFF (BOM), with the Least Significant Byte, written FIRST.

Every valid Unicode character, with code-point <= \xFFFF, is coded with TWO bytes.

The FIRST byte stored is the Least Significant Byte of each sequence of two bytes, so the three characters of the test file, are stored : 41 00 , B0 00 , 19 20

In UTF-8:

The header is the sequence EFBBBF, which represents the UTF-8 form of the Byte Order Mark (BOM) 0xFEFF.

- Every valid Unicode character, with code-point in the range [\x0 - \x10FFFF], is coded with :

  1. byte if Unicode value of the code-point is >= 0x0 ( 0 ) and <= 0x007F ( 127 )
  2. bytes if Unicode value of the code-point is >= 0x0080 ( 128 ) and <= 0x07FF ( 2047 )
  3. bytes if Unicode value of the code-point is >= 0x0800 ( 2048 ) and <= 0xFFFD ( 65533 )
  4. bytes if Unicode value of the code-point is >= 0x10000 ( 65536 ) and <= 0x10FFFD ( 1114109 )

A single UTF-8 byte, with hexadecimal value is:

  • from 00 to 7F, represents a STANDARD character of a one byte sequence ( 0x0xxxxxxx )
  • from 80 to BF, represents a CONTINUATION byte, in a two, three or four bytes sequence ( 0x10xxxxxx )
  • from C0 to C1, is ALWAYS, a forbidden value
  • from C2 to DF, represents the FIRST byte of a two bytes sequence ( 0x110xxxxx )
  • from E0 to EF, represents the FIRST byte of a three bytes sequence ( 0x1110xxxx )
  • from F0 to F4, represents the FIRST byte of a four bytes sequence ( 0x11110xxx )

from F5 to FF, is ALWAYS, a forbidden value

So the three characters of the test file are coded: 41 , C2 B0 , E2 80 99 (one byte for the character A, two bytes for the character ° and three bytes for character ’).

Refer to this link, for further information about UTF-8: http://en.wikipedia.org/wiki/UTF-8

In UTF-8 without BOM:

The encoding of characters is identical to UTF-8, but there's NO header ( BOM ). So, the invisible three characters, at the very beginning of file, are ABSENT.

In ANSI:

NO header is present at the very beginning of file.

Each character, with UNICODE code-point <= \x00FF, is coded in a ONE byte sequence. So, the three characters of the test file are simply stored : 41 , B0 , 92


IMPORTANT:

If default encoding for a new document ( Menu Settings / Preferences / New document ) is set to UTF-8 without BOM, with the box "Apply on opened ANSI files" set, then, for a file, without any character with hexadecimal value > \x7F, regardless to its actual encoding, a conversion to ANSI set automatically the encoding of this file to UTF-8 without BOM, on next opening of N++.

If the default encoding for a new document is DIFFERENT from above, then, for a file, without any character with hexadecimal value > \x7F, regardless to its actual encoding, a conversion to UTF-8 without BOM set automatically the encoding of this file to ANSI, on next opening of N++.

Conversion of current file to encodings UTF-8, UCS-2 Big Endian or UCS-2 Little Endian is ALWAYS immediate.

Conversion of current file to encodings ANSI or UTF-8 without BOM, is ALWAYS immediate, if it contains, at least, one character, with value > \x7F.

DIFFERENCES between the option "Convert to ...." and the option "Encode in ...." OR "Character sets/..."

  • The option "Convert to..." change the current file, with its actual encoding, to the same contents file, translated in the chosen encoding.

    Use this option, ONLY if ALL the characters of the file are correctly displayed. Non displayed characters will be replace by a question Mark 0x3F (?), the UNICODE substitution character 0xFFFD or a small rectangular or square empty box.

The contents of the current file, AFTER conversion, are ALWAYS modified.

  • The option "Encode in ...." OR "Character sets/ ..." apply the chosen encoding to the actual contents of the current file.
    • Generally, the contents of the file, AFTER encoding, are NOT modified and ONLY displaying is changed.
    • BUT, if the actual OR the future encoding is UTF-8, UCS-2 Big Endian or UCS-2 Little Endian, the contents of the current file are, of course, modified, because the header, of the current file, is added, changed or removed.



Use this option, ONLY if some characters of the file are unreadable or displayed as small boxes or question marks.

IMPORTANT:

On the choice of the ANSI encoding (Menu "Encoding\Encode in ANSI"), Notepad++ automatically selects the Windows code page Windows-nnnn, corresponding to the default country, defined, in "Configuration Panel\Linguistic and Regional Options", and used, by Windows, for all NON Unicode programs.

Examples:

For a French user, the options "Encode in ANSI" and "Character Sets\Western European\Windows-1252" are identical For a Greek user, the options "Encode in ANSI" and "Character Sets\Greek\Windows-1253" are identical For a Turkish user, the options "Encode in ANSI" and "Character Sets\Turkish\Windows-1254" are identical

And so on...

If the actual encoding is an UNICODE encoding (UTF-8 without BOM, UTF-8, UCS-2 Big Endian or UCS-2 Little Endian) AND if some characters, of the current file, are still missing or badly displayed, unfortunately, NO ONE other encoding is possible, because these UNICODE encodings cover almost the totality of the characters  the UNIQUE possibility is to use or install an other default font, which can correctly display the actual wrong characters (Menu "Settings\Style Configurator...\Global Styles\Default Style" of Notepad++)

If the actual encoding is an ANSI, an OEM or an ISO encoding AND if some characters, of the current file, are missing or badly displayed, it’s possible to use an other ANSI, OEM or ISO encoding, with the options "Encode in ASCII" or "Character Sets\...", to get the right displaying of these characters.

On the contrary, if the current file is correctly displayed in the actual encoding, special characters will, generally, be displayed, in the new encoding:

For example, if the actual encoding of the test file is "ANSI", the option "Encode in UTF-8" displays, in the 'Control character' way, the string AxB0x92, as the two final bytes are NOT part of a legal UTF-8 sequence.

If the actual encoding of the test file is "UTF-8 without BOM", the option "Encode in ANSI", for example, displays the string A°’, according to the actual contents of the file (41 , C2 , B0 , E2 , 80 , 99).

And, if the actual encoding of the test file is "ANSI", the option "Character Sets/Western European/OEM850" displays the string A░Æ, in DOS CP 850, according to the actual contents of the test file (41 , B0 , 92).