Support¶
If you are running:
Python >=2.7,<3.5: Unsupported
Python 3.5: charset-normalizer < 2.1
Python 3.6: charset-normalizer < 3.1
Upgrade your Python interpreter as soon as possible.
Supported Encodings¶
Here are a list of supported encoding and supported language with latest update. Also this list may change depending of your python version.
Charset Normalizer is able to detect any of those encoding. This list is NOT static and depends heavily on what your current cPython version is shipped with. See https://docs.python.org/3/library/codecs.html#standard-encodings
IANA Code Page |
Aliases |
---|---|
ascii |
646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_ascii |
big5 |
big5_tw, csbig5, x_mac_trad_chinese |
big5hkscs |
big5_hkscs, hkscs |
cp037 |
037, csibm037, ebcdic_cp_ca, ebcdic_cp_nl, ebcdic_cp_us, ebcdic_cp_wt, ibm037, ibm039 |
cp1026 |
1026, csibm1026, ibm1026 |
cp1125 |
1125, ibm1125, cp866u, ruscii |
cp1140 |
1140, ibm1140 |
cp1250 |
1250, windows_1250 |
cp1251 |
1251, windows_1251 |
cp1252 |
1252, windows_1252 |
cp1253 |
1253, windows_1253 |
cp1254 |
1254, windows_1254 |
cp1255 |
1255, windows_1255 |
cp1256 |
1256, windows_1256 |
cp1257 |
1257, windows_1257 |
cp1258 |
1258, windows_1258 |
cp273 |
273, ibm273, csibm273 |
cp424 |
424, csibm424, ebcdic_cp_he, ibm424 |
cp437 |
437, cspc8codepage437, ibm437 |
cp500 |
500, csibm500, ebcdic_cp_be, ebcdic_cp_ch, ibm500 |
cp775 |
775, cspc775baltic, ibm775 |
cp850 |
850, cspc850multilingual, ibm850 |
cp852 |
852, cspcp852, ibm852 |
cp855 |
855, csibm855, ibm855 |
cp857 |
857, csibm857, ibm857 |
cp858 |
858, csibm858, ibm858 |
cp860 |
860, csibm860, ibm860 |
cp861 |
861, cp_is, csibm861, ibm861 |
cp862 |
862, cspc862latinhebrew, ibm862 |
cp863 |
863, csibm863, ibm863 |
cp864 |
864, csibm864, ibm864 |
cp865 |
865, csibm865, ibm865 |
cp866 |
866, csibm866, ibm866 |
cp869 |
869, cp_gr, csibm869, ibm869 |
cp932 |
932, ms932, mskanji, ms_kanji |
cp949 |
949, ms949, uhc |
cp950 |
950, ms950 |
euc_jis_2004 |
jisx0213, eucjis2004, euc_jis2004 |
euc_jisx0213 |
eucjisx0213 |
euc_jp |
eucjp, ujis, u_jis |
euc_kr |
euckr, korean, ksc5601, ks_c_5601, ks_c_5601_1987, ksx1001, ks_x_1001, x_mac_korean |
gb18030 |
gb18030_2000 |
gb2312 |
chinese, csiso58gb231280, euc_cn, euccn, eucgb2312_cn, gb2312_1980, gb2312_80, iso_ir_58, x_mac_simp_chinese |
gbk |
936, cp936, ms936 |
hp_roman8 |
roman8, r8, csHPRoman8 |
hz |
hzgb, hz_gb, hz_gb_2312 |
iso2022_jp |
csiso2022jp, iso2022jp, iso_2022_jp |
iso2022_jp_1 |
iso2022jp_1, iso_2022_jp_1 |
iso2022_jp_2 |
iso2022jp_2, iso_2022_jp_2 |
iso2022_jp_3 |
iso2022jp_3, iso_2022_jp_3 |
iso2022_jp_ext |
iso2022jp_ext, iso_2022_jp_ext |
iso2022_kr |
csiso2022kr, iso2022kr, iso_2022_kr |
iso8859_10 |
csisolatin6, iso_8859_10, iso_8859_10_1992, iso_ir_157, l6, latin6 |
iso8859_11 |
thai, iso_8859_11, iso_8859_11_2001 |
iso8859_13 |
iso_8859_13, l7, latin7 |
iso8859_14 |
iso_8859_14, iso_8859_14_1998, iso_celtic, iso_ir_199, l8, latin8 |
iso8859_15 |
iso_8859_15, l9, latin9 |
iso8859_16 |
iso_8859_16, iso_8859_16_2001, iso_ir_226, l10, latin10 |
iso8859_2 |
csisolatin2, iso_8859_2, iso_8859_2_1987, iso_ir_101, l2, latin2 |
iso8859_3 |
csisolatin3, iso_8859_3, iso_8859_3_1988, iso_ir_109, l3, latin3 |
iso8859_4 |
csisolatin4, iso_8859_4, iso_8859_4_1988, iso_ir_110, l4, latin4 |
iso8859_5 |
csisolatincyrillic, cyrillic, iso_8859_5, iso_8859_5_1988, iso_ir_144 |
iso8859_6 |
arabic, asmo_708, csisolatinarabic, ecma_114, iso_8859_6, iso_8859_6_1987, iso_ir_127 |
iso8859_7 |
csisolatingreek, ecma_118, elot_928, greek, greek8, iso_8859_7, iso_8859_7_1987, iso_ir_126 |
iso8859_8 |
csisolatinhebrew, hebrew, iso_8859_8, iso_8859_8_1988, iso_ir_138 |
iso8859_9 |
csisolatin5, iso_8859_9, iso_8859_9_1989, iso_ir_148, l5, latin5 |
iso2022_jp_2004 |
iso_2022_jp_2004, iso2022jp_2004 |
johab |
cp1361, ms1361 |
koi8_r |
cskoi8r |
kz1048 |
kz_1048, rk1048, strk1048_2002 |
latin_1 |
8859, cp819, csisolatin1, ibm819, iso8859, iso8859_1, iso_8859_1, iso_8859_1_1987, iso_ir_100, l1, latin, latin1 |
mac_cyrillic |
maccyrillic |
mac_greek |
macgreek |
mac_iceland |
maciceland |
mac_latin2 |
maccentraleurope, maclatin2 |
mac_roman |
macintosh, macroman |
mac_turkish |
macturkish |
ptcp154 |
csptcp154, pt154, cp154, cyrillic_asian |
shift_jis |
csshiftjis, shiftjis, sjis, s_jis, x_mac_japanese |
shift_jis_2004 |
shiftjis2004, sjis_2004, s_jis_2004 |
shift_jisx0213 |
shiftjisx0213, sjisx0213, s_jisx0213 |
tis_620 |
tis620, tis_620_0, tis_620_2529_0, tis_620_2529_1, iso_ir_166 |
utf_16 |
u16, utf16 |
utf_16_be |
unicodebigunmarked, utf_16be |
utf_16_le |
unicodelittleunmarked, utf_16le |
utf_32 |
u32, utf32 |
utf_32_be |
utf_32be |
utf_32_le |
utf_32le |
utf_8 |
u8, utf, utf8, utf8_ucs2, utf8_ucs4 (+utf_8_sig) |
utf_7* |
u7, unicode-1-1-utf-7 |
cp720 |
N.A. |
cp737 |
N.A. |
cp856 |
N.A. |
cp874 |
N.A. |
cp875 |
N.A. |
cp1006 |
N.A. |
koi8_r |
N.A. |
koi8_t |
N.A. |
koi8_u |
N.A. |
*: Only if a SIG/mark is found.
Supported Languages¶
Those language can be detected inside your content. All of these are specified in ./charset_normalizer/assets/__init__.py .
Incomplete Sequence / Stream¶
It is not (yet) officially supported. If you feed an incomplete byte sequence (eg. truncated multi-byte sequence) the detector will most likely fail to return a proper result. If you are purposely feeding part of your payload for performance concerns, you may stop doing it as this package is fairly optimized.
We are working on a dedicated way to handle streams.