Handling Result#

When initiating search upon a buffer, bytes or file you can assign the return value and fully exploit it.

my_byte_str = '我没有埋怨,磋砣的只是一些时间。'.encode('gb18030')

# Assign return value so we can fully exploit result
result = from_bytes(
    my_byte_str
).best()

print(result.encoding)  # gb18030

Using CharsetMatch#

Here, result is a CharsetMatch object or None.

class charset_normalizer.CharsetMatch(payload: bytes, guessed_encoding: str, mean_mess_ratio: float, has_sig_or_bom: bool, languages: List[Tuple[str, float]], decoded_payload: str | None = None)[source]#
best() CharsetMatch[source]#

Kept for BC reasons. Will be removed in 3.0.

property chaos_secondary_pass: float#

Check once again chaos in decoded text, except this time, with full content. Use with caution, this can be very slow. Notice: Will be removed in 3.0

property coherence_non_latin: float#

Coherence ratio on the first non-latin language detected if ANY. Notice: Will be removed in 3.0

property could_be_from_charset: List[str]#

The complete list of encoding that output the exact SAME str result and therefore could be the originating encoding. This list does include the encoding available in property ‘encoding’.

property encoding_aliases: List[str]#

Encoding name are known by many name, using this could help when searching for IBM855 when it’s listed as CP855.

property fingerprint: str#

Retrieve the unique SHA256 computed using the transformed (re-encoded) payload. Not the original one.

first() CharsetMatch[source]#

Kept for BC reasons. Will be removed in 3.0.

property language: str#

Most probable language found in decoded sequence. If none were detected or inferred, the property will return “Unknown”.

property languages: List[str]#

Return the complete list of possible languages found in decoded sequence. Usually not really useful. Returned list may be empty even if ‘language’ property return something != ‘Unknown’.

output(encoding: str = 'utf_8') bytes[source]#

Method to get re-encoded bytes payload using given target encoding. Default to UTF-8. Any errors will be simply ignored by the encoder NOT replaced.

property raw: bytes#

Original untouched bytes.

property w_counter: Counter#

Word counter instance on decoded text. Notice: Will be removed in 3.0