Developer Interfaces#

Main Interfaces#

Those functions are publicly exposed and are protected through our BC guarantee.

charset_normalizer.from_bytes(sequences: bytes | bytearray, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: List[str] | None = None, cp_exclusion: List[str] | None = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1, enable_fallback: bool = True) → CharsetMatches[source]#

Given a raw bytes sequence, return the best possibles charset usable to render str objects. If there is no results, it is a strong indicator that the source is binary/not text. By default, the process will extract 5 blocks of 512o each to assess the mess and coherence of a given sequence. And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will.

The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page but never take it for granted. Can improve the performance.

You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that purpose.

This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32. By default the library does not setup any handler other than the NullHandler, if you choose to set the ‘explain’ toggle to True it will alter the logger configuration to add a StreamHandler that is suitable for debugging. Custom logging format and handler can be set manually.

charset_normalizer.from_fp(fp: BinaryIO, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: List[str] | None = None, cp_exclusion: List[str] | None = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1, enable_fallback: bool = True) → CharsetMatches[source]#: Same thing than the function from_bytes but using a file pointer that is already ready. Will not close the file pointer.

charset_normalizer.from_path(path: str | bytes | PathLike, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: List[str] | None = None, cp_exclusion: List[str] | None = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1, enable_fallback: bool = True) → CharsetMatches[source]#: Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode. Can raise IOError.

charset_normalizer.is_binary(fp_or_path_or_payload: PathLike | str | BinaryIO | bytes, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: List[str] | None = None, cp_exclusion: List[str] | None = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1, enable_fallback: bool = False) → bool[source]#: Detect if the given input (file, bytes, or path) points to a binary file. aka. not a string. Based on the same main heuristic algorithms and default kwargs at the sole exception that fallbacks match are disabled to be stricter around ASCII-compatible but unlikely to be a string.

class charset_normalizer.models.CharsetMatches(results: List[CharsetMatch] | None = None)[source]#

Container with every CharsetMatch items ordered by default from most probable to the less one. Act like a list(iterable) but does not implements all related methods.

append(item: CharsetMatch) → None[source]#: Insert a single match. Will be inserted accordingly to preserve sort. Can be inserted as a submatch.

best() → CharsetMatch | None[source]#: Simply return the first match. Strict equivalent to matches[0].

first() → CharsetMatch | None[source]#: Redundant method, call the method best(). Kept for BC reasons.

class charset_normalizer.models.CharsetMatch(payload: bytes, guessed_encoding: str, mean_mess_ratio: float, has_sig_or_bom: bool, languages: List[Tuple[str, float]], decoded_payload: str | None = None)[source]#

property could_be_from_charset: List[str]#: The complete list of encoding that output the exact SAME str result and therefore could be the originating encoding. This list does include the encoding available in property ‘encoding’.

property encoding_aliases: List[str]#: Encoding name are known by many name, using this could help when searching for IBM855 when it’s listed as CP855.

property fingerprint: str#: Retrieve the unique SHA256 computed using the transformed (re-encoded) payload. Not the original one.

property language: str#: Most probable language found in decoded sequence. If none were detected or inferred, the property will return “Unknown”.

property languages: List[str]#: Return the complete list of possible languages found in decoded sequence. Usually not really useful. Returned list may be empty even if ‘language’ property return something != ‘Unknown’.

output(encoding: str = 'utf_8') → bytes[source]#: Method to get re-encoded bytes payload using given target encoding. Default to UTF-8. Any errors will be simply ignored by the encoder NOT replaced.

property raw: bytes#: Original untouched bytes.

charset_normalizer.detect(byte_str: bytes, should_rename_legacy: bool = False, **kwargs: Any) → Dict[str, str | float | None][source]#

chardet legacy method Detect the encoding of the given byte string. It should be mostly backward-compatible. Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it) This function is deprecated and should be used to migrate your project easily, consult the documentation for further information. Not planned for removal.

Parameters:

byte_str – The byte sequence to examine.
should_rename_legacy – Should we rename legacy encodings to their more modern equivalents?

charset_normalizer.utils.set_logging_handler(name: str = 'charset_normalizer', level: int = 20, format_string: str = '%(asctime)s | %(levelname)s | %(message)s') → None[source]#

Mess Detector#

charset_normalizer.md.mess_ratio(decoded_sequence: str, maximum_threshold: float = 0.2, debug: bool = False) → float[source]#: Compute a mess ratio given a decoded bytes sequence. The maximum threshold does stop the computation earlier.

This library allows you to extend the capabilities of the mess detector by extending the class MessDetectorPlugin.

class charset_normalizer.md.MessDetectorPlugin[source]#

Base abstract class used for mess detection plugins. All detectors MUST extend and implement given methods.

eligible(character: str) → bool[source]#: Determine if given character should be fed in.

feed(character: str) → None[source]#: The main routine to be executed upon character. Insert the logic in witch the text would be considered chaotic.

property ratio: float#: Compute the chaos ratio based on what your feed() has seen. Must NOT be lower than 0.; No restriction gt 0.

reset() → None[source]#: Permit to reset the plugin to the initial state.

charset_normalizer.md.is_suspiciously_successive_range(unicode_range_a: str | None, unicode_range_b: str | None) → bool[source]#: Determine if two Unicode range seen next to each other can be considered as suspicious.

Coherence Detector#

charset_normalizer.cd.coherence_ratio(decoded_sequence: str, threshold: float = 0.1, lg_inclusion: str | None = None) → List[Tuple[str, float]][source]#: Detect ANY language that can be identified in given sequence. The sequence will be analysed by layers. A layer = Character extraction by alphabets/ranges.

Utilities#

Some reusable functions used across the project. We do not guarantee the BC in this area.

charset_normalizer.utils.is_accentuated(character: str) → bool[source]#

charset_normalizer.utils.remove_accent(character: str) → str[source]#

charset_normalizer.utils.unicode_range(character: str) → str | None[source]#: Retrieve the Unicode range official name from a single character.

charset_normalizer.utils.is_latin(character: str) → bool[source]#

charset_normalizer.utils.is_punctuation(character: str) → bool[source]#

charset_normalizer.utils.is_symbol(character: str) → bool[source]#

charset_normalizer.utils.is_emoticon(character: str) → bool[source]#

charset_normalizer.utils.is_separator(character: str) → bool[source]#

charset_normalizer.utils.is_case_variable(character: str) → bool[source]#

charset_normalizer.utils.is_cjk(character: str) → bool[source]#

charset_normalizer.utils.is_hiragana(character: str) → bool[source]#

charset_normalizer.utils.is_katakana(character: str) → bool[source]#

charset_normalizer.utils.is_hangul(character: str) → bool[source]#

charset_normalizer.utils.is_thai(character: str) → bool[source]#

charset_normalizer.utils.is_unicode_range_secondary(range_name: str) → bool[source]#

charset_normalizer.utils.any_specified_encoding(sequence: bytes, search_zone: int = 8192) → str | None[source]#: Extract using ASCII-only decoder any specified encoding in the first n-bytes.

charset_normalizer.utils.is_multi_byte_encoding(name: str) → bool[source]#: Verify is a specific encoding is a multi byte one based on it IANA name

charset_normalizer.utils.identify_sig_or_bom(sequence: bytes) → Tuple[str | None, bytes][source]#: Identify and extract SIG/BOM in given sequence.

charset_normalizer.utils.should_strip_sig_or_bom(iana_encoding: str) → bool[source]#

charset_normalizer.utils.iana_name(cp_name: str, strict: bool = True) → str[source]#

charset_normalizer.utils.range_scan(decoded_sequence: str) → List[str][source]#

charset_normalizer.utils.is_cp_similar(iana_name_a: str, iana_name_b: str) → bool[source]#: Determine if two code page are at least 80% similar. IANA_SUPPORTED_SIMILAR dict was generated using the function cp_similarity.

class os.PathLike#

class typing.BinaryIO#