Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
charset_normalizer 3.4.2 documentation
charset_normalizer 3.4.2 documentation
  • Support
  • Installation
  • Basic Usage
  • Advanced Search
  • Handling Result
  • Miscellaneous
  • Command Line Interface
  • Optional speedup extension
  • Frequently asked questions
  • Why should I migrate to Charset-Normalizer?
  • Featured projects
  • Developer Interfaces
Back to top
View this page

Advanced Search¶

Charset Normalizer method from_bytes, from_fp and from_path provide some optional parameters that can be tweaked.

As follow

from charset_normalizer import from_bytes

my_byte_str = 'Bсеки човек има право на образование.'.encode('cp1251')

results = from_bytes(
    my_byte_str,
    steps=10,  # Number of steps/block to extract from my_byte_str
    chunk_size=512,  # Set block size of each extraction
    threshold=0.2,  # Maximum amount of chaos allowed on first pass
    cp_isolation=None,  # Finite list of encoding to use when searching for a match
    cp_exclusion=None,  # Finite list of encoding to avoid when searching for a match
    preemptive_behaviour=True,  # Determine if we should look into my_byte_str (ASCII-Mode) for pre-defined encoding
    explain=False,  # Print on screen what is happening when searching for a match
    language_threshold=0.1  # Minimum coherence ratio / language ratio match accepted
)

Using CharsetMatches¶

Here, results is a CharsetMatches object. It behave like a list but does not implements all related methods. Initially, it is sorted. Calling best() is sufficient to extract the most probable result.

class charset_normalizer.CharsetMatches(results: list[CharsetMatch] | None = None)[source]¶

Container with every CharsetMatch items ordered by default from most probable to the less one. Act like a list(iterable) but does not implements all related methods.

append(item: CharsetMatch) → None[source]¶

Insert a single match. Will be inserted accordingly to preserve sort. Can be inserted as a submatch.

best() → CharsetMatch | None[source]¶

Simply return the first match. Strict equivalent to matches[0].

first() → CharsetMatch | None[source]¶

Redundant method, call the method best(). Kept for BC reasons.

List behaviour¶

Like said earlier, CharsetMatches object behave like a list.

# Call len on results also work
if not results:
    print('No match for your sequence')

# Iterate over results like a list
for match in results:
    print(match.encoding, 'can decode properly your sequence using', match.alphabets, 'and language', match.language)

# Using index to access results
if results:
    print(str(results[0]))

Using best()¶

Like said above, CharsetMatches object behave like a list and it is sorted by default after getting results from from_bytes, from_fp or from_path.

Using best() return the most probable result, the first entry of the list. Eg. idx 0. It return a CharsetMatch object as return value or None if there is not results inside it.

result = results.best()

Calling first()¶

The very same thing than calling the method best().

Class aliases¶

CharsetMatches is also known as CharsetDetector, CharsetDoctor and CharsetNormalizerMatches. It is useful if you prefer short class name.

Verbose output¶

You may want to understand why a specific encoding was not picked by charset_normalizer. All you have to do is passing explain to True when using methods from_bytes, from_fp or from_path.

Next
Handling Result
Previous
Installation
Copyright © 2023, Ahmed TAHRI
Made with Sphinx and @pradyunsg's Furo
On this page
  • Advanced Search
    • Using CharsetMatches
      • CharsetMatches
        • CharsetMatches.append()
        • CharsetMatches.best()
        • CharsetMatches.first()
    • List behaviour
    • Using best()
    • Calling first()
    • Class aliases
    • Verbose output