Advanced Search¶
Charset Normalizer method from_bytes
, from_fp
and from_path
provide some
optional parameters that can be tweaked.
As follow
from charset_normalizer import from_bytes
my_byte_str = 'Bсеки човек има право на образование.'.encode('cp1251')
results = from_bytes(
my_byte_str,
steps=10, # Number of steps/block to extract from my_byte_str
chunk_size=512, # Set block size of each extraction
threshold=0.2, # Maximum amount of chaos allowed on first pass
cp_isolation=None, # Finite list of encoding to use when searching for a match
cp_exclusion=None, # Finite list of encoding to avoid when searching for a match
preemptive_behaviour=True, # Determine if we should look into my_byte_str (ASCII-Mode) for pre-defined encoding
explain=False, # Print on screen what is happening when searching for a match
language_threshold=0.1 # Minimum coherence ratio / language ratio match accepted
)
Using CharsetMatches¶
Here, results
is a CharsetMatches
object. It behave like a list but does not implements all related methods.
Initially, it is sorted. Calling best()
is sufficient to extract the most probable result.
- class charset_normalizer.CharsetMatches(results: List[CharsetMatch] | None = None)[source]¶
Container with every CharsetMatch items ordered by default from most probable to the less one. Act like a list(iterable) but does not implements all related methods.
- append(item: CharsetMatch) None [source]¶
Insert a single match. Will be inserted accordingly to preserve sort. Can be inserted as a submatch.
- best() CharsetMatch | None [source]¶
Simply return the first match. Strict equivalent to matches[0].
- first() CharsetMatch | None [source]¶
Redundant method, call the method best(). Kept for BC reasons.
List behaviour¶
Like said earlier, CharsetMatches
object behave like a list.
# Call len on results also work if not results: print('No match for your sequence') # Iterate over results like a list for match in results: print(match.encoding, 'can decode properly your sequence using', match.alphabets, 'and language', match.language) # Using index to access results if results: print(str(results[0]))
Using best()¶
Like said above, CharsetMatches
object behave like a list and it is sorted by default after getting results from
from_bytes
, from_fp
or from_path
.
Using best()
return the most probable result, the first entry of the list. Eg. idx 0.
It return a CharsetMatch
object as return value or None if there is not results inside it.
result = results.best()
Calling first()¶
The very same thing than calling the method best()
.
Class aliases¶
CharsetMatches
is also known as CharsetDetector
, CharsetDoctor
and CharsetNormalizerMatches
.
It is useful if you prefer short class name.
Verbose output¶
You may want to understand why a specific encoding was not picked by charset_normalizer. All you have to do is passing
explain
to True when using methods from_bytes
, from_fp
or from_path
.