Advanced Search¶
Charset Normalizer method from_bytes
, from_fp
and from_path
provide some
optional parameters that can be tweaked.
As follow
from charset_normalizer import CharsetNormalizerMatches as CnM
my_byte_str = '我没有埋怨,磋砣的只是一些时间。'.encode('gb18030')
results = CnM.from_bytes(
my_byte_str,
steps=10, # Number of steps/block to extract from my_byte_str
chunk_size=512, # Set block size of each extraction
threshold=0.2, # Maximum amount of chaos allowed on first pass
cp_isolation=None, # Finite list of encoding to use when searching for a match
cp_exclusion=None, # Finite list of encoding to avoid when searching for a match
preemptive_behaviour=True, # Determine if we should look into my_byte_str (ASCII-Mode) for pre-defined encoding
explain=False # Print on screen what is happening when searching for a match
)
Using CharsetNormalizerMatches¶
Here, results
is a CharsetNormalizerMatches
object. It behave like a list.
Initially it is not sorted. Be cautious when extracting first()
result without calling method best()
.
List behaviour¶
Like said earlier, CharsetNormalizerMatches
object behave like a list.
# Call len on results also work if len(results) == 0: print('No match for your sequence') # Iterate over results like a list for match in results: print(match.encoding, 'can decode properly your sequence using', match.alphabets, 'and language', match.language) # Using index to access results if len(results) > 0: print(str(results[0]))
Using best()¶
Like said above, CharsetNormalizerMatches
object behave like a list and it is not sorted after calling
from_bytes
, from_fp
or from_path
.
Using best()
keep only the lowest chaotic results and in it the best coherent result if necessary.
It produce also a CharsetNormalizerMatches
object as return value.
results = results.best()
Calling first()¶
This method is callable from a CharsetNormalizerMatches
object. It extract the first match in list.
This method return a CharsetNormalizerMatch
object. See Handling result section.
Class aliases¶
CharsetNormalizerMatches
is also known as CharsetDetector
, CharsetDoctor
and EncodingDetector
.
It is useful if you prefer short class name.
Verbose output¶
You may want to understand why a specific encoding was not picked by charset_normalizer. All you have to do is passing
explain
to True when using methods from_bytes
, from_fp
or from_path
.