Command Line Interface#

charset-normalizer ship with a CLI that should be available as normalizer. This is a great tool to fully exploit the detector capabilities without having to write Python code.

Possible use cases:

Quickly discover probable originating charset from a file.
I want to quickly convert a non Unicode file to Unicode.
Debug the charset-detector.

Down below, we will guide you through some basic examples.

Arguments#

You may simply invoke normalizer -h (with the h(elp) flag) to understand the basics.

usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
                  file [file ...]

The Real First Universal Charset Detector. Discover originating encoding used
on text file. Normalize text to unicode.

positional arguments:
  files                 File(s) to be analysed

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Display complementary information about file if any.
                        Stdout will contain logs about the detection process.
  -a, --with-alternative
                        Output complementary possibilities if any. Top-level
                        JSON WILL be a list.
  -n, --normalize       Permit to normalize input file. If not set, program
                        does not write anything.
  -m, --minimal         Only output the charset detected to STDOUT. Disabling
                        JSON output.
  -r, --replace         Replace file when trying to normalize it instead of
                        creating a new one.
  -f, --force           Replace file without asking if you are sure, use this
                        flag with caution.
  -t THRESHOLD, --threshold THRESHOLD
                        Define a custom maximum amount of chaos allowed in
                        decoded content. 0. <= chaos <= 1.
  --version             Show version information and exit.

normalizer ./data/sample.1.fr.srt

You may also run the command line interface using:

python -m charset_normalizer ./data/sample.1.fr.srt

Main JSON Output#

🎉 Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.

{
    "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
    "encoding": "cp1252",
    "encoding_aliases": [
        "1252",
        "windows_1252"
    ],
    "alternative_encodings": [
        "cp1254",
        "cp1256",
        "cp1258",
        "iso8859_14",
        "iso8859_15",
        "iso8859_16",
        "iso8859_3",
        "iso8859_9",
        "latin_1",
        "mbcs"
    ],
    "language": "French",
    "alphabets": [
        "Basic Latin",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.149,
    "coherence": 97.152,
    "unicode_path": null,
    "is_preferred": true
}

I recommend the jq command line tool to easily parse and exploit specific data from the produced JSON.

Multiple File Input#

It is possible to give multiple files to the CLI. It will produce a list instead of an object at the top level. When using the -m (minimal output) it will rather print one result (encoding) per line.

Unicode Conversion#

If you desire to convert any file to Unicode you will need to append the flag -n. It will produce another file, it won’t replace it by default.

The newly created file path will be declared in unicode_path (JSON output).