Name Description Size
__init__.py Detect the encoding of the given byte string. :param byte_str: The byte sequence to examine. :type byte_str: ``bytes`` or ``bytearray`` :param should_rename_legacy: Should we rename legacy encodings to their more modern equivalents? :type should_rename_legacy: ``bool`` 4797
__main__.py Wrapper so people can run python -m chardet 123
big5freq.py 31274
big5prober.py 1763
chardistribution.py reset analyser, clear any state 10032
charsetgroupprober.py 3915
charsetprober.py We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [\x80-\xFF] marker: everything else [^a-zA-Z\x80-\xFF] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters. 5420
cli
codingstatemachine.py A state machine to verify a byte sequence for a particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. There are 3 states in a state machine that are of interest to an auto-detector: START state: This is the state to start with, or a legal byte sequence (i.e. a valid code point) for character has been identified. ME state: This indicates that the state machine identified a byte sequence that is specific to the charset it is designed for and that there is no other possible encoding which can contain this byte sequence. This will to lead to an immediate positive answer for the detector. ERROR state: This indicates the state machine identified an illegal byte sequence for that encoding. This will lead to an immediate negative answer for this encoding. Detector will exclude this encoding from consideration from here on. 3732
codingstatemachinedict.py 542
cp949prober.py 1860
enums.py All of the Enums that are used throughout the chardet package. :author: Dan Blanchard (dan.blanchard@gmail.com) 1683
escprober.py This CharSetProber uses a "code scheme" approach for detecting encodings, whereby easily recognizable escape or shift sequences are relied on to identify these encodings. 4006
escsm.py 12176
eucjpprober.py 3934
euckrfreq.py 13566
euckrprober.py 1753
euctwfreq.py 36913
euctwprober.py 1753
gb2312freq.py 20735
gb2312prober.py 1759
hebrewprober.py 14537
jisfreq.py 25796
johabfreq.py 42498
johabprober.py 1752
jpcntx.py 27055
langbulgarianmodel.py 104550
langgreekmodel.py 98472
langhebrewmodel.py 98184
langhungarianmodel.py 101351
langrussianmodel.py 128023
langthaimodel.py 102762
langturkishmodel.py 95360
latin1prober.py 5380
macromanprober.py 6077
mbcharsetprober.py MultiByteCharSetProber 3715
mbcsgroupprober.py 2131
mbcssm.py 30391
metadata
py.typed 0
resultdict.py 402
sbcharsetprober.py 6400
sbcsgroupprober.py 4137
sjisprober.py 4007
universaldetector.py Module containing the UniversalDetector detector class, which is the primary class a user of ``chardet`` should use. :author: Mark Pilgrim (initial port to Python) :author: Shy Shalom (original C code) :author: Dan Blanchard (major refactoring for 3.0) :author: Ian Cordasco 14848
utf8prober.py 2812
utf1632prober.py This class simply looks for occurrences of zero bytes, and infers whether the file is UTF16 or UTF32 (low-endian or big-endian) For instance, files looking like ( \0 \0 \0 [nonzero] )+ have a good probability to be UTF32BE. Files looking like ( \0 [nonzero] )+ may be guessed to be UTF16BE, and inversely for little-endian varieties. 8505
version.py This module exists only to simplify retrieving the version number of chardet from within setuptools and from chardet subpackages. :author: Dan Blanchard (dan.blanchard@gmail.com) 244