cld_generated_cjk_delta_bi_4.cc |
|
89683 |
cld_generated_cjk_uni_prop_80.cc |
|
372307 |
cld_generated_score_quad_octa_0122_2.cc |
|
26992 |
cld2_dynamic_data.h |
|
10143 |
cld2_dynamic_data_loader.h |
|
2986 |
cld2_generated_cjk_compatible.cc |
|
17649 |
cld2_generated_deltaoctachrome0122.cc |
|
468277 |
cld2_generated_distinctoctachrome0122.cc |
|
198411 |
cld2_generated_quadchrome0122_16.cc |
|
4731391 |
cld2tablesummary.h |
|
2061 |
cldutil.cc |
|
25007 |
cldutil.h |
|
2902 |
cldutil_shared.cc |
|
14916 |
cldutil_shared.h |
|
16690 |
compact_lang_det.cc |
|
11713 |
compact_lang_det_hint_code.cc |
|
55508 |
compact_lang_det_hint_code.h |
|
3107 |
compact_lang_det_impl.cc |
|
78168 |
compact_lang_det_impl.h |
Flag meanings:
Flags are used in the context of a recursive call from Detect to itself,
trying to deal in a more restrictive way with input that was not reliably
identified in the top-level call.
Finish -- Do not further recurse; return whatever result ensues, even if it is
unreliable. Typically set in any recursive call to take a second try
on unreliable text.
Squeeze -- For each text run, do an inplace cheapsqueeze to remove chunks of
highly repetitive text and chunks of text with too many 1- and
2-letter words. This avoids scoring repetitive or useless non-text
crap in large files such bogus JPEGs within an HTML file.
Repeats -- When scoring a text run, do a cheap prediction of each character
and do not score a unigram/quadgram if the last character of same is
correctly predicted. This is a slower, finer-grained form of
cheapsqueeze, typically used when the first pass got unreliable
results.
Top40 -- Restrict the set of scored languages to the Google "Top 40", which is
actually 38 languages. This gets rid of about 110 languages that
represent about 0.7% of the web. Typically used when the first pass
got unreliable results.
Short -- DEPRICATED, unused
Hint -- EXPERIMENTAL flag for compact_lang_det_test.cc to indicate a language
hint supplied in parameter plus_one.
UseWords -- In additon to scoring quad/uni/nil-grams, score complete words
Tentative decision logic:
In the middle of first pass -- After 4KB of text, look at the front 256 bytes
of every full 4KB buffer. If it compresses very well (say 3:1) or has
lots of spaces (say 1 of every 4 bytes), assume that the input is
large and contains lots of bogus non-text. Recurse, passing the
Squeeze flag to strip out chunks of this non-text.
At the end of the first pass --
If the top language is reliable and >= 70% of the document, return.
Else if the top language is reliable and top+2nd >= say 94%, return.
Else, either the top language is not reliable or there is a lot of
other crap.
* |
7149 |
debug.h |
|
2015 |
debug_empty.cc |
|
2084 |
fixunicodevalue.cc |
|
1583 |
fixunicodevalue.h |
|
3141 |
generated_distinct_bi_0.cc |
|
1782 |
generated_entities.cc |
|
6278 |
generated_language.cc |
|
141928 |
generated_language.h |
|
28159 |
generated_ulscript.cc |
|
26715 |
generated_ulscript.h |
|
5839 |
getonescriptspan.cc |
|
37920 |
getonescriptspan.h |
|
4192 |
integral_types.h |
|
945 |
lang_script.cc |
|
20840 |
lang_script.h |
|
8326 |
langspan.h |
|
1403 |
LICENSE |
|
11358 |
offsetmap.cc |
|
18213 |
offsetmap.h |
|
5578 |
port.h |
|
4112 |
scoreonescriptspan.cc |
|
51724 |
scoreonescriptspan.h |
|
12114 |
stringpiece.h |
|
2337 |
tote.cc |
|
6831 |
tote.h |
|
4074 |
utf8prop_lettermarkscriptnum.h |
|
82751 |
utf8repl_lettermarklower.h |
|
40027 |
utf8scannot_lettermarkspecial.h |
|
70834 |
utf8statetable.cc |
|
48850 |
utf8statetable.h |
|
10072 |