Name Description Size
cld_generated_cjk_delta_bi_4.cc 89683
cld_generated_cjk_uni_prop_80.cc 372307
cld_generated_score_quad_octa_0122_2.cc 26992
cld2_dynamic_data_loader.h 2986
cld2_dynamic_data.h 10143
cld2_generated_cjk_compatible.cc 17649
cld2_generated_deltaoctachrome0122.cc 468277
cld2_generated_distinctoctachrome0122.cc 198411
cld2_generated_quadchrome0122_16.cc 4731391
cld2tablesummary.h 2061
cldutil_shared.cc 14916
cldutil_shared.h 16690
cldutil.cc 25007
cldutil.h 2902
compact_lang_det_hint_code.cc 55508
compact_lang_det_hint_code.h 3107
compact_lang_det_impl.cc 78168
compact_lang_det_impl.h Flag meanings: Flags are used in the context of a recursive call from Detect to itself, trying to deal in a more restrictive way with input that was not reliably identified in the top-level call. Finish -- Do not further recurse; return whatever result ensues, even if it is unreliable. Typically set in any recursive call to take a second try on unreliable text. Squeeze -- For each text run, do an inplace cheapsqueeze to remove chunks of highly repetitive text and chunks of text with too many 1- and 2-letter words. This avoids scoring repetitive or useless non-text crap in large files such bogus JPEGs within an HTML file. Repeats -- When scoring a text run, do a cheap prediction of each character and do not score a unigram/quadgram if the last character of same is correctly predicted. This is a slower, finer-grained form of cheapsqueeze, typically used when the first pass got unreliable results. Top40 -- Restrict the set of scored languages to the Google "Top 40", which is actually 38 languages. This gets rid of about 110 languages that represent about 0.7% of the web. Typically used when the first pass got unreliable results. Short -- DEPRICATED, unused Hint -- EXPERIMENTAL flag for compact_lang_det_test.cc to indicate a language hint supplied in parameter plus_one. UseWords -- In additon to scoring quad/uni/nil-grams, score complete words Tentative decision logic: In the middle of first pass -- After 4KB of text, look at the front 256 bytes of every full 4KB buffer. If it compresses very well (say 3:1) or has lots of spaces (say 1 of every 4 bytes), assume that the input is large and contains lots of bogus non-text. Recurse, passing the Squeeze flag to strip out chunks of this non-text. At the end of the first pass -- If the top language is reliable and >= 70% of the document, return. Else if the top language is reliable and top+2nd >= say 94%, return. Else, either the top language is not reliable or there is a lot of other crap. * 7149
compact_lang_det.cc 11713
debug_empty.cc 2084
debug.h 2015
fixunicodevalue.cc 1583
fixunicodevalue.h 3141
generated_distinct_bi_0.cc 1782
generated_entities.cc 6278
generated_language.cc 141928
generated_language.h 28159
generated_ulscript.cc 26715
generated_ulscript.h 5839
getonescriptspan.cc 37920
getonescriptspan.h 4192
integral_types.h 945
lang_script.cc 20840
lang_script.h 8326
langspan.h 1403
LICENSE 11358
offsetmap.cc 18213
offsetmap.h 5578
port.h 4112
scoreonescriptspan.cc 51724
scoreonescriptspan.h 12114
stringpiece.h 2337
tote.cc 6831
tote.h 4074
utf8prop_lettermarkscriptnum.h 82751
utf8repl_lettermarklower.h 40027
utf8scannot_lettermarkspecial.h 70834
utf8statetable.cc 48850
utf8statetable.h 10072