toolkit/components/translation/cld2/internal

mozilla-central/toolkit/components/translation/cld2/internal

Name	Description	Size
cld_generated_cjk_delta_bi_4.cc		89683
cld_generated_cjk_uni_prop_80.cc		372307
cld_generated_score_quad_octa_0122_2.cc		26992
cld2_dynamic_data.h		10143
cld2_dynamic_data_loader.h		2986
cld2_generated_cjk_compatible.cc		17649
cld2_generated_deltaoctachrome0122.cc		468277
cld2_generated_distinctoctachrome0122.cc		198411
cld2_generated_quadchrome0122_16.cc		4731391
cld2tablesummary.h		2061
cldutil.cc		25007
cldutil.h		2902
cldutil_shared.cc		14916
cldutil_shared.h		16690
compact_lang_det.cc		11713
compact_lang_det_hint_code.cc		55508
compact_lang_det_hint_code.h		3107
compact_lang_det_impl.cc		78168
compact_lang_det_impl.h	Flag meanings: Flags are used in the context of a recursive call from Detect to itself, trying to deal in a more restrictive way with input that was not reliably identified in the top-level call. Finish -- Do not further recurse; return whatever result ensues, even if it is unreliable. Typically set in any recursive call to take a second try on unreliable text. Squeeze -- For each text run, do an inplace cheapsqueeze to remove chunks of highly repetitive text and chunks of text with too many 1- and 2-letter words. This avoids scoring repetitive or useless non-text crap in large files such bogus JPEGs within an HTML file. Repeats -- When scoring a text run, do a cheap prediction of each character and do not score a unigram/quadgram if the last character of same is correctly predicted. This is a slower, finer-grained form of cheapsqueeze, typically used when the first pass got unreliable results. Top40 -- Restrict the set of scored languages to the Google "Top 40", which is actually 38 languages. This gets rid of about 110 languages that represent about 0.7% of the web. Typically used when the first pass got unreliable results. Short -- DEPRICATED, unused Hint -- EXPERIMENTAL flag for compact_lang_det_test.cc to indicate a language hint supplied in parameter plus_one. UseWords -- In additon to scoring quad/uni/nil-grams, score complete words Tentative decision logic: In the middle of first pass -- After 4KB of text, look at the front 256 bytes of every full 4KB buffer. If it compresses very well (say 3:1) or has lots of spaces (say 1 of every 4 bytes), assume that the input is large and contains lots of bogus non-text. Recurse, passing the Squeeze flag to strip out chunks of this non-text. At the end of the first pass -- If the top language is reliable and >= 70% of the document, return. Else if the top language is reliable and top+2nd >= say 94%, return. Else, either the top language is not reliable or there is a lot of other crap. *	7149
debug.h		2015
debug_empty.cc		2084
fixunicodevalue.cc		1583
fixunicodevalue.h		3141
generated_distinct_bi_0.cc		1782
generated_entities.cc		6278
generated_language.cc		141928
generated_language.h		28159
generated_ulscript.cc		26715
generated_ulscript.h		5839
getonescriptspan.cc		37920
getonescriptspan.h		4192
integral_types.h		945
lang_script.cc		20840
lang_script.h		8326
langspan.h		1403
LICENSE		11358
offsetmap.cc		18213
offsetmap.h		5578
port.h		4112
scoreonescriptspan.cc		51724
scoreonescriptspan.h		12114
stringpiece.h		2337
tote.cc		6831
tote.h		4074
utf8prop_lettermarkscriptnum.h		82751
utf8repl_lettermarklower.h		40027
utf8scannot_lettermarkspecial.h		70834
utf8statetable.cc		48850
utf8statetable.h		10072