Name Description Size 89683 372307 26992
cld2_dynamic_data_loader.h 2986
cld2_dynamic_data.h 10143 17649 468277 198411 4731391
cld2tablesummary.h 2061 14916
cldutil_shared.h 16690 25007
cldutil.h 2902 55508
compact_lang_det_hint_code.h 3107 78168
compact_lang_det_impl.h Flag meanings: Flags are used in the context of a recursive call from Detect to itself, trying to deal in a more restrictive way with input that was not reliably identified in the top-level call. Finish -- Do not further recurse; return whatever result ensues, even if it is unreliable. Typically set in any recursive call to take a second try on unreliable text. Squeeze -- For each text run, do an inplace cheapsqueeze to remove chunks of highly repetitive text and chunks of text with too many 1- and 2-letter words. This avoids scoring repetitive or useless non-text crap in large files such bogus JPEGs within an HTML file. Repeats -- When scoring a text run, do a cheap prediction of each character and do not score a unigram/quadgram if the last character of same is correctly predicted. This is a slower, finer-grained form of cheapsqueeze, typically used when the first pass got unreliable results. Top40 -- Restrict the set of scored languages to the Google "Top 40", which is actually 38 languages. This gets rid of about 110 languages that represent about 0.7% of the web. Typically used when the first pass got unreliable results. Short -- DEPRICATED, unused Hint -- EXPERIMENTAL flag for to indicate a language hint supplied in parameter plus_one. UseWords -- In additon to scoring quad/uni/nil-grams, score complete words Tentative decision logic: In the middle of first pass -- After 4KB of text, look at the front 256 bytes of every full 4KB buffer. If it compresses very well (say 3:1) or has lots of spaces (say 1 of every 4 bytes), assume that the input is large and contains lots of bogus non-text. Recurse, passing the Squeeze flag to strip out chunks of this non-text. At the end of the first pass -- If the top language is reliable and >= 70% of the document, return. Else if the top language is reliable and top+2nd >= say 94%, return. Else, either the top language is not reliable or there is a lot of other crap. * 7149 11713 2084
debug.h 2015 1583
fixunicodevalue.h 3141 1782 6278 141928
generated_language.h 28159 26715
generated_ulscript.h 5839 37920
getonescriptspan.h 4192
integral_types.h 945 20840
lang_script.h 8326
langspan.h 1403
LICENSE 11358 18213
offsetmap.h 5578
port.h 4112 51724
scoreonescriptspan.h 12114
stringpiece.h 2337 6831
tote.h 4074
utf8prop_lettermarkscriptnum.h 82751
utf8repl_lettermarklower.h 40027
utf8scannot_lettermarkspecial.h 70834 48850
utf8statetable.h 10072