compact_lang_det_impl.h Flag meanings: Flags are used in the context of a recursive call from Detect to itself, trying to deal in a more restrictive way with input that was not reliably identified in the top-level call. Finish -- Do not further recurse; return whatever result ensues, even if it is unreliable. Typically set in any recursive call to take a second try on unreliable text. Squeeze -- For each text run, do an inplace cheapsqueeze to remove chunks of highly repetitive text and chunks of text with too many 1- and 2-letter words. This avoids scoring repetitive or useless non-text crap in large files such bogus JPEGs within an HTML file. Repeats -- When scoring a text run, do a cheap prediction of each character and do not score a unigram/quadgram if the last character of same is correctly predicted. This is a slower, finer-grained form of cheapsqueeze, typically used when the first pass got unreliable results. Top40 -- Restrict the set of scored languages to the Google "Top 40", which is actually 38 languages. This gets rid of about 110 languages that represent about 0.7% of the web. Typically used when the first pass got unreliable results. Short -- DEPRICATED, unused Hint -- EXPERIMENTAL flag for to indicate a language hint supplied in parameter plus_one. UseWords -- In additon to scoring quad/uni/nil-grams, score complete words Tentative decision logic: In the middle of first pass -- After 4KB of text, look at the front 256 bytes of every full 4KB buffer. If it compresses very well (say 3:1) or has lots of spaces (say 1 of every 4 bytes), assume that the input is large and contains lots of bogus non-text. Recurse, passing the Squeeze flag to strip out chunks of this non-text. At the end of the first pass -- If the top language is reliable and >= 70% of the document, return. Else if the top language is reliable and top+2nd >= say 94%, return. Else, either the top language is not reliable or there is a lot of other crap. * 7149 11713 2084
