Source code

Revision control

Other Tools

1
#!/usr/bin/env python
2
# Copyright 2014 The Chromium Authors. All rights reserved.
3
# Use of this source code is governed by a BSD-style license that can be
4
# found in the LICENSE file.
5
6
"""
7
A Deterministic acyclic finite state automaton (DAFSA) is a compact
8
representation of an unordered word list (dictionary).
9
11
12
This python program converts a list of strings to a byte array in C++.
13
This python program fetches strings and return values from a gperf file
14
and generates a C++ file with a byte array representing graph that can be
15
used as a memory efficient replacement for the perfect hash table.
16
17
The input strings are assumed to consist of printable 7-bit ASCII characters
18
and the return values are assumed to be one digit integers.
19
20
In this program a DAFSA is a diamond shaped graph starting at a common
21
source node and ending at a common sink node. All internal nodes contain
22
a label and each word is represented by the labels in one path from
23
the source node to the sink node.
24
25
The following python represention is used for nodes:
26
27
Source node: [ children ]
28
Internal node: (label, [ children ])
29
Sink node: None
30
31
The graph is first compressed by prefixes like a trie. In the next step
32
suffixes are compressed so that the graph gets diamond shaped. Finally
33
one to one linked nodes are replaced by nodes with the labels joined.
34
35
The order of the operations is crucial since lookups will be performed
36
starting from the source with no backtracking. Thus a node must have at
37
most one child with a label starting by the same character. The output
38
is also arranged so that all jumps are to increasing addresses, thus forward
39
in memory.
40
41
The generated output has suffix free decoding so that the sign of leading
42
bits in a link (a reference to a child node) indicate if it has a size of one,
43
two or three bytes and if it is the last outgoing link from the actual node.
44
A node label is terminated by a byte with the leading bit set.
45
46
The generated byte array can described by the following BNF:
47
48
<byte> ::= < 8-bit value in range [0x00-0xFF] >
49
50
<char> ::= < printable 7-bit ASCII character, byte in range [0x20-0x7F] >
51
<end_char> ::= < char + 0x80, byte in range [0xA0-0xFF] >
52
<return value> ::= < value + 0x80, byte in range [0x80-0x8F] >
53
54
<offset1> ::= < byte in range [0x00-0x3F] >
55
<offset2> ::= < byte in range [0x40-0x5F] >
56
<offset3> ::= < byte in range [0x60-0x7F] >
57
58
<end_offset1> ::= < byte in range [0x80-0xBF] >
59
<end_offset2> ::= < byte in range [0xC0-0xDF] >
60
<end_offset3> ::= < byte in range [0xE0-0xFF] >
61
62
<prefix> ::= <char>
63
64
<label> ::= <end_char>
65
| <char> <label>
66
67
<end_label> ::= <return_value>
68
| <char> <end_label>
69
70
<offset> ::= <offset1>
71
| <offset2> <byte>
72
| <offset3> <byte> <byte>
73
74
<end_offset> ::= <end_offset1>
75
| <end_offset2> <byte>
76
| <end_offset3> <byte> <byte>
77
78
<offsets> ::= <end_offset>
79
| <offset> <offsets>
80
81
<source> ::= <offsets>
82
83
<node> ::= <label> <offsets>
84
| <prefix> <node>
85
| <end_label>
86
87
<dafsa> ::= <source>
88
| <dafsa> <node>
89
90
Decoding:
91
92
<char> -> printable 7-bit ASCII character
93
<end_char> & 0x7F -> printable 7-bit ASCII character
94
<return value> & 0x0F -> integer
95
<offset1 & 0x3F> -> integer
96
((<offset2> & 0x1F>) << 8) + <byte> -> integer
97
((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer
98
99
end_offset1, end_offset2 and and_offset3 are decoded same as offset1,
100
offset2 and offset3 respectively.
101
102
The first offset in a list of offsets is the distance in bytes between the
103
offset itself and the first child node. Subsequent offsets are the distance
104
between previous child node and next child node. Thus each offset links a node
105
to a child node. The distance is always counted between start addresses, i.e.
106
first byte in decoded offset or first byte in child node.
107
108
Example 1:
109
110
%%
111
aa, 1
112
a, 2
113
%%
114
115
The input is first parsed to a list of words:
116
["aa1", "a2"]
117
118
A fully expanded graph is created from the words:
119
source = [node1, node4]
120
node1 = ("a", [node2])
121
node2 = ("a", [node3])
122
node3 = ("\x01", [sink])
123
node4 = ("a", [node5])
124
node5 = ("\x02", [sink])
125
sink = None
126
127
Compression results in the following graph:
128
source = [node1]
129
node1 = ("a", [node2, node3])
130
node2 = ("\x02", [sink])
131
node3 = ("a\x01", [sink])
132
sink = None
133
134
A C++ representation of the compressed graph is generated:
135
136
const unsigned char dafsa[7] = {
137
0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81,
138
};
139
140
The bytes in the generated array has the following meaning:
141
142
0: 0x81 <end_offset1> child at position 0 + (0x81 & 0x3F) -> jump to 1
143
144
1: 0xE1 <end_char> label character (0xE1 & 0x7F) -> match "a"
145
2: 0x02 <offset1> child at position 2 + (0x02 & 0x3F) -> jump to 4
146
147
3: 0x81 <end_offset1> child at position 4 + (0x81 & 0x3F) -> jump to 5
148
4: 0x82 <return_value> 0x82 & 0x0F -> return 2
149
150
5: 0x61 <char> label character 0x61 -> match "a"
151
6: 0x81 <return_value> 0x81 & 0x0F -> return 1
152
153
Example 2:
154
155
%%
156
aa, 1
157
bbb, 2
158
baa, 1
159
%%
160
161
The input is first parsed to a list of words:
162
["aa1", "bbb2", "baa1"]
163
164
Compression results in the following graph:
165
source = [node1, node2]
166
node1 = ("b", [node2, node3])
167
node2 = ("aa\x01", [sink])
168
node3 = ("bb\x02", [sink])
169
sink = None
170
171
A C++ representation of the compressed graph is generated:
172
173
const unsigned char dafsa[11] = {
174
0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82,
175
};
176
177
The bytes in the generated array has the following meaning:
178
179
0: 0x02 <offset1> child at position 0 + (0x02 & 0x3F) -> jump to 2
180
1: 0x83 <end_offset1> child at position 2 + (0x83 & 0x3F) -> jump to 5
181
182
2: 0xE2 <end_char> label character (0xE2 & 0x7F) -> match "b"
183
3: 0x02 <offset1> child at position 3 + (0x02 & 0x3F) -> jump to 5
184
4: 0x83 <end_offset1> child at position 5 + (0x83 & 0x3F) -> jump to 8
185
186
5: 0x61 <char> label character 0x61 -> match "a"
187
6: 0x61 <char> label character 0x61 -> match "a"
188
7: 0x81 <return_value> 0x81 & 0x0F -> return 1
189
190
8: 0x62 <char> label character 0x62 -> match "b"
191
9: 0x62 <char> label character 0x62 -> match "b"
192
10: 0x82 <return_value> 0x82 & 0x0F -> return 2
193
"""
194
195
import sys
196
import struct
197
198
199
class InputError(Exception):
200
"""Exception raised for errors in the input file."""
201
202
203
def to_dafsa(words):
204
"""Generates a DAFSA from a word list and returns the source node.
205
206
Each word is split into characters so that each character is represented by
207
a unique node. It is assumed the word list is not empty.
208
"""
209
if not words:
210
raise InputError('The domain list must not be empty')
211
212
def ToNodes(word):
213
"""Split words into characters"""
214
if not 0x1F < ord(word[0]) < 0x80:
215
raise InputError('Domain names must be printable 7-bit ASCII')
216
if len(word) == 1:
217
return chr(ord(word[0]) & 0x0F), [None]
218
return word[0], [ToNodes(word[1:])]
219
return [ToNodes(word) for word in words]
220
221
222
def to_words(node):
223
"""Generates a word list from all paths starting from an internal node."""
224
if not node:
225
return ['']
226
return [(node[0] + word) for child in node[1] for word in to_words(child)]
227
228
229
def reverse(dafsa):
230
"""Generates a new DAFSA that is reversed, so that the old sink node becomes
231
the new source node.
232
"""
233
sink = []
234
nodemap = {}
235
236
def dfs(node, parent):
237
"""Creates reverse nodes.
238
239
A new reverse node will be created for each old node. The new node will
240
get a reversed label and the parents of the old node as children.
241
"""
242
if not node:
243
sink.append(parent)
244
elif id(node) not in nodemap:
245
nodemap[id(node)] = (node[0][::-1], [parent])
246
for child in node[1]:
247
dfs(child, nodemap[id(node)])
248
else:
249
nodemap[id(node)][1].append(parent)
250
251
for node in dafsa:
252
dfs(node, None)
253
return sink
254
255
256
def join_labels(dafsa):
257
"""Generates a new DAFSA where internal nodes are merged if there is a one to
258
one connection.
259
"""
260
parentcount = {id(None): 2}
261
nodemap = {id(None): None}
262
263
def count_parents(node):
264
"""Count incoming references"""
265
if id(node) in parentcount:
266
parentcount[id(node)] += 1
267
else:
268
parentcount[id(node)] = 1
269
for child in node[1]:
270
count_parents(child)
271
272
def join(node):
273
"""Create new nodes"""
274
if id(node) not in nodemap:
275
children = [join(child) for child in node[1]]
276
if len(children) == 1 and parentcount[id(node[1][0])] == 1:
277
child = children[0]
278
nodemap[id(node)] = (node[0] + child[0], child[1])
279
else:
280
nodemap[id(node)] = (node[0], children)
281
return nodemap[id(node)]
282
283
for node in dafsa:
284
count_parents(node)
285
return [join(node) for node in dafsa]
286
287
288
def join_suffixes(dafsa):
289
"""Generates a new DAFSA where nodes that represent the same word lists
290
towards the sink are merged.
291
"""
292
nodemap = {frozenset(('',)): None}
293
294
def join(node):
295
"""Returns a macthing node. A new node is created if no matching node
296
exists. The graph is accessed in dfs order.
297
"""
298
suffixes = frozenset(to_words(node))
299
if suffixes not in nodemap:
300
nodemap[suffixes] = (node[0], [join(child) for child in node[1]])
301
return nodemap[suffixes]
302
303
return [join(node) for node in dafsa]
304
305
306
def top_sort(dafsa):
307
"""Generates list of nodes in topological sort order."""
308
incoming = {}
309
310
def count_incoming(node):
311
"""Counts incoming references."""
312
if node:
313
if id(node) not in incoming:
314
incoming[id(node)] = 1
315
for child in node[1]:
316
count_incoming(child)
317
else:
318
incoming[id(node)] += 1
319
320
for node in dafsa:
321
count_incoming(node)
322
323
for node in dafsa:
324
incoming[id(node)] -= 1
325
326
waiting = [node for node in dafsa if incoming[id(node)] == 0]
327
nodes = []
328
329
while waiting:
330
node = waiting.pop()
331
assert incoming[id(node)] == 0
332
nodes.append(node)
333
for child in node[1]:
334
if child:
335
incoming[id(child)] -= 1
336
if incoming[id(child)] == 0:
337
waiting.append(child)
338
return nodes
339
340
341
def encode_links(children, offsets, current):
342
"""Encodes a list of children as one, two or three byte offsets."""
343
if not children[0]:
344
# This is an <end_label> node and no links follow such nodes
345
assert len(children) == 1
346
return []
347
guess = 3 * len(children)
348
assert children
349
children = sorted(children, key=lambda x: -offsets[id(x)])
350
while True:
351
offset = current + guess
352
buf = []
353
for child in children:
354
last = len(buf)
355
distance = offset - offsets[id(child)]
356
assert distance > 0 and distance < (1 << 21)
357
358
if distance < (1 << 6):
359
# A 6-bit offset: "s0xxxxxx"
360
buf.append(distance)
361
elif distance < (1 << 13):
362
# A 13-bit offset: "s10xxxxxxxxxxxxx"
363
buf.append(0x40 | (distance >> 8))
364
buf.append(distance & 0xFF)
365
else:
366
# A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx"
367
buf.append(0x60 | (distance >> 16))
368
buf.append((distance >> 8) & 0xFF)
369
buf.append(distance & 0xFF)
370
# Distance in first link is relative to following record.
371
# Distance in other links are relative to previous link.
372
offset -= distance
373
if len(buf) == guess:
374
break
375
guess = len(buf)
376
# Set most significant bit to mark end of links in this node.
377
buf[last] |= (1 << 7)
378
buf.reverse()
379
return buf
380
381
382
def encode_prefix(label):
383
"""Encodes a node label as a list of bytes without a trailing high byte.
384
385
This method encodes a node if there is exactly one child and the
386
child follows immediately after so that no jump is needed. This label
387
will then be a prefix to the label in the child node.
388
"""
389
assert label
390
return [ord(c) for c in reversed(label)]
391
392
393
def encode_label(label):
394
"""Encodes a node label as a list of bytes with a trailing high byte >0x80.
395
"""
396
buf = encode_prefix(label)
397
# Set most significant bit to mark end of label in this node.
398
buf[0] |= (1 << 7)
399
return buf
400
401
402
def encode(dafsa):
403
"""Encodes a DAFSA to a list of bytes"""
404
output = []
405
offsets = {}
406
407
for node in reversed(top_sort(dafsa)):
408
if (len(node[1]) == 1 and node[1][0] and
409
(offsets[id(node[1][0])] == len(output))):
410
output.extend(encode_prefix(node[0]))
411
else:
412
output.extend(encode_links(node[1], offsets, len(output)))
413
output.extend(encode_label(node[0]))
414
offsets[id(node)] = len(output)
415
416
output.extend(encode_links(dafsa, offsets, len(output)))
417
output.reverse()
418
return output
419
420
421
def encode_words(words):
422
"""Generates a dafsa representation of a word list"""
423
dafsa = to_dafsa(words)
424
for fun in (reverse, join_suffixes, reverse, join_suffixes, join_labels):
425
dafsa = fun(dafsa)
426
return dafsa
427
428
429
def to_cxx(data, preamble=None):
430
"""Generates C++ code from a list of encoded bytes."""
431
text = '/* This file is generated. DO NOT EDIT!\n\n'
432
text += 'The byte array encodes a dictionary of strings and values. See '
433
text += 'make_dafsa.py for documentation.'
434
text += '*/\n\n'
435
436
if preamble:
437
text += preamble
438
text += '\n\n'
439
440
text += 'const unsigned char kDafsa[%s] = {\n' % len(data)
441
for i in range(0, len(data), 12):
442
text += ' '
443
text += ', '.join('0x%02x' % byte for byte in data[i:i + 12])
444
text += ',\n'
445
text += '};\n'
446
return text
447
448
449
def words_to_cxx(words, preamble=None):
450
"""Generates C++ code from a word list"""
451
dafsa = encode_words(words)
452
return to_cxx(encode(dafsa), preamble)
453
454
455
def words_to_bin(words):
456
"""Generates bytes from a word list"""
457
dafsa = encode_words(words)
458
data = encode(dafsa)
459
return struct.pack('%dB' % len(data), *data)
460
461
462
def parse_gperf(infile):
463
"""Parses gperf file and extract strings and return code"""
464
lines = [line.strip() for line in infile]
465
466
# Extract the preamble.
467
first_delimeter = lines.index('%%')
468
preamble = '\n'.join(lines[0:first_delimeter])
469
470
# Extract strings after the first '%%' and before the second '%%'.
471
begin = first_delimeter + 1
472
end = lines.index('%%', begin)
473
lines = lines[begin:end]
474
for line in lines:
475
if line[-3:-1] != ', ':
476
raise InputError('Expected "domainname, <digit>", found "%s"' % line)
477
# Technically the DAFSA format could support return values in range [0-31],
478
# but the values below are the only with a defined meaning.
479
if line[-1] not in '0124':
480
raise InputError('Expected value to be one of {0,1,2,4}, found "%s"' %
481
line[-1])
482
return (preamble, [line[:-3] + line[-1] for line in lines])
483
484
485
def main(outfile, infile):
486
with open(infile, 'r') as infile:
487
preamble, words = parse_gperf(infile)
488
outfile.write(words_to_cxx(words, preamble))
489
return 0
490
491
492
if __name__ == '__main__':
493
if len(sys.argv) != 3:
494
print('usage: %s infile outfile' % sys.argv[0])
495
sys.exit(1)
496
497
with open(sys.argv[2], 'w') as outfile:
498
sys.exit(main(outfile, sys.argv[1]))