LZ-Trees: Context-Free Grammar Enumeration via Subtree Reuse

Table of Links

4 LZ-trees

An interesting family of variants to Algorithm A can be created by noting that an integer can encode information other than rule expansions in the grammar. For example, an integer might reference complete subtrees that have been generated previously. This idea is inspired by work formulating probabilistic models that expand CFGs to favor re-use of existing subtrees [18, 19]. We call this approach LZ-trees because we draw on an idea from the LZ77/LZ78 algorithm [20], which compresses strings by permitting pointers back to previously emitted strings. Here, we permit our enumeration to potentially point back to previously generated complete trees. For instance, suppose we are currently decoding an integer at the point x in the tree.

Then at x, we should be allowed to draw on prior complete trees (rooted at B and D) assuming they are of the correct non-terminal type for x. Since there are two previously-generated trees when expanding x, we can let IntegerizedStack values of 0 and 1 reference these trees, and otherwise, we encode the node below x according to Algorithm A. This has the effect of preferentially re-using subtrees that have previously been generated early in the enumeration, although it should be noted that the mapping is no longer a bijection. Note that unlike LZ77, this algorithm does not require us to store an integer for the length of the string/tree that is pointed to, because we assume it is a complete subtree. Also, the integer pointing to a previous tree is an integer specifying the target in any enumeration of nodes in the tree. A listing for this algorithm, Algorithm B, is shown below:

Results from enumerating the grammar in (9) are shown in Figure 4. Note here that the main differences are places where the Algorithm B re-uses a component earlier in the string. However, the algorithms do agree in many places, likely because of the requirement that only complete subtrees of the same type can be references (of which there are often not any). Similar approaches might allow us to write potentially write any kind of encoder and enumerate trees relative to that encoding scheme. For instance, we might permit a pointer to a previous subtree, we might use an integer coding which codes prior tree components relative to their frequency, etc.

Author:

(1) Steven T. Piantadosi.

This paper is available on arxiv under CC BY 4.0 license.