Every Program There Is, Part One

[This is part of a series on generating every string in a language. The previous part is here. The next part is here.]

We can now enumerate every binary tree and every arbitrary tree of a given size, and therefore we can enumerate all of them, period: enumerate all the trees of size one, then all the trees of size two, and so on. This certainly is interesting for generating test cases for algorithms that consume trees. However, I also write algorithms that take programs; of course a compiler is chock full of algorithms that manipulate parts of programs. Can we use these same techniques to automatically generate test cases for compilers?

It seems plausible that we ought to be able to do so. I chose my notation for arbitrary trees as sequences of nested parentheses for a reason. If we take one of those strings, say {{}{{}}}, and we do a search-and-replace replacing the first { with class A {, and each subsequent one with class B { and so on then we get (with whitespace added)

class A
{
class B { }
class C { class D { } }
}

In short, we can use our binary tree generator to generate all possible nestings of classes just by making minor modifications to the code which prints out the string associated with the tree.

We can go farther than this though. I want to see if we can come up with some more general mechanism for generating all members of a particular language.

To do this I’m going to start today by describing an incredibly important tool in the design of programming languages, the Context Free Grammar, or henceforth, CFG.

It’s probably easiest to simply start with an example. Here’s a CFG for some simple declarative English sentences:

SENTENCE: SUBJECT VERB OBJECT
SUBJECT: ARTICLE ADJECTIVE NOUN
OBJECT: ARTICLE ADJECTIVE NOUN
ARTICLE: a | one | the
NOUN: dog | man
ADJECTIVE: fat | short
VERB: chases | bites

The idea is that any time you see one of the things on the left, you can substitute one of the things on the right. For example, we could start with SENTENCE. Only one legal substitution there: SUBJECT VERB OBJECT. Then we make a substitution for SUBJECT; again, there is only one legal substitution, and now we have ARTICLE ADJECTIVE NOUN VERB OBJECT. Then we make a substitution for ARTICLE. Now we have three choices; let’s say the. So we have the ADJECTIVE NOUN VERB OBJECT. Now make a choice for ADJECTIVE, say the fat NOUN VERB OBJECT. Keep on going until you cannot make any more substitutions, and you end up with something like the fat man bites the short dog.

To formalize this a bit: the words in ALL CAPS are called “nonterminals” because if you’ve still got one then you’re not done yet. The other words are called “terminals”; they cannot be further substituted. The grammar formally consists of (1) a set of distinct terminals and nonterminals, (2) a set of rules mapping each nonterminal onto a set of legal substitutions, and (3) which nonterminal is the one you start with - SENTENCE is the start symbol in our example. Rather than calling it out we'll usually just list the start symbol first.

The usual problem faced by compiler writers is the opposite of the process we just went through. That is, given a file containing only terminals (the short dog chases the fat man), go backwards: figure out exactly which rules were used to produce that statement, all the way back to the start, or, if you cannot, give a sensible error message describing why not. This is called “parsing the code”. Unsurprisingly, this is quite a difficult process for certain grammars; the grammar I’ve laid out here is relatively easy to analyze. Fortunately, in this series of articles I won’t be talking about the messy details of building parsers; we want to use CFGs to generate legal programs, not to parse them, so we’ll concentrate on that.

Where CFGs get interesting is when they become recursive. Consider this simple grammar with only two rules, two nonterminals and two terminals:

S: { X } | { }
X: X X | { X } | { }

The start is S. Let's make a substitution { X }. We have three possible substitutions for X, let’s pick the first, so { X X }. (Notice that unlike in our previous grammar, we can make arbitrarily long statements! In the previous grammar no matter what you do you end up with exactly seven terminals. Here we can end up with any even number of terminals.) Continuing on, lets go to { { } X }, and then to { { } { X } } and then { { } { { } } }, and we’re done.

Apparently we’ve come up with a grammar for all the possible arbitrary trees; we’re on the right track. If we could find a way to enumerate all members of this grammar we’d be in good shape.

Next time: addition causes ambiguity.

[This is part of a series on generating every string in a language. The previous part is here. The next part is here.]

Comments

  • Anonymous
    April 26, 2010
    I'm loving this series - thanks.  Can't wait for the next installment!

  • Anonymous
    April 26, 2010
    Seconded. I'm currently doing what I can to learn more about compilers in my spare time, so any extra info I can get is welcomed, especially when it comes from someone who works on something as influential as C#!

  • Anonymous
    April 26, 2010
    Wait, you're missing the {} tree. S: { X } X: X X | { X } | nothing ?

  • Anonymous
    April 26, 2010
    Bahbar...No, Eric has it right, and you appear mistaken. The empty tree is properly a member of S as shown i the post. Adding "nothing" to the grammer of X breaks the syntax... No, Bahbar was right, I made a mistake. I've fixed the grammar. You guys are leaving comments faster than I am editing. :-) -- Eric

  • Anonymous
    April 26, 2010
    The comment has been removed

  • Anonymous
    April 26, 2010
    Err... you're mixing up nonterminals and terminals in the last part. You always end up with 7 terminals. Yeah, I'm not the one who swore.

  • Anonymous
    April 26, 2010
    The first example you gave: {{}{{}}} reminds me of a plugin I use every day (in Gedit). It's called zen-coding, basically you type something like:    div#nameOfMyCalss>ul.nameOfMyId>li*5>a and it turns it into this (not sure if this will show up since it's HTML so I'll replace <> with these {}):    {div class="nameOfMyClass"}        {ul id="nameOfMyId"}            {li}{/li}            {li}{/li}            {li}{/li}            {li}{/li}            {li}{/li}        {/ul}    {/div} Anyway thats what it made me think of. Also after reading your post I now want to read through some of their code and see how they implemented it. Awesome post BTW!

  • Anonymous
    April 26, 2010
    Great twist on the last post, this! :-)

  • Anonymous
    April 26, 2010
    >> If we could find a way to enumerate all members of this grammar we’d be in good shape. Couldn't you enumerate over the number of productions?  Ie: Given:  A grammar and a string representing a possible result of (n) productions Desired:  A list of strings containing all possible results of applying one further production IEnumerable<string> GetProductions(Grammar G, string Input)
    {
        return from nonTerm in G.FindNonTerminals(Input)
        // nonTerm: label and position of NonTerminal in Input
           from rule in G.GetRules(nonTerm.Label)  
        // grammar rules having nonTerm.Label as their left hand side [ Rules in CFG:  NT => (NT | T)* ]
           select G.ApplyProduction(Input, nonTerm, rule.Replacement); // string substitution of nonTerm.Label at nonTerm.Position with rule.Replacement in Input
    }
    IEnumerable<string> GetProductions(Grammar G, IEnumerable<string> Inputs)
    {
       return Inputs.SelectMany((input) => GetProductions(G, input));
    }
    IEnumerable<string> AllProductions(Grammar G, int Length)
    {
       if(Length == 0) return new string[] { G.StartSymbol };
       return GetProductions(G, AllProductions(G, Length - 1));
    }
    IEnumerable<string> AllSentences(Grammar G, int Length)
    {
       return from value in AllProductions(G, Length) where G.IsTerminated(value);
    }
    For this grammar we would have: n = 0:  S n = 1: { X } , { } n = 2: { X X } , { { X } }, { { } } ... You have prematurely anticipated my denouement, Commander. -- Eric

  • Anonymous
    April 26, 2010
    I feel like I'm back in Automata class. Nice post!

  • Anonymous
    April 27, 2010
    Sorry, didn't mean to steal your thunder!  Thought you were going to move in a different direction with "Next time: addition causes ambiguity."

  • Anonymous
    April 27, 2010
    The program my boss told me to write? What about a program that writes the program my boss told me to write? Would the effort I'd put into specifying the program be less than the effort of writing the program?

  • Anonymous
    April 27, 2010
    Is it just me, or do these recent articles have thematic similarity to topics found in Hofstadter's book "Gödel Escher Bach"? I feel like it's 25 years ago and I'm exploring recursive infinities all over again! Or, wait…oh no! Am I in a recursive infinity myself? Aaaaahhhh…

  • Anonymous
    October 23, 2011
    you genius bro, could you tell me how you design your mind set to under how computer work