Every Program There Is, Part Two

[This is part of a series on generating every string in a language. The previous part is here. The next part is here.]

Suppose we want to come up with a CFG for numbers with additions. Consider this very simple grammar with only one nonterminal. We’ll say that an expression is either a number, or a sum of two expressions:

X: 1 | 2 | 3 | X + X

We can make the following series of substitutions, each time substituting the leftmost remaining nonterminal:

X
X + X
X + X + X
1 + X + X
1 + 2 + X
1 + 2 + 3

We can also make this series of substitutions, again, every time substituting for the leftmost remaining nonterminal:

X
X + X
1 + X <-- different!
1 + X + X
1 + 2 + X
1 + 2 + 3

How interesting; we had two rather similar but nonetheless different “leftmost” derivations for the same sequence of terminals! CFGs for which there exist strings of terminals that have two different derivations are called “ambiguous” grammars. It is often extremely difficult to correctly parse an ambiguous CFG, though of course we are not attempting to parse here. But it is also difficult to use an ambiguous CFG to generate all the strings in a language because what sometimes happens is you end up generating the same string more than once. Ideally for our purposes we would like to guarantee that a given string is only generated once. If the CFG is non-ambiguous then we know that trying all possible combinations of rule applications will not lead to the same string generated twice.

Unfortunately, the general problem of ambiguity turns out to be provably unsolvable. There is no general way to (1) look at a grammar and determine whether it is ambiguous, or (2) take an ambiguous CFG and find an “equivalent” unambiguous CFG. However, not all is lost. There are ways to prove that particular grammars are unambiguous, even though there is no general solution. And there are lots of common sneaky tricks for making the sorts of grammars we typically encounter in programming languages into unambiguous grammars.  In this particular case we could say:

S: N | S + N
N: 1 | 2 | 3

And now the only leftmost derivation for 1 + 2 + 3 is

S
S + N
S + N + N
N + N + N
1 + N + N
1 + 2 + N
1 + 2 + 3

I’m not going to prove that this is an unambiguous grammar; that would take us too far afield. Just take my word for it, this is unambiguous. From now on we’ll be trying to produce only unambiguous grammars.

Next time: saying nothing requires extra work

[This is part of a series on generating every string in a language. The previous part is here. The next part is here.]

Comments

  • Anonymous
    April 29, 2010
    I think a proof could go something like this: Let's consider how we can make a given string with P plusses. (1) Starting from one S, it is only possible to produce strings with one S in the left most position, or strings with no S's.  (I believe this is provable by induction.) (2) We can't increase the number of plusses if our string doesn't contain S.  So if our current string has fewer than P plusses, we can't eliminate S.  Thus, working from the possible strings as given by (1) and only doing leftmost replacement the only replacement that will work is S --> S + N (3) It's impossible to reduce the number of plusses, so if we currently have more than P plusses we can't produce the desired string. (4) If we currently have P plusses and we have an S on the left, then given (3) the only valid leftmost replacement is S --> N (anything else will give more than P plusses). (5) If we currently have P plusses and we don't have an S on the left, then the only non-terminals we have are N's (since from (1) we know we can't have an S anywhere else), and N's can only be replaced with terminals, thus the only valid leftmost replacement is to replace the leftmost N with the appropriate terminal. Thus in all cases there is at most one valid replacement to make in order to proceed toward a given string with P plusses.  So the route to our given string is fully determined, and the grammar is unambiguous.

  • Anonymous
    April 29, 2010
    As for step (1), you can't produce a string which contains an S from one which doesn't contain an S, so it suffices to show that without ever eliminating all S's we can only produce strings of the form S + N + N + N + . . . So, considering only strings that contain an S, We know that for the P=0 case we can only have the string S. We assume that for the P=N case the only string we can get without eliminating S is S + N + N + N + ... (P times)  [This is of course how induction works, you assume the P = N case and show that this implies the P = N + 1 case] Based on that assumption, the only replacement that won't eliminate S is the one that gives S + N + N + N + ... (P+1 times) It's impossible to get a string with P+1 plusses from one with less than P plusses or from one with more than P + 1 plusses, and it's impossible to go from S + N + N + N + ... (P+1 times) to another string with P+1 plusses without eliminating S, so we see that S + N + N + N + ... (P+1 times) is the only string with P+1 plusses that we can produce without eliminating S. This completes the proof by induction.

  • Anonymous
    April 29, 2010
    Your proof seems correct, but I don't see why you would use induction over P for part (1). Induction through the productions seems simpler. Looking at the production rules, it is clear that a string that does not contain the symbol S can only produce other strings that also not contain S. (N -> 1, N -> 2, N -> 3 are the only possibilities.) It is also clear that a string starting with S and containing S only once can produce either a string without S (S -> N) or a string that also starts with S and contains it only once (S -> S + N). [These two observations compose the inductive step.] The starting string is "S", which starts with S and contains it only once. [basis step.] Therefore, the grammar only produces strings that do not contain S, and strings that contain S exactly once, and indeed start with S.

  • Anonymous
    April 29, 2010
    You have a good point, Joren.  There's no particular reason I did it the way I did other than that it was the first thing that occurred to me (I was thinking in terms of number of plusses because my overall reasoning for the full proof was basically "We can't do anything except S --> S+N until we get the desired number of plusses.")

  • Anonymous
    April 29, 2010
    The comment has been removed

  • Anonymous
    April 30, 2010
    That helps a LOT.  I should have realized the problem with X<Y>Z was not the associativity of the operators, but rather the semantic meaning based on the message the compiler gave me (Operator cannot be applied to operands of type bool and bool).  OTOH, I do find it surprising that the comparison operators are left associative.  I can only assume that this is due to operator overloading and implicit conversions.  At least, I can't think of any other means to provide semantic meaning to such an expression in C#.

  • Anonymous
    May 02, 2010
    @Chris: I doubt they are deliberately left-associative in a sense that this is intended to be actually used. However, they have been historically left-associative since at least C, and probably older than that (IIRC, wasn't Boolean in Pascal an enumerated type, and thus its values ordered and comparable with < and >?). Definitely within the curly braces language family, every other language out there has done it that way.

  • Anonymous
    May 03, 2010
    Wouldn't it have been simpler to allow >> [one token] to close two type argument lists when in such a context? It'd support all of the same scenarios without introducing a brand-new kind of whitespace sensitivity. I highly doubt it. Try writing a BNF grammar for your way and see if it's concise. Alternately, why not just say "ok, weird, but no harm done" when encountering "> >" or "> >=" in an operator context, and treat it as a shift operator anyway? That would be weird. So... are these not-quite-ambiguities like A<B>C, List<Func<A>>B, etc, the reason it is illegal to use any expression as a statement [CS0201], or was this rule (which happened to avoid a whole class of ambiguities the spec would otherwise have had to deal with) made for another reason? Since that rule predates the invention of generics, it cannot be the reason. The rule that an expression statement must be a method call, assignment, increment, decrement or construction is from C# 1.0. Generics were not added until C# 2.0. The reason is because we like the principle that statements have side effects; ideally, a statement logically has exactly one side effect. Look at it this way: what is the compelling benefit of allowing a statement which has no side effect and only calculates a value and then discards it? Such a thing is almost certainly an error, so there had better be an extremely good reason why we should make it legal. How does C deal with the cast thing? I don't know. I'm not an expert on the grammar of C. -- Eric

  • Anonymous
    May 04, 2010
    "That would be weird." Much less weird than "These productions are treated specially" without bothering to even put it in the grammar, particularly considering it would only be weird from a historical perspective [in that it was once a single token], whereas the current solution is weird even looking at the C# 2.0 language in isolation [literally nothing else anywhere is sensitive to whitespace occuring between tokens]. I think you'll find that we did in fact "bother" to put it in the grammar. See section 2.4.5 of the specification. -- Eric What's the 'extremely good reason' for introducing this bizarrely singular case of special whitespace handling? Because not doing so would allow people to write programs that subjectively look 'weird'? Because doing so solves the ambiguity problem without introducing a weird inconsistency in the language. -- Eric  Why not go further and regulate whitespace appearance in more places? I think that would have been awesome. I don't know if I would go as far as Python in making whitespace significant, but I think that having a standardized style for whitespace that could be optionally enforced by the compiler would be a nice feature. Unfortunately it is too late to make whitespace more significant now; that would break too many programs. Next time you design a new language from scratch, keep that in mind. -- Eric  Requiring whitespace around binary operators and forbidding it for unary ones would disambiguate that cast, for instance - (A)-B is as different from (A) - B as x >> 1 is from x > > 1. Same for F(G < A, B > (7)) vs F(G<A, B>(7)) [the former is the one with relational operators, the latter with a generic method call] Sure. But of course that would break lots of existing programs, so we cannot do it now. -- Eric

  • Anonymous
    May 05, 2010
    " think you'll find that we did in fact "bother" to put it in the grammar. See section 2.4.5 of the specification. -- Eric" "The vertical bar in the right-shift and right-shift-assignment productions are used to indicate that, unlike other productions in the syntactic grammar, no characters of any kind (not even whitespace) are allowed between the tokens. These productions are treated specially in order to enable the correct  handling of type-parameter-lists (§10.1.3)." The fact that it provides a text explanation of what the vertical bar notation means made me think that it's a notation invented for the vaguely defined "treated specially" statement in the text, rather than being an actual part of a BNF grammar (nevermind that it is unclear as to whether this is meant to be part of the lexical grammar or the syntactic grammar) It is also impossible for a parser to use this rule if the terminals of the syntactic grammar are in fact tokens (implied by 2.2) rather than something different like "tokens plus a way to find out what characters are between it and the adjacent tokens". The information on whether tokens were separated by whitespace is supposed to have been lost by the time it gets to the parser, and would be if not for vaguely-defined "special" handling. 2.3 outright states that whitespace has no impact on syntactic structure except to separate tokens. Including it in the grammar would mean defining whitespace (maybe only after a > token) to be an actual token, and defining in the syntactic grammar where it is allowed (i.e. everywhere but inside a shift operator). Made-up >|> notation isn't a substitute for that. Yes, it is a substitute for that. That's precisely what it is. -- Eric "Because doing so solves the ambiguity problem without introducing a weird inconsistency in the language. -- Eric" The inconsistency avoided here is not in the language, it is between this language and other languages. Having two (and only two) pairs of tokens which cannot be separated by whitespace (in some contexts) is itself a weird inconsistency. If anything it would have been slightly more consistent to never allow them to be separated by whitespace (making List<Action<T> > a syntax error) - that at least keeps it so that the lexer is the only stage concerned with whitespace. So your proposal is to prioritize the logical independence of the lexer and the parser over the usefulness of the language to our customers? Do you by any chance have a background in academia? -- Eric If the design required you to make an operator that's two tokens, that's already an inconsistency [though a small one]. The cast operator also has two tokens, a ( and a ). The conditional operator has two, a ? and a :. The checked and unchecked opererators have three. An operator that has a bunch of tokens in it is not unusual. -- Eric Requiring the parser to break encapsulation just to maintain the illusion that it's still one token in the places it was used before doesn't "take it back" - it introduces a second, much larger, inconsistency. Sure, I can see that point of view. Since you're the first person, to my knowledge, to complain about it, I suspect that this "much larger" inconsistency is still pretty darn small. C# is certainly not the only professional programming language to have rules where whitespace around tokens affects the syntactic analysis. It is somewhat surprising in fact that there is only this one special case. JScript has many places where whitespace affects the grammar. In any event, it is not our goal to make a grammatically or lexically "pure" language. That is desirable of course, but it is not a primary goal. Our goal is to make a useful language that makes it easy for customers to express the desired idiom without weird "gotchas". I would bet that fewer than one user in ten thousand has any clue that there is something special about the handling of > in the grammar. That's a sign that we've succeeded. -- Eric