Goal
Transform a Context-free grammar into its Chomsky Normal Form (CNF) — ie: having its productions in on of these forms:
- $A \rightarrow BC$
- $A \rightarrow x$
where $A, B, C$ are variables and $x$ is a letter. This means productions like $A \rightarrow BCD$ or $A \rightarrow xC$ will need to be converted.
We will do this conversion in 4 steps. Each step will be illustrated on the following grammar:
Step 1 — remove null productions
A variable $V$ is nullable if there is a production $V \rightarrow \lambda$ or there is a derivation that starts with $V$ and ends in $\lambda$, ie: $V \rightarrow^* \lambda$.
This is done in 4 sub-steps:
- Mark every $N$ as nullable if there is a production $N \rightarrow \lambda$
- Mark every $V$ as nullable if there is a production $V \rightarrow N$ where $N$ is already nullable. Repeat this step until no more nullable variables are marked.
- For each production $V \rightarrow \alpha$, add to the grammar all productions $V \rightarrow \alpha’$ where $\alpha’$ is obtained from $\alpha$ by removing one or more nullable variables. Eg: $V \rightarrow NM$ (where $N$ and $M$ are nullable) will add $V \rightarrow N$ (removed the $M$), $V \rightarrow M$ (removed the $N$)
- Remove all $\lambda$-productions $V \rightarrow \lambda$
Example
In our grammar, the only nullable state is $X$ (because of the production $X \rightarrow \lambda$).
We will add the productions:
- $S \rightarrow XY$, because we had $S \rightarrow XXY$
- $S \rightarrow Y$ already existed
- $Y \rightarrow Y$, because we had $Y \rightarrow XY$
Finally, we remove the $\lambda$-production $X \rightarrow \lambda$.
That means our grammar now looks like this:
Step 2 — remove renaming productions
Instead of having the productions:
- $V \rightarrow W$
- $W \rightarrow \alpha$ (anything having $W$ on the left-hand-side)
we add the productions $V \rightarrow \alpha$. That is, directly derive $V$ into anything that $W$ can derive, instead of just renaming $V$ into $W$.
Example
In our grammar, for the renaming production is $S \rightarrow Y$, because $Y$ has the following productions:
- $Y \rightarrow XY$
- $Y \rightarrow Yb$
- $Y \rightarrow a$
- $Y \rightarrow b$
we add the productions:
- $S \rightarrow XY$ (already existed)
- $S \rightarrow Yb$
- $S \rightarrow a$ (already existed)
- $S \rightarrow b$
For the renaming production $X \rightarrow S$, because $S$ has the following productions:
- $S \rightarrow XXY$
- $S \rightarrow a$
- $S \rightarrow XY$
- $S \rightarrow Yb$ (added previously)
- $S \rightarrow b$ (added previously)
we add the productions:
- $X \rightarrow XXY$
- $X \rightarrow a$
- $X \rightarrow XY$
- $X \rightarrow Yb$
- $X \rightarrow b$
The last renaming production $Y \rightarrow Y$ is trivial and can be simply removed.
Our grammar now looks like this:
Step 3 — replace long productions
Replace any production $V \rightarrow \alpha_1 \alpha_2 … \alpha_n$, where $n > 2$ with:
- $V \rightarrow \alpha_1 W$ (the variable $W$ is a new one that did not exist in the grammar)
- $W \rightarrow \alpha_2 … \alpha_n$
Repeat until there are no productions with more than two symbols (variables or letters) on the right-hand-side.
Example
In our grammar, for the production $S \rightarrow XXY$, we chose an unused variable $Z$ and add the productions:
- $S \rightarrow XZ$
- $Z \rightarrow XY$
Same for the production $X \rightarrow XXY$: replace it with $X \rightarrow XZ$
Our grammar now looks like this:
Note: if we had a production $S \rightarrow Xab$, it would get replaced in the same manner: $S \rightarrow Tb$ and $T \rightarrow Xa$.
If we had even longer productions: $A \rightarrow abcd$, we would add $A \rightarrow aB$, then $B \rightarrow bC$ and $C \rightarrow cd$.
Step 4 — replace combined right-hand-sides
Replace every production in the form of:
- $V \rightarrow xW$ or
- $V \rightarrow Wx$
with the productions:
- $V \rightarrow V_xW$ or
- $V \rightarrow WV_x$
where $V_x$ is a new variable with its sole production $V_x \rightarrow x$.
Example
In our grammar, the productios combining variables and letters on the right-hand-side are:
- $S \rightarrow Yb$.
- $X \rightarrow Yb$.
- $Y \rightarrow Yb$.
We create a new variable $B$ and add the productions:
- $B \rightarrow b$
- $S \rightarrow YB$
- $X \rightarrow YB$
- $Y \rightarrow YB$
The final grammar looks like this:
IO
Input Grammar.in
Each row contains one production. _
stands for $\lambda$.
S -> XXY | Y | a
X -> S | c | _
Y -> XY | Yb | a | b
Output Grammar.out
in CNF
S -> XY | YB | a | b | XZ
X -> XY | YB | a | b | XZ | c
Y -> XY | YB | a | b
Z -> XY
B -> b
Another grammar
Input Grammar.in
Each row contains one production. _
stands for $\lambda$.
S -> aX | Yb
X -> S | _
Y -> bY | b
Output Grammar.out
in CNF
Solving steps can be found here.
S -> AX | a | YB
X -> AX | a | YB
Y -> BY | b
A -> a
B -> b