Goal

Transform a Context-free grammar into its Chomsky Normal Form (CNF) — ie: having its productions in on of these forms:

where $A, B, C​$ are variables and $x​$ is a letter. This means productions like $A \rightarrow BCD​$ or $A \rightarrow xC​$ will need to be converted.

We will do this conversion in 4 steps. Each step will be illustrated on the following grammar:

Step 1 — remove null productions

A variable $V$ is nullable if there is a production $V \rightarrow \lambda$ or there is a derivation that starts with $V$ and ends in $\lambda$, ie: $V \rightarrow^* \lambda$.

This is done in 4 sub-steps:

  1. Mark every $N$ as nullable if there is a production $N \rightarrow \lambda$
  2. Mark every $V$ as nullable if there is a production $V \rightarrow N$ where $N$ is already nullable. Repeat this step until no more nullable variables are marked.
  3. For each production $V \rightarrow \alpha$, add to the grammar all productions $V \rightarrow \alpha’$ where $\alpha’$ is obtained from $\alpha$ by removing one or more nullable variables. Eg: $V \rightarrow NM$ (where $N$ and $M$ are nullable) will add $V \rightarrow N$ (removed the $M$), $V \rightarrow M$ (removed the $N$)
  4. Remove all $\lambda$-productions $V \rightarrow \lambda$

Example

In our grammar, the only nullable state is $X$ (because of the production $X \rightarrow \lambda$).

We will add the productions:

Finally, we remove the $\lambda$-production $X \rightarrow \lambda$.

That means our grammar now looks like this:

Step 2 — remove renaming productions

Instead of having the productions:

we add the productions $V \rightarrow \alpha$. That is, directly derive $V$ into anything that $W$ can derive, instead of just renaming $V$ into $W$.

Example

In our grammar, for the renaming production is $S \rightarrow Y$, because $Y$ has the following productions:

we add the productions:

For the renaming production $X \rightarrow S$, because $S$ has the following productions:

we add the productions:

The last renaming production $Y \rightarrow Y$ is trivial and can be simply removed.

Our grammar now looks like this:

Step 3 — replace long productions

Replace any production $V \rightarrow \alpha_1 \alpha_2 … \alpha_n$, where $n > 2$ with:

Repeat until there are no productions with more than two symbols (variables or letters) on the right-hand-side.

Example

In our grammar, for the production $S \rightarrow XXY$, we chose an unused variable $Z$ and add the productions:

Same for the production $X \rightarrow XXY$: replace it with $X \rightarrow XZ$

Our grammar now looks like this:

Note: if we had a production $S \rightarrow Xab$, it would get replaced in the same manner: $S \rightarrow Tb$ and $T \rightarrow Xa$.

If we had even longer productions: $A \rightarrow abcd$, we would add $A \rightarrow aB$, then $B \rightarrow bC$ and $C \rightarrow cd$.

Step 4 — replace combined right-hand-sides

Replace every production in the form of:

with the productions:

where $V_x$ is a new variable with its sole production $V_x \rightarrow x$.

Example

In our grammar, the productios combining variables and letters on the right-hand-side are:

We create a new variable $B$ and add the productions:

The final grammar looks like this:

IO

Input Grammar.in

Each row contains one production. _ stands for $\lambda$.

S -> XXY | Y | a
X -> S | c | _
Y -> XY | Yb | a | b
Output Grammar.out in CNF
S -> XY | YB | a | b | XZ
X -> XY | YB | a | b | XZ | c
Y -> XY | YB | a | b
Z -> XY
B -> b

Another grammar

Input Grammar.in

Each row contains one production. _ stands for $\lambda$.

S -> aX | Yb
X -> S | _
Y -> bY | b
Output Grammar.out in CNF

Solving steps can be found here.

S -> AX | a | YB
X -> AX | a | YB
Y -> BY | b
A -> a
B -> b