## Goal

Transform a Context-free grammar into its Chomsky Normal Form (CNF) — ie: having its productions in on of these forms:

• $A \rightarrow BC$
• $A \rightarrow x$

where $A, B, C​$ are variables and $x​$ is a letter. This means productions like $A \rightarrow BCD​$ or $A \rightarrow xC​$ will need to be converted.

We will do this conversion in 4 steps. Each step will be illustrated on the following grammar:

## Step 1 — remove null productions

A variable $V$ is nullable if there is a production $V \rightarrow \lambda$ or there is a derivation that starts with $V$ and ends in $\lambda$, ie: $V \rightarrow^* \lambda$.

This is done in 4 sub-steps:

1. Mark every $N$ as nullable if there is a production $N \rightarrow \lambda$
2. Mark every $V$ as nullable if there is a production $V \rightarrow N$ where $N$ is already nullable. Repeat this step until no more nullable variables are marked.
3. For each production $V \rightarrow \alpha$, add to the grammar all productions $V \rightarrow \alpha’$ where $\alpha’$ is obtained from $\alpha$ by removing one or more nullable variables. Eg: $V \rightarrow NM$ (where $N$ and $M$ are nullable) will add $V \rightarrow N$ (removed the $M$), $V \rightarrow M$ (removed the $N$)
4. Remove all $\lambda$-productions $V \rightarrow \lambda$

### Example

In our grammar, the only nullable state is $X$ (because of the production $X \rightarrow \lambda$).

We will add the productions:

• $S \rightarrow XY$, because we had $S \rightarrow XXY$
• $S \rightarrow Y$ already existed
• $Y \rightarrow Y$, because we had $Y \rightarrow XY$

Finally, we remove the $\lambda$-production $X \rightarrow \lambda$.

That means our grammar now looks like this:

## Step 2 — remove renaming productions

Instead of having the productions:

• $V \rightarrow W$
• $W \rightarrow \alpha$ (anything having $W$ on the left-hand-side)

we add the productions $V \rightarrow \alpha$. That is, directly derive $V$ into anything that $W$ can derive, instead of just renaming $V$ into $W$.

### Example

In our grammar, for the renaming production is $S \rightarrow Y$, because $Y$ has the following productions:

• $Y \rightarrow XY$
• $Y \rightarrow Yb$
• $Y \rightarrow a$
• $Y \rightarrow b$

we add the productions:

• $S \rightarrow XY$ (already existed)
• $S \rightarrow Yb$
• $S \rightarrow a$ (already existed)
• $S \rightarrow b$

For the renaming production $X \rightarrow S$, because $S$ has the following productions:

• $S \rightarrow XXY$
• $S \rightarrow a$
• $S \rightarrow XY$
• $S \rightarrow Yb$ (added previously)
• $S \rightarrow b$ (added previously)

we add the productions:

• $X \rightarrow XXY$
• $X \rightarrow a$
• $X \rightarrow XY$
• $X \rightarrow Yb$
• $X \rightarrow b$

The last renaming production $Y \rightarrow Y$ is trivial and can be simply removed.

Our grammar now looks like this:

## Step 3 — replace long productions

Replace any production $V \rightarrow \alpha_1 \alpha_2 … \alpha_n$, where $n > 2$ with:

• $V \rightarrow \alpha_1 W$ (the variable $W$ is a new one that did not exist in the grammar)
• $W \rightarrow \alpha_2 … \alpha_n$

Repeat until there are no productions with more than two symbols (variables or letters) on the right-hand-side.

### Example

In our grammar, for the production $S \rightarrow XXY$, we chose an unused variable $Z$ and add the productions:

• $S \rightarrow XZ$
• $Z \rightarrow XY$

Same for the production $X \rightarrow XXY$: replace it with $X \rightarrow XZ$

Our grammar now looks like this:

Note: if we had a production $S \rightarrow Xab$, it would get replaced in the same manner: $S \rightarrow Tb$ and $T \rightarrow Xa$.

If we had even longer productions: $A \rightarrow abcd$, we would add $A \rightarrow aB$, then $B \rightarrow bC$ and $C \rightarrow cd$.

## Step 4 — replace combined right-hand-sides

Replace every production in the form of:

• $V \rightarrow xW$ or
• $V \rightarrow Wx$

with the productions:

• $V \rightarrow V_xW$ or
• $V \rightarrow WV_x$

where $V_x$ is a new variable with its sole production $V_x \rightarrow x$.

### Example

In our grammar, the productios combining variables and letters on the right-hand-side are:

• $S \rightarrow Yb$.
• $X \rightarrow Yb$.
• $Y \rightarrow Yb$.

We create a new variable $B$ and add the productions:

• $B \rightarrow b$
• $S \rightarrow YB$
• $X \rightarrow YB$
• $Y \rightarrow YB$

The final grammar looks like this:

## IO

Input Grammar.in

Each row contains one production. _ stands for $\lambda$.

S -> XXY | Y | a
X -> S | c | _
Y -> XY | Yb | a | b

Output Grammar.out in CNF
S -> XY | YB | a | b | XZ
X -> XY | YB | a | b | XZ | c
Y -> XY | YB | a | b
Z -> XY
B -> b


## Another grammar

Input Grammar.in

Each row contains one production. _ stands for $\lambda$.

S -> aX | Yb
X -> S | _
Y -> bY | b

Output Grammar.out in CNF

Solving steps can be found here.

S -> AX | a | YB
X -> AX | a | YB
Y -> BY | b
A -> a
B -> b