Text Generation Using LSTM
In text generation, our goal is to predict the next character or word in a sequence. Typically, the text data comprises a series of characters, where each character serves as input. To tackle this task, deep learning models such as RNNs or LSTMs are commonly employed. LSTMs are favored over RNNs due to their ability to mitigate issues like vanishing and exploding gradients, which are prevalent in RNNs. Given that text generation often requires the retention of significant amounts of preceding data, LSTM’s capability to retain long-term dependencies makes it the preferred choice.
The neural network receives a sequence of words as input, and its output consists of a matrix of probabilities for each word in the dictionary to follow the given sequence. Additionally, the model learns the semantic similarity between each word or character and computes the probability accordingly. Leveraging this information, we can predict or generate the subsequent word or character in the sequence.
Before implementation, we will understand how LSTM works
LSTM (Long Short Term Memory)
As we’ve discussed, RNNs struggle to retain information over long sequences, often forgetting earlier inputs as they process new ones. This issue, known as the vanishing and exploding gradient problem, is effectively addressed by LSTM networks. LSTMs are designed to maintain short-term memory, making them ideal for tasks requiring the retention of information over extended periods.
Moreover, in RNNs, the addition of new information often results in a complete overhaul of the existing data, without any distinction between crucial and less significant details. Contrastingly, LSTMs handle incoming data differently. With the help of specialized gates, LSTMs make subtle adjustments to the existing information when incorporating new inputs. This mechanism allows LSTMs to discern the relevance of incoming information and modify the stored data accordingly, facilitating more precise processing and learning.
The gates decide which data is important and can be useful in the future and which data has to be erased. The three gates are the input gate, output gate, and forget gate.
Forget Gate: This gate decides which information is important and should be stored and which information to forget. It removes the non-important information from the neuron cell. This results in the optimization of performance. This gate takes 2 inputs - one is the output generated by the previous cell and the other is the input of the current cell. Following the required bias weights are added and multiplied and the sigmoid function is applied to the value. A value between 0 and 1 is generated and based on this we decide which information to keep. If the value is 0 the forget gate will remove that information and if the value is 1 then the information is important and has to be remembered.
- Input Gate: This gate is used to add information to the neuron cell. It is responsible for what values should be added to cells by using activation functions like sigmoid. It creates an array of information that has to be added. This is done by using another activation function called tanh. It generates a value between -1 and 1. The sigmoid function acts as a filter and regulates what information has to be added to the cell.
Output Gate: This gate is responsible for selecting important information from the current cell and showing it as output. It creates a vector of values using the tanh function which ranges from -1 to 1. It uses previous output and current input as a regulator which also includes the sigmoid function and decides which values should be shown as output.
Implementation
For text generation, we will perform the following tasks:
- Load the necessary libraries required for LSTM and NLP purposes
- Load the text data
- Performing the required text cleaning
- Create a dictionary of words with keys as integer values
- Prepare the dataset as input and output sets using a dictionary
- Define our LSTM model for text generation
We will also implement some techniques of Natural Language Processing using NLTK like tokenization, pre-processing text, etc.
We are using text which is based on a book written by Charles Dickens. We will be using 5 chapters of that book.
We will load the necessary libraries required for LSTM, data preprocessing, and NLP purposes
#All Libraries required
import numpy
import sys
from nltk.corpus import stopwords
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
from keras.utils import np_utils
from keras.callbacks import ModelCheckpointfile = open("Two-Tails.txt").read()
Since a computer cannot process text data. So we have to convert it to computer-readable form. This can be done by 2 processes — One Hot Encoding or Word Embedding.
One Hot Encoding involves converting the text into 1’s and O’s vectors. It will create a bag of words that represent the frequency of each word in the document. They are considered simple models, which maintain a lot of important information and are very versatile.
While word embedding represents text words as a vector of real numbers. It can use more numbers than 0 and 1.
Now we will apply tokenization which means considering each word separately.
from nltk.tokenize import sent_tokenize, word_tokenize
words=word_tokenize(file)
words=" ".join(words)
Now we have to convert our text into numbers using the techniques we discussed above.
chars = sorted(list(set(processed_inputs)))
char_to_num = dict((c, i) for i, c in enumerate(chars))
This will create a set of unique characters in the text and then use enumerate function to generate numbers of each character.
We will also store the length of input and stores set of characters
input_len = len(processed_inputs)
vocab_len = len(chars)
Now we have to define the sequence length which means the length of input characters as integers. We will set it to the length of 100
seq_length = 100
x_data = []
y_data = []
Now we will convert our data into computer-readable form
for i in range(0, input_len - seq_length, 1):
in_seq = words[i:i + seq_length] out_seq = words[i + seq_length] x.append([char_to_num[char] for char in in_seq])
y.append(char_to_num[out_seq])
n_patterns = len(x_data)
Now we have the input sequence of data as x and our output as y. Also, we stored the number of patterns. We have also converted our input sequence into the right shape so it can be fed to the neural network. Also, we have to apply one hot encoder to our output variable
X = numpy.reshape(x, (n_patterns, seq_length, 1))
X = X/float(vocab_len)
y = np_utils.to_categorical(y)
Now we have the input in the required shape and form along with the output. Now we have to implement our LSTM model.
We will implement it using Keras which is an API of TensorFlow. We will be using a 3—layer model with dropout to prevent overfitting. The first LSTM layer is initialized with 256 units of memory and it will store and return sequences of data rather than randomly scattered data. The last layer will be the output layer which will generate a probability about what next character in the sequence. And then we will compile our model with Adam Optimizer.
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))model.compile(loss='categorical_crossentropy', optimizer='adam')
After compilation, we will fit our model with generated input and output.
model.fit(X, y, epochs=4, batch_size=256, callbacks=desired_callbacks)
For better results, set the value of epochs of more than 15
Since we converted the characters to numbers earlier, we need to define a dictionary variable that will convert the output of the model back into numbers
num_to_char = dict((i, c) for i, c in enumerate(chars))
Now we will generate characters through our trained model and a random seed character that can generate a sequence of characters from
tart = numpy.random.randint(0, len(x_data) - 1)
pattern = x_data[start]
print("Random Seed:")
print("\"", ''.join([num_to_char[value] for value in pattern]), "\"")
>>> Random Seed:
" burned alive kneeled rain honour dirty procession monks passed within view distance fifty sixty yard "
Now we will predict the characters which involves converting the output numbers into characters and then append them to the pattern.
for i in range(1000):
x = numpy.reshape(pattern, (1, len(pattern), 1))
x = x / float(vocab_len)
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
result = num_to_char[index]
seq_in = [num_to_char[value] for value in pattern] sys.stdout.write(result) pattern.append(index)
pattern = pattern[1:len(pattern)]
With this, you have complete knowledge of text generation using LSTM.