# Nature Language Processing

In this series of posts related to NLP, we will begin by discussing several foundational concepts and basic methodologies used to address the problem. Subsequently, we will cover the latest methods, such as GPTs.

I presume the audience is familiar with the fundamental concepts of neural networks and multi-layer perceptrons (MLP).

## Word Embedding

Computer-based methods operate on numerical data. Therefore, it is necessary to find a way to transform “word” in the phonetic alphabet or “character” in hieroglyphic languages into numbers.

A one-hot vector is a straightforward concept where a vector consists of a single element set to 1 and all other elements set to 0 to represent a word or character. A one-hot vector essentially functions like a word index. However, the issue with a one-hot vector is its sparsity. The vector is so sparse that it makes most methods difficult to process. Therefore, we need a method to map this one-hot vector (index) to an embedding (a vector).

Certain neural network models feature an embedding layer that can be trained during the neural network’s training process. Pretrained embedding models like Word2Vec and GloVe are also available for direct use.

### How to Get Word Embedding

**Futher Reading**

- https://arxiv.org/pdf/1301.3781

The primary idea is to construct a network aimed at forecasting the missing word. For illustration, take the sentence ‘I eat apple.’ If we remove ‘apple’, converting the sentence to ‘I eat __’, the network would predict the missing word. Through this approach, we generate the training dataset without requiring any labels.

We shall provide a formal description. Given a dictionary that contains $n$ words, our aim is to develop a model that maps the one-hot encoding of a word to an embedding space $R^d$. By using the embedded word as the input of the network, the network predicts the omitted word. Hence, we obtain the word embedding by training the network.

**Comments: Word embedding is similar to dimensionality reduction. Therefore, techniques such as t-SNE and UMAP can be viewed as approaches to obtain word embeddings. However, these methods cannot be applied directly, as one-hot vectors do not encapsulate the relationships between words. Nonetheless, if we can map one-hot vectors to a feature space without using a neural network, it might be possible to develop methods that offer greater human interpretability**.

## Recurrent Neural Network (RNN)

Recurrent Neural Networks (RNNs) are a type of neural network architecture that allows for the reuse of the same network across different time steps. This makes them particularly useful for tasks that involve sequential data, such as natural language processing and time series analysis.

The basic idea behind RNNs is to maintain a hidden state that captures information from previous time steps and combines it with the current input to generate an output. This hidden state serves as a memory that allows the network to capture dependencies and patterns in the sequential data.

The architecture of an RNN is illustrated in the following diagram:

At each time step, the RNN takes an input vector, denoted as $x_t$, and combines it with the hidden state from the previous time step, denoted as $h_{t-1}$. This combination is typically done using a weighted sum, where the weights are learned during the training process. The resulting hidden state, denoted as $h_t$, is then used to generate the output at that time step, denoted as $y_t$.

The equations that govern the behavior of an RNN can be summarized as follows:

\[h_t = f(W_{hh}h_{t-1} + W_{hx}x_t)\] \[y_t = f(W_{yh}h_t)\]In these equations, $f$ represents the activation function, which is typically a sigmoid or a hyperbolic tangent function. $W_{hh}$, $W_{hx}$, and $W_{yh}$ are the weight matrices that are learned during the training process.

It is important to note that RNNs have a limitation known as the “vanishing gradient” problem, which makes it difficult for them to capture long-term dependencies in the data. This means that the output at a given time step is primarily influenced by the input at that time step and the previous hidden state, rather than inputs from earlier time steps. However, there are variations of RNNs, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), that have been developed to address this issue.

RNNs have a wide range of applications, including language translation, text generation, sentiment analysis, and speech recognition. They are particularly effective when dealing with sequential data that exhibits temporal dependencies.

### Training an RNN

To train an RNN, we typically formulate the problem as a classification or regression task and use appropriate loss functions, such as cross-entropy or mean squared error, to measure the difference between the predicted output and the ground truth. The weights of the RNN are then updated using gradient-based optimization algorithms, such as backpropagation through time (BPTT) or truncated backpropagation through time (TBPTT).

Training an RNN can be challenging due to the vanishing gradient problem and the need to handle variable-length sequences. Techniques such as gradient clipping, regularization, and batch normalization can be employed to mitigate these challenges and improve the training process.

In summary, RNNs are a powerful tool for modeling sequential data and capturing temporal dependencies. They have been widely used in various domains and have paved the way for more advanced architectures, such as Transformers and GPTs, in the field of natural language processing.

### Further Reading on RNN

Here are some online resources you can explore for further reading on Recurrent Neural Networks (RNNs):

- Recurrent Neural Networks - Stanford University
- Understanding LSTM Networks - Christopher Olah
- The Unreasonable Effectiveness of Recurrent Neural Networks - Andrej Karpathy
- Deep Learning Book - Chapter 10: Sequence Modeling: Recurrent and Recursive Nets
- Sequence to Sequence Learning with Neural Networks - Ilya Sutskever, Oriol Vinyals, and Quoc V. Le

## Temperature Sigmoid

Temperature is an important parameter when working with sequence-to-sequence models. In traditional classification networks, we use cross-entropy loss to train the network and aim for high confidence in the predicted class. However, when it comes to generating creative content with sequence-to-sequence models, we want the model to explore beyond the highest probability word.

In a traditional classification network, we typically generate the output using the top-1 or top-5 predictions. However, if we use the top-1 prediction in a recurrent neural network (RNN), the generated sequence will always be the same for a given input. This limits the model’s ability to generate diverse outputs. For example, in English-to-Chinese translation, there are multiple valid translations for a given input.

To address this, we can sample the output word based on the predicted probabilities. Instead of always selecting the word with the highest probability, we can use the probability distribution to randomly select a word. For instance, if the predicted probabilities for the next word after “I ate” are apple: 0.1, orange: 0.8, and banana: 0.1, we can generate apple or banana with a 10% probability and generate orange with an 80% probability.

However, after training, the network tends to favor certain words that were heavily used during training. To encourage the model to generate more diverse and creative results, we can introduce the concept of temperature.

The temperature parameter is applied when calculating the probability using the sigmoid function $\frac{e^z}{1 + e^z}$. By replacing it with $\frac{e^{z/T}}{1 + e^{z/T}}$, where T is the temperature, we can control the entropy of the probability distribution. Increasing the temperature will increase the randomness and diversity of the generated outputs.

This simple technique of using temperature in the sigmoid function can help make the network more creative and generate a wider range of results.

## Attention Mechanism

The attention mechanism is a key component in many advanced neural network architectures, particularly in natural language processing tasks such as machine translation and text summarization. It allows the model to focus on different parts of the input sequence when making predictions, giving it the ability to capture relevant information and improve performance.

At a high level, the attention mechanism works by assigning weights to different elements of the input sequence based on their relevance to the current prediction. These weights are then used to compute a weighted sum of the input elements, which is combined with the hidden state of the model to generate the final output.

There are several types of attention mechanisms, including additive attention, multiplicative attention, and self-attention. Each type has its own advantages and is suitable for different scenarios.

Additive attention, also known as Bahdanau attention, uses a feed-forward neural network to compute the attention weights. It has been widely used in sequence-to-sequence models and has shown good performance in various tasks.

Multiplicative attention, also known as Luong attention, uses a dot product or a bilinear function to compute the attention weights. It is computationally more efficient than additive attention and has been successfully applied in machine translation tasks.

Self-attention, also known as transformer attention, is a type of attention mechanism that allows the model to attend to different positions within the same input sequence. It has been a key component in transformer models, which have achieved state-of-the-art performance in various natural language processing tasks.