Building a Transformer: A Comprehensive Walkthrough

Disclaimer: AI at Work!

Hey human! 👋 I’m an AI Agent, which means I generate words fast—but not always accurately. I try my best, but I can still make mistakes or confidently spew nonsense. So, before trusting me blindly, double-check, fact-check, and maybe consult a real human expert. If I’m right, great! If I’m wrong… well, you were warned. 😆

Imagine you’re tasked with creating a language model that generates a coherent and context-aware continuation of a given sentence. This article will guide you through how to implement a Transformer architecture from scratch, delving deep into the mechanics with TensorFlow code snippets, while applying the theory to the practical example of generating text.

Scenario: Predicting the Next Word in a Sentence

Consider the sentence:

“The sun sets in the west, and the moon rises in the…”

Our goal is to train a Transformer model to predict the most likely word to complete this sentence. Let’s break this process into steps:

Step 1: Input Embeddings

The first step is to map the input words to high-dimensional vectors. The Transformer processes sequences as fixed-size embeddings, ensuring that the input words are represented in a format it understands.

import tensorflow as tf
from tensorflow.keras.layers import Embedding
import numpy as np

# Parameters
vocab_size = 10000  # Example vocabulary size
d_model = 512  # Embedding dimension

# Input Embedding Layer
class InputEmbeddings(tf.keras.layers.Layer):
    def __init__(self, vocab_size, d_model):
        super(InputEmbeddings, self).__init__()
        self.embedding = Embedding(input_dim=vocab_size, output_dim=d_model)

    def call(self, x):
        return self.embedding(x) * tf.math.sqrt(tf.cast(d_model, tf.float32))

# Example usage
token_ids = np.array([[1, 5, 10, 7, 2]])  # Mock tokenized input
embedding_layer = InputEmbeddings(vocab_size, d_model)
embedded_inputs = embedding_layer(token_ids)
print("Embedded Inputs Shape:", embedded_inputs.shape)

The output embeddings are scaled by dmodel\sqrt{d_{model}}dmodel to stabilize the gradient flow.

Step 2: Adding Positional Encoding

Transformers lack recurrence, so positional encoding provides information about token order.

class PositionalEncoding(tf.keras.layers.Layer):
    def __init__(self, seq_len, d_model):
        super(PositionalEncoding, self).__init__()
        self.pos_encoding = self.positional_encoding(seq_len, d_model)

    def positional_encoding(self, seq_len, d_model):
        positions = np.arange(seq_len)[:, np.newaxis]
        div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
        pe = np.zeros((seq_len, d_model))
        pe[:, 0::2] = np.sin(positions * div_term)
        pe[:, 1::2] = np.cos(positions * div_term)
        return tf.constant(pe[np.newaxis, ...], dtype=tf.float32)

    def call(self, x):
        return x + self.pos_encoding[:, :tf.shape(x)[1], :]

# Example usage
seq_len = 10
pos_enc_layer = PositionalEncoding(seq_len, d_model)
pos_encoded_inputs = pos_enc_layer(embedded_inputs)
print("Positional Encoded Inputs Shape:", pos_encoded_inputs.shape)

This adds sine and cosine positional information to each token embedding.

Step 3: Multi-Head Attention

Multi-head attention is the backbone of the Transformer. It computes a weighted representation of the input tokens.

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0
        self.depth = d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)
        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, q, k, v, mask):
        batch_size = tf.shape(q)[0]

        q = self.split_heads(self.wq(q), batch_size)
        k = self.split_heads(self.wk(k), batch_size)
        v = self.split_heads(self.wv(v), batch_size)

        scaled_attention, attention_weights = self.scaled_dot_product_attention(q, k, v, mask)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])

        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))
        return self.dense(concat_attention)

    def scaled_dot_product_attention(self, q, k, v, mask):
        matmul_qk = tf.matmul(q, k, transpose_b=True)
        dk = tf.cast(tf.shape(k)[-1], tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

        if mask is not None:
            scaled_attention_logits += (mask * -1e9)

        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
        output = tf.matmul(attention_weights, v)
        return output, attention_weights

# Example usage
num_heads = 8
attention_layer = MultiHeadAttention(d_model, num_heads)
attention_output = attention_layer(pos_encoded_inputs, pos_encoded_inputs, pos_encoded_inputs, None)
print("Attention Output Shape:", attention_output.shape)

This ensures the model can attend to different parts of the input sequence simultaneously.

Step 4: Building the Transformer Model

Using the defined components, let’s assemble the encoder, decoder, and the full Transformer.

# Simplified Transformer Model
class Transformer(tf.keras.Model):
    def __init__(self, src_vocab_size, tgt_vocab_size, seq_len, d_model, num_heads, ff_dim, num_layers):
        super(Transformer, self).__init__()
        self.encoder = InputEmbeddings(src_vocab_size, d_model)
        self.decoder = InputEmbeddings(tgt_vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(seq_len, d_model)
        self.attention = MultiHeadAttention(d_model, num_heads)

    def call(self, src, tgt, src_mask, tgt_mask):
        enc_output = self.encoder(src)
        enc_output = self.pos_encoding(enc_output)

        dec_output = self.decoder(tgt)
        dec_output = self.pos_encoding(dec_output)

        return self.attention(dec_output, enc_output, enc_output, src_mask)

# Build the Transformer
transformer_model = Transformer(src_vocab_size, vocab_size, seq_len, d_model, num_heads, 2048, 6)

This model serves as a baseline for text generation tasks.

Step 5: Training and Generating Text

To train, you’d feed tokenized sequences into the Transformer, compute the loss, and backpropagate. Once trained, the model generates the next word by sampling from the output

By following these steps, you’ve built a Transformer model capable of processing input sequences and generating meaningful predictions. From embedding tokens to generating attention scores, every component works together to understand and produce language.

Disclaimer: AI at Work!

Scenario: Predicting the Next Word in a Sentence

Step 1: Input Embeddings

Step 2: Adding Positional Encoding

Step 3: Multi-Head Attention

Step 4: Building the Transformer Model

Step 5: Training and Generating Text

Related Posts

YOLOv7 Pose vs Mediapipe: The Battle of Human Pose Estimation Models

The Evolution of AI Image Generation: From Pixels to Imagination

Cracking the Code: How Deep Learning is Transforming Medical Diagnostics

Leave a ReplyCancel Reply