How to Build and Train a Transformer Model from Scratch with Hugging Face Transformers

Image by Editor | Midjourney

Hugging Face Transformers library provides tools for easily loading and using pre-trained Language Models (LMs) based on the transformer architecture. But, did you know this library also allows you to implement and train your transformer model from scratch? This tutorial illustrates how through a step-by-step sentiment classification example.

Important note: Training a transformer model from scratch is computationally expensive, with a training loop typically requiring hours to say the least. To run the code in this tutorial, it is highly recommended to have access to high-performance computing resources, be it on-premises or via a cloud provider.

Step-by-Step Process

Initial Setup and Dataset Loading

Depending on the type of Python development environment you are working on, you may need to install Hugging Face’s transformers and datasets libraries, as well as the accelerate library to train your transformer model in a distributed computing setting.

!pip install transformers datasets
!pip install accelerate -U

Once the necessary libraries are installed, let’s load the emotions dataset for sentiment classification of Twitter messages from Hugging Face hub:

from datasets import load_dataset
dataset = load_dataset('jeffnyman/emotions')

Using the data for training a transformer-based LM requires tokenizing the text. The following code initializes a BERT tokenizer (BERT is a family of transformer models suitable for text classification tasks), defines a function to tokenize text data with padding and truncation, and applies it to the dataset in batches.

from transformers import AutoTokenizer

def tokenize_function(examples):
  return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Before moving on to initialize the transformer model, let’s verify the unique labels in the dataset. Having a verified set of existing class labels helps prevent GPU-related errors during training by verifying label consistency and correctness. We will use this label set later on.

unique_labels = set(tokenized_datasets['train']['label'])
print(f"Unique labels in the training set: {unique_labels}")

def check_labels(dataset):
  for label in dataset['train']['label']:
    if label not in unique_labels:
      print(f"Found invalid label: {label}")

check_labels(tokenized_datasets)

Next, we create and define a model configuration, and then instantiate the transformer model with this configuration. This is where we specify hyperparameters about the transformer architecture like embedding size, number of attention heads, and the previously calculated set of unique labels, key in building the final output layer for sentiment classification.

from transformers import BertConfig
from transformers import BertForSequenceClassification

config = BertConfig(
vocab_size=tokenizer.vocab_size,
hidden_size=512,
num_hidden_layers=6,
num_attention_heads=8,
intermediate_size=2048,
max_position_embeddings=512,
num_labels=len(unique_labels)
)

model = BertForSequenceClassification(config)

We are almost ready to train our transformer model. It just remains to instantiate two necessary instances: TrainingArguments, with specifications about the training loop such as the number of epochs, and Trainer, which glues together the model instance, the training arguments, and the data utilized for training and validation.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
  output_dir="./results",
  evaluation_strategy="epoch",
  learning_rate=2e-5,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=16,
  num_train_epochs=3,
  weight_decay=0.01,
)

trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=tokenized_datasets["train"],
  eval_dataset=tokenized_datasets["test"],
)

Time to train the model, sit back, and relax. Remember this instruction will take a significant amount of time to complete:

Once trained, your transformer model should be ready for passing in input examples for sentiment prediction.

Troubleshooting

If problems appear or persist when executing the training loop or during its setup, you may need to inspect the configuration of the GPU/CPU resources being used. For instance, if using a CUDA GPU, adding these instructions at the beginning of your code can help prevent errors in the training loop:

import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

These lines disable the GPU and make CUDA operations synchronous, providing more immediate and accurate error messages for debugging.

On the other hand, if you are trying this code in a Google Colab instance, chances are this error message shows up during execution, even if you have previously installed the accelerate library:

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.21.0`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

To address this issue, try restarting your session in the ‘Runtime’ menu: the accelerate library typically requires resetting the run environment after being installed.

Summary and Wrap-Up

This tutorial showcased the key steps to build your transformer-based LM from scratch using Hugging Face libraries. The main steps and elements involved can be summarized as:

Loading the dataset and tokenizing the text data.
Initializing your model by using a model configuration instance for the type of model (language task) it is intended for, e.g. BertConfig.
Setting up a Trainer and TrainingArguments instances and running the training loop.

As a next learning step, we encourage you to explore how to make predictions and inferences with your newly trained model.

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

Source link

Step-by-Step Process

Initial Setup and Dataset Loading

Troubleshooting

Summary and Wrap-Up

Leave a Comment Cancel reply