GPT-2 Transformer Trainer

GPT-2 Transformer Trainer

May 2024

Custom GPT-2 transformer trainer for PDF/text datasets.

Hugging Face Model GitHub

Project Details Project Images

< >

Project Overview

A transformer-based language model trainer built with PyTorch and GPT-2 tokenization (tiktoken). This project provides a robust framework for fine-tuning GPT-style models on arbitrary PDF or text datasets. It emphasizes reproducibility, performance monitoring, and experimentation with transformer architectures.

The trainer is a command-line tool that handles dataset preparation, tokenization, model training, and inference sampling. Ideal for machine learning researchers, hobbyists, and developers looking to fine-tune GPT-2 on custom text datasets.

As a demonstration, the trainer was tested on Bram Stoker’s Dracula PDF, showing its ability to ingest and fine-tune on a single, coherent text source while generating character-consistent output. This illustrates the trainer’s applicability to literary datasets and other structured text inputs.

🧠 Model Architecture

GPT-2 style transformer with positional embeddings
Trained using cross-entropy loss and AdamW optimizer
Evaluated at configurable intervals with validation dataset

📊 Tokenization & Data Pipeline

Tokenization via tiktoken GPT-2 encoder
Dataset split into context-length windows
PyTorch DataLoader feeds batches as input tensors into training loop

🧪 Training & Evaluation

Runs multiple epochs over datasets
Tracks training/validation loss
Saves model checkpoints for reproducibility
Optional text sample generation for qualitative evaluation

Project Features

📚 Trains a GPT-2 model on single or multiple PDF/text datasets
🔠 Uses GPT-2-compatible tokenization via tiktoken
🔁 Custom training loop with PyTorch for flexible experimentation
📉 Tracks training/validation loss and tokens processed
💾 Save/load model checkpoints for reproducibility and iterative development

Project User Workflow

Insert PDF/text datasets into the data directory
Run training with python gpt_train.py
Monitor training loss and generate visual plots
Generate text samples using provided prompts
Save and load model checkpoints for further experimentation

Technologies Used

🐍 Python Core programming language
🔦 PyTorch Deep learning framework
🔡 tiktoken GPT-2 tokenization library
🧠 Custom GPT Transformer model built from scratch
🧪 AdamW Optimizer for stable convergence
📈 Matplotlib Visualizing training and validation loss
📁 File I/O Data reading and checkpoint saving