GPT-2 Transformer Trainer
May 2024
Custom GPT-2 transformer trainer for PDF/text datasets.
Project Overview
A transformer-based language model trainer built with PyTorch and GPT-2 tokenization (tiktoken). This project provides a robust framework for fine-tuning GPT-style models on arbitrary PDF or text datasets. It emphasizes reproducibility, performance monitoring, and experimentation with transformer architectures.
The trainer is a command-line tool that handles dataset preparation, tokenization, model training, and inference sampling. Ideal for machine learning researchers, hobbyists, and developers looking to fine-tune GPT-2 on custom text datasets.
As a demonstration, the trainer was tested on Bram Stokerβs Dracula PDF, showing its ability to ingest and fine-tune on a single, coherent text source while generating character-consistent output. This illustrates the trainerβs applicability to literary datasets and other structured text inputs.
- GPT-2 style transformer with positional embeddings
- Trained using cross-entropy loss and AdamW optimizer
- Evaluated at configurable intervals with validation dataset
- Tokenization via tiktoken GPT-2 encoder
- Dataset split into context-length windows
- PyTorch DataLoader feeds batches as input tensors into training loop
- Runs multiple epochs over datasets
- Tracks training/validation loss
- Saves model checkpoints for reproducibility
- Optional text sample generation for qualitative evaluation
π§ Model Architecture
π Tokenization & Data Pipeline
π§ͺ Training & Evaluation
Project Features
- π Trains a GPT-2 model on single or multiple PDF/text datasets
- π Uses GPT-2-compatible tokenization via tiktoken
- π Custom training loop with PyTorch for flexible experimentation
- π Tracks training/validation loss and tokens processed
- πΎ Save/load model checkpoints for reproducibility and iterative development
Project User Workflow
- Insert PDF/text datasets into the data directory
- Run training with python gpt_train.py
- Monitor training loss and generate visual plots
- Generate text samples using provided prompts
- Save and load model checkpoints for further experimentation
Technologies Used
- π Python Core programming language
- π¦ PyTorch Deep learning framework
- π‘ tiktoken GPT-2 tokenization library
- π§ Custom GPT Transformer model built from scratch
- π§ͺ AdamW Optimizer for stable convergence
- π Matplotlib Visualizing training and validation loss
- π File I/O Data reading and checkpoint saving