GPT Trainer
A reusable framework for training GPT-2 style transformers on any PDF or text dataset. Built from scratch in PyTorch with tiktoken tokenization, loss tracking, and checkpoint management. Demonstrated on Bram Stoker's Dracula.
Overview
GPT Trainer is a command-line tool for training GPT-2 style language models on arbitrary text or PDF datasets. The goal was to build the full training pipeline from scratch — not a wrapper around a pretrained model — to develop genuine understanding of how transformers learn.
The framework emphasizes reproducibility: fixed tokenization with tiktoken, configurable context window size, checkpointing at every milestone, and visual loss curves that make it easy to diagnose training behavior. It was validated on Bram Stoker's Dracula, which later became the foundation for the Dracula AI Agent.
This project is the engine underneath the Dracula AI Agent. It was extracted as a standalone reusable tool so any text dataset can be dropped in and trained with a single command.
Architecture
- GPT-2 style with token + positional embeddings
- Multi-head self-attention blocks
- Cross-entropy loss, AdamW optimizer
- Configurable depth, heads, and context size
- PDF and plaintext ingestion
- tiktoken GPT-2 tokenization
- Context-window chunking with stride
- Train/validation split via DataLoader
- Multi-epoch loop with configurable steps
- Periodic validation loss evaluation
- Token throughput tracking
- Qualitative sampling between epochs
- Save at milestones with
torch.save - Resume training from any checkpoint
- Inference-only load path
Features
data/ and the pipeline handles everything — extraction, tokenization, batching.Workflow
data/. Supports single and multi-document datasets.python gpt_train.py. Configure epochs, batch size, and context window via command-line flags or config file.