MrTimmyJ

Main Project Image

GPT-2 Transformer Trainer

May 2024

Custom GPT-2 transformer trainer for PDF/text datasets.

< >

Project Overview

        A transformer-based language model trainer built with PyTorch and GPT-2 tokenization (tiktoken). This project provides a robust framework for fine-tuning GPT-style models on arbitrary PDF or text datasets. It emphasizes reproducibility, performance monitoring, and experimentation with transformer architectures.

        The trainer is a command-line tool that handles dataset preparation, tokenization, model training, and inference sampling. Ideal for machine learning researchers, hobbyists, and developers looking to fine-tune GPT-2 on custom text datasets.

        As a demonstration, the trainer was tested on Bram Stoker’s Dracula PDF, showing its ability to ingest and fine-tune on a single, coherent text source while generating character-consistent output. This illustrates the trainer’s applicability to literary datasets and other structured text inputs.

Project Detail Image

    🧠 Model Architecture

  • GPT-2 style transformer with positional embeddings
  • Trained using cross-entropy loss and AdamW optimizer
  • Evaluated at configurable intervals with validation dataset

  • πŸ“Š Tokenization & Data Pipeline

  • Tokenization via tiktoken GPT-2 encoder
  • Dataset split into context-length windows
  • PyTorch DataLoader feeds batches as input tensors into training loop

  • πŸ§ͺ Training & Evaluation

  • Runs multiple epochs over datasets
  • Tracks training/validation loss
  • Saves model checkpoints for reproducibility
  • Optional text sample generation for qualitative evaluation

Project Features

  • πŸ“š Trains a GPT-2 model on single or multiple PDF/text datasets
  • πŸ”  Uses GPT-2-compatible tokenization via tiktoken
  • πŸ” Custom training loop with PyTorch for flexible experimentation
  • πŸ“‰ Tracks training/validation loss and tokens processed
  • πŸ’Ύ Save/load model checkpoints for reproducibility and iterative development
Project Detail Image

Project User Workflow

  • Insert PDF/text datasets into the data directory
  • Run training with python gpt_train.py
  • Monitor training loss and generate visual plots
  • Generate text samples using provided prompts
  • Save and load model checkpoints for further experimentation
Project Detail Image

Technologies Used

  • 🐍 Python Core programming language
  • πŸ”¦ PyTorch Deep learning framework
  • πŸ”‘ tiktoken GPT-2 tokenization library
  • 🧠 Custom GPT Transformer model built from scratch
  • πŸ§ͺ AdamW Optimizer for stable convergence
  • πŸ“ˆ Matplotlib Visualizing training and validation loss
  • πŸ“ File I/O Data reading and checkpoint saving