# LLM-Lab **Repository Path**: zora-ai/LLM-Lab ## Basic Information - **Project Name**: LLM-Lab - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-21 - **Last Updated**: 2026-04-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Torch-Linguist (Hugging Face Edition) Modern, modular **causal language model (CLM) pretraining** baseline using the **Hugging Face ecosystem**. This repository is a modernization of an older **torchtext-based** language modeling project into current best practices using: * **πŸ€— Transformers** (models, Trainer, checkpoints) * **πŸ€— Datasets** (HF datasets + local text + streaming) * **πŸ€— Tokenizers** (optional BPE tokenizer training) * Optional: **Weights & Biases** logging and **Hugging Face Hub** publishing The focus is **causal LLMs** (next-token prediction). The code is designed to be simple now and easy to extend later to: * Supervised fine-tuning (SFT) * Instruction tuning * Preference tuning (DPO/ORPO, etc.) --- ## Features ### Training * βœ… **Causal LM pretraining** (labels = input_ids) * βœ… Standard **tokenization β†’ packing (block_size) β†’ Trainer** * βœ… **Checkpointing** and **resume from latest checkpoint** * βœ… Mixed precision (**fp16/bf16**) (configurable) * βœ… Optional **W&B logging** * βœ… Optional **push_to_hub** (upload checkpoints/model) ### Data * βœ… Train on **HF datasets** (e.g., WikiText) * βœ… Train on **local text files** (`load_dataset("text", data_files=...)`) * βœ… **Streaming mode** for large datasets * βœ… Optional `max_train_samples` / `max_eval_samples` for fast tutorials ### Tokenizer * βœ… Use pretrained tokenizer (e.g., GPT-2) **or** * βœ… Train a **BPE tokenizer** from your local corpus and save as a HF tokenizer folder --- ## Project Layout ```txt Torch-Linguist/ β”œβ”€ README.md β”œβ”€ requirements.txt β”œβ”€ configs/ β”‚ β”œβ”€ pretrain_gpt2_small.yaml β”‚ └─ tokenizer_bpe.yaml β”œβ”€ scripts/ β”‚ β”œβ”€ pretrain.py β”‚ β”œβ”€ train_tokenizer.py β”‚ └─ generate.py └─ src/torch_linguist/ β”œβ”€ data/ # dataset loading + packing β”œβ”€ modeling/ # model building (from scratch or pretrained) β”œβ”€ training/ # Trainer wiring + perplexity helpers └─ utils/ # seed, logging, hub helpers ``` --- ## How the Data Flows (CLM Pretraining) The training pipeline follows the standard causal LM pretraining recipe: 1. **Load dataset** * HF dataset: `load_dataset(dataset_name, dataset_config)` * Local text: `load_dataset("text", data_files=...)` 2. **Tokenize** * Convert raw text into token IDs: `input_ids` 3. **Pack into fixed blocks** * Concatenate token streams * Split into fixed-length sequences: `block_size` (e.g., 512) * Create labels: `labels = input_ids` 4. **Train** * Use `Trainer` + `DataCollatorForLanguageModeling(mlm=False)` * Save checkpoints and optionally resume / push to hub --- ## Installation ```bash pip install -r requirements.txt ``` Optional (for Weights & Biases logging): ```bash pip install wandb ``` --- ## Quickstart: Pretraining on WikiText-2 This is the recommended β€œtutorial default” dataset: small, fast, and good for learning. ```bash PYTHONPATH=src python scripts/pretrain.py --config configs/pretrain_gpt2_small.yaml ``` Checkpoints will be written to: ``` ckpts/pretrain_gpt2_small/ ``` --- ## Generate Text from a Checkpoint ```bash PYTHONPATH=src python scripts/generate.py \ --model_dir ckpts/pretrain_gpt2_small/checkpoint-500 \ --prompt "Hello my name is" ``` --- ## Tokenizer Options ### Option A (Simplest): Use a Pretrained Tokenizer In `configs/pretrain_gpt2_small.yaml`: ```yaml tokenizer: pretrained_name: gpt2 pad_to_eos_if_missing: true ``` ### Option B: Train Your Own BPE Tokenizer 1. Put your corpus files in `data/raw/` (or adjust paths in config) 2. Train tokenizer: ```bash PYTHONPATH=src python scripts/train_tokenizer.py --config configs/tokenizer_bpe.yaml ``` 3. Update `configs/pretrain_gpt2_small.yaml` to use the tokenizer folder: ```yaml tokenizer: tokenizer_dir: tokenizers/tokenizer_bpe pad_to_eos_if_missing: true ``` Then train: ```bash PYTHONPATH=src python scripts/pretrain.py --config configs/pretrain_gpt2_small.yaml ``` --- ## Training on Local Text Files Set in `configs/pretrain_gpt2_small.yaml`: ```yaml data: source: local_text local_files: train: data/raw/wiki.train.tokens validation: data/raw/wiki.valid.tokens test: data/raw/wiki.test.tokens ``` Then run: ```bash PYTHONPATH=src python scripts/pretrain.py --config configs/pretrain_gpt2_small.yaml ``` --- ## Streaming Mode (Large Datasets) For very large datasets (web-scale text), enable streaming: ```yaml data: streaming: true ``` This uses a streaming-friendly block packing approach. --- ## Resume Training In `configs/pretrain_gpt2_small.yaml`: ```yaml train: resume_from_checkpoint: auto ``` * `auto` resumes from the most recent `checkpoint-*` in `output_dir`. * You can also set an explicit checkpoint path: ```yaml resume_from_checkpoint: ckpts/pretrain_gpt2_small/checkpoint-1000 ``` --- ## Logging with Weights & Biases (Optional) 1. Install and login: ```bash pip install wandb wandb login ``` 2. Enable in config: ```yaml train: report_to: wandb run_name: "wikitext2-gpt2-small" ``` --- ## Push to Hugging Face Hub (Optional) 1. Login once: ```bash huggingface-cli login ``` 2. Enable in config: ```yaml hub: push_to_hub: true repo_id: "YOUR_USERNAME/torch-linguist-pretrain" private: true ``` During training, checkpoints can be uploaded depending on `hub_strategy`, and at the end the script ckpts `trainer.push_to_hub()`. --- ## Configuration Guide (What to Tune First) ### `data.block_size` * 256/512 for quick experiments * 1024+ for stronger modeling (more VRAM) ### Batch size vs gradient accumulation If VRAM is limited: * keep `per_device_train_batch_size` small * increase `gradient_accumulation_steps` ### Model size In `model` config: * `n_layer`, `n_head`, `n_embd` control model capacity * start small to validate the pipeline, then scale up --- ## Roadmap: Fine-tuning & Instruction Tuning This repo is structured so adding later stages is straightforward: ### 1) Supervised Fine-Tuning (SFT) Add: * `scripts/finetune_sft.py` * `src/torch_linguist/data/sft.py` (format: prompt β†’ response) * Use a data collator that masks prompt tokens if needed. ### 2) Instruction / Preference Tuning (Optional) Add: * `scripts/train_dpo.py` (or ORPO) * dataset formatting for (prompt, chosen, rejected) * a trainer from `trl` (HF TRL library) --- ## License Add your license here (MIT / Apache-2.0 / etc.). --- ## Acknowledgements * Hugging Face Transformers, Datasets, Tokenizers, Accelerate * WikiText dataset authors (if using wikitext)