nanoGPT repo reading notes
nanoGPT repo reading notes
I personally haven't been using PyTorch and writing model code for a long time and found Andrej Karpathy's nanoGPT video as super helpful refresher (https://www.youtube.com/watch?v=kCc8FmEb1nY)
He also released a repo that has slightly more involved examples, I took some notes below as I went through the repo learning about some of the implementation details and PyTorch features. Hope it's useful to others as well.
Note: the repo was accessed around 5/18/2023.
https://github.com/karpathy/nanoGPT
Alternative implementation:
https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/trainer.py
train.py
Distributed training
Wandb logging
Simulating larger batch
Auto mixed precision (AMP)
https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html
Context
scaler
Using scaler
Learning rate schedule
It looks like this
model.py
Batched MHA
scaled_dot_product_attention
Register_buffer
Weight tying
Weight init
Model output for train vs inference
Copy weight from gpt2
Due to variable naming the two implementation basically have the same param names
Optimization group
and fused AdamW
Model flops utilization
In train.py
Generate
A few things here
- Concat output and shift out input (idx_cond) as it goes beyond block_size
- Variable length input is ok
- Because in forward, it will do t based on idx.size()
This is quite a bit different than the slightly more complicated version in llama
where it relies on pad_id to identify what was generated; and eos, bos token for decoding stopping
openwebtext/prepare.py
dataset.map
dataset.shard
np.memmap