Finetune mvc
A MVC for finetuning. This MR is just meant to get the training loop running. Optimizations are still needed to train a good model.
- The dataloader is adapted from https://github.com/huggingface/blog/blob/main/codeparrot.md
- We can choose to use deepspeed or hf accelerate, but without much optimization yet.
What this MR includes:
-
A data iterator that is agnostic to input sources. Currently I'm experimenting with this table
-
The iterator is adapted from codeparrot.
-
The iterator will iterate through the whole dataset in chunks (e.g. each chunk contains 100 files)
- The whole chunk is tokenized using the codegen tokenizer.
- Each file in the tokenized chunk is separated by the
eos_token
. - The inputs to the model have fixed context length, and they are randomly drawn from the whole tokenized chunk. In the figure below the inputs are drawn from non-overlapping windows, which I think is not necessary. We can use sliding windows or random windows. I opted for random windows in this MR.
- The number of inputs from each chunk is a configurable hyperparameter.
-
The Iterator has 2 modes:
infinite
andloop-once
.- In
infinite
mode the iterator will repeat from the beginning after go through the data once. The current implementation (itertools.repeat
) requires all the data to be stored in memory in order to loop though the data repeatedly. Validation dataset is in this mode. - In
loop-once
mode, the iterator will exit after iterating through the data once. This has negligible memory footprint. Training dataset is in this mode. If we want to train forn
epochs, we just need to initialize the iteratorn
times in the trainer.
- In
-
-
A deepspeed trainer.
- I'm experimenting with codegen config.
-
A HF accelerate trainer.
- I'm experimenting with a simple config from the tutorial.
-
Training objective: next-token-prediction.
Run training
-
Deepspeed
Use the deepspeed lanucher:
deepspeed --include localhost:0 --no_local_rank --master_port 29501 finetune.py .py .go --working-dir your/working/dir --deepspeed-config-file your/config/file --accelerator deepspeed --context-length 2048 2>out.log
The command will run training on 0 GPU and use port 29501 for communication during distributed training. Each job must use a different GPU id and a different port. Default port is 29500, so it's good idea to increment port number from 29500.
-
HF Accelerate (
🛑 Outdated, I won't experiment with HF accelerate in this MR)Use the accelerate launcher:
accelerate launch --config accelerate_config.yaml finetune.py --accelerator hf_accelerate
for hf accelerate.
Optionally, we can add 2>out.log
at the end of each command to redirect all logs and errors to a file out.log
, I found this useful if I want to keep the training jobs running in the background so I can safely log off my ssh session.
Results
Right now the training process won't decrease the validation loss, which is expected. Because the hyperparameters are terrible. I'll add a better set of hyperparameters in a follow up MR.