Skip to content

Finetune mvc

Hongtao Yang requested to merge finetune_mvc into main

A MVC for finetuning. This MR is just meant to get the training loop running. Optimizations are still needed to train a good model.

What this MR includes:

  • A data iterator that is agnostic to input sources. Currently I'm experimenting with this table

    • The iterator is adapted from codeparrot.

    • The iterator will iterate through the whole dataset in chunks (e.g. each chunk contains 100 files)

      • The whole chunk is tokenized using the codegen tokenizer.
      • Each file in the tokenized chunk is separated by the eos_token.
      • The inputs to the model have fixed context length, and they are randomly drawn from the whole tokenized chunk. In the figure below the inputs are drawn from non-overlapping windows, which I think is not necessary. We can use sliding windows or random windows. I opted for random windows in this MR.
      • The number of inputs from each chunk is a configurable hyperparameter.

      chunk_tokens

    • The Iterator has 2 modes: infinite and loop-once.

      • In infinite mode the iterator will repeat from the beginning after go through the data once. The current implementation (itertools.repeat) requires all the data to be stored in memory in order to loop though the data repeatedly. Validation dataset is in this mode.
      • In loop-once mode, the iterator will exit after iterating through the data once. This has negligible memory footprint. Training dataset is in this mode. If we want to train for n epochs, we just need to initialize the iterator n times in the trainer.
  • A deepspeed trainer.

  • A HF accelerate trainer.

    • I'm experimenting with a simple config from the tutorial.
  • Training objective: next-token-prediction.

Run training

  • Deepspeed

    Use the deepspeed lanucher:

    deepspeed --include localhost:0 --no_local_rank --master_port 29501 finetune.py .py .go --working-dir your/working/dir --deepspeed-config-file your/config/file --accelerator deepspeed --context-length 2048 2>out.log

    The command will run training on 0 GPU and use port 29501 for communication during distributed training. Each job must use a different GPU id and a different port. Default port is 29500, so it's good idea to increment port number from 29500.

  • HF Accelerate (🛑 Outdated, I won't experiment with HF accelerate in this MR)

    Use the accelerate launcher:

    accelerate launch --config accelerate_config.yaml finetune.py --accelerator hf_accelerate

    for hf accelerate.

Optionally, we can add 2>out.log at the end of each command to redirect all logs and errors to a file out.log, I found this useful if I want to keep the training jobs running in the background so I can safely log off my ssh session.

Results

Right now the training process won't decrease the validation loss, which is expected. Because the hyperparameters are terrible. I'll add a better set of hyperparameters in a follow up MR.


ref: ai-assist#22 (closed)

Edited by Hongtao Yang

Merge request reports

Loading