Add script to resize the tokenizer and model with the new added tokens
After converting the model weights with zero_to_fp32.py, we still need to load the weights using a model with the correct embeddings size. The embeddings shape is modified because we add tokens to the tokenizer.
This script adds the tokens to a given model_id (e.g.: Salesforce/codegen-16B-multi), resizes the embeddings of the base model using the new tokenizer length and saves both the tokenizer config and the model config.
Example:
python3 resize_tokenizer_and_model.py \
--out_folder=~/code-suggestions/model_checkpoints/script_new_config/ \
--model_id=Salesforce/codegen-16B-multi