Fine-tune GPT with Line-by-Line Dataset
The latest training/fine-tuning language model tutorial by huggingface transformers can be found here: Transformers Language Model Training
There are three scripts: run_clm.py, run_mlm.py and run_plm.py. For GPT which is a causal language model, we should use run_clm.py. However, run_clm.py doesn’t support line by line dataset. For each batch, the default behavior is to group the training examples into a single block_size line.
