Fine-tune GPT with Line-by-Line Dataset
The latest training/fine-tuning language model tutorial by huggingface transformers can be found here: Transformers Language Model Training
There are three scripts: run_clm.py
,
run_mlm.py
and run_plm.py
. For GPT which is a
causal language model, we should use run_clm.py
. However,
run_clm.py
doesn't support line by line dataset. For each
batch, the default behavior is to group the training examples into a
single block_size
line.
However, grouping text doesn't make sense for datasets whose lines are not related such as QA dataset:
Q1 [SEP] A1 Q2 [SEP] A2 ...
Concatenate them to: Q1 [SEP] A1 Q2 [SEP] A2 ...
might
mislead the language model.