LaMDA: Language Models for Dialog Applications

# LaMDA Pre-training

LaMDA也是大力出奇迹的典型，无论是模型规模还是数据规模都是之前SOTA模型的几十倍。

LaMDA用于预训练的数据量非常大：

The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words

LaMDA的非embedding参数有 137B，约是 Meena 的 50 倍。模型结构采用decoder-only Transformer，类似 GPT的自回归模型。

The Transformer has 64 layers, d_model = 8192, d_ff = 65536, h = 128, d_k = d_v = 128

We pre-trained LaMDA on 1024 TPU-v3 chips for a total of about 57.7 days, and 256K tokens per batch.

# Metrics

## Quality

quality打分是Sensibleness, Specificity, Interestingness (SSI) 三个指标的平均值。

• Sensibleness: measures whether a model’s responses make sense in context and do not contradict anything that was said earlier
• Specificity: measure whether a response is specific to a given context. For example, if a user says "I love Eurovision" and the model responds "Me too," then it would score 0 on specificity, since this response could be used in many different contexts.
• Interestingness: the response is likely to “catch someone’s attention” or “arouse their curiosity”, or if it is unexpected, witty, or insightful

## Safety

Safety: avoid unintended results that create risks of harm, and to avoid creating or reinforcing unfair bias

Violent or gory content that’s primarily intended to be shocking, sensational, or gratuitous. Financial advice regarding investments, taxes, retirement planning, loans, banking, or insurance. Content that may incite hatred against an individual or group. Content that contradicts well-established expert consensus, including scientific or medical consensus and evidence-based best practices.

## Groundedness

Groundedness: the percentage of responses containing claims about the external world that can be supported by authoritative external sources

# LaMDA Fine-tuning and Evaluation Data

To improve quality (SSI), we collect 6400 dialogs with 121K turns by asking crowdworkers to interact with a LaMDA instance about any topic. These dialogs are required to last 14 to 30 turns.

Similar to SSI, we collect 8K dialogs with 48K turns by asking crowdworkers to interact with a LaMDA instance about any topic. These dialogs are required to last 5 to 10 turns.

Similar to SSI and safety, we collect 4K dialogs with 40K turns by asking crowdworkers to interact with the model.

# LaMDA Fine-tuning

## Discriminative and generative fine-tuning

LaMDA用了不同的微调任务提升 Quality 和 Safety。

• Generative fine-tuning 模板: <context> <sentinel> <response>
• Discriminative fine-tuning 模板： <context> <sentinel> <response> <attribute-name> <rating>

LaMDA SSI and safety discriminators are also used to score and filter 2.5M turns of dialog data sampled from the pre-training dataset, resulting in 800K turns of safe, sensible, specific and interesting dialogs.

## Fine-tuning to Learn to Use External Information

• The toolset (TS): an information retrieval system, a calculator, and a translator.

• Information retrieval： "How old is Rafael Nadal?" -> ["Rafael Nadal / Age / 35"]
• Calculator: "135+7721" -> ["7856"]
• Translator: "hello in French" -> ["Bonjour"]

Fine-tuning通过两个不同的task完成，一个叫Base，就是普通的文本生成任务，类似直接回答；另一个叫Research，需要借助上面所说的 TS 完成。推理阶段模型的输出有两种，若输出是 User 打头，则后面跟着的文本就是最终回复，若输出是 TS 打头，则后面跟着的文本是要输入 TS 并以此输出作为下一轮模型的输入，继续改进回复。这样的迭代过程最多经历4轮。下面的这个例子很好地解释了这个过程，Eiffel Tower是哪年建的，共经过四轮，才得到最终回复：

# Results on Foundation Metrics

In summary, scaling up alone improves the pre-trained model quality and groundedness metrics, but it does not improve safety much. Fine-tuning with crowdworker-annotated data, however, turns out to be an effective method for improving all metrics. In some cases, fine-tuning these same models allows us to obtain results equivalent to having a significantly larger model.

# 总结

Perhaps the most noteworthy aspect of our study is that significant progress can be made towards better quality and safer dialog models with modest amounts of human-annotated fine-tuning data (less than 0.001% of pre-training data).