这篇ICLR 2020年的文章我很喜欢，因为它简洁直观。文章首先提出一个有意思的发现：人说的自然语言常常出人意料，即说出的并不总是语言模型中概率最大的词，而Beam Search会总会选择最符合语言模型的词汇，因此生成的文本没有新意(less surprising)。之后提出了一种top-k sampling的改进方案来解决问题：nucleus sampling (top-p sampling)。

阅读全文 »

避免Pod反复自动重启

发表于 2021-11-21 更新于 2022-11-16 分类于 Misc 评论：阅读次数：

如果一个Pod在错误状态启动不了 (crashloopbackoff)，那么Kubernetes就会自动重启该Pod。这就给调试这个Pod带来了麻烦，无法exec到这个Pod上查看问题，也不容易看到这个Pod的日志，因为此时这个Pod已经被Kubernetes杀掉了：

unable to upgrade connection: container not found ("")

那如何防止有错误的Pod无限重启或反复重启？

阅读全文 »

使用LineByLine数据集训练GPT

发表于 2021-11-14 更新于 2023-01-01 分类于 Machine Learning 评论：阅读次数：

训练GPT等语言模型可以参考Huggingface Transformer训练语言模型的tutorial: Transformers Language Model Training

示例提供了三个脚本: run_clm.py, run_mlm.py 和 run_plm.py。GPT是个causal language model，可以使用 run_clm.py 进行训练或微调。但这脚本并不支持行式数据集，即每行一个训练样本的数据集。它默认的数据处理是按行读取样本并把它们连接成一个block_size的连续文本。

而这种数据处理方式在某些情况下并不适用，可能会让语言模型学偏。比如QA任务：

Q1 [SEP] A1 Q2 [SEP] A2 ...

连接后就成了 Q1 [SEP] A1 Q2 [SEP] A2 ... ，可能会让模型认为Q1的答案是A1 Q2。可对run_clm.py支持行式数据集。

阅读全文 »

Fine-tune GPT with Line-by-Line Dataset

发表于 2021-11-14 更新于 2023-01-24 分类于 Machine Learning 评论：阅读次数：

The latest training/fine-tuning language model tutorial by huggingface transformers can be found here: Transformers Language Model Training

There are three scripts: run_clm.py, run_mlm.py and run_plm.py. For GPT which is a causal language model, we should use run_clm.py. However, run_clm.py doesn't support line by line dataset. For each batch, the default behavior is to group the training examples into a single block_size line.

However, grouping text doesn't make sense for datasets whose lines are not related such as QA dataset:

Q1 [SEP] A1 Q2 [SEP] A2 ...

Concatenate them to: Q1 [SEP] A1 Q2 [SEP] A2 ... might mislead the language model.

阅读全文 »

Create Storage Class in Azure Kubernetes Cluster

发表于 2021-11-05 更新于 2023-01-24 分类于 Misc 评论：阅读次数：

Follow # Dynamically create and use a persistent volume with Azure Files in Azure Kubernetes Service (AKS) to create a new storage class in AKS:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: standard-grs
provisioner: kubernetes.io/azure-disk
parameters:
  cachingmode: ReadOnly
  kind: Managed
  skuName: Standard_GRS
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate

If you create new resource by kubernetes dashboard, an error might occur:

Deploying file has failed
the server could not find the requested resource

阅读全文 »

同一机器配置多个GitHub账号

发表于 2021-11-03 分类于 Linux 评论：阅读次数：

在同一机器上对不同repo使用不同的github账号是个常见需求。举个例子，repo1托管在github账号x1下，而repo2托管在账号x2下，如何方便地在同一机器上使用不同账号自动git push到对应的远端？比较直接的做法是在不同repo目录下使用git config配置用户名，但这样有两个问题：

每个repo都要配置一遍比较繁琐
有些情况下无法配置。比如使用 hexo-deployer-git 部署Hexo网站时，.deploy_git目录是动态生成的，而所用的git账户和远端url修改不便。

于是，我们可以借用SSH config文件来把不同github账号与repo联系起来。在SSH config中定义多个不同的host项即可，然后在访问github时，使用一个虚拟host作为别名代替真正的主机名github.com即可。

阅读全文 »

Confgure Multiple Github Accounts On One Machine

发表于 2021-11-03 更新于 2023-01-24 分类于 Linux 评论：阅读次数：

How to use different github account for different repository? For instance, we have two github accounts x1 and x2, while x1 for repo1 and x2 for repo2. At the first glance, we can set git config in different repository folder by git config user.name xxx. However, this approach has two drawbacks:

Need to config user name/email in every repository
In some case, the git user cannot be configured by git config. For example, hexo-deployer-git. Since the git repo is automatically generated by the deployer, it's hard to manually set the user name.

Fortunately, we can leverage SSH config to associate different github accounts with different repos. Define different host entries does the trick: since we login to github via SSH, we use a virtual host as an alias to represent the real host name.

阅读全文 »