Finisky Garden

NLP, Software Engineering, Product Design

0%

Static website such as Hexo/Hugo/Jekyll is very popular recent years. It is fast, easy to write, deploy and host. However, no free lunch: it is non-trivial to store dynamic information such as pageview counts and comments under the serverless architecture. This site uses Waline to implement article view count and comment system.

Accidently I found that we do not have a full site pageview counter. Waline has post-level counter instead of site-level one.

Stylish used to be an excellent Chrome extension. It is able to customize css style for any website. I used this extension to change font to Monaco for many years.

Unfortunately, a recent auto update (July 6, 2022) makes it completely unusable: seems that the custom css style is applied after loading the page, so the page will keep the original font until finished loading the whole page and suddenly change to the custom style. Besides, the extension UI has a big change which cannot load properly.

After Deploy Hexo From Private Repository to GitHub Pages , we encounter many issues: GitHub Checkout Action Preserve File Modification Time , and now some posts’ permalinks date may shift one day. For instance, assume the original markdown date is 2020-07-13 00:50:05, the generated permalinks date becomes 2020/07/12. Since the permalinks changed, search engines will regard these posts are not found which impact the SEO performance.

By # Deploy Hexo From Private Repository to GitHub Pages , we can leverage GitHub Actions to automatically deploy the Hexo website. However, for each deployment commit, the post’s edit time will be changed to the current time instead of actual modification time. It may mislead the search engine to regard the website as a frequently modified site.

By default, Hexo uses the post file modification time as its edit time. By design, git doesn’t preserve the file modification time (refer to this ). After checkout action, the file modification time will be the current time.

After I upgraded pandoc from 2.14.0.3 to 2.18, hexo-renderer-pandoc cannot render one of my post correctly. Everything works fine in 2.14.0.3. The error looks like:

INFO Start processing FATAL { err: Error: [ERROR][hexo-renderer-pandoc] On /home/finisky/source/_posts/test.md [ERROR][hexo-renderer-pandoc] pandoc exited with code 64: YAML parse exception at line 4, column 0, while scanning a simple key: could not find expected ‘:’

  at Hexo.pandocRenderer (/home/finisky/node_modules/hexo-renderer-pandoc/index.js:114:11)
  at Hexo.tryCatcher (/home/finisky/node_modules/bluebird/js/release/util.js:16:23)
  at Hexo.<anonymous> (/home/finisky/node_modules/bluebird/js/release/method.js:15:34)
  at /home/finisky/node_modules/hexo/lib/hexo/render.js:75:22
  at tryCatcher (/home/finisky/node_modules/bluebird/js/release/util.js:16:23)
  at Promise._settlePromiseFromHandler (/home/finisky/node_modules/bluebird/js/release/promise.js:547:31)
  at Promise._settlePromise (/home/finisky/node_modules/bluebird/js/release/promise.js:604:18)
  at Promise._settlePromiseCtx (/home/finisky/node_modules/bluebird/js/release/promise.js:641:10)
  at _drainQueueStep (/home/finisky/node_modules/bluebird/js/release/async.js:97:12)
  at _drainQueue (/home/finisky/node_modules/bluebird/js/release/async.js:86:9)
  at Async._drainQueues (/home/finisky/node_modules/bluebird/js/release/async.js:102:5)
  at Immediate.Async.drainQueues [as _onImmediate] (/home/finisky/node_modules/bluebird/js/release/async.js:15:14)
  at processImmediate (node:internal/timers:464:21)

} Something’s wrong. Maybe you can find the solution here: %s https://hexo.io/docs/troubleshooting.html

Hugo is a nice static site generator. A common scenario is to store your website source code in a private repository and serve it on GitHub Pages. Can we leverage the github actions to automatically build and deploy the site from the private repository to GitHub Pages? The answer is absolutely yes!

Before we start, you need to have two repos (can belong to different github accounts):

  • Source Repo: the repo to store
  • Target Repo: host the GitHub Pages (xxx.github.io)

What we need to do is to create a personal access token in the target github account, configure it in the source repo and create an action workflow yaml.

My first understanding of a language model is originated from n-gram. When I know RNNLM, I have a question: why a neural network can represent a language model?

After some research, I found the answer. Essentially, language model is a probability distribution :

A statistical language model is a probability distribution over sequences of words.

In a language model, the probability of a sequence $w_1, w_2, \ldots w_m$ can be represented as:

LeetCode Top K Frequent Words need to use STL priority_queue to solve the problem.

priority_queue definition :

1
2
3
4
5
template<
    class T,
    class Container = std::vector<T>,
    class Compare = std::less<typename Container::value_type>
> class priority_queue;

There are two ways to implement the priority_queue compare function of customized type.

Overload Operator

Define a struct Cmp and overload operator ():

1
2
3
4
5
6
7
8
9
struct Cmp {
    bool operator()(const pair<string, int> &a, const pair<string, int> &b)
    {
        if (a.second == b.second) return a.first < b.first;
        return a.second > b.second;
    };
};

priority_queue<pair<string, int>, vector<pair<string, int>>, Cmp> pq;

lambda Function

It would be simpler to implement the compare function using lambda. No extra struct is needed:

This blog uses github as image hosting service after migration. However, sometimes the image loading speed is slow. After some investigation, we found that we can easily improve the image loading speed by jsdelivr CDN.

Jsdelivr Image Url Format

Assume the original github image url is:

1
https://raw.githubusercontent.com/{user}/{repo}/master/{folderpath}/{filename}

To leverage jsdelivr CDN, just convert the url to this:

1
https://cdn.jsdelivr.net/gh/{user}/{repo}/{folderpath}/{filename}

Or just do a prefix replacement:

1
2
3
4
5
https://raw.githubusercontent.com/{user}/{repo}/master/

-->

https://cdn.jsdelivr.net/gh/{user}/{repo}/

For example, for image p.png in github user user’s repository repo, folder a/:

Previously we talked about # How to Retry MongoDB Transaction . However, if you use BulkWrite() and one of the operation is retryable (e.g. duplicated key error), the new transactions API will retry the bulk write endlessly which might lead to server CPU 100%. (MongoDB Server v4.4.6-ent, MongoDB Driver v2.12.2)

To avoid such issue, we have three suggestions:

  • Add a cancellation token to limit the max retry time
  • Break the transaction after max retry count
  • Set BulkWriteOptions { IsOrdered = true }

The first two suggestions are also applicable to transactions which don’t use BulkWrite().

Prompting is one of the hottest NLP techniques. This is a brief introduction to prompting by three questions: what’s prompting, why prompting and how to prompting. As a brief introduction, we do not cover too much details but try to summarize the main idea of prompting. For more details, please refer to the original papers.

What’s Prompting

I don’t find a rigorous defintion for prompting. Just quoting some pieces from papers.

Recently I switched the static website generator from Hexo to Hugo. The main reason is that Hexo is too slow, cannot generate websites with thousands of pages.

Then I found this: # Who Should Use Hugo?

Hugo is for people building a blog, a company site, a portfolio site, documentation, a single landing page, or a website with thousands of pages.

The latest training/fine-tuning language model tutorial by huggingface transformers can be found here: Transformers Language Model Training

There are three scripts: run_clm.py, run_mlm.py and run_plm.py. For GPT which is a causal language model, we should use run_clm.py. However, run_clm.py doesn’t support line by line dataset. For each batch, the default behavior is to group the training examples into a single block_size line.

Follow # Dynamically create and use a persistent volume with Azure Files in Azure Kubernetes Service (AKS) to create a new storage class in AKS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: standard-grs
provisioner: kubernetes.io/azure-disk
parameters:
  cachingmode: ReadOnly
  kind: Managed
  skuName: Standard_GRS
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate

If you create new resource by kubernetes dashboard, an error might occur:

How to use different github account for different repository? For instance, we have two github accounts x1 and x2, while x1 for repo1 and x2 for repo2. At the first glance, we can set git config in different repository folder by git config user.name xxx. However, this approach has two drawbacks:

  • Need to config user name/email in every repository
  • In some case, the git user cannot be configured by git config. For example, hexo-deployer-git . Since the git repo is automatically generated by the deployer, it’s hard to manually set the user name.

Fortunately, we can leverage SSH config to associate different github accounts with different repos. Define different host entries does the trick: since we login to github via SSH, we use a virtual host as an alias to represent the real host name.

Recently we found the traffic is not balanced across the MongoDB cluster shards. After investigation, the root cause is that data on each shard is not evenly distributed ( Chunk balancing != data balancing != traffic balancing ). The data distribution looks like this:

ShardData Size
mongo-010.55 GB
mongo-125.76 GB
mongo-210.04 GB

Why the data size of mongo-1 is significantly large than others while the chunk number among 3 shards is almost the same? Then we need to analysis the chunk size distribution across these shards.

After adding new shards to our production MongoDB cluster (v4.4.6-ent with 5 shards, 3 replicas for each shard), we found that the balancer is not working. sh.status() displays many chunk migration errors:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
...
  balancer:
        Currently enabled:  yes
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours:
                7 : Failed with error 'aborted', from mongo-1 to mongo-3
                7208 : Failed with error 'aborted', from mongo-1 to mongo-4
  databases:
        {  "_id" : "X",  "primary" : "mongo-1",  "partitioned" : true,  "version" : {  "uuid" : UUID("xxx"),  "lastMod" : 1 } }
                X.A
                        shard key: { "Uuid" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                mongo-0       231
                                mongo-1       327
                                mongo-2       230
                                mongo-3       208
...

Obviously, the chunks is unbalanced accross shards (327 vs 208). Since the balancer is enabled, we try to debug the issue through mongodb.log on the config server. There are many migration failed logs (sensitive infomation masked):

When a pod in error state (crashloopbackoff), kubernetes would restart the pod. If you try to exec into the pod to check the log or debug, the following error message appears:

1
unable to upgrade connection: container not found ("")

Because the old pod has been killed and you cannot exec into it anymore. So how can we prevent the pod from endless restart?

Just add a command to the deployment yaml to override the default command by the container image. Make the pod never finished by sleep infinity or tail -f /dev/null:

Multiple Backup Daemons are typically run when the storage requirements or the load generated by the deployment is too much for a single daemon.

Directly scale the statefulset ops-manager-backup-daemon to multiple instances (e.g. 3) doesn’t work. Because the mongodb-enterprise-operator is watching the statefulset, the instance number will be scaled down to 1 by the MongoDB operator several miniutes later.

So how to scale up the backup dameons by the MongoDB kubernetes operator?