Finisky Garden

NLP, Software Engineering, Product Design

0%

How to make your local repository always sync with GitHub repository? The answer is webhook.

When the repo received a push event, GitHub will send a POST request to the webhook URL with details of any subscribed events. What we need to do is to implement a webhook (on local side) which performs git pull to keep sync with remote.

When I install elasticdump, the following error appears:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ npm install elasticdump
...
npm WARN @1.0.0 No description
npm WARN @1.0.0 No repository field.
npm ERR! Linux 5.4.0-1091-azure
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "install" "elasticdump"
npm ERR! node v8.10.0
npm ERR! npm  v3.5.2
npm ERR! path /home/finisky/node_modules/.staging/@types/node-1f2b596d/package.json
npm ERR! code ENOTDIR
npm ERR! errno -20
npm ERR! syscall open

npm ERR! ENOTDIR: not a directory, open '/home/finisky/node_modules/.staging/@types/node-1f2b596d/package.json'

git reset --hard not working: everytime you reset, the file is flipped between file.txt and File.txt, really weird…

It’s not a joke, just clone this repo on Windows and you can reproduce it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
D:\$ git clone https://github.com/finisky/git-case-demo.git
Cloning into 'git-case-demo'...
remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 11 (delta 0), reused 8 (delta 0), pack-reused 0
Unpacking objects: 100% (11/11), 1.85 KiB | 126.00 KiB/s, done.
warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

  'File.txt'
  'file.txt'

After clone the repo, you will find that the main branch is not clean. git reset --hard not working:

GitHub Pages cannot perform HTTP 301 redirects as you cannot modify the server config. However, 301 redirects is really crucial for SEO. In order to keep the site ranking, you need to 301 redirects the old GitHub Pages to your new site, and manually notify Google Search Console :

Do you lose credit for links when you redirect to new URLs?
No, 301 or 302 redirects do not cause a loss in PageRank

So how to migrate GitHub Pages to a new site without losing site ranking?

I bind a custom domain finisky.eu.org to the github pages finisky.github.io and then remove it. However, when I visit finisky.github.io, it always redirects to finisky.eu.org which is unavailable. I suspect the issue is caused by cache.

As the post number increases, Hexo generate posts slower and slower. Recently, it usually generates posts for several minutes and report the following error:

[ERROR][hexo-renderer-pandoc] pandoc exited with code null. at Object._prettifyError (/home/finisky/node_modules/nunjucks/src/lib.js:36:11)

I spent several hours to figure out the issue. Finally, I found the root cause is … VM memory is too small … :-(

Static website such as Hexo/Hugo/Jekyll is very popular recent years. It is fast, easy to write, deploy and host. However, no free lunch: it is non-trivial to store dynamic information such as pageview counts and comments under the serverless architecture. This site uses Waline to implement article view count and comment system.

Accidently I found that we do not have a full site pageview counter. Waline has post-level counter instead of site-level one.

Stylish used to be an excellent Chrome extension. It is able to customize css style for any website. I used this extension to change font to Monaco for many years.

Unfortunately, a recent auto update (July 6, 2022) makes it completely unusable: seems that the custom css style is applied after loading the page, so the page will keep the original font until finished loading the whole page and suddenly change to the custom style. Besides, the extension UI has a big change which cannot load properly.

After Deploy Hexo From Private Repository to GitHub Pages , we encounter many issues: GitHub Checkout Action Preserve File Modification Time , and now some posts’ permalinks date may shift one day. For instance, assume the original markdown date is 2020-07-13 00:50:05, the generated permalinks date becomes 2020/07/12. Since the permalinks changed, search engines will regard these posts are not found which impact the SEO performance.

By # Deploy Hexo From Private Repository to GitHub Pages , we can leverage GitHub Actions to automatically deploy the Hexo website. However, for each deployment commit, the post’s edit time will be changed to the current time instead of actual modification time. It may mislead the search engine to regard the website as a frequently modified site.

By default, Hexo uses the post file modification time as its edit time. By design, git doesn’t preserve the file modification time (refer to this ). After checkout action, the file modification time will be the current time.

After I upgraded pandoc from 2.14.0.3 to 2.18, hexo-renderer-pandoc cannot render one of my post correctly. Everything works fine in 2.14.0.3. The error looks like:

INFO Start processing FATAL { err: Error: [ERROR][hexo-renderer-pandoc] On /home/finisky/source/_posts/test.md [ERROR][hexo-renderer-pandoc] pandoc exited with code 64: YAML parse exception at line 4, column 0, while scanning a simple key: could not find expected ‘:’

  at Hexo.pandocRenderer (/home/finisky/node_modules/hexo-renderer-pandoc/index.js:114:11)
  at Hexo.tryCatcher (/home/finisky/node_modules/bluebird/js/release/util.js:16:23)
  at Hexo.<anonymous> (/home/finisky/node_modules/bluebird/js/release/method.js:15:34)
  at /home/finisky/node_modules/hexo/lib/hexo/render.js:75:22
  at tryCatcher (/home/finisky/node_modules/bluebird/js/release/util.js:16:23)
  at Promise._settlePromiseFromHandler (/home/finisky/node_modules/bluebird/js/release/promise.js:547:31)
  at Promise._settlePromise (/home/finisky/node_modules/bluebird/js/release/promise.js:604:18)
  at Promise._settlePromiseCtx (/home/finisky/node_modules/bluebird/js/release/promise.js:641:10)
  at _drainQueueStep (/home/finisky/node_modules/bluebird/js/release/async.js:97:12)
  at _drainQueue (/home/finisky/node_modules/bluebird/js/release/async.js:86:9)
  at Async._drainQueues (/home/finisky/node_modules/bluebird/js/release/async.js:102:5)
  at Immediate.Async.drainQueues [as _onImmediate] (/home/finisky/node_modules/bluebird/js/release/async.js:15:14)
  at processImmediate (node:internal/timers:464:21)

} Something’s wrong. Maybe you can find the solution here: %s https://hexo.io/docs/troubleshooting.html

Hugo is a nice static site generator. A common scenario is to store your website source code in a private repository and serve it on GitHub Pages. Can we leverage the github actions to automatically build and deploy the site from the private repository to GitHub Pages? The answer is absolutely yes!

Before we start, you need to have two repos (can belong to different github accounts):

  • Source Repo: the repo to store
  • Target Repo: host the GitHub Pages (xxx.github.io)

What we need to do is to create a personal access token in the target github account, configure it in the source repo and create an action workflow yaml.

My first understanding of a language model is originated from n-gram. When I know RNNLM, I have a question: why a neural network can represent a language model?

After some research, I found the answer. Essentially, language model is a probability distribution :

A statistical language model is a probability distribution over sequences of words.

In a language model, the probability of a sequence $w_1, w_2, \ldots w_m$ can be represented as:

LeetCode Top K Frequent Words need to use STL priority_queue to solve the problem.

priority_queue definition :

1
2
3
4
5
template<
    class T,
    class Container = std::vector<T>,
    class Compare = std::less<typename Container::value_type>
> class priority_queue;

There are two ways to implement the priority_queue compare function of customized type.

Overload Operator

Define a struct Cmp and overload operator ():

1
2
3
4
5
6
7
8
9
struct Cmp {
    bool operator()(const pair<string, int> &a, const pair<string, int> &b)
    {
        if (a.second == b.second) return a.first < b.first;
        return a.second > b.second;
    };
};

priority_queue<pair<string, int>, vector<pair<string, int>>, Cmp> pq;

lambda Function

It would be simpler to implement the compare function using lambda. No extra struct is needed:

This blog uses github as image hosting service after migration. However, sometimes the image loading speed is slow. After some investigation, we found that we can easily improve the image loading speed by jsdelivr CDN.

Jsdelivr Image Url Format

Assume the original github image url is:

1
https://raw.githubusercontent.com/{user}/{repo}/master/{folderpath}/{filename}

To leverage jsdelivr CDN, just convert the url to this:

1
https://cdn.jsdelivr.net/gh/{user}/{repo}/{folderpath}/{filename}

Or just do a prefix replacement:

1
2
3
4
5
https://raw.githubusercontent.com/{user}/{repo}/master/

-->

https://cdn.jsdelivr.net/gh/{user}/{repo}/

For example, for image p.png in github user user’s repository repo, folder a/:

Previously we talked about # How to Retry MongoDB Transaction . However, if you use BulkWrite() and one of the operation is retryable (e.g. duplicated key error), the new transactions API will retry the bulk write endlessly which might lead to server CPU 100%. (MongoDB Server v4.4.6-ent, MongoDB Driver v2.12.2)

To avoid such issue, we have three suggestions:

  • Add a cancellation token to limit the max retry time
  • Break the transaction after max retry count
  • Set BulkWriteOptions { IsOrdered = true }

The first two suggestions are also applicable to transactions which don’t use BulkWrite().

Prompting is one of the hottest NLP techniques. This is a brief introduction to prompting by three questions: what’s prompting, why prompting and how to prompting. As a brief introduction, we do not cover too much details but try to summarize the main idea of prompting. For more details, please refer to the original papers.

What’s Prompting

I don’t find a rigorous defintion for prompting. Just quoting some pieces from papers.

Recently I switched the static website generator from Hexo to Hugo. The main reason is that Hexo is too slow, cannot generate websites with thousands of pages.

Then I found this: # Who Should Use Hugo?

Hugo is for people building a blog, a company site, a portfolio site, documentation, a single landing page, or a website with thousands of pages.