Git reset --hard Not Working: File System Is Not Case Sensitive

Publish on: 2022-11-16 Classify at: Linux Read:≈ 4min Views: Comments:

git reset --hard not working: everytime you reset, the file is flipped between file.txt and File.txt, really weird…

It’s not a joke, just clone this repo on Windows and you can reproduce it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
D:\$ git clone https://github.com/finisky/git-case-demo.git
Cloning into 'git-case-demo'...
remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 11 (delta 0), reused 8 (delta 0), pack-reused 0
Unpacking objects: 100% (11/11), 1.85 KiB | 126.00 KiB/s, done.
warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

  'File.txt'
  'file.txt'

After clone the repo, you will find that the main branch is not clean. git reset --hard not working:

Migrate GitHub Pages by 301 Redirects

Publish on: 2022-11-14 Classify at: Hexo Read:≈ 2min Views: Comments:

GitHub Pages cannot perform HTTP 301 redirects as you cannot modify the server config. However, 301 redirects is really crucial for SEO. In order to keep the site ranking, you need to 301 redirects the old GitHub Pages to your new site, and manually notify Google Search Console :

Do you lose credit for links when you redirect to new URLs?
No, 301 or 302 redirects do not cause a loss in PageRank

So how to migrate GitHub Pages to a new site without losing site ranking?

Clear Chrome DNS and Disk Cache

Publish on: 2022-11-10 Classify at: Misc Read:≈ 1min Views: Comments:

I bind a custom domain finisky.eu.org to the github pages finisky.github.io and then remove it. However, when I visit finisky.github.io, it always redirects to finisky.eu.org which is unavailable. I suspect the issue is caused by cache.

'pandoc exited with code null' Solution

Publish on: 2022-09-12 Classify at: Hexo Read:≈ 2min Views: Comments:

As the post number increases, Hexo generate posts slower and slower. Recently, it usually generates posts for several minutes and report the following error:

[ERROR][hexo-renderer-pandoc] pandoc exited with code null. at Object._prettifyError (/home/finisky/node_modules/nunjucks/src/lib.js:36:11)

I spent several hours to figure out the issue. Finally, I found the root cause is … VM memory is too small … :-(

FiniCounter: A Website Vistor Counter

Publish on: 2022-08-07 Classify at: Hexo Read:≈ 4min Views: Comments:

Static website such as Hexo/Hugo/Jekyll is very popular recent years. It is fast, easy to write, deploy and host. However, no free lunch: it is non-trivial to store dynamic information such as pageview counts and comments under the serverless architecture. This site uses Waline to implement article view count and comment system.

Accidently I found that we do not have a full site pageview counter. Waline has post-level counter instead of site-level one.

Rollback Stylish Chrome Extension

Publish on: 2022-07-13 Classify at: Misc Read:≈ 2min Views: Comments:

Stylish used to be an excellent Chrome extension. It is able to customize css style for any website. I used this extension to change font to Monaco for many years.

Unfortunately, a recent auto update (July 6, 2022) makes it completely unusable: seems that the custom css style is applied after loading the page, so the page will keep the original font until finished loading the whole page and suddenly change to the custom style. Besides, the extension UI has a big change which cannot load properly.

Hexo Generate Wrong Permalinks Date

Publish on: 2022-06-11 Classify at: Hexo Read:≈ 2min Views: Comments:

After Deploy Hexo From Private Repository to GitHub Pages , we encounter many issues: GitHub Checkout Action Preserve File Modification Time , and now some posts’ permalinks date may shift one day. For instance, assume the original markdown date is 2020-07-13 00:50:05, the generated permalinks date becomes 2020/07/12. Since the permalinks changed, search engines will regard these posts are not found which impact the SEO performance.

GitHub Checkout Action Preserve File Modification Time

Publish on: 2022-05-15 Classify at: Hexo Read:≈ 2min Views: Comments:

By # Deploy Hexo From Private Repository to GitHub Pages , we can leverage GitHub Actions to automatically deploy the Hexo website. However, for each deployment commit, the post’s edit time will be changed to the current time instead of actual modification time. It may mislead the search engine to regard the website as a frequently modified site.

By default, Hexo uses the post file modification time as its edit time. By design, git doesn’t preserve the file modification time (refer to this ). After checkout action, the file modification time will be the current time.

'pandoc exited with code 64' Solution

Publish on: 2022-05-04 Classify at: Hexo Read:≈ 3min Views: Comments:

After I upgraded pandoc from 2.14.0.3 to 2.18, hexo-renderer-pandoc cannot render one of my post correctly. Everything works fine in 2.14.0.3. The error looks like:

INFO Start processing FATAL { err: Error: [ERROR][hexo-renderer-pandoc] On /home/finisky/source/_posts/test.md [ERROR][hexo-renderer-pandoc] pandoc exited with code 64: YAML parse exception at line 4, column 0, while scanning a simple key: could not find expected ‘:’

  at Hexo.pandocRenderer (/home/finisky/node_modules/hexo-renderer-pandoc/index.js:114:11)
  at Hexo.tryCatcher (/home/finisky/node_modules/bluebird/js/release/util.js:16:23)
  at Hexo.<anonymous> (/home/finisky/node_modules/bluebird/js/release/method.js:15:34)
  at /home/finisky/node_modules/hexo/lib/hexo/render.js:75:22
  at tryCatcher (/home/finisky/node_modules/bluebird/js/release/util.js:16:23)
  at Promise._settlePromiseFromHandler (/home/finisky/node_modules/bluebird/js/release/promise.js:547:31)
  at Promise._settlePromise (/home/finisky/node_modules/bluebird/js/release/promise.js:604:18)
  at Promise._settlePromiseCtx (/home/finisky/node_modules/bluebird/js/release/promise.js:641:10)
  at _drainQueueStep (/home/finisky/node_modules/bluebird/js/release/async.js:97:12)
  at _drainQueue (/home/finisky/node_modules/bluebird/js/release/async.js:86:9)
  at Async._drainQueues (/home/finisky/node_modules/bluebird/js/release/async.js:102:5)
  at Immediate.Async.drainQueues [as _onImmediate] (/home/finisky/node_modules/bluebird/js/release/async.js:15:14)
  at processImmediate (node:internal/timers:464:21)

} Something’s wrong. Maybe you can find the solution here: %s https://hexo.io/docs/troubleshooting.html

Deploy Hexo From Private Repository to GitHub Pages

Publish on: 2022-05-02 Classify at: Hexo Read:≈ 4min Views: Comments:

Last time we talked how to deploy Hugo from private repository to GitHub Pages . I thought it is trivial to modify the workflow yaml to make it works for Hexo. However, it is much more complicated than I ever thought. The reason is that Hexo has to setup more dependencies with compatibility issues while Hugo is relatively self-contained and clean.

Deploy Hugo From Private Repository to GitHub Pages

Publish on: 2022-04-26 Classify at: Hugo Read:≈ 3min Views: Comments:

Hugo is a nice static site generator. A common scenario is to store your website source code in a private repository and serve it on GitHub Pages. Can we leverage the github actions to automatically build and deploy the site from the private repository to GitHub Pages? The answer is absolutely yes!

Before we start, you need to have two repos (can belong to different github accounts):

Source Repo: the repo to store
Target Repo: host the GitHub Pages (xxx.github.io)

What we need to do is to create a personal access token in the target github account, configure it in the source repo and create an action workflow yaml.

Why Neural Networks Can Represent Lanugage Models

Publish on: 2022-04-07 Classify at: Machine Learning Read:≈ 2min Views: Comments:

My first understanding of a language model is originated from n-gram. When I know RNNLM, I have a question: why a neural network can represent a language model?

After some research, I found the answer. Essentially, language model is a probability distribution :

A statistical language model is a probability distribution over sequences of words.

In a language model, the probability of a sequence $w_1, w_2, \ldots w_m$ can be represented as:

C++ priority_queue Example

Publish on: 2022-01-05 Classify at: Coding Interview Read:≈ 3min Views: Comments:

LeetCode Top K Frequent Words need to use STL priority_queue to solve the problem.

priority_queue definition :

1
2
3
4
5
template<
    class T,
    class Container = std::vector<T>,
    class Compare = std::less<typename Container::value_type>
> class priority_queue;

There are two ways to implement the priority_queue compare function of customized type.

Overload Operator

Define a struct Cmp and overload operator ():

1
2
3
4
5
6
7
8
9
struct Cmp {
    bool operator()(const pair<string, int> &a, const pair<string, int> &b)
    {
        if (a.second == b.second) return a.first < b.first;
        return a.second > b.second;
    };
};

priority_queue<pair<string, int>, vector<pair<string, int>>, Cmp> pq;

lambda Function

It would be simpler to implement the compare function using lambda. No extra struct is needed:

Speed Up Github Image Hosting By Jsdelivr CDN

Publish on: 2021-12-20 Classify at: Hexo Read:≈ 1min Views: Comments:

This blog uses github as image hosting service after migration. However, sometimes the image loading speed is slow. After some investigation, we found that we can easily improve the image loading speed by jsdelivr CDN.

Jsdelivr Image Url Format

Assume the original github image url is:

`1`	`https://raw.githubusercontent.com/{user}/{repo}/master/{folderpath}/{filename}`

To leverage jsdelivr CDN, just convert the url to this:

`1`	`https://cdn.jsdelivr.net/gh/{user}/{repo}/{folderpath}/{filename}`

Or just do a prefix replacement:

1
2
3
4
5
https://raw.githubusercontent.com/{user}/{repo}/master/

-->

https://cdn.jsdelivr.net/gh/{user}/{repo}/

For example, for image p.png in github user user’s repository repo, folder a/:

MongoDB Transaction BulkWrite Endless Retry

Publish on: 2021-12-16 Classify at: MongoDB Read:≈ 2min Views: Comments:

Previously we talked about # How to Retry MongoDB Transaction . However, if you use BulkWrite() and one of the operation is retryable (e.g. duplicated key error), the new transactions API will retry the bulk write endlessly which might lead to server CPU 100%. (MongoDB Server v4.4.6-ent, MongoDB Driver v2.12.2)

To avoid such issue, we have three suggestions:

Add a cancellation token to limit the max retry time
Break the transaction after max retry count
Set BulkWriteOptions { IsOrdered = true }

The first two suggestions are also applicable to transactions which don’t use BulkWrite().

Brief Introduction to NLP Prompting

Publish on: 2021-12-15 Classify at: Machine Learning Read:≈ 3min Views: Comments:

Prompting is one of the hottest NLP techniques. This is a brief introduction to prompting by three questions: what’s prompting, why prompting and how to prompting. As a brief introduction, we do not cover too much details but try to summarize the main idea of prompting. For more details, please refer to the original papers.

What’s Prompting

I don’t find a rigorous defintion for prompting. Just quoting some pieces from papers.

Hugo Add Customized CSS or Javascript

Publish on: 2021-12-07 Classify at: Hugo Read:≈ 4min Views: Comments:

Recently I switched the static website generator from Hexo to Hugo. The main reason is that Hexo is too slow, cannot generate websites with thousands of pages.

Then I found this: # Who Should Use Hugo?

Hugo is for people building a blog, a company site, a portfolio site, documentation, a single landing page, or a website with thousands of pages.

Fine-tune GPT with Line-by-Line Dataset

Publish on: 2021-11-14 Classify at: Machine Learning Read:≈ 19min Views: Comments:

The latest training/fine-tuning language model tutorial by huggingface transformers can be found here: Transformers Language Model Training

There are three scripts: run_clm.py, run_mlm.py and run_plm.py. For GPT which is a causal language model, we should use run_clm.py. However, run_clm.py doesn’t support line by line dataset. For each batch, the default behavior is to group the training examples into a single block_size line.

Create Storage Class in Azure Kubernetes Cluster

Publish on: 2021-11-05 Classify at: Misc Read:≈ 2min Views: Comments:

Follow # Dynamically create and use a persistent volume with Azure Files in Azure Kubernetes Service (AKS) to create a new storage class in AKS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: standard-grs
provisioner: kubernetes.io/azure-disk
parameters:
  cachingmode: ReadOnly
  kind: Managed
  skuName: Standard_GRS
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate

If you create new resource by kubernetes dashboard, an error might occur:

Confgure Multiple Github Accounts On One Machine

Publish on: 2021-11-03 Classify at: Linux Read:≈ 3min Views: Comments:

How to use different github account for different repository? For instance, we have two github accounts x1 and x2, while x1 for repo1 and x2 for repo2. At the first glance, we can set git config in different repository folder by git config user.name xxx. However, this approach has two drawbacks:

Need to config user name/email in every repository
In some case, the git user cannot be configured by git config. For example, hexo-deployer-git . Since the git repo is automatically generated by the deployer, it’s hard to manually set the user name.

Fortunately, we can leverage SSH config to associate different github accounts with different repos. Define different host entries does the trick: since we login to github via SSH, we use a virtual host as an alias to represent the real host name.