Finisky Garden

NLP, Software Engineering, Product Design

0%

The latest training/fine-tuning language model tutorial by huggingface transformers can be found here: Transformers Language Model Training

There are three scripts: run_clm.py, run_mlm.py and run_plm.py. For GPT which is a causal language model, we should use run_clm.py. However, run_clm.py doesn’t support line by line dataset. For each batch, the default behavior is to group the training examples into a single block_size line.

Follow # Dynamically create and use a persistent volume with Azure Files in Azure Kubernetes Service (AKS) to create a new storage class in AKS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: standard-grs
provisioner: kubernetes.io/azure-disk
parameters:
  cachingmode: ReadOnly
  kind: Managed
  skuName: Standard_GRS
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate

If you create new resource by kubernetes dashboard, an error might occur:

How to use different github account for different repository? For instance, we have two github accounts x1 and x2, while x1 for repo1 and x2 for repo2. At the first glance, we can set git config in different repository folder by git config user.name xxx. However, this approach has two drawbacks:

  • Need to config user name/email in every repository
  • In some case, the git user cannot be configured by git config. For example, hexo-deployer-git . Since the git repo is automatically generated by the deployer, it’s hard to manually set the user name.

Fortunately, we can leverage SSH config to associate different github accounts with different repos. Define different host entries does the trick: since we login to github via SSH, we use a virtual host as an alias to represent the real host name.

Recently we found the traffic is not balanced across the MongoDB cluster shards. After investigation, the root cause is that data on each shard is not evenly distributed ( Chunk balancing != data balancing != traffic balancing ). The data distribution looks like this:

ShardData Size
mongo-010.55 GB
mongo-125.76 GB
mongo-210.04 GB

Why the data size of mongo-1 is significantly large than others while the chunk number among 3 shards is almost the same? Then we need to analysis the chunk size distribution across these shards.

After adding new shards to our production MongoDB cluster (v4.4.6-ent with 5 shards, 3 replicas for each shard), we found that the balancer is not working. sh.status() displays many chunk migration errors:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
...
  balancer:
        Currently enabled:  yes
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours:
                7 : Failed with error 'aborted', from mongo-1 to mongo-3
                7208 : Failed with error 'aborted', from mongo-1 to mongo-4
  databases:
        {  "_id" : "X",  "primary" : "mongo-1",  "partitioned" : true,  "version" : {  "uuid" : UUID("xxx"),  "lastMod" : 1 } }
                X.A
                        shard key: { "Uuid" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                mongo-0       231
                                mongo-1       327
                                mongo-2       230
                                mongo-3       208
...

Obviously, the chunks is unbalanced accross shards (327 vs 208). Since the balancer is enabled, we try to debug the issue through mongodb.log on the config server. There are many migration failed logs (sensitive infomation masked):

When a pod in error state (crashloopbackoff), kubernetes would restart the pod. If you try to exec into the pod to check the log or debug, the following error message appears:

1
unable to upgrade connection: container not found ("")

Because the old pod has been killed and you cannot exec into it anymore. So how can we prevent the pod from endless restart?

Just add a command to the deployment yaml to override the default command by the container image. Make the pod never finished by sleep infinity or tail -f /dev/null:

Multiple Backup Daemons are typically run when the storage requirements or the load generated by the deployment is too much for a single daemon.

Directly scale the statefulset ops-manager-backup-daemon to multiple instances (e.g. 3) doesn’t work. Because the mongodb-enterprise-operator is watching the statefulset, the instance number will be scaled down to 1 by the MongoDB operator several miniutes later.

So how to scale up the backup dameons by the MongoDB kubernetes operator?

Apex Compile Error

The environment (cuda10.0):

1
$ conda install pytorch==1.1.0 torchvision==0.3.0 cudatoolkit=10.0 -c pytorch

The apex repo master HEAD:

1
2
3
4
commit 0c2c6eea6556b208d1a8711197efc94899e754e1 (HEAD -> master, origin/master, origin/HEAD)
Author: Nan Zheng <80790206+nanz-nv@users.noreply.github.com>
Date:   Sat Jul 17 08:53:59 2021 +0800
...

Install apex:

1
2
3
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Compile error like this:

1
2
3
    csrc/mlp.cpp:127:54: error: expected primary-expression before ‘>’ token
           w_ptr.push_back(inputs[i + 1].data_ptr<scalar_t>());
                                                          ^

Solution

Similar discussions: https://github.com/NVIDIA/apex/issues/802 https://github.com/NVIDIA/apex/issues/1139

By the official manual :

Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them. Because change streams use the aggregation framework, applications can also filter for specific changes or transform the notifications at will.

MongoDB change stream is a nice feature. It allows applications to access real-time data changes without the complexity and risk of tailing the oplog.

Recently, when we use change stream to replicate data from one sharded cluster to another, it immediately made the cluster unstable (broke down several nodes and triggered the primary change). Then the read/write operations latency significantly increased.

Observations

Observations on our production envrionment:

Recently I found that the Google auto ads significantly slows down the page loading speed. There are also many discussions about this. As a static website, fast loading speed is crutial. In this post, we will optimize the PageSpeed Insights score by delay loading auto ads.

First, let’s check the current PSI score in mobile: PSI Mobile

It seems that the Reduce unused JavaScript section has many items to be improved. Check the official Google auto ads script:

Blogroll is natively supported in NexT theme. All links will be shown in the sidebar. However, as your links increases, the sidebar length increases as well. It makes the page lengthy and distracting. Therefore, we consider creating a dedicated blogroll page.

After searching, most of the existing approaches need to modify NexT source code (theme swig template files). The implementation is a little bit complicated while breaks the theme’s integrity. When you update the theme later, you will need to manually merge or rebase the master to your code.

Recently I would like to simplify permanent link for each post. From:

/2021/03/21/migrateopsmanager.en/

To:

/migrateopsmanager.en/

Shorter URL is more concise and readable, as the date string makes no sense to users. But what’s the meaning of URL backward compatibility?

Change Permanent Link Format Issue

According to the official document , changing permalink format is pretty easy: just modify :year/:month/:day/:title/ to :title/ would be OK. However, the real problem is that all existing incoming links will be invalid after this modification. Then the ranking of our site will be affected which is unacceptable.

We have a costly SQL server database with bad performance. Specifically, some store procedures (join several tables on primary key, each table has ~10M rows) were executed for several miniutes. The execution plan showed that the index seek costs 90% of the total time. Finally we found the root cause is the indexes have very high degree of fragmentation. Since its DBA had changed many times, we need to analyze the database schemas, table disk usage and storage procedure dependency tables. Based on these results, we cleanup tables, store procedures and rebuild the indexes to improve the DB performance. Here are the queries to accomplish these tasks.

Recently we want to deploy MongoDB Ops Manager and MongoDB deployments in different data centers to improve disaster recovery. If they are deploymented in the same data center and unfortunately it fails, you cannot restore the backup data to a new cluster as both Ops Manager and deployments are unavailable.

Of course, we don’t want to re-deploy the existing MongoDB deployments in Kubernetes. But how to make the deployments sending data to the new ops manager URL?

Using MongoDB in .NET is easy. However, there are two ways to manipulate the documents in C# code: raw bson document or stronly-typed document. In this article, we will compare the difference of them by examples. Basically, strongly-typed collection is preferred unless you have strong reason to use weakly-typed document (different types in the same collection?).

BsonDocument CRUD

The MongoDB C# Driver Official Document provide examples in this style. I guess the reason is that MongoDB is schemaless and the driver would like to demonstrate how to access document without schema. Actually, noSQL doesn’t means no SQL but stands for not only SQL. Creating a schema for a collection is still recommended because it’s easier to access documents and use index.

MongoDB transaction is a nice feature. Although MongoDB uses optimistic concurrency control, write conflict is unavoidable. The situation becomes worse in multi-document transaction which modifies many documents in one transaction. If a write conflict happens, a MongoDBCommandException will be thrown:

Exception: Command update failed: Encountered error from mongodb.svc.cluster.local:27017 during a transaction :: caused by :: WriteConflict error: this operation conflicted with another operation. Please retry your operation or multi-document transaction..

Today I try to import an existing MongoDB deployment (out of the kubernetes cluster) into MongoDB Ops Manager which is running in kubernetes. After installing MongoDB Agent to the deployment, only automation functionality works while monitoring and backup not work. The root cause is that the agent still try to post data into Ops Manager’s internal endpoint.

The latest MongoDB Agent consists of a single binary that contains all three functions: Automation, Monitoring, and Backup. So theoretically install and configure the all-in-one automation agent in the deployment VM is enough.

I have a sharded cluster (2 shards, each 3 mongods; 3 config server, 2 mongoses) which is deployed by MongoDB Ops Manager.

Last week, one of the shard host status was shown as a grey diamond (Hover: “Last Ping: Never”). Besides, in the Ops Manager’s server page, a server had two processes (e.g. sharddb-0 and sharddb-config). However, the cluster still works well and we can list the host sharddb-0-0(shard 0, replica 0) in the mongo shell by sh.status() and rs.status(). What’s wrong with the cluster?

When I execute MongoDB transactions in parallel, I encounter lots of MongoCommandException: code 251, codename NoSuchTransaction:

Command find failed: cannot continue txnId 4 for session 38604515-2584-45a5-a17a-5eb5d34ea6c4 - = with txnId 5. Command find failed: cannot continue txnId 4 for session 38604515-2584-45a5-a17a-5eb5d34ea6c4 - = with txnId 6. Command insert failed: cannot continue txnId 31 for session 3ed7ea61-eae1-440f-8d95-b6e066b35b69 - = with txnId 34.

Problem Analysis

I performed some tests to pinpoint the issue: