ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Yan, Zheng" <ukernel@gmail.com>
To: ceph-devel <ceph-devel@vger.kernel.org>
Subject: CephFS optimizated for machine learning workload
Date: Wed, 15 Sep 2021 15:21:23 +0800	[thread overview]
Message-ID: <CAAM7YAkJxr8+g=kbtk8uU4BV4TAqriQ-_FqWfzJWzbpHkx+oLw@mail.gmail.com> (raw)

Following PRs are optimization we (Kuaishou) made for machine learning
workloads (randomly read billions of small files) .

[1] https://github.com/ceph/ceph/pull/39315
[2] https://github.com/ceph/ceph/pull/43126
[3] https://github.com/ceph/ceph/pull/43125

The first PR adds an option that disables dirfrag prefetch. When files
are accessed randomly, dirfrag prefetch adds lots of useless files to
cache and causes cache thrash. Performance of MDS can be dropped below
100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
request to rados for cache missed lookup.  Single mds can handle about
6k cache missed lookup requests per second (all ssd metadata pool).

The second PR optimizes MDS performance for a large number of clients
and a large number of read-only opened files. It also can greatly
reduce mds recovery time for read-mostly wordload.

The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
uses consistent hash to calculate target rank for each dirfrag.
Compared to dynamic balancer and subtree pin, metadata can be
distributed among MDSs more evenly. Besides, MDS only migrates single
dirfrag (instead of big subtree) for load balancing. So MDS has
shorter pause when doing metadata migration.  The drawbacks of this
change are:  stat(2) directory can be slow; rename(2) file to
different directory can be slow. The reason is, with random dirfrag
distribution, these operations likely involve multiple MDS.

Above three PRs are all merged into an integration branch
https://github.com/ukernel/ceph/tree/wip-mds-integration.

We (Kuaishou) have run these codes for months, 16 active MDS cluster
serve billions of small files. In file random read test, single MDS
can handle about 6k ops,  performance increases linearly with the
number of active MDS.  In file creation test (mpirun -np 160 -host
xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
MDS can serve over 100k file creation per second.

Yan, Zheng

             reply	other threads:[~2021-09-15  7:21 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-15  7:21 Yan, Zheng [this message]
2021-09-15 12:36 ` CephFS optimizated for machine learning workload Mark Nelson
2021-09-16  4:05   ` Yan, Zheng
2021-09-16 16:14     ` Mark Nelson
2021-09-17  8:56       ` Yan, Zheng
2021-10-15 10:05 ` Dan van der Ster
     [not found]   ` <CAAM7YAktCSwTORmKwvNBsPskDz8=TRmyDs6qakkmhpahtAs8qA@mail.gmail.com>
2021-10-18  7:54     ` Dan van der Ster
2021-10-18  9:42       ` Yan, Zheng
2021-09-15  7:25 Yan, Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAM7YAkJxr8+g=kbtk8uU4BV4TAqriQ-_FqWfzJWzbpHkx+oLw@mail.gmail.com' \
    --to=ukernel@gmail.com \
    --cc=ceph-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).