ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Yan, Zheng" <ukernel@gmail.com>
To: Dan van der Ster <dan@vanderster.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: CephFS optimizated for machine learning workload
Date: Mon, 18 Oct 2021 17:42:35 +0800	[thread overview]
Message-ID: <CAAM7YA=Z8+CrpYhqPFMxtPzuLog1MaqXrmAsfSya1UHs3ri1MQ@mail.gmail.com> (raw)
In-Reply-To: <CABZ+qq=XBFC1tZsurdwqxop3B=z62YjczoAsVg5n2NPR_JB-rQ@mail.gmail.com>

On Mon, Oct 18, 2021 at 3:55 PM Dan van der Ster <dan@vanderster.com> wrote:
>
> On Mon, Oct 18, 2021 at 9:23 AM Yan, Zheng <ukernel@gmail.com> wrote:
> >
> >
> >
> > On Fri, Oct 15, 2021 at 6:06 PM Dan van der Ster <dan@vanderster.com> wrote:
> >>
> >> Hi Zheng,
> >>
> >> Thanks for this really nice set of PRs -- we will try them at our site
> >> the next weeks and try to come back with practical feedback.
> >> A few questions:
> >>
> >> 1. How many clients did you scale to, with improvements in the 2nd PR?
> >
> >
> > We have FS clusters with over 10k clients.  If you find CInode::get_caps_{issued,wanted} and/or EOpen::encode use lots of CPU, That PR should help.
>
> That's an impressive number, relevant for our possible future plans.
>
> >
> >> 2. Do these PRs improve the process of scaling up/down the number of active MDS?
> >
> >
> > What problem you encountered?  Decreasing active MDS works well (although a little slow) in my local test. migrating big subtree (after increasing active MDS) can cause slow OPS, the 3rd PR solves it.
>
> Stopping has in the past taken ~30mins, with slow requests while the
> pinned subtrees are re-imported.

This case should be improved by the 3rd PR. subtree migrations are
smoother and smoother.

>
>
> -- dan
>
>
> >
> >>
> >>
> >> Thanks!
> >>
> >> Dan
> >>
> >>
> >> On Wed, Sep 15, 2021 at 9:21 AM Yan, Zheng <ukernel@gmail.com> wrote:
> >> >
> >> > Following PRs are optimization we (Kuaishou) made for machine learning
> >> > workloads (randomly read billions of small files) .
> >> >
> >> > [1] https://github.com/ceph/ceph/pull/39315
> >> > [2] https://github.com/ceph/ceph/pull/43126
> >> > [3] https://github.com/ceph/ceph/pull/43125
> >> >
> >> > The first PR adds an option that disables dirfrag prefetch. When files
> >> > are accessed randomly, dirfrag prefetch adds lots of useless files to
> >> > cache and causes cache thrash. Performance of MDS can be dropped below
> >> > 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
> >> > request to rados for cache missed lookup.  Single mds can handle about
> >> > 6k cache missed lookup requests per second (all ssd metadata pool).
> >> >
> >> > The second PR optimizes MDS performance for a large number of clients
> >> > and a large number of read-only opened files. It also can greatly
> >> > reduce mds recovery time for read-mostly wordload.
> >> >
> >> > The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
> >> > uses consistent hash to calculate target rank for each dirfrag.
> >> > Compared to dynamic balancer and subtree pin, metadata can be
> >> > distributed among MDSs more evenly. Besides, MDS only migrates single
> >> > dirfrag (instead of big subtree) for load balancing. So MDS has
> >> > shorter pause when doing metadata migration.  The drawbacks of this
> >> > change are:  stat(2) directory can be slow; rename(2) file to
> >> > different directory can be slow. The reason is, with random dirfrag
> >> > distribution, these operations likely involve multiple MDS.
> >> >
> >> > Above three PRs are all merged into an integration branch
> >> > https://github.com/ukernel/ceph/tree/wip-mds-integration.
> >> >
> >> > We (Kuaishou) have run these codes for months, 16 active MDS cluster
> >> > serve billions of small files. In file random read test, single MDS
> >> > can handle about 6k ops,  performance increases linearly with the
> >> > number of active MDS.  In file creation test (mpirun -np 160 -host
> >> > xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
> >> > MDS can serve over 100k file creation per second.
> >> >
> >> > Yan, Zheng

  reply	other threads:[~2021-10-18  9:42 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-15  7:21 CephFS optimizated for machine learning workload Yan, Zheng
2021-09-15 12:36 ` Mark Nelson
2021-09-16  4:05   ` Yan, Zheng
2021-09-16 16:14     ` Mark Nelson
2021-09-17  8:56       ` Yan, Zheng
2021-10-15 10:05 ` Dan van der Ster
     [not found]   ` <CAAM7YAktCSwTORmKwvNBsPskDz8=TRmyDs6qakkmhpahtAs8qA@mail.gmail.com>
2021-10-18  7:54     ` Dan van der Ster
2021-10-18  9:42       ` Yan, Zheng [this message]
2021-09-15  7:25 Yan, Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAM7YA=Z8+CrpYhqPFMxtPzuLog1MaqXrmAsfSya1UHs3ri1MQ@mail.gmail.com' \
    --to=ukernel@gmail.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=dan@vanderster.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).