ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CephFS optimizated for machine learning workload
@ 2021-09-15  7:21 Yan, Zheng
  2021-09-15 12:36 ` Mark Nelson
  2021-10-15 10:05 ` Dan van der Ster
  0 siblings, 2 replies; 9+ messages in thread
From: Yan, Zheng @ 2021-09-15  7:21 UTC (permalink / raw)
  To: ceph-devel

Following PRs are optimization we (Kuaishou) made for machine learning
workloads (randomly read billions of small files) .

[1] https://github.com/ceph/ceph/pull/39315
[2] https://github.com/ceph/ceph/pull/43126
[3] https://github.com/ceph/ceph/pull/43125

The first PR adds an option that disables dirfrag prefetch. When files
are accessed randomly, dirfrag prefetch adds lots of useless files to
cache and causes cache thrash. Performance of MDS can be dropped below
100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
request to rados for cache missed lookup.  Single mds can handle about
6k cache missed lookup requests per second (all ssd metadata pool).

The second PR optimizes MDS performance for a large number of clients
and a large number of read-only opened files. It also can greatly
reduce mds recovery time for read-mostly wordload.

The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
uses consistent hash to calculate target rank for each dirfrag.
Compared to dynamic balancer and subtree pin, metadata can be
distributed among MDSs more evenly. Besides, MDS only migrates single
dirfrag (instead of big subtree) for load balancing. So MDS has
shorter pause when doing metadata migration.  The drawbacks of this
change are:  stat(2) directory can be slow; rename(2) file to
different directory can be slow. The reason is, with random dirfrag
distribution, these operations likely involve multiple MDS.

Above three PRs are all merged into an integration branch
https://github.com/ukernel/ceph/tree/wip-mds-integration.

We (Kuaishou) have run these codes for months, 16 active MDS cluster
serve billions of small files. In file random read test, single MDS
can handle about 6k ops,  performance increases linearly with the
number of active MDS.  In file creation test (mpirun -np 160 -host
xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
MDS can serve over 100k file creation per second.

Yan, Zheng

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CephFS optimizated for machine learning workload
  2021-09-15  7:21 CephFS optimizated for machine learning workload Yan, Zheng
@ 2021-09-15 12:36 ` Mark Nelson
  2021-09-16  4:05   ` Yan, Zheng
  2021-10-15 10:05 ` Dan van der Ster
  1 sibling, 1 reply; 9+ messages in thread
From: Mark Nelson @ 2021-09-15 12:36 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel

Hi Zheng,


This looks great!  Have you noticed any slow performance during 
directory splitting?  One of the things I was playing around with last 
year was pre-fragmenting directories based on a user supplied hint that 
the directory would be big (falling back to normal behavior if it grows 
beyond the hint size).  That way you can create the dirfrags upfront and 
do the migration before they ever have any associated files.  Do you 
think that might be worth trying again given your PRs below?


Mark


On 9/15/21 2:21 AM, Yan, Zheng wrote:
> Following PRs are optimization we (Kuaishou) made for machine learning
> workloads (randomly read billions of small files) .
>
> [1] https://github.com/ceph/ceph/pull/39315
> [2] https://github.com/ceph/ceph/pull/43126
> [3] https://github.com/ceph/ceph/pull/43125
>
> The first PR adds an option that disables dirfrag prefetch. When files
> are accessed randomly, dirfrag prefetch adds lots of useless files to
> cache and causes cache thrash. Performance of MDS can be dropped below
> 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
> request to rados for cache missed lookup.  Single mds can handle about
> 6k cache missed lookup requests per second (all ssd metadata pool).
>
> The second PR optimizes MDS performance for a large number of clients
> and a large number of read-only opened files. It also can greatly
> reduce mds recovery time for read-mostly wordload.
>
> The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
> uses consistent hash to calculate target rank for each dirfrag.
> Compared to dynamic balancer and subtree pin, metadata can be
> distributed among MDSs more evenly. Besides, MDS only migrates single
> dirfrag (instead of big subtree) for load balancing. So MDS has
> shorter pause when doing metadata migration.  The drawbacks of this
> change are:  stat(2) directory can be slow; rename(2) file to
> different directory can be slow. The reason is, with random dirfrag
> distribution, these operations likely involve multiple MDS.
>
> Above three PRs are all merged into an integration branch
> https://github.com/ukernel/ceph/tree/wip-mds-integration.
>
> We (Kuaishou) have run these codes for months, 16 active MDS cluster
> serve billions of small files. In file random read test, single MDS
> can handle about 6k ops,  performance increases linearly with the
> number of active MDS.  In file creation test (mpirun -np 160 -host
> xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
> MDS can serve over 100k file creation per second.
>
> Yan, Zheng
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CephFS optimizated for machine learning workload
  2021-09-15 12:36 ` Mark Nelson
@ 2021-09-16  4:05   ` Yan, Zheng
  2021-09-16 16:14     ` Mark Nelson
  0 siblings, 1 reply; 9+ messages in thread
From: Yan, Zheng @ 2021-09-16  4:05 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

On Wed, Sep 15, 2021 at 8:36 PM Mark Nelson <mnelson@redhat.com> wrote:
>
> Hi Zheng,
>
>
> This looks great!  Have you noticed any slow performance during
> directory splitting?  One of the things I was playing around with last
> year was pre-fragmenting directories based on a user supplied hint that
> the directory would be big (falling back to normal behavior if it grows
> beyond the hint size).  That way you can create the dirfrags upfront and
> do the migration before they ever have any associated files.  Do you
> think that might be worth trying again given your PRs below?
>

These PRs do not change directory splitting logic. It's unlikely they
will improve performance number of mdtest hard test.  But these PRs
remove overhead of journaling  subtreemap and distribute metadata more
evenly.  They should improve performance number of mdtest easy test.
So I think it's worth a retest.

Yan, Zheng

>
> Mark
>
>
> On 9/15/21 2:21 AM, Yan, Zheng wrote:
> > Following PRs are optimization we (Kuaishou) made for machine learning
> > workloads (randomly read billions of small files) .
> >
> > [1] https://github.com/ceph/ceph/pull/39315
> > [2] https://github.com/ceph/ceph/pull/43126
> > [3] https://github.com/ceph/ceph/pull/43125
> >
> > The first PR adds an option that disables dirfrag prefetch. When files
> > are accessed randomly, dirfrag prefetch adds lots of useless files to
> > cache and causes cache thrash. Performance of MDS can be dropped below
> > 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
> > request to rados for cache missed lookup.  Single mds can handle about
> > 6k cache missed lookup requests per second (all ssd metadata pool).
> >
> > The second PR optimizes MDS performance for a large number of clients
> > and a large number of read-only opened files. It also can greatly
> > reduce mds recovery time for read-mostly wordload.
> >
> > The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
> > uses consistent hash to calculate target rank for each dirfrag.
> > Compared to dynamic balancer and subtree pin, metadata can be
> > distributed among MDSs more evenly. Besides, MDS only migrates single
> > dirfrag (instead of big subtree) for load balancing. So MDS has
> > shorter pause when doing metadata migration.  The drawbacks of this
> > change are:  stat(2) directory can be slow; rename(2) file to
> > different directory can be slow. The reason is, with random dirfrag
> > distribution, these operations likely involve multiple MDS.
> >
> > Above three PRs are all merged into an integration branch
> > https://github.com/ukernel/ceph/tree/wip-mds-integration.
> >
> > We (Kuaishou) have run these codes for months, 16 active MDS cluster
> > serve billions of small files. In file random read test, single MDS
> > can handle about 6k ops,  performance increases linearly with the
> > number of active MDS.  In file creation test (mpirun -np 160 -host
> > xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
> > MDS can serve over 100k file creation per second.
> >
> > Yan, Zheng
> >
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CephFS optimizated for machine learning workload
  2021-09-16  4:05   ` Yan, Zheng
@ 2021-09-16 16:14     ` Mark Nelson
  2021-09-17  8:56       ` Yan, Zheng
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Nelson @ 2021-09-16 16:14 UTC (permalink / raw)
  To: Yan, Zheng, Mark Nelson; +Cc: ceph-devel, ceph-users, dev



On 9/15/21 11:05 PM, Yan, Zheng wrote:
> On Wed, Sep 15, 2021 at 8:36 PM Mark Nelson <mnelson@redhat.com> wrote:
>>
>> Hi Zheng,
>>
>>
>> This looks great!  Have you noticed any slow performance during
>> directory splitting?  One of the things I was playing around with last
>> year was pre-fragmenting directories based on a user supplied hint that
>> the directory would be big (falling back to normal behavior if it grows
>> beyond the hint size).  That way you can create the dirfrags upfront and
>> do the migration before they ever have any associated files.  Do you
>> think that might be worth trying again given your PRs below?
>>
> 
> These PRs do not change directory splitting logic. It's unlikely they
> will improve performance number of mdtest hard test.  But these PRs
> remove overhead of journaling  subtreemap and distribute metadata more
> evenly.  They should improve performance number of mdtest easy test.
> So I think it's worth a retest.
> 
> Yan, Zheng


I was mostly thinking about:

[3] https://github.com/ceph/ceph/pull/43125

Shouldn't this allow workloads like mdtest hard where you have many 
clients performing file writes/reads/deletes inside a single directory 
(that is split into dirfrags randomly distributed across MDSes) to 
parallelize some of the work? (minus whatever needs to be synchronized 
on the authoritative mds)

We discussed some of this in the performance standup today.  From what 
I've seen the real meat of the problem still rests in the distributed 
cache, locking, and cap revocation, but it seems like anything we can do 
to reduce the overhead of dirfrag migration is a win.

Mark




> 
>>
>> Mark
>>
>>
>> On 9/15/21 2:21 AM, Yan, Zheng wrote:
>>> Following PRs are optimization we (Kuaishou) made for machine learning
>>> workloads (randomly read billions of small files) .
>>>
>>> [1] https://github.com/ceph/ceph/pull/39315
>>> [2] https://github.com/ceph/ceph/pull/43126
>>> [3] https://github.com/ceph/ceph/pull/43125
>>>
>>> The first PR adds an option that disables dirfrag prefetch. When files
>>> are accessed randomly, dirfrag prefetch adds lots of useless files to
>>> cache and causes cache thrash. Performance of MDS can be dropped below
>>> 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
>>> request to rados for cache missed lookup.  Single mds can handle about
>>> 6k cache missed lookup requests per second (all ssd metadata pool).
>>>
>>> The second PR optimizes MDS performance for a large number of clients
>>> and a large number of read-only opened files. It also can greatly
>>> reduce mds recovery time for read-mostly wordload.
>>>
>>> The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
>>> uses consistent hash to calculate target rank for each dirfrag.
>>> Compared to dynamic balancer and subtree pin, metadata can be
>>> distributed among MDSs more evenly. Besides, MDS only migrates single
>>> dirfrag (instead of big subtree) for load balancing. So MDS has
>>> shorter pause when doing metadata migration.  The drawbacks of this
>>> change are:  stat(2) directory can be slow; rename(2) file to
>>> different directory can be slow. The reason is, with random dirfrag
>>> distribution, these operations likely involve multiple MDS.
>>>
>>> Above three PRs are all merged into an integration branch
>>> https://github.com/ukernel/ceph/tree/wip-mds-integration.
>>>
>>> We (Kuaishou) have run these codes for months, 16 active MDS cluster
>>> serve billions of small files. In file random read test, single MDS
>>> can handle about 6k ops,  performance increases linearly with the
>>> number of active MDS.  In file creation test (mpirun -np 160 -host
>>> xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
>>> MDS can serve over 100k file creation per second.
>>>
>>> Yan, Zheng
>>>
>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CephFS optimizated for machine learning workload
  2021-09-16 16:14     ` Mark Nelson
@ 2021-09-17  8:56       ` Yan, Zheng
  0 siblings, 0 replies; 9+ messages in thread
From: Yan, Zheng @ 2021-09-17  8:56 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Mark Nelson, ceph-devel, ceph-users, dev

On Fri, Sep 17, 2021 at 12:14 AM Mark Nelson <mark.a.nelson@gmail.com> wrote:
>
>
>
> On 9/15/21 11:05 PM, Yan, Zheng wrote:
> > On Wed, Sep 15, 2021 at 8:36 PM Mark Nelson <mnelson@redhat.com> wrote:
> >>
> >> Hi Zheng,
> >>
> >>
> >> This looks great!  Have you noticed any slow performance during
> >> directory splitting?  One of the things I was playing around with last
> >> year was pre-fragmenting directories based on a user supplied hint that
> >> the directory would be big (falling back to normal behavior if it grows
> >> beyond the hint size).  That way you can create the dirfrags upfront and
> >> do the migration before they ever have any associated files.  Do you
> >> think that might be worth trying again given your PRs below?
> >>
> >
> > These PRs do not change directory splitting logic. It's unlikely they
> > will improve performance number of mdtest hard test.  But these PRs
> > remove overhead of journaling  subtreemap and distribute metadata more
> > evenly.  They should improve performance number of mdtest easy test.
> > So I think it's worth a retest.
> >
> > Yan, Zheng
>
>
> I was mostly thinking about:
>
> [3] https://github.com/ceph/ceph/pull/43125
>
> Shouldn't this allow workloads like mdtest hard where you have many
> clients performing file writes/reads/deletes inside a single directory
> (that is split into dirfrags randomly distributed across MDSes) to
> parallelize some of the work? (minus whatever needs to be synchronized
> on the authoritative mds)
>

The tiggers for dirfrags migration in this PR are mkdir and dirfrag
fetch.  Dirfrag first need to be split, then get migrated. I don't
hnow how long these events happen in mdtest hard and how the pause of
split/migration affect the test result.



> We discussed some of this in the performance standup today.  From what
> I've seen the real meat of the problem still rests in the distributed
> cache, locking, and cap revocation,

For performance of single thread or single MDS, yes. The purpose of PR
43125 is distribute metadata more evenly and improve aggregate
performance.

Yan, Zheng

> but it seems like anything we can do
> to reduce the overhead of dirfrag migration is a win.
>



> Mark
>
>
>
>
> >
> >>
> >> Mark
> >>
> >>
> >> On 9/15/21 2:21 AM, Yan, Zheng wrote:
> >>> Following PRs are optimization we (Kuaishou) made for machine learning
> >>> workloads (randomly read billions of small files) .
> >>>
> >>> [1] https://github.com/ceph/ceph/pull/39315
> >>> [2] https://github.com/ceph/ceph/pull/43126
> >>> [3] https://github.com/ceph/ceph/pull/43125
> >>>
> >>> The first PR adds an option that disables dirfrag prefetch. When files
> >>> are accessed randomly, dirfrag prefetch adds lots of useless files to
> >>> cache and causes cache thrash. Performance of MDS can be dropped below
> >>> 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
> >>> request to rados for cache missed lookup.  Single mds can handle about
> >>> 6k cache missed lookup requests per second (all ssd metadata pool).
> >>>
> >>> The second PR optimizes MDS performance for a large number of clients
> >>> and a large number of read-only opened files. It also can greatly
> >>> reduce mds recovery time for read-mostly wordload.
> >>>
> >>> The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
> >>> uses consistent hash to calculate target rank for each dirfrag.
> >>> Compared to dynamic balancer and subtree pin, metadata can be
> >>> distributed among MDSs more evenly. Besides, MDS only migrates single
> >>> dirfrag (instead of big subtree) for load balancing. So MDS has
> >>> shorter pause when doing metadata migration.  The drawbacks of this
> >>> change are:  stat(2) directory can be slow; rename(2) file to
> >>> different directory can be slow. The reason is, with random dirfrag
> >>> distribution, these operations likely involve multiple MDS.
> >>>
> >>> Above three PRs are all merged into an integration branch
> >>> https://github.com/ukernel/ceph/tree/wip-mds-integration.
> >>>
> >>> We (Kuaishou) have run these codes for months, 16 active MDS cluster
> >>> serve billions of small files. In file random read test, single MDS
> >>> can handle about 6k ops,  performance increases linearly with the
> >>> number of active MDS.  In file creation test (mpirun -np 160 -host
> >>> xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
> >>> MDS can serve over 100k file creation per second.
> >>>
> >>> Yan, Zheng
> >>>
> >>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CephFS optimizated for machine learning workload
  2021-09-15  7:21 CephFS optimizated for machine learning workload Yan, Zheng
  2021-09-15 12:36 ` Mark Nelson
@ 2021-10-15 10:05 ` Dan van der Ster
       [not found]   ` <CAAM7YAktCSwTORmKwvNBsPskDz8=TRmyDs6qakkmhpahtAs8qA@mail.gmail.com>
  1 sibling, 1 reply; 9+ messages in thread
From: Dan van der Ster @ 2021-10-15 10:05 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel

Hi Zheng,

Thanks for this really nice set of PRs -- we will try them at our site
the next weeks and try to come back with practical feedback.
A few questions:

1. How many clients did you scale to, with improvements in the 2nd PR?
2. Do these PRs improve the process of scaling up/down the number of active MDS?

Thanks!

Dan


On Wed, Sep 15, 2021 at 9:21 AM Yan, Zheng <ukernel@gmail.com> wrote:
>
> Following PRs are optimization we (Kuaishou) made for machine learning
> workloads (randomly read billions of small files) .
>
> [1] https://github.com/ceph/ceph/pull/39315
> [2] https://github.com/ceph/ceph/pull/43126
> [3] https://github.com/ceph/ceph/pull/43125
>
> The first PR adds an option that disables dirfrag prefetch. When files
> are accessed randomly, dirfrag prefetch adds lots of useless files to
> cache and causes cache thrash. Performance of MDS can be dropped below
> 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
> request to rados for cache missed lookup.  Single mds can handle about
> 6k cache missed lookup requests per second (all ssd metadata pool).
>
> The second PR optimizes MDS performance for a large number of clients
> and a large number of read-only opened files. It also can greatly
> reduce mds recovery time for read-mostly wordload.
>
> The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
> uses consistent hash to calculate target rank for each dirfrag.
> Compared to dynamic balancer and subtree pin, metadata can be
> distributed among MDSs more evenly. Besides, MDS only migrates single
> dirfrag (instead of big subtree) for load balancing. So MDS has
> shorter pause when doing metadata migration.  The drawbacks of this
> change are:  stat(2) directory can be slow; rename(2) file to
> different directory can be slow. The reason is, with random dirfrag
> distribution, these operations likely involve multiple MDS.
>
> Above three PRs are all merged into an integration branch
> https://github.com/ukernel/ceph/tree/wip-mds-integration.
>
> We (Kuaishou) have run these codes for months, 16 active MDS cluster
> serve billions of small files. In file random read test, single MDS
> can handle about 6k ops,  performance increases linearly with the
> number of active MDS.  In file creation test (mpirun -np 160 -host
> xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
> MDS can serve over 100k file creation per second.
>
> Yan, Zheng

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CephFS optimizated for machine learning workload
       [not found]   ` <CAAM7YAktCSwTORmKwvNBsPskDz8=TRmyDs6qakkmhpahtAs8qA@mail.gmail.com>
@ 2021-10-18  7:54     ` Dan van der Ster
  2021-10-18  9:42       ` Yan, Zheng
  0 siblings, 1 reply; 9+ messages in thread
From: Dan van der Ster @ 2021-10-18  7:54 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel

On Mon, Oct 18, 2021 at 9:23 AM Yan, Zheng <ukernel@gmail.com> wrote:
>
>
>
> On Fri, Oct 15, 2021 at 6:06 PM Dan van der Ster <dan@vanderster.com> wrote:
>>
>> Hi Zheng,
>>
>> Thanks for this really nice set of PRs -- we will try them at our site
>> the next weeks and try to come back with practical feedback.
>> A few questions:
>>
>> 1. How many clients did you scale to, with improvements in the 2nd PR?
>
>
> We have FS clusters with over 10k clients.  If you find CInode::get_caps_{issued,wanted} and/or EOpen::encode use lots of CPU, That PR should help.

That's an impressive number, relevant for our possible future plans.

>
>> 2. Do these PRs improve the process of scaling up/down the number of active MDS?
>
>
> What problem you encountered?  Decreasing active MDS works well (although a little slow) in my local test. migrating big subtree (after increasing active MDS) can cause slow OPS, the 3rd PR solves it.

Stopping has in the past taken ~30mins, with slow requests while the
pinned subtrees are re-imported.

-- dan


>
>>
>>
>> Thanks!
>>
>> Dan
>>
>>
>> On Wed, Sep 15, 2021 at 9:21 AM Yan, Zheng <ukernel@gmail.com> wrote:
>> >
>> > Following PRs are optimization we (Kuaishou) made for machine learning
>> > workloads (randomly read billions of small files) .
>> >
>> > [1] https://github.com/ceph/ceph/pull/39315
>> > [2] https://github.com/ceph/ceph/pull/43126
>> > [3] https://github.com/ceph/ceph/pull/43125
>> >
>> > The first PR adds an option that disables dirfrag prefetch. When files
>> > are accessed randomly, dirfrag prefetch adds lots of useless files to
>> > cache and causes cache thrash. Performance of MDS can be dropped below
>> > 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
>> > request to rados for cache missed lookup.  Single mds can handle about
>> > 6k cache missed lookup requests per second (all ssd metadata pool).
>> >
>> > The second PR optimizes MDS performance for a large number of clients
>> > and a large number of read-only opened files. It also can greatly
>> > reduce mds recovery time for read-mostly wordload.
>> >
>> > The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
>> > uses consistent hash to calculate target rank for each dirfrag.
>> > Compared to dynamic balancer and subtree pin, metadata can be
>> > distributed among MDSs more evenly. Besides, MDS only migrates single
>> > dirfrag (instead of big subtree) for load balancing. So MDS has
>> > shorter pause when doing metadata migration.  The drawbacks of this
>> > change are:  stat(2) directory can be slow; rename(2) file to
>> > different directory can be slow. The reason is, with random dirfrag
>> > distribution, these operations likely involve multiple MDS.
>> >
>> > Above three PRs are all merged into an integration branch
>> > https://github.com/ukernel/ceph/tree/wip-mds-integration.
>> >
>> > We (Kuaishou) have run these codes for months, 16 active MDS cluster
>> > serve billions of small files. In file random read test, single MDS
>> > can handle about 6k ops,  performance increases linearly with the
>> > number of active MDS.  In file creation test (mpirun -np 160 -host
>> > xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
>> > MDS can serve over 100k file creation per second.
>> >
>> > Yan, Zheng

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CephFS optimizated for machine learning workload
  2021-10-18  7:54     ` Dan van der Ster
@ 2021-10-18  9:42       ` Yan, Zheng
  0 siblings, 0 replies; 9+ messages in thread
From: Yan, Zheng @ 2021-10-18  9:42 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-devel

On Mon, Oct 18, 2021 at 3:55 PM Dan van der Ster <dan@vanderster.com> wrote:
>
> On Mon, Oct 18, 2021 at 9:23 AM Yan, Zheng <ukernel@gmail.com> wrote:
> >
> >
> >
> > On Fri, Oct 15, 2021 at 6:06 PM Dan van der Ster <dan@vanderster.com> wrote:
> >>
> >> Hi Zheng,
> >>
> >> Thanks for this really nice set of PRs -- we will try them at our site
> >> the next weeks and try to come back with practical feedback.
> >> A few questions:
> >>
> >> 1. How many clients did you scale to, with improvements in the 2nd PR?
> >
> >
> > We have FS clusters with over 10k clients.  If you find CInode::get_caps_{issued,wanted} and/or EOpen::encode use lots of CPU, That PR should help.
>
> That's an impressive number, relevant for our possible future plans.
>
> >
> >> 2. Do these PRs improve the process of scaling up/down the number of active MDS?
> >
> >
> > What problem you encountered?  Decreasing active MDS works well (although a little slow) in my local test. migrating big subtree (after increasing active MDS) can cause slow OPS, the 3rd PR solves it.
>
> Stopping has in the past taken ~30mins, with slow requests while the
> pinned subtrees are re-imported.

This case should be improved by the 3rd PR. subtree migrations are
smoother and smoother.

>
>
> -- dan
>
>
> >
> >>
> >>
> >> Thanks!
> >>
> >> Dan
> >>
> >>
> >> On Wed, Sep 15, 2021 at 9:21 AM Yan, Zheng <ukernel@gmail.com> wrote:
> >> >
> >> > Following PRs are optimization we (Kuaishou) made for machine learning
> >> > workloads (randomly read billions of small files) .
> >> >
> >> > [1] https://github.com/ceph/ceph/pull/39315
> >> > [2] https://github.com/ceph/ceph/pull/43126
> >> > [3] https://github.com/ceph/ceph/pull/43125
> >> >
> >> > The first PR adds an option that disables dirfrag prefetch. When files
> >> > are accessed randomly, dirfrag prefetch adds lots of useless files to
> >> > cache and causes cache thrash. Performance of MDS can be dropped below
> >> > 100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
> >> > request to rados for cache missed lookup.  Single mds can handle about
> >> > 6k cache missed lookup requests per second (all ssd metadata pool).
> >> >
> >> > The second PR optimizes MDS performance for a large number of clients
> >> > and a large number of read-only opened files. It also can greatly
> >> > reduce mds recovery time for read-mostly wordload.
> >> >
> >> > The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
> >> > uses consistent hash to calculate target rank for each dirfrag.
> >> > Compared to dynamic balancer and subtree pin, metadata can be
> >> > distributed among MDSs more evenly. Besides, MDS only migrates single
> >> > dirfrag (instead of big subtree) for load balancing. So MDS has
> >> > shorter pause when doing metadata migration.  The drawbacks of this
> >> > change are:  stat(2) directory can be slow; rename(2) file to
> >> > different directory can be slow. The reason is, with random dirfrag
> >> > distribution, these operations likely involve multiple MDS.
> >> >
> >> > Above three PRs are all merged into an integration branch
> >> > https://github.com/ukernel/ceph/tree/wip-mds-integration.
> >> >
> >> > We (Kuaishou) have run these codes for months, 16 active MDS cluster
> >> > serve billions of small files. In file random read test, single MDS
> >> > can handle about 6k ops,  performance increases linearly with the
> >> > number of active MDS.  In file creation test (mpirun -np 160 -host
> >> > xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
> >> > MDS can serve over 100k file creation per second.
> >> >
> >> > Yan, Zheng

^ permalink raw reply	[flat|nested] 9+ messages in thread

* CephFS optimizated for machine learning workload
@ 2021-09-15  7:25 Yan, Zheng
  0 siblings, 0 replies; 9+ messages in thread
From: Yan, Zheng @ 2021-09-15  7:25 UTC (permalink / raw)
  To: dev, ceph-users, ceph-devel

Following PRs are optimization we (Kuaishou) made for machine learning
workloads (randomly read billions of small files) .

[1] https://github.com/ceph/ceph/pull/39315
[2] https://github.com/ceph/ceph/pull/43126
[3] https://github.com/ceph/ceph/pull/43125

The first PR adds an option that disables dirfrag prefetch. When files
are accessed randomly, dirfrag prefetch adds lots of useless files to
cache and causes cache thrash. Performance of MDS can be dropped below
100 RPS. When dirfrag prefetch is disabled, MDS sends a getomapval
request to rados for cache missed lookup.  Single mds can handle about
6k cache missed lookup requests per second (all ssd metadata pool).

The second PR optimizes MDS performance for a large number of clients
and a large number of read-only opened files. It also can greatly
reduce mds recovery time for read-mostly wordload.

The third PR makes MDS cluster randomly distribute all dirfrags.  MDS
uses consistent hash to calculate target rank for each dirfrag.
Compared to dynamic balancer and subtree pin, metadata can be
distributed among MDSs more evenly. Besides, MDS only migrates single
dirfrag (instead of big subtree) for load balancing. So MDS has
shorter pause when doing metadata migration.  The drawbacks of this
change are:  stat(2) directory can be slow; rename(2) file to
different directory can be slow. The reason is, with random dirfrag
distribution, these operations likely involve multiple MDS.

Above three PRs are all merged into an integration branch
https://github.com/ukernel/ceph/tree/wip-mds-integration.

We (Kuaishou) have run these codes for months, 16 active MDS cluster
serve billions of small files. In file random read test, single MDS
can handle about 6k ops,  performance increases linearly with the
number of active MDS.  In file creation test (mpirun -np 160 -host
xxx:160 mdtest -F -L -w 4096 -z 2 -b 10 -I 200 -u -d ...), 16 active
MDS can serve over 100k file creation per second.

Yan, Zheng

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-10-18  9:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-15  7:21 CephFS optimizated for machine learning workload Yan, Zheng
2021-09-15 12:36 ` Mark Nelson
2021-09-16  4:05   ` Yan, Zheng
2021-09-16 16:14     ` Mark Nelson
2021-09-17  8:56       ` Yan, Zheng
2021-10-15 10:05 ` Dan van der Ster
     [not found]   ` <CAAM7YAktCSwTORmKwvNBsPskDz8=TRmyDs6qakkmhpahtAs8qA@mail.gmail.com>
2021-10-18  7:54     ` Dan van der Ster
2021-10-18  9:42       ` Yan, Zheng
2021-09-15  7:25 Yan, Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).