cephfs performance benchmark -- metadata intensive

All of lore.kernel.org
 help / color / mirror / Atom feed

* cephfs performance benchmark -- metadata intensive
@ 2016-08-11  7:29 Xiaoxi Chen
       [not found] ` <CAEYCsVKksnVV+HeNBo4YofPz4Re1CRT3jCTCE2L12NqieMVYWA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Xiaoxi Chen @ 2016-08-11  7:29 UTC (permalink / raw)
  To: Ceph Development, ceph-users, xiaoxchen, Mark Nelson, james.liu

Hi ,


     Here is the slide I shared yesterday on performance meeting.
Thanks and hoping for inputs.


http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark


Xiaoxi

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: cephfs performance benchmark -- metadata intensive
       [not found] ` <CAEYCsVKksnVV+HeNBo4YofPz4Re1CRT3jCTCE2L12NqieMVYWA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-08-11 10:07   ` John Spray
       [not found]     ` <CALe9h7fqrOCGn564HLPyMS6ouOHaY0phuZ64OHnq0Vx8VJWoug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-08-12  2:02     ` Chen, Xiaoxi
  0 siblings, 2 replies; 5+ messages in thread
From: John Spray @ 2016-08-11 10:07 UTC (permalink / raw)
  To: Xiaoxi Chen
  Cc: Ceph Development, ceph-users, james.liu-gPhfCIXyaqCqndwCJWfcng,
	Chen, Xiaoxi

On Thu, Aug 11, 2016 at 8:29 AM, Xiaoxi Chen <superdebuger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi ,
>
>
>      Here is the slide I shared yesterday on performance meeting.
> Thanks and hoping for inputs.
>
>
> http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark

These are definitely useful results and I encourage everyone working
with cephfs to go and look at Xiaoxi's slides.

The main thing that this highlighted for me was our lack of testing so
far on systems with full caches.  Too much of our existing testing is
done on freshly configured systems that never fill the MDS cache.

Test 2.1 notes that we don't enable directory fragmentation by default
currently -- this is an issue, and I'm hoping we can switch it on by
default in Kraken (see thread "Switching on mds_bal_frag by default").
In the meantime we have the fix that Patrick wrote for Jewel which at
least prevents people creating dirfrags too large for the OSDs to
handle.

Test 2.2: since a "failing to respond to cache pressure" bug is
affecting this, I would guess we see the performance fall off at about
the point where the *client* caches fill up (so they start trimming
things even though they're ignore cache pressure).  It would be
interesting to see this chart with addition lines for some related
perf counters like mds_log.evtrm and mds.inodes_expired, that might
make it pretty obvious where the MDS is entering different stages that
see a decrease in the rate of handling client requests.

We really need to sort out the "failing to respond to cache pressure"
issues that keep popping up, especially if they're still happening on
a comparatively simple test that is just creating files.  We have a
specific test for this[1] that is currently being run against the fuse
client but not the kernel client[2].  This is a good time to try and
push that forward so I've kicked off an experimental run here:
http://pulpito.ceph.com/jspray-2016-08-10_16:14:52-kcephfs:recovery-master-testing-basic-mira/

In the meantime, although there are reports of similar issues with
newer kernels, it would be very useful to confirm if the same issue is
still occurring with more recent kernels.  Issues with cache trimming
have occurred due to various (separate) bugs, so it's possible that
while some people are still seeing cache trimming issues with recent
kernels, the specific case you're hitting might be fixed.

Test 2.3: restarting the MDS doesn't actually give you a completely
empty cache (everything in the journal gets replayed to pre-populate
the cache on MDS startup).  However, the results are still valid
because you're using a different random order in the non-caching test
case, and the number of inodes in your journal is probably much
smaller than the overall cache size so it's only a little bit
populated.  We don't currently have a "drop cache" command built into
the MDS but it would be pretty easy to add one for use in testing
(basically just call mds->mdcache->trim(0)).

As one would imagine, the non-caching case is latency-dominated when
the working set is larger than the cache, where each client is waiting
for one open to finish before proceeding to the next.  The MDS is
probably capable of handling many more operations per second, but it
would need more parallel IO operations from the clients.  When a
single client is doing opens one by one, you're potentially seeing a
full network+disk latency for each one (though in practice the OSD
read cache will be helping a lot here).  This non-caching case would
be the main argument for giving the metadata pool low latency (SSD)
storage.

Test 2.5: The observation that the CPU bottleneck makes using fast
storage for the metadata pool less useful (in sequential/cached cases)
is valid, although it could still be useful to isolate the metadata
OSDs (probably SSDs since not so much capacity is needed) to avoid
competing with data operations.  For random access in the non-caching
cases (2.3, 2.4) I think you would probably see an improvement from
SSDs.

Thanks again to the team from ebay for sharing all this.

John

1. https://github.com/ceph/ceph-qa-suite/blob/master/tasks/cephfs/test_client_limits.py#L96
2. http://tracker.ceph.com/issues/9466

>
> Xiaoxi
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: cephfs performance benchmark -- metadata intensive
       [not found]     ` <CALe9h7fqrOCGn564HLPyMS6ouOHaY0phuZ64OHnq0Vx8VJWoug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-08-11 12:24       ` Brett Niver
  2016-08-12 10:03         ` [ceph-users] " John Spray
  0 siblings, 1 reply; 5+ messages in thread
From: Brett Niver @ 2016-08-11 12:24 UTC (permalink / raw)
  To: John Spray
  Cc: Ceph Development, ceph-users, Chen, Xiaoxi, Xiaoxi Chen,
	james.liu-gPhfCIXyaqCqndwCJWfcng


[-- Attachment #1.1: Type: text/plain, Size: 5086 bytes --]

Patrick and I had a related question yesterday, are we able to dynamically
vary cache size to artificially manipulate cache pressure?

On Thu, Aug 11, 2016 at 6:07 AM, John Spray <jspray-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On Thu, Aug 11, 2016 at 8:29 AM, Xiaoxi Chen <superdebuger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> wrote:
> > Hi ,
> >
> >
> >      Here is the slide I shared yesterday on performance meeting.
> > Thanks and hoping for inputs.
> >
> >
> > http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-
> performance-benchmark
>
> These are definitely useful results and I encourage everyone working
> with cephfs to go and look at Xiaoxi's slides.
>
> The main thing that this highlighted for me was our lack of testing so
> far on systems with full caches.  Too much of our existing testing is
> done on freshly configured systems that never fill the MDS cache.
>
> Test 2.1 notes that we don't enable directory fragmentation by default
> currently -- this is an issue, and I'm hoping we can switch it on by
> default in Kraken (see thread "Switching on mds_bal_frag by default").
> In the meantime we have the fix that Patrick wrote for Jewel which at
> least prevents people creating dirfrags too large for the OSDs to
> handle.
>
> Test 2.2: since a "failing to respond to cache pressure" bug is
> affecting this, I would guess we see the performance fall off at about
> the point where the *client* caches fill up (so they start trimming
> things even though they're ignore cache pressure).  It would be
> interesting to see this chart with addition lines for some related
> perf counters like mds_log.evtrm and mds.inodes_expired, that might
> make it pretty obvious where the MDS is entering different stages that
> see a decrease in the rate of handling client requests.
>
> We really need to sort out the "failing to respond to cache pressure"
> issues that keep popping up, especially if they're still happening on
> a comparatively simple test that is just creating files.  We have a
> specific test for this[1] that is currently being run against the fuse
> client but not the kernel client[2].  This is a good time to try and
> push that forward so I've kicked off an experimental run here:
> http://pulpito.ceph.com/jspray-2016-08-10_16:14:52-
> kcephfs:recovery-master-testing-basic-mira/
>
> In the meantime, although there are reports of similar issues with
> newer kernels, it would be very useful to confirm if the same issue is
> still occurring with more recent kernels.  Issues with cache trimming
> have occurred due to various (separate) bugs, so it's possible that
> while some people are still seeing cache trimming issues with recent
> kernels, the specific case you're hitting might be fixed.
>
> Test 2.3: restarting the MDS doesn't actually give you a completely
> empty cache (everything in the journal gets replayed to pre-populate
> the cache on MDS startup).  However, the results are still valid
> because you're using a different random order in the non-caching test
> case, and the number of inodes in your journal is probably much
> smaller than the overall cache size so it's only a little bit
> populated.  We don't currently have a "drop cache" command built into
> the MDS but it would be pretty easy to add one for use in testing
> (basically just call mds->mdcache->trim(0)).
>
> As one would imagine, the non-caching case is latency-dominated when
> the working set is larger than the cache, where each client is waiting
> for one open to finish before proceeding to the next.  The MDS is
> probably capable of handling many more operations per second, but it
> would need more parallel IO operations from the clients.  When a
> single client is doing opens one by one, you're potentially seeing a
> full network+disk latency for each one (though in practice the OSD
> read cache will be helping a lot here).  This non-caching case would
> be the main argument for giving the metadata pool low latency (SSD)
> storage.
>
> Test 2.5: The observation that the CPU bottleneck makes using fast
> storage for the metadata pool less useful (in sequential/cached cases)
> is valid, although it could still be useful to isolate the metadata
> OSDs (probably SSDs since not so much capacity is needed) to avoid
> competing with data operations.  For random access in the non-caching
> cases (2.3, 2.4) I think you would probably see an improvement from
> SSDs.
>
> Thanks again to the team from ebay for sharing all this.
>
> John
>
>
>
> 1. https://github.com/ceph/ceph-qa-suite/blob/master/tasks/
> cephfs/test_client_limits.py#L96
> 2. http://tracker.ceph.com/issues/9466
>
>
> >
> > Xiaoxi
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

[-- Attachment #1.2: Type: text/html, Size: 6879 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: cephfs performance benchmark -- metadata intensive
  2016-08-11 10:07   ` John Spray
       [not found]     ` <CALe9h7fqrOCGn564HLPyMS6ouOHaY0phuZ64OHnq0Vx8VJWoug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-08-12  2:02     ` Chen, Xiaoxi
  1 sibling, 0 replies; 5+ messages in thread
From: Chen, Xiaoxi @ 2016-08-12  2:02 UTC (permalink / raw)
  To: John Spray, Xiaoxi Chen
  Cc: Ceph Development, ceph-users, Mark Nelson, james.liu


Hi John,
   Thanks for your inputs. My  reply inlined:)  
Xiaoxi





On 8/11/16, 6:07 PM, "John Spray" <jspray@redhat.com> wrote:

>On Thu, Aug 11, 2016 at 8:29 AM, Xiaoxi Chen <superdebuger@gmail.com> wrote:
>> Hi ,
>>
>>
>>      Here is the slide I shared yesterday on performance meeting.
>> Thanks and hoping for inputs.
>>
>>
>> http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark
>
>These are definitely useful results and I encourage everyone working
>with cephfs to go and look at Xiaoxi's slides.
>
>The main thing that this highlighted for me was our lack of testing so
>far on systems with full caches.  Too much of our existing testing is
>done on freshly configured systems that never fill the MDS cache.
>
>Test 2.1 notes that we don't enable directory fragmentation by default
>currently -- this is an issue, and I'm hoping we can switch it on by
>default in Kraken (see thread "Switching on mds_bal_frag by default").
>In the meantime we have the fix that Patrick wrote for Jewel which at
>least prevents people creating dirfrags too large for the OSDs to
>handle.
>
>Test 2.2: since a "failing to respond to cache pressure" bug is
>affecting this, I would guess we see the performance fall off at about
>the point where the *client* caches fill up (so they start trimming
>things even though they're ignore cache pressure).  It would be
>interesting to see this chart with addition lines for some related
>perf counters like mds_log.evtrm and mds.inodes_expired, that might
>make it pretty obvious where the MDS is entering different stages that
>see a decrease in the rate of handling client requests.
>
>We really need to sort out the "failing to respond to cache pressure"
>issues that keep popping up, especially if they're still happening on
>a comparatively simple test that is just creating files.  We have a
>specific test for this[1] that is currently being run against the fuse
>client but not the kernel client[2].  This is a good time to try and
>push that forward so I've kicked off an experimental run here:
>http://pulpito.ceph.com/jspray-2016-08-10_16:14:52-kcephfs:recovery-master-testing-basic-mira/
>
>In the meantime, although there are reports of similar issues with
>newer kernels, it would be very useful to confirm if the same issue is
>still occurring with more recent kernels.  Issues with cache trimming
>have occurred due to various (separate) bugs, so it's possible that
>while some people are still seeing cache trimming issues with recent
>kernels, the specific case you're hitting might be fixed.

AFAIK rhel 7.2 will backport most(all?) fixes against cephfs/krbd in newer kernel? If this is the
Case, we would like to try 7.2 as ubuntu 16.04 still not fully integrated with our openstack env yet.
>
>Test 2.3: restarting the MDS doesn't actually give you a completely
>empty cache (everything in the journal gets replayed to pre-populate
>the cache on MDS startup).  However, the results are still valid
>because you're using a different random order in the non-caching test
>case, and the number of inodes in your journal is probably much
>smaller than the overall cache size so it's only a little bit
>populated.  We don't currently have a "drop cache" command built into
>the MDS but it would be pretty easy to add one for use in testing
>(basically just call mds->mdcache->trim(0)).
>
>As one would imagine, the non-caching case is latency-dominated when
>the working set is larger than the cache, where each client is waiting
>for one open to finish before proceeding to the next.  The MDS is
>probably capable of handling many more operations per second, but it
>would need more parallel IO operations from the clients.  When a
>single client is doing opens one by one, you're potentially seeing a
>full network+disk latency for each one (though in practice the OSD
>read cache will be helping a lot here).  This non-caching case would
>be the main argument for giving the metadata pool low latency (SSD)
>storage.

I guess the amplification might be the issue, in mds debug log I can see lots of cache insert/evict,
Because the working set is random, so the parent dir is likely not in the cache, so open a file need to load
The dentries of the parent dir, which means loading(and evicting) 4096 fds? 

And as I never drop pagecache on OSD side (we mean to do this to make OSD as fast as possible - never be the bottleneck), most of the IO
Should served from pagecache(the IOSTAT of OSD side prove this, it is pretty idle).
>
>Test 2.5: The observation that the CPU bottleneck makes using fast
>storage for the metadata pool less useful (in sequential/cached cases)
>is valid, although it could still be useful to isolate the metadata
>OSDs (probably SSDs since not so much capacity is needed) to avoid
>competing with data operations.  For random access in the non-caching
>cases (2.3, 2.4) I think you would probably see an improvement from
>SSDs.

Do we have any chance to break mds_lock into smaller one? That could be a big win to have multi-thread?
>
>Thanks again to the team from ebay for sharing all this.
>
>John
>
>
>
>1. https://github.com/ceph/ceph-qa-suite/blob/master/tasks/cephfs/test_client_limits.py#L96
>2. http://tracker.ceph.com/issues/9466
>
>
>>
>> Xiaoxi
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [ceph-users] cephfs performance benchmark -- metadata intensive
  2016-08-11 12:24       ` Brett Niver
@ 2016-08-12 10:03         ` John Spray
  0 siblings, 0 replies; 5+ messages in thread
From: John Spray @ 2016-08-12 10:03 UTC (permalink / raw)
  To: Brett Niver
  Cc: Xiaoxi Chen, Ceph Development, ceph-users, james.liu, Chen, Xiaoxi

On Thu, Aug 11, 2016 at 1:24 PM, Brett Niver <bniver@redhat.com> wrote:
> Patrick and I had a related question yesterday, are we able to dynamically
> vary cache size to artificially manipulate cache pressure?

Yes -- at the top of MDCache::trim the max size is read straight out
of g_conf so it should pick up on any changes you do with "tell
injectargs".  Things might be a little bit funny though because the
new cache limit wouldn't be reflected in the logic in lru_adjust().

John

> On Thu, Aug 11, 2016 at 6:07 AM, John Spray <jspray@redhat.com> wrote:
>>
>> On Thu, Aug 11, 2016 at 8:29 AM, Xiaoxi Chen <superdebuger@gmail.com>
>> wrote:
>> > Hi ,
>> >
>> >
>> >      Here is the slide I shared yesterday on performance meeting.
>> > Thanks and hoping for inputs.
>> >
>> >
>> >
>> > http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark
>>
>> These are definitely useful results and I encourage everyone working
>> with cephfs to go and look at Xiaoxi's slides.
>>
>> The main thing that this highlighted for me was our lack of testing so
>> far on systems with full caches.  Too much of our existing testing is
>> done on freshly configured systems that never fill the MDS cache.
>>
>> Test 2.1 notes that we don't enable directory fragmentation by default
>> currently -- this is an issue, and I'm hoping we can switch it on by
>> default in Kraken (see thread "Switching on mds_bal_frag by default").
>> In the meantime we have the fix that Patrick wrote for Jewel which at
>> least prevents people creating dirfrags too large for the OSDs to
>> handle.
>>
>> Test 2.2: since a "failing to respond to cache pressure" bug is
>> affecting this, I would guess we see the performance fall off at about
>> the point where the *client* caches fill up (so they start trimming
>> things even though they're ignore cache pressure).  It would be
>> interesting to see this chart with addition lines for some related
>> perf counters like mds_log.evtrm and mds.inodes_expired, that might
>> make it pretty obvious where the MDS is entering different stages that
>> see a decrease in the rate of handling client requests.
>>
>> We really need to sort out the "failing to respond to cache pressure"
>> issues that keep popping up, especially if they're still happening on
>> a comparatively simple test that is just creating files.  We have a
>> specific test for this[1] that is currently being run against the fuse
>> client but not the kernel client[2].  This is a good time to try and
>> push that forward so I've kicked off an experimental run here:
>>
>> http://pulpito.ceph.com/jspray-2016-08-10_16:14:52-kcephfs:recovery-master-testing-basic-mira/
>>
>> In the meantime, although there are reports of similar issues with
>> newer kernels, it would be very useful to confirm if the same issue is
>> still occurring with more recent kernels.  Issues with cache trimming
>> have occurred due to various (separate) bugs, so it's possible that
>> while some people are still seeing cache trimming issues with recent
>> kernels, the specific case you're hitting might be fixed.
>>
>> Test 2.3: restarting the MDS doesn't actually give you a completely
>> empty cache (everything in the journal gets replayed to pre-populate
>> the cache on MDS startup).  However, the results are still valid
>> because you're using a different random order in the non-caching test
>> case, and the number of inodes in your journal is probably much
>> smaller than the overall cache size so it's only a little bit
>> populated.  We don't currently have a "drop cache" command built into
>> the MDS but it would be pretty easy to add one for use in testing
>> (basically just call mds->mdcache->trim(0)).
>>
>> As one would imagine, the non-caching case is latency-dominated when
>> the working set is larger than the cache, where each client is waiting
>> for one open to finish before proceeding to the next.  The MDS is
>> probably capable of handling many more operations per second, but it
>> would need more parallel IO operations from the clients.  When a
>> single client is doing opens one by one, you're potentially seeing a
>> full network+disk latency for each one (though in practice the OSD
>> read cache will be helping a lot here).  This non-caching case would
>> be the main argument for giving the metadata pool low latency (SSD)
>> storage.
>>
>> Test 2.5: The observation that the CPU bottleneck makes using fast
>> storage for the metadata pool less useful (in sequential/cached cases)
>> is valid, although it could still be useful to isolate the metadata
>> OSDs (probably SSDs since not so much capacity is needed) to avoid
>> competing with data operations.  For random access in the non-caching
>> cases (2.3, 2.4) I think you would probably see an improvement from
>> SSDs.
>>
>> Thanks again to the team from ebay for sharing all this.
>>
>> John
>>
>>
>>
>> 1.
>> https://github.com/ceph/ceph-qa-suite/blob/master/tasks/cephfs/test_client_limits.py#L96
>> 2. http://tracker.ceph.com/issues/9466
>>
>>
>> >
>> > Xiaoxi
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-08-12 10:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-11  7:29 cephfs performance benchmark -- metadata intensive Xiaoxi Chen
     [not found] ` <CAEYCsVKksnVV+HeNBo4YofPz4Re1CRT3jCTCE2L12NqieMVYWA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-11 10:07   ` John Spray
     [not found]     ` <CALe9h7fqrOCGn564HLPyMS6ouOHaY0phuZ64OHnq0Vx8VJWoug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-11 12:24       ` Brett Niver
2016-08-12 10:03         ` [ceph-users] " John Spray
2016-08-12  2:02     ` Chen, Xiaoxi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.