All of lore.kernel.org
 help / color / mirror / Atom feed
* rados benchmark, even distribution vs. random distribution
@ 2016-01-26 19:11 Deneau, Tom
  2016-01-26 19:33 ` [Cbt] " Gregory Farnum
  0 siblings, 1 reply; 8+ messages in thread
From: Deneau, Tom @ 2016-01-26 19:11 UTC (permalink / raw)
  To: ceph-devel, cbt

Looking for some help with a Ceph performance puzzler...

Background:
I had been experimenting with running rados benchmarks with a more
controlled distribution of objects across osds.  The main reason was
to reduce run to run variability since I was running on fairly small
clusters.

I modified rados bench itself to optionally take a file with a list of
names and use those instead of the usual randomly generated names that
rados bench uses, "benchmark_data_hostname_pid_objectxxx".  It was
then possible to generate and use a list of names which hashed to an
"even distribution" across osds if desired.  The list of names was
generated by finding a set of pgs that gave even distribution and then
generating names that mapped to those pgs.  So for example with 5 osds
on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4]
[2,3] [3,1] [4,0] and then all the names generated would map to only
those 5 pgs.

Up until recently, the results from these experiments were what I expected,
   * better repeatability across runs
   * even disk utilization.
   * generally higher total Bandwidth

Recently however I saw results on one platform that had much lower
bandwidth (less than half) with the "even distribution" run compared
to the default random distribution run.  Some notes were:

   * showed up in erasure coded pool with k=2, m=1, with 4M size
     objects.  Did not show the discrepancy on a replicated 2 pool on
     the same cluster.

   * In general, larger objects showed the problem more than smaller
     objects.

   * showed up only on reads from the erasure pool.  Writes to the
     same pool had higher bandwidth with the "even distribution"

   * showed up on only one cluster, a separate cluster with the same
     number of nodes and disks but different system architecture did
     not show this.

   * showed up only when the total number of client threads got "high
     enough".  For example showed up with 64 total client threads but
     not with 16.  The distribution of threads across client processes
     did not seem to matter.

I tried looking at "dump_historic_ops" and did indeed see some read
ops logged with high latency in the "even distribution" case.  The
larger elapsed times in the historic ops were always in the
"reached_pg" and "done" steps.  But I saw similar high latencies and
elapsed times for "reached_pg" and "done" for historic read ops in the
random case.

I have perf counters before and after the read tests.  I see big
differences in the op_r_out_bytes which makes sense because the higher
bw run processed more bytes.  For some osds, op_r_latency/sum is
slightly higher in the "even" run but not sure if this is significant.

Anyway, I will probably just stop doing these "even distribution" runs
but I was hoping to get an understanding of why they might have suche
reduced bandwidth in this particular case.  Is there something about mapping
to a smaller number of pgs that becomes a bottleneck?


-- Tom Deneau


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Cbt] rados benchmark, even distribution vs. random distribution
  2016-01-26 19:11 rados benchmark, even distribution vs. random distribution Deneau, Tom
@ 2016-01-26 19:33 ` Gregory Farnum
  2016-01-26 20:14   ` Deneau, Tom
  2016-01-27 17:01   ` Deneau, Tom
  0 siblings, 2 replies; 8+ messages in thread
From: Gregory Farnum @ 2016-01-26 19:33 UTC (permalink / raw)
  To: Deneau, Tom; +Cc: ceph-devel, cbt

On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@amd.com> wrote:
> Looking for some help with a Ceph performance puzzler...
>
> Background:
> I had been experimenting with running rados benchmarks with a more
> controlled distribution of objects across osds.  The main reason was
> to reduce run to run variability since I was running on fairly small
> clusters.
>
> I modified rados bench itself to optionally take a file with a list of
> names and use those instead of the usual randomly generated names that
> rados bench uses, "benchmark_data_hostname_pid_objectxxx".  It was
> then possible to generate and use a list of names which hashed to an
> "even distribution" across osds if desired.  The list of names was
> generated by finding a set of pgs that gave even distribution and then
> generating names that mapped to those pgs.  So for example with 5 osds
> on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4]
> [2,3] [3,1] [4,0] and then all the names generated would map to only
> those 5 pgs.
>
> Up until recently, the results from these experiments were what I expected,
>    * better repeatability across runs
>    * even disk utilization.
>    * generally higher total Bandwidth
>
> Recently however I saw results on one platform that had much lower
> bandwidth (less than half) with the "even distribution" run compared
> to the default random distribution run.  Some notes were:
>
>    * showed up in erasure coded pool with k=2, m=1, with 4M size
>      objects.  Did not show the discrepancy on a replicated 2 pool on
>      the same cluster.
>
>    * In general, larger objects showed the problem more than smaller
>      objects.
>
>    * showed up only on reads from the erasure pool.  Writes to the
>      same pool had higher bandwidth with the "even distribution"
>
>    * showed up on only one cluster, a separate cluster with the same
>      number of nodes and disks but different system architecture did
>      not show this.
>
>    * showed up only when the total number of client threads got "high
>      enough".  For example showed up with 64 total client threads but
>      not with 16.  The distribution of threads across client processes
>      did not seem to matter.
>
> I tried looking at "dump_historic_ops" and did indeed see some read
> ops logged with high latency in the "even distribution" case.  The
> larger elapsed times in the historic ops were always in the
> "reached_pg" and "done" steps.  But I saw similar high latencies and
> elapsed times for "reached_pg" and "done" for historic read ops in the
> random case.
>
> I have perf counters before and after the read tests.  I see big
> differences in the op_r_out_bytes which makes sense because the higher
> bw run processed more bytes.  For some osds, op_r_latency/sum is
> slightly higher in the "even" run but not sure if this is significant.
>
> Anyway, I will probably just stop doing these "even distribution" runs
> but I was hoping to get an understanding of why they might have suche
> reduced bandwidth in this particular case.  Is there something about mapping
> to a smaller number of pgs that becomes a bottleneck?

There's a lot of per-pg locking and pipelining that happens within the
OSD process. If you're mapping to only a single PG per OSD, then
you're basically forcing it to run single-threaded and to only handle
one read at a time. If you want to force an even distribution of
operations across OSDs, you'll need to calculate names for enough PGs
to exceed the sharding counts you're using in order to avoid
"artificial" bottlenecks.
-Greg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [Cbt] rados benchmark, even distribution vs. random distribution
  2016-01-26 19:33 ` [Cbt] " Gregory Farnum
@ 2016-01-26 20:14   ` Deneau, Tom
  2016-01-27 17:01   ` Deneau, Tom
  1 sibling, 0 replies; 8+ messages in thread
From: Deneau, Tom @ 2016-01-26 20:14 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, cbt


> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> Sent: Tuesday, January 26, 2016 1:33 PM
> To: Deneau, Tom
> Cc: ceph-devel@vger.kernel.org; cbt@lists.ceph.com
> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
> distribution
> 
> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@amd.com> wrote:
> > Looking for some help with a Ceph performance puzzler...
> >
> > Background:
> > I had been experimenting with running rados benchmarks with a more
> > controlled distribution of objects across osds.  The main reason was
> > to reduce run to run variability since I was running on fairly small
> > clusters.
> >
> > I modified rados bench itself to optionally take a file with a list of
> > names and use those instead of the usual randomly generated names that
> > rados bench uses, "benchmark_data_hostname_pid_objectxxx".  It was
> > then possible to generate and use a list of names which hashed to an
> > "even distribution" across osds if desired.  The list of names was
> > generated by finding a set of pgs that gave even distribution and then
> > generating names that mapped to those pgs.  So for example with 5 osds
> > on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4]
> > [2,3] [3,1] [4,0] and then all the names generated would map to only
> > those 5 pgs.
> >
> > Up until recently, the results from these experiments were what I
> expected,
> >    * better repeatability across runs
> >    * even disk utilization.
> >    * generally higher total Bandwidth
> >
> > Recently however I saw results on one platform that had much lower
> > bandwidth (less than half) with the "even distribution" run compared
> > to the default random distribution run.  Some notes were:
> >
> >    * showed up in erasure coded pool with k=2, m=1, with 4M size
> >      objects.  Did not show the discrepancy on a replicated 2 pool on
> >      the same cluster.
> >
> >    * In general, larger objects showed the problem more than smaller
> >      objects.
> >
> >    * showed up only on reads from the erasure pool.  Writes to the
> >      same pool had higher bandwidth with the "even distribution"
> >
> >    * showed up on only one cluster, a separate cluster with the same
> >      number of nodes and disks but different system architecture did
> >      not show this.
> >
> >    * showed up only when the total number of client threads got "high
> >      enough".  For example showed up with 64 total client threads but
> >      not with 16.  The distribution of threads across client processes
> >      did not seem to matter.
> >
> > I tried looking at "dump_historic_ops" and did indeed see some read
> > ops logged with high latency in the "even distribution" case.  The
> > larger elapsed times in the historic ops were always in the
> > "reached_pg" and "done" steps.  But I saw similar high latencies and
> > elapsed times for "reached_pg" and "done" for historic read ops in the
> > random case.
> >
> > I have perf counters before and after the read tests.  I see big
> > differences in the op_r_out_bytes which makes sense because the higher
> > bw run processed more bytes.  For some osds, op_r_latency/sum is
> > slightly higher in the "even" run but not sure if this is significant.
> >
> > Anyway, I will probably just stop doing these "even distribution" runs
> > but I was hoping to get an understanding of why they might have suche
> > reduced bandwidth in this particular case.  Is there something about
> > mapping to a smaller number of pgs that becomes a bottleneck?
> 
> There's a lot of per-pg locking and pipelining that happens within the OSD
> process. If you're mapping to only a single PG per OSD, then you're
> basically forcing it to run single-threaded and to only handle one read at
> a time. If you want to force an even distribution of operations across
> OSDs, you'll need to calculate names for enough PGs to exceed the sharding
> counts you're using in order to avoid "artificial" bottlenecks.
> -Greg

I see.  So I guess a pool that has "too small" a number of pgs would have the same problem...

It was curious that the single-PG per OSD only affected a limited number
of test configurations, but maybe other bottlenecks were taking over
in the other configurations.

-- Tom

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [Cbt] rados benchmark, even distribution vs. random distribution
  2016-01-26 19:33 ` [Cbt] " Gregory Farnum
  2016-01-26 20:14   ` Deneau, Tom
@ 2016-01-27 17:01   ` Deneau, Tom
  2016-01-27 18:07     ` Gregory Farnum
  1 sibling, 1 reply; 8+ messages in thread
From: Deneau, Tom @ 2016-01-27 17:01 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, cbt


> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> Sent: Tuesday, January 26, 2016 1:33 PM
> To: Deneau, Tom
> Cc: ceph-devel@vger.kernel.org; cbt@lists.ceph.com
> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
> distribution
> 
> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@amd.com> wrote:
> > Looking for some help with a Ceph performance puzzler...
> >
> > Background:
> > I had been experimenting with running rados benchmarks with a more
> > controlled distribution of objects across osds.  The main reason was
> > to reduce run to run variability since I was running on fairly small
> > clusters.
> >
> > I modified rados bench itself to optionally take a file with a list of
> > names and use those instead of the usual randomly generated names that
> > rados bench uses, "benchmark_data_hostname_pid_objectxxx".  It was
> > then possible to generate and use a list of names which hashed to an
> > "even distribution" across osds if desired.  The list of names was
> > generated by finding a set of pgs that gave even distribution and then
> > generating names that mapped to those pgs.  So for example with 5 osds
> > on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4]
> > [2,3] [3,1] [4,0] and then all the names generated would map to only
> > those 5 pgs.
> >
> > Up until recently, the results from these experiments were what I
> expected,
> >    * better repeatability across runs
> >    * even disk utilization.
> >    * generally higher total Bandwidth
> >
> > Recently however I saw results on one platform that had much lower
> > bandwidth (less than half) with the "even distribution" run compared
> > to the default random distribution run.  Some notes were:
> >
> >    * showed up in erasure coded pool with k=2, m=1, with 4M size
> >      objects.  Did not show the discrepancy on a replicated 2 pool on
> >      the same cluster.
> >
> >    * In general, larger objects showed the problem more than smaller
> >      objects.
> >
> >    * showed up only on reads from the erasure pool.  Writes to the
> >      same pool had higher bandwidth with the "even distribution"
> >
> >    * showed up on only one cluster, a separate cluster with the same
> >      number of nodes and disks but different system architecture did
> >      not show this.
> >
> >    * showed up only when the total number of client threads got "high
> >      enough".  For example showed up with 64 total client threads but
> >      not with 16.  The distribution of threads across client processes
> >      did not seem to matter.
> >
> > I tried looking at "dump_historic_ops" and did indeed see some read
> > ops logged with high latency in the "even distribution" case.  The
> > larger elapsed times in the historic ops were always in the
> > "reached_pg" and "done" steps.  But I saw similar high latencies and
> > elapsed times for "reached_pg" and "done" for historic read ops in the
> > random case.
> >
> > I have perf counters before and after the read tests.  I see big
> > differences in the op_r_out_bytes which makes sense because the higher
> > bw run processed more bytes.  For some osds, op_r_latency/sum is
> > slightly higher in the "even" run but not sure if this is significant.
> >
> > Anyway, I will probably just stop doing these "even distribution" runs
> > but I was hoping to get an understanding of why they might have suche
> > reduced bandwidth in this particular case.  Is there something about
> > mapping to a smaller number of pgs that becomes a bottleneck?
> 
> There's a lot of per-pg locking and pipelining that happens within the OSD
> process. If you're mapping to only a single PG per OSD, then you're
> basically forcing it to run single-threaded and to only handle one read at
> a time. If you want to force an even distribution of operations across
> OSDs, you'll need to calculate names for enough PGs to exceed the sharding
> counts you're using in order to avoid "artificial" bottlenecks.
> -Greg

Greg --

Is there any performance counter which would show the
fact that we were basically single-threading in the OSDs?

-- Tom


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Cbt] rados benchmark, even distribution vs. random distribution
  2016-01-27 17:01   ` Deneau, Tom
@ 2016-01-27 18:07     ` Gregory Farnum
  2016-01-27 18:44       ` Pavan R
  2016-01-27 19:14       ` Deneau, Tom
  0 siblings, 2 replies; 8+ messages in thread
From: Gregory Farnum @ 2016-01-27 18:07 UTC (permalink / raw)
  To: Deneau, Tom; +Cc: ceph-devel, cbt

On Wed, Jan 27, 2016 at 9:01 AM, Deneau, Tom <tom.deneau@amd.com> wrote:
>
>> -----Original Message-----
>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
>> Sent: Tuesday, January 26, 2016 1:33 PM
>> To: Deneau, Tom
>> Cc: ceph-devel@vger.kernel.org; cbt@lists.ceph.com
>> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
>> distribution
>>
>> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@amd.com> wrote:
>> > Looking for some help with a Ceph performance puzzler...
>> >
>> > Background:
>> > I had been experimenting with running rados benchmarks with a more
>> > controlled distribution of objects across osds.  The main reason was
>> > to reduce run to run variability since I was running on fairly small
>> > clusters.
>> >
>> > I modified rados bench itself to optionally take a file with a list of
>> > names and use those instead of the usual randomly generated names that
>> > rados bench uses, "benchmark_data_hostname_pid_objectxxx".  It was
>> > then possible to generate and use a list of names which hashed to an
>> > "even distribution" across osds if desired.  The list of names was
>> > generated by finding a set of pgs that gave even distribution and then
>> > generating names that mapped to those pgs.  So for example with 5 osds
>> > on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4]
>> > [2,3] [3,1] [4,0] and then all the names generated would map to only
>> > those 5 pgs.
>> >
>> > Up until recently, the results from these experiments were what I
>> expected,
>> >    * better repeatability across runs
>> >    * even disk utilization.
>> >    * generally higher total Bandwidth
>> >
>> > Recently however I saw results on one platform that had much lower
>> > bandwidth (less than half) with the "even distribution" run compared
>> > to the default random distribution run.  Some notes were:
>> >
>> >    * showed up in erasure coded pool with k=2, m=1, with 4M size
>> >      objects.  Did not show the discrepancy on a replicated 2 pool on
>> >      the same cluster.
>> >
>> >    * In general, larger objects showed the problem more than smaller
>> >      objects.
>> >
>> >    * showed up only on reads from the erasure pool.  Writes to the
>> >      same pool had higher bandwidth with the "even distribution"
>> >
>> >    * showed up on only one cluster, a separate cluster with the same
>> >      number of nodes and disks but different system architecture did
>> >      not show this.
>> >
>> >    * showed up only when the total number of client threads got "high
>> >      enough".  For example showed up with 64 total client threads but
>> >      not with 16.  The distribution of threads across client processes
>> >      did not seem to matter.
>> >
>> > I tried looking at "dump_historic_ops" and did indeed see some read
>> > ops logged with high latency in the "even distribution" case.  The
>> > larger elapsed times in the historic ops were always in the
>> > "reached_pg" and "done" steps.  But I saw similar high latencies and
>> > elapsed times for "reached_pg" and "done" for historic read ops in the
>> > random case.
>> >
>> > I have perf counters before and after the read tests.  I see big
>> > differences in the op_r_out_bytes which makes sense because the higher
>> > bw run processed more bytes.  For some osds, op_r_latency/sum is
>> > slightly higher in the "even" run but not sure if this is significant.
>> >
>> > Anyway, I will probably just stop doing these "even distribution" runs
>> > but I was hoping to get an understanding of why they might have suche
>> > reduced bandwidth in this particular case.  Is there something about
>> > mapping to a smaller number of pgs that becomes a bottleneck?
>>
>> There's a lot of per-pg locking and pipelining that happens within the OSD
>> process. If you're mapping to only a single PG per OSD, then you're
>> basically forcing it to run single-threaded and to only handle one read at
>> a time. If you want to force an even distribution of operations across
>> OSDs, you'll need to calculate names for enough PGs to exceed the sharding
>> counts you're using in order to avoid "artificial" bottlenecks.
>> -Greg
>
> Greg --
>
> Is there any performance counter which would show the
> fact that we were basically single-threading in the OSDs?

I'm not aware of anything covering that. It's probably not too hard to
add counters on how many ops per shard have been performed; PRs and
tickets welcome.
-Greg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Cbt] rados benchmark, even distribution vs. random distribution
  2016-01-27 18:07     ` Gregory Farnum
@ 2016-01-27 18:44       ` Pavan R
  2016-01-27 19:14       ` Deneau, Tom
  1 sibling, 0 replies; 8+ messages in thread
From: Pavan R @ 2016-01-27 18:44 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Deneau, Tom, ceph-devel, cbt

I recollect doing something on those lines during the tcmalloc perf
issue days, wherein we wanted to see how evenly the shards in an OSD
work queue were populated for pure random workloads. Can pull that in
a production form if that helps.

Thanks,
Pavan.

On Wed, Jan 27, 2016 at 11:37 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Wed, Jan 27, 2016 at 9:01 AM, Deneau, Tom <tom.deneau@amd.com> wrote:
>>
>>> -----Original Message-----
>>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
>>> Sent: Tuesday, January 26, 2016 1:33 PM
>>> To: Deneau, Tom
>>> Cc: ceph-devel@vger.kernel.org; cbt@lists.ceph.com
>>> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
>>> distribution
>>>
>>> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@amd.com> wrote:
>>> > Looking for some help with a Ceph performance puzzler...
>>> >
>>> > Background:
>>> > I had been experimenting with running rados benchmarks with a more
>>> > controlled distribution of objects across osds.  The main reason was
>>> > to reduce run to run variability since I was running on fairly small
>>> > clusters.
>>> >
>>> > I modified rados bench itself to optionally take a file with a list of
>>> > names and use those instead of the usual randomly generated names that
>>> > rados bench uses, "benchmark_data_hostname_pid_objectxxx".  It was
>>> > then possible to generate and use a list of names which hashed to an
>>> > "even distribution" across osds if desired.  The list of names was
>>> > generated by finding a set of pgs that gave even distribution and then
>>> > generating names that mapped to those pgs.  So for example with 5 osds
>>> > on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4]
>>> > [2,3] [3,1] [4,0] and then all the names generated would map to only
>>> > those 5 pgs.
>>> >
>>> > Up until recently, the results from these experiments were what I
>>> expected,
>>> >    * better repeatability across runs
>>> >    * even disk utilization.
>>> >    * generally higher total Bandwidth
>>> >
>>> > Recently however I saw results on one platform that had much lower
>>> > bandwidth (less than half) with the "even distribution" run compared
>>> > to the default random distribution run.  Some notes were:
>>> >
>>> >    * showed up in erasure coded pool with k=2, m=1, with 4M size
>>> >      objects.  Did not show the discrepancy on a replicated 2 pool on
>>> >      the same cluster.
>>> >
>>> >    * In general, larger objects showed the problem more than smaller
>>> >      objects.
>>> >
>>> >    * showed up only on reads from the erasure pool.  Writes to the
>>> >      same pool had higher bandwidth with the "even distribution"
>>> >
>>> >    * showed up on only one cluster, a separate cluster with the same
>>> >      number of nodes and disks but different system architecture did
>>> >      not show this.
>>> >
>>> >    * showed up only when the total number of client threads got "high
>>> >      enough".  For example showed up with 64 total client threads but
>>> >      not with 16.  The distribution of threads across client processes
>>> >      did not seem to matter.
>>> >
>>> > I tried looking at "dump_historic_ops" and did indeed see some read
>>> > ops logged with high latency in the "even distribution" case.  The
>>> > larger elapsed times in the historic ops were always in the
>>> > "reached_pg" and "done" steps.  But I saw similar high latencies and
>>> > elapsed times for "reached_pg" and "done" for historic read ops in the
>>> > random case.
>>> >
>>> > I have perf counters before and after the read tests.  I see big
>>> > differences in the op_r_out_bytes which makes sense because the higher
>>> > bw run processed more bytes.  For some osds, op_r_latency/sum is
>>> > slightly higher in the "even" run but not sure if this is significant.
>>> >
>>> > Anyway, I will probably just stop doing these "even distribution" runs
>>> > but I was hoping to get an understanding of why they might have suche
>>> > reduced bandwidth in this particular case.  Is there something about
>>> > mapping to a smaller number of pgs that becomes a bottleneck?
>>>
>>> There's a lot of per-pg locking and pipelining that happens within the OSD
>>> process. If you're mapping to only a single PG per OSD, then you're
>>> basically forcing it to run single-threaded and to only handle one read at
>>> a time. If you want to force an even distribution of operations across
>>> OSDs, you'll need to calculate names for enough PGs to exceed the sharding
>>> counts you're using in order to avoid "artificial" bottlenecks.
>>> -Greg
>>
>> Greg --
>>
>> Is there any performance counter which would show the
>> fact that we were basically single-threading in the OSDs?
>
> I'm not aware of anything covering that. It's probably not too hard to
> add counters on how many ops per shard have been performed; PRs and
> tickets welcome.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [Cbt] rados benchmark, even distribution vs. random distribution
  2016-01-27 18:07     ` Gregory Farnum
  2016-01-27 18:44       ` Pavan R
@ 2016-01-27 19:14       ` Deneau, Tom
  2016-01-27 19:17         ` Gregory Farnum
  1 sibling, 1 reply; 8+ messages in thread
From: Deneau, Tom @ 2016-01-27 19:14 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, cbt



> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> Sent: Wednesday, January 27, 2016 12:08 PM
> To: Deneau, Tom
> Cc: ceph-devel@vger.kernel.org; cbt@lists.ceph.com
> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
> distribution
> 
> On Wed, Jan 27, 2016 at 9:01 AM, Deneau, Tom <tom.deneau@amd.com> wrote:
> >
> >> -----Original Message-----
> >> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> >> Sent: Tuesday, January 26, 2016 1:33 PM
> >> To: Deneau, Tom
> >> Cc: ceph-devel@vger.kernel.org; cbt@lists.ceph.com
> >> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
> >> distribution
> >>
> >> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@amd.com>
> wrote:
> >> > Looking for some help with a Ceph performance puzzler...
> >> >
> >> > Background:
> >> > I had been experimenting with running rados benchmarks with a more
> >> > controlled distribution of objects across osds.  The main reason
> >> > was to reduce run to run variability since I was running on fairly
> >> > small clusters.
> >> >
> >> > I modified rados bench itself to optionally take a file with a list
> >> > of names and use those instead of the usual randomly generated
> >> > names that rados bench uses,
> >> > "benchmark_data_hostname_pid_objectxxx".  It was then possible to
> >> > generate and use a list of names which hashed to an "even
> >> > distribution" across osds if desired.  The list of names was
> >> > generated by finding a set of pgs that gave even distribution and
> >> > then generating names that mapped to those pgs.  So for example
> >> > with 5 osds on a replicated=2 pool we might find 5 pgs mapping to
> >> > [0,2] [1,4] [2,3] [3,1] [4,0] and then all the names generated would
> map to only those 5 pgs.
> >> >
> >> > Up until recently, the results from these experiments were what I
> >> expected,
> >> >    * better repeatability across runs
> >> >    * even disk utilization.
> >> >    * generally higher total Bandwidth
> >> >
> >> > Recently however I saw results on one platform that had much lower
> >> > bandwidth (less than half) with the "even distribution" run
> >> > compared to the default random distribution run.  Some notes were:
> >> >
> >> >    * showed up in erasure coded pool with k=2, m=1, with 4M size
> >> >      objects.  Did not show the discrepancy on a replicated 2 pool on
> >> >      the same cluster.
> >> >
> >> >    * In general, larger objects showed the problem more than smaller
> >> >      objects.
> >> >
> >> >    * showed up only on reads from the erasure pool.  Writes to the
> >> >      same pool had higher bandwidth with the "even distribution"
> >> >
> >> >    * showed up on only one cluster, a separate cluster with the same
> >> >      number of nodes and disks but different system architecture did
> >> >      not show this.
> >> >
> >> >    * showed up only when the total number of client threads got "high
> >> >      enough".  For example showed up with 64 total client threads but
> >> >      not with 16.  The distribution of threads across client
> processes
> >> >      did not seem to matter.
> >> >
> >> > I tried looking at "dump_historic_ops" and did indeed see some read
> >> > ops logged with high latency in the "even distribution" case.  The
> >> > larger elapsed times in the historic ops were always in the
> >> > "reached_pg" and "done" steps.  But I saw similar high latencies
> >> > and elapsed times for "reached_pg" and "done" for historic read ops
> >> > in the random case.
> >> >
> >> > I have perf counters before and after the read tests.  I see big
> >> > differences in the op_r_out_bytes which makes sense because the
> >> > higher bw run processed more bytes.  For some osds,
> >> > op_r_latency/sum is slightly higher in the "even" run but not sure if
> this is significant.
> >> >
> >> > Anyway, I will probably just stop doing these "even distribution"
> >> > runs but I was hoping to get an understanding of why they might
> >> > have suche reduced bandwidth in this particular case.  Is there
> >> > something about mapping to a smaller number of pgs that becomes a
> bottleneck?
> >>
> >> There's a lot of per-pg locking and pipelining that happens within
> >> the OSD process. If you're mapping to only a single PG per OSD, then
> >> you're basically forcing it to run single-threaded and to only handle
> >> one read at a time. If you want to force an even distribution of
> >> operations across OSDs, you'll need to calculate names for enough PGs
> >> to exceed the sharding counts you're using in order to avoid
> "artificial" bottlenecks.
> >> -Greg
> >
> > Greg --
> >
> > Is there any performance counter which would show the fact that we
> > were basically single-threading in the OSDs?
> 
> I'm not aware of anything covering that. It's probably not too hard to add
> counters on how many ops per shard have been performed; PRs and tickets
> welcome.
> -Greg

Greg --

What is the meaning of 'shard' in this context?  Would this tell us
how much parallelism was going on in the osd?

-- Tom

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Cbt] rados benchmark, even distribution vs. random distribution
  2016-01-27 19:14       ` Deneau, Tom
@ 2016-01-27 19:17         ` Gregory Farnum
  0 siblings, 0 replies; 8+ messages in thread
From: Gregory Farnum @ 2016-01-27 19:17 UTC (permalink / raw)
  To: Deneau, Tom; +Cc: ceph-devel, cbt

On Wed, Jan 27, 2016 at 11:14 AM, Deneau, Tom <tom.deneau@amd.com> wrote:
>
>
>> -----Original Message-----
>> From: Gregory Farnum [mailto:gfarnum@redhat.com]
>> Sent: Wednesday, January 27, 2016 12:08 PM
>> To: Deneau, Tom
>> Cc: ceph-devel@vger.kernel.org; cbt@lists.ceph.com
>> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
>> distribution
>>
>> On Wed, Jan 27, 2016 at 9:01 AM, Deneau, Tom <tom.deneau@amd.com> wrote:
>> >
>> >> -----Original Message-----
>> >> From: Gregory Farnum [mailto:gfarnum@redhat.com]
>> >> Sent: Tuesday, January 26, 2016 1:33 PM
>> >> To: Deneau, Tom
>> >> Cc: ceph-devel@vger.kernel.org; cbt@lists.ceph.com
>> >> Subject: Re: [Cbt] rados benchmark, even distribution vs. random
>> >> distribution
>> >>
>> >> On Tue, Jan 26, 2016 at 11:11 AM, Deneau, Tom <tom.deneau@amd.com>
>> wrote:
>> >> > Looking for some help with a Ceph performance puzzler...
>> >> >
>> >> > Background:
>> >> > I had been experimenting with running rados benchmarks with a more
>> >> > controlled distribution of objects across osds.  The main reason
>> >> > was to reduce run to run variability since I was running on fairly
>> >> > small clusters.
>> >> >
>> >> > I modified rados bench itself to optionally take a file with a list
>> >> > of names and use those instead of the usual randomly generated
>> >> > names that rados bench uses,
>> >> > "benchmark_data_hostname_pid_objectxxx".  It was then possible to
>> >> > generate and use a list of names which hashed to an "even
>> >> > distribution" across osds if desired.  The list of names was
>> >> > generated by finding a set of pgs that gave even distribution and
>> >> > then generating names that mapped to those pgs.  So for example
>> >> > with 5 osds on a replicated=2 pool we might find 5 pgs mapping to
>> >> > [0,2] [1,4] [2,3] [3,1] [4,0] and then all the names generated would
>> map to only those 5 pgs.
>> >> >
>> >> > Up until recently, the results from these experiments were what I
>> >> expected,
>> >> >    * better repeatability across runs
>> >> >    * even disk utilization.
>> >> >    * generally higher total Bandwidth
>> >> >
>> >> > Recently however I saw results on one platform that had much lower
>> >> > bandwidth (less than half) with the "even distribution" run
>> >> > compared to the default random distribution run.  Some notes were:
>> >> >
>> >> >    * showed up in erasure coded pool with k=2, m=1, with 4M size
>> >> >      objects.  Did not show the discrepancy on a replicated 2 pool on
>> >> >      the same cluster.
>> >> >
>> >> >    * In general, larger objects showed the problem more than smaller
>> >> >      objects.
>> >> >
>> >> >    * showed up only on reads from the erasure pool.  Writes to the
>> >> >      same pool had higher bandwidth with the "even distribution"
>> >> >
>> >> >    * showed up on only one cluster, a separate cluster with the same
>> >> >      number of nodes and disks but different system architecture did
>> >> >      not show this.
>> >> >
>> >> >    * showed up only when the total number of client threads got "high
>> >> >      enough".  For example showed up with 64 total client threads but
>> >> >      not with 16.  The distribution of threads across client
>> processes
>> >> >      did not seem to matter.
>> >> >
>> >> > I tried looking at "dump_historic_ops" and did indeed see some read
>> >> > ops logged with high latency in the "even distribution" case.  The
>> >> > larger elapsed times in the historic ops were always in the
>> >> > "reached_pg" and "done" steps.  But I saw similar high latencies
>> >> > and elapsed times for "reached_pg" and "done" for historic read ops
>> >> > in the random case.
>> >> >
>> >> > I have perf counters before and after the read tests.  I see big
>> >> > differences in the op_r_out_bytes which makes sense because the
>> >> > higher bw run processed more bytes.  For some osds,
>> >> > op_r_latency/sum is slightly higher in the "even" run but not sure if
>> this is significant.
>> >> >
>> >> > Anyway, I will probably just stop doing these "even distribution"
>> >> > runs but I was hoping to get an understanding of why they might
>> >> > have suche reduced bandwidth in this particular case.  Is there
>> >> > something about mapping to a smaller number of pgs that becomes a
>> bottleneck?
>> >>
>> >> There's a lot of per-pg locking and pipelining that happens within
>> >> the OSD process. If you're mapping to only a single PG per OSD, then
>> >> you're basically forcing it to run single-threaded and to only handle
>> >> one read at a time. If you want to force an even distribution of
>> >> operations across OSDs, you'll need to calculate names for enough PGs
>> >> to exceed the sharding counts you're using in order to avoid
>> "artificial" bottlenecks.
>> >> -Greg
>> >
>> > Greg --
>> >
>> > Is there any performance counter which would show the fact that we
>> > were basically single-threading in the OSDs?
>>
>> I'm not aware of anything covering that. It's probably not too hard to add
>> counters on how many ops per shard have been performed; PRs and tickets
>> welcome.
>> -Greg
>
> Greg --
>
> What is the meaning of 'shard' in this context?  Would this tell us
> how much parallelism was going on in the osd?

We have a "ShardedOpQueue" (or similar) in the OSD which handles all
the worker threads. PGs are mapped to a single shard for all
processing, and while operations within a single shard might be
concurrent (eg, a write can go to disk and leave the CPU free to
process an op on another PG within the same shard), it is the unit of
parallelism. So if you've got ops within only a single shard, you'll
know you're not getting an even spread and are probably bottlenecking
on that thread. You can do similar comparisons across time by taking
snapshots of the counters and seeing how they change, or introducing
more complicated counters to try and directly measure parallelism.
-Greg

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-01-27 19:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-26 19:11 rados benchmark, even distribution vs. random distribution Deneau, Tom
2016-01-26 19:33 ` [Cbt] " Gregory Farnum
2016-01-26 20:14   ` Deneau, Tom
2016-01-27 17:01   ` Deneau, Tom
2016-01-27 18:07     ` Gregory Farnum
2016-01-27 18:44       ` Pavan R
2016-01-27 19:14       ` Deneau, Tom
2016-01-27 19:17         ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.