All of lore.kernel.org
 help / color / mirror / Atom feed
* CBT: New RGW getput benchmark and testing diary
@ 2017-02-06  5:55 Mark Nelson
  2017-02-06 15:33 ` Matt Benjamin
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-06  5:55 UTC (permalink / raw)
  To: ceph-devel, cbt; +Cc: Mark Seger, Kyle Bader, Karan Singh, Brent Compton

[-- Attachment #1: Type: text/plain, Size: 10461 bytes --]

Hi All,

Over the weekend I took a stab at improving our ability to run RGW 
performance tests in CBT.  Previously the only way to do this was to use 
the cosbench plugin, which required a fair amount of additional
setup and while quite powerful can be overkill in situations where you 
want to rapidly iterate over tests looking for specific issues.  A while 
ago Mark Seger from HP told me he had created a swift benchmark called 
"getput" that is written in python and is much more convenient to run 
quickly in an automated fashion.  Normally getput is used in conjunction 
with gpsuite, a tool for coordinating benchmarking multiple getput 
processes.  This is how you would likely use getput on a typical ceph or 
swift cluster, but since CBT builds the cluster and has it's own way for 
launching multiple benchmark processes, it uses getput directly.

Thankfully it was fairly easy to implement a CBT getput wrapper. 
Several aspects of CBT's RGW support and user/key management were also 
improved to make the whole process of testing RGW completely automated 
via the CBT yaml file.  As part of the process of testing and debugging 
the new getput wrapper, I ran through a series of benchmarks and tests 
to investigate 4MB write performance anomalies previously reported in 
the field.  I wrote something of a diary while doing this and thought I 
would document it here for the community.  These were not extremely 
scientific tests, though I believe the findings are relevant and my be 
useful for folks.


Test Cluster Setup

The test cluster being used has 8 nodes, 4 of which are used for OSDs 
and 4 of which are being used for RGW and clients.  Each OSD node has 
the option to use any combination of 6 Seagate Constellation ES.2 
7200RPM HDDs and 4 Intel 800GB P3700 NVMe drives.  These machines also 
have dual Intel Xeon E5-2650v3 CPUs and Intel 40GbE ethernet adapters. 
In all of these tests, 1X replication was used to eliminate it as a 
bottleneck and put maximum pressure on RGW.  It should be noted that the 
spinning disks in this cluster are attached through motherboard SATA2 
ports, and may not be performing as well as if they were using a 
dedicated SAS controller.  A copy of the cbt yaml configuration file 
used to run the tests is attached.


Some Notes

When using getput as a benchmark for RGW, it's very important to keep 
track of the total number of getput processes and the number of RGW 
threads.  If the runtime option in getput is used, some processes may 
wait until RGW threads open up before they determine their runtime 
leading to skewed results.  I believe this can be resolved in getput by 
changing the way that donetime is calculated, however the issue can be 
avoided by paying close attention to the RGW thread and getput process 
counts.

It's easy to create the wrong pools for RGW since the defaults changed 
in Jewel.  Now we must create default.rgw.buckets.index and 
default.rgw.buckets.data.  It wasn't until I looked at disk usage via 
ceph df that I disovered that my .rgw.buckets and .rgw.buckets.index 
pools were not being used and thus resulted in some bogus initial 
performance data.


RGW with HDD backed OSDs (Filestore)

The first set of tests run were with a 4 node cluster configured with 24 
HDD backed OSDs.  Immediately the first thing I noticed is that the 
number of buckets and/or bucket hsards absolutely affects large 
sequential write performance on HDD backed OSDs.  Using a single bucket 
with no shards (ie the default) resulted in 220MB/s of write throughput 
while rados bench was able to achieve 550-600MB/s.  Both of these 
numbers are quite low, though are partially explained by the lack of SSD 
journals and lack of dedicated controller hardware.

Setting the number of bucket index shards to 8 improved the RGW write 
throughput to 400MB/s. The highest throughput for this setup was 
achieved by either setting the number of buckets or the number of bucket 
index shards substantially higher than the number of OSDs.  In this 
case, 64 appeared to be sufficient.  Three high concurrency 4MB Object 
RGW PUT tests showed results of 602MB/s, 557MB/s and 563MB/s.  3 rados 
bench runs using similar concurrency showed 580MB/s, 580MB/s, and 
564MB/s respectively.  Write IO from multiple clients appeared to cause 
a slight (~5%) performance drop vs write IO from a single client.  In 
all cases, tests were stopped prior to PG splitting.

It was observed that RGW uses high amounts of CPU, especially if low 
tcmalloc threadcache values are used.  With 32MB of threadcache, RGW 
used around 500-600% CPU to serve 500-600MB/s of 4MB writes. 
Occasionally it would spike into the 1000-1200% region.  Perf showed a 
high percentage of time in tcmalloc managing civetweb threads.  With 
128MB of thread cache, this effect was greatly diminished.  RGW appeared 
to use around 300-400% CPU to serve the same workload, though in these 
tests there was little performance impact as we were not CPU bound.


PG counts and PG splitting

Based on the bucket/shard results above, it may be inferred that a RGW 
index pool with very low PG counts could significantly hurt performance. 
  What about the case where small PG counts are used in the bucket.data 
pool?  In this case, PG splitting behavior might actually be more 
important than the effect of clumpiness in the random distribution due 
to low sample counts.  To test this, a buckets.index pool was created 
with 2048 PGs while the buckets.data pool was created with 128 PGs. 
Initial 4MB sequential writes with both rados bench and with RGW were 
about 20% slower than what was seen with 2048 PGs in the data pool, 
likely due to the worse data distribution properties.

While this is significant, I was more interested in the affects of PG 
splitting in filestore.  It has been observed in the past that PG 
splitting can have a huge performance impact, especially when SELinux is 
enabled. Selinux reads a security xattr on every file access which 
grealy impacts how quickly operations like link/unlink can be performed 
during PG splits. While SELinux is not enabled on this test cluster, PG 
splitting may still have other performance impacts due to worse dentry 
and inode caching and kernel overhead.  Rados bench was used to hit the 
thresholds by writing out a large number of 4K objects.

After approximately 1.3M objects were written, performance of 4K writes 
dropped by roughly an order of magnitude.  At this point rados bench and 
RGW were used to write out 4MB objects with high levels of concurrency 
and high numbers of bucket index shards.  In both cases, performance 
started out slightly diminished, but quickly increased to near levels 
observed on the fresh cluster.  At least in this setup, pg splitting in 
the data pool did not appear to majorly affect the performance of 4MB 
object writes, though may have affect the 4k object writes used to 
pre-fill the data pool.


RGW with OSDs using HDD Data and NVMe Journals (Filestore)

Next, 24 OSDs were configured with the filestore data partitions on HDD 
and journals on NVMe.  With minimal tuning rados bench delivered
around 1350MB/s or around 56MB/s per drive.  This is lower per drive 
than other configurations we've tested, but roughly double vs how the 
OSDs were performing wihtout NVMe journals.  Interestingly, the 
difference between using a single bucket with no shards and many 
buckets/shards appeared to be minimal.  Tests against a single bucket 
with no shards resulted in 1000-1400MB/s while a test configuration with 
128 buckets resulted in 1200-1400MB/s over several repeated tests.  It 
should be noted that the single RGW instance was using anywhere from 
1000-1600% CPU to maintain these numbers even with 128MB of TCMalloc 
threadcache!  Again a performance drop was seen when using multiple 
clients vs a single client.


RGW with NVMe data and journals (Filestore)

The test machines in this cluster have 4 Intel P3700 NVMe drives which 
are each capable of about 2GB/s each.  The cluster was again 
reconfigured with 16 OSDs backed by the NVMe drives.   In this case, a 
single client running radosbench with 128 concurrent ops could saturate 
the network and write data at about 4600MB/s.  4 clients however appear 
to be able to saturate the OSDs and achieve 11620MB/s.  A single getput 
client using 128 processes to write 4MB objects to a single bucket with 
no shards achieves 1700-1800MB/s.  Radosgw CPU usage hovered around 
1500-1700%. Using 128 buckets (1 bucket per process) yielded a variety 
of performance results ranging from 1700MB/s to 3500MB/s over multiple 
tests. Radosgw CPU usage appeared to top out around 2100%.  With 4 
clients, A single radosgw instance was able to maintain roughly 3700MB/s 
of writes over several independent tests despite using roughly the same 
amount of CPU as the single client tests.


Testing 4 gateways

CBT can stand up multiple rados gateways and launch tests against all of 
them concurrently.  4 clients were configured to drive traffic to all 4 
rados gateways and ultimately the 16 NVMe backed OSDs.  The first tests 
run resulted in 404 keynotfound errors, apparently because multiple 
copies of getput attempted to write into the same bucket.  This was 
resolved by making sure that each copy of getput being run had a 
distinct bucket (though getput processes did not require the same 
attention).  Thus, the lowest number of buckets targeted in this test 
was 16.  In this configuration, the aggregate write throughput across 
all 4 gateways was 9306MB/s.  With a bucket for every process (512 
total), the aggregate throughput increased to 9445MB/s.  This was 
roughly 2361MB/s per gateway and 81% of the rados bench throughput.


Next Steps and Conclusions

RGW is now roughly as easy to test in CBT as rados and rbd.  It should 
be far easier to examine bluestore performance with RGW now.  I hope we 
should be able to do some filestore and bluestore comparisons shortly. 
I will also note that I was quite happy with how well RGW handled large 
object writes in these tests.  Despite using a large amount of CPU 
overhead, I could achieve 3700MB/s with a single RGW instance and over 
9GB/s with 4 RGW instances.  Despite this, it's clear that performance 
is still quite dependent on bucket index update latency.  Especially on 
spinning disks, it appears to be very important to use bucket index 
sharding or spread write over many buckets.

Mark

[-- Attachment #2: runtests.xfs.16.rgw.yaml --]
[-- Type: application/x-yaml, Size: 2453 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CBT: New RGW getput benchmark and testing diary
  2017-02-06  5:55 CBT: New RGW getput benchmark and testing diary Mark Nelson
@ 2017-02-06 15:33 ` Matt Benjamin
  2017-02-06 15:42   ` Mark Nelson
  0 siblings, 1 reply; 13+ messages in thread
From: Matt Benjamin @ 2017-02-06 15:33 UTC (permalink / raw)
  To: Mark Nelson
  Cc: ceph-devel, cbt, Mark Seger, Kyle Bader, Karan Singh, Brent Compton

Thanks for the detailed effort and analysis, Mark.

As we get closer to the L time-frame, it should become relevant to look at the relative boost::asio frontend rework i/o paths, which are the open effort to reduce CPU overhead/revise threading model, in general.

Matt

----- Original Message -----
> From: "Mark Nelson" <mnelson@redhat.com>
> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent
> Compton" <bcompton@redhat.com>
> Sent: Monday, February 6, 2017 12:55:20 AM
> Subject: CBT: New RGW getput benchmark and testing diary
> 
> Hi All,
> 
> Over the weekend I took a stab at improving our ability to run RGW
> performance tests in CBT.  Previously the only way to do this was to use
> the cosbench plugin, which required a fair amount of additional
> setup and while quite powerful can be overkill in situations where you
> want to rapidly iterate over tests looking for specific issues.  A while
> ago Mark Seger from HP told me he had created a swift benchmark called
> "getput" that is written in python and is much more convenient to run
> quickly in an automated fashion.  Normally getput is used in conjunction
> with gpsuite, a tool for coordinating benchmarking multiple getput
> processes.  This is how you would likely use getput on a typical ceph or
> swift cluster, but since CBT builds the cluster and has it's own way for
> launching multiple benchmark processes, it uses getput directly.
> 


-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CBT: New RGW getput benchmark and testing diary
  2017-02-06 15:33 ` Matt Benjamin
@ 2017-02-06 15:42   ` Mark Nelson
  2017-02-06 15:44     ` Matt Benjamin
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-06 15:42 UTC (permalink / raw)
  To: Matt Benjamin
  Cc: ceph-devel, cbt, Mark Seger, Kyle Bader, Karan Singh, Brent Compton

Just based on what I saw during these tests, it looks to me like a lot 
more time was spent dealing with civetweb's threads than RGW.  I didn't 
look too closely, but it may be worth looking at whether there's any low 
hanging fruit in civetweb itself.

Mark

On 02/06/2017 09:33 AM, Matt Benjamin wrote:
> Thanks for the detailed effort and analysis, Mark.
>
> As we get closer to the L time-frame, it should become relevant to look at the relative boost::asio frontend rework i/o paths, which are the open effort to reduce CPU overhead/revise threading model, in general.
>
> Matt
>
> ----- Original Message -----
>> From: "Mark Nelson" <mnelson@redhat.com>
>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent
>> Compton" <bcompton@redhat.com>
>> Sent: Monday, February 6, 2017 12:55:20 AM
>> Subject: CBT: New RGW getput benchmark and testing diary
>>
>> Hi All,
>>
>> Over the weekend I took a stab at improving our ability to run RGW
>> performance tests in CBT.  Previously the only way to do this was to use
>> the cosbench plugin, which required a fair amount of additional
>> setup and while quite powerful can be overkill in situations where you
>> want to rapidly iterate over tests looking for specific issues.  A while
>> ago Mark Seger from HP told me he had created a swift benchmark called
>> "getput" that is written in python and is much more convenient to run
>> quickly in an automated fashion.  Normally getput is used in conjunction
>> with gpsuite, a tool for coordinating benchmarking multiple getput
>> processes.  This is how you would likely use getput on a typical ceph or
>> swift cluster, but since CBT builds the cluster and has it's own way for
>> launching multiple benchmark processes, it uses getput directly.
>>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CBT: New RGW getput benchmark and testing diary
  2017-02-06 15:42   ` Mark Nelson
@ 2017-02-06 15:44     ` Matt Benjamin
  2017-02-06 17:02       ` Orit Wasserman
  0 siblings, 1 reply; 13+ messages in thread
From: Matt Benjamin @ 2017-02-06 15:44 UTC (permalink / raw)
  To: Mark Nelson
  Cc: ceph-devel, cbt, Mark Seger, Kyle Bader, Karan Singh, Brent Compton

Keep in mind, RGW does most of its request processing work in civetweb threads, so high utilization there does not necessarily imply civetweb-internal processing.

Matt

----- Original Message -----
> From: "Mark Nelson" <mnelson@redhat.com>
> To: "Matt Benjamin" <mbenjamin@redhat.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton" <bcompton@redhat.com>
> Sent: Monday, February 6, 2017 10:42:04 AM
> Subject: Re: CBT: New RGW getput benchmark and testing diary
> 
> Just based on what I saw during these tests, it looks to me like a lot
> more time was spent dealing with civetweb's threads than RGW.  I didn't
> look too closely, but it may be worth looking at whether there's any low
> hanging fruit in civetweb itself.
> 
> Mark
> 
> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
> > Thanks for the detailed effort and analysis, Mark.
> >
> > As we get closer to the L time-frame, it should become relevant to look at
> > the relative boost::asio frontend rework i/o paths, which are the open
> > effort to reduce CPU overhead/revise threading model, in general.
> >
> > Matt
> >
> > ----- Original Message -----
> >> From: "Mark Nelson" <mnelson@redhat.com>
> >> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
> >> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>,
> >> "Karan Singh" <karan@redhat.com>, "Brent
> >> Compton" <bcompton@redhat.com>
> >> Sent: Monday, February 6, 2017 12:55:20 AM
> >> Subject: CBT: New RGW getput benchmark and testing diary
> >>
> >> Hi All,
> >>
> >> Over the weekend I took a stab at improving our ability to run RGW
> >> performance tests in CBT.  Previously the only way to do this was to use
> >> the cosbench plugin, which required a fair amount of additional
> >> setup and while quite powerful can be overkill in situations where you
> >> want to rapidly iterate over tests looking for specific issues.  A while
> >> ago Mark Seger from HP told me he had created a swift benchmark called
> >> "getput" that is written in python and is much more convenient to run
> >> quickly in an automated fashion.  Normally getput is used in conjunction
> >> with gpsuite, a tool for coordinating benchmarking multiple getput
> >> processes.  This is how you would likely use getput on a typical ceph or
> >> swift cluster, but since CBT builds the cluster and has it's own way for
> >> launching multiple benchmark processes, it uses getput directly.
> >>
> >
> >
> 

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CBT: New RGW getput benchmark and testing diary
  2017-02-06 15:44     ` Matt Benjamin
@ 2017-02-06 17:02       ` Orit Wasserman
  2017-02-06 17:07         ` Mark Nelson
  0 siblings, 1 reply; 13+ messages in thread
From: Orit Wasserman @ 2017-02-06 17:02 UTC (permalink / raw)
  To: Matt Benjamin
  Cc: Mark Nelson, ceph-devel, cbt, Mark Seger, Kyle Bader,
	Karan Singh, Brent Compton

On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com> wrote:
> Keep in mind, RGW does most of its request processing work in civetweb threads, so high utilization there does not necessarily imply civetweb-internal processing.
>

True but the request processing is not a CPU intensive operation.
It does seems to indicate that the civetweb threading model simply
doesn't scale (we already noticed it already) or maybe it can point to
some locking issue. We need to run a profiler to understand what is
consuming CPU.
It maybe a simple fix until we move to asynchronous frontend.
It worth investigating as the CPU usage mark is seeing  is really high.

Mark,
How many concurrent request were handled?

Orit

> Matt
>
> ----- Original Message -----
>> From: "Mark Nelson" <mnelson@redhat.com>
>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton" <bcompton@redhat.com>
>> Sent: Monday, February 6, 2017 10:42:04 AM
>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>
>> Just based on what I saw during these tests, it looks to me like a lot
>> more time was spent dealing with civetweb's threads than RGW.  I didn't
>> look too closely, but it may be worth looking at whether there's any low
>> hanging fruit in civetweb itself.
>>
>> Mark
>>
>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>> > Thanks for the detailed effort and analysis, Mark.
>> >
>> > As we get closer to the L time-frame, it should become relevant to look at
>> > the relative boost::asio frontend rework i/o paths, which are the open
>> > effort to reduce CPU overhead/revise threading model, in general.
>> >
>> > Matt
>> >
>> > ----- Original Message -----
>> >> From: "Mark Nelson" <mnelson@redhat.com>
>> >> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>> >> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>,
>> >> "Karan Singh" <karan@redhat.com>, "Brent
>> >> Compton" <bcompton@redhat.com>
>> >> Sent: Monday, February 6, 2017 12:55:20 AM
>> >> Subject: CBT: New RGW getput benchmark and testing diary
>> >>
>> >> Hi All,
>> >>
>> >> Over the weekend I took a stab at improving our ability to run RGW
>> >> performance tests in CBT.  Previously the only way to do this was to use
>> >> the cosbench plugin, which required a fair amount of additional
>> >> setup and while quite powerful can be overkill in situations where you
>> >> want to rapidly iterate over tests looking for specific issues.  A while
>> >> ago Mark Seger from HP told me he had created a swift benchmark called
>> >> "getput" that is written in python and is much more convenient to run
>> >> quickly in an automated fashion.  Normally getput is used in conjunction
>> >> with gpsuite, a tool for coordinating benchmarking multiple getput
>> >> processes.  This is how you would likely use getput on a typical ceph or
>> >> swift cluster, but since CBT builds the cluster and has it's own way for
>> >> launching multiple benchmark processes, it uses getput directly.
>> >>
>> >
>> >
>>
>
> --
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-821-5101
> fax.  734-769-8938
> cel.  734-216-5309
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CBT: New RGW getput benchmark and testing diary
  2017-02-06 17:02       ` Orit Wasserman
@ 2017-02-06 17:07         ` Mark Nelson
  2017-02-07 13:50           ` Orit Wasserman
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-06 17:07 UTC (permalink / raw)
  To: Orit Wasserman, Matt Benjamin
  Cc: ceph-devel, cbt, Mark Seger, Kyle Bader, Karan Singh, Brent Compton



On 02/06/2017 11:02 AM, Orit Wasserman wrote:
> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com> wrote:
>> Keep in mind, RGW does most of its request processing work in civetweb threads, so high utilization there does not necessarily imply civetweb-internal processing.
>>
>
> True but the request processing is not a CPU intensive operation.
> It does seems to indicate that the civetweb threading model simply
> doesn't scale (we already noticed it already) or maybe it can point to
> some locking issue. We need to run a profiler to understand what is
> consuming CPU.
> It maybe a simple fix until we move to asynchronous frontend.
> It worth investigating as the CPU usage mark is seeing  is really high.

The initial profiling I did definitely showed a lot of tcmalloc 
threading activity, which diminshed after increasing threadcache.  This 
is quite similar to what we saw in simplemessenger with low threadcache 
values, though likely is less true with async messenger.  Sadly a 
profiler like perf probably isn't going to help much with debugging lock 
contention.  grabbing GDB stack traces might help, or lttng.

>
> Mark,
> How many concurrent request were handled?

Most of the tests had 128 concurrent IOs per radosgw daemon.  The max 
thread count was increased to 512.  It was very obvious when exceeding 
the thread count since some getput processes will end up stalling and 
doing their writes after others, leading to bogus performance data.

>
> Orit
>
>> Matt
>>
>> ----- Original Message -----
>>> From: "Mark Nelson" <mnelson@redhat.com>
>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton" <bcompton@redhat.com>
>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>
>>> Just based on what I saw during these tests, it looks to me like a lot
>>> more time was spent dealing with civetweb's threads than RGW.  I didn't
>>> look too closely, but it may be worth looking at whether there's any low
>>> hanging fruit in civetweb itself.
>>>
>>> Mark
>>>
>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>> Thanks for the detailed effort and analysis, Mark.
>>>>
>>>> As we get closer to the L time-frame, it should become relevant to look at
>>>> the relative boost::asio frontend rework i/o paths, which are the open
>>>> effort to reduce CPU overhead/revise threading model, in general.
>>>>
>>>> Matt
>>>>
>>>> ----- Original Message -----
>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>,
>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>> Compton" <bcompton@redhat.com>
>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>
>>>>> Hi All,
>>>>>
>>>>> Over the weekend I took a stab at improving our ability to run RGW
>>>>> performance tests in CBT.  Previously the only way to do this was to use
>>>>> the cosbench plugin, which required a fair amount of additional
>>>>> setup and while quite powerful can be overkill in situations where you
>>>>> want to rapidly iterate over tests looking for specific issues.  A while
>>>>> ago Mark Seger from HP told me he had created a swift benchmark called
>>>>> "getput" that is written in python and is much more convenient to run
>>>>> quickly in an automated fashion.  Normally getput is used in conjunction
>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
>>>>> processes.  This is how you would likely use getput on a typical ceph or
>>>>> swift cluster, but since CBT builds the cluster and has it's own way for
>>>>> launching multiple benchmark processes, it uses getput directly.
>>>>>
>>>>
>>>>
>>>
>>
>> --
>> Matt Benjamin
>> Red Hat, Inc.
>> 315 West Huron Street, Suite 140A
>> Ann Arbor, Michigan 48103
>>
>> http://www.redhat.com/en/technologies/storage
>>
>> tel.  734-821-5101
>> fax.  734-769-8938
>> cel.  734-216-5309
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CBT: New RGW getput benchmark and testing diary
  2017-02-06 17:07         ` Mark Nelson
@ 2017-02-07 13:50           ` Orit Wasserman
  2017-02-07 14:47             ` Mark Nelson
  0 siblings, 1 reply; 13+ messages in thread
From: Orit Wasserman @ 2017-02-07 13:50 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Matt Benjamin, ceph-devel, cbt, Mark Seger, Kyle Bader,
	Karan Singh, Brent Compton

Mark,
On what version did you run the tests?

Orit

On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
>
>
> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
>>
>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com>
>> wrote:
>>>
>>> Keep in mind, RGW does most of its request processing work in civetweb
>>> threads, so high utilization there does not necessarily imply
>>> civetweb-internal processing.
>>>
>>
>> True but the request processing is not a CPU intensive operation.
>> It does seems to indicate that the civetweb threading model simply
>> doesn't scale (we already noticed it already) or maybe it can point to
>> some locking issue. We need to run a profiler to understand what is
>> consuming CPU.
>> It maybe a simple fix until we move to asynchronous frontend.
>> It worth investigating as the CPU usage mark is seeing  is really high.
>
>
> The initial profiling I did definitely showed a lot of tcmalloc threading
> activity, which diminshed after increasing threadcache.  This is quite
> similar to what we saw in simplemessenger with low threadcache values,
> though likely is less true with async messenger.  Sadly a profiler like perf
> probably isn't going to help much with debugging lock contention.  grabbing
> GDB stack traces might help, or lttng.
>
>>
>> Mark,
>> How many concurrent request were handled?
>
>
> Most of the tests had 128 concurrent IOs per radosgw daemon.  The max thread
> count was increased to 512.  It was very obvious when exceeding the thread
> count since some getput processes will end up stalling and doing their
> writes after others, leading to bogus performance data.
>
>
>>
>> Orit
>>
>>> Matt
>>>
>>> ----- Original Message -----
>>>>
>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark
>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton"
>>>> <bcompton@redhat.com>
>>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>>
>>>> Just based on what I saw during these tests, it looks to me like a lot
>>>> more time was spent dealing with civetweb's threads than RGW.  I didn't
>>>> look too closely, but it may be worth looking at whether there's any low
>>>> hanging fruit in civetweb itself.
>>>>
>>>> Mark
>>>>
>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>>>
>>>>> Thanks for the detailed effort and analysis, Mark.
>>>>>
>>>>> As we get closer to the L time-frame, it should become relevant to look
>>>>> at
>>>>> the relative boost::asio frontend rework i/o paths, which are the open
>>>>> effort to reduce CPU overhead/revise threading model, in general.
>>>>>
>>>>> Matt
>>>>>
>>>>> ----- Original Message -----
>>>>>>
>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>> <kbader@redhat.com>,
>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>>> Compton" <bcompton@redhat.com>
>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Over the weekend I took a stab at improving our ability to run RGW
>>>>>> performance tests in CBT.  Previously the only way to do this was to
>>>>>> use
>>>>>> the cosbench plugin, which required a fair amount of additional
>>>>>> setup and while quite powerful can be overkill in situations where you
>>>>>> want to rapidly iterate over tests looking for specific issues.  A
>>>>>> while
>>>>>> ago Mark Seger from HP told me he had created a swift benchmark called
>>>>>> "getput" that is written in python and is much more convenient to run
>>>>>> quickly in an automated fashion.  Normally getput is used in
>>>>>> conjunction
>>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
>>>>>> processes.  This is how you would likely use getput on a typical ceph
>>>>>> or
>>>>>> swift cluster, but since CBT builds the cluster and has it's own way
>>>>>> for
>>>>>> launching multiple benchmark processes, it uses getput directly.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Matt Benjamin
>>> Red Hat, Inc.
>>> 315 West Huron Street, Suite 140A
>>> Ann Arbor, Michigan 48103
>>>
>>> http://www.redhat.com/en/technologies/storage
>>>
>>> tel.  734-821-5101
>>> fax.  734-769-8938
>>> cel.  734-216-5309
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CBT: New RGW getput benchmark and testing diary
  2017-02-07 13:50           ` Orit Wasserman
@ 2017-02-07 14:47             ` Mark Nelson
  2017-02-07 15:03               ` Orit Wasserman
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-07 14:47 UTC (permalink / raw)
  To: Orit Wasserman
  Cc: Matt Benjamin, ceph-devel, cbt, Mark Seger, Kyle Bader,
	Karan Singh, Brent Compton

Hi Orit,

This was a pull from master over the weekend:
5bf39156d8312d65ef77822fbede73fd9454591f

Btw, I've been noticing that it appears when bucket index sharding is 
used, there's a higher likelyhood that client connection attempts are 
delayed or starved out entirely under high concurrency.  I haven't 
looked at the code yet, does this match with what you'd expect to 
happen?  I assume the threadpool is shared?

Mark

On 02/07/2017 07:50 AM, Orit Wasserman wrote:
> Mark,
> On what version did you run the tests?
>
> Orit
>
> On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>
>>
>> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
>>>
>>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com>
>>> wrote:
>>>>
>>>> Keep in mind, RGW does most of its request processing work in civetweb
>>>> threads, so high utilization there does not necessarily imply
>>>> civetweb-internal processing.
>>>>
>>>
>>> True but the request processing is not a CPU intensive operation.
>>> It does seems to indicate that the civetweb threading model simply
>>> doesn't scale (we already noticed it already) or maybe it can point to
>>> some locking issue. We need to run a profiler to understand what is
>>> consuming CPU.
>>> It maybe a simple fix until we move to asynchronous frontend.
>>> It worth investigating as the CPU usage mark is seeing  is really high.
>>
>>
>> The initial profiling I did definitely showed a lot of tcmalloc threading
>> activity, which diminshed after increasing threadcache.  This is quite
>> similar to what we saw in simplemessenger with low threadcache values,
>> though likely is less true with async messenger.  Sadly a profiler like perf
>> probably isn't going to help much with debugging lock contention.  grabbing
>> GDB stack traces might help, or lttng.
>>
>>>
>>> Mark,
>>> How many concurrent request were handled?
>>
>>
>> Most of the tests had 128 concurrent IOs per radosgw daemon.  The max thread
>> count was increased to 512.  It was very obvious when exceeding the thread
>> count since some getput processes will end up stalling and doing their
>> writes after others, leading to bogus performance data.
>>
>>
>>>
>>> Orit
>>>
>>>> Matt
>>>>
>>>> ----- Original Message -----
>>>>>
>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark
>>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton"
>>>>> <bcompton@redhat.com>
>>>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>>>
>>>>> Just based on what I saw during these tests, it looks to me like a lot
>>>>> more time was spent dealing with civetweb's threads than RGW.  I didn't
>>>>> look too closely, but it may be worth looking at whether there's any low
>>>>> hanging fruit in civetweb itself.
>>>>>
>>>>> Mark
>>>>>
>>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>>>>
>>>>>> Thanks for the detailed effort and analysis, Mark.
>>>>>>
>>>>>> As we get closer to the L time-frame, it should become relevant to look
>>>>>> at
>>>>>> the relative boost::asio frontend rework i/o paths, which are the open
>>>>>> effort to reduce CPU overhead/revise threading model, in general.
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>>
>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>> <kbader@redhat.com>,
>>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>> Compton" <bcompton@redhat.com>
>>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Over the weekend I took a stab at improving our ability to run RGW
>>>>>>> performance tests in CBT.  Previously the only way to do this was to
>>>>>>> use
>>>>>>> the cosbench plugin, which required a fair amount of additional
>>>>>>> setup and while quite powerful can be overkill in situations where you
>>>>>>> want to rapidly iterate over tests looking for specific issues.  A
>>>>>>> while
>>>>>>> ago Mark Seger from HP told me he had created a swift benchmark called
>>>>>>> "getput" that is written in python and is much more convenient to run
>>>>>>> quickly in an automated fashion.  Normally getput is used in
>>>>>>> conjunction
>>>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
>>>>>>> processes.  This is how you would likely use getput on a typical ceph
>>>>>>> or
>>>>>>> swift cluster, but since CBT builds the cluster and has it's own way
>>>>>>> for
>>>>>>> launching multiple benchmark processes, it uses getput directly.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Matt Benjamin
>>>> Red Hat, Inc.
>>>> 315 West Huron Street, Suite 140A
>>>> Ann Arbor, Michigan 48103
>>>>
>>>> http://www.redhat.com/en/technologies/storage
>>>>
>>>> tel.  734-821-5101
>>>> fax.  734-769-8938
>>>> cel.  734-216-5309
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CBT: New RGW getput benchmark and testing diary
  2017-02-07 14:47             ` Mark Nelson
@ 2017-02-07 15:03               ` Orit Wasserman
  2017-02-07 15:23                 ` Mark Nelson
  0 siblings, 1 reply; 13+ messages in thread
From: Orit Wasserman @ 2017-02-07 15:03 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Matt Benjamin, ceph-devel, cbt, Mark Seger, Kyle Bader,
	Karan Singh, Brent Compton

On Tue, Feb 7, 2017 at 4:47 PM, Mark Nelson <mnelson@redhat.com> wrote:
> Hi Orit,
>
> This was a pull from master over the weekend:
> 5bf39156d8312d65ef77822fbede73fd9454591f
>
> Btw, I've been noticing that it appears when bucket index sharding is used,
> there's a higher likelyhood that client connection attempts are delayed or
> starved out entirely under high concurrency.  I haven't looked at the code
> yet, does this match with what you'd expect to happen?  I assume the
> threadpool is shared?
>
yes it is shared.

> Mark
>
>
> On 02/07/2017 07:50 AM, Orit Wasserman wrote:
>>
>> Mark,
>> On what version did you run the tests?
>>
>> Orit
>>
>> On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>
>>>
>>>
>>> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
>>>>
>>>>
>>>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> Keep in mind, RGW does most of its request processing work in civetweb
>>>>> threads, so high utilization there does not necessarily imply
>>>>> civetweb-internal processing.
>>>>>
>>>>
>>>> True but the request processing is not a CPU intensive operation.
>>>> It does seems to indicate that the civetweb threading model simply
>>>> doesn't scale (we already noticed it already) or maybe it can point to
>>>> some locking issue. We need to run a profiler to understand what is
>>>> consuming CPU.
>>>> It maybe a simple fix until we move to asynchronous frontend.
>>>> It worth investigating as the CPU usage mark is seeing  is really high.
>>>
>>>
>>>
>>> The initial profiling I did definitely showed a lot of tcmalloc threading
>>> activity, which diminshed after increasing threadcache.  This is quite
>>> similar to what we saw in simplemessenger with low threadcache values,
>>> though likely is less true with async messenger.  Sadly a profiler like
>>> perf
>>> probably isn't going to help much with debugging lock contention.
>>> grabbing
>>> GDB stack traces might help, or lttng.
>>>
>>>>
>>>> Mark,
>>>> How many concurrent request were handled?
>>>
>>>
>>>
>>> Most of the tests had 128 concurrent IOs per radosgw daemon.  The max
>>> thread
>>> count was increased to 512.  It was very obvious when exceeding the
>>> thread
>>> count since some getput processes will end up stalling and doing their
>>> writes after others, leading to bogus performance data.
>>>
>>>
>>>>
>>>> Orit
>>>>
>>>>> Matt
>>>>>
>>>>> ----- Original Message -----
>>>>>>
>>>>>>
>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com,
>>>>>> "Mark
>>>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton"
>>>>>> <bcompton@redhat.com>
>>>>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>>>>
>>>>>> Just based on what I saw during these tests, it looks to me like a lot
>>>>>> more time was spent dealing with civetweb's threads than RGW.  I
>>>>>> didn't
>>>>>> look too closely, but it may be worth looking at whether there's any
>>>>>> low
>>>>>> hanging fruit in civetweb itself.
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the detailed effort and analysis, Mark.
>>>>>>>
>>>>>>> As we get closer to the L time-frame, it should become relevant to
>>>>>>> look
>>>>>>> at
>>>>>>> the relative boost::asio frontend rework i/o paths, which are the
>>>>>>> open
>>>>>>> effort to reduce CPU overhead/revise threading model, in general.
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>>
>>>>>>>>
>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>>>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>>> <kbader@redhat.com>,
>>>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>>> Compton" <bcompton@redhat.com>
>>>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Over the weekend I took a stab at improving our ability to run RGW
>>>>>>>> performance tests in CBT.  Previously the only way to do this was to
>>>>>>>> use
>>>>>>>> the cosbench plugin, which required a fair amount of additional
>>>>>>>> setup and while quite powerful can be overkill in situations where
>>>>>>>> you
>>>>>>>> want to rapidly iterate over tests looking for specific issues.  A
>>>>>>>> while
>>>>>>>> ago Mark Seger from HP told me he had created a swift benchmark
>>>>>>>> called
>>>>>>>> "getput" that is written in python and is much more convenient to
>>>>>>>> run
>>>>>>>> quickly in an automated fashion.  Normally getput is used in
>>>>>>>> conjunction
>>>>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
>>>>>>>> processes.  This is how you would likely use getput on a typical
>>>>>>>> ceph
>>>>>>>> or
>>>>>>>> swift cluster, but since CBT builds the cluster and has it's own way
>>>>>>>> for
>>>>>>>> launching multiple benchmark processes, it uses getput directly.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Matt Benjamin
>>>>> Red Hat, Inc.
>>>>> 315 West Huron Street, Suite 140A
>>>>> Ann Arbor, Michigan 48103
>>>>>
>>>>> http://www.redhat.com/en/technologies/storage
>>>>>
>>>>> tel.  734-821-5101
>>>>> fax.  734-769-8938
>>>>> cel.  734-216-5309
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CBT: New RGW getput benchmark and testing diary
  2017-02-07 15:03               ` Orit Wasserman
@ 2017-02-07 15:23                 ` Mark Nelson
  2017-02-07 16:02                   ` Matt Benjamin
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-07 15:23 UTC (permalink / raw)
  To: Orit Wasserman
  Cc: Matt Benjamin, ceph-devel, cbt, Mark Seger, Kyle Bader,
	Karan Singh, Brent Compton



On 02/07/2017 09:03 AM, Orit Wasserman wrote:
> On Tue, Feb 7, 2017 at 4:47 PM, Mark Nelson <mnelson@redhat.com> wrote:
>> Hi Orit,
>>
>> This was a pull from master over the weekend:
>> 5bf39156d8312d65ef77822fbede73fd9454591f
>>
>> Btw, I've been noticing that it appears when bucket index sharding is used,
>> there's a higher likelyhood that client connection attempts are delayed or
>> starved out entirely under high concurrency.  I haven't looked at the code
>> yet, does this match with what you'd expect to happen?  I assume the
>> threadpool is shared?
>>
> yes it is shared.

Ok, so that probably explains the behavior I'm seeing.  Perhaps a more 
serious issue:  Do we have anything in place to stop a herd of clients 
from connecting, starving out bucket index lookups, and making 
everything deadlock?

>
>> Mark
>>
>>
>> On 02/07/2017 07:50 AM, Orit Wasserman wrote:
>>>
>>> Mark,
>>> On what version did you run the tests?
>>>
>>> Orit
>>>
>>> On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
>>>>>
>>>>>
>>>>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Keep in mind, RGW does most of its request processing work in civetweb
>>>>>> threads, so high utilization there does not necessarily imply
>>>>>> civetweb-internal processing.
>>>>>>
>>>>>
>>>>> True but the request processing is not a CPU intensive operation.
>>>>> It does seems to indicate that the civetweb threading model simply
>>>>> doesn't scale (we already noticed it already) or maybe it can point to
>>>>> some locking issue. We need to run a profiler to understand what is
>>>>> consuming CPU.
>>>>> It maybe a simple fix until we move to asynchronous frontend.
>>>>> It worth investigating as the CPU usage mark is seeing  is really high.
>>>>
>>>>
>>>>
>>>> The initial profiling I did definitely showed a lot of tcmalloc threading
>>>> activity, which diminshed after increasing threadcache.  This is quite
>>>> similar to what we saw in simplemessenger with low threadcache values,
>>>> though likely is less true with async messenger.  Sadly a profiler like
>>>> perf
>>>> probably isn't going to help much with debugging lock contention.
>>>> grabbing
>>>> GDB stack traces might help, or lttng.
>>>>
>>>>>
>>>>> Mark,
>>>>> How many concurrent request were handled?
>>>>
>>>>
>>>>
>>>> Most of the tests had 128 concurrent IOs per radosgw daemon.  The max
>>>> thread
>>>> count was increased to 512.  It was very obvious when exceeding the
>>>> thread
>>>> count since some getput processes will end up stalling and doing their
>>>> writes after others, leading to bogus performance data.
>>>>
>>>>
>>>>>
>>>>> Orit
>>>>>
>>>>>> Matt
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>>
>>>>>>>
>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>>>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com,
>>>>>>> "Mark
>>>>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton"
>>>>>>> <bcompton@redhat.com>
>>>>>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>>>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>>>>>
>>>>>>> Just based on what I saw during these tests, it looks to me like a lot
>>>>>>> more time was spent dealing with civetweb's threads than RGW.  I
>>>>>>> didn't
>>>>>>> look too closely, but it may be worth looking at whether there's any
>>>>>>> low
>>>>>>> hanging fruit in civetweb itself.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for the detailed effort and analysis, Mark.
>>>>>>>>
>>>>>>>> As we get closer to the L time-frame, it should become relevant to
>>>>>>>> look
>>>>>>>> at
>>>>>>>> the relative boost::asio frontend rework i/o paths, which are the
>>>>>>>> open
>>>>>>>> effort to reduce CPU overhead/revise threading model, in general.
>>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>>>>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>>>> <kbader@redhat.com>,
>>>>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>>>> Compton" <bcompton@redhat.com>
>>>>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> Over the weekend I took a stab at improving our ability to run RGW
>>>>>>>>> performance tests in CBT.  Previously the only way to do this was to
>>>>>>>>> use
>>>>>>>>> the cosbench plugin, which required a fair amount of additional
>>>>>>>>> setup and while quite powerful can be overkill in situations where
>>>>>>>>> you
>>>>>>>>> want to rapidly iterate over tests looking for specific issues.  A
>>>>>>>>> while
>>>>>>>>> ago Mark Seger from HP told me he had created a swift benchmark
>>>>>>>>> called
>>>>>>>>> "getput" that is written in python and is much more convenient to
>>>>>>>>> run
>>>>>>>>> quickly in an automated fashion.  Normally getput is used in
>>>>>>>>> conjunction
>>>>>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
>>>>>>>>> processes.  This is how you would likely use getput on a typical
>>>>>>>>> ceph
>>>>>>>>> or
>>>>>>>>> swift cluster, but since CBT builds the cluster and has it's own way
>>>>>>>>> for
>>>>>>>>> launching multiple benchmark processes, it uses getput directly.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Matt Benjamin
>>>>>> Red Hat, Inc.
>>>>>> 315 West Huron Street, Suite 140A
>>>>>> Ann Arbor, Michigan 48103
>>>>>>
>>>>>> http://www.redhat.com/en/technologies/storage
>>>>>>
>>>>>> tel.  734-821-5101
>>>>>> fax.  734-769-8938
>>>>>> cel.  734-216-5309
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CBT: New RGW getput benchmark and testing diary
  2017-02-07 15:23                 ` Mark Nelson
@ 2017-02-07 16:02                   ` Matt Benjamin
  2017-02-07 16:11                     ` Mark Nelson
  0 siblings, 1 reply; 13+ messages in thread
From: Matt Benjamin @ 2017-02-07 16:02 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Orit Wasserman, ceph-devel, cbt, Mark Seger, Kyle Bader,
	Karan Singh, Brent Compton

Hi Mark,

There are rgw and rados-level throttling parameters.  There are known issues of fairness.  The only scenario we know of where something like the "deadlock" you're theorizing can happen is possible only when byte-throttling is incorrectly configured.

Matt

----- Original Message -----
> From: "Mark Nelson" <mnelson@redhat.com>
> To: "Orit Wasserman" <owasserm@redhat.com>
> Cc: "Matt Benjamin" <mbenjamin@redhat.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark
> Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton"
> <bcompton@redhat.com>
> Sent: Tuesday, February 7, 2017 10:23:05 AM
> Subject: Re: CBT: New RGW getput benchmark and testing diary
> 
> 
> 
> On 02/07/2017 09:03 AM, Orit Wasserman wrote:
> > On Tue, Feb 7, 2017 at 4:47 PM, Mark Nelson <mnelson@redhat.com> wrote:
> >> Hi Orit,
> >>
> >> This was a pull from master over the weekend:
> >> 5bf39156d8312d65ef77822fbede73fd9454591f
> >>
> >> Btw, I've been noticing that it appears when bucket index sharding is
> >> used,
> >> there's a higher likelyhood that client connection attempts are delayed or
> >> starved out entirely under high concurrency.  I haven't looked at the code
> >> yet, does this match with what you'd expect to happen?  I assume the
> >> threadpool is shared?
> >>
> > yes it is shared.
> 
> Ok, so that probably explains the behavior I'm seeing.  Perhaps a more
> serious issue:  Do we have anything in place to stop a herd of clients
> from connecting, starving out bucket index lookups, and making
> everything deadlock?
> 
> >
> >> Mark
> >>
> >>
> >> On 02/07/2017 07:50 AM, Orit Wasserman wrote:
> >>>
> >>> Mark,
> >>> On what version did you run the tests?
> >>>
> >>> Orit
> >>>
> >>> On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
> >>>>>
> >>>>>
> >>>>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Keep in mind, RGW does most of its request processing work in civetweb
> >>>>>> threads, so high utilization there does not necessarily imply
> >>>>>> civetweb-internal processing.
> >>>>>>
> >>>>>
> >>>>> True but the request processing is not a CPU intensive operation.
> >>>>> It does seems to indicate that the civetweb threading model simply
> >>>>> doesn't scale (we already noticed it already) or maybe it can point to
> >>>>> some locking issue. We need to run a profiler to understand what is
> >>>>> consuming CPU.
> >>>>> It maybe a simple fix until we move to asynchronous frontend.
> >>>>> It worth investigating as the CPU usage mark is seeing  is really high.
> >>>>
> >>>>
> >>>>
> >>>> The initial profiling I did definitely showed a lot of tcmalloc
> >>>> threading
> >>>> activity, which diminshed after increasing threadcache.  This is quite
> >>>> similar to what we saw in simplemessenger with low threadcache values,
> >>>> though likely is less true with async messenger.  Sadly a profiler like
> >>>> perf
> >>>> probably isn't going to help much with debugging lock contention.
> >>>> grabbing
> >>>> GDB stack traces might help, or lttng.
> >>>>
> >>>>>
> >>>>> Mark,
> >>>>> How many concurrent request were handled?
> >>>>
> >>>>
> >>>>
> >>>> Most of the tests had 128 concurrent IOs per radosgw daemon.  The max
> >>>> thread
> >>>> count was increased to 512.  It was very obvious when exceeding the
> >>>> thread
> >>>> count since some getput processes will end up stalling and doing their
> >>>> writes after others, leading to bogus performance data.
> >>>>
> >>>>
> >>>>>
> >>>>> Orit
> >>>>>
> >>>>>> Matt
> >>>>>>
> >>>>>> ----- Original Message -----
> >>>>>>>
> >>>>>>>
> >>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
> >>>>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
> >>>>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com,
> >>>>>>> "Mark
> >>>>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
> >>>>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent
> >>>>>>> Compton"
> >>>>>>> <bcompton@redhat.com>
> >>>>>>> Sent: Monday, February 6, 2017 10:42:04 AM
> >>>>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
> >>>>>>>
> >>>>>>> Just based on what I saw during these tests, it looks to me like a
> >>>>>>> lot
> >>>>>>> more time was spent dealing with civetweb's threads than RGW.  I
> >>>>>>> didn't
> >>>>>>> look too closely, but it may be worth looking at whether there's any
> >>>>>>> low
> >>>>>>> hanging fruit in civetweb itself.
> >>>>>>>
> >>>>>>> Mark
> >>>>>>>
> >>>>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks for the detailed effort and analysis, Mark.
> >>>>>>>>
> >>>>>>>> As we get closer to the L time-frame, it should become relevant to
> >>>>>>>> look
> >>>>>>>> at
> >>>>>>>> the relative boost::asio frontend rework i/o paths, which are the
> >>>>>>>> open
> >>>>>>>> effort to reduce CPU overhead/revise threading model, in general.
> >>>>>>>>
> >>>>>>>> Matt
> >>>>>>>>
> >>>>>>>> ----- Original Message -----
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
> >>>>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
> >>>>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
> >>>>>>>>> <kbader@redhat.com>,
> >>>>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
> >>>>>>>>> Compton" <bcompton@redhat.com>
> >>>>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
> >>>>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
> >>>>>>>>>
> >>>>>>>>> Hi All,
> >>>>>>>>>
> >>>>>>>>> Over the weekend I took a stab at improving our ability to run RGW
> >>>>>>>>> performance tests in CBT.  Previously the only way to do this was
> >>>>>>>>> to
> >>>>>>>>> use
> >>>>>>>>> the cosbench plugin, which required a fair amount of additional
> >>>>>>>>> setup and while quite powerful can be overkill in situations where
> >>>>>>>>> you
> >>>>>>>>> want to rapidly iterate over tests looking for specific issues.  A
> >>>>>>>>> while
> >>>>>>>>> ago Mark Seger from HP told me he had created a swift benchmark
> >>>>>>>>> called
> >>>>>>>>> "getput" that is written in python and is much more convenient to
> >>>>>>>>> run
> >>>>>>>>> quickly in an automated fashion.  Normally getput is used in
> >>>>>>>>> conjunction
> >>>>>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
> >>>>>>>>> processes.  This is how you would likely use getput on a typical
> >>>>>>>>> ceph
> >>>>>>>>> or
> >>>>>>>>> swift cluster, but since CBT builds the cluster and has it's own
> >>>>>>>>> way
> >>>>>>>>> for
> >>>>>>>>> launching multiple benchmark processes, it uses getput directly.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Matt Benjamin
> >>>>>> Red Hat, Inc.
> >>>>>> 315 West Huron Street, Suite 140A
> >>>>>> Ann Arbor, Michigan 48103
> >>>>>>
> >>>>>> http://www.redhat.com/en/technologies/storage
> >>>>>>
> >>>>>> tel.  734-821-5101
> >>>>>> fax.  734-769-8938
> >>>>>> cel.  734-216-5309
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>>> in
> >>>>>> the body of a message to majordomo@vger.kernel.org
> >>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CBT: New RGW getput benchmark and testing diary
  2017-02-07 16:02                   ` Matt Benjamin
@ 2017-02-07 16:11                     ` Mark Nelson
  2017-02-07 20:01                       ` Casey Bodley
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-07 16:11 UTC (permalink / raw)
  To: Matt Benjamin
  Cc: Orit Wasserman, ceph-devel, cbt, Mark Seger, Kyle Bader,
	Karan Singh, Brent Compton

Thanks Matt!

Just so I understand, how does the byte throttling impact the number of 
threads used under a heavy client connection scenario?  IE if you have 
2000 threads and 2000 clients connect (1 thread per client?), What 
ensures that additional threads are available for bucket index lookups?

Sorry for the weedy questions, just trying to make sure I understand how 
this all works since I've never really looked closely at it and I'm 
seeing some strange behavior.

Mark

On 02/07/2017 10:02 AM, Matt Benjamin wrote:
> Hi Mark,
>
> There are rgw and rados-level throttling parameters.  There are known issues of fairness.  The only scenario we know of where something like the "deadlock" you're theorizing can happen is possible only when byte-throttling is incorrectly configured.
>
> Matt
>
> ----- Original Message -----
>> From: "Mark Nelson" <mnelson@redhat.com>
>> To: "Orit Wasserman" <owasserm@redhat.com>
>> Cc: "Matt Benjamin" <mbenjamin@redhat.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark
>> Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton"
>> <bcompton@redhat.com>
>> Sent: Tuesday, February 7, 2017 10:23:05 AM
>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>
>>
>>
>> On 02/07/2017 09:03 AM, Orit Wasserman wrote:
>>> On Tue, Feb 7, 2017 at 4:47 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>> Hi Orit,
>>>>
>>>> This was a pull from master over the weekend:
>>>> 5bf39156d8312d65ef77822fbede73fd9454591f
>>>>
>>>> Btw, I've been noticing that it appears when bucket index sharding is
>>>> used,
>>>> there's a higher likelyhood that client connection attempts are delayed or
>>>> starved out entirely under high concurrency.  I haven't looked at the code
>>>> yet, does this match with what you'd expect to happen?  I assume the
>>>> threadpool is shared?
>>>>
>>> yes it is shared.
>>
>> Ok, so that probably explains the behavior I'm seeing.  Perhaps a more
>> serious issue:  Do we have anything in place to stop a herd of clients
>> from connecting, starving out bucket index lookups, and making
>> everything deadlock?
>>
>>>
>>>> Mark
>>>>
>>>>
>>>> On 02/07/2017 07:50 AM, Orit Wasserman wrote:
>>>>>
>>>>> Mark,
>>>>> On what version did you run the tests?
>>>>>
>>>>> Orit
>>>>>
>>>>> On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Keep in mind, RGW does most of its request processing work in civetweb
>>>>>>>> threads, so high utilization there does not necessarily imply
>>>>>>>> civetweb-internal processing.
>>>>>>>>
>>>>>>>
>>>>>>> True but the request processing is not a CPU intensive operation.
>>>>>>> It does seems to indicate that the civetweb threading model simply
>>>>>>> doesn't scale (we already noticed it already) or maybe it can point to
>>>>>>> some locking issue. We need to run a profiler to understand what is
>>>>>>> consuming CPU.
>>>>>>> It maybe a simple fix until we move to asynchronous frontend.
>>>>>>> It worth investigating as the CPU usage mark is seeing  is really high.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The initial profiling I did definitely showed a lot of tcmalloc
>>>>>> threading
>>>>>> activity, which diminshed after increasing threadcache.  This is quite
>>>>>> similar to what we saw in simplemessenger with low threadcache values,
>>>>>> though likely is less true with async messenger.  Sadly a profiler like
>>>>>> perf
>>>>>> probably isn't going to help much with debugging lock contention.
>>>>>> grabbing
>>>>>> GDB stack traces might help, or lttng.
>>>>>>
>>>>>>>
>>>>>>> Mark,
>>>>>>> How many concurrent request were handled?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Most of the tests had 128 concurrent IOs per radosgw daemon.  The max
>>>>>> thread
>>>>>> count was increased to 512.  It was very obvious when exceeding the
>>>>>> thread
>>>>>> count since some getput processes will end up stalling and doing their
>>>>>> writes after others, leading to bogus performance data.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Orit
>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>>>>>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com,
>>>>>>>>> "Mark
>>>>>>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>>>> Compton"
>>>>>>>>> <bcompton@redhat.com>
>>>>>>>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>>>>>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>>>>>>>
>>>>>>>>> Just based on what I saw during these tests, it looks to me like a
>>>>>>>>> lot
>>>>>>>>> more time was spent dealing with civetweb's threads than RGW.  I
>>>>>>>>> didn't
>>>>>>>>> look too closely, but it may be worth looking at whether there's any
>>>>>>>>> low
>>>>>>>>> hanging fruit in civetweb itself.
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for the detailed effort and analysis, Mark.
>>>>>>>>>>
>>>>>>>>>> As we get closer to the L time-frame, it should become relevant to
>>>>>>>>>> look
>>>>>>>>>> at
>>>>>>>>>> the relative boost::asio frontend rework i/o paths, which are the
>>>>>>>>>> open
>>>>>>>>>> effort to reduce CPU overhead/revise threading model, in general.
>>>>>>>>>>
>>>>>>>>>> Matt
>>>>>>>>>>
>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>>>>>>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>>>>>> <kbader@redhat.com>,
>>>>>>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>>>>>> Compton" <bcompton@redhat.com>
>>>>>>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>>>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> Over the weekend I took a stab at improving our ability to run RGW
>>>>>>>>>>> performance tests in CBT.  Previously the only way to do this was
>>>>>>>>>>> to
>>>>>>>>>>> use
>>>>>>>>>>> the cosbench plugin, which required a fair amount of additional
>>>>>>>>>>> setup and while quite powerful can be overkill in situations where
>>>>>>>>>>> you
>>>>>>>>>>> want to rapidly iterate over tests looking for specific issues.  A
>>>>>>>>>>> while
>>>>>>>>>>> ago Mark Seger from HP told me he had created a swift benchmark
>>>>>>>>>>> called
>>>>>>>>>>> "getput" that is written in python and is much more convenient to
>>>>>>>>>>> run
>>>>>>>>>>> quickly in an automated fashion.  Normally getput is used in
>>>>>>>>>>> conjunction
>>>>>>>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
>>>>>>>>>>> processes.  This is how you would likely use getput on a typical
>>>>>>>>>>> ceph
>>>>>>>>>>> or
>>>>>>>>>>> swift cluster, but since CBT builds the cluster and has it's own
>>>>>>>>>>> way
>>>>>>>>>>> for
>>>>>>>>>>> launching multiple benchmark processes, it uses getput directly.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Matt Benjamin
>>>>>>>> Red Hat, Inc.
>>>>>>>> 315 West Huron Street, Suite 140A
>>>>>>>> Ann Arbor, Michigan 48103
>>>>>>>>
>>>>>>>> http://www.redhat.com/en/technologies/storage
>>>>>>>>
>>>>>>>> tel.  734-821-5101
>>>>>>>> fax.  734-769-8938
>>>>>>>> cel.  734-216-5309
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: CBT: New RGW getput benchmark and testing diary
  2017-02-07 16:11                     ` Mark Nelson
@ 2017-02-07 20:01                       ` Casey Bodley
  0 siblings, 0 replies; 13+ messages in thread
From: Casey Bodley @ 2017-02-07 20:01 UTC (permalink / raw)
  To: Mark Nelson, Matt Benjamin
  Cc: Orit Wasserman, ceph-devel, cbt, Mark Seger, Kyle Bader,
	Karan Singh, Brent Compton



On 02/07/2017 11:11 AM, Mark Nelson wrote:
> Thanks Matt!
>
> Just so I understand, how does the byte throttling impact the number 
> of threads used under a heavy client connection scenario? IE if you 
> have 2000 threads and 2000 clients connect (1 thread per client?), 
> What ensures that additional threads are available for bucket index 
> lookups?

Hi Mark,

You're correct that civetweb is 1 thread per client connection. These 
frontend threads are owned by civetweb, and they call our 
process_request() function synchronously. Any rados operations required 
to satisfy a request (bucket index or otherwise) are also synchronous. 
We're not scheduling other work on frontend threads, so there isn't any 
potential for deadlock there.

Casey

>
> Sorry for the weedy questions, just trying to make sure I understand 
> how this all works since I've never really looked closely at it and 
> I'm seeing some strange behavior.
>
> Mark
>
> On 02/07/2017 10:02 AM, Matt Benjamin wrote:
>> Hi Mark,
>>
>> There are rgw and rados-level throttling parameters.  There are known 
>> issues of fairness.  The only scenario we know of where something 
>> like the "deadlock" you're theorizing can happen is possible only 
>> when byte-throttling is incorrectly configured.
>>
>> Matt
>>
>> ----- Original Message -----
>>> From: "Mark Nelson" <mnelson@redhat.com>
>>> To: "Orit Wasserman" <owasserm@redhat.com>
>>> Cc: "Matt Benjamin" <mbenjamin@redhat.com>, "ceph-devel" 
>>> <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark
>>> Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>, "Karan 
>>> Singh" <karan@redhat.com>, "Brent Compton"
>>> <bcompton@redhat.com>
>>> Sent: Tuesday, February 7, 2017 10:23:05 AM
>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>
>>>
>>>
>>> On 02/07/2017 09:03 AM, Orit Wasserman wrote:
>>>> On Tue, Feb 7, 2017 at 4:47 PM, Mark Nelson <mnelson@redhat.com> 
>>>> wrote:
>>>>> Hi Orit,
>>>>>
>>>>> This was a pull from master over the weekend:
>>>>> 5bf39156d8312d65ef77822fbede73fd9454591f
>>>>>
>>>>> Btw, I've been noticing that it appears when bucket index sharding is
>>>>> used,
>>>>> there's a higher likelyhood that client connection attempts are 
>>>>> delayed or
>>>>> starved out entirely under high concurrency.  I haven't looked at 
>>>>> the code
>>>>> yet, does this match with what you'd expect to happen?  I assume the
>>>>> threadpool is shared?
>>>>>
>>>> yes it is shared.
>>>
>>> Ok, so that probably explains the behavior I'm seeing. Perhaps a more
>>> serious issue:  Do we have anything in place to stop a herd of clients
>>> from connecting, starving out bucket index lookups, and making
>>> everything deadlock?
>>>
>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> On 02/07/2017 07:50 AM, Orit Wasserman wrote:
>>>>>>
>>>>>> Mark,
>>>>>> On what version did you run the tests?
>>>>>>
>>>>>> Orit
>>>>>>
>>>>>> On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> 
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin 
>>>>>>>> <mbenjamin@redhat.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Keep in mind, RGW does most of its request processing work in 
>>>>>>>>> civetweb
>>>>>>>>> threads, so high utilization there does not necessarily imply
>>>>>>>>> civetweb-internal processing.
>>>>>>>>>
>>>>>>>>
>>>>>>>> True but the request processing is not a CPU intensive operation.
>>>>>>>> It does seems to indicate that the civetweb threading model simply
>>>>>>>> doesn't scale (we already noticed it already) or maybe it can 
>>>>>>>> point to
>>>>>>>> some locking issue. We need to run a profiler to understand 
>>>>>>>> what is
>>>>>>>> consuming CPU.
>>>>>>>> It maybe a simple fix until we move to asynchronous frontend.
>>>>>>>> It worth investigating as the CPU usage mark is seeing  is 
>>>>>>>> really high.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The initial profiling I did definitely showed a lot of tcmalloc
>>>>>>> threading
>>>>>>> activity, which diminshed after increasing threadcache.  This is 
>>>>>>> quite
>>>>>>> similar to what we saw in simplemessenger with low threadcache 
>>>>>>> values,
>>>>>>> though likely is less true with async messenger. Sadly a 
>>>>>>> profiler like
>>>>>>> perf
>>>>>>> probably isn't going to help much with debugging lock contention.
>>>>>>> grabbing
>>>>>>> GDB stack traces might help, or lttng.
>>>>>>>
>>>>>>>>
>>>>>>>> Mark,
>>>>>>>> How many concurrent request were handled?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Most of the tests had 128 concurrent IOs per radosgw daemon.  
>>>>>>> The max
>>>>>>> thread
>>>>>>> count was increased to 512.  It was very obvious when exceeding the
>>>>>>> thread
>>>>>>> count since some getput processes will end up stalling and doing 
>>>>>>> their
>>>>>>> writes after others, leading to bogus performance data.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Orit
>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>>>>>>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, 
>>>>>>>>>> cbt@lists.ceph.com,
>>>>>>>>>> "Mark
>>>>>>>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>>>>> Compton"
>>>>>>>>>> <bcompton@redhat.com>
>>>>>>>>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>>>>>>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>>>>>>>>
>>>>>>>>>> Just based on what I saw during these tests, it looks to me 
>>>>>>>>>> like a
>>>>>>>>>> lot
>>>>>>>>>> more time was spent dealing with civetweb's threads than RGW.  I
>>>>>>>>>> didn't
>>>>>>>>>> look too closely, but it may be worth looking at whether 
>>>>>>>>>> there's any
>>>>>>>>>> low
>>>>>>>>>> hanging fruit in civetweb itself.
>>>>>>>>>>
>>>>>>>>>> Mark
>>>>>>>>>>
>>>>>>>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the detailed effort and analysis, Mark.
>>>>>>>>>>>
>>>>>>>>>>> As we get closer to the L time-frame, it should become 
>>>>>>>>>>> relevant to
>>>>>>>>>>> look
>>>>>>>>>>> at
>>>>>>>>>>> the relative boost::asio frontend rework i/o paths, which 
>>>>>>>>>>> are the
>>>>>>>>>>> open
>>>>>>>>>>> effort to reduce CPU overhead/revise threading model, in 
>>>>>>>>>>> general.
>>>>>>>>>>>
>>>>>>>>>>> Matt
>>>>>>>>>>>
>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, 
>>>>>>>>>>>> cbt@lists.ceph.com
>>>>>>>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>>>>>>> <kbader@redhat.com>,
>>>>>>>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>>>>>>> Compton" <bcompton@redhat.com>
>>>>>>>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>>>>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> Over the weekend I took a stab at improving our ability to 
>>>>>>>>>>>> run RGW
>>>>>>>>>>>> performance tests in CBT.  Previously the only way to do 
>>>>>>>>>>>> this was
>>>>>>>>>>>> to
>>>>>>>>>>>> use
>>>>>>>>>>>> the cosbench plugin, which required a fair amount of 
>>>>>>>>>>>> additional
>>>>>>>>>>>> setup and while quite powerful can be overkill in 
>>>>>>>>>>>> situations where
>>>>>>>>>>>> you
>>>>>>>>>>>> want to rapidly iterate over tests looking for specific 
>>>>>>>>>>>> issues.  A
>>>>>>>>>>>> while
>>>>>>>>>>>> ago Mark Seger from HP told me he had created a swift 
>>>>>>>>>>>> benchmark
>>>>>>>>>>>> called
>>>>>>>>>>>> "getput" that is written in python and is much more 
>>>>>>>>>>>> convenient to
>>>>>>>>>>>> run
>>>>>>>>>>>> quickly in an automated fashion.  Normally getput is used in
>>>>>>>>>>>> conjunction
>>>>>>>>>>>> with gpsuite, a tool for coordinating benchmarking multiple 
>>>>>>>>>>>> getput
>>>>>>>>>>>> processes.  This is how you would likely use getput on a 
>>>>>>>>>>>> typical
>>>>>>>>>>>> ceph
>>>>>>>>>>>> or
>>>>>>>>>>>> swift cluster, but since CBT builds the cluster and has 
>>>>>>>>>>>> it's own
>>>>>>>>>>>> way
>>>>>>>>>>>> for
>>>>>>>>>>>> launching multiple benchmark processes, it uses getput 
>>>>>>>>>>>> directly.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Matt Benjamin
>>>>>>>>> Red Hat, Inc.
>>>>>>>>> 315 West Huron Street, Suite 140A
>>>>>>>>> Ann Arbor, Michigan 48103
>>>>>>>>>
>>>>>>>>> http://www.redhat.com/en/technologies/storage
>>>>>>>>>
>>>>>>>>> tel.  734-821-5101
>>>>>>>>> fax.  734-769-8938
>>>>>>>>> cel.  734-216-5309
>>>>>>>>> -- 
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>>>>> ceph-devel"
>>>>>>>>> in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-02-07 20:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-06  5:55 CBT: New RGW getput benchmark and testing diary Mark Nelson
2017-02-06 15:33 ` Matt Benjamin
2017-02-06 15:42   ` Mark Nelson
2017-02-06 15:44     ` Matt Benjamin
2017-02-06 17:02       ` Orit Wasserman
2017-02-06 17:07         ` Mark Nelson
2017-02-07 13:50           ` Orit Wasserman
2017-02-07 14:47             ` Mark Nelson
2017-02-07 15:03               ` Orit Wasserman
2017-02-07 15:23                 ` Mark Nelson
2017-02-07 16:02                   ` Matt Benjamin
2017-02-07 16:11                     ` Mark Nelson
2017-02-07 20:01                       ` Casey Bodley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.