* CBT: New RGW getput benchmark and testing diary
@ 2017-02-06 5:55 Mark Nelson
2017-02-06 15:33 ` Matt Benjamin
0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-06 5:55 UTC (permalink / raw)
To: ceph-devel, cbt; +Cc: Mark Seger, Kyle Bader, Karan Singh, Brent Compton
[-- Attachment #1: Type: text/plain, Size: 10461 bytes --]
Hi All,
Over the weekend I took a stab at improving our ability to run RGW
performance tests in CBT. Previously the only way to do this was to use
the cosbench plugin, which required a fair amount of additional
setup and while quite powerful can be overkill in situations where you
want to rapidly iterate over tests looking for specific issues. A while
ago Mark Seger from HP told me he had created a swift benchmark called
"getput" that is written in python and is much more convenient to run
quickly in an automated fashion. Normally getput is used in conjunction
with gpsuite, a tool for coordinating benchmarking multiple getput
processes. This is how you would likely use getput on a typical ceph or
swift cluster, but since CBT builds the cluster and has it's own way for
launching multiple benchmark processes, it uses getput directly.
Thankfully it was fairly easy to implement a CBT getput wrapper.
Several aspects of CBT's RGW support and user/key management were also
improved to make the whole process of testing RGW completely automated
via the CBT yaml file. As part of the process of testing and debugging
the new getput wrapper, I ran through a series of benchmarks and tests
to investigate 4MB write performance anomalies previously reported in
the field. I wrote something of a diary while doing this and thought I
would document it here for the community. These were not extremely
scientific tests, though I believe the findings are relevant and my be
useful for folks.
Test Cluster Setup
The test cluster being used has 8 nodes, 4 of which are used for OSDs
and 4 of which are being used for RGW and clients. Each OSD node has
the option to use any combination of 6 Seagate Constellation ES.2
7200RPM HDDs and 4 Intel 800GB P3700 NVMe drives. These machines also
have dual Intel Xeon E5-2650v3 CPUs and Intel 40GbE ethernet adapters.
In all of these tests, 1X replication was used to eliminate it as a
bottleneck and put maximum pressure on RGW. It should be noted that the
spinning disks in this cluster are attached through motherboard SATA2
ports, and may not be performing as well as if they were using a
dedicated SAS controller. A copy of the cbt yaml configuration file
used to run the tests is attached.
Some Notes
When using getput as a benchmark for RGW, it's very important to keep
track of the total number of getput processes and the number of RGW
threads. If the runtime option in getput is used, some processes may
wait until RGW threads open up before they determine their runtime
leading to skewed results. I believe this can be resolved in getput by
changing the way that donetime is calculated, however the issue can be
avoided by paying close attention to the RGW thread and getput process
counts.
It's easy to create the wrong pools for RGW since the defaults changed
in Jewel. Now we must create default.rgw.buckets.index and
default.rgw.buckets.data. It wasn't until I looked at disk usage via
ceph df that I disovered that my .rgw.buckets and .rgw.buckets.index
pools were not being used and thus resulted in some bogus initial
performance data.
RGW with HDD backed OSDs (Filestore)
The first set of tests run were with a 4 node cluster configured with 24
HDD backed OSDs. Immediately the first thing I noticed is that the
number of buckets and/or bucket hsards absolutely affects large
sequential write performance on HDD backed OSDs. Using a single bucket
with no shards (ie the default) resulted in 220MB/s of write throughput
while rados bench was able to achieve 550-600MB/s. Both of these
numbers are quite low, though are partially explained by the lack of SSD
journals and lack of dedicated controller hardware.
Setting the number of bucket index shards to 8 improved the RGW write
throughput to 400MB/s. The highest throughput for this setup was
achieved by either setting the number of buckets or the number of bucket
index shards substantially higher than the number of OSDs. In this
case, 64 appeared to be sufficient. Three high concurrency 4MB Object
RGW PUT tests showed results of 602MB/s, 557MB/s and 563MB/s. 3 rados
bench runs using similar concurrency showed 580MB/s, 580MB/s, and
564MB/s respectively. Write IO from multiple clients appeared to cause
a slight (~5%) performance drop vs write IO from a single client. In
all cases, tests were stopped prior to PG splitting.
It was observed that RGW uses high amounts of CPU, especially if low
tcmalloc threadcache values are used. With 32MB of threadcache, RGW
used around 500-600% CPU to serve 500-600MB/s of 4MB writes.
Occasionally it would spike into the 1000-1200% region. Perf showed a
high percentage of time in tcmalloc managing civetweb threads. With
128MB of thread cache, this effect was greatly diminished. RGW appeared
to use around 300-400% CPU to serve the same workload, though in these
tests there was little performance impact as we were not CPU bound.
PG counts and PG splitting
Based on the bucket/shard results above, it may be inferred that a RGW
index pool with very low PG counts could significantly hurt performance.
What about the case where small PG counts are used in the bucket.data
pool? In this case, PG splitting behavior might actually be more
important than the effect of clumpiness in the random distribution due
to low sample counts. To test this, a buckets.index pool was created
with 2048 PGs while the buckets.data pool was created with 128 PGs.
Initial 4MB sequential writes with both rados bench and with RGW were
about 20% slower than what was seen with 2048 PGs in the data pool,
likely due to the worse data distribution properties.
While this is significant, I was more interested in the affects of PG
splitting in filestore. It has been observed in the past that PG
splitting can have a huge performance impact, especially when SELinux is
enabled. Selinux reads a security xattr on every file access which
grealy impacts how quickly operations like link/unlink can be performed
during PG splits. While SELinux is not enabled on this test cluster, PG
splitting may still have other performance impacts due to worse dentry
and inode caching and kernel overhead. Rados bench was used to hit the
thresholds by writing out a large number of 4K objects.
After approximately 1.3M objects were written, performance of 4K writes
dropped by roughly an order of magnitude. At this point rados bench and
RGW were used to write out 4MB objects with high levels of concurrency
and high numbers of bucket index shards. In both cases, performance
started out slightly diminished, but quickly increased to near levels
observed on the fresh cluster. At least in this setup, pg splitting in
the data pool did not appear to majorly affect the performance of 4MB
object writes, though may have affect the 4k object writes used to
pre-fill the data pool.
RGW with OSDs using HDD Data and NVMe Journals (Filestore)
Next, 24 OSDs were configured with the filestore data partitions on HDD
and journals on NVMe. With minimal tuning rados bench delivered
around 1350MB/s or around 56MB/s per drive. This is lower per drive
than other configurations we've tested, but roughly double vs how the
OSDs were performing wihtout NVMe journals. Interestingly, the
difference between using a single bucket with no shards and many
buckets/shards appeared to be minimal. Tests against a single bucket
with no shards resulted in 1000-1400MB/s while a test configuration with
128 buckets resulted in 1200-1400MB/s over several repeated tests. It
should be noted that the single RGW instance was using anywhere from
1000-1600% CPU to maintain these numbers even with 128MB of TCMalloc
threadcache! Again a performance drop was seen when using multiple
clients vs a single client.
RGW with NVMe data and journals (Filestore)
The test machines in this cluster have 4 Intel P3700 NVMe drives which
are each capable of about 2GB/s each. The cluster was again
reconfigured with 16 OSDs backed by the NVMe drives. In this case, a
single client running radosbench with 128 concurrent ops could saturate
the network and write data at about 4600MB/s. 4 clients however appear
to be able to saturate the OSDs and achieve 11620MB/s. A single getput
client using 128 processes to write 4MB objects to a single bucket with
no shards achieves 1700-1800MB/s. Radosgw CPU usage hovered around
1500-1700%. Using 128 buckets (1 bucket per process) yielded a variety
of performance results ranging from 1700MB/s to 3500MB/s over multiple
tests. Radosgw CPU usage appeared to top out around 2100%. With 4
clients, A single radosgw instance was able to maintain roughly 3700MB/s
of writes over several independent tests despite using roughly the same
amount of CPU as the single client tests.
Testing 4 gateways
CBT can stand up multiple rados gateways and launch tests against all of
them concurrently. 4 clients were configured to drive traffic to all 4
rados gateways and ultimately the 16 NVMe backed OSDs. The first tests
run resulted in 404 keynotfound errors, apparently because multiple
copies of getput attempted to write into the same bucket. This was
resolved by making sure that each copy of getput being run had a
distinct bucket (though getput processes did not require the same
attention). Thus, the lowest number of buckets targeted in this test
was 16. In this configuration, the aggregate write throughput across
all 4 gateways was 9306MB/s. With a bucket for every process (512
total), the aggregate throughput increased to 9445MB/s. This was
roughly 2361MB/s per gateway and 81% of the rados bench throughput.
Next Steps and Conclusions
RGW is now roughly as easy to test in CBT as rados and rbd. It should
be far easier to examine bluestore performance with RGW now. I hope we
should be able to do some filestore and bluestore comparisons shortly.
I will also note that I was quite happy with how well RGW handled large
object writes in these tests. Despite using a large amount of CPU
overhead, I could achieve 3700MB/s with a single RGW instance and over
9GB/s with 4 RGW instances. Despite this, it's clear that performance
is still quite dependent on bucket index update latency. Especially on
spinning disks, it appears to be very important to use bucket index
sharding or spread write over many buckets.
Mark
[-- Attachment #2: runtests.xfs.16.rgw.yaml --]
[-- Type: application/x-yaml, Size: 2453 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: CBT: New RGW getput benchmark and testing diary
2017-02-06 5:55 CBT: New RGW getput benchmark and testing diary Mark Nelson
@ 2017-02-06 15:33 ` Matt Benjamin
2017-02-06 15:42 ` Mark Nelson
0 siblings, 1 reply; 13+ messages in thread
From: Matt Benjamin @ 2017-02-06 15:33 UTC (permalink / raw)
To: Mark Nelson
Cc: ceph-devel, cbt, Mark Seger, Kyle Bader, Karan Singh, Brent Compton
Thanks for the detailed effort and analysis, Mark.
As we get closer to the L time-frame, it should become relevant to look at the relative boost::asio frontend rework i/o paths, which are the open effort to reduce CPU overhead/revise threading model, in general.
Matt
----- Original Message -----
> From: "Mark Nelson" <mnelson@redhat.com>
> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent
> Compton" <bcompton@redhat.com>
> Sent: Monday, February 6, 2017 12:55:20 AM
> Subject: CBT: New RGW getput benchmark and testing diary
>
> Hi All,
>
> Over the weekend I took a stab at improving our ability to run RGW
> performance tests in CBT. Previously the only way to do this was to use
> the cosbench plugin, which required a fair amount of additional
> setup and while quite powerful can be overkill in situations where you
> want to rapidly iterate over tests looking for specific issues. A while
> ago Mark Seger from HP told me he had created a swift benchmark called
> "getput" that is written in python and is much more convenient to run
> quickly in an automated fashion. Normally getput is used in conjunction
> with gpsuite, a tool for coordinating benchmarking multiple getput
> processes. This is how you would likely use getput on a typical ceph or
> swift cluster, but since CBT builds the cluster and has it's own way for
> launching multiple benchmark processes, it uses getput directly.
>
--
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103
http://www.redhat.com/en/technologies/storage
tel. 734-821-5101
fax. 734-769-8938
cel. 734-216-5309
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: CBT: New RGW getput benchmark and testing diary
2017-02-06 15:33 ` Matt Benjamin
@ 2017-02-06 15:42 ` Mark Nelson
2017-02-06 15:44 ` Matt Benjamin
0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-06 15:42 UTC (permalink / raw)
To: Matt Benjamin
Cc: ceph-devel, cbt, Mark Seger, Kyle Bader, Karan Singh, Brent Compton
Just based on what I saw during these tests, it looks to me like a lot
more time was spent dealing with civetweb's threads than RGW. I didn't
look too closely, but it may be worth looking at whether there's any low
hanging fruit in civetweb itself.
Mark
On 02/06/2017 09:33 AM, Matt Benjamin wrote:
> Thanks for the detailed effort and analysis, Mark.
>
> As we get closer to the L time-frame, it should become relevant to look at the relative boost::asio frontend rework i/o paths, which are the open effort to reduce CPU overhead/revise threading model, in general.
>
> Matt
>
> ----- Original Message -----
>> From: "Mark Nelson" <mnelson@redhat.com>
>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent
>> Compton" <bcompton@redhat.com>
>> Sent: Monday, February 6, 2017 12:55:20 AM
>> Subject: CBT: New RGW getput benchmark and testing diary
>>
>> Hi All,
>>
>> Over the weekend I took a stab at improving our ability to run RGW
>> performance tests in CBT. Previously the only way to do this was to use
>> the cosbench plugin, which required a fair amount of additional
>> setup and while quite powerful can be overkill in situations where you
>> want to rapidly iterate over tests looking for specific issues. A while
>> ago Mark Seger from HP told me he had created a swift benchmark called
>> "getput" that is written in python and is much more convenient to run
>> quickly in an automated fashion. Normally getput is used in conjunction
>> with gpsuite, a tool for coordinating benchmarking multiple getput
>> processes. This is how you would likely use getput on a typical ceph or
>> swift cluster, but since CBT builds the cluster and has it's own way for
>> launching multiple benchmark processes, it uses getput directly.
>>
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: CBT: New RGW getput benchmark and testing diary
2017-02-06 15:42 ` Mark Nelson
@ 2017-02-06 15:44 ` Matt Benjamin
2017-02-06 17:02 ` Orit Wasserman
0 siblings, 1 reply; 13+ messages in thread
From: Matt Benjamin @ 2017-02-06 15:44 UTC (permalink / raw)
To: Mark Nelson
Cc: ceph-devel, cbt, Mark Seger, Kyle Bader, Karan Singh, Brent Compton
Keep in mind, RGW does most of its request processing work in civetweb threads, so high utilization there does not necessarily imply civetweb-internal processing.
Matt
----- Original Message -----
> From: "Mark Nelson" <mnelson@redhat.com>
> To: "Matt Benjamin" <mbenjamin@redhat.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton" <bcompton@redhat.com>
> Sent: Monday, February 6, 2017 10:42:04 AM
> Subject: Re: CBT: New RGW getput benchmark and testing diary
>
> Just based on what I saw during these tests, it looks to me like a lot
> more time was spent dealing with civetweb's threads than RGW. I didn't
> look too closely, but it may be worth looking at whether there's any low
> hanging fruit in civetweb itself.
>
> Mark
>
> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
> > Thanks for the detailed effort and analysis, Mark.
> >
> > As we get closer to the L time-frame, it should become relevant to look at
> > the relative boost::asio frontend rework i/o paths, which are the open
> > effort to reduce CPU overhead/revise threading model, in general.
> >
> > Matt
> >
> > ----- Original Message -----
> >> From: "Mark Nelson" <mnelson@redhat.com>
> >> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
> >> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>,
> >> "Karan Singh" <karan@redhat.com>, "Brent
> >> Compton" <bcompton@redhat.com>
> >> Sent: Monday, February 6, 2017 12:55:20 AM
> >> Subject: CBT: New RGW getput benchmark and testing diary
> >>
> >> Hi All,
> >>
> >> Over the weekend I took a stab at improving our ability to run RGW
> >> performance tests in CBT. Previously the only way to do this was to use
> >> the cosbench plugin, which required a fair amount of additional
> >> setup and while quite powerful can be overkill in situations where you
> >> want to rapidly iterate over tests looking for specific issues. A while
> >> ago Mark Seger from HP told me he had created a swift benchmark called
> >> "getput" that is written in python and is much more convenient to run
> >> quickly in an automated fashion. Normally getput is used in conjunction
> >> with gpsuite, a tool for coordinating benchmarking multiple getput
> >> processes. This is how you would likely use getput on a typical ceph or
> >> swift cluster, but since CBT builds the cluster and has it's own way for
> >> launching multiple benchmark processes, it uses getput directly.
> >>
> >
> >
>
--
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103
http://www.redhat.com/en/technologies/storage
tel. 734-821-5101
fax. 734-769-8938
cel. 734-216-5309
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: CBT: New RGW getput benchmark and testing diary
2017-02-06 15:44 ` Matt Benjamin
@ 2017-02-06 17:02 ` Orit Wasserman
2017-02-06 17:07 ` Mark Nelson
0 siblings, 1 reply; 13+ messages in thread
From: Orit Wasserman @ 2017-02-06 17:02 UTC (permalink / raw)
To: Matt Benjamin
Cc: Mark Nelson, ceph-devel, cbt, Mark Seger, Kyle Bader,
Karan Singh, Brent Compton
On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com> wrote:
> Keep in mind, RGW does most of its request processing work in civetweb threads, so high utilization there does not necessarily imply civetweb-internal processing.
>
True but the request processing is not a CPU intensive operation.
It does seems to indicate that the civetweb threading model simply
doesn't scale (we already noticed it already) or maybe it can point to
some locking issue. We need to run a profiler to understand what is
consuming CPU.
It maybe a simple fix until we move to asynchronous frontend.
It worth investigating as the CPU usage mark is seeing is really high.
Mark,
How many concurrent request were handled?
Orit
> Matt
>
> ----- Original Message -----
>> From: "Mark Nelson" <mnelson@redhat.com>
>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton" <bcompton@redhat.com>
>> Sent: Monday, February 6, 2017 10:42:04 AM
>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>
>> Just based on what I saw during these tests, it looks to me like a lot
>> more time was spent dealing with civetweb's threads than RGW. I didn't
>> look too closely, but it may be worth looking at whether there's any low
>> hanging fruit in civetweb itself.
>>
>> Mark
>>
>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>> > Thanks for the detailed effort and analysis, Mark.
>> >
>> > As we get closer to the L time-frame, it should become relevant to look at
>> > the relative boost::asio frontend rework i/o paths, which are the open
>> > effort to reduce CPU overhead/revise threading model, in general.
>> >
>> > Matt
>> >
>> > ----- Original Message -----
>> >> From: "Mark Nelson" <mnelson@redhat.com>
>> >> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>> >> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>,
>> >> "Karan Singh" <karan@redhat.com>, "Brent
>> >> Compton" <bcompton@redhat.com>
>> >> Sent: Monday, February 6, 2017 12:55:20 AM
>> >> Subject: CBT: New RGW getput benchmark and testing diary
>> >>
>> >> Hi All,
>> >>
>> >> Over the weekend I took a stab at improving our ability to run RGW
>> >> performance tests in CBT. Previously the only way to do this was to use
>> >> the cosbench plugin, which required a fair amount of additional
>> >> setup and while quite powerful can be overkill in situations where you
>> >> want to rapidly iterate over tests looking for specific issues. A while
>> >> ago Mark Seger from HP told me he had created a swift benchmark called
>> >> "getput" that is written in python and is much more convenient to run
>> >> quickly in an automated fashion. Normally getput is used in conjunction
>> >> with gpsuite, a tool for coordinating benchmarking multiple getput
>> >> processes. This is how you would likely use getput on a typical ceph or
>> >> swift cluster, but since CBT builds the cluster and has it's own way for
>> >> launching multiple benchmark processes, it uses getput directly.
>> >>
>> >
>> >
>>
>
> --
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel. 734-821-5101
> fax. 734-769-8938
> cel. 734-216-5309
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: CBT: New RGW getput benchmark and testing diary
2017-02-06 17:02 ` Orit Wasserman
@ 2017-02-06 17:07 ` Mark Nelson
2017-02-07 13:50 ` Orit Wasserman
0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-06 17:07 UTC (permalink / raw)
To: Orit Wasserman, Matt Benjamin
Cc: ceph-devel, cbt, Mark Seger, Kyle Bader, Karan Singh, Brent Compton
On 02/06/2017 11:02 AM, Orit Wasserman wrote:
> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com> wrote:
>> Keep in mind, RGW does most of its request processing work in civetweb threads, so high utilization there does not necessarily imply civetweb-internal processing.
>>
>
> True but the request processing is not a CPU intensive operation.
> It does seems to indicate that the civetweb threading model simply
> doesn't scale (we already noticed it already) or maybe it can point to
> some locking issue. We need to run a profiler to understand what is
> consuming CPU.
> It maybe a simple fix until we move to asynchronous frontend.
> It worth investigating as the CPU usage mark is seeing is really high.
The initial profiling I did definitely showed a lot of tcmalloc
threading activity, which diminshed after increasing threadcache. This
is quite similar to what we saw in simplemessenger with low threadcache
values, though likely is less true with async messenger. Sadly a
profiler like perf probably isn't going to help much with debugging lock
contention. grabbing GDB stack traces might help, or lttng.
>
> Mark,
> How many concurrent request were handled?
Most of the tests had 128 concurrent IOs per radosgw daemon. The max
thread count was increased to 512. It was very obvious when exceeding
the thread count since some getput processes will end up stalling and
doing their writes after others, leading to bogus performance data.
>
> Orit
>
>> Matt
>>
>> ----- Original Message -----
>>> From: "Mark Nelson" <mnelson@redhat.com>
>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton" <bcompton@redhat.com>
>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>
>>> Just based on what I saw during these tests, it looks to me like a lot
>>> more time was spent dealing with civetweb's threads than RGW. I didn't
>>> look too closely, but it may be worth looking at whether there's any low
>>> hanging fruit in civetweb itself.
>>>
>>> Mark
>>>
>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>> Thanks for the detailed effort and analysis, Mark.
>>>>
>>>> As we get closer to the L time-frame, it should become relevant to look at
>>>> the relative boost::asio frontend rework i/o paths, which are the open
>>>> effort to reduce CPU overhead/revise threading model, in general.
>>>>
>>>> Matt
>>>>
>>>> ----- Original Message -----
>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>,
>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>> Compton" <bcompton@redhat.com>
>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>
>>>>> Hi All,
>>>>>
>>>>> Over the weekend I took a stab at improving our ability to run RGW
>>>>> performance tests in CBT. Previously the only way to do this was to use
>>>>> the cosbench plugin, which required a fair amount of additional
>>>>> setup and while quite powerful can be overkill in situations where you
>>>>> want to rapidly iterate over tests looking for specific issues. A while
>>>>> ago Mark Seger from HP told me he had created a swift benchmark called
>>>>> "getput" that is written in python and is much more convenient to run
>>>>> quickly in an automated fashion. Normally getput is used in conjunction
>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
>>>>> processes. This is how you would likely use getput on a typical ceph or
>>>>> swift cluster, but since CBT builds the cluster and has it's own way for
>>>>> launching multiple benchmark processes, it uses getput directly.
>>>>>
>>>>
>>>>
>>>
>>
>> --
>> Matt Benjamin
>> Red Hat, Inc.
>> 315 West Huron Street, Suite 140A
>> Ann Arbor, Michigan 48103
>>
>> http://www.redhat.com/en/technologies/storage
>>
>> tel. 734-821-5101
>> fax. 734-769-8938
>> cel. 734-216-5309
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: CBT: New RGW getput benchmark and testing diary
2017-02-06 17:07 ` Mark Nelson
@ 2017-02-07 13:50 ` Orit Wasserman
2017-02-07 14:47 ` Mark Nelson
0 siblings, 1 reply; 13+ messages in thread
From: Orit Wasserman @ 2017-02-07 13:50 UTC (permalink / raw)
To: Mark Nelson
Cc: Matt Benjamin, ceph-devel, cbt, Mark Seger, Kyle Bader,
Karan Singh, Brent Compton
Mark,
On what version did you run the tests?
Orit
On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
>
>
> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
>>
>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com>
>> wrote:
>>>
>>> Keep in mind, RGW does most of its request processing work in civetweb
>>> threads, so high utilization there does not necessarily imply
>>> civetweb-internal processing.
>>>
>>
>> True but the request processing is not a CPU intensive operation.
>> It does seems to indicate that the civetweb threading model simply
>> doesn't scale (we already noticed it already) or maybe it can point to
>> some locking issue. We need to run a profiler to understand what is
>> consuming CPU.
>> It maybe a simple fix until we move to asynchronous frontend.
>> It worth investigating as the CPU usage mark is seeing is really high.
>
>
> The initial profiling I did definitely showed a lot of tcmalloc threading
> activity, which diminshed after increasing threadcache. This is quite
> similar to what we saw in simplemessenger with low threadcache values,
> though likely is less true with async messenger. Sadly a profiler like perf
> probably isn't going to help much with debugging lock contention. grabbing
> GDB stack traces might help, or lttng.
>
>>
>> Mark,
>> How many concurrent request were handled?
>
>
> Most of the tests had 128 concurrent IOs per radosgw daemon. The max thread
> count was increased to 512. It was very obvious when exceeding the thread
> count since some getput processes will end up stalling and doing their
> writes after others, leading to bogus performance data.
>
>
>>
>> Orit
>>
>>> Matt
>>>
>>> ----- Original Message -----
>>>>
>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark
>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton"
>>>> <bcompton@redhat.com>
>>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>>
>>>> Just based on what I saw during these tests, it looks to me like a lot
>>>> more time was spent dealing with civetweb's threads than RGW. I didn't
>>>> look too closely, but it may be worth looking at whether there's any low
>>>> hanging fruit in civetweb itself.
>>>>
>>>> Mark
>>>>
>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>>>
>>>>> Thanks for the detailed effort and analysis, Mark.
>>>>>
>>>>> As we get closer to the L time-frame, it should become relevant to look
>>>>> at
>>>>> the relative boost::asio frontend rework i/o paths, which are the open
>>>>> effort to reduce CPU overhead/revise threading model, in general.
>>>>>
>>>>> Matt
>>>>>
>>>>> ----- Original Message -----
>>>>>>
>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>> <kbader@redhat.com>,
>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>>> Compton" <bcompton@redhat.com>
>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Over the weekend I took a stab at improving our ability to run RGW
>>>>>> performance tests in CBT. Previously the only way to do this was to
>>>>>> use
>>>>>> the cosbench plugin, which required a fair amount of additional
>>>>>> setup and while quite powerful can be overkill in situations where you
>>>>>> want to rapidly iterate over tests looking for specific issues. A
>>>>>> while
>>>>>> ago Mark Seger from HP told me he had created a swift benchmark called
>>>>>> "getput" that is written in python and is much more convenient to run
>>>>>> quickly in an automated fashion. Normally getput is used in
>>>>>> conjunction
>>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
>>>>>> processes. This is how you would likely use getput on a typical ceph
>>>>>> or
>>>>>> swift cluster, but since CBT builds the cluster and has it's own way
>>>>>> for
>>>>>> launching multiple benchmark processes, it uses getput directly.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Matt Benjamin
>>> Red Hat, Inc.
>>> 315 West Huron Street, Suite 140A
>>> Ann Arbor, Michigan 48103
>>>
>>> http://www.redhat.com/en/technologies/storage
>>>
>>> tel. 734-821-5101
>>> fax. 734-769-8938
>>> cel. 734-216-5309
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: CBT: New RGW getput benchmark and testing diary
2017-02-07 13:50 ` Orit Wasserman
@ 2017-02-07 14:47 ` Mark Nelson
2017-02-07 15:03 ` Orit Wasserman
0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-07 14:47 UTC (permalink / raw)
To: Orit Wasserman
Cc: Matt Benjamin, ceph-devel, cbt, Mark Seger, Kyle Bader,
Karan Singh, Brent Compton
Hi Orit,
This was a pull from master over the weekend:
5bf39156d8312d65ef77822fbede73fd9454591f
Btw, I've been noticing that it appears when bucket index sharding is
used, there's a higher likelyhood that client connection attempts are
delayed or starved out entirely under high concurrency. I haven't
looked at the code yet, does this match with what you'd expect to
happen? I assume the threadpool is shared?
Mark
On 02/07/2017 07:50 AM, Orit Wasserman wrote:
> Mark,
> On what version did you run the tests?
>
> Orit
>
> On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>
>>
>> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
>>>
>>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com>
>>> wrote:
>>>>
>>>> Keep in mind, RGW does most of its request processing work in civetweb
>>>> threads, so high utilization there does not necessarily imply
>>>> civetweb-internal processing.
>>>>
>>>
>>> True but the request processing is not a CPU intensive operation.
>>> It does seems to indicate that the civetweb threading model simply
>>> doesn't scale (we already noticed it already) or maybe it can point to
>>> some locking issue. We need to run a profiler to understand what is
>>> consuming CPU.
>>> It maybe a simple fix until we move to asynchronous frontend.
>>> It worth investigating as the CPU usage mark is seeing is really high.
>>
>>
>> The initial profiling I did definitely showed a lot of tcmalloc threading
>> activity, which diminshed after increasing threadcache. This is quite
>> similar to what we saw in simplemessenger with low threadcache values,
>> though likely is less true with async messenger. Sadly a profiler like perf
>> probably isn't going to help much with debugging lock contention. grabbing
>> GDB stack traces might help, or lttng.
>>
>>>
>>> Mark,
>>> How many concurrent request were handled?
>>
>>
>> Most of the tests had 128 concurrent IOs per radosgw daemon. The max thread
>> count was increased to 512. It was very obvious when exceeding the thread
>> count since some getput processes will end up stalling and doing their
>> writes after others, leading to bogus performance data.
>>
>>
>>>
>>> Orit
>>>
>>>> Matt
>>>>
>>>> ----- Original Message -----
>>>>>
>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark
>>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton"
>>>>> <bcompton@redhat.com>
>>>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>>>
>>>>> Just based on what I saw during these tests, it looks to me like a lot
>>>>> more time was spent dealing with civetweb's threads than RGW. I didn't
>>>>> look too closely, but it may be worth looking at whether there's any low
>>>>> hanging fruit in civetweb itself.
>>>>>
>>>>> Mark
>>>>>
>>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>>>>
>>>>>> Thanks for the detailed effort and analysis, Mark.
>>>>>>
>>>>>> As we get closer to the L time-frame, it should become relevant to look
>>>>>> at
>>>>>> the relative boost::asio frontend rework i/o paths, which are the open
>>>>>> effort to reduce CPU overhead/revise threading model, in general.
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>>
>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>> <kbader@redhat.com>,
>>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>> Compton" <bcompton@redhat.com>
>>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Over the weekend I took a stab at improving our ability to run RGW
>>>>>>> performance tests in CBT. Previously the only way to do this was to
>>>>>>> use
>>>>>>> the cosbench plugin, which required a fair amount of additional
>>>>>>> setup and while quite powerful can be overkill in situations where you
>>>>>>> want to rapidly iterate over tests looking for specific issues. A
>>>>>>> while
>>>>>>> ago Mark Seger from HP told me he had created a swift benchmark called
>>>>>>> "getput" that is written in python and is much more convenient to run
>>>>>>> quickly in an automated fashion. Normally getput is used in
>>>>>>> conjunction
>>>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
>>>>>>> processes. This is how you would likely use getput on a typical ceph
>>>>>>> or
>>>>>>> swift cluster, but since CBT builds the cluster and has it's own way
>>>>>>> for
>>>>>>> launching multiple benchmark processes, it uses getput directly.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Matt Benjamin
>>>> Red Hat, Inc.
>>>> 315 West Huron Street, Suite 140A
>>>> Ann Arbor, Michigan 48103
>>>>
>>>> http://www.redhat.com/en/technologies/storage
>>>>
>>>> tel. 734-821-5101
>>>> fax. 734-769-8938
>>>> cel. 734-216-5309
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: CBT: New RGW getput benchmark and testing diary
2017-02-07 14:47 ` Mark Nelson
@ 2017-02-07 15:03 ` Orit Wasserman
2017-02-07 15:23 ` Mark Nelson
0 siblings, 1 reply; 13+ messages in thread
From: Orit Wasserman @ 2017-02-07 15:03 UTC (permalink / raw)
To: Mark Nelson
Cc: Matt Benjamin, ceph-devel, cbt, Mark Seger, Kyle Bader,
Karan Singh, Brent Compton
On Tue, Feb 7, 2017 at 4:47 PM, Mark Nelson <mnelson@redhat.com> wrote:
> Hi Orit,
>
> This was a pull from master over the weekend:
> 5bf39156d8312d65ef77822fbede73fd9454591f
>
> Btw, I've been noticing that it appears when bucket index sharding is used,
> there's a higher likelyhood that client connection attempts are delayed or
> starved out entirely under high concurrency. I haven't looked at the code
> yet, does this match with what you'd expect to happen? I assume the
> threadpool is shared?
>
yes it is shared.
> Mark
>
>
> On 02/07/2017 07:50 AM, Orit Wasserman wrote:
>>
>> Mark,
>> On what version did you run the tests?
>>
>> Orit
>>
>> On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>
>>>
>>>
>>> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
>>>>
>>>>
>>>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> Keep in mind, RGW does most of its request processing work in civetweb
>>>>> threads, so high utilization there does not necessarily imply
>>>>> civetweb-internal processing.
>>>>>
>>>>
>>>> True but the request processing is not a CPU intensive operation.
>>>> It does seems to indicate that the civetweb threading model simply
>>>> doesn't scale (we already noticed it already) or maybe it can point to
>>>> some locking issue. We need to run a profiler to understand what is
>>>> consuming CPU.
>>>> It maybe a simple fix until we move to asynchronous frontend.
>>>> It worth investigating as the CPU usage mark is seeing is really high.
>>>
>>>
>>>
>>> The initial profiling I did definitely showed a lot of tcmalloc threading
>>> activity, which diminshed after increasing threadcache. This is quite
>>> similar to what we saw in simplemessenger with low threadcache values,
>>> though likely is less true with async messenger. Sadly a profiler like
>>> perf
>>> probably isn't going to help much with debugging lock contention.
>>> grabbing
>>> GDB stack traces might help, or lttng.
>>>
>>>>
>>>> Mark,
>>>> How many concurrent request were handled?
>>>
>>>
>>>
>>> Most of the tests had 128 concurrent IOs per radosgw daemon. The max
>>> thread
>>> count was increased to 512. It was very obvious when exceeding the
>>> thread
>>> count since some getput processes will end up stalling and doing their
>>> writes after others, leading to bogus performance data.
>>>
>>>
>>>>
>>>> Orit
>>>>
>>>>> Matt
>>>>>
>>>>> ----- Original Message -----
>>>>>>
>>>>>>
>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com,
>>>>>> "Mark
>>>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton"
>>>>>> <bcompton@redhat.com>
>>>>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>>>>
>>>>>> Just based on what I saw during these tests, it looks to me like a lot
>>>>>> more time was spent dealing with civetweb's threads than RGW. I
>>>>>> didn't
>>>>>> look too closely, but it may be worth looking at whether there's any
>>>>>> low
>>>>>> hanging fruit in civetweb itself.
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the detailed effort and analysis, Mark.
>>>>>>>
>>>>>>> As we get closer to the L time-frame, it should become relevant to
>>>>>>> look
>>>>>>> at
>>>>>>> the relative boost::asio frontend rework i/o paths, which are the
>>>>>>> open
>>>>>>> effort to reduce CPU overhead/revise threading model, in general.
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>>
>>>>>>>>
>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>>>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>>> <kbader@redhat.com>,
>>>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>>> Compton" <bcompton@redhat.com>
>>>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Over the weekend I took a stab at improving our ability to run RGW
>>>>>>>> performance tests in CBT. Previously the only way to do this was to
>>>>>>>> use
>>>>>>>> the cosbench plugin, which required a fair amount of additional
>>>>>>>> setup and while quite powerful can be overkill in situations where
>>>>>>>> you
>>>>>>>> want to rapidly iterate over tests looking for specific issues. A
>>>>>>>> while
>>>>>>>> ago Mark Seger from HP told me he had created a swift benchmark
>>>>>>>> called
>>>>>>>> "getput" that is written in python and is much more convenient to
>>>>>>>> run
>>>>>>>> quickly in an automated fashion. Normally getput is used in
>>>>>>>> conjunction
>>>>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
>>>>>>>> processes. This is how you would likely use getput on a typical
>>>>>>>> ceph
>>>>>>>> or
>>>>>>>> swift cluster, but since CBT builds the cluster and has it's own way
>>>>>>>> for
>>>>>>>> launching multiple benchmark processes, it uses getput directly.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Matt Benjamin
>>>>> Red Hat, Inc.
>>>>> 315 West Huron Street, Suite 140A
>>>>> Ann Arbor, Michigan 48103
>>>>>
>>>>> http://www.redhat.com/en/technologies/storage
>>>>>
>>>>> tel. 734-821-5101
>>>>> fax. 734-769-8938
>>>>> cel. 734-216-5309
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: CBT: New RGW getput benchmark and testing diary
2017-02-07 15:03 ` Orit Wasserman
@ 2017-02-07 15:23 ` Mark Nelson
2017-02-07 16:02 ` Matt Benjamin
0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-07 15:23 UTC (permalink / raw)
To: Orit Wasserman
Cc: Matt Benjamin, ceph-devel, cbt, Mark Seger, Kyle Bader,
Karan Singh, Brent Compton
On 02/07/2017 09:03 AM, Orit Wasserman wrote:
> On Tue, Feb 7, 2017 at 4:47 PM, Mark Nelson <mnelson@redhat.com> wrote:
>> Hi Orit,
>>
>> This was a pull from master over the weekend:
>> 5bf39156d8312d65ef77822fbede73fd9454591f
>>
>> Btw, I've been noticing that it appears when bucket index sharding is used,
>> there's a higher likelyhood that client connection attempts are delayed or
>> starved out entirely under high concurrency. I haven't looked at the code
>> yet, does this match with what you'd expect to happen? I assume the
>> threadpool is shared?
>>
> yes it is shared.
Ok, so that probably explains the behavior I'm seeing. Perhaps a more
serious issue: Do we have anything in place to stop a herd of clients
from connecting, starving out bucket index lookups, and making
everything deadlock?
>
>> Mark
>>
>>
>> On 02/07/2017 07:50 AM, Orit Wasserman wrote:
>>>
>>> Mark,
>>> On what version did you run the tests?
>>>
>>> Orit
>>>
>>> On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
>>>>>
>>>>>
>>>>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Keep in mind, RGW does most of its request processing work in civetweb
>>>>>> threads, so high utilization there does not necessarily imply
>>>>>> civetweb-internal processing.
>>>>>>
>>>>>
>>>>> True but the request processing is not a CPU intensive operation.
>>>>> It does seems to indicate that the civetweb threading model simply
>>>>> doesn't scale (we already noticed it already) or maybe it can point to
>>>>> some locking issue. We need to run a profiler to understand what is
>>>>> consuming CPU.
>>>>> It maybe a simple fix until we move to asynchronous frontend.
>>>>> It worth investigating as the CPU usage mark is seeing is really high.
>>>>
>>>>
>>>>
>>>> The initial profiling I did definitely showed a lot of tcmalloc threading
>>>> activity, which diminshed after increasing threadcache. This is quite
>>>> similar to what we saw in simplemessenger with low threadcache values,
>>>> though likely is less true with async messenger. Sadly a profiler like
>>>> perf
>>>> probably isn't going to help much with debugging lock contention.
>>>> grabbing
>>>> GDB stack traces might help, or lttng.
>>>>
>>>>>
>>>>> Mark,
>>>>> How many concurrent request were handled?
>>>>
>>>>
>>>>
>>>> Most of the tests had 128 concurrent IOs per radosgw daemon. The max
>>>> thread
>>>> count was increased to 512. It was very obvious when exceeding the
>>>> thread
>>>> count since some getput processes will end up stalling and doing their
>>>> writes after others, leading to bogus performance data.
>>>>
>>>>
>>>>>
>>>>> Orit
>>>>>
>>>>>> Matt
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>>
>>>>>>>
>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>>>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com,
>>>>>>> "Mark
>>>>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton"
>>>>>>> <bcompton@redhat.com>
>>>>>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>>>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>>>>>
>>>>>>> Just based on what I saw during these tests, it looks to me like a lot
>>>>>>> more time was spent dealing with civetweb's threads than RGW. I
>>>>>>> didn't
>>>>>>> look too closely, but it may be worth looking at whether there's any
>>>>>>> low
>>>>>>> hanging fruit in civetweb itself.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for the detailed effort and analysis, Mark.
>>>>>>>>
>>>>>>>> As we get closer to the L time-frame, it should become relevant to
>>>>>>>> look
>>>>>>>> at
>>>>>>>> the relative boost::asio frontend rework i/o paths, which are the
>>>>>>>> open
>>>>>>>> effort to reduce CPU overhead/revise threading model, in general.
>>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>>>>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>>>> <kbader@redhat.com>,
>>>>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>>>> Compton" <bcompton@redhat.com>
>>>>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> Over the weekend I took a stab at improving our ability to run RGW
>>>>>>>>> performance tests in CBT. Previously the only way to do this was to
>>>>>>>>> use
>>>>>>>>> the cosbench plugin, which required a fair amount of additional
>>>>>>>>> setup and while quite powerful can be overkill in situations where
>>>>>>>>> you
>>>>>>>>> want to rapidly iterate over tests looking for specific issues. A
>>>>>>>>> while
>>>>>>>>> ago Mark Seger from HP told me he had created a swift benchmark
>>>>>>>>> called
>>>>>>>>> "getput" that is written in python and is much more convenient to
>>>>>>>>> run
>>>>>>>>> quickly in an automated fashion. Normally getput is used in
>>>>>>>>> conjunction
>>>>>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
>>>>>>>>> processes. This is how you would likely use getput on a typical
>>>>>>>>> ceph
>>>>>>>>> or
>>>>>>>>> swift cluster, but since CBT builds the cluster and has it's own way
>>>>>>>>> for
>>>>>>>>> launching multiple benchmark processes, it uses getput directly.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Matt Benjamin
>>>>>> Red Hat, Inc.
>>>>>> 315 West Huron Street, Suite 140A
>>>>>> Ann Arbor, Michigan 48103
>>>>>>
>>>>>> http://www.redhat.com/en/technologies/storage
>>>>>>
>>>>>> tel. 734-821-5101
>>>>>> fax. 734-769-8938
>>>>>> cel. 734-216-5309
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: CBT: New RGW getput benchmark and testing diary
2017-02-07 15:23 ` Mark Nelson
@ 2017-02-07 16:02 ` Matt Benjamin
2017-02-07 16:11 ` Mark Nelson
0 siblings, 1 reply; 13+ messages in thread
From: Matt Benjamin @ 2017-02-07 16:02 UTC (permalink / raw)
To: Mark Nelson
Cc: Orit Wasserman, ceph-devel, cbt, Mark Seger, Kyle Bader,
Karan Singh, Brent Compton
Hi Mark,
There are rgw and rados-level throttling parameters. There are known issues of fairness. The only scenario we know of where something like the "deadlock" you're theorizing can happen is possible only when byte-throttling is incorrectly configured.
Matt
----- Original Message -----
> From: "Mark Nelson" <mnelson@redhat.com>
> To: "Orit Wasserman" <owasserm@redhat.com>
> Cc: "Matt Benjamin" <mbenjamin@redhat.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark
> Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton"
> <bcompton@redhat.com>
> Sent: Tuesday, February 7, 2017 10:23:05 AM
> Subject: Re: CBT: New RGW getput benchmark and testing diary
>
>
>
> On 02/07/2017 09:03 AM, Orit Wasserman wrote:
> > On Tue, Feb 7, 2017 at 4:47 PM, Mark Nelson <mnelson@redhat.com> wrote:
> >> Hi Orit,
> >>
> >> This was a pull from master over the weekend:
> >> 5bf39156d8312d65ef77822fbede73fd9454591f
> >>
> >> Btw, I've been noticing that it appears when bucket index sharding is
> >> used,
> >> there's a higher likelyhood that client connection attempts are delayed or
> >> starved out entirely under high concurrency. I haven't looked at the code
> >> yet, does this match with what you'd expect to happen? I assume the
> >> threadpool is shared?
> >>
> > yes it is shared.
>
> Ok, so that probably explains the behavior I'm seeing. Perhaps a more
> serious issue: Do we have anything in place to stop a herd of clients
> from connecting, starving out bucket index lookups, and making
> everything deadlock?
>
> >
> >> Mark
> >>
> >>
> >> On 02/07/2017 07:50 AM, Orit Wasserman wrote:
> >>>
> >>> Mark,
> >>> On what version did you run the tests?
> >>>
> >>> Orit
> >>>
> >>> On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
> >>>>>
> >>>>>
> >>>>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Keep in mind, RGW does most of its request processing work in civetweb
> >>>>>> threads, so high utilization there does not necessarily imply
> >>>>>> civetweb-internal processing.
> >>>>>>
> >>>>>
> >>>>> True but the request processing is not a CPU intensive operation.
> >>>>> It does seems to indicate that the civetweb threading model simply
> >>>>> doesn't scale (we already noticed it already) or maybe it can point to
> >>>>> some locking issue. We need to run a profiler to understand what is
> >>>>> consuming CPU.
> >>>>> It maybe a simple fix until we move to asynchronous frontend.
> >>>>> It worth investigating as the CPU usage mark is seeing is really high.
> >>>>
> >>>>
> >>>>
> >>>> The initial profiling I did definitely showed a lot of tcmalloc
> >>>> threading
> >>>> activity, which diminshed after increasing threadcache. This is quite
> >>>> similar to what we saw in simplemessenger with low threadcache values,
> >>>> though likely is less true with async messenger. Sadly a profiler like
> >>>> perf
> >>>> probably isn't going to help much with debugging lock contention.
> >>>> grabbing
> >>>> GDB stack traces might help, or lttng.
> >>>>
> >>>>>
> >>>>> Mark,
> >>>>> How many concurrent request were handled?
> >>>>
> >>>>
> >>>>
> >>>> Most of the tests had 128 concurrent IOs per radosgw daemon. The max
> >>>> thread
> >>>> count was increased to 512. It was very obvious when exceeding the
> >>>> thread
> >>>> count since some getput processes will end up stalling and doing their
> >>>> writes after others, leading to bogus performance data.
> >>>>
> >>>>
> >>>>>
> >>>>> Orit
> >>>>>
> >>>>>> Matt
> >>>>>>
> >>>>>> ----- Original Message -----
> >>>>>>>
> >>>>>>>
> >>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
> >>>>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
> >>>>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com,
> >>>>>>> "Mark
> >>>>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
> >>>>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent
> >>>>>>> Compton"
> >>>>>>> <bcompton@redhat.com>
> >>>>>>> Sent: Monday, February 6, 2017 10:42:04 AM
> >>>>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
> >>>>>>>
> >>>>>>> Just based on what I saw during these tests, it looks to me like a
> >>>>>>> lot
> >>>>>>> more time was spent dealing with civetweb's threads than RGW. I
> >>>>>>> didn't
> >>>>>>> look too closely, but it may be worth looking at whether there's any
> >>>>>>> low
> >>>>>>> hanging fruit in civetweb itself.
> >>>>>>>
> >>>>>>> Mark
> >>>>>>>
> >>>>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks for the detailed effort and analysis, Mark.
> >>>>>>>>
> >>>>>>>> As we get closer to the L time-frame, it should become relevant to
> >>>>>>>> look
> >>>>>>>> at
> >>>>>>>> the relative boost::asio frontend rework i/o paths, which are the
> >>>>>>>> open
> >>>>>>>> effort to reduce CPU overhead/revise threading model, in general.
> >>>>>>>>
> >>>>>>>> Matt
> >>>>>>>>
> >>>>>>>> ----- Original Message -----
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
> >>>>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
> >>>>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
> >>>>>>>>> <kbader@redhat.com>,
> >>>>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
> >>>>>>>>> Compton" <bcompton@redhat.com>
> >>>>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
> >>>>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
> >>>>>>>>>
> >>>>>>>>> Hi All,
> >>>>>>>>>
> >>>>>>>>> Over the weekend I took a stab at improving our ability to run RGW
> >>>>>>>>> performance tests in CBT. Previously the only way to do this was
> >>>>>>>>> to
> >>>>>>>>> use
> >>>>>>>>> the cosbench plugin, which required a fair amount of additional
> >>>>>>>>> setup and while quite powerful can be overkill in situations where
> >>>>>>>>> you
> >>>>>>>>> want to rapidly iterate over tests looking for specific issues. A
> >>>>>>>>> while
> >>>>>>>>> ago Mark Seger from HP told me he had created a swift benchmark
> >>>>>>>>> called
> >>>>>>>>> "getput" that is written in python and is much more convenient to
> >>>>>>>>> run
> >>>>>>>>> quickly in an automated fashion. Normally getput is used in
> >>>>>>>>> conjunction
> >>>>>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
> >>>>>>>>> processes. This is how you would likely use getput on a typical
> >>>>>>>>> ceph
> >>>>>>>>> or
> >>>>>>>>> swift cluster, but since CBT builds the cluster and has it's own
> >>>>>>>>> way
> >>>>>>>>> for
> >>>>>>>>> launching multiple benchmark processes, it uses getput directly.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Matt Benjamin
> >>>>>> Red Hat, Inc.
> >>>>>> 315 West Huron Street, Suite 140A
> >>>>>> Ann Arbor, Michigan 48103
> >>>>>>
> >>>>>> http://www.redhat.com/en/technologies/storage
> >>>>>>
> >>>>>> tel. 734-821-5101
> >>>>>> fax. 734-769-8938
> >>>>>> cel. 734-216-5309
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>>> in
> >>>>>> the body of a message to majordomo@vger.kernel.org
> >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103
http://www.redhat.com/en/technologies/storage
tel. 734-821-5101
fax. 734-769-8938
cel. 734-216-5309
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: CBT: New RGW getput benchmark and testing diary
2017-02-07 16:02 ` Matt Benjamin
@ 2017-02-07 16:11 ` Mark Nelson
2017-02-07 20:01 ` Casey Bodley
0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-07 16:11 UTC (permalink / raw)
To: Matt Benjamin
Cc: Orit Wasserman, ceph-devel, cbt, Mark Seger, Kyle Bader,
Karan Singh, Brent Compton
Thanks Matt!
Just so I understand, how does the byte throttling impact the number of
threads used under a heavy client connection scenario? IE if you have
2000 threads and 2000 clients connect (1 thread per client?), What
ensures that additional threads are available for bucket index lookups?
Sorry for the weedy questions, just trying to make sure I understand how
this all works since I've never really looked closely at it and I'm
seeing some strange behavior.
Mark
On 02/07/2017 10:02 AM, Matt Benjamin wrote:
> Hi Mark,
>
> There are rgw and rados-level throttling parameters. There are known issues of fairness. The only scenario we know of where something like the "deadlock" you're theorizing can happen is possible only when byte-throttling is incorrectly configured.
>
> Matt
>
> ----- Original Message -----
>> From: "Mark Nelson" <mnelson@redhat.com>
>> To: "Orit Wasserman" <owasserm@redhat.com>
>> Cc: "Matt Benjamin" <mbenjamin@redhat.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark
>> Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent Compton"
>> <bcompton@redhat.com>
>> Sent: Tuesday, February 7, 2017 10:23:05 AM
>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>
>>
>>
>> On 02/07/2017 09:03 AM, Orit Wasserman wrote:
>>> On Tue, Feb 7, 2017 at 4:47 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>> Hi Orit,
>>>>
>>>> This was a pull from master over the weekend:
>>>> 5bf39156d8312d65ef77822fbede73fd9454591f
>>>>
>>>> Btw, I've been noticing that it appears when bucket index sharding is
>>>> used,
>>>> there's a higher likelyhood that client connection attempts are delayed or
>>>> starved out entirely under high concurrency. I haven't looked at the code
>>>> yet, does this match with what you'd expect to happen? I assume the
>>>> threadpool is shared?
>>>>
>>> yes it is shared.
>>
>> Ok, so that probably explains the behavior I'm seeing. Perhaps a more
>> serious issue: Do we have anything in place to stop a herd of clients
>> from connecting, starving out bucket index lookups, and making
>> everything deadlock?
>>
>>>
>>>> Mark
>>>>
>>>>
>>>> On 02/07/2017 07:50 AM, Orit Wasserman wrote:
>>>>>
>>>>> Mark,
>>>>> On what version did you run the tests?
>>>>>
>>>>> Orit
>>>>>
>>>>> On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin <mbenjamin@redhat.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Keep in mind, RGW does most of its request processing work in civetweb
>>>>>>>> threads, so high utilization there does not necessarily imply
>>>>>>>> civetweb-internal processing.
>>>>>>>>
>>>>>>>
>>>>>>> True but the request processing is not a CPU intensive operation.
>>>>>>> It does seems to indicate that the civetweb threading model simply
>>>>>>> doesn't scale (we already noticed it already) or maybe it can point to
>>>>>>> some locking issue. We need to run a profiler to understand what is
>>>>>>> consuming CPU.
>>>>>>> It maybe a simple fix until we move to asynchronous frontend.
>>>>>>> It worth investigating as the CPU usage mark is seeing is really high.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The initial profiling I did definitely showed a lot of tcmalloc
>>>>>> threading
>>>>>> activity, which diminshed after increasing threadcache. This is quite
>>>>>> similar to what we saw in simplemessenger with low threadcache values,
>>>>>> though likely is less true with async messenger. Sadly a profiler like
>>>>>> perf
>>>>>> probably isn't going to help much with debugging lock contention.
>>>>>> grabbing
>>>>>> GDB stack traces might help, or lttng.
>>>>>>
>>>>>>>
>>>>>>> Mark,
>>>>>>> How many concurrent request were handled?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Most of the tests had 128 concurrent IOs per radosgw daemon. The max
>>>>>> thread
>>>>>> count was increased to 512. It was very obvious when exceeding the
>>>>>> thread
>>>>>> count since some getput processes will end up stalling and doing their
>>>>>> writes after others, leading to bogus performance data.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Orit
>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>>>>>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com,
>>>>>>>>> "Mark
>>>>>>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>>>> Compton"
>>>>>>>>> <bcompton@redhat.com>
>>>>>>>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>>>>>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>>>>>>>
>>>>>>>>> Just based on what I saw during these tests, it looks to me like a
>>>>>>>>> lot
>>>>>>>>> more time was spent dealing with civetweb's threads than RGW. I
>>>>>>>>> didn't
>>>>>>>>> look too closely, but it may be worth looking at whether there's any
>>>>>>>>> low
>>>>>>>>> hanging fruit in civetweb itself.
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for the detailed effort and analysis, Mark.
>>>>>>>>>>
>>>>>>>>>> As we get closer to the L time-frame, it should become relevant to
>>>>>>>>>> look
>>>>>>>>>> at
>>>>>>>>>> the relative boost::asio frontend rework i/o paths, which are the
>>>>>>>>>> open
>>>>>>>>>> effort to reduce CPU overhead/revise threading model, in general.
>>>>>>>>>>
>>>>>>>>>> Matt
>>>>>>>>>>
>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com
>>>>>>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>>>>>> <kbader@redhat.com>,
>>>>>>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>>>>>> Compton" <bcompton@redhat.com>
>>>>>>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>>>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> Over the weekend I took a stab at improving our ability to run RGW
>>>>>>>>>>> performance tests in CBT. Previously the only way to do this was
>>>>>>>>>>> to
>>>>>>>>>>> use
>>>>>>>>>>> the cosbench plugin, which required a fair amount of additional
>>>>>>>>>>> setup and while quite powerful can be overkill in situations where
>>>>>>>>>>> you
>>>>>>>>>>> want to rapidly iterate over tests looking for specific issues. A
>>>>>>>>>>> while
>>>>>>>>>>> ago Mark Seger from HP told me he had created a swift benchmark
>>>>>>>>>>> called
>>>>>>>>>>> "getput" that is written in python and is much more convenient to
>>>>>>>>>>> run
>>>>>>>>>>> quickly in an automated fashion. Normally getput is used in
>>>>>>>>>>> conjunction
>>>>>>>>>>> with gpsuite, a tool for coordinating benchmarking multiple getput
>>>>>>>>>>> processes. This is how you would likely use getput on a typical
>>>>>>>>>>> ceph
>>>>>>>>>>> or
>>>>>>>>>>> swift cluster, but since CBT builds the cluster and has it's own
>>>>>>>>>>> way
>>>>>>>>>>> for
>>>>>>>>>>> launching multiple benchmark processes, it uses getput directly.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Matt Benjamin
>>>>>>>> Red Hat, Inc.
>>>>>>>> 315 West Huron Street, Suite 140A
>>>>>>>> Ann Arbor, Michigan 48103
>>>>>>>>
>>>>>>>> http://www.redhat.com/en/technologies/storage
>>>>>>>>
>>>>>>>> tel. 734-821-5101
>>>>>>>> fax. 734-769-8938
>>>>>>>> cel. 734-216-5309
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: CBT: New RGW getput benchmark and testing diary
2017-02-07 16:11 ` Mark Nelson
@ 2017-02-07 20:01 ` Casey Bodley
0 siblings, 0 replies; 13+ messages in thread
From: Casey Bodley @ 2017-02-07 20:01 UTC (permalink / raw)
To: Mark Nelson, Matt Benjamin
Cc: Orit Wasserman, ceph-devel, cbt, Mark Seger, Kyle Bader,
Karan Singh, Brent Compton
On 02/07/2017 11:11 AM, Mark Nelson wrote:
> Thanks Matt!
>
> Just so I understand, how does the byte throttling impact the number
> of threads used under a heavy client connection scenario? IE if you
> have 2000 threads and 2000 clients connect (1 thread per client?),
> What ensures that additional threads are available for bucket index
> lookups?
Hi Mark,
You're correct that civetweb is 1 thread per client connection. These
frontend threads are owned by civetweb, and they call our
process_request() function synchronously. Any rados operations required
to satisfy a request (bucket index or otherwise) are also synchronous.
We're not scheduling other work on frontend threads, so there isn't any
potential for deadlock there.
Casey
>
> Sorry for the weedy questions, just trying to make sure I understand
> how this all works since I've never really looked closely at it and
> I'm seeing some strange behavior.
>
> Mark
>
> On 02/07/2017 10:02 AM, Matt Benjamin wrote:
>> Hi Mark,
>>
>> There are rgw and rados-level throttling parameters. There are known
>> issues of fairness. The only scenario we know of where something
>> like the "deadlock" you're theorizing can happen is possible only
>> when byte-throttling is incorrectly configured.
>>
>> Matt
>>
>> ----- Original Message -----
>>> From: "Mark Nelson" <mnelson@redhat.com>
>>> To: "Orit Wasserman" <owasserm@redhat.com>
>>> Cc: "Matt Benjamin" <mbenjamin@redhat.com>, "ceph-devel"
>>> <ceph-devel@vger.kernel.org>, cbt@lists.ceph.com, "Mark
>>> Seger" <mjseger@gmail.com>, "Kyle Bader" <kbader@redhat.com>, "Karan
>>> Singh" <karan@redhat.com>, "Brent Compton"
>>> <bcompton@redhat.com>
>>> Sent: Tuesday, February 7, 2017 10:23:05 AM
>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>
>>>
>>>
>>> On 02/07/2017 09:03 AM, Orit Wasserman wrote:
>>>> On Tue, Feb 7, 2017 at 4:47 PM, Mark Nelson <mnelson@redhat.com>
>>>> wrote:
>>>>> Hi Orit,
>>>>>
>>>>> This was a pull from master over the weekend:
>>>>> 5bf39156d8312d65ef77822fbede73fd9454591f
>>>>>
>>>>> Btw, I've been noticing that it appears when bucket index sharding is
>>>>> used,
>>>>> there's a higher likelyhood that client connection attempts are
>>>>> delayed or
>>>>> starved out entirely under high concurrency. I haven't looked at
>>>>> the code
>>>>> yet, does this match with what you'd expect to happen? I assume the
>>>>> threadpool is shared?
>>>>>
>>>> yes it is shared.
>>>
>>> Ok, so that probably explains the behavior I'm seeing. Perhaps a more
>>> serious issue: Do we have anything in place to stop a herd of clients
>>> from connecting, starving out bucket index lookups, and making
>>> everything deadlock?
>>>
>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> On 02/07/2017 07:50 AM, Orit Wasserman wrote:
>>>>>>
>>>>>> Mark,
>>>>>> On what version did you run the tests?
>>>>>>
>>>>>> Orit
>>>>>>
>>>>>> On Mon, Feb 6, 2017 at 7:07 PM, Mark Nelson <mnelson@redhat.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 02/06/2017 11:02 AM, Orit Wasserman wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 6, 2017 at 5:44 PM, Matt Benjamin
>>>>>>>> <mbenjamin@redhat.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Keep in mind, RGW does most of its request processing work in
>>>>>>>>> civetweb
>>>>>>>>> threads, so high utilization there does not necessarily imply
>>>>>>>>> civetweb-internal processing.
>>>>>>>>>
>>>>>>>>
>>>>>>>> True but the request processing is not a CPU intensive operation.
>>>>>>>> It does seems to indicate that the civetweb threading model simply
>>>>>>>> doesn't scale (we already noticed it already) or maybe it can
>>>>>>>> point to
>>>>>>>> some locking issue. We need to run a profiler to understand
>>>>>>>> what is
>>>>>>>> consuming CPU.
>>>>>>>> It maybe a simple fix until we move to asynchronous frontend.
>>>>>>>> It worth investigating as the CPU usage mark is seeing is
>>>>>>>> really high.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The initial profiling I did definitely showed a lot of tcmalloc
>>>>>>> threading
>>>>>>> activity, which diminshed after increasing threadcache. This is
>>>>>>> quite
>>>>>>> similar to what we saw in simplemessenger with low threadcache
>>>>>>> values,
>>>>>>> though likely is less true with async messenger. Sadly a
>>>>>>> profiler like
>>>>>>> perf
>>>>>>> probably isn't going to help much with debugging lock contention.
>>>>>>> grabbing
>>>>>>> GDB stack traces might help, or lttng.
>>>>>>>
>>>>>>>>
>>>>>>>> Mark,
>>>>>>>> How many concurrent request were handled?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Most of the tests had 128 concurrent IOs per radosgw daemon.
>>>>>>> The max
>>>>>>> thread
>>>>>>> count was increased to 512. It was very obvious when exceeding the
>>>>>>> thread
>>>>>>> count since some getput processes will end up stalling and doing
>>>>>>> their
>>>>>>> writes after others, leading to bogus performance data.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Orit
>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>>>>> To: "Matt Benjamin" <mbenjamin@redhat.com>
>>>>>>>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>,
>>>>>>>>>> cbt@lists.ceph.com,
>>>>>>>>>> "Mark
>>>>>>>>>> Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>>>>> <kbader@redhat.com>, "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>>>>> Compton"
>>>>>>>>>> <bcompton@redhat.com>
>>>>>>>>>> Sent: Monday, February 6, 2017 10:42:04 AM
>>>>>>>>>> Subject: Re: CBT: New RGW getput benchmark and testing diary
>>>>>>>>>>
>>>>>>>>>> Just based on what I saw during these tests, it looks to me
>>>>>>>>>> like a
>>>>>>>>>> lot
>>>>>>>>>> more time was spent dealing with civetweb's threads than RGW. I
>>>>>>>>>> didn't
>>>>>>>>>> look too closely, but it may be worth looking at whether
>>>>>>>>>> there's any
>>>>>>>>>> low
>>>>>>>>>> hanging fruit in civetweb itself.
>>>>>>>>>>
>>>>>>>>>> Mark
>>>>>>>>>>
>>>>>>>>>> On 02/06/2017 09:33 AM, Matt Benjamin wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the detailed effort and analysis, Mark.
>>>>>>>>>>>
>>>>>>>>>>> As we get closer to the L time-frame, it should become
>>>>>>>>>>> relevant to
>>>>>>>>>>> look
>>>>>>>>>>> at
>>>>>>>>>>> the relative boost::asio frontend rework i/o paths, which
>>>>>>>>>>> are the
>>>>>>>>>>> open
>>>>>>>>>>> effort to reduce CPU overhead/revise threading model, in
>>>>>>>>>>> general.
>>>>>>>>>>>
>>>>>>>>>>> Matt
>>>>>>>>>>>
>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> From: "Mark Nelson" <mnelson@redhat.com>
>>>>>>>>>>>> To: "ceph-devel" <ceph-devel@vger.kernel.org>,
>>>>>>>>>>>> cbt@lists.ceph.com
>>>>>>>>>>>> Cc: "Mark Seger" <mjseger@gmail.com>, "Kyle Bader"
>>>>>>>>>>>> <kbader@redhat.com>,
>>>>>>>>>>>> "Karan Singh" <karan@redhat.com>, "Brent
>>>>>>>>>>>> Compton" <bcompton@redhat.com>
>>>>>>>>>>>> Sent: Monday, February 6, 2017 12:55:20 AM
>>>>>>>>>>>> Subject: CBT: New RGW getput benchmark and testing diary
>>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> Over the weekend I took a stab at improving our ability to
>>>>>>>>>>>> run RGW
>>>>>>>>>>>> performance tests in CBT. Previously the only way to do
>>>>>>>>>>>> this was
>>>>>>>>>>>> to
>>>>>>>>>>>> use
>>>>>>>>>>>> the cosbench plugin, which required a fair amount of
>>>>>>>>>>>> additional
>>>>>>>>>>>> setup and while quite powerful can be overkill in
>>>>>>>>>>>> situations where
>>>>>>>>>>>> you
>>>>>>>>>>>> want to rapidly iterate over tests looking for specific
>>>>>>>>>>>> issues. A
>>>>>>>>>>>> while
>>>>>>>>>>>> ago Mark Seger from HP told me he had created a swift
>>>>>>>>>>>> benchmark
>>>>>>>>>>>> called
>>>>>>>>>>>> "getput" that is written in python and is much more
>>>>>>>>>>>> convenient to
>>>>>>>>>>>> run
>>>>>>>>>>>> quickly in an automated fashion. Normally getput is used in
>>>>>>>>>>>> conjunction
>>>>>>>>>>>> with gpsuite, a tool for coordinating benchmarking multiple
>>>>>>>>>>>> getput
>>>>>>>>>>>> processes. This is how you would likely use getput on a
>>>>>>>>>>>> typical
>>>>>>>>>>>> ceph
>>>>>>>>>>>> or
>>>>>>>>>>>> swift cluster, but since CBT builds the cluster and has
>>>>>>>>>>>> it's own
>>>>>>>>>>>> way
>>>>>>>>>>>> for
>>>>>>>>>>>> launching multiple benchmark processes, it uses getput
>>>>>>>>>>>> directly.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Matt Benjamin
>>>>>>>>> Red Hat, Inc.
>>>>>>>>> 315 West Huron Street, Suite 140A
>>>>>>>>> Ann Arbor, Michigan 48103
>>>>>>>>>
>>>>>>>>> http://www.redhat.com/en/technologies/storage
>>>>>>>>>
>>>>>>>>> tel. 734-821-5101
>>>>>>>>> fax. 734-769-8938
>>>>>>>>> cel. 734-216-5309
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>> ceph-devel"
>>>>>>>>> in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2017-02-07 20:01 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-06 5:55 CBT: New RGW getput benchmark and testing diary Mark Nelson
2017-02-06 15:33 ` Matt Benjamin
2017-02-06 15:42 ` Mark Nelson
2017-02-06 15:44 ` Matt Benjamin
2017-02-06 17:02 ` Orit Wasserman
2017-02-06 17:07 ` Mark Nelson
2017-02-07 13:50 ` Orit Wasserman
2017-02-07 14:47 ` Mark Nelson
2017-02-07 15:03 ` Orit Wasserman
2017-02-07 15:23 ` Mark Nelson
2017-02-07 16:02 ` Matt Benjamin
2017-02-07 16:11 ` Mark Nelson
2017-02-07 20:01 ` Casey Bodley
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.