CBT: New RGW getput benchmark and testing diary

* CBT: New RGW getput benchmark and testing diary
@ 2017-02-06  5:55 Mark Nelson
  2017-02-06 15:33 ` Matt Benjamin
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2017-02-06  5:55 UTC (permalink / raw)
  To: ceph-devel, cbt; +Cc: Mark Seger, Kyle Bader, Karan Singh, Brent Compton

[-- Attachment #1: Type: text/plain, Size: 10461 bytes --]

Hi All,

Over the weekend I took a stab at improving our ability to run RGW 
performance tests in CBT.  Previously the only way to do this was to use 
the cosbench plugin, which required a fair amount of additional
setup and while quite powerful can be overkill in situations where you 
want to rapidly iterate over tests looking for specific issues.  A while 
ago Mark Seger from HP told me he had created a swift benchmark called 
"getput" that is written in python and is much more convenient to run 
quickly in an automated fashion.  Normally getput is used in conjunction 
with gpsuite, a tool for coordinating benchmarking multiple getput 
processes.  This is how you would likely use getput on a typical ceph or 
swift cluster, but since CBT builds the cluster and has it's own way for 
launching multiple benchmark processes, it uses getput directly.

Thankfully it was fairly easy to implement a CBT getput wrapper. 
Several aspects of CBT's RGW support and user/key management were also 
improved to make the whole process of testing RGW completely automated 
via the CBT yaml file.  As part of the process of testing and debugging 
the new getput wrapper, I ran through a series of benchmarks and tests 
to investigate 4MB write performance anomalies previously reported in 
the field.  I wrote something of a diary while doing this and thought I 
would document it here for the community.  These were not extremely 
scientific tests, though I believe the findings are relevant and my be 
useful for folks.

Test Cluster Setup

The test cluster being used has 8 nodes, 4 of which are used for OSDs 
and 4 of which are being used for RGW and clients.  Each OSD node has 
the option to use any combination of 6 Seagate Constellation ES.2 
7200RPM HDDs and 4 Intel 800GB P3700 NVMe drives.  These machines also 
have dual Intel Xeon E5-2650v3 CPUs and Intel 40GbE ethernet adapters. 
In all of these tests, 1X replication was used to eliminate it as a 
bottleneck and put maximum pressure on RGW.  It should be noted that the 
spinning disks in this cluster are attached through motherboard SATA2 
ports, and may not be performing as well as if they were using a 
dedicated SAS controller.  A copy of the cbt yaml configuration file 
used to run the tests is attached.

Some Notes

When using getput as a benchmark for RGW, it's very important to keep 
track of the total number of getput processes and the number of RGW 
threads.  If the runtime option in getput is used, some processes may 
wait until RGW threads open up before they determine their runtime 
leading to skewed results.  I believe this can be resolved in getput by 
changing the way that donetime is calculated, however the issue can be 
avoided by paying close attention to the RGW thread and getput process 
counts.

It's easy to create the wrong pools for RGW since the defaults changed 
in Jewel.  Now we must create default.rgw.buckets.index and 
default.rgw.buckets.data.  It wasn't until I looked at disk usage via 
ceph df that I disovered that my .rgw.buckets and .rgw.buckets.index 
pools were not being used and thus resulted in some bogus initial 
performance data.

RGW with HDD backed OSDs (Filestore)

The first set of tests run were with a 4 node cluster configured with 24 
HDD backed OSDs.  Immediately the first thing I noticed is that the 
number of buckets and/or bucket hsards absolutely affects large 
sequential write performance on HDD backed OSDs.  Using a single bucket 
with no shards (ie the default) resulted in 220MB/s of write throughput 
while rados bench was able to achieve 550-600MB/s.  Both of these 
numbers are quite low, though are partially explained by the lack of SSD 
journals and lack of dedicated controller hardware.

Setting the number of bucket index shards to 8 improved the RGW write 
throughput to 400MB/s. The highest throughput for this setup was 
achieved by either setting the number of buckets or the number of bucket 
index shards substantially higher than the number of OSDs.  In this 
case, 64 appeared to be sufficient.  Three high concurrency 4MB Object 
RGW PUT tests showed results of 602MB/s, 557MB/s and 563MB/s.  3 rados 
bench runs using similar concurrency showed 580MB/s, 580MB/s, and 
564MB/s respectively.  Write IO from multiple clients appeared to cause 
a slight (~5%) performance drop vs write IO from a single client.  In 
all cases, tests were stopped prior to PG splitting.

It was observed that RGW uses high amounts of CPU, especially if low 
tcmalloc threadcache values are used.  With 32MB of threadcache, RGW 
used around 500-600% CPU to serve 500-600MB/s of 4MB writes. 
Occasionally it would spike into the 1000-1200% region.  Perf showed a 
high percentage of time in tcmalloc managing civetweb threads.  With 
128MB of thread cache, this effect was greatly diminished.  RGW appeared 
to use around 300-400% CPU to serve the same workload, though in these 
tests there was little performance impact as we were not CPU bound.

PG counts and PG splitting

Based on the bucket/shard results above, it may be inferred that a RGW 
index pool with very low PG counts could significantly hurt performance. 
  What about the case where small PG counts are used in the bucket.data 
pool?  In this case, PG splitting behavior might actually be more 
important than the effect of clumpiness in the random distribution due 
to low sample counts.  To test this, a buckets.index pool was created 
with 2048 PGs while the buckets.data pool was created with 128 PGs. 
Initial 4MB sequential writes with both rados bench and with RGW were 
about 20% slower than what was seen with 2048 PGs in the data pool, 
likely due to the worse data distribution properties.

While this is significant, I was more interested in the affects of PG 
splitting in filestore.  It has been observed in the past that PG 
splitting can have a huge performance impact, especially when SELinux is 
enabled. Selinux reads a security xattr on every file access which 
grealy impacts how quickly operations like link/unlink can be performed 
during PG splits. While SELinux is not enabled on this test cluster, PG 
splitting may still have other performance impacts due to worse dentry 
and inode caching and kernel overhead.  Rados bench was used to hit the 
thresholds by writing out a large number of 4K objects.

After approximately 1.3M objects were written, performance of 4K writes 
dropped by roughly an order of magnitude.  At this point rados bench and 
RGW were used to write out 4MB objects with high levels of concurrency 
and high numbers of bucket index shards.  In both cases, performance 
started out slightly diminished, but quickly increased to near levels 
observed on the fresh cluster.  At least in this setup, pg splitting in 
the data pool did not appear to majorly affect the performance of 4MB 
object writes, though may have affect the 4k object writes used to 
pre-fill the data pool.

RGW with OSDs using HDD Data and NVMe Journals (Filestore)

Next, 24 OSDs were configured with the filestore data partitions on HDD 
and journals on NVMe.  With minimal tuning rados bench delivered
around 1350MB/s or around 56MB/s per drive.  This is lower per drive 
than other configurations we've tested, but roughly double vs how the 
OSDs were performing wihtout NVMe journals.  Interestingly, the 
difference between using a single bucket with no shards and many 
buckets/shards appeared to be minimal.  Tests against a single bucket 
with no shards resulted in 1000-1400MB/s while a test configuration with 
128 buckets resulted in 1200-1400MB/s over several repeated tests.  It 
should be noted that the single RGW instance was using anywhere from 
1000-1600% CPU to maintain these numbers even with 128MB of TCMalloc 
threadcache!  Again a performance drop was seen when using multiple 
clients vs a single client.

RGW with NVMe data and journals (Filestore)

The test machines in this cluster have 4 Intel P3700 NVMe drives which 
are each capable of about 2GB/s each.  The cluster was again 
reconfigured with 16 OSDs backed by the NVMe drives.   In this case, a 
single client running radosbench with 128 concurrent ops could saturate 
the network and write data at about 4600MB/s.  4 clients however appear 
to be able to saturate the OSDs and achieve 11620MB/s.  A single getput 
client using 128 processes to write 4MB objects to a single bucket with 
no shards achieves 1700-1800MB/s.  Radosgw CPU usage hovered around 
1500-1700%. Using 128 buckets (1 bucket per process) yielded a variety 
of performance results ranging from 1700MB/s to 3500MB/s over multiple 
tests. Radosgw CPU usage appeared to top out around 2100%.  With 4 
clients, A single radosgw instance was able to maintain roughly 3700MB/s 
of writes over several independent tests despite using roughly the same 
amount of CPU as the single client tests.

Testing 4 gateways

CBT can stand up multiple rados gateways and launch tests against all of 
them concurrently.  4 clients were configured to drive traffic to all 4 
rados gateways and ultimately the 16 NVMe backed OSDs.  The first tests 
run resulted in 404 keynotfound errors, apparently because multiple 
copies of getput attempted to write into the same bucket.  This was 
resolved by making sure that each copy of getput being run had a 
distinct bucket (though getput processes did not require the same 
attention).  Thus, the lowest number of buckets targeted in this test 
was 16.  In this configuration, the aggregate write throughput across 
all 4 gateways was 9306MB/s.  With a bucket for every process (512 
total), the aggregate throughput increased to 9445MB/s.  This was 
roughly 2361MB/s per gateway and 81% of the rados bench throughput.

Next Steps and Conclusions

RGW is now roughly as easy to test in CBT as rados and rbd.  It should 
be far easier to examine bluestore performance with RGW now.  I hope we 
should be able to do some filestore and bluestore comparisons shortly. 
I will also note that I was quite happy with how well RGW handled large 
object writes in these tests.  Despite using a large amount of CPU 
overhead, I could achieve 3700MB/s with a single RGW instance and over 
9GB/s with 4 RGW instances.  Despite this, it's clear that performance 
is still quite dependent on bucket index update latency.  Especially on 
spinning disks, it appears to be very important to use bucket index 
sharding or spread write over many buckets.

Mark

[-- Attachment #2: runtests.xfs.16.rgw.yaml --]
[-- Type: application/x-yaml, Size: 2453 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread