rados benchmark, even distribution vs. random distribution

* rados benchmark, even distribution vs. random distribution
@ 2016-01-26 19:11 Deneau, Tom
  2016-01-26 19:33 ` [Cbt] " Gregory Farnum
  0 siblings, 1 reply; 8+ messages in thread
From: Deneau, Tom @ 2016-01-26 19:11 UTC (permalink / raw)
  To: ceph-devel, cbt

Looking for some help with a Ceph performance puzzler...

Background:
I had been experimenting with running rados benchmarks with a more
controlled distribution of objects across osds.  The main reason was
to reduce run to run variability since I was running on fairly small
clusters.

I modified rados bench itself to optionally take a file with a list of
names and use those instead of the usual randomly generated names that
rados bench uses, "benchmark_data_hostname_pid_objectxxx".  It was
then possible to generate and use a list of names which hashed to an
"even distribution" across osds if desired.  The list of names was
generated by finding a set of pgs that gave even distribution and then
generating names that mapped to those pgs.  So for example with 5 osds
on a replicated=2 pool we might find 5 pgs mapping to [0,2] [1,4]
[2,3] [3,1] [4,0] and then all the names generated would map to only
those 5 pgs.

Up until recently, the results from these experiments were what I expected,
   * better repeatability across runs
   * even disk utilization.
   * generally higher total Bandwidth

Recently however I saw results on one platform that had much lower
bandwidth (less than half) with the "even distribution" run compared
to the default random distribution run.  Some notes were:

   * showed up in erasure coded pool with k=2, m=1, with 4M size
     objects.  Did not show the discrepancy on a replicated 2 pool on
     the same cluster.

   * In general, larger objects showed the problem more than smaller
     objects.

   * showed up only on reads from the erasure pool.  Writes to the
     same pool had higher bandwidth with the "even distribution"

   * showed up on only one cluster, a separate cluster with the same
     number of nodes and disks but different system architecture did
     not show this.

   * showed up only when the total number of client threads got "high
     enough".  For example showed up with 64 total client threads but
     not with 16.  The distribution of threads across client processes
     did not seem to matter.

I tried looking at "dump_historic_ops" and did indeed see some read
ops logged with high latency in the "even distribution" case.  The
larger elapsed times in the historic ops were always in the
"reached_pg" and "done" steps.  But I saw similar high latencies and
elapsed times for "reached_pg" and "done" for historic read ops in the
random case.

I have perf counters before and after the read tests.  I see big
differences in the op_r_out_bytes which makes sense because the higher
bw run processed more bytes.  For some osds, op_r_latency/sum is
slightly higher in the "even" run but not sure if this is significant.

Anyway, I will probably just stop doing these "even distribution" runs
but I was hoping to get an understanding of why they might have suche
reduced bandwidth in this particular case.  Is there something about mapping
to a smaller number of pgs that becomes a bottleneck?

-- Tom Deneau

^ permalink raw reply	[flat|nested] 8+ messages in thread