high overhead of functions blkg_*stats_* in bfq

* high overhead of functions blkg_*stats_* in bfq
@ 2017-10-17 10:11 Paolo Valente
  2017-10-17 12:45 ` Paolo Valente
  2017-10-18 13:19 ` Tejun Heo
  0 siblings, 2 replies; 35+ messages in thread
From: Paolo Valente @ 2017-10-17 10:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-block, Luca Miccio

Hi Tejun, all,
in our work for reducing bfq overhead, we bumped into an unexpected
fact: the functions blkg_*stats_*, invoked in bfq to update cgroups
statistics as in cfq, take about 40% of the total execution time of
bfq.  This causes an additional serious slowdown on any multicore cpu,
as most bfq functions, from which blkg_*stats_* get invoked, are
protected by a per-device scheduler lock.  To give you an idea, on an
Intel i7-4850HQ, and with 8 threads doing random I/O in parallel on
null_blk (configured with 0 latency), if the update of groups stats is
removed, then the throughput grows from 260 to 404 KIOPS.  This and
all the other results we might share in this thread can be reproduced
very easily with a (useful) script made by Luca Miccio [1].

We tried to understand the reason for this high overhead, and, in
particular, to find out whether whether there was some issue that we
could address on our own.  But the causes seem somehow substantial:
one of the most time-consuming operations needed by some blkg_*stats_*
functions is, e.g., find_next_bit, for which we don't see any trivial
replacement.

So, as a first attempt to reduce this severe slowdown, we have made a
patch that moves the invocation of blkg_*stats_* functions outside the
critical sections protected by the bfq lock.  Still, these functions
apparently need to be protected with the request_queue lock, because
the group they are invoked on may otherwise disappear before or while
these functions are executed.  Fortunately, tests run without even
this lock have shown that the serialization caused by this lock has a
little impact (5% of throughput reduction).  As for results, moving
these functions outside the bfq lock does improve throughput: it
grows, e.g., from 260 to 316 KIOPS in the above test case.  But we are
still rather far from the optimum.

Do you have any clue about possible solutions to reduce the overhead
of these functions?  If no relatively quick solution is available, we
are planning to prepare, in addition to the above patch to increase
parallelism, a further patch to give the user the possibility to
disable stats update, so as to gain a full throughput boost of up to
55% (according to the tests we have run so far on a few different
systems).

Thanks,
Paolo

[1] https://github.com/Algodev-github/IOSpeed

^ permalink raw reply	[flat|nested] 35+ messages in thread