All of lore.kernel.org
 help / color / mirror / Atom feed
* Inconsistent performance of dbus call GetManagedObjects to PSUSensor in dbus-sensors
@ 2020-08-07 22:42 Alex Qiu
  2020-08-07 23:17 ` Ed Tanous
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Qiu @ 2020-08-07 22:42 UTC (permalink / raw)
  To: Ed Tanous, OpenBMC Maillist, James Feist
  Cc: Peter Lundgren, Josh Lehan, Jason Ling, Sui Chen, Jie Yang, Drew Macrae

[-- Attachment #1: Type: text/plain, Size: 1179 bytes --]

Hi folks,

I'm branching this topic as Jason mentioned about this in the other topic
that he started, and I'll report the information I have for now. Sui and
Jie will follow up on this as they are actively working on this issue.

The setup has a total of 249 IPMI sensors, and among these, dbus-sensors
reports 59 objects from HwmonTempSensor and 195 objects from PSUSensor, and
we've already decreased the polling rate of PSUSensor to every 10 seconds
to mitigate the issue. As the intel-ipmi-oem does, we measure the time of
calling GetManagedObjects with commands:

time busctl call xyz.openbmc_project.HwmonTempSensor /
org.freedesktop.DBus.ObjectManager GetManagedObjects
time busctl call xyz.openbmc_project.PSUSensor /
org.freedesktop.DBus.ObjectManager GetManagedObjects

The first command for HwmonTempSensor constantly finishes in about 60 ms.
However, the run time of the second command for PSUSensor is very
inconsistent. Out of 50 continuous runs, most of them finish in about 150
ms, but 2 or 3 of them will take as long as about 6 seconds to return. This
results in long time to scan the SDR and inconsistent performance polling
IPMI sensors.

Thanks.

- Alex Qiu

[-- Attachment #2: Type: text/html, Size: 1487 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inconsistent performance of dbus call GetManagedObjects to PSUSensor in dbus-sensors
  2020-08-07 22:42 Inconsistent performance of dbus call GetManagedObjects to PSUSensor in dbus-sensors Alex Qiu
@ 2020-08-07 23:17 ` Ed Tanous
  2020-08-13  0:30   ` Sui Chen
  0 siblings, 1 reply; 8+ messages in thread
From: Ed Tanous @ 2020-08-07 23:17 UTC (permalink / raw)
  To: Alex Qiu
  Cc: OpenBMC Maillist, James Feist, Peter Lundgren, Josh Lehan,
	Jason Ling, Sui Chen, Jie Yang, Drew Macrae

This is great!  Thank you for taking the time to type this up.

On Fri, Aug 7, 2020 at 3:42 PM Alex Qiu <xqiu@google.com> wrote:
>
> The setup has a total of 249 IPMI sensors, and among these, dbus-sensors reports 59 objects from HwmonTempSensor and 195 objects from PSUSensor, and we've already decreased the polling rate of PSUSensor to every 10 seconds to mitigate the issue. As the intel-ipmi-oem does, we measure the time of calling GetManagedObjects with commands:

This isn't the biggest sensor usage I've ever seen, but it certainly
is the biggest usage of PSUsensor I've seen sofar.  It's not
surprising you're finding performance issues other people haven't.
PSUSensor was originally supposed to be for physical pmbus power
supplies, but got abstracted a little at some point to be more
generic.

>
> time busctl call xyz.openbmc_project.HwmonTempSensor / org.freedesktop.DBus.ObjectManager GetManagedObjects
> time busctl call xyz.openbmc_project.PSUSensor / org.freedesktop.DBus.ObjectManager GetManagedObjects
>
> The first command for HwmonTempSensor constantly finishes in about 60 ms. However, the run time of the second command for PSUSensor is very inconsistent. Out of 50 continuous runs, most of them finish in about 150 ms, but 2 or 3 of them will take as long as about 6 seconds to return. This results in long time to scan the SDR and inconsistent performance polling IPMI sensors.
>

I don't have a system handy that uses PSUSensor, but based on what
you're saying, I'm going to guess that there's a blocking
io/wait/sleep call that snuck in somewhere in the PSUsensor, and it's
stopping the main reactor for some amount of time.  This is probably
exacerbated by how loaded your system is, which is causing the really
bad tail latencies.

If I were in your shoes, the first thing I would do is to recompile
PSUSensor with IO handler tracking enabled:
https://www.boost.org/doc/libs/1_73_0/doc/html/boost_asio/overview/core/handler_tracking.html

to do that, go here:
https://github.com/openbmc/dbus-sensors/blob/master/CMakeLists.txt#L194

add add a line line like
target_compile_definitions(psusensor PUBLIC
-DBOOST_ASIO_ENABLE_HANDLER_TRACKING)

and recompile.

That's going to print loads of debug info to the console when it runs.
Be prepared.  Rerun your test with the flag enabled.  When your
getmanagedobjects command gets stuck, dump the log and try to find the
spot where io seems to stop for a bit.  Hopefully you'll find one
async operation is taking a looooong time to run.  Most operations
should be in the order of micro/milliseconds for runtime.  Once you
know what the spot is, we can probably triage further.  Each
individual callback is pretty simple, and only does a couple things,
so it should be pretty easy to sort out what's blocking within a given
callback.


My second theory is that because of the async nature of psusensor, if
you get unlucky, 195 concurrent IO completion operations are getting
scheduled right ahead of your GetManagedObjects call.  Right now the
IO scheduling is pretty dumb, and doesn't attempt to add jitter to
randomize the call starts, under the assumption that the reactor will
never be more than 10 or so handles at a given time.  Given the number
of sensors you've got, we might want to rethink that, and try to
spread them out in time a little.  If we wanted to verify this, we
could instrument io_context with a little run_for() magic that breaks
every N milliseconds and prints the size of the queue.  That could
verify that we're running it too large.

Technically I think this is the embedded version of the thundering
herd problem.  There are ways to solve it that should be relatively
easy (if that's what it turns out to be).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inconsistent performance of dbus call GetManagedObjects to PSUSensor in dbus-sensors
  2020-08-07 23:17 ` Ed Tanous
@ 2020-08-13  0:30   ` Sui Chen
  2020-08-13 14:53     ` Ed Tanous
  0 siblings, 1 reply; 8+ messages in thread
From: Sui Chen @ 2020-08-13  0:30 UTC (permalink / raw)
  To: OpenBMC Maillist


[-- Attachment #1.1: Type: text/plain, Size: 2518 bytes --]

Hello,

Because machine configurations may change during development, we used a
microbenchmark to try to isolate the cause and reproduce the long DBus
latencies reliably, and another microbenchmark to demonstrate the idea we
had tried that appeared to alleviate but not completely eliminate this DBus
latency problem.

The first microbenchmark, dbus-asio-bmk (
https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35576)
mimics our patched psusensor: an ASIO worker that reads sensors at some
fixed timer interval; the ASIO worker is also used by sdbusplus for
handling DBus work. We continuously run "busctl tree" against the DBus
interface created by the microbenchmark binary.

By importing the resultant DBus traffic dump and ASIO handler activity log
onto the timeline view form dbus-vis, we can clearly see the the "sensor
reading" ASIO work items block the DBus work items, which in turn causes
very long DBus latencies to show up:

[image: busctlubmk.png]Although this benchmark is run on an x86 workstation
instead of the BMC due to a run-time error in its dependencies, according
to the results above, we see this "thundering herd" problem appear to occur
on a desktop platform as well.

As we modify various experimental parameters, it turns out that the more
time is occupied by non-DBus ASIO work the more likely long DBus latencies
are to happen, since there is a higher chance the DBus calls clash with the
"fake sensor reading". Thus, we come up with an assumption that if we
reduce the time spent (by the only ASIO worker) in non-DBus ASIO work, DBus
latencies will be reduced.

Based on this assumption, we attempted a few methods to reduce the time it
takes psusensor to read the sensors. The second benchmark (
https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35387)
explains the methods we had experimented with. It turns out we were able to
reduce sensor reading time as well as the chance of long DBus method calls
happening, but the inconsistent DBus call times do not completely go away.
This is probably due to psusensors being much more complex than the two
benchmarks with much other work contending for the ASIO worker's time.

So to summarize the point of this reply is to say:
1) We had attempted the ASIO handler dump as suggested and a method for
analyzing DBus+ASIO performance has been embodied in dbus-vis.
2) We are interested to know if someone else is looking at similar problems.
3) We will examine GetManagedObjects again when we get a chance.

Thanks,
Sui

[-- Attachment #1.2: Type: text/html, Size: 3084 bytes --]

[-- Attachment #2: busctlubmk.png --]
[-- Type: image/png, Size: 157653 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inconsistent performance of dbus call GetManagedObjects to PSUSensor in dbus-sensors
  2020-08-13  0:30   ` Sui Chen
@ 2020-08-13 14:53     ` Ed Tanous
  2020-08-13 23:33       ` Sui Chen
  0 siblings, 1 reply; 8+ messages in thread
From: Ed Tanous @ 2020-08-13 14:53 UTC (permalink / raw)
  To: Sui Chen; +Cc: OpenBMC Maillist

On Wed, Aug 12, 2020 at 6:38 PM Sui Chen <suichen@google.com> wrote:
>
> Hello,
>
> Because machine configurations may change during development, we used a microbenchmark to try to isolate the cause and reproduce the long DBus latencies reliably, and another microbenchmark to demonstrate the idea we had tried that appeared to alleviate but not completely eliminate this DBus latency problem.


This works great, so long as you constrain things to the way they look
on the bmc, and don't look at things too closely.  One thread, pinned
to a single core tends to work best for reproducing this kind of
stuff.  If you can also pin something else to that same core that's
doing some work, it tends to be closer to a real bmc.  Also remember,
context switches are much cheaper on x86 multicore than they are on
other, less capable single core processors.  You need to keep that in
mind when running your microbenchmarks on a different platform, as the
profiles can vary significantly, especially in the worst case.

>
>
> The first microbenchmark, dbus-asio-bmk (https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35576) mimics our patched psusensor: an ASIO worker that reads sensors at some fixed timer interval; the ASIO worker is also used by sdbusplus for handling DBus work. We continuously run "busctl tree" against the DBus interface created by the microbenchmark binary.

What patches did you apply?

>
>
> By importing the resultant DBus traffic dump and ASIO handler activity log onto the timeline view form dbus-vis, we can clearly see the the "sensor reading" ASIO work items block the DBus work items, which in turn causes very long DBus latencies to show up:


All I see is that there's a long lag between a single read of a busctl
tree call. I don't see any indication that it's because of sensor
reading specifically, at least from below.  Dumping the state of the
work queue would explain a lot about what's going on during these
periods.  I wonder if this is as easy as smearing the initial start
time of each sensor when the application starts to make sure the
timers don't synchronize like this.

>
> Although this benchmark is run on an x86 workstation instead of the BMC due to a run-time error in its dependencies, according to the results above, we see this "thundering herd" problem appear to occur on a desktop platform as well.


One odd thing I notice in that graph: we're using null_buffers?  I
thought we were using a direct async read call.

The other thing to realize is busctl tree is very inefficient.  It
executes a full introspection of all levels, then executes
GetProperties for each property.  You really need to move your
microbenchmark to GetManagedObjects (which I see you recognize below)
as that's more indicative of what a real caller would do.  I don't
think your profiles are worth looking at until you do this.

>
>
> As we modify various experimental parameters, it turns out that the more time is occupied by non-DBus ASIO work the more likely long DBus latencies are to happen, since there is a higher chance the DBus calls clash with the "fake sensor reading". Thus, we come up with an assumption that if we reduce the time spent (by the only ASIO worker) in non-DBus ASIO work, DBus latencies will be reduced.


Another option is to make it 2 threaded.  Have one thread (and
io_context) set up for handling dbus traffic, and one thread set up
for async reading of IO, with an atomic data structure in between for
passing values.

Another (probably preferred in this case) architecture would be to
implement an ASIO priority queue, and prioritize the dbus traffic
higher than the IO reading traffic.  There are examples of this in
asio, and I assumed the day where we needed this would've come a lot
sooner (we got away with a dumb fifo io queue for a long time), but
here we are.

>
>
> Based on this assumption, we attempted a few methods to reduce the time it takes psusensor to read the sensors. The second benchmark (https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35387) explains the methods we had experimented with. It turns out we were able to reduce sensor reading time as well as the chance of long DBus method calls happening, but the inconsistent DBus call times do not completely go away. This is probably due to psusensors being much more complex than the two benchmarks with much other work contending for the ASIO worker's time.

Can you detail what methods?  It might help someone else pinpoint
another good option.

>
>
> So to summarize the point of this reply is to say:
> 1) We had attempted the ASIO handler dump as suggested and a method for analyzing DBus+ASIO performance has been embodied in dbus-vis.

Can you post the dump somewhere that others can help analyse?

>
> 2) We are interested to know if someone else is looking at similar problems.


Ps, I'd be remiss if I didn't point out that your last email was sent
as HTML.  This list prefers plaintext.

>
> 3) We will examine GetManagedObjects again when we get a chance.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inconsistent performance of dbus call GetManagedObjects to PSUSensor in dbus-sensors
  2020-08-13 14:53     ` Ed Tanous
@ 2020-08-13 23:33       ` Sui Chen
  2020-08-15 14:05         ` Ed Tanous
  0 siblings, 1 reply; 8+ messages in thread
From: Sui Chen @ 2020-08-13 23:33 UTC (permalink / raw)
  To: Ed Tanous
  Cc: OpenBMC Maillist, Peter Lundgren, Josh Lehan, Alex Qiu, Jie Yang,
	Drew Macrae

There are some questions that I can answer quickly, the rest may take some time:

> What patches did you apply?
This one: https://gerrit.openbmc-project.xyz/c/openbmc/dbus-sensors/+/30348

> Profiles of dbus-asio-bmk are not worth looking at until the benchmark is doing GetManagedObjects
Hmm, I believe it has been worth looking at in that it shows how a
DBus method call may be blocked by other ASIO work. But I have to
admit it hasn't shown much more insights.
Another reason I used "busctl tree" is we had observed inconsistent
performance with "busctl tree" as well.
The sequence of events obtained by having the microbenchmark execute
"GetManagedObjects" does not look much too different from that
obtained through "busctl tree": In both cases, we observe DBus calls
that "straddle" the other ASIO work that are running concurrently. An
example might be found here: https://imgur.com/a/ZiQdcX0 ; The second
row on the top panel are rectangles that represent method call-method
return pairs. One could see that the highlighted method call-return
pair takes significantly longer than the other GetManagedObjects
method calls, and during the same time, the io_context@.post row on
the lower pane is filled with activities. These activities are the
sleep() functions run by the "fake sensor reading" procedures in the
benchmark.
I am looking at a more important issue I'm aware of that is related to
the "worthiness" of the benchmark: On the BMC, DBus calls do not
appear to align perfectly with the async_read_some(null_buffers) asio
handlers as they seem to be in the microbenchmark (see the screenshot
here: https://imgur.com/a/FIUM4W6 ). This gap needs to be filled by
some more investigation.
I suspect this might be the 7.7x single-core performance difference
(as measured via LMBench) between the particular BMC and the host, as
well as the BMC's DBus being a lot busier than the host's. Perhaps
some other methods are needed in place of this "capture DBus and
capture ASIO handler logs and align them" approach.

> Another option is to make it 2 threaded, one set up dbus traffic, another for async io
We experimented with a similar architecture using "Mode 32" in
hwmon-bmk (described in
https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35387),
but we only measured sensor reading times there. This method did not
speed up sensor reading much. It might help DBus though. But
eventually we want to address both sensor reading time and DBus
response time for dbus-sensors.

> ASIO priority queue
This sounds promising. We will look into it, for both the DBus
scheduling issue and the scalability differences we had observed in
hwmon-bmk.

> Can you detail what methods (in https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35387 )?
On top of patch 30348, we have 8 std::threads that directly run
PSUSensors::prepareInput and do not use ASIO, and the main asio worker
thread for handling the rest of the original PSUSensor logic. This
corresponds to "Mode 7" in the description in hwmon-bmk (change
35387). We found that this roughly halves the total time it takes to
read the 247 sensors. To compare, a user would need to start a very
large number of threads (say, 32, 64 or 128) for the io_context to
achieve the same speedup as can be seen in "Mode 42" in the
descriptions of hwmon-bmk (change 35387).

> Dumps
The dump for the particular dbus-asio-bmk run is here:
https://www.dropbox.com/sh/1xmo6vrz13zjkca/AADf1ySDYXyH6XIc7n2Hqc1Ba?dl=0
For the dump from the actual BMC, a scrub process may be needed and I
can't do it quickly yet.

I have set the mail to plaintext mode this time.



On Thu, Aug 13, 2020 at 7:54 AM Ed Tanous <ed@tanous.net> wrote:
>
> On Wed, Aug 12, 2020 at 6:38 PM Sui Chen <suichen@google.com> wrote:
> >
> > Hello,
> >
> > Because machine configurations may change during development, we used a microbenchmark to try to isolate the cause and reproduce the long DBus latencies reliably, and another microbenchmark to demonstrate the idea we had tried that appeared to alleviate but not completely eliminate this DBus latency problem.
>
>
> This works great, so long as you constrain things to the way they look
> on the bmc, and don't look at things too closely.  One thread, pinned
> to a single core tends to work best for reproducing this kind of
> stuff.  If you can also pin something else to that same core that's
> doing some work, it tends to be closer to a real bmc.  Also remember,
> context switches are much cheaper on x86 multicore than they are on
> other, less capable single core processors.  You need to keep that in
> mind when running your microbenchmarks on a different platform, as the
> profiles can vary significantly, especially in the worst case.
>
> >
> >
> > The first microbenchmark, dbus-asio-bmk (https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35576) mimics our patched psusensor: an ASIO worker that reads sensors at some fixed timer interval; the ASIO worker is also used by sdbusplus for handling DBus work. We continuously run "busctl tree" against the DBus interface created by the microbenchmark binary.
>
> What patches did you apply?
>
> >
> >
> > By importing the resultant DBus traffic dump and ASIO handler activity log onto the timeline view form dbus-vis, we can clearly see the the "sensor reading" ASIO work items block the DBus work items, which in turn causes very long DBus latencies to show up:
>
>
> All I see is that there's a long lag between a single read of a busctl
> tree call. I don't see any indication that it's because of sensor
> reading specifically, at least from below.  Dumping the state of the
> work queue would explain a lot about what's going on during these
> periods.  I wonder if this is as easy as smearing the initial start
> time of each sensor when the application starts to make sure the
> timers don't synchronize like this.
>
> >
> > Although this benchmark is run on an x86 workstation instead of the BMC due to a run-time error in its dependencies, according to the results above, we see this "thundering herd" problem appear to occur on a desktop platform as well.
>
>
> One odd thing I notice in that graph: we're using null_buffers?  I
> thought we were using a direct async read call.
>
> The other thing to realize is busctl tree is very inefficient.  It
> executes a full introspection of all levels, then executes
> GetProperties for each property.  You really need to move your
> microbenchmark to GetManagedObjects (which I see you recognize below)
> as that's more indicative of what a real caller would do.  I don't
> think your profiles are worth looking at until you do this.
>
> >
> >
> > As we modify various experimental parameters, it turns out that the more time is occupied by non-DBus ASIO work the more likely long DBus latencies are to happen, since there is a higher chance the DBus calls clash with the "fake sensor reading". Thus, we come up with an assumption that if we reduce the time spent (by the only ASIO worker) in non-DBus ASIO work, DBus latencies will be reduced.
>
>
> Another option is to make it 2 threaded.  Have one thread (and
> io_context) set up for handling dbus traffic, and one thread set up
> for async reading of IO, with an atomic data structure in between for
> passing values.
>
> Another (probably preferred in this case) architecture would be to
> implement an ASIO priority queue, and prioritize the dbus traffic
> higher than the IO reading traffic.  There are examples of this in
> asio, and I assumed the day where we needed this would've come a lot
> sooner (we got away with a dumb fifo io queue for a long time), but
> here we are.
>
> >
> >
> > Based on this assumption, we attempted a few methods to reduce the time it takes psusensor to read the sensors. The second benchmark (https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35387) explains the methods we had experimented with. It turns out we were able to reduce sensor reading time as well as the chance of long DBus method calls happening, but the inconsistent DBus call times do not completely go away. This is probably due to psusensors being much more complex than the two benchmarks with much other work contending for the ASIO worker's time.
>
> Can you detail what methods?  It might help someone else pinpoint
> another good option.
>
> >
> >
> > So to summarize the point of this reply is to say:
> > 1) We had attempted the ASIO handler dump as suggested and a method for analyzing DBus+ASIO performance has been embodied in dbus-vis.
>
> Can you post the dump somewhere that others can help analyse?
>
> >
> > 2) We are interested to know if someone else is looking at similar problems.
>
>
> Ps, I'd be remiss if I didn't point out that your last email was sent
> as HTML.  This list prefers plaintext.
>
> >
> > 3) We will examine GetManagedObjects again when we get a chance.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inconsistent performance of dbus call GetManagedObjects to PSUSensor in dbus-sensors
  2020-08-13 23:33       ` Sui Chen
@ 2020-08-15 14:05         ` Ed Tanous
  2020-08-31 21:35           ` Josh Lehan
  0 siblings, 1 reply; 8+ messages in thread
From: Ed Tanous @ 2020-08-15 14:05 UTC (permalink / raw)
  To: Sui Chen
  Cc: OpenBMC Maillist, Peter Lundgren, Josh Lehan, Alex Qiu, Jie Yang,
	Drew Macrae

On Thu, Aug 13, 2020 at 4:33 PM Sui Chen <suichen@google.com> wrote:
>
> There are some questions that I can answer quickly, the rest may take some time:
>
> > What patches did you apply?
> This one: https://gerrit.openbmc-project.xyz/c/openbmc/dbus-sensors/+/30348

Looking at this patch, I get the feeling this is specifically the
cause of your problems.  Have you tried your things on master yet,
without that patch, to see if it exhibits the same behavior?  There's
a good reason each sensor has an independent timer instance;  The
intent is that when the sensor system gets overloaded, the readings
get delayed, and each individual timer introduces its own
load-specific delays as it's started after the read completes.  Said
another way, technically, a sensor scan rate isn't exactly 1 second,
it's 1 second + queuing time + the time it takes to read the sensor,
so if the sensor takes a long time to read (because the system is
overloaded) the sensor scans don't collapse into one another and cause
problems like you're seeing, and you still (hopefully) have
significant idle time in the io_context to handle other things, like
managing dbus connections.  (as a side note, for i2c based sensors,
this keeps the latency of other operations on the bus down as well.
Unfortunately, that patch kinda puts all that design on its head.
"Avoid timer skew" is in the commit message.  That skew is there on
purpose, to prevent the problem you're seeing.  With that said, I'm
sure Josh was trying to fix something there, but in the commit message
he doesn't actually outline what the problem is.  Do you know?  My
best guess is maybe he was trying to get more accurate scan rates?  We
could think up designs that have some logic that, in the base case
handles the loop more accurately but spreading out the sensor scans is
a good thing in general and reduces the peak work queue size (as
you've already found out).
Maybe there were some different performance issues that putting all
the sensor scans together solved?


>
> > Profiles of dbus-asio-bmk are not worth looking at until the benchmark is doing GetManagedObjects
> Hmm, I believe it has been worth looking at in that it shows how a
> DBus method call may be blocked by other ASIO work. But I have to
> admit it hasn't shown much more insights.
> Another reason I used "busctl tree" is we had observed inconsistent
> performance with "busctl tree" as well.
> The sequence of events obtained by having the microbenchmark execute
> "GetManagedObjects" does not look much too different from that
> obtained through "busctl tree": In both cases, we observe DBus calls
> that "straddle" the other ASIO work that are running concurrently. An
> example might be found here: https://imgur.com/a/ZiQdcX0 ; The second
> row on the top panel are rectangles that represent method call-method
> return pairs. One could see that the highlighted method call-return
> pair takes significantly longer than the other GetManagedObjects
> method calls, and during the same time, the io_context@.post row on
> the lower pane is filled with activities. These activities are the
> sleep() functions run by the "fake sensor reading" procedures in the
> benchmark.

https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35576/2/suichen/dbus-asio-bmk/dbus_asio_bmk.cpp#46
That sleep is not very genuine to what a sysfs read looks like, and is
why your IO loop in your microbenchmark looks odd.  Is it possible to
focus on real profiles from an actual BMC reading from hwmon?  I
suspect that will be faster getting to root cause in this case.

If you want to simulate an async reading of a sensor, use a
steady_timer instance with async_wait.  That's much closer to a real
sensor than sleep, as the reactor is aware of it, and can schedule
other work ahead of it while it waits for the sensor read to complete.


> I am looking at a more important issue I'm aware of that is related to
> the "worthiness" of the benchmark: On the BMC, DBus calls do not
> appear to align perfectly with the async_read_some(null_buffers) asio
> handlers as they seem to be in the microbenchmark (see the screenshot
> here: https://imgur.com/a/FIUM4W6 ). This gap needs to be filled by
> some more investigation.

See above.  (ps, this is making me want to play with dbus-vis more.
It looks much better than doing all this via text files).

> I suspect this might be the 7.7x single-core performance difference
> (as measured via LMBench) between the particular BMC and the host, as
> well as the BMC's DBus being a lot busier than the host's.
> Perhaps
> some other methods are needed in place of this "capture DBus and
> capture ASIO handler logs and align them" approach.

What approaches did you have in mind?  Most of what exists today in
dbus-sensors was debugged using those two things, but I'm always up
for something better.

>
> > Another option is to make it 2 threaded, one set up dbus traffic, another for async io
> We experimented with a similar architecture using "Mode 32" in
> hwmon-bmk (described in
> https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35387),

That's super interesting, and the results don't really map to the
benchmarking I've done on the same thing.  One thing you don't list is
what BMC you're on, which might be some of the discrepancy.  Most of
my testing was on an ast2500, which is single core, and doesn't really
benefit (at least linearly) from extra threads.  If you can't list the
specific model publically, can you list at least a cpu architecture
type and core count you tested on?

The other thing I would say about that is that IMO you're measuring
the wrong thing.  The stat that thermal loops generally care about for
sensors is average sensor latency from reading to action (usually a
fan pwm change), not the actual time to read N sensors.  In network
terminology, dbus-sensors optimizes latency over throughput.  The
thinking is that the faster the thermal loop can act on each single
reading, the closer the thermal loop can be to the margin.

> but we only measured sensor reading times there. This method did not
> speed up sensor reading much. It might help DBus though. But
> eventually we want to address both sensor reading time and DBus
> response time for dbus-sensors.

Ignore the above idea.  Your problem is not what I thought it was.

>
> > ASIO priority queue
> This sounds promising. We will look into it, for both the DBus
> scheduling issue and the scalability differences we had observed in
> hwmon-bmk.

This would work in your case (assuming I understand what's going on,
which is a stretch), but I think the timer synchronization model is
broken.

>
> > Can you detail what methods (in https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35387 )?
> On top of patch 30348, we have 8 std::threads that directly run
> PSUSensors::prepareInput and do not use ASIO, and the main asio worker
> thread for handling the rest of the original PSUSensor logic. This
> corresponds to "Mode 7" in the description in hwmon-bmk (change
> 35387). We found that this roughly halves the total time it takes to
> read the 247 sensors. To compare, a user would need to start a very
> large number of threads (say, 32, 64 or 128) for the io_context to
> achieve the same speedup as can be seen in "Mode 42" in the
> descriptions of hwmon-bmk (change 35387).
>
> > Dumps
> The dump for the particular dbus-asio-bmk run is here:
> https://www.dropbox.com/sh/1xmo6vrz13zjkca/AADf1ySDYXyH6XIc7n2Hqc1Ba?dl=0
> For the dump from the actual BMC, a scrub process may be needed and I
> can't do it quickly yet.

That's a lot of interesting data, and I'm still unpacking it in my
head.  With that said, this is all great!  Keep it coming if you learn
anything else.

>
> I have set the mail to plaintext mode this time.

Excellent!


>
>
>
> On Thu, Aug 13, 2020 at 7:54 AM Ed Tanous <ed@tanous.net> wrote:
> >
> > On Wed, Aug 12, 2020 at 6:38 PM Sui Chen <suichen@google.com> wrote:
> > >
> > > Hello,
> > >
> > > Because machine configurations may change during development, we used a microbenchmark to try to isolate the cause and reproduce the long DBus latencies reliably, and another microbenchmark to demonstrate the idea we had tried that appeared to alleviate but not completely eliminate this DBus latency problem.
> >
> >
> > This works great, so long as you constrain things to the way they look
> > on the bmc, and don't look at things too closely.  One thread, pinned
> > to a single core tends to work best for reproducing this kind of
> > stuff.  If you can also pin something else to that same core that's
> > doing some work, it tends to be closer to a real bmc.  Also remember,
> > context switches are much cheaper on x86 multicore than they are on
> > other, less capable single core processors.  You need to keep that in
> > mind when running your microbenchmarks on a different platform, as the
> > profiles can vary significantly, especially in the worst case.
> >
> > >
> > >
> > > The first microbenchmark, dbus-asio-bmk (https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35576) mimics our patched psusensor: an ASIO worker that reads sensors at some fixed timer interval; the ASIO worker is also used by sdbusplus for handling DBus work. We continuously run "busctl tree" against the DBus interface created by the microbenchmark binary.
> >
> > What patches did you apply?
> >
> > >
> > >
> > > By importing the resultant DBus traffic dump and ASIO handler activity log onto the timeline view form dbus-vis, we can clearly see the the "sensor reading" ASIO work items block the DBus work items, which in turn causes very long DBus latencies to show up:
> >
> >
> > All I see is that there's a long lag between a single read of a busctl
> > tree call. I don't see any indication that it's because of sensor
> > reading specifically, at least from below.  Dumping the state of the
> > work queue would explain a lot about what's going on during these
> > periods.  I wonder if this is as easy as smearing the initial start
> > time of each sensor when the application starts to make sure the
> > timers don't synchronize like this.
> >
> > >
> > > Although this benchmark is run on an x86 workstation instead of the BMC due to a run-time error in its dependencies, according to the results above, we see this "thundering herd" problem appear to occur on a desktop platform as well.
> >
> >
> > One odd thing I notice in that graph: we're using null_buffers?  I
> > thought we were using a direct async read call.
> >
> > The other thing to realize is busctl tree is very inefficient.  It
> > executes a full introspection of all levels, then executes
> > GetProperties for each property.  You really need to move your
> > microbenchmark to GetManagedObjects (which I see you recognize below)
> > as that's more indicative of what a real caller would do.  I don't
> > think your profiles are worth looking at until you do this.
> >
> > >
> > >
> > > As we modify various experimental parameters, it turns out that the more time is occupied by non-DBus ASIO work the more likely long DBus latencies are to happen, since there is a higher chance the DBus calls clash with the "fake sensor reading". Thus, we come up with an assumption that if we reduce the time spent (by the only ASIO worker) in non-DBus ASIO work, DBus latencies will be reduced.
> >
> >
> > Another option is to make it 2 threaded.  Have one thread (and
> > io_context) set up for handling dbus traffic, and one thread set up
> > for async reading of IO, with an atomic data structure in between for
> > passing values.
> >
> > Another (probably preferred in this case) architecture would be to
> > implement an ASIO priority queue, and prioritize the dbus traffic
> > higher than the IO reading traffic.  There are examples of this in
> > asio, and I assumed the day where we needed this would've come a lot
> > sooner (we got away with a dumb fifo io queue for a long time), but
> > here we are.
> >
> > >
> > >
> > > Based on this assumption, we attempted a few methods to reduce the time it takes psusensor to read the sensors. The second benchmark (https://gerrit.openbmc-project.xyz/c/openbmc/openbmc-tools/+/35387) explains the methods we had experimented with. It turns out we were able to reduce sensor reading time as well as the chance of long DBus method calls happening, but the inconsistent DBus call times do not completely go away. This is probably due to psusensors being much more complex than the two benchmarks with much other work contending for the ASIO worker's time.
> >
> > Can you detail what methods?  It might help someone else pinpoint
> > another good option.
> >
> > >
> > >
> > > So to summarize the point of this reply is to say:
> > > 1) We had attempted the ASIO handler dump as suggested and a method for analyzing DBus+ASIO performance has been embodied in dbus-vis.
> >
> > Can you post the dump somewhere that others can help analyse?
> >
> > >
> > > 2) We are interested to know if someone else is looking at similar problems.
> >
> >
> > Ps, I'd be remiss if I didn't point out that your last email was sent
> > as HTML.  This list prefers plaintext.
> >
> > >
> > > 3) We will examine GetManagedObjects again when we get a chance.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inconsistent performance of dbus call GetManagedObjects to PSUSensor in dbus-sensors
  2020-08-15 14:05         ` Ed Tanous
@ 2020-08-31 21:35           ` Josh Lehan
  2020-08-31 23:34             ` Ed Tanous
  0 siblings, 1 reply; 8+ messages in thread
From: Josh Lehan @ 2020-08-31 21:35 UTC (permalink / raw)
  To: Ed Tanous
  Cc: Sui Chen, OpenBMC Maillist, Peter Lundgren, Alex Qiu, Jie Yang,
	Drew Macrae

On Sat, Aug 15, 2020 at 7:05 AM Ed Tanous <ed@tanous.net> wrote:
> On Thu, Aug 13, 2020 at 4:33 PM Sui Chen <suichen@google.com> wrote:
> > > What patches did you apply?
> > This one: https://gerrit.openbmc-project.xyz/c/openbmc/dbus-sensors/+/30348
>
> Looking at this patch, I get the feeling this is specifically the
> cause of your problems.  Have you tried your things on master yet,
> without that patch, to see if it exhibits the same behavior?  There's
> a good reason each sensor has an independent timer instance;  The
> intent is that when the sensor system gets overloaded, the readings
> get delayed, and each individual timer introduces its own
> load-specific delays as it's started after the read completes.  Said
> another way, technically, a sensor scan rate isn't exactly 1 second,
> it's 1 second + queuing time + the time it takes to read the sensor,
> so if the sensor takes a long time to read (because the system is
> overloaded) the sensor scans don't collapse into one another and cause
> problems like you're seeing, and you still (hopefully) have
> significant idle time in the io_context to handle other things, like
> managing dbus connections.  (as a side note, for i2c based sensors,
> this keeps the latency of other operations on the bus down as well.

That's a good point.

With each sensor having its own free-running timer, which auto-extends
time if the system is delayed as you mentioned, that does provide a
way for the timers to automatically slow down when the system is under
heavy load and can't respond as quickly.

I wonder if it would be worth adding some timing to detect this, and
perhaps try to back off the timer duration somewhat? For example, if
1-second timers are bogging down too much, back the timer off to
2-second timers, and so on, until they have been slowed down enough to
give the system enough time to avoid falling behind.

> Unfortunately, that patch kinda puts all that design on its head.
> "Avoid timer skew" is in the commit message.  That skew is there on
> purpose, to prevent the problem you're seeing.  With that said, I'm
> sure Josh was trying to fix something there, but in the commit message
> he doesn't actually outline what the problem is.  Do you know?  My
> best guess is maybe he was trying to get more accurate scan rates?  We
> could think up designs that have some logic that, in the base case
> handles the loop more accurately but spreading out the sensor scans is
> a good thing in general and reduces the peak work queue size (as
> you've already found out).
> Maybe there were some different performance issues that putting all
> the sensor scans together solved?

The sensor reading was causing the rest of the system to become
unusably slow, even before making any changes. I had 3 goals in mind:

1) The pmbus driver seems to ask the hardware for new data if it has
been long enough (HZ, roughly one second) since the last read, by
comparing the system jiffies clock. As all sensors would have their
own free-running timers, the driver readings would be smeared
throughout time, causing this interval to be quickly and repeatedly
reached. We suspected a driver was being slow, perhaps because of I2C,
so we were seeing a system slowdown every second. By reading more
sensors simultaneously, and then pausing for a while, I was hoping to
avoid triggering this interval as often, thus taking advantage of some
caching. The pmbus driver does a complete read of the chip every time
this interval is triggered, thus if I can satisfy more sensors at a
single interval, I don't need to do as many complete readings.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/pmbus/pmbus_core.c#n584

2) In general, it's good practice to read sensors at fixed intervals,
to avoid skew between sensor readings. An example is voltage and
amperage. Ideally, you would sample them simultaneously, so that
computations such as obtaining wattage (volts * amps) would be
correct. If the sensor readings were staggered, and you were reading
something that was changing, your calculated wattage would be
incorrect. Many embedded sensor networks have a command that
broadcasts out to the worker nodes, to "lock" a sensor reading into
local sensor memory, then, as time permits, the leader node can issue
commands individually to collect data from each worker, knowing that
the data was earlier all captured simultaneously. This technique
improves the quality of collected data, and is often used in industry.
I was hoping to bring PSUSensor a little closer to this ideal.

3) By synchronizing to one master timer, it becomes possible to slow
down the sensor collection interval, and introduce some resting time
between sensor data collection. The system became almost unusably slow
during sensor data collection. By introducing a rest time between the
readings, I was able to "PWM" the system load, and thus dial it down
to make the system usable again, by changing the duty cycle to
something like 70% instead of 100%.

I wonder if it would be worth it, instead of having each sensor have
its own free-running timer, to still have a single master time for all
sensor collection, but stagger it? For example, let's say you need to
read 200 sensors, and you need to do a complete reading once every 5
seconds. Have the timer fire at intervals of 15 milliseconds. Collect
one sensor reading each tick. This completes 200 sensors in 3000
milliseconds. The system can then be quiet for the remaining 2000
milliseconds of the interval, giving it a chance to catch up and
remain responsive.

Josh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Inconsistent performance of dbus call GetManagedObjects to PSUSensor in dbus-sensors
  2020-08-31 21:35           ` Josh Lehan
@ 2020-08-31 23:34             ` Ed Tanous
  0 siblings, 0 replies; 8+ messages in thread
From: Ed Tanous @ 2020-08-31 23:34 UTC (permalink / raw)
  To: Josh Lehan
  Cc: Sui Chen, OpenBMC Maillist, Peter Lundgren, Alex Qiu, Jie Yang,
	Drew Macrae

This is great feedback BTW.  Keep it coming.

-Ed

On Mon, Aug 31, 2020 at 2:35 PM Josh Lehan <krellan@google.com> wrote:
>
> On Sat, Aug 15, 2020 at 7:05 AM Ed Tanous <ed@tanous.net> wrote:
> > On Thu, Aug 13, 2020 at 4:33 PM Sui Chen <suichen@google.com> wrote:
> > > > What patches did you apply?
> > > This one: https://gerrit.openbmc-project.xyz/c/openbmc/dbus-sensors/+/30348
> >
> > Looking at this patch, I get the feeling this is specifically the
> > cause of your problems.  Have you tried your things on master yet,
> > without that patch, to see if it exhibits the same behavior?  There's
> > a good reason each sensor has an independent timer instance;  The
> > intent is that when the sensor system gets overloaded, the readings
> > get delayed, and each individual timer introduces its own
> > load-specific delays as it's started after the read completes.  Said
> > another way, technically, a sensor scan rate isn't exactly 1 second,
> > it's 1 second + queuing time + the time it takes to read the sensor,
> > so if the sensor takes a long time to read (because the system is
> > overloaded) the sensor scans don't collapse into one another and cause
> > problems like you're seeing, and you still (hopefully) have
> > significant idle time in the io_context to handle other things, like
> > managing dbus connections.  (as a side note, for i2c based sensors,
> > this keeps the latency of other operations on the bus down as well.
>
> That's a good point.
>
> With each sensor having its own free-running timer, which auto-extends
> time if the system is delayed as you mentioned, that does provide a
> way for the timers to automatically slow down when the system is under
> heavy load and can't respond as quickly.
>
> I wonder if it would be worth adding some timing to detect this, and
> perhaps try to back off the timer duration somewhat? For example, if
> 1-second timers are bogging down too much, back the timer off to
> 2-second timers, and so on, until they have been slowed down enough to
> give the system enough time to avoid falling behind.

What you described would be better in theory, but I've never seen
sensor problems bad enough that would need something like that.
Usually the sensor reads back up, the kernel sorts them out in some
order (note, some day I'd like to understand better how that ordering
works, and if there's anything to optimize there) then we go to the
wait state, which allows the BMC to recover pretty quickly (a couple
seconds).  In theory, your change could allow the BMC to recover in a
more orderly way, leaving more gaps of CPU time.  My gut (ie, I have
zero data to back this up) says that we might get some minor latency
gains, at the expense of recovery time.  Which is better in that case,
IDK.

It'd be really cool if we had a mode where we could just dynamically
scale sensor reads as BMC CPU and IO resources allow, and not have
fixed scan rate.  In theory phosphor-pid-control is already timing
aware when it makes fan changes based on the integral component, so
that should "just work".  I wonder how other IO-driven applications do
CPU-usage based throttling in practice, and if it would be easy to
adapt to the BMC as the default.

>
> > Unfortunately, that patch kinda puts all that design on its head.
> > "Avoid timer skew" is in the commit message.  That skew is there on
> > purpose, to prevent the problem you're seeing.  With that said, I'm
> > sure Josh was trying to fix something there, but in the commit message
> > he doesn't actually outline what the problem is.  Do you know?  My
> > best guess is maybe he was trying to get more accurate scan rates?  We
> > could think up designs that have some logic that, in the base case
> > handles the loop more accurately but spreading out the sensor scans is
> > a good thing in general and reduces the peak work queue size (as
> > you've already found out).
> > Maybe there were some different performance issues that putting all
> > the sensor scans together solved?
>
> The sensor reading was causing the rest of the system to become
> unusably slow, even before making any changes. I had 3 goals in mind:
>
> 1) The pmbus driver seems to ask the hardware for new data if it has
> been long enough (HZ, roughly one second) since the last read, by
> comparing the system jiffies clock. As all sensors would have their
> own free-running timers, the driver readings would be smeared
> throughout time, causing this interval to be quickly and repeatedly
> reached. We suspected a driver was being slow, perhaps because of I2C,
> so we were seeing a system slowdown every second. By reading more
> sensors simultaneously, and then pausing for a while, I was hoping to
> avoid triggering this interval as often, thus taking advantage of some
> caching. The pmbus driver does a complete read of the chip every time
> this interval is triggered, thus if I can satisfy more sensors at a
> single interval, I don't need to do as many complete readings.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/pmbus/pmbus_core.c#n584

Gotcha.  That makes considerably more sense.  Did you profile it, and
did your patch work?  I wonder how we could work around that in a less
invasive way, that keeps the smearing properties in time, but doesn't
cause such bad read errors.

Alternatively (again, with no data) I wonder if the pmbus kernel
driver needs some improvement to not do that, or decrease that time to
something more reasonable.  This requires more thought on my part, but
it's really good information overall.

>
> 2) In general, it's good practice to read sensors at fixed intervals,
> to avoid skew between sensor readings. An example is voltage and
> amperage. Ideally, you would sample them simultaneously, so that
> computations such as obtaining wattage (volts * amps) would be
> correct.

Yep, and this is why each sensor owns its own IO loop rather than
something global, so when things like the above come up, we can code
in IO actions in the order something that makes the most sense.  The
above doesn't mean we need to scan all sensors at the same time. we
just need to schedule "adjacent" sensors together.  Also, if we're
talking best case, we would be using ICs that can do that calculation
in hardware, but I realize we don't always get to pick the hardware we
have :)

> If the sensor readings were staggered, and you were reading
> something that was changing, your calculated wattage would be
> incorrect. Many embedded sensor networks have a command that
> broadcasts out to the worker nodes, to "lock" a sensor reading into
> local sensor memory, then, as time permits, the leader node can issue
> commands individually to collect data from each worker, knowing that
> the data was earlier all captured simultaneously. This technique
> improves the quality of collected data, and is often used in industry.
> I was hoping to bring PSUSensor a little closer to this ideal.

While I agree with you in the spirit of your example, I think we need
to build it in such a way that maps the things together that have
benefits of being together, while not imposing a single timer on the
whole system.  In the above example, I'd personally make sure that
PSUSensor did a Voltage reading and a current reading on the same
timing.  Also, in this case, doesn't PMBus already have a value
available for wattage that the PSU calculates in hardware?

Getting very accurate snapshots of the system state is great in a lot
of industries, but I don't think it applies much to BMCs.  As a
general rule, thermal control loops care the most about the delay
between reading and action, so in almost all cases, we'd never want to
read a "locked in" value from that past (unless we were doing system
modeling, which is a whole other thing) we'd much rather read the
"latest" value.

>
> 3) By synchronizing to one master timer, it becomes possible to slow
> down the sensor collection interval, and introduce some resting time
> between sensor data collection. The system became almost unusably slow
> during sensor data collection. By introducing a rest time between the
> readings, I was able to "PWM" the system load, and thus dial it down
> to make the system usable again, by changing the duty cycle to
> something like 70% instead of 100%.

Did you try lowering the process priority of the sensors that were
causing your issues?  I've long wondered if we should do that by
default, but never really had a system that was too overloaded to do
the investigations on.

>
> I wonder if it would be worth it, instead of having each sensor have
> its own free-running timer, to still have a single master time for all
> sensor collection, but stagger it? For example, let's say you need to
> read 200 sensors, and you need to do a complete reading once every 5
> seconds. Have the timer fire at intervals of 15 milliseconds. Collect
> one sensor reading each tick. This completes 200 sensors in 3000
> milliseconds. The system can then be quiet for the remaining 2000
> milliseconds of the interval, giving it a chance to catch up and
> remain responsive.

That would be an interesting experiment, but I think the root of the
issue is that no sensor, no matter how expensive to read should be
blocking other user space from running for a significant portion of
time.  Have you ever tried running a perf record session on your BMC
to see where it's spending all that time?  Is it in that pmbus kernel
loop you mentioned above?

I'm a little wary of trying to PWM it, because picking duty cycles is
going to rely on specifics of the implementation on that particular
system, and lead to a lot of necessary error/odd condition handling.
What if the same system has half the number of sensors?  What if it
has 4X?  What if you have a ground PMbus channel, and 4 of your 200
sensors take 10 seconds to read?

Trying to qualify all of those conditions gets really hairy, although
I suspect it could be done.

>
> Josh

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-08-31 23:34 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-07 22:42 Inconsistent performance of dbus call GetManagedObjects to PSUSensor in dbus-sensors Alex Qiu
2020-08-07 23:17 ` Ed Tanous
2020-08-13  0:30   ` Sui Chen
2020-08-13 14:53     ` Ed Tanous
2020-08-13 23:33       ` Sui Chen
2020-08-15 14:05         ` Ed Tanous
2020-08-31 21:35           ` Josh Lehan
2020-08-31 23:34             ` Ed Tanous

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.