D-bus is three orders of magnitude too slow (Re: [GIT PULL] kdbus for 4.1-rc1)

* D-bus is three orders of magnitude too slow (Re: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-27 20:08 Andy Lutomirski
  2015-04-28 10:25 ` Lukasz Skalski
  0 siblings, 1 reply; 2+ messages in thread
From: Andy Lutomirski @ 2015-04-27 20:08 UTC (permalink / raw)
  To: Lukasz Skalski
  Cc: Greg Kroah-Hartman, Linus Torvalds, Andrew Morton, Arnd Bergmann,
	Eric W. Biederman, One Thousand Gnomes, Tom Gundersen,
	Jiri Kosina, linux-kernel, Daniel Mack, David Herrmann,
	Djalal Harouni

On Fri, Apr 24, 2015 at 6:50 AM, Lukasz Skalski <l.skalski@samsung.com> wrote:
> Hi All,
>
> On 04/23/2015 07:16 PM, Greg Kroah-Hartman wrote:
>> On Thu, Apr 23, 2015 at 09:46:22AM -0700, Andy Lutomirski wrote:
>>>  - There's still an open performance question.  Namely: is kdbus performant?
>>
>> Yes, I thought that was already answered.  Tizen posted some numbers
>> with a much older version of the code, before David fixed a bunch of
>> issues that he and you found, and that averaged between 25-50% faster.
>> Details are in this presentation:
>>       http://download.tizen.org/misc/media/conference2014/slides/tdc2014-kdbus-in-tizen3.pdf
>>
>> The Tizen and GENIVI developers are off running numbers with the latest
>> code, or so they told me through emails, but I don't know when/if that
>> will ever happen, so I can't promise more than what is already here.
>>
>
> I'm working on kdbus support for GLib ([1],[2]). I saw some questions
> about kdbus performance, so I've prepared simple benchmark. Because
> David already has posted some comparison results between kdbus and UDS,
> I've decided to use my GLib port with native kdbus support (it should
> be noted, that this port is not finished yet and there are still some
> places for improvements, thus please do not treat these test results as
> final).
>
> To perform tests I've created two simple apps:
>
> - server: http://fpaste.org/215157/
> - client: http://fpaste.org/215156/

After some fiddling (what's this G_BUS_TYPE_USER thing?
G_BUS_TYPE_SESSION seems to work), I got this to run.

Can we please take a big step back from the kernel-vs-userspace debate
here?  Can we try to understand why it's so bloody slow before
thinking about merging something that might ossify it?

With a slightly improved test (it shows how many calls happened) and
userspace dbus, I got:

$ taskset -c 0 dbus-launch bash
$ perf stat ./test.sh
HANDLER: bus_acquired_handler
HANDLER: name_acquired_handler
20000 calls in 4.978810 s = 0.248940 ms per call

 Performance counter stats for './test.sh':

       3866.770672 task-clock (msec)         #    0.738 CPUs utilized
           300,633 context-switches          #    0.078 M/sec
                 3 cpu-migrations            #    0.001 K/sec
               817 page-faults               #    0.211 K/sec
    13,572,244,134 cycles                    #    3.510 GHz
         [83.26%]
     8,799,771,026 stalled-cycles-frontend   #   64.84% frontend
cycles idle    [82.75%]
     6,914,976,524 stalled-cycles-backend    #   50.95% backend
cycles idle    [68.92%]
     9,915,362,980 instructions              #    0.73  insns per cycle
                                             #    0.89  stalled cycles
per insn [84.82%]
     2,343,233,242 branches                  #  605.992 M/sec
         [82.25%]
        44,699,493 branch-misses             #    1.91% of all
branches         [82.83%]

       5.238823116 seconds time elapsed

That's more than 15 context switches per call.  Something is severely
broken.  There should be almost exactly four context switches per
call.

NB: subtract 250ms from the elapsed time.  test.sh contains a sleep
0.25 to work around a stupid race.

This performance is absolutely terrible.  Kdbus should be 100-1000x
faster, not 2x faster, so kdbus' performance is still absolutely
terrible.

Sure, we can crank the message size so high that the only thing that
matters is the per-byte overhead, in which case anything that uses
memfd will win, but dbus-daemon could pretty easily do that to.

Anyway, here's part of the problem.  The client does this for each
message (I added the nanosleep for this test so I could see what was
going on):

[pid 29592] eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 6
[pid 29592] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 29592] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 29592] poll([{fd=6, events=POLLIN}], 1, 25000) = 1 ([{fd=6,
revents=POLLIN}])
[pid 29592] poll([{fd=6, events=POLLIN}], 1, 25000) = 1 ([{fd=6,
revents=POLLIN}])
[pid 29592] read(6, "\1\0\0\0\0\0\0\0", 16) = 8
[pid 29592] poll([{fd=6, events=POLLIN}], 1, 25000 <unfinished ...>
[pid 29593] <... poll resumed> )        = 1 ([{fd=5, revents=POLLIN}])
[pid 29593] read(5, "\1\0\0\0\0\0\0\0", 16) = 8
[pid 29593] sendmsg(4, {msg_name(0)=NULL,
msg_iov(1)=[{"l\1\0\1\5\4\0\0=\4\0\0e\0\0\0\10\1g\0\1s\0\0\1\1o\0\r\0\0\0"...,
1149}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 1149
[pid 29593] poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}], 2,
4294967295) = 1 ([{fd=4, revents=POLLIN}])
[pid 29593] read(5, 0x7f9115edfcf0, 16) = -1 EAGAIN (Resource
temporarily unavailable)
[pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 29593] recvmsg(4, {msg_name(0)=NULL,
msg_iov(1)=[{"l\2\1\1\7\0\0\0>\4\0\0-\0\0\0", 16}], msg_controllen=0,
msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
[pid 29593] poll([{fd=4, events=POLLIN}], 1, 0) = 1 ([{fd=4, revents=POLLIN}])
[pid 29593] recvmsg(4, {msg_name(0)=NULL,
msg_iov(1)=[{"\10\1g\0\1s\0\0\5\1u\0=\4\0\0\6\1s\0\4\0\0\0:1.1\0\0\0\0"...,
55}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC)
= 55
[pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 29593] poll([{fd=5, events=POLLIN}], 1, 0) = 1 ([{fd=5, revents=POLLIN}])
[pid 29593] write(6, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
[pid 29592] <... poll resumed> )        = 1 ([{fd=6, revents=POLLIN}])
[pid 29592] futex(0x1f5c130, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 29593] <... write resumed> )       = 8
[pid 29593] futex(0x1f5c130, FUTEX_WAKE, 1 <unfinished ...>
[pid 29592] <... futex resumed> )       = 0
[pid 29592] read(6, "\1\0\0\0\0\0\0\0", 16) = 8
[pid 29592] futex(0x1f5c130, FUTEX_WAKE, 1) = 0
[pid 29592] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 29592] futex(0x1f5c140, FUTEX_WAKE, 2147483647) = 0
[pid 29592] nanosleep({0, 100000},  <unfinished ...>
[pid 29593] <... futex resumed> )       = 1
[pid 29593] close(6)                    = 0
[pid 29593] poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 29593] poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}], 2,
4294967295) = 1 ([{fd=5, revents=POLLIN}])
[pid 29593] read(5, "\4\0\0\0\0\0\0\0", 16) = 8
[pid 29593] poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}], 2,
4294967295 <unfinished ...>
[pid 29592] <... nanosleep resumed> NULL) = 0

Hmm.  So you're closing an fd for each message, thereby forcing an RCU
cleanup, and you're doing some kind of inter-thread communication
using several poll calls and a few futex wakes.

Can someone who likes kdbus please benchmark kdbus against something
that doesn't go out of its way to be inefficient?  I don't even want
to review kernel code that has as a major claim to fame the ability to
beat this mess by a mere factor of two.

For comparison, I have some code that uses Thrift, which is dog-slow
[1], to serialize a message, send it, and deserialize it on the other
end.  The entire process takes 28 microseconds on average.  That would
be almost exactly three orders of magnitude faster.  Of course, I'm
not comparing apples to apples here, but the species of fruit being
compared does not justify a three order of magnitude difference.

[1] For all I know, the glib serialization stuff is also dog-slow, or
perhaps snail-slow.  Nonetheless, Thrift, or at least the version I'm
using, is not a shining example of performance.  If you want
serialization performance, use Cap'n Proto.

--Andy

^ permalink raw reply	[flat|nested] 2+ messages in thread