All of lore.kernel.org
 help / color / mirror / Atom feed
* kdbus: to merge or not to merge?
@ 2015-06-23  6:06 Andy Lutomirski
  2015-06-23  6:31 ` Andy Lutomirski
                   ` (4 more replies)
  0 siblings, 5 replies; 72+ messages in thread
From: Andy Lutomirski @ 2015-06-23  6:06 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel, David Herrmann, Djalal Harouni,
	Greg KH, Havoc Pennington, Eric W. Biederman,
	One Thousand Gnomes, Tom Gundersen, Daniel Mack

Hi Linus,

Can you opine as to whether you think that kdbus should be merged?  I
don't mean whether you'd accept a pull request that Greg may or may
not send during this merge window -- I mean whether you think that
kdbus should be merged if it had appropriate review and people were
okay with the implementation.

The current state of uncertainty is problematic, I think.  The kdbus
team is spending a lot of time making things compatible with kdbus,
and the latest systemd release makes kdbus userspace support
mandatory.  The kernel people who would review it (myself included)
probably don't want to review new versions at a line-by-line level,
because we (myself included) either don't know whether there's any
point or don't think that it should be merged *even if the
implementation were flawless*.

For my part, here's my argument why the answer should be "no, kdbus
shouldn't be merged":

1. It's not necessary.  kdbus is a giant API surface.  The problems it
purports to solve are (very roughly) performance, ability to collect
metadata in a manner that doesn't suck, sandbox support, better
logging/monitoring, and availability very early in userspace startup.
I think that the performance issues should be solved in userspace --
gdbus performance is atrocious for reasons that have nothing to do
with the kernel or context switches [1].  The metadata problem, to the
extent that it's a real problem, can and should be solved by improving
AF_UNIX.  The logging, monitoring, and early userspace problems can
and should be solved in userspace.  See #3 below for my thoughts on
the sandbox.  Right now, kdbus sounds awfully like Tux.

2. Kdbus introduces a novel buffering model.  Receivers allocate a big
chunk of what's essentially tmpfs space.  Assuming that space is
available (in a virtual memory sense), senders synchronously write to
the receivers' tmpfs space.  Broadcast senders synchronously write to
*all* receivers' tmpfs space.  I think that, regardless of
implementation, this is problematic if the sender and the receiver are
in different memcgs.  Suppose that the message is to be written to a
page in the receivers' tmpfs space that is not currently resident.  If
the write happens in the sender's memcg context, then a receiver can
effectively allocate an unlimited number of pages in the sender's
memcg, which will, in practice, be the init memcg if the sender is
systemd.  This breaks the memcg model.  If, on the other hand, the
sender writes to the receiver's tmpfs space in the receiver's memcg
context, then the sender will block (or fail?  presumably
unpredictable failures are a bad thing) if the receiver's memcg is at
capacity.

3. The sandbox model is, in my opinion, an experiment that isn't going
to succeed.  It's a poor model: a "restricted endpoint" (i.e. a
sandboxed kdbus client) sees a view of the world defined by a limited
policy language implemented by the kernel.  This completely fails to
express what I think should be common use cases.  If a sandboxed app
is given permission to access, say,
/org/gnome/evolution/dataserver/CalendarView/3125/12, then it knows
that it's looking at CalendarView/3125/12 (whatever that means) and
there's no way to hide the name.  If someone subsequently deletes that
CalendarView and creates a new one with that name, racelessly blocking
access to the new one for the app may be complicated.  If a sandbox
wants to prompt the user before allowing access to some resource, it
has a problem: the policy language doesn't seem to be able to express
request interception.

The sandbox model is also already starting to accumulate kludges.
Apparently it was recently discovered that the kdbus connection
lifetime model was incompatible with sandbox policy, so as of a recent
change [2] connection lifetime messages completely bypass sandbox
policy.  Maybe this isn't obviously insecure, but it seems like a bad
sign that "it's probably okay to poke this hole" is already happening
before the thing is even merged.

I'll point out that a pure userspace implementation of sandboxed dbus
connections would be straightforward to implement today, would have
none of these problems, and would allow arbitrarily complex policy and
the flexibility to redesign it in the future if the initial design
turned out to be inappropriate for the sandbox being written.  (You
could even have two different implementations to go with two different
sandboxes.  Let a thousand sandboxes bloom, which is easy in userspace
but not so great in the kernel.)

In summary, I think that a very high quality implementation of the
kdbus concept and API would be a bit faster than a very high quality
userspace implementation of dbus.  Other than that, I think it would
actually be worse.  The kdbus proponents seem to be comparing the
current kdbus implementation to the current userspace implementation,
and a favorable comparison there is not a good reason to merge it.

--Andy

[1] I spent a while today trying to benchmark sd-bus.  I gave up,
because I couldn't get test code to build.  I don't have the patience
to try harder.

[2] https://git.kernel.org/cgit/linux/kernel/git/gregkh/char-misc.git/commit/?h=kdbus&id=d27c8057699d164648b7d8c1559fa6529998f89d
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 72+ messages in thread
* Re: kdbus: to merge or not to merge?
@ 2015-07-01  0:03 Kalle A. Sandstrom
  2015-07-01 16:51 ` David Herrmann
  0 siblings, 1 reply; 72+ messages in thread
From: Kalle A. Sandstrom @ 2015-07-01  0:03 UTC (permalink / raw)
  To: linux-kernel


[delurk; apparently kdbus is not receiving the architectural review it should.
i've got quite a bit of knowledge on message-passing mechanisms in general, and
kernel IPC in particular, so i'll weigh in uninvited. apologies for length.

as my "proper" review on this topic is still under construction, i'll try (and
fail) to be brief here. i started down that road only to realize that kdbus is
quite the ball of mud even if the only thing under the scope is its interface,
and that if i held off until properly ready i'd risk kdbus having already been
merged, making review moot.]


Ingo Molnar wrote:

>- I've been closely monitoring Linux kernel changes for over 20 years, and for the
>  last 10 years the linux/ipc/* code has been dormant: it works and was kept good
>  for existing usecases, but no-one was maintaining and enhancing it with the
>  future in mind.

It's my understanding that linux/ipc/* contains only SysV IPC, i.e. shm, sem,
SysV message queues, and POSIX message queues. There are other IPC-implementing
things in the kernel also, such as unix domain sockets, pipes, shared memory
via mmap(), signals, mappings that appear shared across fork(), and whatever
else provides either kernel-mediated multi-client buffer access or some
combination of shared memory and synchronization that lets userspace exchange
hot data across the address space boundary.

It's also my understanding that no-one in their right mind would call SysV IPC
state-of-the-art even at the level of interface; indeed its presence in the
hoariest of vendor unixes suggests it's not supposed to be even close.

However, the suggested replacement in kdbus replicates the worst[-1] of all
known user-to-user IPC mechanisms, i.e. Mach. I'm not suggesting that Linux
adopt e.g. a different microkernel IPC mechanism-- those are by and large
inapplicable to a monolithic kernel for reasons of ABI (and, well, why would
you do IPC when function calls are zomgfast already?)-- but rather, that the
existing ones either are good enough at this time or can be reworked to become
near-equivalent to the state of the art in terms of performance.


>  So there exists a technical vacuum: the kernel does not have any good, modern
>  IPC ABI at the moment that distros can rely on as a 'golden standard'. This is
>  partly technical, partly political. The technical reason is that SysV IPC is
>  ancient and cumbersome. The political reason is that SystemD could be using
>  and extending Android's existing kernel accelerated IPC subsystem (Binder)
>  that is already upstream - but does not.

I'll contend that the reason for this vacuum is that the existing kernel IPC
interfaces are fine to the point that other mechanisms may be derived from
them solely in user-space without significant performance demerit, and without
pushing ca. 10k SLOC of IPC broker and policy engine into kernel space.

Furthermore, it's my well-ruminated opinion that implementations of the
userspace ABI specified in the kdbus 4.1-rc1 version (of April this year) will
always be necessarily slower than existing IPC primitives in terms of both
throughput and latency; and that the latter are directly applicable to
constructing a more convenient user-space IPC broker that implements what
kdbus seeks to provide: naming, broadcast, unidirectional signaling,
bidirectional "method calls", and a policy mechanism.

In addition I'll argue that as currently specified, the kdbus interface-- even
if tuned to its utmost-- is not only necessarily inferior to e.g. a well-tuned
version of unix domain sockets, but also fundamentally flawed in ways that
prohibit construction of robust in-system distributed programs by kdbus'
mechanisms alone (i.e. byzantine call-site workarounds notwithstanding).


For the first, compare unix domain sockets (i.e. point-to-point mode, access
control through filesystem [or fork() parentage], read/write/select) to the
kdbus message-sending ioctl. In the main data-exchanging portion, the former
requires only a connection identifier, a pointer to a buffer, and the length
of data in that buffer. To contrast, kdbus takes a complex message-sending
command structure with 0..n items of m kinds that the ioctl must parse in a
m-way switching loop, and then another complex message-describing structure
which has its own 1..n items of another m kinds describing its contents,
destination-lookup options, negotiation of supported options, and so forth.

Consequently, a carefully optimized implementation of unix domain sockets (and
by extension all the data-carrying SysV etc. IPC primitives, optimized
similarly) will always be superior to kdbus for both message throughput and
latency, for the reason of kdbus' comparatively great interface complexity
alone.

There's an obvious caveat here, i.e. "well where is it, then?". Given the
overhead dictated by its interface, kdbus' performance is already inferior for
short messages. For long messages (> L1 cache size per Stetson-Harrison[0]) the
only performance benefit from kdbus is its claimed single-copy mode of
operation-- an equivalent to which could be had with ye olde sockets by copying
data from the writer directly into the reader while one of them blocks[1] in
the appropriate syscall. That the current Linux pipes, SysV queues, unix domain
sockets, etc. don't do this doesn't really factor in.


For the second, kdbus is fundamentally designed to buffer message data, up to
a fixed limit, in the pool associated with receivers' connections. I cannot
overstate the degree of this _outright architectural blunder_, so I'll put an
extra paragraph break here just for emphasis' sake.

A consequence of this buffering is that whenever a client sends a message with
kdbus, it must be prepared to handle an out-of-space non-delivery status.
(kdbus has two of those, one for queue length and another for buffer space.
why, i have no idea-- do clients have a different behaviour in response to one
of them from the other?) There's no option to e.g. overwrite a previous
message, or to discard queued messages in an oldest-first order, instead of
rebuffing the sender.

For broadcast messaging, a recipient may observe that messages were dropped by
looking at a `dropped_msgs' field delivered (and then reset) as part of the
message reception ioctl. Its value is the number of messages dropped since last
read, so arguably a client could achieve the equivalent of the condition's
absence by resynchronizing explicitly with all signal-senders on its current
bus wrt which it knows the protocol, when the value is >0. This method could in
principle apply to 1-to-1 unidirectional messaging as well[2].

Looking at the kdbus "send message, wait for tagged reply" feature in
conjunction with these details appears to reveal two holes in its state graph.
The first is that if replies are delivered through the requestor's buffer,
concurrent sends into that same buffer may cause it to become full (or the
queue to grow too long, w/e) before the service gets a chance to reply. If this
condition causes a reply to fall out of the IPC flow, the requestor will hang
until either its specified timeout happens or it gets interrupted by a signal.
If replies are delivered outside the shm pool, the requestor must be prepared
to pick them up using a different means from the "in your pool w/ offset X,
length Y" format the main-line kdbus interface provides. [i've seen no such
thing in the kdbus docs so far.]

As far as alternative solutions go, preallocation of space for a reply message
is an incomplete fix unless every reply's size has a known upper bound (e.g.
with use of an IDL compiler); in this scheme it'd be necessary for the
requestor to specify this, suffering consequences if the number is too low, and
to prepare to handle a "not enough buffer space for a reply" condition at send.
The kdbus docs specify no such condition.

The second problem is that given how there can be a timeout or interrupt on the
receive side of a "method call" transaction, it's possible for the requestor to
bow out of the IPC flow _while the service is processing its request_. This
results either in the reply message being lost, or its ending up in the
requestor's buffer to appear in a loop where it may not be expected. Either
way, the client must at that point resynchronize wrt all objects related to the
request's side effects, or abandon the IPC flow entirely and start over.
(services need only confirm their replies before effecting e.g. a chardev-like
"destructively read N bytes from buffer" operation's outcome, which is slightly
less ugly.)


Tying this back into the first point: to prevent this type of denial-of-service
against sanguinely-written software it's necessary for kdbus to invoke the
policy engine to determine that an unrelated participant isn't allowed to
consume a peer's buffer space. As this operation is absent in unix-domain
sockets, an ideal implementation of kdbus 4.1-rc1 will be slower in
point-to-point communication even if the particulars of its message-descriptor
format get reworked to a light-weight alternative. In addition, its API ends up
requiring highly involved state-tracking wrappers or inversion-of-control
machinery in its clients, to the point where just using unix domain sockets
with a heavyweight user-space broker would be nicer.


It's my opinionated conclusion that merging kdbus as-is would be the sort of
cock-up which we'll look back at, point a finger, giggle a bit, and wonder only
half-jokingly if there was something besides horse bones in that glue. Its docs
betray an absence of careful analysis, and the spec of its interface is so
loose as to make programs written for kdbus 4.1-rc1 subtly incompatible to any
later program through deeply-baked design consequences stemming from quirks of
its current implementation.

I'm not a Linux kernel developer. But if I were, this would be where I'd put
my NAK.


Sincerely,
  -KS

[-1] author's opinion
[0] no bunny rabbits were harmed
[1] the case where both use non-blocking I/O requires either a buffer or
    support from the scheduler. the former is no optimization at all, and the
    latter may be _quite involved indeed_.
[2] as for whether freedesktop.org programs will be designed and built to such
    a standard, i suspend judgement.

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2015-08-10 17:04 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-23  6:06 kdbus: to merge or not to merge? Andy Lutomirski
2015-06-23  6:31 ` Andy Lutomirski
2015-06-23  6:41 ` Greg KH
2015-06-23  7:22   ` Richard Weinberger
2015-06-23  9:25     ` Martin Steigerwald
2015-06-23  9:38       ` Martin Steigerwald
2015-06-23 15:07     ` Andy Lutomirski
2015-06-25  2:14       ` Steven Rostedt
2015-06-25  2:20         ` Linus Torvalds
2015-06-25  6:01           ` Martin Steigerwald
2015-06-25  6:05             ` Martin Steigerwald
2015-06-25 13:34               ` Theodore Ts'o
2015-06-25 14:03                 ` Martin Steigerwald
2015-06-23  9:12   ` Borislav Petkov
2015-07-08 13:54   ` Pavel Machek
2015-07-09  8:39     ` Geert Uytterhoeven
2015-07-09 10:29       ` Joe Perches
2015-07-09 10:57         ` Geert Uytterhoeven
2015-07-09 11:36       ` Pavel Machek
2015-06-23 23:19 ` Linus Torvalds
2015-06-24  0:52   ` Andy Lutomirski
2015-06-24  8:05   ` Ingo Molnar
2015-06-24 10:41     ` Eric W. Biederman
2015-06-24 10:46     ` Martin Steigerwald
2015-06-24 13:18       ` Ingo Molnar
2015-06-24 17:39         ` David Lang
2015-06-24 18:41           ` Eric W. Biederman
2015-06-24 18:50           ` Martin Steigerwald
2015-06-24 19:12             ` David Lang
2015-06-25  7:57               ` Geert Uytterhoeven
2015-06-25 15:26                 ` Steven Rostedt
2015-06-25  6:31           ` Greg KH
2015-06-25  6:48             ` David Lang
2015-06-25  7:47           ` Ingo Molnar
2015-06-25  7:51             ` Ingo Molnar
2015-06-24 11:43     ` Martin Steigerwald
2015-06-24 13:27       ` Ingo Molnar
2015-06-24  9:55 ` Alexander Larsson
2015-06-24 14:38   ` Andy Lutomirski
     [not found]     ` <CAHr-LrYWNwv6_YLoP-B3duQ1QsjPiTiaEnjBQ7j2brPMeTgA3A@mail.gmail.com>
     [not found]       ` <CALCETrW3F6YP_H1oRJa47f1DT7B35OubhJYSnq0U-_GmFQHNOA@mail.gmail.com>
2015-06-24 17:11         ` Alexander Larsson
2015-06-24 19:43           ` Andy Lutomirski
2015-06-24 20:45             ` Alexander Larsson
2015-08-03 23:02 ` Andy Lutomirski
2015-08-04  8:58   ` David Herrmann
2015-08-04 13:46     ` Linus Torvalds
2015-08-04 14:09       ` David Herrmann
2015-08-04 14:47         ` Andy Lutomirski
2015-08-05  0:18           ` Andy Lutomirski
2015-08-06  7:06             ` Daniel Mack
2015-08-06 15:27               ` Andy Lutomirski
2015-08-06 17:24                 ` Daniel Mack
2015-08-05  7:10           ` David Herrmann
2015-08-05 20:11             ` Andy Lutomirski
2015-08-06  8:04               ` David Herrmann
2015-08-06  8:25                 ` Martin Steigerwald
2015-08-06 15:21                 ` Andy Lutomirski
2015-08-06 18:14                   ` Daniel Mack
2015-08-06 18:43                     ` Andy Lutomirski
2015-08-07 14:40                       ` Daniel Mack
2015-08-07 15:09                         ` Andy Lutomirski
     [not found]                         ` <CA+55aFxDLt-5+=xXeYG4nJKMb8L_iD9FmwTZ2VuughBku-mW3g@mail.gmail.com>
2015-08-09 19:00                           ` Greg Kroah-Hartman
2015-08-09 22:11                             ` Daniel Mack
2015-08-09 22:11                               ` Daniel Mack
2015-08-10  2:10                               ` Andy Lutomirski
2015-08-10  2:10                                 ` Andy Lutomirski
2015-08-10 17:04                               ` Linus Torvalds
2015-08-10 17:04                                 ` Linus Torvalds
2015-08-10  2:48                             ` David Lang
2015-08-07 15:37                       ` cee1
2015-07-01  0:03 Kalle A. Sandstrom
2015-07-01 16:51 ` David Herrmann
2015-07-06 21:18   ` Kalle A. Sandstrom

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.