Linux-Security-Module Archive on lore.kernel.org
 help / color / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: David Howells <dhowells@redhat.com>
Cc: Ray Strode <rstrode@redhat.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Steven Whitehouse <swhiteho@redhat.com>,
	Nicolas Dichtel <nicolas.dichtel@6wind.com>,
	raven@themaw.net, keyrings@vger.kernel.org,
	linux-usb@vger.kernel.org,
	linux-block <linux-block@vger.kernel.org>,
	Christian Brauner <christian@brauner.io>,
	LSM List <linux-security-module@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	"Ray, Debarshi" <debarshi.ray@gmail.com>,
	Robbie Harwood <rharwood@redhat.com>
Subject: Re: Why add the general notification queue and its sources
Date: Thu, 5 Sep 2019 17:07:25 -0700
Message-ID: <CAHk-=wjt2Eb+yEDOcQwCa0SrZ4cWu967OtQG8Vz21c=n5ZP1Nw@mail.gmail.com> (raw)
In-Reply-To: <14883.1567725508@warthog.procyon.org.uk>

On Thu, Sep 5, 2019 at 4:18 PM David Howells <dhowells@redhat.com> wrote:
>
> Can you write into a pipe from softirq context and/or with spinlocks held
> and/or with the RCU read lock held?  That is a requirement.  Another is that
> messages get inserted whole or not at all (or if they are truncated, the size
> field gets updated).

Right now we use a mutex for the buffer locking, so no, pipe buffers
are not irq-safe or atomic. That's due to the whole "we may block on
data from user space" when doing a write.

HOWEVER.

Pipes actually have buffers on two different levels: there's the
actual data buffers themselves (each described by a "struct
pipe_buffer"), and there's the circular queue of them (the
"pipe->buf[]" array, with pipe->curbuf/nrbufs) that points to
individual data buffers.

And we could easily separate out that data buffer management. Right
now it's not really all that separated: people just do things like

        int newbuf = (pipe->curbuf + bufs) & (pipe->buffers-1);
        struct pipe_buffer *buf = pipe->bufs + newbuf;
...
        pipe->nrbufs++;

to add a buffer into that circular array of buffers, but _that_ part
could be made separate.  It's just all protected by the pipe mutex
right now, so it has never been an issue.

And yes, atomicity of writes has actually been an integral part of
pipes since forever. It's actually the only unambiguous atomicity that
POSIX guarantees. It only holds for writes to pipes() of less than
PIPE_BUF blocks, but that's 4096 on Linux.

> Since one end would certainly be attached to an fd, it looks on the face of it
> that writing into the pipe would require taking pipe->mutex.

That's how the normal synchronization is done, yes. And changing that
in general would be pretty painful. For example, two concurrent
user-space writers might take page faults and just generally be
painful, and the pipe locking needs to serialize that.

So the mutex couldn't go away from pipes in general - it would remain
for read/write/splice mutual exclusion (and it's not just the data it
protects, it's the reader/writer logic for EPIPE etc).

But the low-level pipe->bufs[] handling is another issue entirely.
Even when a user space writer copies things from user space, it does
so into a pre-allocated buffer that is then attached to the list of
buffers somewhat separately (there's a magical special case where you
can re-use a buffer that is marked as "I can be reused" and append
into an already allocated buffer).

And adding new buffers *could* be done with it's own separate locking.
If you have a blocking writer (ie a user space data source), that
would still take the pipe mutex, and it would delay the user space
readers (because the readers also need the mutex), but it should not
be all that hard to just make the whole "curbuf/nrbufs" handling use
its own locking (maybe even some lockless atomics and cmpxchg).

So a kernel writer could "insert" a "struct pipe_buffer" atomically,
and wake up the reader atomically. No need for the other complexity
that is protected by the mutex.

The buggest problem is perhaps that the number of pipe buffers per
pipe is fairly limited by default. PIPE_DEF_BUFFERS is 16, and if we'd
insert using the ->bufs[] array, that would be the limit of "number of
messages". But each message could be any size (we've historically
limited pipe buffers to one page each, but that limit isn't all that
hard. You could put more data in there).

The number of pipe buffers _is_ dynamic, so the above PIPE_DEF_BUFFERS
isn't a hard limit, but it would be the default.

Would it be entirely trivial to do all the above? No. But it's
*literally* just finding the places that work with pipe->curbuf/nrbufs
and making them use atomic updates. You'd find all the places by just
renaming them (and making them atomic or whatever) and the compiler
will tell you "this area needs fixing".

We've actually used pipes for messages before: autofs uses a magic
packetized pipe buffer thing. It didn't need any extra atomicity,
though, so it stil all worked with the regular pipe->mutex thing.

And there is a big advantage from using pipes. They really would work
with almost anything. You could even mix-and-match "data generated by
kernel" and "data done by 'write()' or 'splice()' by a user process".

NOTE! I'm not at all saying that pipes are perfect. You'll find people
who swear by sockets instead. They have their own advantages (and
disadvantages). Most people who do packet-based stuff tend to prefer
sockets, because those have standard packet-based models (Linux pipes
have that packet mode too, but it's certainly not standard, and I'm
not even sure we ever exposed it to user space - it could be that it's
only used by the autofs daemon).

I have a soft spot for pipes, just because I think they are simpler
than sockets. But that soft spot might be misplaced.

                   Linus

  reply index

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-04 22:15 [PATCH 00/11] Keyrings, Block and USB notifications [ver #8] David Howells
2019-09-04 22:15 ` [PATCH 01/11] uapi: General notification ring definitions " David Howells
2019-09-04 22:16 ` [PATCH 02/11] security: Add hooks to rule on setting a watch " David Howells
2019-09-04 22:16 ` [PATCH 03/11] security: Add a hook for the point of notification insertion " David Howells
2019-09-04 22:16 ` [PATCH 04/11] General notification queue with user mmap()'able ring buffer " David Howells
2019-09-04 22:16 ` [PATCH 05/11] keys: Add a notification facility " David Howells
2019-09-04 22:16 ` [PATCH 06/11] Add a general, global device notification watch list " David Howells
2019-09-04 22:16 ` [PATCH 07/11] block: Add block layer notifications " David Howells
2019-09-04 22:16 ` [PATCH 08/11] usb: Add USB subsystem " David Howells
2019-09-04 22:17 ` [PATCH 09/11] Add sample notification program " David Howells
2019-09-04 22:17 ` [PATCH 10/11] selinux: Implement the watch_key security hook " David Howells
2019-09-04 22:17 ` [PATCH 11/11] smack: Implement the watch_key and post_notification hooks " David Howells
2019-09-04 22:28 ` [PATCH 00/11] Keyrings, Block and USB notifications " Linus Torvalds
2019-09-05 17:01 ` Why add the general notification queue and its sources David Howells
2019-09-05 17:19   ` Linus Torvalds
2019-09-05 18:32     ` Ray Strode
2019-09-05 20:39       ` Linus Torvalds
2019-09-06 19:32         ` Ray Strode
2019-09-06 19:41           ` Ray Strode
2019-09-06 19:53           ` Robbie Harwood
2019-09-05 21:32       ` David Howells
2019-09-05 22:08         ` Linus Torvalds
2019-09-05 23:18         ` David Howells
2019-09-06  0:07           ` Linus Torvalds [this message]
2019-09-06 10:09           ` David Howells
2019-09-06 15:35             ` Linus Torvalds
2019-09-06 15:53               ` Linus Torvalds
2019-09-06 16:12                 ` Steven Whitehouse
2019-09-06 17:07                   ` Linus Torvalds
2019-09-06 17:14                     ` Linus Torvalds
2019-09-06 21:19                       ` David Howells
2019-09-06 17:14                   ` Andy Lutomirski
2019-09-05 18:37     ` Steven Whitehouse
2019-09-05 18:51       ` Ray Strode
2019-09-05 20:09         ` David Lehman
2019-09-05 18:33   ` Greg Kroah-Hartman

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wjt2Eb+yEDOcQwCa0SrZ4cWu967OtQG8Vz21c=n5ZP1Nw@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=christian@brauner.io \
    --cc=debarshi.ray@gmail.com \
    --cc=dhowells@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=keyrings@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=linux-usb@vger.kernel.org \
    --cc=nicolas.dichtel@6wind.com \
    --cc=raven@themaw.net \
    --cc=rharwood@redhat.com \
    --cc=rstrode@redhat.com \
    --cc=swhiteho@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Security-Module Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-security-module/0 linux-security-module/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-security-module linux-security-module/ https://lore.kernel.org/linux-security-module \
		linux-security-module@vger.kernel.org linux-security-module@archiver.kernel.org
	public-inbox-index linux-security-module

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-security-module


AGPL code for this site: git clone https://public-inbox.org/ public-inbox