From: Rasmus Villemoes <linux@rasmusvillemoes.dk> To: David Howells <dhowells@redhat.com>, torvalds@linux-foundation.org Cc: Casey Schaufler <casey@schaufler-ca.com>, Stephen Smalley <sds@tycho.nsa.gov>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, nicolas.dichtel@6wind.com, raven@themaw.net, Christian Brauner <christian@brauner.io>, keyrings@vger.kernel.org, linux-usb@vger.kernel.org, linux-block@vger.kernel.org, linux-security-module@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 03/21] pipe: Use head and tail pointers for the ring, not cursor and length Date: Wed, 16 Oct 2019 09:46:10 +0200 [thread overview] Message-ID: <b8799179-d389-8005-4f6d-845febc3bb23@rasmusvillemoes.dk> (raw) In-Reply-To: <157117609543.15019.17103851546424902507.stgit@warthog.procyon.org.uk> On 15/10/2019 23.48, David Howells wrote: > Convert pipes to use head and tail pointers for the buffer ring rather than > pointer and length as the latter requires two atomic ops to update (or a > combined op) whereas the former only requires one. > > (1) The head pointer is the point at which production occurs and points to > the slot in which the next buffer will be placed. This is equivalent > to pipe->curbuf + pipe->nrbufs. > > The head pointer belongs to the write-side. > > (2) The tail pointer is the point at which consumption occurs. It points > to the next slot to be consumed. This is equivalent to pipe->curbuf. > > The tail pointer belongs to the read-side. > > (3) head and tail are allowed to run to UINT_MAX and wrap naturally. They > are only masked off when the array is being accessed, e.g.: > > pipe->bufs[head & mask] > > This means that it is not necessary to have a dead slot in the ring as > head == tail isn't ambiguous. > > (4) The ring is empty if "head == tail". > > (5) The occupancy of the ring is "head - tail". > > (6) The number of free slots in the ring is "(tail + pipe->ring_size) - > head". Seems an odd way of writing pipe->ring_size - (head - tail) ; i.e. obviously #free slots is #size minus #occupancy. > (7) The ring is full if "head >= (tail + pipe->ring_size)", which can also > be written as "head - tail >= pipe->ring_size". > No it cannot, it _must_ be written in the latter form. Assuming sizeof(int)==1 for simplicity, consider ring_size = 16, tail = 240. Regardless whether head is 240, 241, ..., 255, 0, tail + ring_size wraps to 0, so the former expression states the ring is full in all cases. Better spell out somewhere that while head and tail are free-running, at any point in time they satisfy the invariant head - tail <= pipe_size (and also 0 <= head - tail, but that's a tautology for unsigned ints...). Then it's a matter of taste if one wants to write "full" as head-tail == pipe_size or head-tail >= pipe_size. > Also split pipe->buffers into pipe->ring_size (which indicates the size of the > ring) and pipe->max_usage (which restricts the amount of ring that write() is > allowed to fill). This allows for a pipe that is both writable by the kernel > notification facility and by userspace, allowing plenty of ring space for > notifications to be added whilst preventing userspace from being able to use > up too much buffer space. That seems like something that should be added in a separate patch - adding ->max_usage and switching appropriate users of ->ring_size over, so it's more clear where you're using one or the other. > @@ -1949,8 +1950,12 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, > > pipe_lock(pipe); > > - bufs = kvmalloc_array(pipe->nrbufs, sizeof(struct pipe_buffer), > - GFP_KERNEL); > + head = pipe->head; > + tail = pipe->tail; > + mask = pipe->ring_size - 1; > + count = head - tail; > + > + bufs = kvmalloc_array(count, sizeof(struct pipe_buffer), GFP_KERNEL); > if (!bufs) { > pipe_unlock(pipe); > return -ENOMEM; > @@ -1958,8 +1963,8 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, > > nbuf = 0; > rem = 0; > - for (idx = 0; idx < pipe->nrbufs && rem < len; idx++) > - rem += pipe->bufs[(pipe->curbuf + idx) & (pipe->buffers - 1)].len; > + for (idx = tail; idx < head && rem < len; idx++) > + rem += pipe->bufs[idx & mask].len; > > ret = -EINVAL; > if (rem < len) > @@ -1970,16 +1975,16 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, > struct pipe_buffer *ibuf; > struct pipe_buffer *obuf; > > - BUG_ON(nbuf >= pipe->buffers); > - BUG_ON(!pipe->nrbufs); > - ibuf = &pipe->bufs[pipe->curbuf]; > + BUG_ON(nbuf >= pipe->ring_size); > + BUG_ON(tail == head); > + ibuf = &pipe->bufs[tail]; I don't see where tail gets masked between tail = pipe->tail; above and here, but I may be missing it. In any case, how about seeding head and tail with something like 1<<20 when creating the pipe so bugs like that are hit more quickly. > @@ -515,17 +525,19 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) > static long pipe_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) > { > struct pipe_inode_info *pipe = filp->private_data; > - int count, buf, nrbufs; > + int count, head, tail, mask; > > switch (cmd) { > case FIONREAD: > __pipe_lock(pipe); > count = 0; > - buf = pipe->curbuf; > - nrbufs = pipe->nrbufs; > - while (--nrbufs >= 0) { > - count += pipe->bufs[buf].len; > - buf = (buf+1) & (pipe->buffers - 1); > + head = pipe->head; > + tail = pipe->tail; > + mask = pipe->ring_size - 1; > + > + while (tail < head) { > + count += pipe->bufs[tail & mask].len; > + tail++; > } This is broken if head has wrapped but tail has not. It has to be "while (head - tail)" or perhaps just "while (tail != head)" or something along those lines. > @@ -1086,17 +1104,21 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) > } > > /* > - * We can shrink the pipe, if arg >= pipe->nrbufs. Since we don't > - * expect a lot of shrink+grow operations, just free and allocate > - * again like we would do for growing. If the pipe currently > + * We can shrink the pipe, if arg is greater than the ring occupancy. > + * Since we don't expect a lot of shrink+grow operations, just free and > + * allocate again like we would do for growing. If the pipe currently > * contains more buffers than arg, then return busy. > */ > - if (nr_pages < pipe->nrbufs) { > + mask = pipe->ring_size - 1; > + head = pipe->head & mask; > + tail = pipe->tail & mask; > + n = pipe->head - pipe->tail; I think it's confusing to "premask" head and tail here. Can you either drop that (pipe_set_size should hardly be a hot path?), or perhaps call them something else to avoid a future reader seeing an unmasked bufs[head] and thinking that's a bug? > @@ -1254,9 +1290,10 @@ static ssize_t pipe_get_pages(struct iov_iter *i, > struct page **pages, size_t maxsize, unsigned maxpages, > size_t *start) > { > + unsigned int p_tail; > + unsigned int i_head; > unsigned npages; > size_t capacity; > - int idx; > > if (!maxsize) > return 0; > @@ -1264,12 +1301,15 @@ static ssize_t pipe_get_pages(struct iov_iter *i, > if (!sanity(i)) > return -EFAULT; > > - data_start(i, &idx, start); > - /* some of this one + all after this one */ > - npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1; > - capacity = min(npages,maxpages) * PAGE_SIZE - *start; > + data_start(i, &i_head, start); > + p_tail = i->pipe->tail; > + /* Amount of free space: some of this one + all after this one */ > + npages = (p_tail + i->pipe->ring_size) - i_head; Hm, it's not clear that this is equivalent to the old computation. Since it seems repeated in a few places, could it be factored to a little helper (before this patch) and the "some of this one + all after this one" comment perhaps expanded to explain what is going on? Rasmus
WARNING: multiple messages have this Message-ID (diff)
From: Rasmus Villemoes <linux@rasmusvillemoes.dk> To: David Howells <dhowells@redhat.com>, torvalds@linux-foundation.org Cc: Casey Schaufler <casey@schaufler-ca.com>, Stephen Smalley <sds@tycho.nsa.gov>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, nicolas.dichtel@6wind.com, raven@themaw.net, Christian Brauner <christian@brauner.io>, keyrings@vger.kernel.org, linux-usb@vger.kernel.org, linux-block@vger.kernel.org, linux-security-module@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 03/21] pipe: Use head and tail pointers for the ring, not cursor and length Date: Wed, 16 Oct 2019 07:46:10 +0000 [thread overview] Message-ID: <b8799179-d389-8005-4f6d-845febc3bb23@rasmusvillemoes.dk> (raw) In-Reply-To: <157117609543.15019.17103851546424902507.stgit@warthog.procyon.org.uk> On 15/10/2019 23.48, David Howells wrote: > Convert pipes to use head and tail pointers for the buffer ring rather than > pointer and length as the latter requires two atomic ops to update (or a > combined op) whereas the former only requires one. > > (1) The head pointer is the point at which production occurs and points to > the slot in which the next buffer will be placed. This is equivalent > to pipe->curbuf + pipe->nrbufs. > > The head pointer belongs to the write-side. > > (2) The tail pointer is the point at which consumption occurs. It points > to the next slot to be consumed. This is equivalent to pipe->curbuf. > > The tail pointer belongs to the read-side. > > (3) head and tail are allowed to run to UINT_MAX and wrap naturally. They > are only masked off when the array is being accessed, e.g.: > > pipe->bufs[head & mask] > > This means that it is not necessary to have a dead slot in the ring as > head = tail isn't ambiguous. > > (4) The ring is empty if "head = tail". > > (5) The occupancy of the ring is "head - tail". > > (6) The number of free slots in the ring is "(tail + pipe->ring_size) - > head". Seems an odd way of writing pipe->ring_size - (head - tail) ; i.e. obviously #free slots is #size minus #occupancy. > (7) The ring is full if "head >= (tail + pipe->ring_size)", which can also > be written as "head - tail >= pipe->ring_size". > No it cannot, it _must_ be written in the latter form. Assuming sizeof(int)=1 for simplicity, consider ring_size = 16, tail = 240. Regardless whether head is 240, 241, ..., 255, 0, tail + ring_size wraps to 0, so the former expression states the ring is full in all cases. Better spell out somewhere that while head and tail are free-running, at any point in time they satisfy the invariant head - tail <= pipe_size (and also 0 <= head - tail, but that's a tautology for unsigned ints...). Then it's a matter of taste if one wants to write "full" as head-tail = pipe_size or head-tail >= pipe_size. > Also split pipe->buffers into pipe->ring_size (which indicates the size of the > ring) and pipe->max_usage (which restricts the amount of ring that write() is > allowed to fill). This allows for a pipe that is both writable by the kernel > notification facility and by userspace, allowing plenty of ring space for > notifications to be added whilst preventing userspace from being able to use > up too much buffer space. That seems like something that should be added in a separate patch - adding ->max_usage and switching appropriate users of ->ring_size over, so it's more clear where you're using one or the other. > @@ -1949,8 +1950,12 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, > > pipe_lock(pipe); > > - bufs = kvmalloc_array(pipe->nrbufs, sizeof(struct pipe_buffer), > - GFP_KERNEL); > + head = pipe->head; > + tail = pipe->tail; > + mask = pipe->ring_size - 1; > + count = head - tail; > + > + bufs = kvmalloc_array(count, sizeof(struct pipe_buffer), GFP_KERNEL); > if (!bufs) { > pipe_unlock(pipe); > return -ENOMEM; > @@ -1958,8 +1963,8 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, > > nbuf = 0; > rem = 0; > - for (idx = 0; idx < pipe->nrbufs && rem < len; idx++) > - rem += pipe->bufs[(pipe->curbuf + idx) & (pipe->buffers - 1)].len; > + for (idx = tail; idx < head && rem < len; idx++) > + rem += pipe->bufs[idx & mask].len; > > ret = -EINVAL; > if (rem < len) > @@ -1970,16 +1975,16 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe, > struct pipe_buffer *ibuf; > struct pipe_buffer *obuf; > > - BUG_ON(nbuf >= pipe->buffers); > - BUG_ON(!pipe->nrbufs); > - ibuf = &pipe->bufs[pipe->curbuf]; > + BUG_ON(nbuf >= pipe->ring_size); > + BUG_ON(tail = head); > + ibuf = &pipe->bufs[tail]; I don't see where tail gets masked between tail = pipe->tail; above and here, but I may be missing it. In any case, how about seeding head and tail with something like 1<<20 when creating the pipe so bugs like that are hit more quickly. > @@ -515,17 +525,19 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) > static long pipe_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) > { > struct pipe_inode_info *pipe = filp->private_data; > - int count, buf, nrbufs; > + int count, head, tail, mask; > > switch (cmd) { > case FIONREAD: > __pipe_lock(pipe); > count = 0; > - buf = pipe->curbuf; > - nrbufs = pipe->nrbufs; > - while (--nrbufs >= 0) { > - count += pipe->bufs[buf].len; > - buf = (buf+1) & (pipe->buffers - 1); > + head = pipe->head; > + tail = pipe->tail; > + mask = pipe->ring_size - 1; > + > + while (tail < head) { > + count += pipe->bufs[tail & mask].len; > + tail++; > } This is broken if head has wrapped but tail has not. It has to be "while (head - tail)" or perhaps just "while (tail != head)" or something along those lines. > @@ -1086,17 +1104,21 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) > } > > /* > - * We can shrink the pipe, if arg >= pipe->nrbufs. Since we don't > - * expect a lot of shrink+grow operations, just free and allocate > - * again like we would do for growing. If the pipe currently > + * We can shrink the pipe, if arg is greater than the ring occupancy. > + * Since we don't expect a lot of shrink+grow operations, just free and > + * allocate again like we would do for growing. If the pipe currently > * contains more buffers than arg, then return busy. > */ > - if (nr_pages < pipe->nrbufs) { > + mask = pipe->ring_size - 1; > + head = pipe->head & mask; > + tail = pipe->tail & mask; > + n = pipe->head - pipe->tail; I think it's confusing to "premask" head and tail here. Can you either drop that (pipe_set_size should hardly be a hot path?), or perhaps call them something else to avoid a future reader seeing an unmasked bufs[head] and thinking that's a bug? > @@ -1254,9 +1290,10 @@ static ssize_t pipe_get_pages(struct iov_iter *i, > struct page **pages, size_t maxsize, unsigned maxpages, > size_t *start) > { > + unsigned int p_tail; > + unsigned int i_head; > unsigned npages; > size_t capacity; > - int idx; > > if (!maxsize) > return 0; > @@ -1264,12 +1301,15 @@ static ssize_t pipe_get_pages(struct iov_iter *i, > if (!sanity(i)) > return -EFAULT; > > - data_start(i, &idx, start); > - /* some of this one + all after this one */ > - npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1; > - capacity = min(npages,maxpages) * PAGE_SIZE - *start; > + data_start(i, &i_head, start); > + p_tail = i->pipe->tail; > + /* Amount of free space: some of this one + all after this one */ > + npages = (p_tail + i->pipe->ring_size) - i_head; Hm, it's not clear that this is equivalent to the old computation. Since it seems repeated in a few places, could it be factored to a little helper (before this patch) and the "some of this one + all after this one" comment perhaps expanded to explain what is going on? Rasmus
next prev parent reply other threads:[~2019-10-16 7:46 UTC|newest] Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-10-15 21:47 [RFC PATCH 00/21] pipe: Keyrings, Block and USB notifications David Howells 2019-10-15 21:47 ` David Howells 2019-10-15 21:47 ` David Howells 2019-10-15 21:47 ` [RFC PATCH 01/21] pipe: Reduce #inclusion of pipe_fs_i.h David Howells 2019-10-15 21:47 ` David Howells 2019-10-15 21:47 ` David Howells 2019-10-15 21:48 ` [RFC PATCH 02/21] Add a prelocked wake-up David Howells 2019-10-15 21:48 ` David Howells 2019-10-15 21:48 ` David Howells 2019-10-15 22:14 ` Linus Torvalds 2019-10-15 22:14 ` Linus Torvalds 2019-10-16 17:02 ` Tim Chen 2019-10-16 17:02 ` Tim Chen 2019-10-15 22:33 ` David Howells 2019-10-15 22:33 ` David Howells 2019-10-16 14:26 ` David Howells 2019-10-16 14:26 ` David Howells 2019-10-16 15:31 ` Linus Torvalds 2019-10-16 15:31 ` Linus Torvalds 2019-10-15 21:48 ` [RFC PATCH 03/21] pipe: Use head and tail pointers for the ring, not cursor and length David Howells 2019-10-15 21:48 ` David Howells 2019-10-15 21:48 ` David Howells 2019-10-16 7:46 ` Rasmus Villemoes [this message] 2019-10-16 7:46 ` Rasmus Villemoes 2019-10-17 10:53 ` David Howells 2019-10-17 10:53 ` David Howells 2019-10-15 21:48 ` [RFC PATCH 04/21] pipe: Advance tail pointer inside of wait spinlock in pipe_read() David Howells 2019-10-15 21:48 ` David Howells 2019-10-15 21:48 ` David Howells 2019-10-15 21:48 ` [RFC PATCH 05/21] pipe: Conditionalise wakeup " David Howells 2019-10-15 21:48 ` David Howells 2019-10-15 21:48 ` David Howells 2019-10-15 21:48 ` [RFC PATCH 06/21] pipe: Rearrange sequence in pipe_write() to preallocate slot David Howells 2019-10-15 21:48 ` David Howells 2019-10-15 21:48 ` David Howells 2019-10-15 21:48 ` [RFC PATCH 07/21] pipe: Remove redundant wakeup from pipe_write() David Howells 2019-10-15 21:48 ` David Howells 2019-10-15 21:48 ` David Howells 2019-10-15 21:49 ` [RFC PATCH 08/21] pipe: Check for ring full inside of the spinlock in pipe_write() David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 22:20 ` Linus Torvalds 2019-10-15 22:20 ` Linus Torvalds 2019-10-15 21:49 ` [RFC PATCH 09/21] uapi: General notification queue definitions David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:49 ` [RFC PATCH 10/21] security: Add hooks to rule on setting a watch David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:49 ` [RFC PATCH 11/21] security: Add a hook for the point of notification insertion David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:49 ` [RFC PATCH 12/21] pipe: Add general notification queue support David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:49 ` [RFC PATCH 13/21] keys: Add a notification facility David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:49 ` [RFC PATCH 14/21] Add sample notification program David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:49 ` David Howells 2019-10-15 21:50 ` [RFC PATCH 15/21] pipe: Allow buffers to be marked read-whole-or-error for notifications David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` [RFC PATCH 16/21] pipe: Add notification lossage handling David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` [RFC PATCH 17/21] Add a general, global device notification watch list David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` [RFC PATCH 18/21] block: Add block layer notifications David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` [RFC PATCH 19/21] usb: Add USB subsystem notifications David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` [RFC PATCH 20/21] selinux: Implement the watch_key security hook David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` [RFC PATCH 21/21] smack: Implement the watch_key and post_notification hooks David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 21:50 ` David Howells 2019-10-15 22:11 ` [RFC PATCH 00/21] pipe: Keyrings, Block and USB notifications James Morris 2019-10-15 22:11 ` James Morris 2019-10-15 22:11 ` James Morris 2019-10-15 22:32 ` Linus Torvalds 2019-10-15 22:32 ` Linus Torvalds
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=b8799179-d389-8005-4f6d-845febc3bb23@rasmusvillemoes.dk \ --to=linux@rasmusvillemoes.dk \ --cc=casey@schaufler-ca.com \ --cc=christian@brauner.io \ --cc=dhowells@redhat.com \ --cc=gregkh@linuxfoundation.org \ --cc=keyrings@vger.kernel.org \ --cc=linux-api@vger.kernel.org \ --cc=linux-block@vger.kernel.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-security-module@vger.kernel.org \ --cc=linux-usb@vger.kernel.org \ --cc=nicolas.dichtel@6wind.com \ --cc=raven@themaw.net \ --cc=sds@tycho.nsa.gov \ --cc=torvalds@linux-foundation.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.