All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Howells <dhowells@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: dhowells@redhat.com, Jakub Kicinski <kuba@kernel.org>,
	netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>,
	Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
	David Ahern <dsahern@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Jens Axboe <axboe@kernel.dk>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Chuck Lever <chuck.lever@oracle.com>,
	Boris Pismenny <borisp@nvidia.com>,
	John Fastabend <john.fastabend@gmail.com>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: Bug in short splice to socket?
Date: Thu, 01 Jun 2023 15:34:08 +0100	[thread overview]
Message-ID: <832277.1685630048@warthog.procyon.org.uk> (raw)
In-Reply-To: <CAHk-=wghhHghtvU_SzXxAzfaX35BkNs-x91-Vj6+6tnVEhPrZg@mail.gmail.com>


Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Jun 1, 2023 at 9:09 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > The reason the old code is garbage is that it sets SPLICE_F_MORE
> > entirely in the wrong place. It sets it *after* it has done the
> > splice(). That's just crazy.
> 
> Clarification, because there are two splice's (from and to): by "after
> the splice" I mean after the splice source, of course. It's still set
> before the splice _to_ the network.
> 
> (But it is still true that I hope the networking code itself then sets
> MSG_MORE even if SPLICE_F_MORE wasn't set, if it gets a buffer that is
> bigger than what it can handle right now - so there are two
> *different* reasons for "more data" - either the source knows it has
> more to give, or the destination knows it didn't use everything it
> got).

I'm in the process of changing things so that ->sendpage() is removed and
replaced with sendmsg() with MSG_SPLICE_PAGES.  The aim is to end up with a
splice_to_socket() function (see attached) that transcribes a chunk of the
pipe into a BVEC iterator and does a single sendmsg() that pushes a whole
chunk of data to the socket in one go.

In the network protocol, the loop inside sendmsg splices those pages into
socket buffers, sleeping as necessary to gain sufficient memory to transcribe
all of them.  It can return partly done and the fully consumed pages will be
consumed and then it'll return to gain more data.

At the moment, it transcribes 16 pages at a time.  I could make it set
MSG_MORE only if (a) SPLICE_F_MORE was passed into the splice() syscall or (b)
there's yet more data in the buffer.

However, this might well cause a malfunction in UDP, for example.  MSG_MORE
corks the current packet, so if I ask sendfile() say shove 32K into a packet,
if, say, 16K is read from the source and entirely transcribed into the packet,
if I understand what you're proposing, MSG_MORE wouldn't get set and the
packet would be transmitted early.  A similar issue might exist for AF_TLS,
where the lack of MSG_MORE triggers an end-of-record.

The problem isn't that the buffer supplied is bigger than the protocol could
handle in one go, it's that what we read was less than the amount that the
user requested - but we don't know if there will be more data.

One solution might be to pass MSG_MORE regardless, and then follow up with a
null sendmsg() if we then hit a zero-length read (ie. EOF) or -ENODATA (ie. a
hole).

> The point is that the splice *source* knows whether there is more data
> to be had, and that's where the "there is more" should be set.

Actually, that's not necessarily true.  If the source is a pipe or a socket,
there may not be more, but we don't know that until the far end is closed -
but for a file it probably is (we could be racing with a writer).  And then
there's seqfile-based things like the trace buffer or procfiles which are
really a class of their own.

> Basically my argument is that the whole "there is more data" should be
> set by "->splice_read()" not by some hack in some generic
> splice_direct_to_actor() function that is absolutely not supposed to
> know about the internals of the source or the destination.
> 
> Do we have a good interface for that? No. I get the feeling that to
> actually fix this, we'd have to pass in the 'struct splice_desc"
> pointer to ->splice_read().

That might become more feasible, actually.  In Jens's tree are the patches to
clean up splice_read a lot and get rid of ITER_PIPE.  Most of the
->splice_reads are going to be direct calls to copy_splice_read() and
filemap_splice_read() or wrappers around filemap_splice_read().  The latter,
at least, might be in a good position to note EOF, perhaps by marking a pipe
buffer as "no more".

copy_splice_read() is more problematic because ->read_iter() doesn't indicate
if it hit EOF (or hit a hole).

David
---
ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
			 loff_t *ppos, size_t len, unsigned int flags)
{
	struct socket *sock = sock_from_file(out);
	struct bio_vec bvec[16];
	struct msghdr msg = {};
	ssize_t ret;
	size_t spliced = 0;
	bool need_wakeup = false;

	pipe_lock(pipe);

	while (len > 0) {
		unsigned int head, tail, mask, bc = 0;
		size_t remain = len;

		/*
		 * Check for signal early to make process killable when there
		 * are always buffers available
		 */
		ret = -ERESTARTSYS;
		if (signal_pending(current))
			break;

		while (pipe_empty(pipe->head, pipe->tail)) {
			ret = 0;
			if (!pipe->writers)
				goto out;

			if (spliced)
				goto out;

			ret = -EAGAIN;
			if (flags & SPLICE_F_NONBLOCK)
				goto out;

			ret = -ERESTARTSYS;
			if (signal_pending(current))
				goto out;

			if (need_wakeup) {
				wakeup_pipe_writers(pipe);
				need_wakeup = false;
			}

			pipe_wait_readable(pipe);
		}

		head = pipe->head;
		tail = pipe->tail;
		mask = pipe->ring_size - 1;

		while (!pipe_empty(head, tail)) {
			struct pipe_buffer *buf = &pipe->bufs[tail & mask];
			size_t seg;

			if (!buf->len) {
				tail++;
				continue;
			}

			seg = min_t(size_t, remain, buf->len);
			seg = min_t(size_t, seg, PAGE_SIZE);

			ret = pipe_buf_confirm(pipe, buf);
			if (unlikely(ret)) {
				if (ret == -ENODATA)
					ret = 0;
				break;
			}

			bvec_set_page(&bvec[bc++], buf->page, seg, buf->offset);
			remain -= seg;
			if (seg >= buf->len)
				tail++;
			if (bc >= ARRAY_SIZE(bvec))
				break;
		}

		if (!bc)
			break;

		msg.msg_flags = 0;
		if (flags & SPLICE_F_MORE)
			msg.msg_flags = MSG_MORE;
		if (remain && pipe_occupancy(pipe->head, tail) > 0)
			msg.msg_flags = MSG_MORE;
		msg.msg_flags |= MSG_SPLICE_PAGES;

		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, bc, len - remain);
		ret = sock_sendmsg(sock, &msg);
		if (ret <= 0)
			break;

		spliced += ret;
		len -= ret;
		tail = pipe->tail;
		while (ret > 0) {
			struct pipe_buffer *buf = &pipe->bufs[tail & mask];
			size_t seg = min_t(size_t, ret, buf->len);

			buf->offset += seg;
			buf->len -= seg;
			ret -= seg;

			if (!buf->len) {
				pipe_buf_release(pipe, buf);
				tail++;
			}
		}

		if (tail != pipe->tail) {
			pipe->tail = tail;
			if (pipe->files)
				need_wakeup = true;
		}
	}

out:
	pipe_unlock(pipe);
	if (need_wakeup)
		wakeup_pipe_writers(pipe);
	return spliced ?: ret;
}


  parent reply	other threads:[~2023-06-01 14:35 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-24 15:32 [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3 David Howells
2023-05-24 15:33 ` [PATCH net-next 01/12] mm: Move the page fragment allocator from page_alloc.c into its own file David Howells
2023-05-24 15:33 ` [PATCH net-next 02/12] mm: Provide a page_frag_cache allocator cleanup function David Howells
2023-05-24 15:33   ` David Howells
2023-05-24 15:33 ` [PATCH net-next 03/12] mm: Make the page_frag_cache allocator alignment param a pow-of-2 David Howells
2023-05-24 15:33   ` David Howells
2023-05-27 15:54   ` Alexander H Duyck
2023-05-27 15:54     ` Alexander H Duyck
2023-11-30  9:00     ` Yunsheng Lin
2023-11-30  9:00       ` Yunsheng Lin
2023-06-16 15:28   ` David Howells
2023-06-16 15:28     ` David Howells
2023-06-16 16:06     ` Alexander Duyck
2023-06-16 16:06       ` Alexander Duyck
2023-05-24 15:33 ` [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios David Howells
2023-05-24 15:33   ` David Howells
2023-05-26 11:56   ` Yunsheng Lin
2023-05-26 11:56     ` Yunsheng Lin
2023-05-27 15:47     ` Alexander H Duyck
2023-05-27 15:47       ` Alexander H Duyck
2023-06-06  8:25     ` David Howells
2023-06-06  8:25       ` David Howells
2023-06-06 14:59       ` Alexander Duyck
2023-06-06 14:59         ` Alexander Duyck
2023-05-26 12:47   ` David Howells
2023-05-26 12:47     ` David Howells
2023-05-26 14:06     ` Mika Penttilä
2023-05-26 14:06       ` Mika Penttilä
2023-05-27  0:50   ` Jakub Kicinski
2023-05-27  0:50     ` Jakub Kicinski
2023-05-24 15:33 ` [PATCH net-next 05/12] mm: Make the page_frag_cache allocator handle __GFP_ZERO itself David Howells
2023-05-24 15:33   ` David Howells
2023-05-27  0:57   ` Jakub Kicinski
2023-05-27  0:57     ` Jakub Kicinski
2023-05-27 15:54     ` Alexander Duyck
2023-05-27 15:54       ` Alexander Duyck
2023-05-24 15:33 ` [PATCH net-next 06/12] mm: Make the page_frag_cache allocator use per-cpu David Howells
2023-05-24 15:33   ` David Howells
2023-05-27  1:02   ` Jakub Kicinski
2023-05-27  1:02     ` Jakub Kicinski
2023-05-24 15:33 ` [PATCH net-next 07/12] net: Clean up users of netdev_alloc_cache and napi_frag_cache David Howells
2023-05-24 15:33 ` [PATCH net-next 08/12] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES) David Howells
2023-05-24 15:33 ` [PATCH net-next 09/12] tls/sw: Support MSG_SPLICE_PAGES David Howells
2023-05-27  1:08   ` Jakub Kicinski
2023-05-30 22:26   ` Bug in short splice to socket? David Howells
2023-05-31  0:32     ` Jakub Kicinski
2023-06-01 11:01     ` David Laight
2023-06-01 13:09     ` Linus Torvalds
2023-06-01 13:19       ` Linus Torvalds
2023-06-01 14:34       ` David Howells [this message]
2023-06-01 15:12         ` Linus Torvalds
2023-06-05 11:03           ` David Laight
2023-06-05 15:52           ` David Howells
2023-06-01 17:14         ` David Howells
2023-06-02  4:20           ` Jakub Kicinski
2023-06-02  8:23           ` David Howells
2023-06-02 11:28             ` Linus Torvalds
2023-06-02 11:44             ` David Howells
2023-06-02 12:11               ` Linus Torvalds
2023-06-02 16:39                 ` Jakub Kicinski
2023-06-02 16:53                   ` Linus Torvalds
2023-06-02 17:05                     ` Linus Torvalds
2023-06-02 17:38                       ` Jakub Kicinski
2023-06-02 20:38                     ` David Howells
2023-06-02 20:50                     ` David Howells
2023-05-24 15:33 ` [PATCH net-next 10/12] tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES David Howells
2023-05-27  1:13   ` Jakub Kicinski
2023-05-24 15:33 ` [PATCH net-next 11/12] tls/device: Support MSG_SPLICE_PAGES David Howells
2023-05-24 15:33 ` [PATCH net-next 12/12] tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=832277.1685630048@warthog.procyon.org.uk \
    --to=dhowells@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=borisp@nvidia.com \
    --cc=chuck.lever@oracle.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=hch@infradead.org \
    --cc=john.fastabend@gmail.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.