All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Metzmacher <metze@samba.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux API Mailing List <linux-api@vger.kernel.org>,
	io-uring <io-uring@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Samba Technical <samba-technical@lists.samba.org>
Subject: copy on write for splice() from file to pipe?
Date: Thu, 9 Feb 2023 14:55:59 +0100	[thread overview]
Message-ID: <0cfd9f02-dea7-90e2-e932-c8129b6013c7@samba.org> (raw)

Hi Linus and others,

as written in a private mail before, I'm currently trying to
make use of IORING_OP_SPLICE in order to get zero copy support
in Samba.

The most important use cases are 8 Mbytes reads and writes to
files. where "memcpy" (at the lowest end copy_user_enhanced_fast_string())
is the obvious performance killer.

I have a prototype that offers really great performance
avoiding "memcpy" by using splice() (in order to get async IORING_OP_SPLICE).

So we have two cases:

1. network -> socket -> splice -> pipe -> splice -> file -> storage

2. storage -> file -> splice -> pipe -> splice -> socket -> network

With 1. I guess everything can work reliable, once
the pages are created/filled in the socket receive buffer
they are used exclusively and they won't be shared on
the way to the file. Which means we can be sure that
data arrives unmodified in the file(system).

But with 2. there's a problem, as the pages from the file,
which are spliced into the pipe are still shared without
copy on write with the file(system). It means writes to the file
after the first splice modify the content of the spliced pages!
So the content may change uncontrolled before it reaches the network!
I created a little example that demonstrates the problem (see below),
it gives the following output:

> open(O_TMPFILE) => ffd[3]
> pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0x1f sret[4096]
> pipe() => ret[0]
> splice(count=PIPE_BUF*2,ofs=0) sret[8192]
> pwrite(count=PIPE_BUF,ofs=0) 0xf0 sret[4096]
> pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0xf0 sret[4096]
> read(from_pipe, count=PIPE_BUF) sret[4096]
> memcmp() at ofs=0, expecting 0x00 => ret[240]
> memcmp() at ofs=0, checking for 0xf0 => ret[0]
> read(from_pipe, count=PIPE_BUF) sret[4096]
> memcmp() at ofs=PIPE_BUF, expecting 0x1f => ret[209]
> memcmp() at ofs=PIPE_BUF, checking for 0xf0 => ret[0]

After reading from the pipe we get the values we have written to
the file instead of the values we had at the time of splice.

For things like web servers, which mostly serve static content, this
isn't a problem, but it is for Samba, when reads and writes may happen within
microseconds, before the content is pushed to the network.

I'm wondering if there's a possible way out of this, maybe triggered by a new
flag passed to splice.

I looked through the code and noticed the existence of IOMAP_F_SHARED.
Maybe the splice from the page cache to the pipe could set IOMAP_F_SHARED,
while incrementing the refcount and the network driver could remove it again
when the refcount reaches 1 again.

Is there any other way we could archive something like this?

In addition and/or as alternative I was thinking about a flag to
preadv2() (and IORING_OP_READV) to indicate the use of something
like async_memcpy(), instead of doing the copy via the cpu.
That in combination with IORING_OP_SENDMSG_ZC would avoid "memcpy"
on the cpu.

Any hints, remarks and prototype patches are highly welcome.

Thanks!
metze

#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <limits.h>

int main(void)
{
	int ffd;
	int pfds[2];
	char buf [PIPE_BUF] = {0, };
	char buf2 [PIPE_BUF] = {0, };
	ssize_t sret;
	int ret;
	off_t ofs;

	memset(buf, 0x1f, PIPE_BUF);

	ffd = open("/tmp/", O_RDWR | O_TMPFILE, S_IRUSR | S_IWUSR);
	printf("open(O_TMPFILE) => ffd[%d]\n", ffd);

	sret = pwrite(ffd, buf, PIPE_BUF, PIPE_BUF);
	printf("pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0x1f sret[%zd]\n", sret);

	ret = pipe(pfds);
	printf("pipe() => ret[%d]\n", ret);

	ofs = 0;
	sret = splice(ffd, &ofs, pfds[1], NULL, PIPE_BUF*2, 0);
	printf("splice(count=PIPE_BUF*2,ofs=0) sret[%zd]\n", sret);

	memset(buf, 0xf0, PIPE_BUF);

	sret = pwrite(ffd, buf, PIPE_BUF, 0);
	printf("pwrite(count=PIPE_BUF,ofs=0) 0xf0 sret[%zd]\n", sret);
	sret = pwrite(ffd, buf, PIPE_BUF, PIPE_BUF);
	printf("pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0xf0 sret[%zd]\n", sret);

	sret = read(pfds[0], buf, PIPE_BUF);
	printf("read(from_pipe, count=PIPE_BUF) sret[%zd]\n", sret);

	memset(buf2, 0x00, PIPE_BUF);
	ret = memcmp(buf, buf2, PIPE_BUF);
	printf("memcmp() at ofs=0, expecting 0x00 => ret[%d]\n", ret);
	memset(buf2, 0xf0, PIPE_BUF);
	ret = memcmp(buf, buf2, PIPE_BUF);
	printf("memcmp() at ofs=0, checking for 0xf0 => ret[%d]\n", ret);

	sret = read(pfds[0], buf, PIPE_BUF);
	printf("read(from_pipe, count=PIPE_BUF) sret[%zd]\n", sret);

	memset(buf2, 0x1f, PIPE_BUF);
	ret = memcmp(buf, buf2, PIPE_BUF);
	printf("memcmp() at ofs=PIPE_BUF, expecting 0x1f => ret[%d]\n", ret);
	memset(buf2, 0xf0, PIPE_BUF);
	ret = memcmp(buf, buf2, PIPE_BUF);
	printf("memcmp() at ofs=PIPE_BUF, checking for 0xf0 => ret[%d]\n", ret);
	return 0;
}

             reply	other threads:[~2023-02-09 13:56 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-09 13:55 Stefan Metzmacher [this message]
2023-02-09 14:11 ` copy on write for splice() from file to pipe? Matthew Wilcox
2023-02-09 14:29   ` Stefan Metzmacher
2023-02-09 16:41 ` Linus Torvalds
2023-02-09 19:17   ` Stefan Metzmacher
2023-02-09 19:36     ` Linus Torvalds
2023-02-09 19:48       ` Linus Torvalds
2023-02-09 20:33         ` Jeremy Allison
2023-02-10 20:45         ` Stefan Metzmacher
2023-02-10 20:51           ` Linus Torvalds
2023-02-10  2:16   ` Dave Chinner
2023-02-10  4:06     ` Dave Chinner
2023-02-10  4:44       ` Matthew Wilcox
2023-02-10  6:57         ` Dave Chinner
2023-02-10 15:14           ` Andy Lutomirski
2023-02-10 16:33             ` Linus Torvalds
2023-02-10 17:57               ` Andy Lutomirski
2023-02-10 18:19                 ` Jeremy Allison
2023-02-10 19:29                   ` Stefan Metzmacher
2023-02-10 18:37                 ` Linus Torvalds
2023-02-10 19:01                   ` Andy Lutomirski
2023-02-10 19:18                     ` Linus Torvalds
2023-02-10 19:27                       ` Jeremy Allison
2023-02-10 19:42                         ` Stefan Metzmacher
2023-02-10 19:42                         ` Linus Torvalds
2023-02-10 19:54                           ` Stefan Metzmacher
2023-02-10 19:29                       ` Linus Torvalds
2023-02-13  9:07                         ` Herbert Xu
2023-02-10 19:55                       ` Andy Lutomirski
2023-02-10 20:27                         ` Linus Torvalds
2023-02-10 20:32                           ` Jens Axboe
2023-02-10 20:36                             ` Linus Torvalds
2023-02-10 20:39                               ` Jens Axboe
2023-02-10 20:44                                 ` Linus Torvalds
2023-02-10 20:50                                   ` Jens Axboe
2023-02-10 21:14                                     ` Andy Lutomirski
2023-02-10 21:27                                       ` Jens Axboe
2023-02-10 21:51                                         ` Jens Axboe
2023-02-10 22:08                                           ` Linus Torvalds
2023-02-10 22:16                                             ` Jens Axboe
2023-02-10 22:17                                             ` Linus Torvalds
2023-02-10 22:25                                               ` Jens Axboe
2023-02-10 22:35                                                 ` Linus Torvalds
2023-02-10 22:51                                                   ` Jens Axboe
2023-02-11  3:18                                             ` Ming Lei
2023-02-11  6:17                                               ` Ming Lei
2023-02-11 14:13                                               ` Jens Axboe
2023-02-11 15:05                                                 ` Ming Lei
2023-02-11 15:33                                                   ` Jens Axboe
2023-02-11 18:57                                                     ` Linus Torvalds
2023-02-12  2:46                                                       ` Jens Axboe
2023-02-10  4:47       ` Linus Torvalds
2023-02-10  6:19         ` Dave Chinner
2023-02-10 17:23           ` Linus Torvalds
2023-02-10 17:47             ` Linus Torvalds
2023-02-13  9:28               ` Herbert Xu
2023-02-10 22:41             ` David Laight
2023-02-10 22:51               ` Jens Axboe
2023-02-13  9:30               ` Herbert Xu
2023-02-13  9:25           ` Herbert Xu
2023-02-13 18:01             ` Andy Lutomirski
2023-02-14  1:22               ` Herbert Xu
2023-02-17 23:13                 ` Andy Lutomirski
2023-02-20  4:54                   ` Herbert Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0cfd9f02-dea7-90e2-e932-c8129b6013c7@samba.org \
    --to=metze@samba.org \
    --cc=io-uring@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=samba-technical@lists.samba.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.