Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Kirill Smelkov <kirr@nexedi.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Al Viro <viro@zeniv.linux.org.uk>, Arnd Bergmann <arnd@arndb.de>,
	Christoph Hellwig <hch@lst.de>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
Date: Fri, 12 Apr 2019 12:42:03 +0000
Message-ID: <20190412124144.GA22786@deco.navytux.spb.ru> (raw)
In-Reply-To: <CAHk-=whJtZt52SnhBGrNMnuxFn3GE9X_e02x8BPxtkqrfyZukw@mail.gmail.com>

On Thu, Apr 11, 2019 at 09:22:56AM -0700, Linus Torvalds wrote:
> On Thu, Apr 11, 2019 at 5:38 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > However file->f_pos writing is still there and it will bug under race
> > detector, e.g. under KTSAN because read and write can be running
> > simultaneously. Maybe it is not only race bug, but also a bit of
> > slowdown as read and write code paths write to the same memory thus
> > needing inter-cpu synchronization if read and write handlers are on
> > different cpus. However for this I'm not sure.
> 
> I doubt it's noticeable, but it would probably be good to simply not
> do the write for streams.
> 
> That *could* be done somewhat similarly, with just changing th address
> to be updated, or course.
> 
> In fact, we *used* to (long ago) pass in the address of "file->f_pos"
> itself to the low-level read/write routines. We then changed it to do
> that indirection through a local copy of pos (and
> file_pos_read/file_pos_write) because we didn't do the proper locking,
> so different read/write versions could mess with each other (and with
> lseek).
> 
> But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
> accesses as per POSIX") did was to add the proper locking at least for
> the cases that we care about deeply, so we *could* say that we have
> three cases:
> 
>  - FMODE_ATOMIC_POS: properly locked,
>  - FMODE_STREAM: no pos at all
>  - otherwise a "mostly don't care - don't mix!"
> 
> and so we could go back to not copying the pos at all, and instead do
> something like
> 
>         loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
>         ret = vfs_write(f.file, buf, count, ppos);
> 
> and perhaps have a long-term plan to try to get rid of the "don't mix"
> case entirely (ie "if you use f_pos, then we'll do the proper
> locking")
> 
> (The above is obviously surrounded by the fdget_pos()/fdput_pos() that
> implements the locking decision).

Ok Linus, thanks for feedback. Please consider applying or queueing the
following patches.

If the patches are ok, the first one is probably better to be applied
now - to catch wrongly converted drivers right from start, while the
second one could be probably better delayed till merge window. However
please do as what you prefer is best.

Kirill

---- 8< ----
From 328d4cc5dbb1d6fdbff1f80f1eed9f04b9a5de1c Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <kirr@nexedi.com>
Date: Fri, 12 Apr 2019 12:31:57 +0300
Subject: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files

This amends commit 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") in how position is passed into .read()/.write() handler for
stream-like files:

Rasmus noticed that we currently pass 0 as position and ignore any position
change if that is done by a file implementation. This papers over bugs if ppos
is used in files that declare themselves as being stream-like as such bugs will
go unnoticed. Even if a file implementation is correctly converted into using
stream_open, its read/write later could be changed to use ppos and even though
that won't be working correctly, that bug might go unnoticed without someone
doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
read/write for stream-like files as that don't give any chance for ppos usage
bugs because it will oops if ppos is ever used inside .read() or .write().

Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
because they are called by vfs_read/vfs_write & friends before
file_operations .read/.write .

Note 2: if file backend uses new-style .read_iter/.write_iter, position
is still passed into there as non-pointer kiocb.ki_pos . Currently
stream_open.cocci (semantic patch added by 10dce8af3422) ignores files
whose file_operations has *_iter methods.

Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 fs/open.c       |  5 ++--
 fs/read_write.c | 75 +++++++++++++++++++++++++++++--------------------
 2 files changed, 47 insertions(+), 33 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index a00350018a47..9c7d724a6f67 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1219,8 +1219,9 @@ EXPORT_SYMBOL(nonseekable_open);
 /*
  * stream_open is used by subsystems that want stream-like file descriptors.
  * Such file descriptors are not seekable and don't have notion of position
- * (file.f_pos is always 0). Contrary to file descriptors of other regular
- * files, .read() and .write() can run simultaneously.
+ * (file.f_pos is always 0 and ppos passed to .read()/.write() is always NULL).
+ * Contrary to file descriptors of other regular files, .read() and .write()
+ * can run simultaneously.
  *
  * stream_open never fails and is marked to return int so that it could be
  * directly used as file_operations.open .
diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..d62556be6848 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
 	inode = file_inode(file);
 	if (unlikely((ssize_t) count < 0))
 		return retval;
-	pos = *ppos;
+	pos = (ppos ? *ppos : 0);
 	if (unlikely(pos < 0)) {
 		if (!unsigned_offsets(file))
 			return retval;
@@ -400,12 +400,13 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 	iov_iter_init(&iter, READ, &iov, 1, len);
 
 	ret = call_read_iter(filp, &kiocb, &iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	*ppos = kiocb.ki_pos;
+	if (ppos)
+		*ppos = kiocb.ki_pos;
 	return ret;
 }
 
@@ -468,12 +469,12 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 	iov_iter_init(&iter, WRITE, &iov, 1, len);
 
 	ret = call_write_iter(filp, &kiocb, &iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	if (ret > 0)
+	if (ret > 0 && ppos)
 		*ppos = kiocb.ki_pos;
 	return ret;
 }
@@ -558,15 +559,10 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 	return ret;
 }
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
+/* file_ppos returns &file->f_pos or NULL if file is stream */
+static inline loff_t *file_ppos(struct file *file)
 {
-	if ((file->f_mode & FMODE_STREAM) == 0)
-		file->f_pos = pos;
+	return file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
 }
 
 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
@@ -575,10 +571,14 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_read(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_read(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 	return ret;
@@ -595,10 +595,14 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_write(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_write(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -673,14 +677,15 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 	ret = kiocb_set_rw_flags(&kiocb, flags);
 	if (ret)
 		return ret;
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 
 	if (type == READ)
 		ret = call_read_iter(filp, &kiocb, iter);
 	else
 		ret = call_write_iter(filp, &kiocb, iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	*ppos = kiocb.ki_pos;
+	if (ppos)
+		*ppos = kiocb.ki_pos;
 	return ret;
 }
 
@@ -1013,10 +1018,14 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_readv(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -1033,10 +1042,14 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_writev(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
-- 
2.21.0.593.g511ec345e1


---- 8< ----
From ac88efdcdedb66230fcdf56e3fb244d8f5502b0b Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <kirr@nexedi.com>
Date: Fri, 12 Apr 2019 14:53:57 +0300
Subject: [PATCH 2/2] vfs: use &file->f_pos directly on files that have position

Long ago vfs read/write operations were passing ppos=&file->f_pos directly to
.read / .write file_operations methods. That changed in 2004 in 55f09ec0087c
("read/write: pass down a copy of f_pos, not f_pos itself.") which started to
pass ppos=&local_var trying to avoid simultaneous read/write/lseek stepping
onto each other toes and overwriting file->f_pos racily. That measure was not
complete and in 2014 commit 9c225f2655e36 ("vfs: atomic f_pos accesses as per
POSIX") added file->f_pos_lock to completely disable simultaneous
read/write/lseek runs. After f_pos_lock was introduced the reason to avoid
passing ppos=&file->f_pos directly due to concurrency vanished.

Linus explains[1]:

	In fact, we *used* to (long ago) pass in the address of "file->f_pos"
	itself to the low-level read/write routines. We then changed it to do
	that indirection through a local copy of pos (and
	file_pos_read/file_pos_write) because we didn't do the proper locking,
	so different read/write versions could mess with each other (and with
	lseek).

	But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
	accesses as per POSIX") did was to add the proper locking at least for
	the cases that we care about deeply, so we *could* say that we have
	three cases:

	 - FMODE_ATOMIC_POS: properly locked,
	 - FMODE_STREAM: no pos at all
	 - otherwise a "mostly don't care - don't mix!"

	and so we could go back to not copying the pos at all, and instead do
	something like

	        loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
	        ret = vfs_write(f.file, buf, count, ppos);

	and perhaps have a long-term plan to try to get rid of the "don't mix"
	case entirely (ie "if you use f_pos, then we'll do the proper
	locking")

	(The above is obviously surrounded by the fdget_pos()/fdput_pos() that
	implements the locking decision).

Currently for regular files we always set FMODE_ATOMIC_POS and change that to
FMODE_STREAM if stream_open is used explicitly on open. That leaves other
files, like e.g. sockets and pipes, for "mostly don't care - don't mix!" case.

Sockets, for example, always check that on read/write the initial pos they
receive is 0 and don't update it. And if it is !0 they return -ESPIPE.
That suggests that we can do the switch into passing &file->f_pos directly now
and incrementally convert to FMODE_STREAM files that were doing the stream-like
checking manually in their low-level .read/.write handlers.

Note: it is theoretically possible that a driver updates *ppos inside
even if read/write returns error. For such cases the conversion will
change IO semantic a bit. The semantic that is changing here was
introduced in 2013 in commit 5faf153ebf61 "don't call file_pos_write()
if vfs_{read,write}{,v}() fails".

[1] https://lore.kernel.org/linux-fsdevel/CAHk-=whJtZt52SnhBGrNMnuxFn3GE9X_e02x8BPxtkqrfyZukw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 fs/read_write.c | 36 ++++--------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index d62556be6848..13550b65cb2c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -571,14 +571,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_read(f.file, buf, count, ppos);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_read(f.file, buf, count, file_ppos(f.file));
 		fdput_pos(f);
 	}
 	return ret;
@@ -595,14 +588,7 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_write(f.file, buf, count, ppos);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_write(f.file, buf, count, file_ppos(f.file));
 		fdput_pos(f);
 	}
 
@@ -1018,14 +1004,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_readv(f.file, vec, vlen, ppos, flags);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_readv(f.file, vec, vlen, file_ppos(f.file), flags);
 		fdput_pos(f);
 	}
 
@@ -1042,14 +1021,7 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_writev(f.file, vec, vlen, ppos, flags);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_writev(f.file, vec, vlen, file_ppos(f.file), flags);
 		fdput_pos(f);
 	}
 
-- 
2.21.0.593.g511ec345e1

  reply index

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-26 22:20 Kirill Smelkov
2019-03-26 23:22 ` [PATCH 3/3] fuse: Add FOPEN_STREAM and use stream_open() if filesystem returned that from open handler Kirill Smelkov
2019-04-24  7:13   ` [RESEND, PATCH " Kirill Smelkov
     [not found]     ` <20190424160611.2A71321900@mail.kernel.org>
2019-04-24 19:16       ` Kirill Smelkov
     [not found] ` <8794193f3040b798010970228d978c05ad56ec52.1553637462.git.kirr@nexedi.com>
2019-03-27  6:54   ` [PATCH 2/3] *: convert stream-like files from nonseekable_open -> stream_open Lubomir Rintel
2019-03-27 16:58 ` [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock Juergen Gross
2019-04-06 17:07 ` Linus Torvalds
2019-04-07 20:04   ` Kirill Smelkov
2019-04-08  0:09     ` Linus Torvalds
2019-04-14  7:11       ` Kirill Smelkov
     [not found] ` <4c4651e2-167e-bfcc-7b3e-cda118f98a69@rasmusvillemoes.dk>
     [not found]   ` <20190409203807.GA13855@deco.navytux.spb.ru>
     [not found]     ` <d8c23d05-8810-13a2-cc50-7a47ff35e90b@rasmusvillemoes.dk>
2019-04-11 12:38       ` Kirill Smelkov
2019-04-11 16:22         ` Linus Torvalds
2019-04-12 12:42           ` Kirill Smelkov [this message]
2019-04-13 16:54             ` Kirill Smelkov
2019-04-13 16:54               ` [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files Kirill Smelkov
2019-04-13 17:27                 ` Linus Torvalds
2019-04-13 17:38                   ` Al Viro
2019-04-13 18:44                   ` Kirill Smelkov
2019-04-13 16:55               ` [PATCH 2/2] vfs: use &file->f_pos directly on files that have position Kirill Smelkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190412124144.GA22786@deco.navytux.spb.ru \
    --to=kirr@nexedi.com \
    --cc=arnd@arndb.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git