All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
@ 2019-03-26 22:20 ` Kirill Smelkov
  0 siblings, 0 replies; 21+ messages in thread
From: Kirill Smelkov @ 2019-03-26 22:20 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: linux-fsdevel, linux-kernel, Kirill Smelkov, Michael Kerrisk,
	Yongzhi Pan, Jonathan Corbet, David Vrabel, Juergen Gross,
	Miklos Szeredi, Tejun Heo, Kirill Tkhai, Arnd Bergmann,
	Christoph Hellwig, Greg Kroah-Hartman, Julia Lawall,
	Nikolaus Rath, Han-Wen Nienhuys

Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for
file.f_pos access and in particular made concurrent read and write not possible
- now both those functions take f_pos lock for the whole run, and so if e.g. a
read is blocked waiting for data, write will deadlock waiting for that read to
complete. This caused regression for stream-like files where previously read
and write could run simultaneously, but after that patch could not do so
anymore. See e.g. 581d21a2d0 (xenbus: fix deadlock on writes to /proc/xen/xenbus)
which fixes such regression for particular case of /proc/xen/xenbus.

The patch that added f_pos lock in 2014 (see https://lkml.org/lkml/2014/2/17/324
for background discussion) did so to guarantee POSIX thread safety for
read/write/lseek and added the locking to file descriptors of all regular
files. In 2014 that thread-safety problem was not new as it was already discussed
earlier in 2006: https://lwn.net/Articles/180387. However even though 2006'th
version of Linus's patch (https://lwn.net/Articles/180396) was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding
the stream-like objects like pipes and sockets)", 2014'th version - the one that
actually made it into the tree as 9c225f2655 - is doing so irregardless of whether
a file is seekable or not. The reason that it did so is, probably, that there are
many files that are marked non-seekable, but e.g. their read implementation
actually depends on knowing current position to correctly handle the read. Some
examples:

	kernel/power/user.c		snapshot_read
	fs/debugfs/file.c		u32_array_read
	fs/fuse/control.c		fuse_conn_waiting_read + ...
	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
	arch/s390/hypfs/inode.c		hypfs_read_iter
	...

In despite that, many nonseekable_open users implement read and write with pure
stream semantics - they don't depend on passed ppos at all. And for those cases
where read could wait for something inside, it creates a situation similar to
xenbus - the write could be never made to go until read is done, and read is
waiting for some, potentially external, event, for potentially unbounded time
-> deadlock. Besides xenbus, there are 14 such places in the kernel that I've
found with semantic patch (see below):

	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()

In addition to the cases above another regression caused by f_pos locking is
that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can
no longer implement bidirectional stream-like files - for the same reason
as above e.g. read can deadlock write locking on file.f_pos in the kernel.
FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f7 (fuse: implement
nonseekable open) to support OSSPD (https://github.com/libfuse/osspd;
https://lwn.net/Articles/308445). OSSPD implements /dev/dsp in userspace with
FOPEN_NONSEEKABLE flag, with corresponding read and write routines not
depending on current position at all, and with both read and write being
potentially blocking operations:

	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510

Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset. However
that test implements only read without write and cannot exercise the deadlock
scenario:

	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216

I've actually hit the read vs write deadlock for real while implementing my
FUSE filesystem where there is /head/watch file, for which open creates
separate bidirectional socket-like stream in between filesystem and its user
with both read and write being later performed simultaneously. And there it is
semantically not easy to split the stream into two separate read-only and
write-only channels:

	https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169

Let's fix this regression. The plan is:

1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would
   break many in-kernel nonseekable_open users which actually use ppos in
   read/write handlers.

2. Add stream_open() to kernel to open stream-like non-seekable file descriptors.
   Read and write on such file descriptors would never use nor change ppos. And
   with that property on stream-like files read and write will be running without
   taking f_pos lock - i.e. read and write could be running simultaneously.

3. With semantic patch search and convert to stream_open all in-kernel
   nonseekable_open users for which read and write actually do not depend on ppos and
   where there is no other methods in file_operations which assume @offset access.

4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open
   if that bit is present in filesystem open reply.

   It was tempting to change fs/fuse/ open handler to use stream_open instead of
   nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian
   codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually
   uses offset in its read and write handlers

	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

   so if we would do such a change it will break a real user.

5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from
   v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch
   OSSPD and other FUSE filesystems that provide stream-like files to return
   FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on
   all kernel versions. This should work because fs/fuse/ ignores unknown open
   flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that
   is not aware of this flag cannot hurt. In turn the kernel that is not aware of
   FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to
   implement streams without read vs write deadlock.

This patch: adds stream_open, converts /proc/xen/xenbus to it and adds semantic
patch to automatically locate in-kernel places that are either required to be
converted due to read vs write deadlock, or that are just safe to be converted
because read and write do not use ppos and there are no other funky methods in
file_operations.

Followup patches are:

- apply the result of semantic patch;
- add FOPEN_STREAM to fs/fuse.

Regarding semantic patch I've verified each generated change manually - that it is
correct to convert - and each other nonseekable_open instance left - that it is
either not correct to convert there, or that it is not converted due to current
stream_open.cocci limitations. The script also does not convert files that should
be valid to convert, but that currently have .llseek = noop_llseek or
generic_file_llseek for unknown reason despite file being opened with
nonseekable_open (e.g. drivers/input/mousedev.c)

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 drivers/xen/xenbus/xenbus_dev_frontend.c |   4 +-
 fs/open.c                                |  18 ++
 fs/read_write.c                          |   5 +-
 include/linux/fs.h                       |   4 +
 scripts/coccinelle/api/stream_open.cocci | 363 +++++++++++++++++++++++
 5 files changed, 389 insertions(+), 5 deletions(-)
 create mode 100644 scripts/coccinelle/api/stream_open.cocci

diff --git a/drivers/xen/xenbus/xenbus_dev_frontend.c b/drivers/xen/xenbus/xenbus_dev_frontend.c
index c3e201025ef0..0782ff3c2273 100644
--- a/drivers/xen/xenbus/xenbus_dev_frontend.c
+++ b/drivers/xen/xenbus/xenbus_dev_frontend.c
@@ -622,9 +622,7 @@ static int xenbus_file_open(struct inode *inode, struct file *filp)
 	if (xen_store_evtchn == 0)
 		return -ENOENT;
 
-	nonseekable_open(inode, filp);
-
-	filp->f_mode &= ~FMODE_ATOMIC_POS; /* cdev-style semantics */
+	stream_open(inode, filp);
 
 	u = kzalloc(sizeof(*u), GFP_KERNEL);
 	if (u == NULL)
diff --git a/fs/open.c b/fs/open.c
index 0285ce7dbd51..b6bf04a2ef8c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1209,3 +1209,21 @@ int nonseekable_open(struct inode *inode, struct file *filp)
 }
 
 EXPORT_SYMBOL(nonseekable_open);
+
+/*
+ * stream_open is used by subsystems that want stream-like file descriptors.
+ * Such file descriptors are not seekable and don't have notion of position
+ * (file.f_pos is always 0). Contrary to file descriptors of other regular
+ * files, .read() and .write() can run simultaneously.
+ *
+ * stream_open never fails and is marked to return int so that it could be
+ * directly used as file_operations.open .
+ */
+int stream_open(struct inode *inode, struct file *filp)
+{
+	filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE | FMODE_ATOMIC_POS);
+	filp->f_mode |= FMODE_STREAM;
+	return 0;
+}
+
+EXPORT_SYMBOL(stream_open);
diff --git a/fs/read_write.c b/fs/read_write.c
index d83003d856a0..91cb375f9840 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -560,12 +560,13 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 
 static inline loff_t file_pos_read(struct file *file)
 {
-	return file->f_pos;
+	return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
 }
 
 static inline void file_pos_write(struct file *file, loff_t pos)
 {
-	file->f_pos = pos;
+	if ((file->f_mode & FMODE_STREAM) == 0)
+		file->f_pos = pos;
 }
 
 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3e85cb8e8c20..cd59946bc3a4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -155,6 +155,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define FMODE_OPENED		((__force fmode_t)0x80000)
 #define FMODE_CREATED		((__force fmode_t)0x100000)
 
+/* File is stream-like */
+#define FMODE_STREAM		((__force fmode_t)0x200000)
+
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x4000000)
 
@@ -3075,6 +3078,7 @@ extern loff_t no_seek_end_llseek_size(struct file *, loff_t, int, loff_t);
 extern loff_t no_seek_end_llseek(struct file *, loff_t, int);
 extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
+extern int stream_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(struct bio *bio, struct inode *inode,
diff --git a/scripts/coccinelle/api/stream_open.cocci b/scripts/coccinelle/api/stream_open.cocci
new file mode 100644
index 000000000000..350145da7669
--- /dev/null
+++ b/scripts/coccinelle/api/stream_open.cocci
@@ -0,0 +1,363 @@
+// SPDX-License-Identifier: GPL-2.0
+// Author: Kirill Smelkov (kirr@nexedi.com)
+//
+// Search for stream-like files that are using nonseekable_open and convert
+// them to stream_open. A stream-like file is a file that does not use ppos in
+// its read and write. Rationale for the conversion is to avoid deadlock in
+// between read and write.
+
+virtual report
+virtual patch
+virtual explain  // explain decisions in the patch (SPFLAGS="-D explain")
+
+// stream-like reader & writer - ones that do not depend on f_pos.
+@ stream_reader @
+identifier readstream, ppos;
+identifier f, buf, len;
+type loff_t;
+@@
+  ssize_t readstream(struct file *f, char *buf, size_t len, loff_t *ppos)
+  {
+    ... when != ppos
+  }
+
+@ stream_writer @
+identifier writestream, ppos;
+identifier f, buf, len;
+type loff_t;
+@@
+  ssize_t writestream(struct file *f, const char *buf, size_t len, loff_t *ppos)
+  {
+    ... when != ppos
+  }
+
+
+// a function that blocks
+@ blocks @
+identifier block_f;
+identifier wait_event =~ "^wait_event_.*";
+@@
+  block_f(...) {
+    ... when exists
+    wait_event(...)
+    ... when exists
+  }
+
+// stream_reader that can block inside.
+//
+// XXX wait_* can be called not directly from current function (e.g. func -> f -> g -> wait())
+// XXX currently reader_blocks supports only direct and 1-level indirect cases.
+@ reader_blocks_direct @
+identifier stream_reader.readstream;
+identifier wait_event =~ "^wait_event_.*";
+@@
+  readstream(...)
+  {
+    ... when exists
+    wait_event(...)
+    ... when exists
+  }
+
+@ reader_blocks_1 @
+identifier stream_reader.readstream;
+identifier blocks.block_f;
+@@
+  readstream(...)
+  {
+    ... when exists
+    block_f(...)
+    ... when exists
+  }
+
+@ reader_blocks depends on reader_blocks_direct || reader_blocks_1 @
+identifier stream_reader.readstream;
+@@
+  readstream(...) {
+    ...
+  }
+
+
+// file_operations + whether they have _any_ .read, .write, .llseek ... at all.
+//
+// XXX add support for file_operations xxx[N] = ...	(sound/core/pcm_native.c)
+@ fops0 @
+identifier fops;
+@@
+  struct file_operations fops = {
+    ...
+  };
+
+@ has_read @
+identifier fops0.fops;
+identifier read_f;
+@@
+  struct file_operations fops = {
+    .read = read_f,
+  };
+
+@ has_read_iter @
+identifier fops0.fops;
+identifier read_iter_f;
+@@
+  struct file_operations fops = {
+    .read_iter = read_iter_f,
+  };
+
+@ has_write @
+identifier fops0.fops;
+identifier write_f;
+@@
+  struct file_operations fops = {
+    .write = write_f,
+  };
+
+@ has_write_iter @
+identifier fops0.fops;
+identifier write_iter_f;
+@@
+  struct file_operations fops = {
+    .write_iter = write_iter_f,
+  };
+
+@ has_llseek @
+identifier fops0.fops;
+identifier llseek_f;
+@@
+  struct file_operations fops = {
+    .llseek = llseek_f,
+  };
+
+@ has_no_llseek @
+identifier fops0.fops;
+@@
+  struct file_operations fops = {
+    .llseek = no_llseek,
+  };
+
+@ has_mmap @
+identifier fops0.fops;
+identifier mmap_f;
+@@
+  struct file_operations fops = {
+    .mmap = mmap_f,
+  };
+
+@ has_copy_file_range @
+identifier fops0.fops;
+identifier copy_file_range_f;
+@@
+  struct file_operations fops = {
+    .copy_file_range = copy_file_range_f,
+  };
+
+@ has_remap_file_range @
+identifier fops0.fops;
+identifier remap_file_range_f;
+@@
+  struct file_operations fops = {
+    .remap_file_range = remap_file_range_f,
+  };
+
+@ has_splice_read @
+identifier fops0.fops;
+identifier splice_read_f;
+@@
+  struct file_operations fops = {
+    .splice_read = splice_read_f,
+  };
+
+@ has_splice_write @
+identifier fops0.fops;
+identifier splice_write_f;
+@@
+  struct file_operations fops = {
+    .splice_write = splice_write_f,
+  };
+
+
+// file_operations that is candidate for stream_open conversion - it does not
+// use mmap and other methods that assume @offset access to file.
+//
+// XXX for simplicity require no .{read/write}_iter and no .splice_{read/write} for now.
+// XXX maybe_steam.fops cannot be used in other rules - it gives "bad rule maybe_stream or bad variable fops".
+@ maybe_stream depends on (!has_llseek || has_no_llseek) && !has_mmap && !has_copy_file_range && !has_remap_file_range && !has_read_iter && !has_write_iter && !has_splice_read && !has_splice_write @
+identifier fops0.fops;
+@@
+  struct file_operations fops = {
+  };
+
+
+// ---- conversions ----
+
+// XXX .open = nonseekable_open -> .open = stream_open
+// XXX .open = func -> openfunc -> nonseekable_open
+
+// read & write
+//
+// if both are used in the same file_operations together with an opener -
+// under that conditions we can use stream_open instead of nonseekable_open.
+@ fops_rw depends on maybe_stream @
+identifier fops0.fops, openfunc;
+identifier stream_reader.readstream;
+identifier stream_writer.writestream;
+@@
+  struct file_operations fops = {
+      .open  = openfunc,
+      .read  = readstream,
+      .write = writestream,
+  };
+
+@ report_rw depends on report @
+identifier fops_rw.openfunc;
+position p1;
+@@
+  openfunc(...) {
+    <...
+     nonseekable_open@p1
+    ...>
+  }
+
+@ script:python depends on report && reader_blocks @
+fops << fops0.fops;
+p << report_rw.p1;
+@@
+coccilib.report.print_report(p[0],
+  "ERROR: %s: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix." % (fops,))
+
+@ script:python depends on report && !reader_blocks @
+fops << fops0.fops;
+p << report_rw.p1;
+@@
+coccilib.report.print_report(p[0],
+  "WARNING: %s: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open." % (fops,))
+
+
+@ explain_rw_deadlocked depends on explain && reader_blocks @
+identifier fops_rw.openfunc;
+@@
+  openfunc(...) {
+    <...
+-    nonseekable_open
++    nonseekable_open /* read & write (was deadlock) */
+    ...>
+  }
+
+
+@ explain_rw_nodeadlock depends on explain && !reader_blocks @
+identifier fops_rw.openfunc;
+@@
+  openfunc(...) {
+    <...
+-    nonseekable_open
++    nonseekable_open /* read & write (no direct deadlock) */
+    ...>
+  }
+
+@ patch_rw depends on patch @
+identifier fops_rw.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   stream_open
+    ...>
+  }
+
+
+// read, but not write
+@ fops_r depends on maybe_stream && !has_write @
+identifier fops0.fops, openfunc;
+identifier stream_reader.readstream;
+@@
+  struct file_operations fops = {
+      .open  = openfunc,
+      .read  = readstream,
+  };
+
+@ report_r depends on report @
+identifier fops_r.openfunc;
+position p1;
+@@
+  openfunc(...) {
+    <...
+    nonseekable_open@p1
+    ...>
+  }
+
+@ script:python depends on report @
+fops << fops0.fops;
+p << report_r.p1;
+@@
+coccilib.report.print_report(p[0],
+  "WARNING: %s: .read() has stream semantic; safe to change nonseekable_open -> stream_open." % (fops,))
+
+@ explain_r depends on explain @
+identifier fops_r.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   nonseekable_open /* read only */
+    ...>
+  }
+
+@ patch_r depends on patch @
+identifier fops_r.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   stream_open
+    ...>
+  }
+
+
+// write, but not read
+@ fops_w depends on maybe_stream && !has_read @
+identifier fops0.fops, openfunc;
+identifier stream_writer.writestream;
+@@
+  struct file_operations fops = {
+      .open  = openfunc,
+      .write = writestream,
+  };
+
+@ report_w depends on report @
+identifier fops_w.openfunc;
+position p1;
+@@
+  openfunc(...) {
+    <...
+    nonseekable_open@p1
+    ...>
+  }
+
+@ script:python depends on report @
+fops << fops0.fops;
+p << report_w.p1;
+@@
+coccilib.report.print_report(p[0],
+  "WARNING: %s: .write() has stream semantic; safe to change nonseekable_open -> stream_open." % (fops,))
+
+@ explain_w depends on explain @
+identifier fops_w.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   nonseekable_open /* write only */
+    ...>
+  }
+
+@ patch_w depends on patch @
+identifier fops_w.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   stream_open
+    ...>
+  }
+
+
+// no read, no write - don't change anything
-- 
2.21.0.392.gf8f6787159

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
@ 2019-03-26 22:20 ` Kirill Smelkov
  0 siblings, 0 replies; 21+ messages in thread
From: Kirill Smelkov @ 2019-03-26 22:20 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: linux-fsdevel, linux-kernel, Kirill Smelkov, Michael Kerrisk,
	Yongzhi Pan, Jonathan Corbet, David Vrabel, Juergen Gross,
	Miklos Szeredi, Tejun Heo, Kirill Tkhai, Arnd Bergmann,
	Christoph Hellwig, Greg Kroah-Hartman, Julia Lawall,
	Nikolaus Rath, Han-Wen Nienhuys

Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for
file.f_pos access and in particular made concurrent read and write not possible
- now both those functions take f_pos lock for the whole run, and so if e.g. a
read is blocked waiting for data, write will deadlock waiting for that read to
complete. This caused regression for stream-like files where previously read
and write could run simultaneously, but after that patch could not do so
anymore. See e.g. 581d21a2d0 (xenbus: fix deadlock on writes to /proc/xen/xenbus)
which fixes such regression for particular case of /proc/xen/xenbus.

The patch that added f_pos lock in 2014 (see https://lkml.org/lkml/2014/2/17/324
for background discussion) did so to guarantee POSIX thread safety for
read/write/lseek and added the locking to file descriptors of all regular
files. In 2014 that thread-safety problem was not new as it was already discussed
earlier in 2006: https://lwn.net/Articles/180387. However even though 2006'th
version of Linus's patch (https://lwn.net/Articles/180396) was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding
the stream-like objects like pipes and sockets)", 2014'th version - the one that
actually made it into the tree as 9c225f2655 - is doing so irregardless of whether
a file is seekable or not. The reason that it did so is, probably, that there are
many files that are marked non-seekable, but e.g. their read implementation
actually depends on knowing current position to correctly handle the read. Some
examples:

	kernel/power/user.c		snapshot_read
	fs/debugfs/file.c		u32_array_read
	fs/fuse/control.c		fuse_conn_waiting_read + ...
	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
	arch/s390/hypfs/inode.c		hypfs_read_iter
	...

In despite that, many nonseekable_open users implement read and write with pure
stream semantics - they don't depend on passed ppos at all. And for those cases
where read could wait for something inside, it creates a situation similar to
xenbus - the write could be never made to go until read is done, and read is
waiting for some, potentially external, event, for potentially unbounded time
-> deadlock. Besides xenbus, there are 14 such places in the kernel that I've
found with semantic patch (see below):

	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()

In addition to the cases above another regression caused by f_pos locking is
that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can
no longer implement bidirectional stream-like files - for the same reason
as above e.g. read can deadlock write locking on file.f_pos in the kernel.
FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f7 (fuse: implement
nonseekable open) to support OSSPD (https://github.com/libfuse/osspd;
https://lwn.net/Articles/308445). OSSPD implements /dev/dsp in userspace with
FOPEN_NONSEEKABLE flag, with corresponding read and write routines not
depending on current position at all, and with both read and write being
potentially blocking operations:

	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510

Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset. However
that test implements only read without write and cannot exercise the deadlock
scenario:

	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216

I've actually hit the read vs write deadlock for real while implementing my
FUSE filesystem where there is /head/watch file, for which open creates
separate bidirectional socket-like stream in between filesystem and its user
with both read and write being later performed simultaneously. And there it is
semantically not easy to split the stream into two separate read-only and
write-only channels:

	https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169

Let's fix this regression. The plan is:

1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would
   break many in-kernel nonseekable_open users which actually use ppos in
   read/write handlers.

2. Add stream_open() to kernel to open stream-like non-seekable file descriptors.
   Read and write on such file descriptors would never use nor change ppos. And
   with that property on stream-like files read and write will be running without
   taking f_pos lock - i.e. read and write could be running simultaneously.

3. With semantic patch search and convert to stream_open all in-kernel
   nonseekable_open users for which read and write actually do not depend on ppos and
   where there is no other methods in file_operations which assume @offset access.

4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open
   if that bit is present in filesystem open reply.

   It was tempting to change fs/fuse/ open handler to use stream_open instead of
   nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian
   codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually
   uses offset in its read and write handlers

	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

   so if we would do such a change it will break a real user.

5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from
   v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch
   OSSPD and other FUSE filesystems that provide stream-like files to return
   FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on
   all kernel versions. This should work because fs/fuse/ ignores unknown open
   flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that
   is not aware of this flag cannot hurt. In turn the kernel that is not aware of
   FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to
   implement streams without read vs write deadlock.

This patch: adds stream_open, converts /proc/xen/xenbus to it and adds semantic
patch to automatically locate in-kernel places that are either required to be
converted due to read vs write deadlock, or that are just safe to be converted
because read and write do not use ppos and there are no other funky methods in
file_operations.

Followup patches are:

- apply the result of semantic patch;
- add FOPEN_STREAM to fs/fuse.

Regarding semantic patch I've verified each generated change manually - that it is
correct to convert - and each other nonseekable_open instance left - that it is
either not correct to convert there, or that it is not converted due to current
stream_open.cocci limitations. The script also does not convert files that should
be valid to convert, but that currently have .llseek = noop_llseek or
generic_file_llseek for unknown reason despite file being opened with
nonseekable_open (e.g. drivers/input/mousedev.c)

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 drivers/xen/xenbus/xenbus_dev_frontend.c |   4 +-
 fs/open.c                                |  18 ++
 fs/read_write.c                          |   5 +-
 include/linux/fs.h                       |   4 +
 scripts/coccinelle/api/stream_open.cocci | 363 +++++++++++++++++++++++
 5 files changed, 389 insertions(+), 5 deletions(-)
 create mode 100644 scripts/coccinelle/api/stream_open.cocci

diff --git a/drivers/xen/xenbus/xenbus_dev_frontend.c b/drivers/xen/xenbus/xenbus_dev_frontend.c
index c3e201025ef0..0782ff3c2273 100644
--- a/drivers/xen/xenbus/xenbus_dev_frontend.c
+++ b/drivers/xen/xenbus/xenbus_dev_frontend.c
@@ -622,9 +622,7 @@ static int xenbus_file_open(struct inode *inode, struct file *filp)
 	if (xen_store_evtchn == 0)
 		return -ENOENT;
 
-	nonseekable_open(inode, filp);
-
-	filp->f_mode &= ~FMODE_ATOMIC_POS; /* cdev-style semantics */
+	stream_open(inode, filp);
 
 	u = kzalloc(sizeof(*u), GFP_KERNEL);
 	if (u == NULL)
diff --git a/fs/open.c b/fs/open.c
index 0285ce7dbd51..b6bf04a2ef8c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1209,3 +1209,21 @@ int nonseekable_open(struct inode *inode, struct file *filp)
 }
 
 EXPORT_SYMBOL(nonseekable_open);
+
+/*
+ * stream_open is used by subsystems that want stream-like file descriptors.
+ * Such file descriptors are not seekable and don't have notion of position
+ * (file.f_pos is always 0). Contrary to file descriptors of other regular
+ * files, .read() and .write() can run simultaneously.
+ *
+ * stream_open never fails and is marked to return int so that it could be
+ * directly used as file_operations.open .
+ */
+int stream_open(struct inode *inode, struct file *filp)
+{
+	filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE | FMODE_ATOMIC_POS);
+	filp->f_mode |= FMODE_STREAM;
+	return 0;
+}
+
+EXPORT_SYMBOL(stream_open);
diff --git a/fs/read_write.c b/fs/read_write.c
index d83003d856a0..91cb375f9840 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -560,12 +560,13 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 
 static inline loff_t file_pos_read(struct file *file)
 {
-	return file->f_pos;
+	return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
 }
 
 static inline void file_pos_write(struct file *file, loff_t pos)
 {
-	file->f_pos = pos;
+	if ((file->f_mode & FMODE_STREAM) == 0)
+		file->f_pos = pos;
 }
 
 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3e85cb8e8c20..cd59946bc3a4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -155,6 +155,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define FMODE_OPENED		((__force fmode_t)0x80000)
 #define FMODE_CREATED		((__force fmode_t)0x100000)
 
+/* File is stream-like */
+#define FMODE_STREAM		((__force fmode_t)0x200000)
+
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x4000000)
 
@@ -3075,6 +3078,7 @@ extern loff_t no_seek_end_llseek_size(struct file *, loff_t, int, loff_t);
 extern loff_t no_seek_end_llseek(struct file *, loff_t, int);
 extern int generic_file_open(struct inode * inode, struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
+extern int stream_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(struct bio *bio, struct inode *inode,
diff --git a/scripts/coccinelle/api/stream_open.cocci b/scripts/coccinelle/api/stream_open.cocci
new file mode 100644
index 000000000000..350145da7669
--- /dev/null
+++ b/scripts/coccinelle/api/stream_open.cocci
@@ -0,0 +1,363 @@
+// SPDX-License-Identifier: GPL-2.0
+// Author: Kirill Smelkov (kirr@nexedi.com)
+//
+// Search for stream-like files that are using nonseekable_open and convert
+// them to stream_open. A stream-like file is a file that does not use ppos in
+// its read and write. Rationale for the conversion is to avoid deadlock in
+// between read and write.
+
+virtual report
+virtual patch
+virtual explain  // explain decisions in the patch (SPFLAGS="-D explain")
+
+// stream-like reader & writer - ones that do not depend on f_pos.
+@ stream_reader @
+identifier readstream, ppos;
+identifier f, buf, len;
+type loff_t;
+@@
+  ssize_t readstream(struct file *f, char *buf, size_t len, loff_t *ppos)
+  {
+    ... when != ppos
+  }
+
+@ stream_writer @
+identifier writestream, ppos;
+identifier f, buf, len;
+type loff_t;
+@@
+  ssize_t writestream(struct file *f, const char *buf, size_t len, loff_t *ppos)
+  {
+    ... when != ppos
+  }
+
+
+// a function that blocks
+@ blocks @
+identifier block_f;
+identifier wait_event =~ "^wait_event_.*";
+@@
+  block_f(...) {
+    ... when exists
+    wait_event(...)
+    ... when exists
+  }
+
+// stream_reader that can block inside.
+//
+// XXX wait_* can be called not directly from current function (e.g. func -> f -> g -> wait())
+// XXX currently reader_blocks supports only direct and 1-level indirect cases.
+@ reader_blocks_direct @
+identifier stream_reader.readstream;
+identifier wait_event =~ "^wait_event_.*";
+@@
+  readstream(...)
+  {
+    ... when exists
+    wait_event(...)
+    ... when exists
+  }
+
+@ reader_blocks_1 @
+identifier stream_reader.readstream;
+identifier blocks.block_f;
+@@
+  readstream(...)
+  {
+    ... when exists
+    block_f(...)
+    ... when exists
+  }
+
+@ reader_blocks depends on reader_blocks_direct || reader_blocks_1 @
+identifier stream_reader.readstream;
+@@
+  readstream(...) {
+    ...
+  }
+
+
+// file_operations + whether they have _any_ .read, .write, .llseek ... at all.
+//
+// XXX add support for file_operations xxx[N] = ...	(sound/core/pcm_native.c)
+@ fops0 @
+identifier fops;
+@@
+  struct file_operations fops = {
+    ...
+  };
+
+@ has_read @
+identifier fops0.fops;
+identifier read_f;
+@@
+  struct file_operations fops = {
+    .read = read_f,
+  };
+
+@ has_read_iter @
+identifier fops0.fops;
+identifier read_iter_f;
+@@
+  struct file_operations fops = {
+    .read_iter = read_iter_f,
+  };
+
+@ has_write @
+identifier fops0.fops;
+identifier write_f;
+@@
+  struct file_operations fops = {
+    .write = write_f,
+  };
+
+@ has_write_iter @
+identifier fops0.fops;
+identifier write_iter_f;
+@@
+  struct file_operations fops = {
+    .write_iter = write_iter_f,
+  };
+
+@ has_llseek @
+identifier fops0.fops;
+identifier llseek_f;
+@@
+  struct file_operations fops = {
+    .llseek = llseek_f,
+  };
+
+@ has_no_llseek @
+identifier fops0.fops;
+@@
+  struct file_operations fops = {
+    .llseek = no_llseek,
+  };
+
+@ has_mmap @
+identifier fops0.fops;
+identifier mmap_f;
+@@
+  struct file_operations fops = {
+    .mmap = mmap_f,
+  };
+
+@ has_copy_file_range @
+identifier fops0.fops;
+identifier copy_file_range_f;
+@@
+  struct file_operations fops = {
+    .copy_file_range = copy_file_range_f,
+  };
+
+@ has_remap_file_range @
+identifier fops0.fops;
+identifier remap_file_range_f;
+@@
+  struct file_operations fops = {
+    .remap_file_range = remap_file_range_f,
+  };
+
+@ has_splice_read @
+identifier fops0.fops;
+identifier splice_read_f;
+@@
+  struct file_operations fops = {
+    .splice_read = splice_read_f,
+  };
+
+@ has_splice_write @
+identifier fops0.fops;
+identifier splice_write_f;
+@@
+  struct file_operations fops = {
+    .splice_write = splice_write_f,
+  };
+
+
+// file_operations that is candidate for stream_open conversion - it does not
+// use mmap and other methods that assume @offset access to file.
+//
+// XXX for simplicity require no .{read/write}_iter and no .splice_{read/write} for now.
+// XXX maybe_steam.fops cannot be used in other rules - it gives "bad rule maybe_stream or bad variable fops".
+@ maybe_stream depends on (!has_llseek || has_no_llseek) && !has_mmap && !has_copy_file_range && !has_remap_file_range && !has_read_iter && !has_write_iter && !has_splice_read && !has_splice_write @
+identifier fops0.fops;
+@@
+  struct file_operations fops = {
+  };
+
+
+// ---- conversions ----
+
+// XXX .open = nonseekable_open -> .open = stream_open
+// XXX .open = func -> openfunc -> nonseekable_open
+
+// read & write
+//
+// if both are used in the same file_operations together with an opener -
+// under that conditions we can use stream_open instead of nonseekable_open.
+@ fops_rw depends on maybe_stream @
+identifier fops0.fops, openfunc;
+identifier stream_reader.readstream;
+identifier stream_writer.writestream;
+@@
+  struct file_operations fops = {
+      .open  = openfunc,
+      .read  = readstream,
+      .write = writestream,
+  };
+
+@ report_rw depends on report @
+identifier fops_rw.openfunc;
+position p1;
+@@
+  openfunc(...) {
+    <...
+     nonseekable_open@p1
+    ...>
+  }
+
+@ script:python depends on report && reader_blocks @
+fops << fops0.fops;
+p << report_rw.p1;
+@@
+coccilib.report.print_report(p[0],
+  "ERROR: %s: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix." % (fops,))
+
+@ script:python depends on report && !reader_blocks @
+fops << fops0.fops;
+p << report_rw.p1;
+@@
+coccilib.report.print_report(p[0],
+  "WARNING: %s: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open." % (fops,))
+
+
+@ explain_rw_deadlocked depends on explain && reader_blocks @
+identifier fops_rw.openfunc;
+@@
+  openfunc(...) {
+    <...
+-    nonseekable_open
++    nonseekable_open /* read & write (was deadlock) */
+    ...>
+  }
+
+
+@ explain_rw_nodeadlock depends on explain && !reader_blocks @
+identifier fops_rw.openfunc;
+@@
+  openfunc(...) {
+    <...
+-    nonseekable_open
++    nonseekable_open /* read & write (no direct deadlock) */
+    ...>
+  }
+
+@ patch_rw depends on patch @
+identifier fops_rw.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   stream_open
+    ...>
+  }
+
+
+// read, but not write
+@ fops_r depends on maybe_stream && !has_write @
+identifier fops0.fops, openfunc;
+identifier stream_reader.readstream;
+@@
+  struct file_operations fops = {
+      .open  = openfunc,
+      .read  = readstream,
+  };
+
+@ report_r depends on report @
+identifier fops_r.openfunc;
+position p1;
+@@
+  openfunc(...) {
+    <...
+    nonseekable_open@p1
+    ...>
+  }
+
+@ script:python depends on report @
+fops << fops0.fops;
+p << report_r.p1;
+@@
+coccilib.report.print_report(p[0],
+  "WARNING: %s: .read() has stream semantic; safe to change nonseekable_open -> stream_open." % (fops,))
+
+@ explain_r depends on explain @
+identifier fops_r.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   nonseekable_open /* read only */
+    ...>
+  }
+
+@ patch_r depends on patch @
+identifier fops_r.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   stream_open
+    ...>
+  }
+
+
+// write, but not read
+@ fops_w depends on maybe_stream && !has_read @
+identifier fops0.fops, openfunc;
+identifier stream_writer.writestream;
+@@
+  struct file_operations fops = {
+      .open  = openfunc,
+      .write = writestream,
+  };
+
+@ report_w depends on report @
+identifier fops_w.openfunc;
+position p1;
+@@
+  openfunc(...) {
+    <...
+    nonseekable_open@p1
+    ...>
+  }
+
+@ script:python depends on report @
+fops << fops0.fops;
+p << report_w.p1;
+@@
+coccilib.report.print_report(p[0],
+  "WARNING: %s: .write() has stream semantic; safe to change nonseekable_open -> stream_open." % (fops,))
+
+@ explain_w depends on explain @
+identifier fops_w.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   nonseekable_open /* write only */
+    ...>
+  }
+
+@ patch_w depends on patch @
+identifier fops_w.openfunc;
+@@
+  openfunc(...) {
+    <...
+-   nonseekable_open
++   stream_open
+    ...>
+  }
+
+
+// no read, no write - don't change anything
-- 
2.21.0.392.gf8f6787159

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/3] fuse: Add FOPEN_STREAM and use stream_open() if filesystem returned that from open handler
  2019-03-26 22:20 ` Kirill Smelkov
  (?)
@ 2019-03-26 23:22 ` Kirill Smelkov
  2019-04-24  7:13   ` [RESEND, PATCH " Kirill Smelkov
  -1 siblings, 1 reply; 21+ messages in thread
From: Kirill Smelkov @ 2019-03-26 23:22 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, Kirill Smelkov, Al Viro,
	Linus Torvalds, Michael Kerrisk, Yongzhi Pan, Jonathan Corbet,
	David Vrabel, Juergen Gross, Tejun Heo, Kirill Tkhai,
	Arnd Bergmann, Christoph Hellwig, Greg Kroah-Hartman,
	Julia Lawall, Nikolaus Rath, Han-Wen Nienhuys

Starting from 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) files
opened even via nonseekable_open gate read and write via lock and do not
allow them to be run simultaneously. This can create read vs write
deadlock if a filesystem is trying to implement a socket-like file which
is intended to be simultaneously used for both read and write from
filesystem client. See previous patch "fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock" for details and e.g. 581d21a2d0 (xenbus: fix deadlock on
writes to /proc/xen/xenbus) for a similar deadlock example on /proc/xen/xenbus.

To avoid such deadlock it was tempting fuse_finish_open to use
stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags,
but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and write
handlers

	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

so if we would do such a change it will break a real user.

-> Add another flag (FOPEN_STREAM) for filesystem servers to indicate
that the opened handler is having stream-like semantics; does not use
file position and thus the kernel is free to issue simultaneous read and
write request on opened file handle.

This patch together with stream_open should be added to stable kernels starting from
v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch
OSSPD and other FUSE filesystems that provide stream-like files to return
FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on
all kernel versions. This should work because fuse_finish_open ignores unknown open
flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that
is not aware of this flag cannot hurt. In turn the kernel that is not aware of
FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to
implement streams without read vs write deadlock.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 fs/fuse/file.c            | 4 +++-
 include/uapi/linux/fuse.h | 2 ++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ffaffe18352a..7ea4099cde16 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -181,7 +181,9 @@ void fuse_finish_open(struct inode *inode, struct file *file)
 		file->f_op = &fuse_direct_io_file_operations;
 	if (!(ff->open_flags & FOPEN_KEEP_CACHE))
 		invalidate_inode_pages2(inode->i_mapping);
-	if (ff->open_flags & FOPEN_NONSEEKABLE)
+	if (ff->open_flags & FOPEN_STREAM)
+		stream_open(inode, file);
+	else if (ff->open_flags & FOPEN_NONSEEKABLE)
 		nonseekable_open(inode, file);
 	if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
 		struct fuse_inode *fi = get_fuse_inode(inode);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index b4967d48bfda..93ac72a1e4ff 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -226,11 +226,13 @@ struct fuse_file_lock {
  * FOPEN_KEEP_CACHE: don't invalidate the data cache on open
  * FOPEN_NONSEEKABLE: the file is not seekable
  * FOPEN_CACHE_DIR: allow caching this directory
+ * FOPEN_STREAM: the file is stream-like
  */
 #define FOPEN_DIRECT_IO		(1 << 0)
 #define FOPEN_KEEP_CACHE	(1 << 1)
 #define FOPEN_NONSEEKABLE	(1 << 2)
 #define FOPEN_CACHE_DIR		(1 << 3)
+#define FOPEN_STREAM		(1 << 4)
 
 /**
  * INIT request/reply flags
-- 
2.21.0.392.gf8f6787159

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] *: convert stream-like files from nonseekable_open -> stream_open
       [not found] ` <8794193f3040b798010970228d978c05ad56ec52.1553637462.git.kirr@nexedi.com>
@ 2019-03-27  6:54   ` Lubomir Rintel
  0 siblings, 0 replies; 21+ messages in thread
From: Lubomir Rintel @ 2019-03-27  6:54 UTC (permalink / raw)
  To: Kirill Smelkov, Al Viro, Linus Torvalds; +Cc: linux-fsdevel, linux-kernel

On Tue, 2019-03-26 at 23:23 +0000, Kirill Smelkov wrote:
> Using scripts/coccinelle/api/stream_open.cocci added in the previous
> patch, search and convert to stream_open all in-kernel nonseekable_open
> users for which read and write actually do not depend on ppos and where
> there is no other methods in file_operations which assume @offset
> access.
> 
> I've verified each generated change manually - that it is correct to convert -
> and each other nonseekable_open instance left - that it is either not correct
> to convert there, or that it is not converted due to current stream_open.cocci
> limitations. The script also does not convert files that should be valid to
> convert, but that currently have .llseek = noop_llseek or generic_file_llseek
> for unknown reason despite file being opened with nonseekable_open (e.g.
> drivers/input/mousedev.c)
> 
> Among cases converted 14 were potentially vulnerable to read vs write deadlock
> (see details in the previous patch):

...

> and the reset were just safe to convert to stream_open because their
> read and write do not use ppos at all and corresponding file_operations
> do not have methods that assume @offset file access:

...

> 	drivers/char/pcmcia/scr24x_cs.c:95:8-24: WARNING: scr24x_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.

...

> diff --git a/drivers/char/pcmcia/scr24x_cs.c b/drivers/char/pcmcia/scr24x_cs.c
> index f6b43d9350f0..04b39c3596cc 100644
> --- a/drivers/char/pcmcia/scr24x_cs.c
> +++ b/drivers/char/pcmcia/scr24x_cs.c
> @@ -92,7 +92,7 @@ static int scr24x_open(struct inode *inode, struct file *filp)
>  	kref_get(&dev->refcnt);
>  	filp->private_data = dev;
>  
> -	return nonseekable_open(inode, filp);
> +	return stream_open(inode, filp);
>  }
>  
>  static int scr24x_release(struct inode *inode, struct file *filp)

Acked-by: Lubomir Rintel <lkundrak@v3.sk> [scr24x_cs]

Thanks,
Lubo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
  2019-03-26 22:20 ` Kirill Smelkov
                   ` (2 preceding siblings ...)
  (?)
@ 2019-03-27 16:58 ` Juergen Gross
  -1 siblings, 0 replies; 21+ messages in thread
From: Juergen Gross @ 2019-03-27 16:58 UTC (permalink / raw)
  To: Kirill Smelkov, Al Viro, Linus Torvalds
  Cc: linux-fsdevel, linux-kernel, Michael Kerrisk, Yongzhi Pan,
	Jonathan Corbet, David Vrabel, Miklos Szeredi, Tejun Heo,
	Kirill Tkhai, Arnd Bergmann, Christoph Hellwig,
	Greg Kroah-Hartman, Julia Lawall, Nikolaus Rath,
	Han-Wen Nienhuys

On 26/03/2019 23:20, Kirill Smelkov wrote:
> Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for
> file.f_pos access and in particular made concurrent read and write not possible
> - now both those functions take f_pos lock for the whole run, and so if e.g. a
> read is blocked waiting for data, write will deadlock waiting for that read to
> complete. This caused regression for stream-like files where previously read
> and write could run simultaneously, but after that patch could not do so
> anymore. See e.g. 581d21a2d0 (xenbus: fix deadlock on writes to /proc/xen/xenbus)
> which fixes such regression for particular case of /proc/xen/xenbus.
> 
> The patch that added f_pos lock in 2014 (see https://lkml.org/lkml/2014/2/17/324
> for background discussion) did so to guarantee POSIX thread safety for
> read/write/lseek and added the locking to file descriptors of all regular
> files. In 2014 that thread-safety problem was not new as it was already discussed
> earlier in 2006: https://lwn.net/Articles/180387. However even though 2006'th
> version of Linus's patch (https://lwn.net/Articles/180396) was adding f_pos
> locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding
> the stream-like objects like pipes and sockets)", 2014'th version - the one that
> actually made it into the tree as 9c225f2655 - is doing so irregardless of whether
> a file is seekable or not. The reason that it did so is, probably, that there are
> many files that are marked non-seekable, but e.g. their read implementation
> actually depends on knowing current position to correctly handle the read. Some
> examples:
> 
> 	kernel/power/user.c		snapshot_read
> 	fs/debugfs/file.c		u32_array_read
> 	fs/fuse/control.c		fuse_conn_waiting_read + ...
> 	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
> 	arch/s390/hypfs/inode.c		hypfs_read_iter
> 	...
> 
> In despite that, many nonseekable_open users implement read and write with pure
> stream semantics - they don't depend on passed ppos at all. And for those cases
> where read could wait for something inside, it creates a situation similar to
> xenbus - the write could be never made to go until read is done, and read is
> waiting for some, potentially external, event, for potentially unbounded time
> -> deadlock. Besides xenbus, there are 14 such places in the kernel that I've
> found with semantic patch (see below):
> 
> 	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
> 	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
> 	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
> 	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
> 	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
> 	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
> 	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
> 	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
> 	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
> 	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
> 	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
> 	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
> 	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
> 	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
> 
> In addition to the cases above another regression caused by f_pos locking is
> that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can
> no longer implement bidirectional stream-like files - for the same reason
> as above e.g. read can deadlock write locking on file.f_pos in the kernel.
> FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f7 (fuse: implement
> nonseekable open) to support OSSPD (https://github.com/libfuse/osspd;
> https://lwn.net/Articles/308445). OSSPD implements /dev/dsp in userspace with
> FOPEN_NONSEEKABLE flag, with corresponding read and write routines not
> depending on current position at all, and with both read and write being
> potentially blocking operations:
> 
> 	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
> 	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
> 	https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
> 
> Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
> "somewhat pipe-like files ..." with read handler not using offset. However
> that test implements only read without write and cannot exercise the deadlock
> scenario:
> 
> 	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
> 	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
> 	https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
> 
> I've actually hit the read vs write deadlock for real while implementing my
> FUSE filesystem where there is /head/watch file, for which open creates
> separate bidirectional socket-like stream in between filesystem and its user
> with both read and write being later performed simultaneously. And there it is
> semantically not easy to split the stream into two separate read-only and
> write-only channels:
> 
> 	https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
> 
> Let's fix this regression. The plan is:
> 
> 1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would
>    break many in-kernel nonseekable_open users which actually use ppos in
>    read/write handlers.
> 
> 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors.
>    Read and write on such file descriptors would never use nor change ppos. And
>    with that property on stream-like files read and write will be running without
>    taking f_pos lock - i.e. read and write could be running simultaneously.
> 
> 3. With semantic patch search and convert to stream_open all in-kernel
>    nonseekable_open users for which read and write actually do not depend on ppos and
>    where there is no other methods in file_operations which assume @offset access.
> 
> 4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open
>    if that bit is present in filesystem open reply.
> 
>    It was tempting to change fs/fuse/ open handler to use stream_open instead of
>    nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian
>    codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually
>    uses offset in its read and write handlers
> 
> 	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
> 	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
> 	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
> 	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
> 
>    so if we would do such a change it will break a real user.
> 
> 5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from
>    v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch
>    OSSPD and other FUSE filesystems that provide stream-like files to return
>    FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on
>    all kernel versions. This should work because fs/fuse/ ignores unknown open
>    flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that
>    is not aware of this flag cannot hurt. In turn the kernel that is not aware of
>    FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to
>    implement streams without read vs write deadlock.
> 
> This patch: adds stream_open, converts /proc/xen/xenbus to it and adds semantic
> patch to automatically locate in-kernel places that are either required to be
> converted due to read vs write deadlock, or that are just safe to be converted
> because read and write do not use ppos and there are no other funky methods in
> file_operations.
> 
> Followup patches are:
> 
> - apply the result of semantic patch;
> - add FOPEN_STREAM to fs/fuse.
> 
> Regarding semantic patch I've verified each generated change manually - that it is
> correct to convert - and each other nonseekable_open instance left - that it is
> either not correct to convert there, or that it is not converted due to current
> stream_open.cocci limitations. The script also does not convert files that should
> be valid to convert, but that currently have .llseek = noop_llseek or
> generic_file_llseek for unknown reason despite file being opened with
> nonseekable_open (e.g. drivers/input/mousedev.c)
> 
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Yongzhi Pan <panyongzhi@gmail.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: David Vrabel <david.vrabel@citrix.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Julia Lawall <Julia.Lawall@lip6.fr>
> Cc: Nikolaus Rath <Nikolaus@rath.org>
> Cc: Han-Wen Nienhuys <hanwen@google.com>
> Signed-off-by: Kirill Smelkov <kirr@nexedi.com>

For the Xen changes: Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
  2019-03-26 22:20 ` Kirill Smelkov
                   ` (3 preceding siblings ...)
  (?)
@ 2019-04-06 17:07 ` Linus Torvalds
  2019-04-07 20:04   ` Kirill Smelkov
  -1 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2019-04-06 17:07 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Al Viro, linux-fsdevel, Linux List Kernel Mailing,
	Michael Kerrisk, Yongzhi Pan, Jonathan Corbet, David Vrabel,
	Juergen Gross, Miklos Szeredi, Tejun Heo, Kirill Tkhai,
	Arnd Bergmann, Christoph Hellwig, Greg Kroah-Hartman,
	Julia Lawall, Nikolaus Rath, Han-Wen Nienhuys

On Tue, Mar 26, 2019 at 12:20 PM Kirill Smelkov <kirr@nexedi.com> wrote:
>
> Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for
> file.f_pos access and in particular made concurrent read and write not possible
> [...]

Ok, I have applied this patch - but this patch only - as a
well-researched preparatory patch.

The actual conversion patch looks fine to me too, and I see a few
acks, but I think I'd like to see it during a merge window just
because it's large and does significant changes, while this one is
purely preparatory.

Al, comments?

One small note: please don't use lkml.org references since they tend
to be slow and flaky - use lore.kernel.org instead. Also, fix your git
config to use 12-character git hashes (best done by just removing the
explicit hash size entirely, and letting modern git shorten hashes
appropriately automatically).

                  Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
  2019-04-06 17:07 ` Linus Torvalds
@ 2019-04-07 20:04   ` Kirill Smelkov
  2019-04-08  0:09     ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Kirill Smelkov @ 2019-04-07 20:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, linux-fsdevel, Linux List Kernel Mailing,
	Michael Kerrisk, Yongzhi Pan, Jonathan Corbet, David Vrabel,
	Juergen Gross, Miklos Szeredi, Tejun Heo, Kirill Tkhai,
	Arnd Bergmann, Christoph Hellwig, Greg Kroah-Hartman,
	Julia Lawall, Nikolaus Rath, Han-Wen Nienhuys

On Sat, Apr 06, 2019 at 07:07:14AM -1000, Linus Torvalds wrote:
> On Tue, Mar 26, 2019 at 12:20 PM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > Commit 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) added locking for
> > file.f_pos access and in particular made concurrent read and write not possible
> > [...]
> 
> Ok, I have applied this patch - but this patch only - as a
> well-researched preparatory patch.
> 
> The actual conversion patch looks fine to me too, and I see a few
> acks, but I think I'd like to see it during a merge window just
> because it's large and does significant changes, while this one is
> purely preparatory.
> 
> Al, comments?

Linus, thanks a lot for feedback and for accepting the patch.

I agree that it is better to merge the bulk-conversion at the time of
next merge window. I hope Al will pick that patch up and send to you
when the next merge window time comes.

Could we please clarify one thing: is the general complete plan to deal
with this regression, as outlined in the merged patch, ok? I mean do we
agree to further proceed with fs/fuse/ FOPEN_STREAM (patch 3/3) and
backport it + stream_open to v3.14+ stable kernels? Unfortunately it
requires a bit of cooperation from userspace, but this scheme to fix the
regression is the best I could come up with.

If FUSE bits are ok, how should we go? Are we ok to merge that patch
now, or should we wait for the next merge window? I understand that
generally it should be merge window from one side, but since we are
trying to fix regression here and if the plan for regression fix is
accepted it should be merge now. If we merge now, what should be the way
for that patch to get merged?

Fixing regression on FUSE side is my reason to do this whole work -
that's why I care about it the most and ask.

> One small note: please don't use lkml.org references since they tend
> to be slow and flaky - use lore.kernel.org instead. Also, fix your git
> config to use 12-character git hashes (best done by just removing the
> explicit hash size entirely, and letting modern git shorten hashes
> appropriately automatically).

Thanks for pointing to lore.kernel.org - I'm not very keen to kernel
development and did not knew it exists. I was seeing lkml.org being
referenced here and there and that's why I used it. I will use
lore.kernel.org from now.

Short hashes were not due to git config, but due to me trimming them to
10 characters in vim manually. I will use longer hashes as you ask.

Kirill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
  2019-04-07 20:04   ` Kirill Smelkov
@ 2019-04-08  0:09     ` Linus Torvalds
  2019-04-14  7:11       ` Kirill Smelkov
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2019-04-08  0:09 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Al Viro, linux-fsdevel, Linux List Kernel Mailing,
	Michael Kerrisk, Yongzhi Pan, Jonathan Corbet, David Vrabel,
	Juergen Gross, Miklos Szeredi, Tejun Heo, Kirill Tkhai,
	Arnd Bergmann, Christoph Hellwig, Greg Kroah-Hartman,
	Julia Lawall, Nikolaus Rath, Han-Wen Nienhuys

On Sun, Apr 7, 2019 at 10:04 AM Kirill Smelkov <kirr@nexedi.com> wrote:
>
> Fixing regression on FUSE side is my reason to do this whole work -
> that's why I care about it the most and ask.

Yeah, we can do the actual FUSE fix, I think. Preferably through the
FUSE tree. Miklos?

                 Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
       [not found]     ` <d8c23d05-8810-13a2-cc50-7a47ff35e90b@rasmusvillemoes.dk>
@ 2019-04-11 12:38       ` Kirill Smelkov
  2019-04-11 16:22         ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Kirill Smelkov @ 2019-04-11 12:38 UTC (permalink / raw)
  To: Rasmus Villemoes, Linus Torvalds
  Cc: Al Viro, Arnd Bergmann, Christoph Hellwig, Greg Kroah-Hartman,
	linux-fsdevel, linux-kernel

+Linus, +Al, +linux-fsdevel, +linux-kernel

On Tue, Apr 09, 2019 at 11:50:23PM +0200, Rasmus Villemoes wrote:
> On 09/04/2019 22.38, Kirill Smelkov wrote:
> > On Tue, Apr 09, 2019 at 09:43:37AM +0200, Rasmus Villemoes wrote:
> >> On 26/03/2019 23.20, Kirill Smelkov wrote:
> >>
> >>> 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors.
> >>>    Read and write on such file descriptors would never use nor change ppos. And
> >>>    with that property on stream-like files read and write will be running without
> >>>    taking f_pos lock - i.e. read and write could be running simultaneously.
> >>>
> >>
> >>> diff --git a/fs/read_write.c b/fs/read_write.c
> >>> index d83003d856a0..91cb375f9840 100644
> >>> --- a/fs/read_write.c
> >>> +++ b/fs/read_write.c
> >>> @@ -560,12 +560,13 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
> >>>  
> >>>  static inline loff_t file_pos_read(struct file *file)
> >>>  {
> >>> -	return file->f_pos;
> >>> +	return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
> >>>  }
> >>>  
> >>>  static inline void file_pos_write(struct file *file, loff_t pos)
> >>>  {
> >>> -	file->f_pos = pos;
> >>> +	if ((file->f_mode & FMODE_STREAM) == 0)
> >>> +		file->f_pos = pos;
> >>>  }
> >>
> >> Perhaps leave file_pos_read alone (to avoid the conditional load), and
> >> do WARN_ON_ONCE((file->f_mode & FMODE_STREAM) && pos != 0) or maybe !=
> >> file->f_pos in _write to catch wrongly converted drivers - once the dust
> >> has settled and patches backported to -stable, that WARN_ON_ONCE can
> >> also be removed.
> >>
> >> Or, a more brutal way to detect .read or .write callbacks that do use
> >> ppos despite FMODE_STREAM is to lift the FMODE_STREAM handling to the
> >> callers of file_pos_{read,write} and pass on (file->f_mode &
> >> FMODE_STREAM) ? NULL : &pos instead of &pos. That would also just add
> >> one instead of two conditionals to the read/write paths.
> > 
> > Good idea to pass NULL as ppos into FMODE_STREAM users as this will show
> > as oops if a driver is wrongly converted and uses the position. Do you mean
> > something like this:
> > 
> 
> Not that invasive. I was more thinking of reverting the
> file_pos_read/write to their original state, then do this one-liner once
> per caller of those
> 
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -575,7 +575,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf,
> size_t count)
> 
>         if (f.file) {
>                 loff_t pos = file_pos_read(f.file);
> -               ret = vfs_read(f.file, buf, count, &pos);
> +               ret = vfs_read(f.file, buf, count, (file->f_mode & FMODE_STREAM) ? NULL : &pos);
>                 if (ret >= 0)
>                         file_pos_write(f.file, pos);
>                 fdput_pos(f);
> 
> The idea being to minimize the number of extra branches introduced in
> the read and write syscalls due to the FMODE_STREAM. So the above
> catches wrongly done conversions (or backports - perhaps an earlier
> version of some ->read callback did use ppos...) in a rather blunt way.

I checked generated x86_64 assembly and from the point of view of adding
minimum extra branches this is indeed the best version - no branches are
added at all as `(file->f_mode & FMODE_STREAM) ? NULL : &pos` is
translated to cmov instruction.

However file->f_pos writing is still there and it will bug under race
detector, e.g. under KTSAN because read and write can be running
simultaneously. Maybe it is not only race bug, but also a bit of
slowdown as read and write code paths write to the same memory thus
needing inter-cpu synchronization if read and write handlers are on
different cpus. However for this I'm not sure.

> The alternative was to revert file_pos_read() to just return file->f_pos
> as it used to, and just do
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 177ccc3d405a..97180d61a918 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -565,6 +565,7 @@ static inline loff_t file_pos_read(struct file *file)
> 
>  static inline void file_pos_write(struct file *file, loff_t pos)
>  {
> +       WARN_ON_ONCE((file->f_mode & FMODE_STREAM) && pos != 0);
>         file->f_pos = pos;
>  }
> 
> which is, essentially, also just one more branch (at least for all the
> ordinary, non-STREAM, files); this has the advantage of not oopsing, and
> just means that ->f_pos gets updated racily until somebody reacts to the
> WARN.

I prefer we pass NULL for ppos to FOPEN_STREAM files, because above
WARN, even if it triggers, will not tell which file implementation is
guilty - it will show the stack trace only till generic VFS code, if I
understand correctly.

> In either case, the current code papers over bugs introduced by
> conversion/backporting and has more branches.

Yes, thanks for brining this up. It is indeed better that we pass NULL
into converted drivers to catch correctness. Please see below for
updated patch and branches analysis.

> > but I wonder why we are discussing in private?
> 
> Well, the cc list looked too big for some mostly micro-optimzation
> comments, and I couldn't figure out a proper subset to trim it to. But
> if you think the sanity-checking (either variant) is worth it before the
> mass-conversion starts, I'm happy to take the discussion back to lkml.

Added relevant (I think) people and lists to Cc.

I was going to propose attached patch, but it turned out that if we
start to pass ppos=NULL, we also have to update at least new_sync_read,
new_sync_write, do_iter_writev_writev, rw_verify_area etc - functions
that can be called from under vfs_read/vfs_write/..., at least this way:

diff --git a/fs/read_write.c b/fs/read_write.c
index 4bdd1731859c..80724e9791bd 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
        inode = file_inode(file);
        if (unlikely((ssize_t) count < 0))
                return retval;
-       pos = *ppos;
+       pos = (ppos ? *ppos : 0);
        if (unlikely(pos < 0)) {
                if (!unsigned_offsets(file))
                        return retval;
@@ -400,12 +400,13 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo
        ssize_t ret;

        init_sync_kiocb(&kiocb, filp);
-       kiocb.ki_pos = *ppos;
+       kiocb.ki_pos = (ppos ? *ppos : 0);
        iov_iter_init(&iter, READ, &iov, 1, len);

        ret = call_read_iter(filp, &kiocb, &iter);
        BUG_ON(ret == -EIOCBQUEUED);
-       *ppos = kiocb.ki_pos;
+       if (ppos)
+               *ppos = kiocb.ki_pos;
        return ret;
 }

@@ -468,12 +469,12 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t
        ssize_t ret;

        init_sync_kiocb(&kiocb, filp);
-       kiocb.ki_pos = *ppos;
+       kiocb.ki_pos = (ppos ? *ppos : 0);
        iov_iter_init(&iter, WRITE, &iov, 1, len);

        ret = call_write_iter(filp, &kiocb, &iter);
        BUG_ON(ret == -EIOCBQUEUED);
-       if (ret > 0)
+       if (ret > 0 && ppos)
                *ppos = kiocb.ki_pos;
        return ret;
 }
@@ -666,14 +667,15 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
        ret = kiocb_set_rw_flags(&kiocb, flags);
        if (ret)
                return ret;
-       kiocb.ki_pos = *ppos;
+       kiocb.ki_pos = (ppos ? *ppos : 0);

        if (type == READ)
                ret = call_read_iter(filp, &kiocb, iter);
        else
                ret = call_write_iter(filp, &kiocb, iter);
        BUG_ON(ret == -EIOCBQUEUED);
-       *ppos = kiocb.ki_pos;
+       if (ppos)
+               *ppos = kiocb.ki_pos;
        return ret;
 }

There is no notion of NULL ppos in kiocb, and I'm not familiar with that
subsystem, so it would be good to first get feedback before deciding on
how/where to move on.

Linus, anyone, comments?

Thanks beforehand,
Kirill

---- 8< ----
From facf84dfc75bf314c5f19160551750001d2d513b Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <kirr@nexedi.com>
Date: Thu, 11 Apr 2019 13:07:03 +0300
Subject: [BUGGY PATCH] fs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files

( The patch is buggy - it will fail at NULL pointer dereference in e.g.
  rw_verify_area )

This amends commit 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") in how position is passed into .read()/.write() handler for
stream-like files:

Rasmus noticed that we currently pass 0 as position and ignore any position
change if that is done by a file implementation. This papers over bugs if ppos
is used in files that declare themselves as being stream-like as such bugs will
go unnoticed. Even if a file implementation is correctly converted into using
stream_open, its read/write later could be changed to use ppos and even though
that won't be working correctly, that bug might go unnoticed without someone
doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
read/write for stream-like files as that don't give any chance for ppos usage
bugs because it will oops if ppos is ever used inside .read() or .write().

Rasmus also made the point that FMODE_STREAM patch added 2 branches into
ksys_read/ksys_write path which is too much unnecessary overhead and
could be reduces.

Let's rework the code to pass ppos=NULL and have less branches.

The first approach could be to revert file_pos_read and file_pos_write
into their pre-FMODE_STREAM version and just do

	--- a/fs/read_write.c
	+++ b/fs/read_write.c
	@@ -575,7 +575,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf,
	size_t count)

	        if (f.file) {
	                loff_t pos = file_pos_read(f.file);
	-               ret = vfs_read(f.file, buf, count, &pos);
	+               ret = vfs_read(f.file, buf, count, (file->f_mode & FMODE_STREAM) ? NULL : &pos);
	                if (ret >= 0)
	                        file_pos_write(f.file, pos);
	                fdput_pos(f);

this adds no branches at all as `(file->f_mode & FMODE_STREAM) ? NULL : &pos)`
translates to single cmov instruction on x86_64:

        if (f.file) {
    18e5:       48 89 c3                mov    %rax,%rbx
    18e8:       48 83 e3 fc             and    $0xfffffffffffffffc,%rbx
    18ec:       74 79                   je     1967 <ksys_read+0xa7>
    18ee:       48 89 c5                mov    %rax,%rbp
                loff_t pos = f.file->f_pos;
    18f1:       48 8b 43 68             mov    0x68(%rbx),%rax
                ret = vfs_read(f.file, buf, count, (f.file->f_mode & FMODE_STREAM) ? NULL : &pos);
    18f5:       48 89 e1                mov    %rsp,%rcx
    18f8:       f6 43 46 20             testb  $0x20,0x46(%rbx)
    18fc:       4c 89 e6                mov    %r12,%rsi
    18ff:       4c 89 ea                mov    %r13,%rdx
    1902:       48 89 df                mov    %rbx,%rdi
                loff_t pos = f.file->f_pos;
    1905:       48 89 04 24             mov    %rax,(%rsp)
                ret = vfs_read(f.file, buf, count, (f.file->f_mode & FMODE_STREAM) ? NULL : &pos);
    1909:       b8 00 00 00 00          mov    $0x0,%eax
    190e:       48 0f 45 c8             cmovne %rax,%rcx		<-- NOTE
    1912:       e8 00 00 00 00          callq  1917 <ksys_read+0x57>
                        1913: R_X86_64_PLT32    vfs_read-0x4

However this leaves on unconditional write into file->f_pos after
vfs_read call, and even though it should not be affecting semantic, it
will bug under race detector (e.g. KTSAN) and probably it might add some
inter-CPU overhead if read and write handlers are running on different
cores.

If we instead move `file->f_mode & FMODE_STREAM` checking into vfs
functions that actually dispatch IO and care to load/store file->f_fpos
only for non-stream files, even though it is 3 branches if looking at C
source, gcc generates only 1 more branch compared to how it was
pre-FMODE_STREAM. For example consider updated ksys_read:

	 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
	 {
	 	struct fd f = fdget_pos(fd);
	 	ssize_t ret = -EBADF;

	 	if (f.file) {
	-		loff_t pos = file_pos_read(f.file);
	-		ret = vfs_read(f.file, buf, count, &pos);
	-		if (ret >= 0)
	-			file_pos_write(f.file, pos);
	+		loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos;
	+		if (ppos)
	+			pos = f.file->f_pos;
	+		ret = vfs_read(f.file, buf, count, ppos);
	+		if (ret >= 0 && ppos)
	+			f.file->f_pos = pos;
	 		fdput_pos(f);
	 	}
	 	return ret;

The code that gcc generates here is essentially as if the source was:

	if (f.file->f_mode & FMODE_STREAM) {
		ret = vfs_read(f.file, buf, count, NULL);
	} else {
		loff_t pos = f.file->f_pos;
		ret = vfs_read(f.file, buf, count, ppos);
		if (ret >= 0)
			f.file->f_pos = pos;
	}
	fdput_pos(f);

i.e. it branches only once after checking file->f_mode and then has two
distinct codepaths each with its own vfs_read call:

        if (f.file) {
    18e5:       48 89 c5                mov    %rax,%rbp
    18e8:       48 83 e5 fc             and    $0xfffffffffffffffc,%rbp
    18ec:       0f 84 89 00 00 00       je     197b <ksys_read+0xbb>
    18f2:       48 89 c3                mov    %rax,%rbx
                loff_t pos, *ppos = (f.file->f_mode & FMODE_STREAM) ? NULL : &pos;
    18f5:       f6 45 46 20             testb  $0x20,0x46(%rbp)
    18f9:       74 3b                   je     1936 <ksys_read+0x76>	<-- branch pos/no-pos
                ret = vfs_read(f.file, buf, count, ppos);
    18fb:       4c 89 e6                mov    %r12,%rsi		<-- start of IO without pos
    18fe:       31 c9                   xor    %ecx,%ecx
    1900:       4c 89 ea                mov    %r13,%rdx
    1903:       48 89 ef                mov    %rbp,%rdi
    1906:       e8 00 00 00 00          callq  190b <ksys_read+0x4b>
                        1907: R_X86_64_PLT32    vfs_read-0x4		<-- vfs_read(..., ppos=NULL)
    190b:       49 89 c4                mov    %rax,%r12
        if (f.flags & FDPUT_POS_UNLOCK)
    190e:       f6 c3 02                test   $0x2,%bl
    1911:       75 51                   jne    1964 <ksys_read+0xa4>
        if (fd.flags & FDPUT_FPUT)
    1913:       83 e3 01                and    $0x1,%ebx
    1916:       75 59                   jne    1971 <ksys_read+0xb1>
}
    1918:       48 8b 4c 24 08          mov    0x8(%rsp),%rcx
    191d:       65 48 33 0c 25 28 00    xor    %gs:0x28,%rcx
    1924:       00 00
    1926:       4c 89 e0                mov    %r12,%rax
    1929:       75 59                   jne    1984 <ksys_read+0xc4>
    192b:       48 83 c4 10             add    $0x10,%rsp
    192f:       5b                      pop    %rbx
    1930:       5d                      pop    %rbp
    1931:       41 5c                   pop    %r12
    1933:       41 5d                   pop    %r13
    1935:       c3                      retq
                        pos = f.file->f_pos;
    1936:       48 8b 45 68             mov    0x68(%rbp),%rax		<-- start of IO with pos
                ret = vfs_read(f.file, buf, count, ppos);
    193a:       4c 89 e6                mov    %r12,%rsi
    193d:       48 89 e1                mov    %rsp,%rcx
    1940:       4c 89 ea                mov    %r13,%rdx
    1943:       48 89 ef                mov    %rbp,%rdi
                        pos = f.file->f_pos;
    1946:       48 89 04 24             mov    %rax,(%rsp)
                ret = vfs_read(f.file, buf, count, ppos);
    194a:       e8 00 00 00 00          callq  194f <ksys_read+0x8f>
                        194b: R_X86_64_PLT32    vfs_read-0x4		<-- vfs_read(..., ppos=&pos)
    194f:       49 89 c4                mov    %rax,%r12
                if (ret >= 0 && ppos)
    1952:       48 85 c0                test   %rax,%rax
    1955:       78 b7                   js     190e <ksys_read+0x4e>
                        f.file->f_pos = pos;
    1957:       48 8b 04 24             mov    (%rsp),%rax
    195b:       48 89 45 68             mov    %rax,0x68(%rbp)
        if (f.flags & FDPUT_POS_UNLOCK)
    ...

If adding one branch to handle FMODE_STREAM is acceptable, this approach should
be ok. Not duplicating vfs_read calls at C level is better as C-level code is
more clear without such duplication, and gcc cares to optimize the function
properly to have only 1 branch depending on file->f_mode.

If even 1 extra branch is unacceptable, we should indeed go with the first
option described in this patch but be prepared for race-detector bug reports
and probably some inter-CPU overhead.

Overall I would suggest to use simpler approach presented by hereby patch unless
1-extra jump could be shown to cause noticeable slowdowns in practice.

Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 fs/open.c       |  5 +++--
 fs/read_write.c | 51 +++++++++++++++++++++++--------------------------
 2 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index a00350018a47..9c7d724a6f67 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1219,8 +1219,9 @@ EXPORT_SYMBOL(nonseekable_open);
 /*
  * stream_open is used by subsystems that want stream-like file descriptors.
  * Such file descriptors are not seekable and don't have notion of position
- * (file.f_pos is always 0). Contrary to file descriptors of other regular
- * files, .read() and .write() can run simultaneously.
+ * (file.f_pos is always 0 and ppos passed to .read()/.write() is always NULL).
+ * Contrary to file descriptors of other regular files, .read() and .write()
+ * can run simultaneously.
  *
  * stream_open never fails and is marked to return int so that it could be
  * directly used as file_operations.open .
diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..21d350df8109 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -558,27 +558,18 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 	return ret;
 }
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
-	if ((file->f_mode & FMODE_STREAM) == 0)
-		file->f_pos = pos;
-}
-
 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_read(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos;
+		if (ppos)
+			pos = f.file->f_pos;
+		ret = vfs_read(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 	return ret;
@@ -595,10 +586,12 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_write(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos;
+		if (ppos)
+			pos = f.file->f_pos;
+		ret = vfs_write(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -1013,10 +1006,12 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos;
+		if (ppos)
+			pos = f.file->f_pos;
+		ret = vfs_readv(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -1033,10 +1028,12 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &pos;
+		if (ppos)
+			pos = f.file->f_pos;
+		ret = vfs_writev(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
-- 
2.21.0.392.gf8f6787159

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
  2019-04-11 12:38       ` Kirill Smelkov
@ 2019-04-11 16:22         ` Linus Torvalds
  2019-04-12 12:42           ` Kirill Smelkov
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2019-04-11 16:22 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Rasmus Villemoes, Al Viro, Arnd Bergmann, Christoph Hellwig,
	Greg Kroah-Hartman, linux-fsdevel, Linux List Kernel Mailing

On Thu, Apr 11, 2019 at 5:38 AM Kirill Smelkov <kirr@nexedi.com> wrote:
>
> However file->f_pos writing is still there and it will bug under race
> detector, e.g. under KTSAN because read and write can be running
> simultaneously. Maybe it is not only race bug, but also a bit of
> slowdown as read and write code paths write to the same memory thus
> needing inter-cpu synchronization if read and write handlers are on
> different cpus. However for this I'm not sure.

I doubt it's noticeable, but it would probably be good to simply not
do the write for streams.

That *could* be done somewhat similarly, with just changing th address
to be updated, or course.

In fact, we *used* to (long ago) pass in the address of "file->f_pos"
itself to the low-level read/write routines. We then changed it to do
that indirection through a local copy of pos (and
file_pos_read/file_pos_write) because we didn't do the proper locking,
so different read/write versions could mess with each other (and with
lseek).

But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
accesses as per POSIX") did was to add the proper locking at least for
the cases that we care about deeply, so we *could* say that we have
three cases:

 - FMODE_ATOMIC_POS: properly locked,
 - FMODE_STREAM: no pos at all
 - otherwise a "mostly don't care - don't mix!"

and so we could go back to not copying the pos at all, and instead do
something like

        loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
        ret = vfs_write(f.file, buf, count, ppos);

and perhaps have a long-term plan to try to get rid of the "don't mix"
case entirely (ie "if you use f_pos, then we'll do the proper
locking")

(The above is obviously surrounded by the fdget_pos()/fdput_pos() that
implements the locking decision).

           Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
  2019-04-11 16:22         ` Linus Torvalds
@ 2019-04-12 12:42           ` Kirill Smelkov
  2019-04-13 16:54             ` Kirill Smelkov
  0 siblings, 1 reply; 21+ messages in thread
From: Kirill Smelkov @ 2019-04-12 12:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rasmus Villemoes, Al Viro, Arnd Bergmann, Christoph Hellwig,
	Greg Kroah-Hartman, linux-fsdevel, Linux List Kernel Mailing

On Thu, Apr 11, 2019 at 09:22:56AM -0700, Linus Torvalds wrote:
> On Thu, Apr 11, 2019 at 5:38 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > However file->f_pos writing is still there and it will bug under race
> > detector, e.g. under KTSAN because read and write can be running
> > simultaneously. Maybe it is not only race bug, but also a bit of
> > slowdown as read and write code paths write to the same memory thus
> > needing inter-cpu synchronization if read and write handlers are on
> > different cpus. However for this I'm not sure.
> 
> I doubt it's noticeable, but it would probably be good to simply not
> do the write for streams.
> 
> That *could* be done somewhat similarly, with just changing th address
> to be updated, or course.
> 
> In fact, we *used* to (long ago) pass in the address of "file->f_pos"
> itself to the low-level read/write routines. We then changed it to do
> that indirection through a local copy of pos (and
> file_pos_read/file_pos_write) because we didn't do the proper locking,
> so different read/write versions could mess with each other (and with
> lseek).
> 
> But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
> accesses as per POSIX") did was to add the proper locking at least for
> the cases that we care about deeply, so we *could* say that we have
> three cases:
> 
>  - FMODE_ATOMIC_POS: properly locked,
>  - FMODE_STREAM: no pos at all
>  - otherwise a "mostly don't care - don't mix!"
> 
> and so we could go back to not copying the pos at all, and instead do
> something like
> 
>         loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
>         ret = vfs_write(f.file, buf, count, ppos);
> 
> and perhaps have a long-term plan to try to get rid of the "don't mix"
> case entirely (ie "if you use f_pos, then we'll do the proper
> locking")
> 
> (The above is obviously surrounded by the fdget_pos()/fdput_pos() that
> implements the locking decision).

Ok Linus, thanks for feedback. Please consider applying or queueing the
following patches.

If the patches are ok, the first one is probably better to be applied
now - to catch wrongly converted drivers right from start, while the
second one could be probably better delayed till merge window. However
please do as what you prefer is best.

Kirill

---- 8< ----
From 328d4cc5dbb1d6fdbff1f80f1eed9f04b9a5de1c Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <kirr@nexedi.com>
Date: Fri, 12 Apr 2019 12:31:57 +0300
Subject: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files

This amends commit 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") in how position is passed into .read()/.write() handler for
stream-like files:

Rasmus noticed that we currently pass 0 as position and ignore any position
change if that is done by a file implementation. This papers over bugs if ppos
is used in files that declare themselves as being stream-like as such bugs will
go unnoticed. Even if a file implementation is correctly converted into using
stream_open, its read/write later could be changed to use ppos and even though
that won't be working correctly, that bug might go unnoticed without someone
doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
read/write for stream-like files as that don't give any chance for ppos usage
bugs because it will oops if ppos is ever used inside .read() or .write().

Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
because they are called by vfs_read/vfs_write & friends before
file_operations .read/.write .

Note 2: if file backend uses new-style .read_iter/.write_iter, position
is still passed into there as non-pointer kiocb.ki_pos . Currently
stream_open.cocci (semantic patch added by 10dce8af3422) ignores files
whose file_operations has *_iter methods.

Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 fs/open.c       |  5 ++--
 fs/read_write.c | 75 +++++++++++++++++++++++++++++--------------------
 2 files changed, 47 insertions(+), 33 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index a00350018a47..9c7d724a6f67 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1219,8 +1219,9 @@ EXPORT_SYMBOL(nonseekable_open);
 /*
  * stream_open is used by subsystems that want stream-like file descriptors.
  * Such file descriptors are not seekable and don't have notion of position
- * (file.f_pos is always 0). Contrary to file descriptors of other regular
- * files, .read() and .write() can run simultaneously.
+ * (file.f_pos is always 0 and ppos passed to .read()/.write() is always NULL).
+ * Contrary to file descriptors of other regular files, .read() and .write()
+ * can run simultaneously.
  *
  * stream_open never fails and is marked to return int so that it could be
  * directly used as file_operations.open .
diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..d62556be6848 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
 	inode = file_inode(file);
 	if (unlikely((ssize_t) count < 0))
 		return retval;
-	pos = *ppos;
+	pos = (ppos ? *ppos : 0);
 	if (unlikely(pos < 0)) {
 		if (!unsigned_offsets(file))
 			return retval;
@@ -400,12 +400,13 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 	iov_iter_init(&iter, READ, &iov, 1, len);
 
 	ret = call_read_iter(filp, &kiocb, &iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	*ppos = kiocb.ki_pos;
+	if (ppos)
+		*ppos = kiocb.ki_pos;
 	return ret;
 }
 
@@ -468,12 +469,12 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 	iov_iter_init(&iter, WRITE, &iov, 1, len);
 
 	ret = call_write_iter(filp, &kiocb, &iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	if (ret > 0)
+	if (ret > 0 && ppos)
 		*ppos = kiocb.ki_pos;
 	return ret;
 }
@@ -558,15 +559,10 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 	return ret;
 }
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
+/* file_ppos returns &file->f_pos or NULL if file is stream */
+static inline loff_t *file_ppos(struct file *file)
 {
-	if ((file->f_mode & FMODE_STREAM) == 0)
-		file->f_pos = pos;
+	return file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
 }
 
 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
@@ -575,10 +571,14 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_read(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_read(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 	return ret;
@@ -595,10 +595,14 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_write(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_write(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -673,14 +677,15 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 	ret = kiocb_set_rw_flags(&kiocb, flags);
 	if (ret)
 		return ret;
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 
 	if (type == READ)
 		ret = call_read_iter(filp, &kiocb, iter);
 	else
 		ret = call_write_iter(filp, &kiocb, iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	*ppos = kiocb.ki_pos;
+	if (ppos)
+		*ppos = kiocb.ki_pos;
 	return ret;
 }
 
@@ -1013,10 +1018,14 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_readv(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -1033,10 +1042,14 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_writev(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
-- 
2.21.0.593.g511ec345e1


---- 8< ----
From ac88efdcdedb66230fcdf56e3fb244d8f5502b0b Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <kirr@nexedi.com>
Date: Fri, 12 Apr 2019 14:53:57 +0300
Subject: [PATCH 2/2] vfs: use &file->f_pos directly on files that have position

Long ago vfs read/write operations were passing ppos=&file->f_pos directly to
.read / .write file_operations methods. That changed in 2004 in 55f09ec0087c
("read/write: pass down a copy of f_pos, not f_pos itself.") which started to
pass ppos=&local_var trying to avoid simultaneous read/write/lseek stepping
onto each other toes and overwriting file->f_pos racily. That measure was not
complete and in 2014 commit 9c225f2655e36 ("vfs: atomic f_pos accesses as per
POSIX") added file->f_pos_lock to completely disable simultaneous
read/write/lseek runs. After f_pos_lock was introduced the reason to avoid
passing ppos=&file->f_pos directly due to concurrency vanished.

Linus explains[1]:

	In fact, we *used* to (long ago) pass in the address of "file->f_pos"
	itself to the low-level read/write routines. We then changed it to do
	that indirection through a local copy of pos (and
	file_pos_read/file_pos_write) because we didn't do the proper locking,
	so different read/write versions could mess with each other (and with
	lseek).

	But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
	accesses as per POSIX") did was to add the proper locking at least for
	the cases that we care about deeply, so we *could* say that we have
	three cases:

	 - FMODE_ATOMIC_POS: properly locked,
	 - FMODE_STREAM: no pos at all
	 - otherwise a "mostly don't care - don't mix!"

	and so we could go back to not copying the pos at all, and instead do
	something like

	        loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
	        ret = vfs_write(f.file, buf, count, ppos);

	and perhaps have a long-term plan to try to get rid of the "don't mix"
	case entirely (ie "if you use f_pos, then we'll do the proper
	locking")

	(The above is obviously surrounded by the fdget_pos()/fdput_pos() that
	implements the locking decision).

Currently for regular files we always set FMODE_ATOMIC_POS and change that to
FMODE_STREAM if stream_open is used explicitly on open. That leaves other
files, like e.g. sockets and pipes, for "mostly don't care - don't mix!" case.

Sockets, for example, always check that on read/write the initial pos they
receive is 0 and don't update it. And if it is !0 they return -ESPIPE.
That suggests that we can do the switch into passing &file->f_pos directly now
and incrementally convert to FMODE_STREAM files that were doing the stream-like
checking manually in their low-level .read/.write handlers.

Note: it is theoretically possible that a driver updates *ppos inside
even if read/write returns error. For such cases the conversion will
change IO semantic a bit. The semantic that is changing here was
introduced in 2013 in commit 5faf153ebf61 "don't call file_pos_write()
if vfs_{read,write}{,v}() fails".

[1] https://lore.kernel.org/linux-fsdevel/CAHk-=whJtZt52SnhBGrNMnuxFn3GE9X_e02x8BPxtkqrfyZukw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 fs/read_write.c | 36 ++++--------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index d62556be6848..13550b65cb2c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -571,14 +571,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_read(f.file, buf, count, ppos);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_read(f.file, buf, count, file_ppos(f.file));
 		fdput_pos(f);
 	}
 	return ret;
@@ -595,14 +588,7 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_write(f.file, buf, count, ppos);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_write(f.file, buf, count, file_ppos(f.file));
 		fdput_pos(f);
 	}
 
@@ -1018,14 +1004,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_readv(f.file, vec, vlen, ppos, flags);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_readv(f.file, vec, vlen, file_ppos(f.file), flags);
 		fdput_pos(f);
 	}
 
@@ -1042,14 +1021,7 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_writev(f.file, vec, vlen, ppos, flags);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_writev(f.file, vec, vlen, file_ppos(f.file), flags);
 		fdput_pos(f);
 	}
 
-- 
2.21.0.593.g511ec345e1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
  2019-04-12 12:42           ` Kirill Smelkov
@ 2019-04-13 16:54             ` Kirill Smelkov
  2019-04-13 16:54               ` [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files Kirill Smelkov
  2019-04-13 16:55                 ` Kirill Smelkov
  0 siblings, 2 replies; 21+ messages in thread
From: Kirill Smelkov @ 2019-04-13 16:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rasmus Villemoes, Al Viro, Arnd Bergmann, Christoph Hellwig,
	Greg Kroah-Hartman, linux-fsdevel, Linux List Kernel Mailing

On Fri, Apr 12, 2019 at 03:41:44PM +0300, Kirill Smelkov wrote:
> On Thu, Apr 11, 2019 at 09:22:56AM -0700, Linus Torvalds wrote:
> > On Thu, Apr 11, 2019 at 5:38 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> > >
> > > However file->f_pos writing is still there and it will bug under race
> > > detector, e.g. under KTSAN because read and write can be running
> > > simultaneously. Maybe it is not only race bug, but also a bit of
> > > slowdown as read and write code paths write to the same memory thus
> > > needing inter-cpu synchronization if read and write handlers are on
> > > different cpus. However for this I'm not sure.
> > 
> > I doubt it's noticeable, but it would probably be good to simply not
> > do the write for streams.
> > 
> > That *could* be done somewhat similarly, with just changing th address
> > to be updated, or course.
> > 
> > In fact, we *used* to (long ago) pass in the address of "file->f_pos"
> > itself to the low-level read/write routines. We then changed it to do
> > that indirection through a local copy of pos (and
> > file_pos_read/file_pos_write) because we didn't do the proper locking,
> > so different read/write versions could mess with each other (and with
> > lseek).
> > 
> > But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
> > accesses as per POSIX") did was to add the proper locking at least for
> > the cases that we care about deeply, so we *could* say that we have
> > three cases:
> > 
> >  - FMODE_ATOMIC_POS: properly locked,
> >  - FMODE_STREAM: no pos at all
> >  - otherwise a "mostly don't care - don't mix!"
> > 
> > and so we could go back to not copying the pos at all, and instead do
> > something like
> > 
> >         loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
> >         ret = vfs_write(f.file, buf, count, ppos);
> > 
> > and perhaps have a long-term plan to try to get rid of the "don't mix"
> > case entirely (ie "if you use f_pos, then we'll do the proper
> > locking")
> > 
> > (The above is obviously surrounded by the fdget_pos()/fdput_pos() that
> > implements the locking decision).
> 
> Ok Linus, thanks for feedback. Please consider applying or queueing the
> following patches.

Resending those patches one by one in separate emails, if maybe 2
patches inline was problematic to handle...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files
  2019-04-13 16:54             ` Kirill Smelkov
@ 2019-04-13 16:54               ` Kirill Smelkov
  2019-04-13 17:27                 ` Linus Torvalds
  2019-04-13 16:55                 ` Kirill Smelkov
  1 sibling, 1 reply; 21+ messages in thread
From: Kirill Smelkov @ 2019-04-13 16:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Arnd Bergmann, Christoph Hellwig, Greg Kroah-Hartman,
	linux-fsdevel, linux-kernel, Kirill Smelkov, Rasmus Villemoes

This amends commit 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") in how position is passed into .read()/.write() handler for
stream-like files:

Rasmus noticed that we currently pass 0 as position and ignore any position
change if that is done by a file implementation. This papers over bugs if ppos
is used in files that declare themselves as being stream-like as such bugs will
go unnoticed. Even if a file implementation is correctly converted into using
stream_open, its read/write later could be changed to use ppos and even though
that won't be working correctly, that bug might go unnoticed without someone
doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
read/write for stream-like files as that don't give any chance for ppos usage
bugs because it will oops if ppos is ever used inside .read() or .write().

Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
because they are called by vfs_read/vfs_write & friends before
file_operations .read/.write .

Note 2: if file backend uses new-style .read_iter/.write_iter, position
is still passed into there as non-pointer kiocb.ki_pos . Currently
stream_open.cocci (semantic patch added by 10dce8af3422) ignores files
whose file_operations has *_iter methods.

Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 fs/open.c       |  5 ++--
 fs/read_write.c | 75 +++++++++++++++++++++++++++++--------------------
 2 files changed, 47 insertions(+), 33 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index a00350018a47..9c7d724a6f67 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1219,8 +1219,9 @@ EXPORT_SYMBOL(nonseekable_open);
 /*
  * stream_open is used by subsystems that want stream-like file descriptors.
  * Such file descriptors are not seekable and don't have notion of position
- * (file.f_pos is always 0). Contrary to file descriptors of other regular
- * files, .read() and .write() can run simultaneously.
+ * (file.f_pos is always 0 and ppos passed to .read()/.write() is always NULL).
+ * Contrary to file descriptors of other regular files, .read() and .write()
+ * can run simultaneously.
  *
  * stream_open never fails and is marked to return int so that it could be
  * directly used as file_operations.open .
diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..d62556be6848 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
 	inode = file_inode(file);
 	if (unlikely((ssize_t) count < 0))
 		return retval;
-	pos = *ppos;
+	pos = (ppos ? *ppos : 0);
 	if (unlikely(pos < 0)) {
 		if (!unsigned_offsets(file))
 			return retval;
@@ -400,12 +400,13 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 	iov_iter_init(&iter, READ, &iov, 1, len);
 
 	ret = call_read_iter(filp, &kiocb, &iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	*ppos = kiocb.ki_pos;
+	if (ppos)
+		*ppos = kiocb.ki_pos;
 	return ret;
 }
 
@@ -468,12 +469,12 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 	iov_iter_init(&iter, WRITE, &iov, 1, len);
 
 	ret = call_write_iter(filp, &kiocb, &iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	if (ret > 0)
+	if (ret > 0 && ppos)
 		*ppos = kiocb.ki_pos;
 	return ret;
 }
@@ -558,15 +559,10 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 	return ret;
 }
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
+/* file_ppos returns &file->f_pos or NULL if file is stream */
+static inline loff_t *file_ppos(struct file *file)
 {
-	if ((file->f_mode & FMODE_STREAM) == 0)
-		file->f_pos = pos;
+	return file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
 }
 
 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
@@ -575,10 +571,14 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_read(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_read(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 	return ret;
@@ -595,10 +595,14 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_write(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_write(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -673,14 +677,15 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 	ret = kiocb_set_rw_flags(&kiocb, flags);
 	if (ret)
 		return ret;
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 
 	if (type == READ)
 		ret = call_read_iter(filp, &kiocb, iter);
 	else
 		ret = call_write_iter(filp, &kiocb, iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	*ppos = kiocb.ki_pos;
+	if (ppos)
+		*ppos = kiocb.ki_pos;
 	return ret;
 }
 
@@ -1013,10 +1018,14 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_readv(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -1033,10 +1042,14 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_writev(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
-- 
2.21.0.593.g511ec345e1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] vfs: use &file->f_pos directly on files that have position
  2019-04-13 16:54             ` Kirill Smelkov
@ 2019-04-13 16:55                 ` Kirill Smelkov
  2019-04-13 16:55                 ` Kirill Smelkov
  1 sibling, 0 replies; 21+ messages in thread
From: Kirill Smelkov @ 2019-04-13 16:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Arnd Bergmann, Christoph Hellwig, Greg Kroah-Hartman,
	linux-fsdevel, linux-kernel, Kirill Smelkov

Long ago vfs read/write operations were passing ppos=&file->f_pos directly to
.read / .write file_operations methods. That changed in 2004 in 55f09ec0087c
("read/write: pass down a copy of f_pos, not f_pos itself.") which started to
pass ppos=&local_var trying to avoid simultaneous read/write/lseek stepping
onto each other toes and overwriting file->f_pos racily. That measure was not
complete and in 2014 commit 9c225f2655e36 ("vfs: atomic f_pos accesses as per
POSIX") added file->f_pos_lock to completely disable simultaneous
read/write/lseek runs. After f_pos_lock was introduced the reason to avoid
passing ppos=&file->f_pos directly due to concurrency vanished.

Linus explains[1]:

	In fact, we *used* to (long ago) pass in the address of "file->f_pos"
	itself to the low-level read/write routines. We then changed it to do
	that indirection through a local copy of pos (and
	file_pos_read/file_pos_write) because we didn't do the proper locking,
	so different read/write versions could mess with each other (and with
	lseek).

	But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
	accesses as per POSIX") did was to add the proper locking at least for
	the cases that we care about deeply, so we *could* say that we have
	three cases:

	 - FMODE_ATOMIC_POS: properly locked,
	 - FMODE_STREAM: no pos at all
	 - otherwise a "mostly don't care - don't mix!"

	and so we could go back to not copying the pos at all, and instead do
	something like

	        loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
	        ret = vfs_write(f.file, buf, count, ppos);

	and perhaps have a long-term plan to try to get rid of the "don't mix"
	case entirely (ie "if you use f_pos, then we'll do the proper
	locking")

	(The above is obviously surrounded by the fdget_pos()/fdput_pos() that
	implements the locking decision).

Currently for regular files we always set FMODE_ATOMIC_POS and change that to
FMODE_STREAM if stream_open is used explicitly on open. That leaves other
files, like e.g. sockets and pipes, for "mostly don't care - don't mix!" case.

Sockets, for example, always check that on read/write the initial pos they
receive is 0 and don't update it. And if it is !0 they return -ESPIPE.
That suggests that we can do the switch into passing &file->f_pos directly now
and incrementally convert to FMODE_STREAM files that were doing the stream-like
checking manually in their low-level .read/.write handlers.

Note: it is theoretically possible that a driver updates *ppos inside
even if read/write returns error. For such cases the conversion will
change IO semantic a bit. The semantic that is changing here was
introduced in 2013 in commit 5faf153ebf61 "don't call file_pos_write()
if vfs_{read,write}{,v}() fails".

[1] https://lore.kernel.org/linux-fsdevel/CAHk-=whJtZt52SnhBGrNMnuxFn3GE9X_e02x8BPxtkqrfyZukw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 fs/read_write.c | 36 ++++--------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index d62556be6848..13550b65cb2c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -571,14 +571,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_read(f.file, buf, count, ppos);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_read(f.file, buf, count, file_ppos(f.file));
 		fdput_pos(f);
 	}
 	return ret;
@@ -595,14 +588,7 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_write(f.file, buf, count, ppos);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_write(f.file, buf, count, file_ppos(f.file));
 		fdput_pos(f);
 	}
 
@@ -1018,14 +1004,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_readv(f.file, vec, vlen, ppos, flags);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_readv(f.file, vec, vlen, file_ppos(f.file), flags);
 		fdput_pos(f);
 	}
 
@@ -1042,14 +1021,7 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_writev(f.file, vec, vlen, ppos, flags);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_writev(f.file, vec, vlen, file_ppos(f.file), flags);
 		fdput_pos(f);
 	}
 
-- 
2.21.0.593.g511ec345e1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] vfs: use &file->f_pos directly on files that have position
@ 2019-04-13 16:55                 ` Kirill Smelkov
  0 siblings, 0 replies; 21+ messages in thread
From: Kirill Smelkov @ 2019-04-13 16:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Arnd Bergmann, Christoph Hellwig, Greg Kroah-Hartman,
	linux-fsdevel, linux-kernel, Kirill Smelkov

Long ago vfs read/write operations were passing ppos=&file->f_pos directly to
.read / .write file_operations methods. That changed in 2004 in 55f09ec0087c
("read/write: pass down a copy of f_pos, not f_pos itself.") which started to
pass ppos=&local_var trying to avoid simultaneous read/write/lseek stepping
onto each other toes and overwriting file->f_pos racily. That measure was not
complete and in 2014 commit 9c225f2655e36 ("vfs: atomic f_pos accesses as per
POSIX") added file->f_pos_lock to completely disable simultaneous
read/write/lseek runs. After f_pos_lock was introduced the reason to avoid
passing ppos=&file->f_pos directly due to concurrency vanished.

Linus explains[1]:

	In fact, we *used* to (long ago) pass in the address of "file->f_pos"
	itself to the low-level read/write routines. We then changed it to do
	that indirection through a local copy of pos (and
	file_pos_read/file_pos_write) because we didn't do the proper locking,
	so different read/write versions could mess with each other (and with
	lseek).

	But one of the things that commit 9c225f2655e36 ("vfs: atomic f_pos
	accesses as per POSIX") did was to add the proper locking at least for
	the cases that we care about deeply, so we *could* say that we have
	three cases:

	 - FMODE_ATOMIC_POS: properly locked,
	 - FMODE_STREAM: no pos at all
	 - otherwise a "mostly don't care - don't mix!"

	and so we could go back to not copying the pos at all, and instead do
	something like

	        loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
	        ret = vfs_write(f.file, buf, count, ppos);

	and perhaps have a long-term plan to try to get rid of the "don't mix"
	case entirely (ie "if you use f_pos, then we'll do the proper
	locking")

	(The above is obviously surrounded by the fdget_pos()/fdput_pos() that
	implements the locking decision).

Currently for regular files we always set FMODE_ATOMIC_POS and change that to
FMODE_STREAM if stream_open is used explicitly on open. That leaves other
files, like e.g. sockets and pipes, for "mostly don't care - don't mix!" case.

Sockets, for example, always check that on read/write the initial pos they
receive is 0 and don't update it. And if it is !0 they return -ESPIPE.
That suggests that we can do the switch into passing &file->f_pos directly now
and incrementally convert to FMODE_STREAM files that were doing the stream-like
checking manually in their low-level .read/.write handlers.

Note: it is theoretically possible that a driver updates *ppos inside
even if read/write returns error. For such cases the conversion will
change IO semantic a bit. The semantic that is changing here was
introduced in 2013 in commit 5faf153ebf61 "don't call file_pos_write()
if vfs_{read,write}{,v}() fails".

[1] https://lore.kernel.org/linux-fsdevel/CAHk-=whJtZt52SnhBGrNMnuxFn3GE9X_e02x8BPxtkqrfyZukw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 fs/read_write.c | 36 ++++--------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index d62556be6848..13550b65cb2c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -571,14 +571,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_read(f.file, buf, count, ppos);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_read(f.file, buf, count, file_ppos(f.file));
 		fdput_pos(f);
 	}
 	return ret;
@@ -595,14 +588,7 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_write(f.file, buf, count, ppos);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_write(f.file, buf, count, file_ppos(f.file));
 		fdput_pos(f);
 	}
 
@@ -1018,14 +1004,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_readv(f.file, vec, vlen, ppos, flags);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_readv(f.file, vec, vlen, file_ppos(f.file), flags);
 		fdput_pos(f);
 	}
 
@@ -1042,14 +1021,7 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos, *ppos = file_ppos(f.file);
-		if (ppos) {
-			pos = *ppos;
-			ppos = &pos;
-		}
-		ret = vfs_writev(f.file, vec, vlen, ppos, flags);
-		if (ret >= 0 && ppos)
-			f.file->f_pos = pos;
+		ret = vfs_writev(f.file, vec, vlen, file_ppos(f.file), flags);
 		fdput_pos(f);
 	}
 
-- 
2.21.0.593.g511ec345e1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files
  2019-04-13 16:54               ` [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files Kirill Smelkov
@ 2019-04-13 17:27                 ` Linus Torvalds
  2019-04-13 17:38                   ` Al Viro
  2019-04-13 18:44                   ` Kirill Smelkov
  0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2019-04-13 17:27 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Al Viro, Arnd Bergmann, Christoph Hellwig, Greg Kroah-Hartman,
	linux-fsdevel, Linux List Kernel Mailing, Rasmus Villemoes

On Sat, Apr 13, 2019 at 9:55 AM Kirill Smelkov <kirr@nexedi.com> wrote:
>
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
>         inode = file_inode(file);
>         if (unlikely((ssize_t) count < 0))
>                 return retval;
> -       pos = *ppos;
> +       pos = (ppos ? *ppos : 0);
>         if (unlikely(pos < 0)) {
>                 if (!unsigned_offsets(file))
>                         return retval;

This part looks silly. We should just avoid all the position overflow
games when we don't have a position at all (ie streaming). You can't
overflow what you don't use.

Similarly, you can't use ranged mandatory locking on a stream, so the
mandatory locking thing seems dependent on pos too.

So I think that with a NULL ppos being possible, we should just change
the code to just do all of that conditionally on having a position,
rather than saying that the position of a stream is always 0.

That said, this whole "let's make it possible to not have a position
at all" is a big change, and there's no way I'll apply these before
the 5.2 merge window.

And I'd really like to have people (Al?) look at this and go "yeah,
makes sense". I do think that moving to a model where we wither have a
(properly locked) file position or no pos pointer at all is the right
model (ie I'd really like to get rid of the mixed case), but there
might be some practical problem that makes it impractical.

Because the *real* problem with the mixed case is not "insane people
who do bad things might get odd results". No, the real problem with
the mixed case is that it could be a security issue (ie: one process
intentionally changes the file position just as another process is
going a 'read' and then avoids some limit test because the limit test
was done using the old 'pos' value but the actual IO was done using
the new one).

So I suspect that we will have to either

 - get rid of the mixed case entirely (and do only properly locked
f_pos accesses or pass is a NULL f_pos)

 - continue to support the mixed case, but also continue to support
the nasty temporary 'pos' value with that file_pos_{read,write}()
thing.

IOW, I would not be ok with passing in a shared - and unlocked -
&file->f_pos value to random drivers that *might* do odd things when a
race happens.

                   Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files
  2019-04-13 17:27                 ` Linus Torvalds
@ 2019-04-13 17:38                   ` Al Viro
  2019-04-13 18:44                   ` Kirill Smelkov
  1 sibling, 0 replies; 21+ messages in thread
From: Al Viro @ 2019-04-13 17:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill Smelkov, Arnd Bergmann, Christoph Hellwig,
	Greg Kroah-Hartman, linux-fsdevel, Linux List Kernel Mailing,
	Rasmus Villemoes

On Sat, Apr 13, 2019 at 10:27:00AM -0700, Linus Torvalds wrote:

> And I'd really like to have people (Al?) look at this and go "yeah,
> makes sense". I do think that moving to a model where we wither have a
> (properly locked) file position or no pos pointer at all is the right
> model (ie I'd really like to get rid of the mixed case), but there
> might be some practical problem that makes it impractical.
>
> Because the *real* problem with the mixed case is not "insane people
> who do bad things might get odd results". No, the real problem with
> the mixed case is that it could be a security issue (ie: one process
> intentionally changes the file position just as another process is
> going a 'read' and then avoids some limit test because the limit test
> was done using the old 'pos' value but the actual IO was done using
> the new one).
> 
> So I suspect that we will have to either
> 
>  - get rid of the mixed case entirely (and do only properly locked
> f_pos accesses or pass is a NULL f_pos)

Good luck.  Character devices
	get no exclusion
	often do use position
	are many
	might be buried behind several layers of indirection
	... and often left unmaintained for a decade or two

IOW, I don't see how you'd go about eliminating the mixed case...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files
  2019-04-13 17:27                 ` Linus Torvalds
  2019-04-13 17:38                   ` Al Viro
@ 2019-04-13 18:44                   ` Kirill Smelkov
  1 sibling, 0 replies; 21+ messages in thread
From: Kirill Smelkov @ 2019-04-13 18:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Arnd Bergmann, Christoph Hellwig, Greg Kroah-Hartman,
	linux-fsdevel, Linux List Kernel Mailing, Rasmus Villemoes

On Sat, Apr 13, 2019 at 10:27:00AM -0700, Linus Torvalds wrote:
> On Sat, Apr 13, 2019 at 9:55 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -371,7 +371,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
> >         inode = file_inode(file);
> >         if (unlikely((ssize_t) count < 0))
> >                 return retval;
> > -       pos = *ppos;
> > +       pos = (ppos ? *ppos : 0);
> >         if (unlikely(pos < 0)) {
> >                 if (!unsigned_offsets(file))
> >                         return retval;
> 
> This part looks silly. We should just avoid all the position overflow
> games when we don't have a position at all (ie streaming). You can't
> overflow what you don't use.
> 
> Similarly, you can't use ranged mandatory locking on a stream, so the
> mandatory locking thing seems dependent on pos too.
> 
> So I think that with a NULL ppos being possible, we should just change
> the code to just do all of that conditionally on having a position,
> rather than saying that the position of a stream is always 0.

Linus, thanks for feedback. After checking a bit of history of
rw_verify_area and seeing what it does I agree with you. I was initially
confused because rw_verify_area calls security_file_permission() at its
tail and so I thought we cannot skip rw_verify_area for !ppos case.
After studying a bit I see now that there are several things going on
and indeed moving ranged locks checking into under `if ppos!=NULL` makes
sense. This code is new to me; apologize for not getting it right
initially.

> That said, this whole "let's make it possible to not have a position
> at all" is a big change, and there's no way I'll apply these before
> the 5.2 merge window.

Ok.

> And I'd really like to have people (Al?) look at this and go "yeah,
> makes sense". I do think that moving to a model where we wither have a
> (properly locked) file position or no pos pointer at all is the right
> model (ie I'd really like to get rid of the mixed case), but there
> might be some practical problem that makes it impractical.
> 
> Because the *real* problem with the mixed case is not "insane people
> who do bad things might get odd results". No, the real problem with
> the mixed case is that it could be a security issue (ie: one process
> intentionally changes the file position just as another process is
> going a 'read' and then avoids some limit test because the limit test
> was done using the old 'pos' value but the actual IO was done using
> the new one).
> 
> So I suspect that we will have to either
> 
>  - get rid of the mixed case entirely (and do only properly locked
> f_pos accesses or pass is a NULL f_pos)
> 
>  - continue to support the mixed case, but also continue to support
> the nasty temporary 'pos' value with that file_pos_{read,write}()
> thing.
> 
> IOW, I would not be ok with passing in a shared - and unlocked -
> &file->f_pos value to random drivers that *might* do odd things when a
> race happens.

I see. That's why I splitted the whole change into 2: the first patch
only adds ppos=NULL semantic for FMODE_STREAM and does not change to
passing &file->f_pos directly. There are signs that people can be
starting to use stream_open even now - before we do mass conversion.
That's why it is better to fix ppos semantic for stream case early.
You said it won't happen before next merge window - ok, but ppos=NULL
change is separate and does not depend on passing &file->f_pos directly.
So let's please apply ppos=NULL patch early when merge window begins and
independently of what happens with whether we'll switch to direct usage
of &file->f_pos.

Looking forward to get review from others too.

And I appreciate if people could help at least somehow with "getting rid
of mixed case entirely" (i.e. always lock f_pos_lock on !FMODE_STREAM),
because this transition starts to diverge from my particular use-case
too far. To me it makes sense to do that transition as follows:

- convert nonseekable_open -> stream_open via stream_open.cocci;
- audit other nonseekable_open calls and convert left users that truly
  don't depend on position to stream_open;
- extend stream_open.cocci to analyze alloc_file_pseudo as well (this
  will cover pipes and sockets), or maybe convert pipes and sockets to
  FMODE_STREAM manually;
- extend stream_open.cocci to analyze file_operations that use no_llseek
  or noop_llseek, but do not use nonseekable_open or alloc_file_pseudo.
  This might find files that have stream semantic but are opened
  differently;
- extend stream_open.cocci to analyze file_operations whose .read/.write
  do not use ppos at all (independently of how file was opened);
- ...
- after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
  !FMODE_STREAM;
- gather bug reports for deadlocked read/write and convert missed cases
  to FMODE_STREAM, probably extending stream_open.cocci along the road
  to catch similar cases.

i.e. always take f_pos_lock unless a file is explicitly marked as being
stream, and try to find and cover all files that are streams.

Updated patch for ppos=NULL for stream case with rw_verify_area change
fixed is below.

Kirill

---- 8<----
From 3df9bc6c8ec58baec37bf413e2c0cef07308988f Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <kirr@nexedi.com>
Date: Fri, 12 Apr 2019 12:31:57 +0300
Subject: [PATCH v2 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files

This amends commit 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") in how position is passed into .read()/.write() handler for
stream-like files:

Rasmus noticed that we currently pass 0 as position and ignore any position
change if that is done by a file implementation. This papers over bugs if ppos
is used in files that declare themselves as being stream-like as such bugs will
go unnoticed. Even if a file implementation is correctly converted into using
stream_open, its read/write later could be changed to use ppos and even though
that won't be working correctly, that bug might go unnoticed without someone
doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
read/write for stream-like files as that don't give any chance for ppos usage
bugs because it will oops if ppos is ever used inside .read() or .write().

Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
because they are called by vfs_read/vfs_write & friends before
file_operations .read/.write .

Note 2: if file backend uses new-style .read_iter/.write_iter, position
is still passed into there as non-pointer kiocb.ki_pos . Currently
stream_open.cocci (semantic patch added by 10dce8af3422) ignores files
whose file_operations has *_iter methods.

Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---

v2: fixed ppos handling in rw_verify_area per Linus comment.

 fs/open.c       |   5 ++-
 fs/read_write.c | 113 ++++++++++++++++++++++++++++--------------------
 2 files changed, 70 insertions(+), 48 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index a00350018a47..9c7d724a6f67 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1219,8 +1219,9 @@ EXPORT_SYMBOL(nonseekable_open);
 /*
  * stream_open is used by subsystems that want stream-like file descriptors.
  * Such file descriptors are not seekable and don't have notion of position
- * (file.f_pos is always 0). Contrary to file descriptors of other regular
- * files, .read() and .write() can run simultaneously.
+ * (file.f_pos is always 0 and ppos passed to .read()/.write() is always NULL).
+ * Contrary to file descriptors of other regular files, .read() and .write()
+ * can run simultaneously.
  *
  * stream_open never fails and is marked to return int so that it could be
  * directly used as file_operations.open .
diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..c543d965e288 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -365,29 +365,37 @@ SYSCALL_DEFINE5(llseek, unsigned int, fd, unsigned long, offset_high,
 int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t count)
 {
 	struct inode *inode;
-	loff_t pos;
 	int retval = -EINVAL;
 
 	inode = file_inode(file);
 	if (unlikely((ssize_t) count < 0))
 		return retval;
-	pos = *ppos;
-	if (unlikely(pos < 0)) {
-		if (!unsigned_offsets(file))
-			return retval;
-		if (count >= -pos) /* both values are in 0..LLONG_MAX */
-			return -EOVERFLOW;
-	} else if (unlikely((loff_t) (pos + count) < 0)) {
-		if (!unsigned_offsets(file))
-			return retval;
-	}
 
-	if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
-		retval = locks_mandatory_area(inode, file, pos, pos + count - 1,
-				read_write == READ ? F_RDLCK : F_WRLCK);
-		if (retval < 0)
-			return retval;
+	/*
+	 * ranged mandatory locking does not apply to streams - it makes sense
+	 * only for files where position has a meaning.
+	 */
+	if (ppos) {
+		loff_t pos = *ppos;
+
+		if (unlikely(pos < 0)) {
+			if (!unsigned_offsets(file))
+				return retval;
+			if (count >= -pos) /* both values are in 0..LLONG_MAX */
+				return -EOVERFLOW;
+		} else if (unlikely((loff_t) (pos + count) < 0)) {
+			if (!unsigned_offsets(file))
+				return retval;
+		}
+
+		if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
+			retval = locks_mandatory_area(inode, file, pos, pos + count - 1,
+					read_write == READ ? F_RDLCK : F_WRLCK);
+			if (retval < 0)
+				return retval;
+		}
 	}
+
 	return security_file_permission(file,
 				read_write == READ ? MAY_READ : MAY_WRITE);
 }
@@ -400,12 +408,13 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 	iov_iter_init(&iter, READ, &iov, 1, len);
 
 	ret = call_read_iter(filp, &kiocb, &iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	*ppos = kiocb.ki_pos;
+	if (ppos)
+		*ppos = kiocb.ki_pos;
 	return ret;
 }
 
@@ -468,12 +477,12 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 	iov_iter_init(&iter, WRITE, &iov, 1, len);
 
 	ret = call_write_iter(filp, &kiocb, &iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	if (ret > 0)
+	if (ret > 0 && ppos)
 		*ppos = kiocb.ki_pos;
 	return ret;
 }
@@ -558,15 +567,10 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 	return ret;
 }
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_mode & FMODE_STREAM ? 0 : file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
+/* file_ppos returns &file->f_pos or NULL if file is stream */
+static inline loff_t *file_ppos(struct file *file)
 {
-	if ((file->f_mode & FMODE_STREAM) == 0)
-		file->f_pos = pos;
+	return file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
 }
 
 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
@@ -575,10 +579,14 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_read(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_read(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 	return ret;
@@ -595,10 +603,14 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_write(f.file, buf, count, &pos);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_write(f.file, buf, count, ppos);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -673,14 +685,15 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 	ret = kiocb_set_rw_flags(&kiocb, flags);
 	if (ret)
 		return ret;
-	kiocb.ki_pos = *ppos;
+	kiocb.ki_pos = (ppos ? *ppos : 0);
 
 	if (type == READ)
 		ret = call_read_iter(filp, &kiocb, iter);
 	else
 		ret = call_write_iter(filp, &kiocb, iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-	*ppos = kiocb.ki_pos;
+	if (ppos)
+		*ppos = kiocb.ki_pos;
 	return ret;
 }
 
@@ -1013,10 +1026,14 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_readv(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
@@ -1033,10 +1050,14 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
-		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
-		if (ret >= 0)
-			file_pos_write(f.file, pos);
+		loff_t pos, *ppos = file_ppos(f.file);
+		if (ppos) {
+			pos = *ppos;
+			ppos = &pos;
+		}
+		ret = vfs_writev(f.file, vec, vlen, ppos, flags);
+		if (ret >= 0 && ppos)
+			f.file->f_pos = pos;
 		fdput_pos(f);
 	}
 
-- 
2.21.0.593.g511ec345e1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
  2019-04-08  0:09     ` Linus Torvalds
@ 2019-04-14  7:11       ` Kirill Smelkov
  0 siblings, 0 replies; 21+ messages in thread
From: Kirill Smelkov @ 2019-04-14  7:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, linux-fsdevel, Linux List Kernel Mailing,
	Michael Kerrisk, Yongzhi Pan, Jonathan Corbet, David Vrabel,
	Juergen Gross, Miklos Szeredi, Tejun Heo, Kirill Tkhai,
	Arnd Bergmann, Christoph Hellwig, Greg Kroah-Hartman,
	Julia Lawall, Nikolaus Rath, Han-Wen Nienhuys

On Sun, Apr 07, 2019 at 02:09:08PM -1000, Linus Torvalds wrote:
> On Sun, Apr 7, 2019 at 10:04 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > Fixing regression on FUSE side is my reason to do this whole work -
> > that's why I care about it the most and ask.
> 
> Yeah, we can do the actual FUSE fix, I think. Preferably through the
> FUSE tree. Miklos?

Linus, thanks for feedback.

I'm not sure what is going on, but I'm afraid we can wait very long
here. It's been already 1 week, and I was also struggling to get
feedback from Miklos on my FUSE patches starting from end of February
without any luck:

https://lore.kernel.org/linux-fsdevel/cover.1553680185.git.kirr@nexedi.com/
	https://lore.kernel.org/linux-fsdevel/12f7d0d98555ee0d174d04bb47644f65c07f035a.1553680185.git.kirr@nexedi.com/
	https://lore.kernel.org/linux-fsdevel/d74b17b9d33c3dcc7a1f2fa2914fb3c4e7cda127.1553680185.git.kirr@nexedi.com/

https://lore.kernel.org/linux-fsdevel/cover.1553677194.git.kirr@nexedi.com/
	https://lore.kernel.org/linux-fsdevel/d916d401a80e9834c95970d86ca71c0154e988a6.1553677194.git.kirr@nexedi.com/
	https://lore.kernel.org/linux-fsdevel/e0b43507976d6ea9010f1bacaef067f18de49f1f.1553677194.git.kirr@nexedi.com/

https://lore.kernel.org/linux-fsdevel/dc47c061f20c464ccf46b43822b062dca6486e90.1553637462.git.kirr@nexedi.com/

I'm not sure what to do with my FUSE patches. It feels like knocking
into closed door...

Kirill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RESEND, PATCH 3/3] fuse: Add FOPEN_STREAM and use stream_open() if filesystem returned that from open handler
  2019-03-26 23:22 ` [PATCH 3/3] fuse: Add FOPEN_STREAM and use stream_open() if filesystem returned that from open handler Kirill Smelkov
@ 2019-04-24  7:13   ` Kirill Smelkov
       [not found]     ` <20190424160611.2A71321900@mail.kernel.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Kirill Smelkov @ 2019-04-24  7:13 UTC (permalink / raw)
  To: Miklos Szeredi, Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, Kirill Smelkov, Al Viro,
	Linus Torvalds, Michael Kerrisk, Yongzhi Pan, Jonathan Corbet,
	David Vrabel, Juergen Gross, Tejun Heo, Kirill Tkhai,
	Arnd Bergmann, Christoph Hellwig, Greg Kroah-Hartman,
	Julia Lawall, Nikolaus Rath, Han-Wen Nienhuys, stable

Starting from 9c225f2655 (vfs: atomic f_pos accesses as per POSIX) files
opened even via nonseekable_open gate read and write via lock and do not
allow them to be run simultaneously. This can create read vs write
deadlock if a filesystem is trying to implement a socket-like file which
is intended to be simultaneously used for both read and write from
filesystem client. See 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") for details and e.g. 581d21a2d0 (xenbus: fix deadlock on
writes to /proc/xen/xenbus) for a similar deadlock example on /proc/xen/xenbus.

To avoid such deadlock it was tempting to adjust fuse_finish_open to use
stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags,
but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and write
handlers

	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

so if we would do such a change it will break a real user.

-> Add another flag (FOPEN_STREAM) for filesystem servers to indicate
that the opened handler is having stream-like semantics; does not use
file position and thus the kernel is free to issue simultaneous read and
write request on opened file handle.

This patch together with stream_open (10dce8af3422) should be added to
stable kernels starting from v3.14+ (the kernel where 9c225f2655 first
appeared). This will allow to patch OSSPD and other FUSE filesystems that
provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in
open handler and this way avoid the deadlock on all kernel versions. This
should work because fuse_finish_open ignores unknown open flags returned
from a filesystem and so passing FOPEN_STREAM to a kernel that is not
aware of this flag cannot hurt. In turn the kernel that is not aware of
FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to
implement streams without read vs write deadlock.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---

( resending the same patch with updated description to reference stream_open
  that was landed to master as 10dce8af3422; also added cc stable )

 fs/fuse/file.c            | 4 +++-
 include/uapi/linux/fuse.h | 2 ++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 06096b60f1df..44de96cb7871 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -178,7 +178,9 @@ void fuse_finish_open(struct inode *inode, struct file *file)
 
 	if (!(ff->open_flags & FOPEN_KEEP_CACHE))
 		invalidate_inode_pages2(inode->i_mapping);
-	if (ff->open_flags & FOPEN_NONSEEKABLE)
+	if (ff->open_flags & FOPEN_STREAM)
+		stream_open(inode, file);
+	else if (ff->open_flags & FOPEN_NONSEEKABLE)
 		nonseekable_open(inode, file);
 	if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
 		struct fuse_inode *fi = get_fuse_inode(inode);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 2ac598614a8f..26abf0a571c7 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -229,11 +229,13 @@ struct fuse_file_lock {
  * FOPEN_KEEP_CACHE: don't invalidate the data cache on open
  * FOPEN_NONSEEKABLE: the file is not seekable
  * FOPEN_CACHE_DIR: allow caching this directory
+ * FOPEN_STREAM: the file is stream-like
  */
 #define FOPEN_DIRECT_IO		(1 << 0)
 #define FOPEN_KEEP_CACHE	(1 << 1)
 #define FOPEN_NONSEEKABLE	(1 << 2)
 #define FOPEN_CACHE_DIR		(1 << 3)
+#define FOPEN_STREAM		(1 << 4)
 
 /**
  * INIT request/reply flags
-- 
2.21.0.765.geec228f530

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RESEND, PATCH 3/3] fuse: Add FOPEN_STREAM and use stream_open() if filesystem returned that from open handler
       [not found]     ` <20190424160611.2A71321900@mail.kernel.org>
@ 2019-04-24 19:16       ` Kirill Smelkov
  0 siblings, 0 replies; 21+ messages in thread
From: Kirill Smelkov @ 2019-04-24 19:16 UTC (permalink / raw)
  To: Sasha Levin, Greg Kroah-Hartman
  Cc: Miklos Szeredi, linux-fsdevel, linux-kernel, Al Viro,
	Linus Torvalds, Michael Kerrisk, Yongzhi Pan, Jonathan Corbet,
	David Vrabel, Juergen Gross, Tejun Heo, Kirill Tkhai,
	Arnd Bergmann, Christoph Hellwig, Julia Lawall, Nikolaus Rath,
	Han-Wen Nienhuys, stable

Hello up there,

On Wed, Apr 24, 2019 at 04:06:10PM +0000, Sasha Levin wrote:
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: 3.14+
>
> The bot has tested the following trees: v5.0.9, v4.19.36, v4.14.113, v4.9.170, v4.4.178, v3.18.138.
>
> v5.0.9: Build failed! Errors:
>     fs/fuse/file.c:185:3: error: implicit declaration of function ‘stream_open’; did you mean ‘seq_open’? [-Werror=implicit-function-declaration]

This patch needs "fs: stream_open - opener for stream-like files so that
read and write can run simultaneously without deadlock" (10dce8af3422)
as its dependency. It documents so in its commit message. That base
dependency patch is being discussed here in stable context:

https://lore.kernel.org/linux-fsdevel/20190424183012.GB3798@deco.navytux.spb.ru/


> v4.19.36: Failed to apply! Possible dependencies:
>     6433b8998a21 ("fuse: add FOPEN_CACHE_DIR")
>     88bc7d5097a1 ("fuse: add support for copy_file_range()")

"fuse: add FOPEN_CACHE_DIR" added another nearby constant. The conflict
with that patch should be trivially resolvable (just add FOPEN_STREAM
irregardless that there is no FOPEN_CACHE_DIR in context).

> v4.14.113: Failed to apply! Possible dependencies:
>     3b7008b226f3 ("fuse: return -ECONNABORTED on /dev/fuse read after abort")
>     6433b8998a21 ("fuse: add FOPEN_CACHE_DIR")
>     88bc7d5097a1 ("fuse: add support for copy_file_range()")

same.

> v4.9.170: Failed to apply! Possible dependencies:
>     3b7008b226f3 ("fuse: return -ECONNABORTED on /dev/fuse read after abort")
>     6433b8998a21 ("fuse: add FOPEN_CACHE_DIR")
>     88bc7d5097a1 ("fuse: add support for copy_file_range()")

----//----

> v4.4.178: Failed to apply! Possible dependencies:
>     29433a2991fa ("fuse: get rid of fc->flags")
>     3767e255b390 ("switch ->setxattr() to passing dentry and inode separately")
>     60bcc88ad185 ("fuse: Add posix ACL support")
>     6192269444eb ("introduce a parallel variant of ->iterate()")
>     6433b8998a21 ("fuse: add FOPEN_CACHE_DIR")
>     6b2553918d8b ("replace ->follow_link() with new method that could stay in RCU mode")
>     703c73629f93 ("fuse: Use generic xattr ops")
>     84e710da2a1d ("parallel lookups machinery, part 2")
>     88bc7d5097a1 ("fuse: add support for copy_file_range()")
>     9902af79c01a ("parallel lookups: actual switch to rwsem")
>     9cf843e3f47c ("lookup_open(): lock the parent shared unless O_CREAT is given")
>     aa80deab33a8 ("namei: page_getlink() and page_follow_link_light() are the same thing")
>     cd3417c8fc95 ("kill free_page_put_link()")
>     ce23e6401334 ("->getxattr(): pass dentry and inode as separate arguments")
>     fceef393a538 ("switch ->get_link() to delayed_call, kill ->put_link()")

similar

> v3.18.138: Failed to apply! Possible dependencies:
>     09561a53b50d ("lustre: use %p[dD]")
>     29433a2991fa ("fuse: get rid of fc->flags")
>     2b0143b5c986 ("VFS: normal filesystems (and lustre): d_inode() annotations")
>     3767e255b390 ("switch ->setxattr() to passing dentry and inode separately")
>     60bcc88ad185 ("fuse: Add posix ACL support")
>     6192269444eb ("introduce a parallel variant of ->iterate()")
>     6433b8998a21 ("fuse: add FOPEN_CACHE_DIR")
>     680baacbca69 ("new ->follow_link() and ->put_link() calling conventions")
>     6b2553918d8b ("replace ->follow_link() with new method that could stay in RCU mode")
>     6e77137b363b ("don't pass nameidata to ->follow_link()")
>     703c73629f93 ("fuse: Use generic xattr ops")
>     84e710da2a1d ("parallel lookups machinery, part 2")
>     88bc7d5097a1 ("fuse: add support for copy_file_range()")
>     90e4fc8890da ("9p: don't bother with __getname() in ->follow_link()")
>     9902af79c01a ("parallel lookups: actual switch to rwsem")
>     9cf843e3f47c ("lookup_open(): lock the parent shared unless O_CREAT is given")
>     a06ae8609b3d ("coresight: add CoreSight core layer framework")
>     ce23e6401334 ("->getxattr(): pass dentry and inode as separate arguments")
>     dab363f938a5 ("Merge tag 'staging-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging")
>     fceef393a538 ("switch ->get_link() to delayed_call, kill ->put_link()")

similar

> How should we proceed with this patch?

I have backported this patch to above those stable trees. You can pull
the result from https://lab.nexedi.com/kirr/linux.git, branches:

	fopen_stream-5.0.y
	fopen_stream-4.19.y
	fopen_stream-4.14.y
	fopen_stream-4.4.y
		( actually fixed deadlock on /proc/xen/xenbus as
		  581d21a2d02a was not backported to 4.4 )
	fopen_stream-3.18.y
		( actually fixed deadlock on /proc/xen/xenbus as
		  581d21a2d02a was not backported to 3.18 )


Hope it helps a bit,

Kirill

P.S.

The fact that 4.4 and 3.18 versions of stream_open patch had to
resolve xenbus conflict in a way that actually fixes /proc/xen/xenbus
deadlock (introduced in 3.14) suggests that deadlock error messages
produced by stream_open.cocci should indeed be considered by relevant
maintainers including stable team...


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-04-24 19:17 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-26 22:20 [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock Kirill Smelkov
2019-03-26 22:20 ` Kirill Smelkov
2019-03-26 23:22 ` [PATCH 3/3] fuse: Add FOPEN_STREAM and use stream_open() if filesystem returned that from open handler Kirill Smelkov
2019-04-24  7:13   ` [RESEND, PATCH " Kirill Smelkov
     [not found]     ` <20190424160611.2A71321900@mail.kernel.org>
2019-04-24 19:16       ` Kirill Smelkov
     [not found] ` <8794193f3040b798010970228d978c05ad56ec52.1553637462.git.kirr@nexedi.com>
2019-03-27  6:54   ` [PATCH 2/3] *: convert stream-like files from nonseekable_open -> stream_open Lubomir Rintel
2019-03-27 16:58 ` [PATCH 1/3] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock Juergen Gross
2019-04-06 17:07 ` Linus Torvalds
2019-04-07 20:04   ` Kirill Smelkov
2019-04-08  0:09     ` Linus Torvalds
2019-04-14  7:11       ` Kirill Smelkov
     [not found] ` <4c4651e2-167e-bfcc-7b3e-cda118f98a69@rasmusvillemoes.dk>
     [not found]   ` <20190409203807.GA13855@deco.navytux.spb.ru>
     [not found]     ` <d8c23d05-8810-13a2-cc50-7a47ff35e90b@rasmusvillemoes.dk>
2019-04-11 12:38       ` Kirill Smelkov
2019-04-11 16:22         ` Linus Torvalds
2019-04-12 12:42           ` Kirill Smelkov
2019-04-13 16:54             ` Kirill Smelkov
2019-04-13 16:54               ` [PATCH 1/2] vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files Kirill Smelkov
2019-04-13 17:27                 ` Linus Torvalds
2019-04-13 17:38                   ` Al Viro
2019-04-13 18:44                   ` Kirill Smelkov
2019-04-13 16:55               ` [PATCH 2/2] vfs: use &file->f_pos directly on files that have position Kirill Smelkov
2019-04-13 16:55                 ` Kirill Smelkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.