linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: Sasha Levin <sashal@kernel.org>
Cc: linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	Kirill Smelkov <kirr@nexedi.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Yongzhi Pan <panyongzhi@gmail.com>,
	Jonathan Corbet <corbet@lwn.net>,
	David Vrabel <david.vrabel@citrix.com>,
	Juergen Gross <jgross@suse.com>,
	Miklos Szeredi <miklos@szeredi.hu>, Tejun Heo <tj@kernel.org>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Arnd Bergmann <arnd@arndb.de>, Christoph Hellwig <hch@lst.de>,
	Julia Lawall <Julia.Lawall@lip6.fr>,
	Nikolaus Rath <Nikolaus@rath.org>,
	Han-Wen Nienhuys <hanwen@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH AUTOSEL 5.0 59/66] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
Date: Wed, 24 Apr 2019 18:34:15 +0200	[thread overview]
Message-ID: <20190424163415.GB21413@kroah.com> (raw)
In-Reply-To: <20190424143341.27665-59-sashal@kernel.org>

On Wed, Apr 24, 2019 at 10:33:33AM -0400, Sasha Levin wrote:
> From: Kirill Smelkov <kirr@nexedi.com>
> 
> [ Upstream commit 10dce8af34226d90fa56746a934f8da5dcdba3df ]
> 
> Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added
> locking for file.f_pos access and in particular made concurrent read and
> write not possible - now both those functions take f_pos lock for the
> whole run, and so if e.g. a read is blocked waiting for data, write will
> deadlock waiting for that read to complete.
> 
> This caused regression for stream-like files where previously read and
> write could run simultaneously, but after that patch could not do so
> anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes
> to /proc/xen/xenbus") which fixes such regression for particular case of
> /proc/xen/xenbus.
> 
> The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
> safety for read/write/lseek and added the locking to file descriptors of
> all regular files. In 2014 that thread-safety problem was not new as it
> was already discussed earlier in 2006.
> 
> However even though 2006'th version of Linus's patch was adding f_pos
> locking "only for files that are marked seekable with FMODE_LSEEK (thus
> avoiding the stream-like objects like pipes and sockets)", the 2014
> version - the one that actually made it into the tree as 9c225f2655e3 -
> is doing so irregardless of whether a file is seekable or not.
> 
> See
> 
>     https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
>     https://lwn.net/Articles/180387
>     https://lwn.net/Articles/180396
> 
> for historic context.
> 
> The reason that it did so is, probably, that there are many files that
> are marked non-seekable, but e.g. their read implementation actually
> depends on knowing current position to correctly handle the read. Some
> examples:
> 
> 	kernel/power/user.c		snapshot_read
> 	fs/debugfs/file.c		u32_array_read
> 	fs/fuse/control.c		fuse_conn_waiting_read + ...
> 	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
> 	arch/s390/hypfs/inode.c		hypfs_read_iter
> 	...
> 
> Despite that, many nonseekable_open users implement read and write with
> pure stream semantics - they don't depend on passed ppos at all. And for
> those cases where read could wait for something inside, it creates a
> situation similar to xenbus - the write could be never made to go until
> read is done, and read is waiting for some, potentially external, event,
> for potentially unbounded time -> deadlock.
> 
> Besides xenbus, there are 14 such places in the kernel that I've found
> with semantic patch (see below):
> 
> 	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
> 	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
> 	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
> 	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
> 	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
> 	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
> 	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
> 	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
> 	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
> 	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
> 	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
> 	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
> 	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
> 	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
> 
> In addition to the cases above another regression caused by f_pos
> locking is that now FUSE filesystems that implement open with
> FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
> stream-like files - for the same reason as above e.g. read can deadlock
> write locking on file.f_pos in the kernel.
> 
> FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse:
> implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
> in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
> write routines not depending on current position at all, and with both
> read and write being potentially blocking operations:
> 
> See
> 
>     https://github.com/libfuse/osspd
>     https://lwn.net/Articles/308445
> 
>     https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
>     https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
>     https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
> 
> Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
> "somewhat pipe-like files ..." with read handler not using offset.
> However that test implements only read without write and cannot exercise
> the deadlock scenario:
> 
>     https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
>     https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
>     https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
> 
> I've actually hit the read vs write deadlock for real while implementing
> my FUSE filesystem where there is /head/watch file, for which open
> creates separate bidirectional socket-like stream in between filesystem
> and its user with both read and write being later performed
> simultaneously. And there it is semantically not easy to split the
> stream into two separate read-only and write-only channels:
> 
>     https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
> 
> Let's fix this regression. The plan is:
> 
> 1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
>    doing so would break many in-kernel nonseekable_open users which
>    actually use ppos in read/write handlers.
> 
> 2. Add stream_open() to kernel to open stream-like non-seekable file
>    descriptors. Read and write on such file descriptors would never use
>    nor change ppos. And with that property on stream-like files read and
>    write will be running without taking f_pos lock - i.e. read and write
>    could be running simultaneously.
> 
> 3. With semantic patch search and convert to stream_open all in-kernel
>    nonseekable_open users for which read and write actually do not
>    depend on ppos and where there is no other methods in file_operations
>    which assume @offset access.
> 
> 4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
>    steam_open if that bit is present in filesystem open reply.
> 
>    It was tempting to change fs/fuse/ open handler to use stream_open
>    instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
>    grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
>    and in particular GVFS which actually uses offset in its read and
>    write handlers
> 
> 	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
> 	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
> 	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
> 	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
> 
>    so if we would do such a change it will break a real user.
> 
> 5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
>    from v3.14+ (the kernel where 9c225f2655 first appeared).
> 
>    This will allow to patch OSSPD and other FUSE filesystems that
>    provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
>    in their open handler and this way avoid the deadlock on all kernel
>    versions. This should work because fs/fuse/ ignores unknown open
>    flags returned from a filesystem and so passing FOPEN_STREAM to a
>    kernel that is not aware of this flag cannot hurt. In turn the kernel
>    that is not aware of FOPEN_STREAM will be < v3.14 where just
>    FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
>    write deadlock.
> 
> This patch adds stream_open, converts /proc/xen/xenbus to it and adds
> semantic patch to automatically locate in-kernel places that are either
> required to be converted due to read vs write deadlock, or that are just
> safe to be converted because read and write do not use ppos and there
> are no other funky methods in file_operations.
> 
> Regarding semantic patch I've verified each generated change manually -
> that it is correct to convert - and each other nonseekable_open instance
> left - that it is either not correct to convert there, or that it is not
> converted due to current stream_open.cocci limitations.
> 
> The script also does not convert files that should be valid to convert,
> but that currently have .llseek = noop_llseek or generic_file_llseek for
> unknown reason despite file being opened with nonseekable_open (e.g.
> drivers/input/mousedev.c)
> 
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Yongzhi Pan <panyongzhi@gmail.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: David Vrabel <david.vrabel@citrix.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Julia Lawall <Julia.Lawall@lip6.fr>
> Cc: Nikolaus Rath <Nikolaus@rath.org>
> Cc: Han-Wen Nienhuys <hanwen@google.com>
> Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
> ---
>  drivers/xen/xenbus/xenbus_dev_frontend.c |   4 +-
>  fs/open.c                                |  18 ++
>  fs/read_write.c                          |   5 +-
>  include/linux/fs.h                       |   4 +
>  scripts/coccinelle/api/stream_open.cocci | 363 +++++++++++++++++++++++
>  5 files changed, 389 insertions(+), 5 deletions(-)
>  create mode 100644 scripts/coccinelle/api/stream_open.cocci

I think there is a follow-on patch for this one as well, that adds the
proper "stream open" logic to all of the individual locations.

But even with that, I don't think this is stable material, it should
just be for 5.1 and newer kernels.

thanks,

greg k-h

  reply	other threads:[~2019-04-24 16:34 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20190424143341.27665-1-sashal@kernel.org>
2019-04-24 14:33 ` [PATCH AUTOSEL 5.0 56/66] kernel/sysctl.c: fix out-of-bounds access when setting file-max Sasha Levin
2019-04-24 14:33 ` [PATCH AUTOSEL 5.0 59/66] fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock Sasha Levin
2019-04-24 16:34   ` Greg Kroah-Hartman [this message]
2019-04-24 16:40     ` Linus Torvalds
2019-04-24 17:02       ` Greg Kroah-Hartman
2019-04-24 17:19       ` Sasha Levin
2019-04-24 17:26         ` Linus Torvalds
2019-04-24 18:30           ` Kirill Smelkov
2019-04-25 10:04             ` David Laight
2019-04-26  7:45               ` Kirill Smelkov
2019-04-26 11:00                 ` David Laight
2019-04-26 18:20                   ` Kirill Smelkov
2019-04-24 14:33 ` [PATCH AUTOSEL 5.0 62/66] pin iocb through aio Sasha Levin
2019-04-24 14:33 ` [PATCH AUTOSEL 5.0 63/66] aio: fold lookup_kiocb() into its sole caller Sasha Levin
2019-04-24 14:33 ` [PATCH AUTOSEL 5.0 64/66] aio: keep io_event in aio_kiocb Sasha Levin
2019-04-24 14:33 ` [PATCH AUTOSEL 5.0 65/66] aio: store event at final iocb_put() Sasha Levin
2019-04-24 14:33 ` [PATCH AUTOSEL 5.0 66/66] Fix aio_poll() races Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190424163415.GB21413@kroah.com \
    --to=gregkh@linuxfoundation.org \
    --cc=Julia.Lawall@lip6.fr \
    --cc=Nikolaus@rath.org \
    --cc=arnd@arndb.de \
    --cc=corbet@lwn.net \
    --cc=david.vrabel@citrix.com \
    --cc=hanwen@google.com \
    --cc=hch@lst.de \
    --cc=jgross@suse.com \
    --cc=kirr@nexedi.com \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=mtk.manpages@gmail.com \
    --cc=panyongzhi@gmail.com \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).