linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: xfs_file_splice_read: possible circular locking dependency detected
       [not found]                 ` <20160914133925.2fba4629@roar.ozlabs.ibm.com>
@ 2016-09-18  5:33                   ` Al Viro
  2016-09-19  3:08                     ` Nicholas Piggin
  0 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-09-18  5:33 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, linux-fsdevel

[finally Cc'd to fsdevel - should've done that several iterations upthread]

On Wed, Sep 14, 2016 at 01:39:25PM +1000, Nicholas Piggin wrote:

> Should not be so bad, but I don't have hard numbers for you. PAGEVEC_SIZE
> is 14, and that's conceptually rather similar operation (walk radix tree;
> grab pages). OTOH many archs are heavier and do locking and vmas walking etc.
> 
> Documentation/features/vm/pte_special/arch-support.txt
> 
> But even for those, at 16 entries, the bulk of the cost *should* be hitting
> struct page cachelines and refcounting. The rest should mostly stay in cache.

OK...  That's actually important only for vmsplice_to_pipe() and 16-page
array seems to be doing fine there.

Another question, now that you've finally resurfaced: could you reconstruct
the story with page-stealing and breakage(s) thereof that had lead to
commit 485ddb4b9741bafb70b22e5c1f9b4f37dc3e85bd
Author: Nick Piggin <npiggin@suse.de>
Date:   Tue Mar 27 08:55:08 2007 +0200

    1/2 splice: dont steal

I realize that it had been 9 years ago, but anything resembling a braindump
would be very welcome.  Note that there is a couple of ->splice_write()
instances that _do_ use ->steal() (fuse_dev_splice_write() and virtio_console
port_fops_splice_write()) and I wonder if they suffer from the same problems;
your commit message is rather short on details, unfortunately.  FUSE one
is especially interesting...

^ permalink raw reply	[flat|nested] 104+ messages in thread

* skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
       [not found]                       ` <20160917190023.GA8039@ZenIV.linux.org.uk>
@ 2016-09-18 19:31                         ` Al Viro
  2016-09-18 20:12                           ` Linus Torvalds
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
  1 sibling, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-09-18 19:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, netdev, Eric Dumazet

FWIW, I'm not sure if skb_splice_bits() can't land us in trouble; fragments
might come from compound pages and I'm not entirely convinced that we won't
end up with coalesced fragments putting more than PAGE_SIZE into a single
pipe_buffer.  And that could badly confuse a bunch of code.

Can that legitimately happen?  If so, we'll need to audit quite a few
->splice_write()-related codepaths; FUSE, in particular, is very likely
to be unhappy with that kind of stuff, and it's not the only place where
we might count upon never seeing e.g. longer than PAGE_SIZE chunks in
bio_vec.  It shouldn't be all that hard to fix, but if the whole thing
is simply impossible, I would rather avoid that round of RTFS at the moment...

Comments?

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-18 19:31                         ` skb_splice_bits() and large chunks in pipe (was " Al Viro
@ 2016-09-18 20:12                           ` Linus Torvalds
  2016-09-18 22:31                             ` Al Viro
  0 siblings, 1 reply; 104+ messages in thread
From: Linus Torvalds @ 2016-09-18 20:12 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, Network Development,
	Eric Dumazet

On Sun, Sep 18, 2016 at 12:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> FWIW, I'm not sure if skb_splice_bits() can't land us in trouble; fragments
> might come from compound pages and I'm not entirely convinced that we won't
> end up with coalesced fragments putting more than PAGE_SIZE into a single
> pipe_buffer.  And that could badly confuse a bunch of code.

The pipe buffer code is actually *supposed* to handle any size
allocations at all. They should *not* be limited by pages, exactly
because the data can come from huge-pages or just multi-page
allocations. It's definitely possible with networking, and networking
is one of the *primary* targets of splice in many ways.

So if the splice code ends up being confused by "this is not just
inside a single page", then the splice code is buggy, I think.

Why would splice_write() cases be confused anyway? A filesystem needs
to be able to handle the case of "this needs to be split" regardless,
since even if the source buffer were to fit in a page, the offset
might obviously mean that the target won't fit in a page.

Now, if you decide that you want to make the iterator always split
those possibly big cases and never have big iovec entries, I guess
that would potentially be ok. But my initial reaction is that they are
perfectly normal and should be handled normally, and any code that
depends on a splice buffer fitting in one page is just buggy and
should be fixed.

                 Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-18 20:12                           ` Linus Torvalds
@ 2016-09-18 22:31                             ` Al Viro
  2016-09-19  0:18                               ` Linus Torvalds
  2016-09-19  0:22                               ` Al Viro
  0 siblings, 2 replies; 104+ messages in thread
From: Al Viro @ 2016-09-18 22:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, Network Development,
	Eric Dumazet

On Sun, Sep 18, 2016 at 01:12:21PM -0700, Linus Torvalds wrote:

> So if the splice code ends up being confused by "this is not just
> inside a single page", then the splice code is buggy, I think.
> 
> Why would splice_write() cases be confused anyway? A filesystem needs
> to be able to handle the case of "this needs to be split" regardless,
> since even if the source buffer were to fit in a page, the offset
> might obviously mean that the target won't fit in a page.

What worries me is iov_iter_get_pages() and friends.  The calling conventions
are
	size = iov_iter_get_pages(iter, pages, maxlen, maxpages, &start);

They are convenient enough for most of the callers - we fill an array of
pages, the first (and only in bvec case) one having start bytes skipped.

The thing is, the calculation of the number of pages returned is broken
in this case; normally it's ROUND_DIV_UP(start + n, PAGE_SIZE).  That,
of course, gets broken even by the offset being large enough.  We don't
have that many users of that thing (and iov_iter_get_pages_alloc()), but
it'll need careful review.  What's more, looking at those shows other
fun issues:
        sg_init_table(sgl->sg, npages + 1);

        for (i = 0, len = n; i < npages; i++) {
                int plen = min_t(int, len, PAGE_SIZE - off);

                sg_set_page(sgl->sg + i, sgl->pages[i], plen, off);

and that'll instantly blow up, due to PAGE_SIZE - off possibly becoming
negative.  That's af_alg_make_sg(), and it shouldn't see anything
coming from pipe buffers (right now the only way for that to happen is
iter_file_splice_write()), but the things like e.g. dio_refill_pages()
might, and they also get seriously confused by that.  Worse, some of those
callers have calling conventions that have similar problems of their own.

At the moment there are 11 callers (10 in mainline; one more added in
conversion of vmsplice_to_pipe() to new pipe locking, but it's irrelevant
anyway - it gets fed an iovec-backed iov_iter).  I'm looking through those
right now, hopefully will come up with something sane...

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-18 22:31                             ` Al Viro
@ 2016-09-19  0:18                               ` Linus Torvalds
  2016-09-19  0:22                               ` Al Viro
  1 sibling, 0 replies; 104+ messages in thread
From: Linus Torvalds @ 2016-09-19  0:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, Network Development,
	Eric Dumazet

On Sun, Sep 18, 2016 at 3:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What worries me is iov_iter_get_pages() and friends.

So honestly, if it worries you, I'm not going to complain at all if
you decide that you'd rather translate the pipe_buffer[] array into a
kvec by always splitting at page boundaries.

Even with large packets in networking, it's not going t be a huge
deal. And maybe we *should* make it a rule that a "kvec" is always
composed of individual entries that fit entirely within a page.

In this code, being safe rather than clever would be a welcome and
surprising change, I guess.

             Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-18 22:31                             ` Al Viro
  2016-09-19  0:18                               ` Linus Torvalds
@ 2016-09-19  0:22                               ` Al Viro
  2016-09-20  9:51                                 ` Herbert Xu
  1 sibling, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-09-19  0:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, Network Development,
	Eric Dumazet

On Sun, Sep 18, 2016 at 11:31:17PM +0100, Al Viro wrote:

> At the moment there are 11 callers (10 in mainline; one more added in
> conversion of vmsplice_to_pipe() to new pipe locking, but it's irrelevant
> anyway - it gets fed an iovec-backed iov_iter).  I'm looking through those
> right now, hopefully will come up with something sane...

FWIW, I wonder how many of those users are ready to cope with compound
pages in the first place; they end up passed to
	* skb_fill_page_desc().  Probably OK (as in all of them, modulo
calculating the number of pages and ranges for them).
	* shoved into scatterlist, which gets passed to virtqueue_add_sgs().
Need to check virtio to see what happens there.
	* shoved into nfs ->wb_page and fed into nfs_pageio_add_request() and
machinery behind it.  These, BTW, are reachable by pipe_buffer-derived ones
at the moment (splice to O_DIRECT nfs file).  The code looks like it's
playing fast and loose with ->wb_page - in some cases it's an NFS pagecache
one, in some - anything from userland, and there are places like
	inode = page_file_mapping(req->wb_page)->host;
which will do nasty things if they are ever reached by the second kind.
nfs_pgio_rpcsetup() looks like it won't be happy with compound pages, but
again, I'm not familiar enough with that code to tell if it's reachable
from nfs_pageio_add_request().
	* shoved into scatterlist, which gets fed into crypto/*.c machinery.
No way for a pipe_buffer stuff to get there, fortunately, because I would
be very surprised if it works correctly with compound pages and large
ranges in those.
	* shoved into lustre ->ldp_pages; almost certainly not ready for
compound pages.
	* fed to ceph_osd_data_pages_init(); again, practically certain not
to be ready.
	* put into dio_submit ->pages[], eventually fed to bio_add_page();
that might be fixable, but it would take some massage in fs/direct-io.c
	*�fuse - probably OK, but that's only on a fairly cursory look.

It certainly won't be easy to verify in details ;-/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-18  5:33                   ` xfs_file_splice_read: possible circular locking dependency detected Al Viro
@ 2016-09-19  3:08                     ` Nicholas Piggin
  2016-09-19  6:11                       ` Al Viro
  0 siblings, 1 reply; 104+ messages in thread
From: Nicholas Piggin @ 2016-09-19  3:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, linux-fsdevel

On Sun, 18 Sep 2016 06:33:52 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> [finally Cc'd to fsdevel - should've done that several iterations upthread]
> 
> On Wed, Sep 14, 2016 at 01:39:25PM +1000, Nicholas Piggin wrote:
> 
> > Should not be so bad, but I don't have hard numbers for you. PAGEVEC_SIZE
> > is 14, and that's conceptually rather similar operation (walk radix tree;
> > grab pages). OTOH many archs are heavier and do locking and vmas walking etc.
> > 
> > Documentation/features/vm/pte_special/arch-support.txt
> > 
> > But even for those, at 16 entries, the bulk of the cost *should* be hitting
> > struct page cachelines and refcounting. The rest should mostly stay in cache.  
> 
> OK...  That's actually important only for vmsplice_to_pipe() and 16-page
> array seems to be doing fine there.
> 
> Another question, now that you've finally resurfaced: could you reconstruct
> the story with page-stealing and breakage(s) thereof that had lead to
> commit 485ddb4b9741bafb70b22e5c1f9b4f37dc3e85bd
> Author: Nick Piggin <npiggin@suse.de>
> Date:   Tue Mar 27 08:55:08 2007 +0200
> 
>     1/2 splice: dont steal
> 
> I realize that it had been 9 years ago, but anything resembling a braindump
> would be very welcome.  Note that there is a couple of ->splice_write()
> instances that _do_ use ->steal() (fuse_dev_splice_write() and virtio_console
> port_fops_splice_write()) and I wonder if they suffer from the same problems;
> your commit message is rather short on details, unfortunately.  FUSE one
> is especially interesting...

Without looking through all the patches again, I believe the issue was
just that filesystems were not expecting (or at least, not audited to
expect) pages being added to their pagecache in that particular state
(they'd expect to go through ->readpage or see !uptodate in prepare_write).

If some wanted to attach metadata to uptodate pages for example, this
may have caused a problem. It wasn't some big fundamental problem, just a
mechanical one.

Thanks,
Nick'

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-19  3:08                     ` Nicholas Piggin
@ 2016-09-19  6:11                       ` Al Viro
  2016-09-19  7:26                         ` Nicholas Piggin
  0 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-09-19  6:11 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, linux-fsdevel

On Mon, Sep 19, 2016 at 01:08:30PM +1000, Nicholas Piggin wrote:

> Without looking through all the patches again, I believe the issue was
> just that filesystems were not expecting (or at least, not audited to
> expect) pages being added to their pagecache in that particular state
> (they'd expect to go through ->readpage or see !uptodate in prepare_write).
> 
> If some wanted to attach metadata to uptodate pages for example, this
> may have caused a problem. It wasn't some big fundamental problem, just a
> mechanical one.

Umm...  Why not make it non-uptodate/locked, try to replace the original
with it in pagecache and then do full-page ->write_begin immediately
followed by full-page ->write_end?  Looks like that ought to work in
all in-tree cases...

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-19  6:11                       ` Al Viro
@ 2016-09-19  7:26                         ` Nicholas Piggin
  0 siblings, 0 replies; 104+ messages in thread
From: Nicholas Piggin @ 2016-09-19  7:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, linux-fsdevel

On Mon, 19 Sep 2016 07:11:21 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Mon, Sep 19, 2016 at 01:08:30PM +1000, Nicholas Piggin wrote:
> 
> > Without looking through all the patches again, I believe the issue was
> > just that filesystems were not expecting (or at least, not audited to
> > expect) pages being added to their pagecache in that particular state
> > (they'd expect to go through ->readpage or see !uptodate in prepare_write).
> > 
> > If some wanted to attach metadata to uptodate pages for example, this
> > may have caused a problem. It wasn't some big fundamental problem, just a
> > mechanical one.  
> 
> Umm...  Why not make it non-uptodate/locked, try to replace the original
> with it in pagecache and then do full-page ->write_begin immediately
> followed by full-page ->write_end?  Looks like that ought to work in
> all in-tree cases...

That sounds like it probably should work for that case. IIRC, I was looking
at using a write_begin flag to notify the case of of replacing the page, so
the fs could also handle the case of replacing existing pagecache.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-19  0:22                               ` Al Viro
@ 2016-09-20  9:51                                 ` Herbert Xu
  0 siblings, 0 replies; 104+ messages in thread
From: Herbert Xu @ 2016-09-20  9:51 UTC (permalink / raw)
  To: Al Viro; +Cc: torvalds, axboe, npiggin, linux-fsdevel, netdev, edumazet

Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>        * shoved into scatterlist, which gets fed into crypto/*.c machinery.
> No way for a pipe_buffer stuff to get there, fortunately, because I would
> be very surprised if it works correctly with compound pages and large
> ranges in those.

FWIW the crypto API has always been supposed to handle SG entries
that cross page boundaries.  There were a couple of bugs in this
area but AFAIK they've all been fixed.

Of course I cannot guarantee that every crypto driver also handles
it correctly, but at least we have a few test vectors which test
the page-crossing case specifically.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [RFC][CFT] splice_read reworked
       [not found]                       ` <20160917190023.GA8039@ZenIV.linux.org.uk>
  2016-09-18 19:31                         ` skb_splice_bits() and large chunks in pipe (was " Al Viro
@ 2016-09-23 19:00                         ` Al Viro
  2016-09-23 19:01                           ` [PATCH 01/11] fix memory leaks in tracing_buffers_splice_read() Al Viro
                                             ` (11 more replies)
  1 sibling, 12 replies; 104+ messages in thread
From: Al Viro @ 2016-09-23 19:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

	The series is supposed to solve the locking order problems for
->splice_read() and get rid of code duplication between the read-side
methods.
	pipe_lock is lifted out of ->splice_read() instances, along with
waiting for empty space in pipe, etc. - we do that stuff in callers.
	A new variant of iov_iter is introduced - it's backed by a pipe,
copy_to_iter() results in allocating pages and copying into those,
copy_page_to_iter() just sticks a reference to that page into pipe.
Running out of space in pipe yields a short read, as a fault in iovec-backed
iov_iter would have.  Enough primitives are implemented for normal
->read_iter() instances to work.
	generic_file_splice_read() switched to feeding such iov_iter to
->read_iter() instance.  That turns out to be enough to kill almost all
->splice_read() instances; the only ones _not_ using generic_file_splice_read()
or default_file_splice_read() (== no zero-copy fallback) are
fuse_dev_splice_read(), 3 instances in kernel/{relay.c,trace/trace.c} and
sock_splice_read().  It's almost certainly possible to convert fuse one
and the same might be possible to do to socket one.  relay and tracing
stuff is just plain weird; might or might not be doable.

	Something hopefully working is in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.splice_read

Several commits in that pipe (#1, #8 and #9) are trivial cleanups and fixes
for crap caught while doing the rest, probably ought to be separated.

Shortlog:
Al Viro (11):
      fix memory leaks in tracing_buffers_splice_read()
      splice_to_pipe(): don't open-code wakeup_pipe_readers()
      splice: switch get_iovec_page_array() to iov_iter
      splice: lift pipe_lock out of splice_to_pipe()
      skb_splice_bits(): get rid of callback
      new helper: add_to_pipe()
      fuse_dev_splice_read(): switch to add_to_pipe()
      cifs: don't use memcpy() to copy struct iov_iter
      fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter()
      new iov_iter flavour: pipe-backed
      switch generic_file_splice_read() to use of ->read_iter()

Diffstat:
 drivers/staging/lustre/lustre/llite/file.c         |  70 +--
 .../staging/lustre/lustre/llite/llite_internal.h   |  15 +-
 drivers/staging/lustre/lustre/llite/vvp_internal.h |  14 -
 drivers/staging/lustre/lustre/llite/vvp_io.c       |  45 +-
 fs/cifs/file.c                                     |  14 +-
 fs/coda/file.c                                     |  23 +-
 fs/fuse/dev.c                                      |  48 +-
 fs/fuse/file.c                                     |  30 +-
 fs/gfs2/file.c                                     |  28 +-
 fs/nfs/file.c                                      |  25 +-
 fs/nfs/internal.h                                  |   2 -
 fs/nfs/nfs4file.c                                  |   2 +-
 fs/ocfs2/file.c                                    |  34 +-
 fs/ocfs2/ocfs2_trace.h                             |   2 -
 fs/splice.c                                        | 578 +++++++--------------
 fs/xfs/xfs_file.c                                  |  41 +-
 fs/xfs/xfs_trace.h                                 |   1 -
 include/linux/fs.h                                 |   2 -
 include/linux/skbuff.h                             |   8 +-
 include/linux/splice.h                             |   3 +
 include/linux/uio.h                                |  14 +-
 kernel/trace/trace.c                               |  14 +-
 lib/iov_iter.c                                     | 390 +++++++++++++-
 mm/shmem.c                                         | 115 +---
 net/core/skbuff.c                                  |  28 +-
 net/ipv4/tcp.c                                     |   3 +-
 net/kcm/kcmsock.c                                  |  16 +-
 net/unix/af_unix.c                                 |  17 +-
 28 files changed, 648 insertions(+), 934 deletions(-)

	It's not all I would like to do there (in particular, I hadn't
done fuse splice_read conversion to read_iter, even though it does appear
to be doable; that'll take copy_page_to_iter_nosteal() as a new primitive
+ considerable amount of massage in fs/fuse/dev.c), but it should at least
	* make pipe lock the outermost
	* switch generic_file_splice_read() to ->read_iter(), making
it suitable for lustre/coda/gfs2/ocfs2/xfs/shmem without any wrappers
	* somewhat simplify socket ->splice_read() guts (not by much - to
start doing that right we'd need the same new primitive)
	* remove a considerable pile of code.
	* get rid of a bunch of splice_{grow,shrink}_spd/splice_to_pipe
callers; remaining ones are in default_file_splice_read() (trivially
killable by conversion to iov_iter_get_pages_alloc(), followed by the same
build iovec array + use kernel_readv as we do now + iov_iter_advance to
the length returned by kernel_readv), kernel/relay and kernel/trace/trace.c
ones (should switch to add_to_pipe(), AFAICS) and skb_splice_bits()
(again, a matter of copy_page_to_iter_nosteal(), which will take out
spd_can_coalesce/spd_fill_page in there as well).  Once the remaining ones
are taken care of, splice_pipe_desc and friends will go away.

	In its current form it survives LTP, xfstests and overlayfs testsuite;
if anybody has additional tests for splice and friends, I would like to hear
about such.  It really needs more beating, though.

	Please, help with review and testing.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 01/11] fix memory leaks in tracing_buffers_splice_read()
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
@ 2016-09-23 19:01                           ` Al Viro
  2016-09-23 19:02                           ` [PATCH 02/11] splice_to_pipe(): don't open-code wakeup_pipe_readers() Al Viro
                                             ` (10 subsequent siblings)
  11 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-09-23 19:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 kernel/trace/trace.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index dade4c9..9016f98 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6163,9 +6163,6 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos,
 		return -EBUSY;
 #endif
 
-	if (splice_grow_spd(pipe, &spd))
-		return -ENOMEM;
-
 	if (*ppos & (PAGE_SIZE - 1))
 		return -EINVAL;
 
@@ -6175,6 +6172,9 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos,
 		len &= PAGE_MASK;
 	}
 
+	if (splice_grow_spd(pipe, &spd))
+		return -ENOMEM;
+
  again:
 	trace_access_lock(iter->cpu_file);
 	entries = ring_buffer_entries_cpu(iter->trace_buffer->buffer, iter->cpu_file);
@@ -6232,19 +6232,21 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos,
 	/* did we read anything? */
 	if (!spd.nr_pages) {
 		if (ret)
-			return ret;
+			goto out;
 
+		ret = -EAGAIN;
 		if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK))
-			return -EAGAIN;
+			goto out;
 
 		ret = wait_on_pipe(iter, true);
 		if (ret)
-			return ret;
+			goto out;
 
 		goto again;
 	}
 
 	ret = splice_to_pipe(pipe, &spd);
+out:
 	splice_shrink_spd(&spd);
 
 	return ret;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 02/11] splice_to_pipe(): don't open-code wakeup_pipe_readers()
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
  2016-09-23 19:01                           ` [PATCH 01/11] fix memory leaks in tracing_buffers_splice_read() Al Viro
@ 2016-09-23 19:02                           ` Al Viro
  2016-09-23 19:02                           ` [PATCH 03/11] splice: switch get_iovec_page_array() to iov_iter Al Viro
                                             ` (9 subsequent siblings)
  11 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-09-23 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index dd9bf7e..36e9353 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -242,10 +242,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		}
 
 		if (do_wakeup) {
-			smp_mb();
-			if (waitqueue_active(&pipe->wait))
-				wake_up_interruptible_sync(&pipe->wait);
-			kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
+			wakeup_pipe_readers(pipe);
 			do_wakeup = 0;
 		}
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 03/11] splice: switch get_iovec_page_array() to iov_iter
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
  2016-09-23 19:01                           ` [PATCH 01/11] fix memory leaks in tracing_buffers_splice_read() Al Viro
  2016-09-23 19:02                           ` [PATCH 02/11] splice_to_pipe(): don't open-code wakeup_pipe_readers() Al Viro
@ 2016-09-23 19:02                           ` Al Viro
  2016-09-23 19:03                           ` [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe() Al Viro
                                             ` (8 subsequent siblings)
  11 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-09-23 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c | 135 ++++++++++++++++--------------------------------------------
 1 file changed, 36 insertions(+), 99 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 36e9353..31c52e0 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1434,106 +1434,32 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	return -EINVAL;
 }
 
-/*
- * Map an iov into an array of pages and offset/length tupples. With the
- * partial_page structure, we can map several non-contiguous ranges into
- * our ones pages[] map instead of splitting that operation into pieces.
- * Could easily be exported as a generic helper for other users, in which
- * case one would probably want to add a 'max_nr_pages' parameter as well.
- */
-static int get_iovec_page_array(const struct iovec __user *iov,
-				unsigned int nr_vecs, struct page **pages,
-				struct partial_page *partial, bool aligned,
+static int get_iovec_page_array(struct iov_iter *from,
+				struct page **pages,
+				struct partial_page *partial,
 				unsigned int pipe_buffers)
 {
-	int buffers = 0, error = 0;
-
-	while (nr_vecs) {
-		unsigned long off, npages;
-		struct iovec entry;
-		void __user *base;
-		size_t len;
-		int i;
-
-		error = -EFAULT;
-		if (copy_from_user(&entry, iov, sizeof(entry)))
-			break;
-
-		base = entry.iov_base;
-		len = entry.iov_len;
-
-		/*
-		 * Sanity check this iovec. 0 read succeeds.
-		 */
-		error = 0;
-		if (unlikely(!len))
-			break;
-		error = -EFAULT;
-		if (!access_ok(VERIFY_READ, base, len))
-			break;
-
-		/*
-		 * Get this base offset and number of pages, then map
-		 * in the user pages.
-		 */
-		off = (unsigned long) base & ~PAGE_MASK;
-
-		/*
-		 * If asked for alignment, the offset must be zero and the
-		 * length a multiple of the PAGE_SIZE.
-		 */
-		error = -EINVAL;
-		if (aligned && (off || len & ~PAGE_MASK))
-			break;
-
-		npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-		if (npages > pipe_buffers - buffers)
-			npages = pipe_buffers - buffers;
-
-		error = get_user_pages_fast((unsigned long)base, npages,
-					0, &pages[buffers]);
-
-		if (unlikely(error <= 0))
-			break;
-
-		/*
-		 * Fill this contiguous range into the partial page map.
-		 */
-		for (i = 0; i < error; i++) {
-			const int plen = min_t(size_t, len, PAGE_SIZE - off);
-
-			partial[buffers].offset = off;
-			partial[buffers].len = plen;
-
-			off = 0;
-			len -= plen;
+	int buffers = 0;
+	while (iov_iter_count(from)) {
+		ssize_t copied;
+		size_t start;
+
+		copied = iov_iter_get_pages(from, pages + buffers, ~0UL,
+					pipe_buffers - buffers, &start);
+		if (copied <= 0)
+			return buffers ? buffers : copied;
+
+		iov_iter_advance(from, copied);
+		while (copied) {
+			int size = min_t(int, copied, PAGE_SIZE - start);
+			partial[buffers].offset = start;
+			partial[buffers].len = size;
+			copied -= size;
+			start = 0;
 			buffers++;
 		}
-
-		/*
-		 * We didn't complete this iov, stop here since it probably
-		 * means we have to move some of this into a pipe to
-		 * be able to continue.
-		 */
-		if (len)
-			break;
-
-		/*
-		 * Don't continue if we mapped fewer pages than we asked for,
-		 * or if we mapped the max number of pages that we have
-		 * room for.
-		 */
-		if (error < npages || buffers == pipe_buffers)
-			break;
-
-		nr_vecs--;
-		iov++;
 	}
-
-	if (buffers)
-		return buffers;
-
-	return error;
+	return buffers;
 }
 
 static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
@@ -1587,10 +1513,13 @@ static long vmsplice_to_user(struct file *file, const struct iovec __user *uiov,
  * as splice-from-memory, where the regular splice is splice-from-file (or
  * to file). In both cases the output is a pipe, naturally.
  */
-static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
+static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 			     unsigned long nr_segs, unsigned int flags)
 {
 	struct pipe_inode_info *pipe;
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = iovstack;
+	struct iov_iter from;
 	struct page *pages[PIPE_DEF_BUFFERS];
 	struct partial_page partial[PIPE_DEF_BUFFERS];
 	struct splice_pipe_desc spd = {
@@ -1607,11 +1536,18 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
 	if (!pipe)
 		return -EBADF;
 
-	if (splice_grow_spd(pipe, &spd))
+	ret = import_iovec(WRITE, uiov, nr_segs,
+			   ARRAY_SIZE(iovstack), &iov, &from);
+	if (ret < 0)
+		return ret;
+
+	if (splice_grow_spd(pipe, &spd)) {
+		kfree(iov);
 		return -ENOMEM;
+	}
 
-	spd.nr_pages = get_iovec_page_array(iov, nr_segs, spd.pages,
-					    spd.partial, false,
+	spd.nr_pages = get_iovec_page_array(&from, spd.pages,
+					    spd.partial,
 					    spd.nr_pages_max);
 	if (spd.nr_pages <= 0)
 		ret = spd.nr_pages;
@@ -1619,6 +1555,7 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
 		ret = splice_to_pipe(pipe, &spd);
 
 	splice_shrink_spd(&spd);
+	kfree(iov);
 	return ret;
 }
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
                                             ` (2 preceding siblings ...)
  2016-09-23 19:02                           ` [PATCH 03/11] splice: switch get_iovec_page_array() to iov_iter Al Viro
@ 2016-09-23 19:03                           ` Al Viro
  2016-09-23 19:45                             ` Linus Torvalds
  2016-09-23 19:03                           ` [PATCH 05/11] skb_splice_bits(): get rid of callback Al Viro
                                             ` (7 subsequent siblings)
  11 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-09-23 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

* splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
* ->splice_read() instances do the same
* vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
  arrange for waiting, looping, etc. themselves.

That should make pipe_lock the outermost one.

Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
and do_splice() are quite ugly _and_ userland code can be easily broken
by changing those.  It's not even "no more than the maximal capacity of
this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
leave instead of waiting".  I would like to change it to something saner,
but that's for later.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/fuse/dev.c |   2 -
 fs/splice.c   | 171 ++++++++++++++++++++++++++++++++--------------------------
 2 files changed, 96 insertions(+), 77 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index a94d2ed..eaf56c6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1364,7 +1364,6 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 		goto out;
 
 	ret = 0;
-	pipe_lock(pipe);
 
 	if (!pipe->readers) {
 		send_sig(SIGPIPE, current, 0);
@@ -1400,7 +1399,6 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	}
 
 out_unlock:
-	pipe_unlock(pipe);
 
 	if (do_wakeup) {
 		smp_mb();
diff --git a/fs/splice.c b/fs/splice.c
index 31c52e0..9ce6e62 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -183,79 +183,41 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		       struct splice_pipe_desc *spd)
 {
 	unsigned int spd_pages = spd->nr_pages;
-	int ret, do_wakeup, page_nr;
+	int ret = 0, page_nr = 0;
 
 	if (!spd_pages)
 		return 0;
 
-	ret = 0;
-	do_wakeup = 0;
-	page_nr = 0;
-
-	pipe_lock(pipe);
-
-	for (;;) {
-		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
-			if (!ret)
-				ret = -EPIPE;
-			break;
-		}
-
-		if (pipe->nrbufs < pipe->buffers) {
-			int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-			struct pipe_buffer *buf = pipe->bufs + newbuf;
-
-			buf->page = spd->pages[page_nr];
-			buf->offset = spd->partial[page_nr].offset;
-			buf->len = spd->partial[page_nr].len;
-			buf->private = spd->partial[page_nr].private;
-			buf->ops = spd->ops;
-			if (spd->flags & SPLICE_F_GIFT)
-				buf->flags |= PIPE_BUF_FLAG_GIFT;
-
-			pipe->nrbufs++;
-			page_nr++;
-			ret += buf->len;
-
-			if (pipe->files)
-				do_wakeup = 1;
+	if (unlikely(!pipe->readers)) {
+		send_sig(SIGPIPE, current, 0);
+		ret = -EPIPE;
+		goto out;
+	}
 
-			if (!--spd->nr_pages)
-				break;
-			if (pipe->nrbufs < pipe->buffers)
-				continue;
+	while (pipe->nrbufs < pipe->buffers) {
+		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		struct pipe_buffer *buf = pipe->bufs + newbuf;
 
-			break;
-		}
+		buf->page = spd->pages[page_nr];
+		buf->offset = spd->partial[page_nr].offset;
+		buf->len = spd->partial[page_nr].len;
+		buf->private = spd->partial[page_nr].private;
+		buf->ops = spd->ops;
+		if (spd->flags & SPLICE_F_GIFT)
+			buf->flags |= PIPE_BUF_FLAG_GIFT;
 
-		if (spd->flags & SPLICE_F_NONBLOCK) {
-			if (!ret)
-				ret = -EAGAIN;
-			break;
-		}
+		pipe->nrbufs++;
+		page_nr++;
+		ret += buf->len;
 
-		if (signal_pending(current)) {
-			if (!ret)
-				ret = -ERESTARTSYS;
+		if (!--spd->nr_pages)
 			break;
-		}
-
-		if (do_wakeup) {
-			wakeup_pipe_readers(pipe);
-			do_wakeup = 0;
-		}
-
-		pipe->waiting_writers++;
-		pipe_wait(pipe);
-		pipe->waiting_writers--;
 	}
 
-	pipe_unlock(pipe);
-
-	if (do_wakeup)
-		wakeup_pipe_readers(pipe);
+	if (!ret)
+		ret = -EAGAIN;
 
+out:
 	while (page_nr < spd_pages)
 		spd->spd_release(spd, page_nr++);
 
@@ -1339,6 +1301,27 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(do_splice_direct);
 
+static bool splice_more(struct pipe_inode_info *pipe,
+			long *p, unsigned flags)
+{
+	if (pipe->nrbufs < pipe->buffers) // no overflows
+		return false;
+	if (flags & SPLICE_F_NONBLOCK) // not allowed to wait
+		return false;
+	if (*p < 0 && *p != -EAGAIN) // error happened
+		return false;
+	if (signal_pending(current)) { // interrupted
+		*p = -ERESTARTSYS;
+		return false;
+	}
+	if (*p > 0)
+		wakeup_pipe_readers(pipe);
+	pipe->waiting_writers++;
+	pipe_wait(pipe);
+	pipe->waiting_writers--;
+	return true;
+}
+
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
@@ -1410,6 +1393,8 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	}
 
 	if (opipe) {
+		size_t total = 0;
+		int bogus_count;
 		if (off_out)
 			return -ESPIPE;
 		if (off_in) {
@@ -1421,8 +1406,25 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 			offset = in->f_pos;
 		}
 
-		ret = do_splice_to(in, &offset, opipe, len, flags);
-
+		ret = 0;
+		pipe_lock(opipe);
+		bogus_count = opipe->buffers;
+		do {
+			bogus_count += opipe->nrbufs;
+			ret = do_splice_to(in, &offset, opipe, len, flags);
+			if (ret > 0) {
+				total += ret;
+				len -= ret;
+			}
+			bogus_count -= opipe->nrbufs;
+			if (bogus_count <= 0)
+				break;
+		} while (len && splice_more(opipe, &ret, flags));
+		pipe_unlock(opipe);
+		if (total) {
+			wakeup_pipe_readers(opipe);
+			ret = total;
+		}
 		if (!off_in)
 			in->f_pos = offset;
 		else if (copy_to_user(off_in, &offset, sizeof(loff_t)))
@@ -1434,22 +1436,23 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	return -EINVAL;
 }
 
-static int get_iovec_page_array(struct iov_iter *from,
+static int get_iovec_page_array(const struct iov_iter *from,
 				struct page **pages,
 				struct partial_page *partial,
 				unsigned int pipe_buffers)
 {
+	struct iov_iter i = *from;
 	int buffers = 0;
-	while (iov_iter_count(from)) {
+	while (iov_iter_count(&i)) {
 		ssize_t copied;
 		size_t start;
 
-		copied = iov_iter_get_pages(from, pages + buffers, ~0UL,
+		copied = iov_iter_get_pages(&i, pages + buffers, ~0UL,
 					pipe_buffers - buffers, &start);
 		if (copied <= 0)
 			return buffers ? buffers : copied;
 
-		iov_iter_advance(from, copied);
+		iov_iter_advance(&i, copied);
 		while (copied) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
 			partial[buffers].offset = start;
@@ -1530,7 +1533,8 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 		.ops = &user_page_pipe_buf_ops,
 		.spd_release = spd_release_page,
 	};
-	long ret;
+	long ret, total = 0;
+	int bogus_count;
 
 	pipe = get_pipe_info(file);
 	if (!pipe)
@@ -1546,14 +1550,31 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 		return -ENOMEM;
 	}
 
-	spd.nr_pages = get_iovec_page_array(&from, spd.pages,
-					    spd.partial,
-					    spd.nr_pages_max);
-	if (spd.nr_pages <= 0)
-		ret = spd.nr_pages;
-	else
+	pipe_lock(pipe);
+	bogus_count = pipe->buffers;
+	do {
+		bogus_count += pipe->nrbufs;
+		spd.nr_pages = get_iovec_page_array(&from, spd.pages,
+						    spd.partial,
+						    spd.nr_pages_max);
+		if (spd.nr_pages <= 0) {
+			ret = spd.nr_pages;
+			break;
+		}
 		ret = splice_to_pipe(pipe, &spd);
-
+		if (ret > 0) {
+			total += ret;
+			iov_iter_advance(&from, ret);
+		}
+		bogus_count -= pipe->nrbufs;
+		if (bogus_count <= 0)
+			break;
+	} while (iov_iter_count(&from) && splice_more(pipe, &ret, flags));
+	pipe_unlock(pipe);
+	if (total) {
+		wakeup_pipe_readers(pipe);
+		ret = total;
+	}
 	splice_shrink_spd(&spd);
 	kfree(iov);
 	return ret;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 05/11] skb_splice_bits(): get rid of callback
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
                                             ` (3 preceding siblings ...)
  2016-09-23 19:03                           ` [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe() Al Viro
@ 2016-09-23 19:03                           ` Al Viro
  2016-09-23 19:04                           ` [PATCH 06/11] new helper: add_to_pipe() Al Viro
                                             ` (6 subsequent siblings)
  11 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-09-23 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

since pipe_lock is the outermost now, we don't need to drop/regain
socket locks around the call of splice_to_pipe() from skb_splice_bits(),
which kills the need to have a socket-specific callback; we can just
call splice_to_pipe() and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/skbuff.h |  8 +-------
 net/core/skbuff.c      | 28 ++--------------------------
 net/ipv4/tcp.c         |  3 +--
 net/kcm/kcmsock.c      | 16 +---------------
 net/unix/af_unix.c     | 17 +----------------
 5 files changed, 6 insertions(+), 66 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0f665cb..f520251 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3021,15 +3021,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
 			      int len, __wsum csum);
-ssize_t skb_socket_splice(struct sock *sk,
-			  struct pipe_inode_info *pipe,
-			  struct splice_pipe_desc *spd);
 int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int len,
-		    unsigned int flags,
-		    ssize_t (*splice_cb)(struct sock *,
-					 struct pipe_inode_info *,
-					 struct splice_pipe_desc *));
+		    unsigned int flags);
 void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 unsigned int skb_zerocopy_headlen(const struct sk_buff *from);
 int skb_zerocopy(struct sk_buff *to, struct sk_buff *from,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3864b4b6..208a9bc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1962,37 +1962,13 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
 	return false;
 }
 
-ssize_t skb_socket_splice(struct sock *sk,
-			  struct pipe_inode_info *pipe,
-			  struct splice_pipe_desc *spd)
-{
-	int ret;
-
-	/* Drop the socket lock, otherwise we have reverse
-	 * locking dependencies between sk_lock and i_mutex
-	 * here as compared to sendfile(). We enter here
-	 * with the socket lock held, and splice_to_pipe() will
-	 * grab the pipe inode lock. For sendfile() emulation,
-	 * we call into ->sendpage() with the i_mutex lock held
-	 * and networking will grab the socket lock.
-	 */
-	release_sock(sk);
-	ret = splice_to_pipe(pipe, spd);
-	lock_sock(sk);
-
-	return ret;
-}
-
 /*
  * Map data from the skb to a pipe. Should handle both the linear part,
  * the fragments, and the frag list.
  */
 int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int tlen,
-		    unsigned int flags,
-		    ssize_t (*splice_cb)(struct sock *,
-					 struct pipe_inode_info *,
-					 struct splice_pipe_desc *))
+		    unsigned int flags)
 {
 	struct partial_page partial[MAX_SKB_FRAGS];
 	struct page *pages[MAX_SKB_FRAGS];
@@ -2009,7 +1985,7 @@ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 	__skb_splice_bits(skb, pipe, &offset, &tlen, &spd, sk);
 
 	if (spd.nr_pages)
-		ret = splice_cb(sk, pipe, &spd);
+		ret = splice_to_pipe(pipe, &spd);
 
 	return ret;
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ffbb218..ddd2179 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -688,8 +688,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
 	int ret;
 
 	ret = skb_splice_bits(skb, skb->sk, offset, tss->pipe,
-			      min(rd_desc->count, len), tss->flags,
-			      skb_socket_splice);
+			      min(rd_desc->count, len), tss->flags);
 	if (ret > 0)
 		rd_desc->count -= ret;
 	return ret;
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index cb39e05..994baae 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -1461,19 +1461,6 @@ out:
 	return copied ? : err;
 }
 
-static ssize_t kcm_sock_splice(struct sock *sk,
-			       struct pipe_inode_info *pipe,
-			       struct splice_pipe_desc *spd)
-{
-	int ret;
-
-	release_sock(sk);
-	ret = splice_to_pipe(pipe, spd);
-	lock_sock(sk);
-
-	return ret;
-}
-
 static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
 			       struct pipe_inode_info *pipe, size_t len,
 			       unsigned int flags)
@@ -1503,8 +1490,7 @@ static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
 	if (len > rxm->full_len)
 		len = rxm->full_len;
 
-	copied = skb_splice_bits(skb, sk, rxm->offset, pipe, len, flags,
-				 kcm_sock_splice);
+	copied = skb_splice_bits(skb, sk, rxm->offset, pipe, len, flags);
 	if (copied < 0) {
 		err = copied;
 		goto err_out;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f1dffe8..e7707ca 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2488,28 +2488,13 @@ static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
 	return unix_stream_read_generic(&state);
 }
 
-static ssize_t skb_unix_socket_splice(struct sock *sk,
-				      struct pipe_inode_info *pipe,
-				      struct splice_pipe_desc *spd)
-{
-	int ret;
-	struct unix_sock *u = unix_sk(sk);
-
-	mutex_unlock(&u->readlock);
-	ret = splice_to_pipe(pipe, spd);
-	mutex_lock(&u->readlock);
-
-	return ret;
-}
-
 static int unix_stream_splice_actor(struct sk_buff *skb,
 				    int skip, int chunk,
 				    struct unix_stream_read_state *state)
 {
 	return skb_splice_bits(skb, state->socket->sk,
 			       UNIXCB(skb).consumed + skip,
-			       state->pipe, chunk, state->splice_flags,
-			       skb_unix_socket_splice);
+			       state->pipe, chunk, state->splice_flags);
 }
 
 static ssize_t unix_stream_splice_read(struct socket *sock,  loff_t *ppos,
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 06/11] new helper: add_to_pipe()
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
                                             ` (4 preceding siblings ...)
  2016-09-23 19:03                           ` [PATCH 05/11] skb_splice_bits(): get rid of callback Al Viro
@ 2016-09-23 19:04                           ` Al Viro
  2016-09-23 19:04                           ` [PATCH 07/11] fuse_dev_splice_read(): switch to add_to_pipe() Al Viro
                                             ` (5 subsequent siblings)
  11 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-09-23 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

single-buffer analogue of splice_to_pipe(); vmsplice_to_pipe() switched
to that, leaving splice_to_pipe() only for ->splice_read() instances
(and that only until they are converted as well).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c            | 109 ++++++++++++++++++++++++++++---------------------
 include/linux/splice.h |   2 +
 2 files changed, 64 insertions(+), 47 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 9ce6e62..085ad37 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -203,8 +203,6 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		buf->len = spd->partial[page_nr].len;
 		buf->private = spd->partial[page_nr].private;
 		buf->ops = spd->ops;
-		if (spd->flags & SPLICE_F_GIFT)
-			buf->flags |= PIPE_BUF_FLAG_GIFT;
 
 		pipe->nrbufs++;
 		page_nr++;
@@ -225,6 +223,27 @@ out:
 }
 EXPORT_SYMBOL_GPL(splice_to_pipe);
 
+ssize_t add_to_pipe(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
+{
+	int ret;
+
+	if (unlikely(!pipe->readers)) {
+		send_sig(SIGPIPE, current, 0);
+		ret = -EPIPE;
+	} else if (pipe->nrbufs == pipe->buffers) {
+		ret = -EAGAIN;
+	} else {
+		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		pipe->bufs[newbuf] = *buf;
+		pipe->nrbufs++;
+		return buf->len;
+	}
+	buf->ops->release(pipe, buf);
+	buf->ops = NULL;
+	return ret;
+}
+EXPORT_SYMBOL(add_to_pipe);
+
 void spd_release_page(struct splice_pipe_desc *spd, unsigned int i)
 {
 	put_page(spd->pages[i]);
@@ -1436,33 +1455,50 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	return -EINVAL;
 }
 
-static int get_iovec_page_array(const struct iov_iter *from,
-				struct page **pages,
-				struct partial_page *partial,
-				unsigned int pipe_buffers)
+static int iter_to_pipe(struct iov_iter *from,
+			struct pipe_inode_info *pipe,
+			unsigned flags)
 {
-	struct iov_iter i = *from;
-	int buffers = 0;
-	while (iov_iter_count(&i)) {
+	struct pipe_buffer buf = {
+		.ops = &user_page_pipe_buf_ops,
+		.flags = flags
+	};
+	size_t total = 0;
+	int ret = 0;
+	bool failed = false;
+
+	while (iov_iter_count(from) && !failed) {
+		struct page *pages[16];
 		ssize_t copied;
 		size_t start;
+		int n;
 
-		copied = iov_iter_get_pages(&i, pages + buffers, ~0UL,
-					pipe_buffers - buffers, &start);
-		if (copied <= 0)
-			return buffers ? buffers : copied;
+		copied = iov_iter_get_pages(from, pages, ~0UL, 16, &start);
+		if (copied <= 0) {
+			ret = copied;
+			break;
+		}
 
-		iov_iter_advance(&i, copied);
-		while (copied) {
+		for (n = 0; copied; n++, start = 0) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
-			partial[buffers].offset = start;
-			partial[buffers].len = size;
+			if (!failed) {
+				buf.page = pages[n];
+				buf.offset = start;
+				buf.len = size;
+				ret = add_to_pipe(pipe, &buf);
+				if (unlikely(ret < 0)) {
+					failed = true;
+				} else {
+					iov_iter_advance(from, ret);
+					total += ret;
+				}
+			} else {
+				put_page(pages[n]);
+			}
 			copied -= size;
-			start = 0;
-			buffers++;
 		}
 	}
-	return buffers;
+	return total ? total : ret;
 }
 
 static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
@@ -1523,19 +1559,13 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 	struct iovec iovstack[UIO_FASTIOV];
 	struct iovec *iov = iovstack;
 	struct iov_iter from;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &user_page_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
 	long ret, total = 0;
+	unsigned buf_flag = 0;
 	int bogus_count;
 
+	if (flags & SPLICE_F_GIFT)
+		buf_flag = PIPE_BUF_FLAG_GIFT;
+
 	pipe = get_pipe_info(file);
 	if (!pipe)
 		return -EBADF;
@@ -1545,27 +1575,13 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 	if (ret < 0)
 		return ret;
 
-	if (splice_grow_spd(pipe, &spd)) {
-		kfree(iov);
-		return -ENOMEM;
-	}
-
 	pipe_lock(pipe);
 	bogus_count = pipe->buffers;
 	do {
 		bogus_count += pipe->nrbufs;
-		spd.nr_pages = get_iovec_page_array(&from, spd.pages,
-						    spd.partial,
-						    spd.nr_pages_max);
-		if (spd.nr_pages <= 0) {
-			ret = spd.nr_pages;
-			break;
-		}
-		ret = splice_to_pipe(pipe, &spd);
-		if (ret > 0) {
+		ret = iter_to_pipe(&from, pipe, buf_flag);
+		if (ret > 0)
 			total += ret;
-			iov_iter_advance(&from, ret);
-		}
 		bogus_count -= pipe->nrbufs;
 		if (bogus_count <= 0)
 			break;
@@ -1575,7 +1591,6 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 		wakeup_pipe_readers(pipe);
 		ret = total;
 	}
-	splice_shrink_spd(&spd);
 	kfree(iov);
 	return ret;
 }
diff --git a/include/linux/splice.h b/include/linux/splice.h
index da2751d..58b300f 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -72,6 +72,8 @@ extern ssize_t __splice_from_pipe(struct pipe_inode_info *,
 				  struct splice_desc *, splice_actor *);
 extern ssize_t splice_to_pipe(struct pipe_inode_info *,
 			      struct splice_pipe_desc *);
+extern ssize_t add_to_pipe(struct pipe_inode_info *,
+			      struct pipe_buffer *);
 extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
 				      splice_direct_actor *);
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 07/11] fuse_dev_splice_read(): switch to add_to_pipe()
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
                                             ` (5 preceding siblings ...)
  2016-09-23 19:04                           ` [PATCH 06/11] new helper: add_to_pipe() Al Viro
@ 2016-09-23 19:04                           ` Al Viro
  2016-09-23 19:06                           ` [PATCH 08/11] cifs: don't use memcpy() to copy struct iov_iter Al Viro
                                             ` (4 subsequent siblings)
  11 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-09-23 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/fuse/dev.c | 46 +++++++++-------------------------------------
 1 file changed, 9 insertions(+), 37 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index eaf56c6..0a6a808 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1342,9 +1342,8 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 				    struct pipe_inode_info *pipe,
 				    size_t len, unsigned int flags)
 {
-	int ret;
+	int total, ret;
 	int page_nr = 0;
-	int do_wakeup = 0;
 	struct pipe_buffer *bufs;
 	struct fuse_copy_state cs;
 	struct fuse_dev *fud = fuse_get_dev(in);
@@ -1363,50 +1362,23 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	if (ret < 0)
 		goto out;
 
-	ret = 0;
-
-	if (!pipe->readers) {
-		send_sig(SIGPIPE, current, 0);
-		if (!ret)
-			ret = -EPIPE;
-		goto out_unlock;
-	}
-
 	if (pipe->nrbufs + cs.nr_segs > pipe->buffers) {
 		ret = -EIO;
-		goto out_unlock;
+		goto out;
 	}
 
-	while (page_nr < cs.nr_segs) {
-		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-		struct pipe_buffer *buf = pipe->bufs + newbuf;
-
-		buf->page = bufs[page_nr].page;
-		buf->offset = bufs[page_nr].offset;
-		buf->len = bufs[page_nr].len;
+	for (ret = total = 0; page_nr < cs.nr_segs; total += ret) {
 		/*
 		 * Need to be careful about this.  Having buf->ops in module
 		 * code can Oops if the buffer persists after module unload.
 		 */
-		buf->ops = &nosteal_pipe_buf_ops;
-
-		pipe->nrbufs++;
-		page_nr++;
-		ret += buf->len;
-
-		if (pipe->files)
-			do_wakeup = 1;
-	}
-
-out_unlock:
-
-	if (do_wakeup) {
-		smp_mb();
-		if (waitqueue_active(&pipe->wait))
-			wake_up_interruptible(&pipe->wait);
-		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
+		bufs[page_nr].ops = &nosteal_pipe_buf_ops;
+		ret = add_to_pipe(pipe, &bufs[page_nr++]);
+		if (unlikely(ret < 0))
+			break;
 	}
-
+	if (total)
+		ret = total;
 out:
 	for (; page_nr < cs.nr_segs; page_nr++)
 		put_page(bufs[page_nr].page);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 08/11] cifs: don't use memcpy() to copy struct iov_iter
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
                                             ` (6 preceding siblings ...)
  2016-09-23 19:04                           ` [PATCH 07/11] fuse_dev_splice_read(): switch to add_to_pipe() Al Viro
@ 2016-09-23 19:06                           ` Al Viro
  2016-09-23 19:08                           ` [PATCH 09/11] fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter() Al Viro
                                             ` (3 subsequent siblings)
  11 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-09-23 19:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

it's not 70s anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
[obviously should be separated; trivial cleanup almost unrelated to series]
 fs/cifs/file.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 579e41b..42b99af 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2478,7 +2478,7 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
 	size_t cur_len;
 	unsigned long nr_pages, num_pages, i;
 	struct cifs_writedata *wdata;
-	struct iov_iter saved_from;
+	struct iov_iter saved_from = *from;
 	loff_t saved_offset = offset;
 	pid_t pid;
 	struct TCP_Server_Info *server;
@@ -2489,7 +2489,6 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
 		pid = current->tgid;
 
 	server = tlink_tcon(open_file->tlink)->ses->server;
-	memcpy(&saved_from, from, sizeof(struct iov_iter));
 
 	do {
 		unsigned int wsize, credits;
@@ -2551,8 +2550,7 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
 			kref_put(&wdata->refcount,
 				 cifs_uncached_writedata_release);
 			if (rc == -EAGAIN) {
-				memcpy(from, &saved_from,
-				       sizeof(struct iov_iter));
+				*from = saved_from;
 				iov_iter_advance(from, offset - saved_offset);
 				continue;
 			}
@@ -2576,7 +2574,7 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from)
 	struct cifs_sb_info *cifs_sb;
 	struct cifs_writedata *wdata, *tmp;
 	struct list_head wdata_list;
-	struct iov_iter saved_from;
+	struct iov_iter saved_from = *from;
 	int rc;
 
 	/*
@@ -2597,8 +2595,6 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from)
 	if (!tcon->ses->server->ops->async_writev)
 		return -ENOSYS;
 
-	memcpy(&saved_from, from, sizeof(struct iov_iter));
-
 	rc = cifs_write_from_iter(iocb->ki_pos, iov_iter_count(from), from,
 				  open_file, cifs_sb, &wdata_list);
 
@@ -2631,13 +2627,11 @@ restart_loop:
 			/* resend call if it's a retryable error */
 			if (rc == -EAGAIN) {
 				struct list_head tmp_list;
-				struct iov_iter tmp_from;
+				struct iov_iter tmp_from = saved_from;
 
 				INIT_LIST_HEAD(&tmp_list);
 				list_del_init(&wdata->list);
 
-				memcpy(&tmp_from, &saved_from,
-				       sizeof(struct iov_iter));
 				iov_iter_advance(&tmp_from,
 						 wdata->offset - iocb->ki_pos);
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 09/11] fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter()
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
                                             ` (7 preceding siblings ...)
  2016-09-23 19:06                           ` [PATCH 08/11] cifs: don't use memcpy() to copy struct iov_iter Al Viro
@ 2016-09-23 19:08                           ` Al Viro
  2016-09-26  9:31                             ` Miklos Szeredi
  2016-09-23 19:09                           ` [PATCH 10/11] new iov_iter flavour: pipe-backed Al Viro
                                             ` (2 subsequent siblings)
  11 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-09-23 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
[another cleanup, will be moved out of that branch]
 fs/fuse/file.c | 30 +++++++-----------------------
 1 file changed, 7 insertions(+), 23 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3988b43..4c1db6c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2339,31 +2339,15 @@ static int fuse_ioctl_copy_user(struct page **pages, struct iovec *iov,
 
 	while (iov_iter_count(&ii)) {
 		struct page *page = pages[page_idx++];
-		size_t todo = min_t(size_t, PAGE_SIZE, iov_iter_count(&ii));
-		void *kaddr;
+		size_t copied;
 
-		kaddr = kmap(page);
-
-		while (todo) {
-			char __user *uaddr = ii.iov->iov_base + ii.iov_offset;
-			size_t iov_len = ii.iov->iov_len - ii.iov_offset;
-			size_t copy = min(todo, iov_len);
-			size_t left;
-
-			if (!to_user)
-				left = copy_from_user(kaddr, uaddr, copy);
-			else
-				left = copy_to_user(uaddr, kaddr, copy);
-
-			if (unlikely(left))
-				return -EFAULT;
-
-			iov_iter_advance(&ii, copy);
-			todo -= copy;
-			kaddr += copy;
-		}
+		if (!to_user)
+			copied = copy_page_from_iter(page, 0, PAGE_SIZE, &ii);
+		else
+			copied = copy_page_to_iter(page, 0, PAGE_SIZE, &ii);
 
-		kunmap(page);
+		if (unlikely(copied != PAGE_SIZE && iov_iter_count(&ii)))
+			return -EFAULT;
 	}
 
 	return 0;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 10/11] new iov_iter flavour: pipe-backed
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
                                             ` (8 preceding siblings ...)
  2016-09-23 19:08                           ` [PATCH 09/11] fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter() Al Viro
@ 2016-09-23 19:09                           ` Al Viro
  2016-09-23 19:10                           ` [PATCH 11/11] switch generic_file_splice_read() to use of ->read_iter() Al Viro
  2016-09-30 13:32                           ` [RFC][CFT] splice_read reworked CAI Qian
  11 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-09-23 19:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

iov_iter variant for passing data into pipe.  copy_to_iter()
copies data into page(s) it has allocated and stuffs them into
the pipe; copy_page_to_iter() stuffs there a reference to the
page given to it.  Both will try to coalesce if possible.
iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages()
and friends will do as copy_to_iter() would have and return the
pages where the data would've been copied.  iov_iter_advance()
will truncate everything past the spot it has advanced to.

New primitive: iov_iter_pipe(), used for initializing those.
pipe should be locked all along.

Running out of space acts as fault would for iovec-backed ones;
in other words, giving it to ->read_iter() may result in short
read if the pipe overflows, or -EFAULT if it happens with nothing
copied there.

In other words, ->read_iter() on those acts pretty much like
->splice_read().  Moreover, all generic_file_splice_read() users,
as well as many other ->splice_read() instances can be switched
to that scheme - that'll happen in the next commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
[this certainly needs to be documented in more details]
 fs/splice.c            |   2 +-
 include/linux/splice.h |   1 +
 include/linux/uio.h    |  14 +-
 lib/iov_iter.c         | 390 ++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 401 insertions(+), 6 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 085ad37..0daa7d1 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -524,7 +524,7 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 }
 EXPORT_SYMBOL(generic_file_splice_read);
 
-static const struct pipe_buf_operations default_pipe_buf_ops = {
+const struct pipe_buf_operations default_pipe_buf_ops = {
 	.can_merge = 0,
 	.confirm = generic_pipe_buf_confirm,
 	.release = generic_pipe_buf_release,
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 58b300f..00a2116 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -85,4 +85,5 @@ extern void splice_shrink_spd(struct splice_pipe_desc *);
 extern void spd_release_page(struct splice_pipe_desc *, unsigned int);
 
 extern const struct pipe_buf_operations page_cache_pipe_buf_ops;
+extern const struct pipe_buf_operations default_pipe_buf_ops;
 #endif
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 1b5d1cd..c4fe1ab 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -13,6 +13,7 @@
 #include <uapi/linux/uio.h>
 
 struct page;
+struct pipe_inode_info;
 
 struct kvec {
 	void *iov_base; /* and that should *never* hold a userland pointer */
@@ -23,6 +24,7 @@ enum {
 	ITER_IOVEC = 0,
 	ITER_KVEC = 2,
 	ITER_BVEC = 4,
+	ITER_PIPE = 8,
 };
 
 struct iov_iter {
@@ -33,8 +35,12 @@ struct iov_iter {
 		const struct iovec *iov;
 		const struct kvec *kvec;
 		const struct bio_vec *bvec;
+		struct pipe_inode_info *pipe;
+	};
+	union {
+		unsigned long nr_segs;
+		int idx;
 	};
-	unsigned long nr_segs;
 };
 
 /*
@@ -64,7 +70,7 @@ static inline struct iovec iov_iter_iovec(const struct iov_iter *iter)
 }
 
 #define iov_for_each(iov, iter, start)				\
-	if (!((start).type & ITER_BVEC))			\
+	if (!((start).type & (ITER_BVEC | ITER_PIPE)))		\
 	for (iter = (start);					\
 	     (iter).count &&					\
 	     ((iov = iov_iter_iovec(&(iter))), 1);		\
@@ -94,6 +100,8 @@ void iov_iter_kvec(struct iov_iter *i, int direction, const struct kvec *kvec,
 			unsigned long nr_segs, size_t count);
 void iov_iter_bvec(struct iov_iter *i, int direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
+void iov_iter_pipe(struct iov_iter *i, int direction, struct pipe_inode_info *pipe,
+			size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
@@ -109,7 +117,7 @@ static inline size_t iov_iter_count(struct iov_iter *i)
 
 static inline bool iter_is_iovec(struct iov_iter *i)
 {
-	return !(i->type & (ITER_BVEC | ITER_KVEC));
+	return !(i->type & (ITER_BVEC | ITER_KVEC | ITER_PIPE));
 }
 
 /*
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 9e8c738..02efc898 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -3,8 +3,11 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
+#include <linux/splice.h>
 #include <net/checksum.h>
 
+#define PIPE_PARANOIA /* for now */
+
 #define iterate_iovec(i, n, __v, __p, skip, STEP) {	\
 	size_t left;					\
 	size_t wanted = n;				\
@@ -290,6 +293,82 @@ done:
 	return wanted - bytes;
 }
 
+#ifdef PIPE_PARANOIA
+static bool sanity(const struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	int idx = i->idx;
+	int delta = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1);
+	if (i->iov_offset) {
+		struct pipe_buffer *p;
+		if (unlikely(delta != 1) || unlikely(!pipe->nrbufs))
+			goto Bad;	// must be at the last buffer...
+
+		p = &pipe->bufs[idx];
+		if (unlikely(p->offset + p->len != i->iov_offset))
+			goto Bad;	// ... at the end of segment
+	} else {
+		if (delta)
+			goto Bad;	// must be right after the last buffer
+	}
+	return true;
+Bad:
+	WARN_ON(1);
+	return false;
+}
+#else
+#define sanity(i) true
+#endif
+
+static inline int next_idx(int idx, struct pipe_inode_info *pipe)
+{
+	return (idx + 1) & (pipe->buffers - 1);
+}
+
+static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
+			 struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	struct pipe_buffer *buf;
+	size_t off;
+	int idx;
+
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	if (!sanity(i))
+		return 0;
+
+	off = i->iov_offset;
+	idx = i->idx;
+	buf = &pipe->bufs[idx];
+	if (off) {
+		if (offset == off && buf->page == page) {
+			/* merge with the last one */
+			buf->len += bytes;
+			i->iov_offset += bytes;
+			goto out;
+		}
+		idx = next_idx(idx, pipe);
+		buf = &pipe->bufs[idx];
+	}
+	if (idx == pipe->curbuf && pipe->nrbufs)
+		return 0;
+	pipe->nrbufs++;
+	buf->ops = &page_cache_pipe_buf_ops;
+	get_page(buf->page = page);
+	buf->offset = offset;
+	buf->len = bytes;
+	i->iov_offset = offset + bytes;
+	i->idx = idx;
+out:
+	i->count -= bytes;
+	return bytes;
+}
+
 /*
  * Fault in the first iovec of the given iov_iter, to a maximum length
  * of bytes. Returns 0 on success, or non-zero if the memory could not be
@@ -376,9 +455,98 @@ static void memzero_page(struct page *page, size_t offset, size_t len)
 	kunmap_atomic(addr);
 }
 
+static inline bool allocated(struct pipe_buffer *buf)
+{
+	return buf->ops == &default_pipe_buf_ops;
+}
+
+static inline void data_start(const struct iov_iter *i, int *idxp, size_t *offp)
+{
+	size_t off = i->iov_offset;
+	int idx = i->idx;
+	if (off && (!allocated(&i->pipe->bufs[idx]) || off == PAGE_SIZE)) {
+		idx = next_idx(idx, i->pipe);
+		off = 0;
+	}
+	*idxp = idx;
+	*offp = off;
+}
+
+static size_t push_pipe(struct iov_iter *i, size_t size,
+			int *idxp, size_t *offp)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t off;
+	int idx;
+	ssize_t left;
+
+	if (unlikely(size > i->count))
+		size = i->count;
+	if (unlikely(!size))
+		return 0;
+
+	left = size;
+	data_start(i, &idx, &off);
+	*idxp = idx;
+	*offp = off;
+	if (off) {
+		left -= PAGE_SIZE - off;
+		if (left <= 0) {
+			pipe->bufs[idx].len += size;
+			return size;
+		}
+		pipe->bufs[idx].len = PAGE_SIZE;
+		idx = next_idx(idx, pipe);
+	}
+	while (idx != pipe->curbuf || !pipe->nrbufs) {
+		struct page *page = alloc_page(GFP_USER);
+		if (!page)
+			break;
+		pipe->nrbufs++;
+		pipe->bufs[idx].ops = &default_pipe_buf_ops;
+		pipe->bufs[idx].page = page;
+		pipe->bufs[idx].offset = 0;
+		if (left <= PAGE_SIZE) {
+			pipe->bufs[idx].len = left;
+			return size;
+		}
+		pipe->bufs[idx].len = PAGE_SIZE;
+		left -= PAGE_SIZE;
+		idx = next_idx(idx, pipe);
+	}
+	return size - left;
+}
+
+static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
+				struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t n, off;
+	int idx;
+
+	if (!sanity(i))
+		return 0;
+
+	bytes = n = push_pipe(i, bytes, &idx, &off);
+	if (unlikely(!n))
+		return 0;
+	for ( ; n; idx = next_idx(idx, pipe), off = 0) {
+		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
+		memcpy_to_page(pipe->bufs[idx].page, off, addr, chunk);
+		i->idx = idx;
+		i->iov_offset = off + chunk;
+		n -= chunk;
+		addr += chunk;
+	}
+	i->count -= bytes;
+	return bytes;
+}
+
 size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	const char *from = addr;
+	if (unlikely(i->type & ITER_PIPE))
+		return copy_pipe_to_iter(addr, bytes, i);
 	iterate_and_advance(i, bytes, v,
 		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
 			       v.iov_len),
@@ -394,6 +562,10 @@ EXPORT_SYMBOL(copy_to_iter);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
 				 v.iov_len),
@@ -409,6 +581,10 @@ EXPORT_SYMBOL(copy_from_iter);
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user_nocache((to += v.iov_len) - v.iov_len,
 					 v.iov_base, v.iov_len),
@@ -429,14 +605,20 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 		size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
 		kunmap_atomic(kaddr);
 		return wanted;
-	} else
+	} else if (likely(!(i->type & ITER_PIPE)))
 		return copy_page_to_iter_iovec(page, offset, bytes, i);
+	else
+		return copy_page_to_iter_pipe(page, offset, bytes, i);
 }
 EXPORT_SYMBOL(copy_page_to_iter);
 
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	if (i->type & (ITER_BVEC|ITER_KVEC)) {
 		void *kaddr = kmap_atomic(page);
 		size_t wanted = copy_from_iter(kaddr + offset, bytes, i);
@@ -447,8 +629,34 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
+static size_t pipe_zero(size_t bytes, struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t n, off;
+	int idx;
+
+	if (!sanity(i))
+		return 0;
+
+	bytes = n = push_pipe(i, bytes, &idx, &off);
+	if (unlikely(!n))
+		return 0;
+
+	for ( ; n; idx = next_idx(idx, pipe), off = 0) {
+		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
+		memzero_page(pipe->bufs[idx].page, off, chunk);
+		i->idx = idx;
+		i->iov_offset = off + chunk;
+		n -= chunk;
+	}
+	i->count -= bytes;
+	return bytes;
+}
+
 size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 {
+	if (unlikely(i->type & ITER_PIPE))
+		return pipe_zero(bytes, i);
 	iterate_and_advance(i, bytes, v,
 		__clear_user(v.iov_base, v.iov_len),
 		memzero_page(v.bv_page, v.bv_offset, v.bv_len),
@@ -463,6 +671,11 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes)
 {
 	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
+	if (unlikely(i->type & ITER_PIPE)) {
+		kunmap_atomic(kaddr);
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_all_kinds(i, bytes, v,
 		__copy_from_user_inatomic((p += v.iov_len) - v.iov_len,
 					  v.iov_base, v.iov_len),
@@ -475,8 +688,55 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 }
 EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
 
+static void pipe_advance(struct iov_iter *i, size_t size)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	struct pipe_buffer *buf;
+	size_t off;
+	int idx;
+	
+	if (unlikely(i->count < size))
+		size = i->count;
+
+	idx = i->idx;
+	off = i->iov_offset;
+	if (size || off) {
+		/* take it relative to the beginning of buffer */
+		size += off - pipe->bufs[idx].offset;
+		while (1) {
+			buf = &pipe->bufs[idx];
+			if (size > buf->len) {
+				size -= buf->len;
+				idx = next_idx(idx, pipe);
+				off = 0;
+			} else {
+				buf->len = size;
+				i->idx = idx;
+				i->iov_offset = off = buf->offset + size;
+				break;
+			}
+		}
+		idx = next_idx(idx, pipe);
+	}
+	if (pipe->nrbufs) {
+		int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		/* [curbuf,unused) is in use.  Free [idx,unused) */
+		while (idx != unused) {
+			buf = &pipe->bufs[idx];
+			buf->ops->release(pipe, buf);
+			buf->ops = NULL;
+			idx = next_idx(idx, pipe);
+			pipe->nrbufs--;
+		}
+	}
+}
+
 void iov_iter_advance(struct iov_iter *i, size_t size)
 {
+	if (unlikely(i->type & ITER_PIPE)) {
+		pipe_advance(i, size);
+		return;
+	}
 	iterate_and_advance(i, size, v, 0, 0, 0)
 }
 EXPORT_SYMBOL(iov_iter_advance);
@@ -486,6 +746,8 @@ EXPORT_SYMBOL(iov_iter_advance);
  */
 size_t iov_iter_single_seg_count(const struct iov_iter *i)
 {
+	if (unlikely(i->type & ITER_PIPE))
+		return i->count;	// it is a silly place, anyway
 	if (i->nr_segs == 1)
 		return i->count;
 	else if (i->type & ITER_BVEC)
@@ -521,6 +783,19 @@ void iov_iter_bvec(struct iov_iter *i, int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
+void iov_iter_pipe(struct iov_iter *i, int direction,
+			struct pipe_inode_info *pipe,
+			size_t count)
+{
+	BUG_ON(direction != ITER_PIPE);
+	i->type = direction;
+	i->pipe = pipe;
+	i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+	i->iov_offset = 0;
+	i->count = count;
+}
+EXPORT_SYMBOL(iov_iter_pipe);
+
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
 	unsigned long res = 0;
@@ -529,6 +804,11 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 	if (!size)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE)) {
+		if (i->iov_offset && allocated(&i->pipe->bufs[i->idx]))
+			return size | i->iov_offset;
+		return size;
+	}
 	iterate_all_kinds(i, size, v,
 		(res |= (unsigned long)v.iov_base | v.iov_len, 0),
 		res |= v.bv_offset | v.bv_len,
@@ -545,6 +825,11 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 	if (!size)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return ~0U;
+	}
+
 	iterate_all_kinds(i, size, v,
 		(res |= (!res ? 0 : (unsigned long)v.iov_base) |
 			(size != v.iov_len ? size : 0), 0),
@@ -557,6 +842,47 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_gap_alignment);
 
+static inline size_t __pipe_get_pages(struct iov_iter *i,
+				size_t maxsize,
+				struct page **pages,
+				int idx,
+				size_t *start)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t n = push_pipe(i, maxsize, &idx, start);
+	if (!n)
+		return 0;
+
+	maxsize = n;
+	n += *start;
+	while (n >= PAGE_SIZE) {
+		*pages++ = pipe->bufs[idx].page;
+		idx = next_idx(idx, pipe);
+		n -= PAGE_SIZE;
+	}
+
+	return maxsize;
+}
+
+static ssize_t pipe_get_pages(struct iov_iter *i,
+		   struct page **pages, size_t maxsize, unsigned maxpages,
+		   size_t *start)
+{
+	unsigned npages;
+	size_t capacity;
+	int idx;
+
+	if (!sanity(i))
+		return 0;
+
+	data_start(i, &idx, start);
+	/* some of this one + all after this one */
+	npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1;
+	capacity = min(npages,maxpages) * PAGE_SIZE - *start;
+
+	return __pipe_get_pages(i, min(maxsize, capacity), pages, idx, start);
+}
+
 ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
@@ -567,6 +893,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 	if (!maxsize)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE))
+		return pipe_get_pages(i, pages, maxsize, maxpages, start);
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
 		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -602,6 +930,37 @@ static struct page **get_pages_array(size_t n)
 	return p;
 }
 
+static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
+		   struct page ***pages, size_t maxsize,
+		   size_t *start)
+{
+	struct page **p;
+	size_t n;
+	int idx;
+	int npages;
+
+	if (!sanity(i))
+		return 0;
+
+	data_start(i, &idx, start);
+	/* some of this one + all after this one */
+	npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1;
+	n = npages * PAGE_SIZE - *start;
+	if (maxsize > n)
+		maxsize = n;
+	else
+		npages = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
+	p = get_pages_array(npages);
+	if (!p)
+		return -ENOMEM;
+	n = __pipe_get_pages(i, maxsize, p, idx, start);
+	if (n)
+		*pages = p;
+	else
+		kvfree(p);
+	return n;
+}
+
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
@@ -614,6 +973,8 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 	if (!maxsize)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE))
+		return pipe_get_pages_alloc(i, pages, maxsize, start);
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
 		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -655,6 +1016,10 @@ size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v, ({
 		int err = 0;
 		next = csum_and_copy_from_user(v.iov_base, 
@@ -693,6 +1058,10 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum,
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);	/* for now */
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v, ({
 		int err = 0;
 		next = csum_and_copy_to_user((from += v.iov_len) - v.iov_len,
@@ -732,7 +1101,20 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 	if (!size)
 		return 0;
 
-	iterate_all_kinds(i, size, v, ({
+	if (unlikely(i->type & ITER_PIPE)) {
+		struct pipe_inode_info *pipe = i->pipe;
+		size_t off;
+		int idx;
+
+		if (!sanity(i))
+			return 0;
+
+		data_start(i, &idx, &off);
+		/* some of this one + all after this one */
+		npages = ((pipe->curbuf - idx - 1) & (pipe->buffers - 1)) + 1;
+		if (npages >= maxpages)
+			return maxpages;
+	} else iterate_all_kinds(i, size, v, ({
 		unsigned long p = (unsigned long)v.iov_base;
 		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
 			- p / PAGE_SIZE;
@@ -757,6 +1139,10 @@ EXPORT_SYMBOL(iov_iter_npages);
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags)
 {
 	*new = *old;
+	if (unlikely(new->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return NULL;
+	}
 	if (new->type & ITER_BVEC)
 		return new->bvec = kmemdup(new->bvec,
 				    new->nr_segs * sizeof(struct bio_vec),
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 11/11] switch generic_file_splice_read() to use of ->read_iter()
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
                                             ` (9 preceding siblings ...)
  2016-09-23 19:09                           ` [PATCH 10/11] new iov_iter flavour: pipe-backed Al Viro
@ 2016-09-23 19:10                           ` Al Viro
  2016-09-30 13:32                           ` [RFC][CFT] splice_read reworked CAI Qian
  11 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-09-23 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

... and kill the ->splice_read() instances that can be switched to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 drivers/staging/lustre/lustre/llite/file.c         |  70 ++----
 .../staging/lustre/lustre/llite/llite_internal.h   |  15 +-
 drivers/staging/lustre/lustre/llite/vvp_internal.h |  14 --
 drivers/staging/lustre/lustre/llite/vvp_io.c       |  45 +---
 fs/coda/file.c                                     |  23 +-
 fs/gfs2/file.c                                     |  28 +--
 fs/nfs/file.c                                      |  25 +--
 fs/nfs/internal.h                                  |   2 -
 fs/nfs/nfs4file.c                                  |   2 +-
 fs/ocfs2/file.c                                    |  34 +--
 fs/ocfs2/ocfs2_trace.h                             |   2 -
 fs/splice.c                                        | 238 +++------------------
 fs/xfs/xfs_file.c                                  |  41 +---
 fs/xfs/xfs_trace.h                                 |   1 -
 include/linux/fs.h                                 |   2 -
 mm/shmem.c                                         | 115 +---------
 16 files changed, 57 insertions(+), 600 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
index 57281b9..2567b09 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -1153,36 +1153,21 @@ restart:
 		int write_mutex_locked = 0;
 
 		vio->vui_fd  = LUSTRE_FPRIVATE(file);
-		vio->vui_io_subtype = args->via_io_subtype;
-
-		switch (vio->vui_io_subtype) {
-		case IO_NORMAL:
-			vio->vui_iter = args->u.normal.via_iter;
-			vio->vui_iocb = args->u.normal.via_iocb;
-			if ((iot == CIT_WRITE) &&
-			    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
-				if (mutex_lock_interruptible(&lli->
-							       lli_write_mutex)) {
-					result = -ERESTARTSYS;
-					goto out;
-				}
-				write_mutex_locked = 1;
+		vio->vui_iter = args->u.normal.via_iter;
+		vio->vui_iocb = args->u.normal.via_iocb;
+		if ((iot == CIT_WRITE) &&
+		    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
+			if (mutex_lock_interruptible(&lli->lli_write_mutex)) {
+				result = -ERESTARTSYS;
+				goto out;
 			}
-			down_read(&lli->lli_trunc_sem);
-			break;
-		case IO_SPLICE:
-			vio->u.splice.vui_pipe = args->u.splice.via_pipe;
-			vio->u.splice.vui_flags = args->u.splice.via_flags;
-			break;
-		default:
-			CERROR("Unknown IO type - %u\n", vio->vui_io_subtype);
-			LBUG();
+			write_mutex_locked = 1;
 		}
+		down_read(&lli->lli_trunc_sem);
 		ll_cl_add(file, env, io);
 		result = cl_io_loop(env, io);
 		ll_cl_remove(file, env);
-		if (args->via_io_subtype == IO_NORMAL)
-			up_read(&lli->lli_trunc_sem);
+		up_read(&lli->lli_trunc_sem);
 		if (write_mutex_locked)
 			mutex_unlock(&lli->lli_write_mutex);
 	} else {
@@ -1237,7 +1222,7 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (IS_ERR(env))
 		return PTR_ERR(env);
 
-	args = ll_env_args(env, IO_NORMAL);
+	args = ll_env_args(env);
 	args->u.normal.via_iter = to;
 	args->u.normal.via_iocb = iocb;
 
@@ -1261,7 +1246,7 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (IS_ERR(env))
 		return PTR_ERR(env);
 
-	args = ll_env_args(env, IO_NORMAL);
+	args = ll_env_args(env);
 	args->u.normal.via_iter = from;
 	args->u.normal.via_iocb = iocb;
 
@@ -1271,31 +1256,6 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	return result;
 }
 
-/*
- * Send file content (through pagecache) somewhere with helper
- */
-static ssize_t ll_file_splice_read(struct file *in_file, loff_t *ppos,
-				   struct pipe_inode_info *pipe, size_t count,
-				   unsigned int flags)
-{
-	struct lu_env      *env;
-	struct vvp_io_args *args;
-	ssize_t	     result;
-	int		 refcheck;
-
-	env = cl_env_get(&refcheck);
-	if (IS_ERR(env))
-		return PTR_ERR(env);
-
-	args = ll_env_args(env, IO_SPLICE);
-	args->u.splice.via_pipe = pipe;
-	args->u.splice.via_flags = flags;
-
-	result = ll_file_io_generic(env, args, in_file, CIT_READ, ppos, count);
-	cl_env_put(env, &refcheck);
-	return result;
-}
-
 static int ll_lov_recreate(struct inode *inode, struct ost_id *oi, u32 ost_idx)
 {
 	struct obd_export *exp = ll_i2dtexp(inode);
@@ -3173,7 +3133,7 @@ struct file_operations ll_file_operations = {
 	.release	= ll_file_release,
 	.mmap	   = ll_file_mmap,
 	.llseek	 = ll_file_seek,
-	.splice_read    = ll_file_splice_read,
+	.splice_read    = generic_file_splice_read,
 	.fsync	  = ll_fsync,
 	.flush	  = ll_flush
 };
@@ -3186,7 +3146,7 @@ struct file_operations ll_file_operations_flock = {
 	.release	= ll_file_release,
 	.mmap	   = ll_file_mmap,
 	.llseek	 = ll_file_seek,
-	.splice_read    = ll_file_splice_read,
+	.splice_read    = generic_file_splice_read,
 	.fsync	  = ll_fsync,
 	.flush	  = ll_flush,
 	.flock	  = ll_file_flock,
@@ -3202,7 +3162,7 @@ struct file_operations ll_file_operations_noflock = {
 	.release	= ll_file_release,
 	.mmap	   = ll_file_mmap,
 	.llseek	 = ll_file_seek,
-	.splice_read    = ll_file_splice_read,
+	.splice_read    = generic_file_splice_read,
 	.fsync	  = ll_fsync,
 	.flush	  = ll_flush,
 	.flock	  = ll_file_noflock,
diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index 4d6d589..0e738c8 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -800,17 +800,11 @@ void vvp_write_complete(struct vvp_object *club, struct vvp_page *page);
  */
 struct vvp_io_args {
 	/** normal/splice */
-	enum vvp_io_subtype via_io_subtype;
-
 	union {
 		struct {
 			struct kiocb      *via_iocb;
 			struct iov_iter   *via_iter;
 		} normal;
-		struct {
-			struct pipe_inode_info  *via_pipe;
-			unsigned int       via_flags;
-		} splice;
 	} u;
 };
 
@@ -838,14 +832,9 @@ static inline struct ll_thread_info *ll_env_info(const struct lu_env *env)
 	return lti;
 }
 
-static inline struct vvp_io_args *ll_env_args(const struct lu_env *env,
-					      enum vvp_io_subtype type)
+static inline struct vvp_io_args *ll_env_args(const struct lu_env *env)
 {
-	struct vvp_io_args *via = &ll_env_info(env)->lti_args;
-
-	via->via_io_subtype = type;
-
-	return via;
+	return &ll_env_info(env)->lti_args;
 }
 
 void ll_queue_done_writing(struct inode *inode, unsigned long flags);
diff --git a/drivers/staging/lustre/lustre/llite/vvp_internal.h b/drivers/staging/lustre/lustre/llite/vvp_internal.h
index 79fc428..2fa49cc 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_internal.h
+++ b/drivers/staging/lustre/lustre/llite/vvp_internal.h
@@ -49,14 +49,6 @@ struct obd_device;
 struct obd_export;
 struct page;
 
-/* specific architecture can implement only part of this list */
-enum vvp_io_subtype {
-	/** normal IO */
-	IO_NORMAL,
-	/** io started from splice_{read|write} */
-	IO_SPLICE
-};
-
 /**
  * IO state private to IO state private to VVP layer.
  */
@@ -99,10 +91,6 @@ struct vvp_io {
 			bool		ft_flags_valid;
 		} fault;
 		struct {
-			struct pipe_inode_info	*vui_pipe;
-			unsigned int		 vui_flags;
-		} splice;
-		struct {
 			struct cl_page_list vui_queue;
 			unsigned long vui_written;
 			int vui_from;
@@ -110,8 +98,6 @@ struct vvp_io {
 		} write;
 	} u;
 
-	enum vvp_io_subtype	vui_io_subtype;
-
 	/**
 	 * Layout version when this IO is initialized
 	 */
diff --git a/drivers/staging/lustre/lustre/llite/vvp_io.c b/drivers/staging/lustre/lustre/llite/vvp_io.c
index 94916dc..4864600 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_io.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_io.c
@@ -55,18 +55,6 @@ static struct vvp_io *cl2vvp_io(const struct lu_env *env,
 }
 
 /**
- * True, if \a io is a normal io, False for splice_{read,write}
- */
-static int cl_is_normalio(const struct lu_env *env, const struct cl_io *io)
-{
-	struct vvp_io *vio = vvp_env_io(env);
-
-	LASSERT(io->ci_type == CIT_READ || io->ci_type == CIT_WRITE);
-
-	return vio->vui_io_subtype == IO_NORMAL;
-}
-
-/**
  * For swapping layout. The file's layout may have changed.
  * To avoid populating pages to a wrong stripe, we have to verify the
  * correctness of layout. It works because swapping layout processes
@@ -391,9 +379,6 @@ static int vvp_mmap_locks(const struct lu_env *env,
 
 	LASSERT(io->ci_type == CIT_READ || io->ci_type == CIT_WRITE);
 
-	if (!cl_is_normalio(env, io))
-		return 0;
-
 	if (!vio->vui_iter) /* nfs or loop back device write */
 		return 0;
 
@@ -462,15 +447,10 @@ static void vvp_io_advance(const struct lu_env *env,
 			   const struct cl_io_slice *ios,
 			   size_t nob)
 {
-	struct vvp_io    *vio = cl2vvp_io(env, ios);
-	struct cl_io     *io  = ios->cis_io;
 	struct cl_object *obj = ios->cis_io->ci_obj;
-
+	struct vvp_io	 *vio = cl2vvp_io(env, ios);
 	CLOBINVRNT(env, obj, vvp_object_invariant(obj));
 
-	if (!cl_is_normalio(env, io))
-		return;
-
 	iov_iter_reexpand(vio->vui_iter, vio->vui_tot_count  -= nob);
 }
 
@@ -479,7 +459,7 @@ static void vvp_io_update_iov(const struct lu_env *env,
 {
 	size_t size = io->u.ci_rw.crw_count;
 
-	if (!cl_is_normalio(env, io) || !vio->vui_iter)
+	if (!vio->vui_iter)
 		return;
 
 	iov_iter_truncate(vio->vui_iter, size);
@@ -716,25 +696,8 @@ static int vvp_io_read_start(const struct lu_env *env,
 
 	/* BUG: 5972 */
 	file_accessed(file);
-	switch (vio->vui_io_subtype) {
-	case IO_NORMAL:
-		LASSERT(vio->vui_iocb->ki_pos == pos);
-		result = generic_file_read_iter(vio->vui_iocb, vio->vui_iter);
-		break;
-	case IO_SPLICE:
-		result = generic_file_splice_read(file, &pos,
-						  vio->u.splice.vui_pipe, cnt,
-						  vio->u.splice.vui_flags);
-		/* LU-1109: do splice read stripe by stripe otherwise if it
-		 * may make nfsd stuck if this read occupied all internal pipe
-		 * buffers.
-		 */
-		io->ci_continue = 0;
-		break;
-	default:
-		CERROR("Wrong IO type %u\n", vio->vui_io_subtype);
-		LBUG();
-	}
+	LASSERT(vio->vui_iocb->ki_pos == pos);
+	result = generic_file_read_iter(vio->vui_iocb, vio->vui_iter);
 
 out:
 	if (result >= 0) {
diff --git a/fs/coda/file.c b/fs/coda/file.c
index f47c748..8415d4f 100644
--- a/fs/coda/file.c
+++ b/fs/coda/file.c
@@ -38,27 +38,6 @@ coda_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 }
 
 static ssize_t
-coda_file_splice_read(struct file *coda_file, loff_t *ppos,
-		      struct pipe_inode_info *pipe, size_t count,
-		      unsigned int flags)
-{
-	ssize_t (*splice_read)(struct file *, loff_t *,
-			       struct pipe_inode_info *, size_t, unsigned int);
-	struct coda_file_info *cfi;
-	struct file *host_file;
-
-	cfi = CODA_FTOC(coda_file);
-	BUG_ON(!cfi || cfi->cfi_magic != CODA_MAGIC);
-	host_file = cfi->cfi_container;
-
-	splice_read = host_file->f_op->splice_read;
-	if (!splice_read)
-		splice_read = default_file_splice_read;
-
-	return splice_read(host_file, ppos, pipe, count, flags);
-}
-
-static ssize_t
 coda_file_write_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *coda_file = iocb->ki_filp;
@@ -225,6 +204,6 @@ const struct file_operations coda_file_operations = {
 	.open		= coda_open,
 	.release	= coda_release,
 	.fsync		= coda_fsync,
-	.splice_read	= coda_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 };
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 320e65e..7016a6a7 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -954,30 +954,6 @@ out_uninit:
 	return ret;
 }
 
-static ssize_t gfs2_file_splice_read(struct file *in, loff_t *ppos,
-				     struct pipe_inode_info *pipe, size_t len,
-				     unsigned int flags)
-{
-	struct inode *inode = in->f_mapping->host;
-	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_holder gh;
-	int ret;
-
-	inode_lock(inode);
-
-	ret = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
-	if (ret) {
-		inode_unlock(inode);
-		return ret;
-	}
-
-	gfs2_glock_dq_uninit(&gh);
-	inode_unlock(inode);
-
-	return generic_file_splice_read(in, ppos, pipe, len, flags);
-}
-
-
 static ssize_t gfs2_file_splice_write(struct pipe_inode_info *pipe,
 				      struct file *out, loff_t *ppos,
 				      size_t len, unsigned int flags)
@@ -1140,7 +1116,7 @@ const struct file_operations gfs2_file_fops = {
 	.fsync		= gfs2_fsync,
 	.lock		= gfs2_lock,
 	.flock		= gfs2_flock,
-	.splice_read	= gfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= gfs2_file_splice_write,
 	.setlease	= simple_nosetlease,
 	.fallocate	= gfs2_fallocate,
@@ -1168,7 +1144,7 @@ const struct file_operations gfs2_file_fops_nolock = {
 	.open		= gfs2_open,
 	.release	= gfs2_release,
 	.fsync		= gfs2_fsync,
-	.splice_read	= gfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= gfs2_file_splice_write,
 	.setlease	= generic_setlease,
 	.fallocate	= gfs2_fallocate,
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 7d62097..5048585 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -182,29 +182,6 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 }
 EXPORT_SYMBOL_GPL(nfs_file_read);
 
-ssize_t
-nfs_file_splice_read(struct file *filp, loff_t *ppos,
-		     struct pipe_inode_info *pipe, size_t count,
-		     unsigned int flags)
-{
-	struct inode *inode = file_inode(filp);
-	ssize_t res;
-
-	dprintk("NFS: splice_read(%pD2, %lu@%Lu)\n",
-		filp, (unsigned long) count, (unsigned long long) *ppos);
-
-	nfs_start_io_read(inode);
-	res = nfs_revalidate_mapping(inode, filp->f_mapping);
-	if (!res) {
-		res = generic_file_splice_read(filp, ppos, pipe, count, flags);
-		if (res > 0)
-			nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, res);
-	}
-	nfs_end_io_read(inode);
-	return res;
-}
-EXPORT_SYMBOL_GPL(nfs_file_splice_read);
-
 int
 nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
 {
@@ -868,7 +845,7 @@ const struct file_operations nfs_file_operations = {
 	.fsync		= nfs_file_fsync,
 	.lock		= nfs_lock,
 	.flock		= nfs_flock,
-	.splice_read	= nfs_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= simple_nosetlease,
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 74935a1..d7b062b 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -365,8 +365,6 @@ int nfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *)
 int nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync);
 loff_t nfs_file_llseek(struct file *, loff_t, int);
 ssize_t nfs_file_read(struct kiocb *, struct iov_iter *);
-ssize_t nfs_file_splice_read(struct file *, loff_t *, struct pipe_inode_info *,
-			     size_t, unsigned int);
 int nfs_file_mmap(struct file *, struct vm_area_struct *);
 ssize_t nfs_file_write(struct kiocb *, struct iov_iter *);
 int nfs_file_release(struct inode *, struct file *);
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index d085ad7..89a7795 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -248,7 +248,7 @@ const struct file_operations nfs4_file_operations = {
 	.fsync		= nfs_file_fsync,
 	.lock		= nfs_lock,
 	.flock		= nfs_flock,
-	.splice_read	= nfs_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= simple_nosetlease,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 4e7b0dc..6596e41 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2307,36 +2307,6 @@ out_mutex:
 	return ret;
 }
 
-static ssize_t ocfs2_file_splice_read(struct file *in,
-				      loff_t *ppos,
-				      struct pipe_inode_info *pipe,
-				      size_t len,
-				      unsigned int flags)
-{
-	int ret = 0, lock_level = 0;
-	struct inode *inode = file_inode(in);
-
-	trace_ocfs2_file_splice_read(inode, in, in->f_path.dentry,
-			(unsigned long long)OCFS2_I(inode)->ip_blkno,
-			in->f_path.dentry->d_name.len,
-			in->f_path.dentry->d_name.name, len);
-
-	/*
-	 * See the comment in ocfs2_file_read_iter()
-	 */
-	ret = ocfs2_inode_lock_atime(inode, in->f_path.mnt, &lock_level);
-	if (ret < 0) {
-		mlog_errno(ret);
-		goto bail;
-	}
-	ocfs2_inode_unlock(inode, lock_level);
-
-	ret = generic_file_splice_read(in, ppos, pipe, len, flags);
-
-bail:
-	return ret;
-}
-
 static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 				   struct iov_iter *to)
 {
@@ -2495,7 +2465,7 @@ const struct file_operations ocfs2_fops = {
 #endif
 	.lock		= ocfs2_lock,
 	.flock		= ocfs2_flock,
-	.splice_read	= ocfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
 };
@@ -2540,7 +2510,7 @@ const struct file_operations ocfs2_fops_no_plocks = {
 	.compat_ioctl   = ocfs2_compat_ioctl,
 #endif
 	.flock		= ocfs2_flock,
-	.splice_read	= ocfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
 };
diff --git a/fs/ocfs2/ocfs2_trace.h b/fs/ocfs2/ocfs2_trace.h
index f8f5fc5..0b58abc 100644
--- a/fs/ocfs2/ocfs2_trace.h
+++ b/fs/ocfs2/ocfs2_trace.h
@@ -1314,8 +1314,6 @@ DEFINE_OCFS2_FILE_OPS(ocfs2_file_aio_write);
 
 DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_write);
 
-DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_read);
-
 DEFINE_OCFS2_FILE_OPS(ocfs2_file_aio_read);
 
 DEFINE_OCFS2_ULL_ULL_ULL_EVENT(ocfs2_truncate_file);
diff --git a/fs/splice.c b/fs/splice.c
index 0daa7d1..7b756d3 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -281,207 +281,6 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
 	kfree(spd->partial);
 }
 
-static int
-__generic_file_splice_read(struct file *in, loff_t *ppos,
-			   struct pipe_inode_info *pipe, size_t len,
-			   unsigned int flags)
-{
-	struct address_space *mapping = in->f_mapping;
-	unsigned int loff, nr_pages, req_pages;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct page *page;
-	pgoff_t index, end_index;
-	loff_t isize;
-	int error, page_nr;
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &page_cache_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
-
-	if (splice_grow_spd(pipe, &spd))
-		return -ENOMEM;
-
-	index = *ppos >> PAGE_SHIFT;
-	loff = *ppos & ~PAGE_MASK;
-	req_pages = (len + loff + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	nr_pages = min(req_pages, spd.nr_pages_max);
-
-	/*
-	 * Lookup the (hopefully) full range of pages we need.
-	 */
-	spd.nr_pages = find_get_pages_contig(mapping, index, nr_pages, spd.pages);
-	index += spd.nr_pages;
-
-	/*
-	 * If find_get_pages_contig() returned fewer pages than we needed,
-	 * readahead/allocate the rest and fill in the holes.
-	 */
-	if (spd.nr_pages < nr_pages)
-		page_cache_sync_readahead(mapping, &in->f_ra, in,
-				index, req_pages - spd.nr_pages);
-
-	error = 0;
-	while (spd.nr_pages < nr_pages) {
-		/*
-		 * Page could be there, find_get_pages_contig() breaks on
-		 * the first hole.
-		 */
-		page = find_get_page(mapping, index);
-		if (!page) {
-			/*
-			 * page didn't exist, allocate one.
-			 */
-			page = page_cache_alloc_cold(mapping);
-			if (!page)
-				break;
-
-			error = add_to_page_cache_lru(page, mapping, index,
-				   mapping_gfp_constraint(mapping, GFP_KERNEL));
-			if (unlikely(error)) {
-				put_page(page);
-				if (error == -EEXIST)
-					continue;
-				break;
-			}
-			/*
-			 * add_to_page_cache() locks the page, unlock it
-			 * to avoid convoluting the logic below even more.
-			 */
-			unlock_page(page);
-		}
-
-		spd.pages[spd.nr_pages++] = page;
-		index++;
-	}
-
-	/*
-	 * Now loop over the map and see if we need to start IO on any
-	 * pages, fill in the partial map, etc.
-	 */
-	index = *ppos >> PAGE_SHIFT;
-	nr_pages = spd.nr_pages;
-	spd.nr_pages = 0;
-	for (page_nr = 0; page_nr < nr_pages; page_nr++) {
-		unsigned int this_len;
-
-		if (!len)
-			break;
-
-		/*
-		 * this_len is the max we'll use from this page
-		 */
-		this_len = min_t(unsigned long, len, PAGE_SIZE - loff);
-		page = spd.pages[page_nr];
-
-		if (PageReadahead(page))
-			page_cache_async_readahead(mapping, &in->f_ra, in,
-					page, index, req_pages - page_nr);
-
-		/*
-		 * If the page isn't uptodate, we may need to start io on it
-		 */
-		if (!PageUptodate(page)) {
-			lock_page(page);
-
-			/*
-			 * Page was truncated, or invalidated by the
-			 * filesystem.  Redo the find/create, but this time the
-			 * page is kept locked, so there's no chance of another
-			 * race with truncate/invalidate.
-			 */
-			if (!page->mapping) {
-				unlock_page(page);
-retry_lookup:
-				page = find_or_create_page(mapping, index,
-						mapping_gfp_mask(mapping));
-
-				if (!page) {
-					error = -ENOMEM;
-					break;
-				}
-				put_page(spd.pages[page_nr]);
-				spd.pages[page_nr] = page;
-			}
-			/*
-			 * page was already under io and is now done, great
-			 */
-			if (PageUptodate(page)) {
-				unlock_page(page);
-				goto fill_it;
-			}
-
-			/*
-			 * need to read in the page
-			 */
-			error = mapping->a_ops->readpage(in, page);
-			if (unlikely(error)) {
-				/*
-				 * Re-lookup the page
-				 */
-				if (error == AOP_TRUNCATED_PAGE)
-					goto retry_lookup;
-
-				break;
-			}
-		}
-fill_it:
-		/*
-		 * i_size must be checked after PageUptodate.
-		 */
-		isize = i_size_read(mapping->host);
-		end_index = (isize - 1) >> PAGE_SHIFT;
-		if (unlikely(!isize || index > end_index))
-			break;
-
-		/*
-		 * if this is the last page, see if we need to shrink
-		 * the length and stop
-		 */
-		if (end_index == index) {
-			unsigned int plen;
-
-			/*
-			 * max good bytes in this page
-			 */
-			plen = ((isize - 1) & ~PAGE_MASK) + 1;
-			if (plen <= loff)
-				break;
-
-			/*
-			 * force quit after adding this page
-			 */
-			this_len = min(this_len, plen - loff);
-			len = this_len;
-		}
-
-		spd.partial[page_nr].offset = loff;
-		spd.partial[page_nr].len = this_len;
-		len -= this_len;
-		loff = 0;
-		spd.nr_pages++;
-		index++;
-	}
-
-	/*
-	 * Release any pages at the end, if we quit early. 'page_nr' is how far
-	 * we got, 'nr_pages' is how many pages are in the map.
-	 */
-	while (page_nr < nr_pages)
-		put_page(spd.pages[page_nr++]);
-	in->f_ra.prev_pos = (loff_t)index << PAGE_SHIFT;
-
-	if (spd.nr_pages)
-		error = splice_to_pipe(pipe, &spd);
-
-	splice_shrink_spd(&spd);
-	return error;
-}
-
 /**
  * generic_file_splice_read - splice data from file to a pipe
  * @in:		file to splice from
@@ -492,19 +291,17 @@ fill_it:
  *
  * Description:
  *    Will read pages from given file and fill them into a pipe. Can be
- *    used as long as the address_space operations for the source implements
- *    a readpage() hook.
+ *    used as long as it has more or less sane ->read_iter().
  *
  */
 ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
+	struct iov_iter to;
+	struct kiocb kiocb;
 	loff_t isize, left;
-	int ret;
-
-	if (IS_DAX(in->f_mapping->host))
-		return default_file_splice_read(in, ppos, pipe, len, flags);
+	int idx, ret;
 
 	isize = i_size_read(in->f_mapping->host);
 	if (unlikely(*ppos >= isize))
@@ -514,10 +311,30 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	if (unlikely(left < len))
 		len = left;
 
-	ret = __generic_file_splice_read(in, ppos, pipe, len, flags);
+	iov_iter_pipe(&to, ITER_PIPE | READ, pipe, len);
+	idx = to.idx;
+	init_sync_kiocb(&kiocb, in);
+	kiocb.ki_pos = *ppos;
+	ret = in->f_op->read_iter(&kiocb, &to);
 	if (ret > 0) {
-		*ppos += ret;
+		*ppos = kiocb.ki_pos;
 		file_accessed(in);
+	} else if (ret < 0) {
+		if (WARN_ON(to.idx != idx || to.iov_offset)) {
+			/*
+			 * a bogus ->read_iter() has copied something and still
+			 * returned an error instead of a short read.
+			 */
+			to.idx = idx;
+			to.iov_offset = 0;
+			iov_iter_advance(&to, 0); /* to free what was emitted */
+		}
+		/*
+		 * callers of ->splice_read() expect -EAGAIN on
+		 * "can't put anything in there", rather than -EFAULT.
+		 */
+		if (ret == -EFAULT)
+			ret = -EAGAIN;
 	}
 
 	return ret;
@@ -580,7 +397,7 @@ ssize_t kernel_write(struct file *file, const char *buf, size_t count,
 }
 EXPORT_SYMBOL(kernel_write);
 
-ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
+static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
@@ -675,7 +492,6 @@ err:
 	res = error;
 	goto shrink_ret;
 }
-EXPORT_SYMBOL(default_file_splice_read);
 
 /*
  * Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e612a02..92f16cf 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -399,45 +399,6 @@ xfs_file_read_iter(
 	return ret;
 }
 
-STATIC ssize_t
-xfs_file_splice_read(
-	struct file		*infilp,
-	loff_t			*ppos,
-	struct pipe_inode_info	*pipe,
-	size_t			count,
-	unsigned int		flags)
-{
-	struct xfs_inode	*ip = XFS_I(infilp->f_mapping->host);
-	ssize_t			ret;
-
-	XFS_STATS_INC(ip->i_mount, xs_read_calls);
-
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
-		return -EIO;
-
-	trace_xfs_file_splice_read(ip, count, *ppos);
-
-	/*
-	 * DAX inodes cannot ues the page cache for splice, so we have to push
-	 * them through the VFS IO path. This means it goes through
-	 * ->read_iter, which for us takes the XFS_IOLOCK_SHARED. Hence we
-	 * cannot lock the splice operation at this level for DAX inodes.
-	 */
-	if (IS_DAX(VFS_I(ip))) {
-		ret = default_file_splice_read(infilp, ppos, pipe, count,
-					       flags);
-		goto out;
-	}
-
-	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
-	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
-out:
-	if (ret > 0)
-		XFS_STATS_ADD(ip->i_mount, xs_read_bytes, ret);
-	return ret;
-}
-
 /*
  * Zero any on disk space between the current EOF and the new, larger EOF.
  *
@@ -1652,7 +1613,7 @@ const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
 	.write_iter	= xfs_file_write_iter,
-	.splice_read	= xfs_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.unlocked_ioctl	= xfs_file_ioctl,
 #ifdef CONFIG_COMPAT
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d303a66..f31db44 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1170,7 +1170,6 @@ DEFINE_RW_EVENT(xfs_file_dax_read);
 DEFINE_RW_EVENT(xfs_file_buffered_write);
 DEFINE_RW_EVENT(xfs_file_direct_write);
 DEFINE_RW_EVENT(xfs_file_dax_write);
-DEFINE_RW_EVENT(xfs_file_splice_read);
 
 DECLARE_EVENT_CLASS(xfs_page_class,
 	TP_PROTO(struct inode *inode, struct page *page, unsigned long off,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 901e25d..b04883e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2794,8 +2794,6 @@ extern void block_sync_page(struct page *page);
 /* fs/splice.c */
 extern ssize_t generic_file_splice_read(struct file *, loff_t *,
 		struct pipe_inode_info *, size_t, unsigned int);
-extern ssize_t default_file_splice_read(struct file *, loff_t *,
-		struct pipe_inode_info *, size_t, unsigned int);
 extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
 		struct file *, loff_t *, size_t, unsigned int);
 extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
diff --git a/mm/shmem.c b/mm/shmem.c
index fd8b2b5..84d7077 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2310,119 +2310,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return retval ? retval : error;
 }
 
-static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
-				struct pipe_inode_info *pipe, size_t len,
-				unsigned int flags)
-{
-	struct address_space *mapping = in->f_mapping;
-	struct inode *inode = mapping->host;
-	unsigned int loff, nr_pages, req_pages;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct page *page;
-	pgoff_t index, end_index;
-	loff_t isize, left;
-	int error, page_nr;
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &page_cache_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
-
-	isize = i_size_read(inode);
-	if (unlikely(*ppos >= isize))
-		return 0;
-
-	left = isize - *ppos;
-	if (unlikely(left < len))
-		len = left;
-
-	if (splice_grow_spd(pipe, &spd))
-		return -ENOMEM;
-
-	index = *ppos >> PAGE_SHIFT;
-	loff = *ppos & ~PAGE_MASK;
-	req_pages = (len + loff + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	nr_pages = min(req_pages, spd.nr_pages_max);
-
-	spd.nr_pages = find_get_pages_contig(mapping, index,
-						nr_pages, spd.pages);
-	index += spd.nr_pages;
-	error = 0;
-
-	while (spd.nr_pages < nr_pages) {
-		error = shmem_getpage(inode, index, &page, SGP_CACHE);
-		if (error)
-			break;
-		unlock_page(page);
-		spd.pages[spd.nr_pages++] = page;
-		index++;
-	}
-
-	index = *ppos >> PAGE_SHIFT;
-	nr_pages = spd.nr_pages;
-	spd.nr_pages = 0;
-
-	for (page_nr = 0; page_nr < nr_pages; page_nr++) {
-		unsigned int this_len;
-
-		if (!len)
-			break;
-
-		this_len = min_t(unsigned long, len, PAGE_SIZE - loff);
-		page = spd.pages[page_nr];
-
-		if (!PageUptodate(page) || page->mapping != mapping) {
-			error = shmem_getpage(inode, index, &page, SGP_CACHE);
-			if (error)
-				break;
-			unlock_page(page);
-			put_page(spd.pages[page_nr]);
-			spd.pages[page_nr] = page;
-		}
-
-		isize = i_size_read(inode);
-		end_index = (isize - 1) >> PAGE_SHIFT;
-		if (unlikely(!isize || index > end_index))
-			break;
-
-		if (end_index == index) {
-			unsigned int plen;
-
-			plen = ((isize - 1) & ~PAGE_MASK) + 1;
-			if (plen <= loff)
-				break;
-
-			this_len = min(this_len, plen - loff);
-			len = this_len;
-		}
-
-		spd.partial[page_nr].offset = loff;
-		spd.partial[page_nr].len = this_len;
-		len -= this_len;
-		loff = 0;
-		spd.nr_pages++;
-		index++;
-	}
-
-	while (page_nr < nr_pages)
-		put_page(spd.pages[page_nr++]);
-
-	if (spd.nr_pages)
-		error = splice_to_pipe(pipe, &spd);
-
-	splice_shrink_spd(&spd);
-
-	if (error > 0) {
-		*ppos += error;
-		file_accessed(in);
-	}
-	return error;
-}
-
 /*
  * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
  */
@@ -3785,7 +3672,7 @@ static const struct file_operations shmem_file_operations = {
 	.read_iter	= shmem_file_read_iter,
 	.write_iter	= generic_file_write_iter,
 	.fsync		= noop_fsync,
-	.splice_read	= shmem_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= shmem_fallocate,
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-23 19:03                           ` [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe() Al Viro
@ 2016-09-23 19:45                             ` Linus Torvalds
  2016-09-23 20:10                               ` Al Viro
  0 siblings, 1 reply; 104+ messages in thread
From: Linus Torvalds @ 2016-09-23 19:45 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Fri, Sep 23, 2016 at 12:03 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> @@ -1421,8 +1406,25 @@ static long do_splice(struct file *in, loff_t __user *off_in,
> +               ret = 0;
> +               pipe_lock(opipe);
> +               bogus_count = opipe->buffers;
> +               do {
> +                       bogus_count += opipe->nrbufs;
> +                       ret = do_splice_to(in, &offset, opipe, len, flags);
> +                       if (ret > 0) {
> +                               total += ret;
> +                               len -= ret;
> +                       }
> +                       bogus_count -= opipe->nrbufs;
> +                       if (bogus_count <= 0)
> +                               break;

I was like "oh, I'm sure this is some temporary hack, it will be gone
by the end of the series".

It wasn't gone by the end.

There's two copies of that pattern, and at the very least it needs a
big comment about what this pattern does and why.

But other than that reaction, I didn't get any hives from this. I
didn't *test* it, only looking at patches, but no red flags I could
notice.

               Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-23 19:45                             ` Linus Torvalds
@ 2016-09-23 20:10                               ` Al Viro
  2016-09-23 20:36                                 ` Linus Torvalds
  0 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-09-23 20:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Fri, Sep 23, 2016 at 12:45:53PM -0700, Linus Torvalds wrote:

> I was like "oh, I'm sure this is some temporary hack, it will be gone
> by the end of the series".
> 
> It wasn't gone by the end.
> 
> There's two copies of that pattern, and at the very least it needs a
> big comment about what this pattern does and why.

The thing is, I'm not sure what to do with it; it was brought by the LTP
vmsplice test, which asks to feed 128Kb into a pipe.  With the caller
itself on the other end of that pipe, SPLICE_F_NONBLOCK *not* given and
the pipe capacity being 64Kb.  Unfortunately, "quietly truncate the
length down to 64Kb" does *not* suffice - the damn thing starts not at
the page boundary, so we only copy about 62Kb until hitting the pipe
overflow (the pipe is initially empty).  The reason why it doesn't go
to sleep indefinitely on the mainline kernel is that mainline collects
up to page->buffers *pages*, before feeding them into the pipe.  And these
~62Kb are just that.  Note that had there been anything already in the
pipe, the same call would've gone to sleep (and in the end transferred the
same ~62Kb worth of data).

All of that is completely undocumented in vmsplice(2) (or anywhere else that
I'd been able to find) ;-/

OTOH, considering the quality of documentation, I'm somewhat tempted to go
for "sleep only if it had been completely full when we entered; once there's
some space feed as much as fits and be done with that".  OTTH, I'm not sure
that no userland cr^Hode will manage to be hurt by that variant...

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-23 20:10                               ` Al Viro
@ 2016-09-23 20:36                                 ` Linus Torvalds
  2016-09-24  3:59                                   ` Al Viro
                                                     ` (5 more replies)
  0 siblings, 6 replies; 104+ messages in thread
From: Linus Torvalds @ 2016-09-23 20:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Fri, Sep 23, 2016 at 1:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OTOH, considering the quality of documentation, I'm somewhat tempted to go
> for "sleep only if it had been completely full when we entered; once there's
> some space feed as much as fits and be done with that".  OTTH, I'm not sure
> that no userland cr^Hode will manage to be hurt by that variant...

Let's just try it.

If that then doesn't work, we can introduce your odd code (with a
*big* comment). Ok?

               Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-23 20:36                                 ` Linus Torvalds
@ 2016-09-24  3:59                                   ` Al Viro
  2016-09-24 17:29                                     ` Al Viro
  2016-09-24  3:59                                   ` [PATCH 04/12] " Al Viro
                                                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-09-24  3:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Fri, Sep 23, 2016 at 01:36:12PM -0700, Linus Torvalds wrote:
> On Fri, Sep 23, 2016 at 1:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > OTOH, considering the quality of documentation, I'm somewhat tempted to go
> > for "sleep only if it had been completely full when we entered; once there's
> > some space feed as much as fits and be done with that".  OTTH, I'm not sure
> > that no userland cr^Hode will manage to be hurt by that variant...
> 
> Let's just try it.
> 
> If that then doesn't work, we can introduce your odd code (with a
> *big* comment). Ok?

	FWIW, updated (with fixes) and force-pushed.  Added piece:
default_file_splice_read() converted to iov_iter.  Seems to work, after
fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
beginning, as described above), #6 (context changes from #4), #10 (missing
get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
of len - ->read_iter() can bloody well handle that on its own) and added #12.
Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-23 20:36                                 ` Linus Torvalds
  2016-09-24  3:59                                   ` Al Viro
@ 2016-09-24  3:59                                   ` Al Viro
  2016-09-26 13:35                                     ` Miklos Szeredi
  2016-12-17 19:54                                     ` Andreas Schwab
  2016-09-24  4:00                                   ` [PATCH 06/12] new helper: add_to_pipe() Al Viro
                                                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 104+ messages in thread
From: Al Viro @ 2016-09-24  3:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

* splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
* ->splice_read() instances do the same
* vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
  arrange for waiting, looping, etc. themselves.

That should make pipe_lock the outermost one.

Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
and do_splice() are quite ugly _and_ userland code can be easily broken
by changing those.  It's not even "no more than the maximal capacity of
this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
leave instead of waiting".

Considering how poorly these rules are documented, let's try "wait for some
space to appear, unless given SPLICE_F_NONBLOCK, then push into pipe
and if we run into overflow, we are done".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/fuse/dev.c |   2 -
 fs/splice.c   | 138 +++++++++++++++++++++++++++-------------------------------
 2 files changed, 63 insertions(+), 77 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index a94d2ed..eaf56c6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1364,7 +1364,6 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 		goto out;
 
 	ret = 0;
-	pipe_lock(pipe);
 
 	if (!pipe->readers) {
 		send_sig(SIGPIPE, current, 0);
@@ -1400,7 +1399,6 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	}
 
 out_unlock:
-	pipe_unlock(pipe);
 
 	if (do_wakeup) {
 		smp_mb();
diff --git a/fs/splice.c b/fs/splice.c
index 31c52e0..02daa61 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -183,79 +183,41 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		       struct splice_pipe_desc *spd)
 {
 	unsigned int spd_pages = spd->nr_pages;
-	int ret, do_wakeup, page_nr;
+	int ret = 0, page_nr = 0;
 
 	if (!spd_pages)
 		return 0;
 
-	ret = 0;
-	do_wakeup = 0;
-	page_nr = 0;
-
-	pipe_lock(pipe);
-
-	for (;;) {
-		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
-			if (!ret)
-				ret = -EPIPE;
-			break;
-		}
-
-		if (pipe->nrbufs < pipe->buffers) {
-			int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-			struct pipe_buffer *buf = pipe->bufs + newbuf;
-
-			buf->page = spd->pages[page_nr];
-			buf->offset = spd->partial[page_nr].offset;
-			buf->len = spd->partial[page_nr].len;
-			buf->private = spd->partial[page_nr].private;
-			buf->ops = spd->ops;
-			if (spd->flags & SPLICE_F_GIFT)
-				buf->flags |= PIPE_BUF_FLAG_GIFT;
-
-			pipe->nrbufs++;
-			page_nr++;
-			ret += buf->len;
-
-			if (pipe->files)
-				do_wakeup = 1;
+	if (unlikely(!pipe->readers)) {
+		send_sig(SIGPIPE, current, 0);
+		ret = -EPIPE;
+		goto out;
+	}
 
-			if (!--spd->nr_pages)
-				break;
-			if (pipe->nrbufs < pipe->buffers)
-				continue;
+	while (pipe->nrbufs < pipe->buffers) {
+		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		struct pipe_buffer *buf = pipe->bufs + newbuf;
 
-			break;
-		}
+		buf->page = spd->pages[page_nr];
+		buf->offset = spd->partial[page_nr].offset;
+		buf->len = spd->partial[page_nr].len;
+		buf->private = spd->partial[page_nr].private;
+		buf->ops = spd->ops;
+		if (spd->flags & SPLICE_F_GIFT)
+			buf->flags |= PIPE_BUF_FLAG_GIFT;
 
-		if (spd->flags & SPLICE_F_NONBLOCK) {
-			if (!ret)
-				ret = -EAGAIN;
-			break;
-		}
+		pipe->nrbufs++;
+		page_nr++;
+		ret += buf->len;
 
-		if (signal_pending(current)) {
-			if (!ret)
-				ret = -ERESTARTSYS;
+		if (!--spd->nr_pages)
 			break;
-		}
-
-		if (do_wakeup) {
-			wakeup_pipe_readers(pipe);
-			do_wakeup = 0;
-		}
-
-		pipe->waiting_writers++;
-		pipe_wait(pipe);
-		pipe->waiting_writers--;
 	}
 
-	pipe_unlock(pipe);
-
-	if (do_wakeup)
-		wakeup_pipe_readers(pipe);
+	if (!ret)
+		ret = -EAGAIN;
 
+out:
 	while (page_nr < spd_pages)
 		spd->spd_release(spd, page_nr++);
 
@@ -1339,6 +1301,20 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(do_splice_direct);
 
+static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
+{
+	while (pipe->nrbufs == pipe->buffers) {
+		if (flags & SPLICE_F_NONBLOCK)
+			return -EAGAIN;
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+		pipe->waiting_writers++;
+		pipe_wait(pipe);
+		pipe->waiting_writers--;
+	}
+	return 0;
+}
+
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
@@ -1421,8 +1397,13 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 			offset = in->f_pos;
 		}
 
-		ret = do_splice_to(in, &offset, opipe, len, flags);
-
+		pipe_lock(opipe);
+		ret = wait_for_space(opipe, flags);
+		if (!ret)
+			ret = do_splice_to(in, &offset, opipe, len, flags);
+		pipe_unlock(opipe);
+		if (ret > 0)
+			wakeup_pipe_readers(opipe);
 		if (!off_in)
 			in->f_pos = offset;
 		else if (copy_to_user(off_in, &offset, sizeof(loff_t)))
@@ -1434,22 +1415,23 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	return -EINVAL;
 }
 
-static int get_iovec_page_array(struct iov_iter *from,
+static int get_iovec_page_array(const struct iov_iter *from,
 				struct page **pages,
 				struct partial_page *partial,
 				unsigned int pipe_buffers)
 {
+	struct iov_iter i = *from;
 	int buffers = 0;
-	while (iov_iter_count(from)) {
+	while (iov_iter_count(&i)) {
 		ssize_t copied;
 		size_t start;
 
-		copied = iov_iter_get_pages(from, pages + buffers, ~0UL,
+		copied = iov_iter_get_pages(&i, pages + buffers, ~0UL,
 					pipe_buffers - buffers, &start);
 		if (copied <= 0)
 			return buffers ? buffers : copied;
 
-		iov_iter_advance(from, copied);
+		iov_iter_advance(&i, copied);
 		while (copied) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
 			partial[buffers].offset = start;
@@ -1546,14 +1528,20 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 		return -ENOMEM;
 	}
 
-	spd.nr_pages = get_iovec_page_array(&from, spd.pages,
-					    spd.partial,
-					    spd.nr_pages_max);
-	if (spd.nr_pages <= 0)
-		ret = spd.nr_pages;
-	else
-		ret = splice_to_pipe(pipe, &spd);
-
+	pipe_lock(pipe);
+	ret = wait_for_space(pipe, flags);
+	if (!ret) {
+		spd.nr_pages = get_iovec_page_array(&from, spd.pages,
+						    spd.partial,
+						    spd.nr_pages_max);
+		if (spd.nr_pages <= 0)
+			ret = spd.nr_pages;
+		else
+			ret = splice_to_pipe(pipe, &spd);
+		pipe_unlock(pipe);
+		if (ret > 0)
+			wakeup_pipe_readers(pipe);
+	}
 	splice_shrink_spd(&spd);
 	kfree(iov);
 	return ret;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 06/12] new helper: add_to_pipe()
  2016-09-23 20:36                                 ` Linus Torvalds
  2016-09-24  3:59                                   ` Al Viro
  2016-09-24  3:59                                   ` [PATCH 04/12] " Al Viro
@ 2016-09-24  4:00                                   ` Al Viro
  2016-09-26 13:49                                     ` Miklos Szeredi
  2016-09-24  4:01                                   ` [PATCH 10/12] new iov_iter flavour: pipe-backed Al Viro
                                                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-09-24  4:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

single-buffer analogue of splice_to_pipe(); vmsplice_to_pipe() switched
to that, leaving splice_to_pipe() only for ->splice_read() instances
(and that only until they are converted as well).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c            | 113 ++++++++++++++++++++++++++++---------------------
 include/linux/splice.h |   2 +
 2 files changed, 67 insertions(+), 48 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 02daa61..e13d935 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -203,8 +203,6 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		buf->len = spd->partial[page_nr].len;
 		buf->private = spd->partial[page_nr].private;
 		buf->ops = spd->ops;
-		if (spd->flags & SPLICE_F_GIFT)
-			buf->flags |= PIPE_BUF_FLAG_GIFT;
 
 		pipe->nrbufs++;
 		page_nr++;
@@ -225,6 +223,27 @@ out:
 }
 EXPORT_SYMBOL_GPL(splice_to_pipe);
 
+ssize_t add_to_pipe(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
+{
+	int ret;
+
+	if (unlikely(!pipe->readers)) {
+		send_sig(SIGPIPE, current, 0);
+		ret = -EPIPE;
+	} else if (pipe->nrbufs == pipe->buffers) {
+		ret = -EAGAIN;
+	} else {
+		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		pipe->bufs[newbuf] = *buf;
+		pipe->nrbufs++;
+		return buf->len;
+	}
+	buf->ops->release(pipe, buf);
+	buf->ops = NULL;
+	return ret;
+}
+EXPORT_SYMBOL(add_to_pipe);
+
 void spd_release_page(struct splice_pipe_desc *spd, unsigned int i)
 {
 	put_page(spd->pages[i]);
@@ -1415,33 +1434,50 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	return -EINVAL;
 }
 
-static int get_iovec_page_array(const struct iov_iter *from,
-				struct page **pages,
-				struct partial_page *partial,
-				unsigned int pipe_buffers)
+static int iter_to_pipe(struct iov_iter *from,
+			struct pipe_inode_info *pipe,
+			unsigned flags)
 {
-	struct iov_iter i = *from;
-	int buffers = 0;
-	while (iov_iter_count(&i)) {
+	struct pipe_buffer buf = {
+		.ops = &user_page_pipe_buf_ops,
+		.flags = flags
+	};
+	size_t total = 0;
+	int ret = 0;
+	bool failed = false;
+
+	while (iov_iter_count(from) && !failed) {
+		struct page *pages[16];
 		ssize_t copied;
 		size_t start;
+		int n;
 
-		copied = iov_iter_get_pages(&i, pages + buffers, ~0UL,
-					pipe_buffers - buffers, &start);
-		if (copied <= 0)
-			return buffers ? buffers : copied;
+		copied = iov_iter_get_pages(from, pages, ~0UL, 16, &start);
+		if (copied <= 0) {
+			ret = copied;
+			break;
+		}
 
-		iov_iter_advance(&i, copied);
-		while (copied) {
+		for (n = 0; copied; n++, start = 0) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
-			partial[buffers].offset = start;
-			partial[buffers].len = size;
+			if (!failed) {
+				buf.page = pages[n];
+				buf.offset = start;
+				buf.len = size;
+				ret = add_to_pipe(pipe, &buf);
+				if (unlikely(ret < 0)) {
+					failed = true;
+				} else {
+					iov_iter_advance(from, ret);
+					total += ret;
+				}
+			} else {
+				put_page(pages[n]);
+			}
 			copied -= size;
-			start = 0;
-			buffers++;
 		}
 	}
-	return buffers;
+	return total ? total : ret;
 }
 
 static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
@@ -1502,17 +1538,11 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 	struct iovec iovstack[UIO_FASTIOV];
 	struct iovec *iov = iovstack;
 	struct iov_iter from;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &user_page_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
 	long ret;
+	unsigned buf_flag = 0;
+
+	if (flags & SPLICE_F_GIFT)
+		buf_flag = PIPE_BUF_FLAG_GIFT;
 
 	pipe = get_pipe_info(file);
 	if (!pipe)
@@ -1523,26 +1553,13 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 	if (ret < 0)
 		return ret;
 
-	if (splice_grow_spd(pipe, &spd)) {
-		kfree(iov);
-		return -ENOMEM;
-	}
-
 	pipe_lock(pipe);
 	ret = wait_for_space(pipe, flags);
-	if (!ret) {
-		spd.nr_pages = get_iovec_page_array(&from, spd.pages,
-						    spd.partial,
-						    spd.nr_pages_max);
-		if (spd.nr_pages <= 0)
-			ret = spd.nr_pages;
-		else
-			ret = splice_to_pipe(pipe, &spd);
-		pipe_unlock(pipe);
-		if (ret > 0)
-			wakeup_pipe_readers(pipe);
-	}
-	splice_shrink_spd(&spd);
+	if (!ret)
+		ret = iter_to_pipe(&from, pipe, buf_flag);
+	pipe_unlock(pipe);
+	if (ret > 0)
+		wakeup_pipe_readers(pipe);
 	kfree(iov);
 	return ret;
 }
diff --git a/include/linux/splice.h b/include/linux/splice.h
index da2751d..58b300f 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -72,6 +72,8 @@ extern ssize_t __splice_from_pipe(struct pipe_inode_info *,
 				  struct splice_desc *, splice_actor *);
 extern ssize_t splice_to_pipe(struct pipe_inode_info *,
 			      struct splice_pipe_desc *);
+extern ssize_t add_to_pipe(struct pipe_inode_info *,
+			      struct pipe_buffer *);
 extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
 				      splice_direct_actor *);
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 10/12] new iov_iter flavour: pipe-backed
  2016-09-23 20:36                                 ` Linus Torvalds
                                                     ` (2 preceding siblings ...)
  2016-09-24  4:00                                   ` [PATCH 06/12] new helper: add_to_pipe() Al Viro
@ 2016-09-24  4:01                                   ` Al Viro
  2016-09-29 20:53                                     ` Miklos Szeredi
  2016-09-24  4:01                                   ` [PATCH 11/12] switch generic_file_splice_read() to use of ->read_iter() Al Viro
  2016-09-24  4:02                                   ` [PATCH 12/12] switch default_file_splice_read() to use of pipe-backed iov_iter Al Viro
  5 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-09-24  4:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

iov_iter variant for passing data into pipe.  copy_to_iter()
copies data into page(s) it has allocated and stuffs them into
the pipe; copy_page_to_iter() stuffs there a reference to the
page given to it.  Both will try to coalesce if possible.
iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages()
and friends will do as copy_to_iter() would have and return the
pages where the data would've been copied.  iov_iter_advance()
will truncate everything past the spot it has advanced to.

New primitive: iov_iter_pipe(), used for initializing those.
pipe should be locked all along.

Running out of space acts as fault would for iovec-backed ones;
in other words, giving it to ->read_iter() may result in short
read if the pipe overflows, or -EFAULT if it happens with nothing
copied there.

In other words, ->read_iter() on those acts pretty much like
->splice_read().  Moreover, all generic_file_splice_read() users,
as well as many other ->splice_read() instances can be switched
to that scheme - that'll happen in the next commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c            |   2 +-
 include/linux/splice.h |   1 +
 include/linux/uio.h    |  14 +-
 lib/iov_iter.c         | 390 ++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 401 insertions(+), 6 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index e13d935..589a1d5 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -524,7 +524,7 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 }
 EXPORT_SYMBOL(generic_file_splice_read);
 
-static const struct pipe_buf_operations default_pipe_buf_ops = {
+const struct pipe_buf_operations default_pipe_buf_ops = {
 	.can_merge = 0,
 	.confirm = generic_pipe_buf_confirm,
 	.release = generic_pipe_buf_release,
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 58b300f..00a2116 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -85,4 +85,5 @@ extern void splice_shrink_spd(struct splice_pipe_desc *);
 extern void spd_release_page(struct splice_pipe_desc *, unsigned int);
 
 extern const struct pipe_buf_operations page_cache_pipe_buf_ops;
+extern const struct pipe_buf_operations default_pipe_buf_ops;
 #endif
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 1b5d1cd..c4fe1ab 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -13,6 +13,7 @@
 #include <uapi/linux/uio.h>
 
 struct page;
+struct pipe_inode_info;
 
 struct kvec {
 	void *iov_base; /* and that should *never* hold a userland pointer */
@@ -23,6 +24,7 @@ enum {
 	ITER_IOVEC = 0,
 	ITER_KVEC = 2,
 	ITER_BVEC = 4,
+	ITER_PIPE = 8,
 };
 
 struct iov_iter {
@@ -33,8 +35,12 @@ struct iov_iter {
 		const struct iovec *iov;
 		const struct kvec *kvec;
 		const struct bio_vec *bvec;
+		struct pipe_inode_info *pipe;
+	};
+	union {
+		unsigned long nr_segs;
+		int idx;
 	};
-	unsigned long nr_segs;
 };
 
 /*
@@ -64,7 +70,7 @@ static inline struct iovec iov_iter_iovec(const struct iov_iter *iter)
 }
 
 #define iov_for_each(iov, iter, start)				\
-	if (!((start).type & ITER_BVEC))			\
+	if (!((start).type & (ITER_BVEC | ITER_PIPE)))		\
 	for (iter = (start);					\
 	     (iter).count &&					\
 	     ((iov = iov_iter_iovec(&(iter))), 1);		\
@@ -94,6 +100,8 @@ void iov_iter_kvec(struct iov_iter *i, int direction, const struct kvec *kvec,
 			unsigned long nr_segs, size_t count);
 void iov_iter_bvec(struct iov_iter *i, int direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
+void iov_iter_pipe(struct iov_iter *i, int direction, struct pipe_inode_info *pipe,
+			size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
@@ -109,7 +117,7 @@ static inline size_t iov_iter_count(struct iov_iter *i)
 
 static inline bool iter_is_iovec(struct iov_iter *i)
 {
-	return !(i->type & (ITER_BVEC | ITER_KVEC));
+	return !(i->type & (ITER_BVEC | ITER_KVEC | ITER_PIPE));
 }
 
 /*
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 9e8c738..405fdd6 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -3,8 +3,11 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
+#include <linux/splice.h>
 #include <net/checksum.h>
 
+#define PIPE_PARANOIA /* for now */
+
 #define iterate_iovec(i, n, __v, __p, skip, STEP) {	\
 	size_t left;					\
 	size_t wanted = n;				\
@@ -290,6 +293,82 @@ done:
 	return wanted - bytes;
 }
 
+#ifdef PIPE_PARANOIA
+static bool sanity(const struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	int idx = i->idx;
+	int delta = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1);
+	if (i->iov_offset) {
+		struct pipe_buffer *p;
+		if (unlikely(delta != 1) || unlikely(!pipe->nrbufs))
+			goto Bad;	// must be at the last buffer...
+
+		p = &pipe->bufs[idx];
+		if (unlikely(p->offset + p->len != i->iov_offset))
+			goto Bad;	// ... at the end of segment
+	} else {
+		if (delta)
+			goto Bad;	// must be right after the last buffer
+	}
+	return true;
+Bad:
+	WARN_ON(1);
+	return false;
+}
+#else
+#define sanity(i) true
+#endif
+
+static inline int next_idx(int idx, struct pipe_inode_info *pipe)
+{
+	return (idx + 1) & (pipe->buffers - 1);
+}
+
+static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
+			 struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	struct pipe_buffer *buf;
+	size_t off;
+	int idx;
+
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	if (!sanity(i))
+		return 0;
+
+	off = i->iov_offset;
+	idx = i->idx;
+	buf = &pipe->bufs[idx];
+	if (off) {
+		if (offset == off && buf->page == page) {
+			/* merge with the last one */
+			buf->len += bytes;
+			i->iov_offset += bytes;
+			goto out;
+		}
+		idx = next_idx(idx, pipe);
+		buf = &pipe->bufs[idx];
+	}
+	if (idx == pipe->curbuf && pipe->nrbufs)
+		return 0;
+	pipe->nrbufs++;
+	buf->ops = &page_cache_pipe_buf_ops;
+	get_page(buf->page = page);
+	buf->offset = offset;
+	buf->len = bytes;
+	i->iov_offset = offset + bytes;
+	i->idx = idx;
+out:
+	i->count -= bytes;
+	return bytes;
+}
+
 /*
  * Fault in the first iovec of the given iov_iter, to a maximum length
  * of bytes. Returns 0 on success, or non-zero if the memory could not be
@@ -376,9 +455,98 @@ static void memzero_page(struct page *page, size_t offset, size_t len)
 	kunmap_atomic(addr);
 }
 
+static inline bool allocated(struct pipe_buffer *buf)
+{
+	return buf->ops == &default_pipe_buf_ops;
+}
+
+static inline void data_start(const struct iov_iter *i, int *idxp, size_t *offp)
+{
+	size_t off = i->iov_offset;
+	int idx = i->idx;
+	if (off && (!allocated(&i->pipe->bufs[idx]) || off == PAGE_SIZE)) {
+		idx = next_idx(idx, i->pipe);
+		off = 0;
+	}
+	*idxp = idx;
+	*offp = off;
+}
+
+static size_t push_pipe(struct iov_iter *i, size_t size,
+			int *idxp, size_t *offp)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t off;
+	int idx;
+	ssize_t left;
+
+	if (unlikely(size > i->count))
+		size = i->count;
+	if (unlikely(!size))
+		return 0;
+
+	left = size;
+	data_start(i, &idx, &off);
+	*idxp = idx;
+	*offp = off;
+	if (off) {
+		left -= PAGE_SIZE - off;
+		if (left <= 0) {
+			pipe->bufs[idx].len += size;
+			return size;
+		}
+		pipe->bufs[idx].len = PAGE_SIZE;
+		idx = next_idx(idx, pipe);
+	}
+	while (idx != pipe->curbuf || !pipe->nrbufs) {
+		struct page *page = alloc_page(GFP_USER);
+		if (!page)
+			break;
+		pipe->nrbufs++;
+		pipe->bufs[idx].ops = &default_pipe_buf_ops;
+		pipe->bufs[idx].page = page;
+		pipe->bufs[idx].offset = 0;
+		if (left <= PAGE_SIZE) {
+			pipe->bufs[idx].len = left;
+			return size;
+		}
+		pipe->bufs[idx].len = PAGE_SIZE;
+		left -= PAGE_SIZE;
+		idx = next_idx(idx, pipe);
+	}
+	return size - left;
+}
+
+static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
+				struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t n, off;
+	int idx;
+
+	if (!sanity(i))
+		return 0;
+
+	bytes = n = push_pipe(i, bytes, &idx, &off);
+	if (unlikely(!n))
+		return 0;
+	for ( ; n; idx = next_idx(idx, pipe), off = 0) {
+		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
+		memcpy_to_page(pipe->bufs[idx].page, off, addr, chunk);
+		i->idx = idx;
+		i->iov_offset = off + chunk;
+		n -= chunk;
+		addr += chunk;
+	}
+	i->count -= bytes;
+	return bytes;
+}
+
 size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	const char *from = addr;
+	if (unlikely(i->type & ITER_PIPE))
+		return copy_pipe_to_iter(addr, bytes, i);
 	iterate_and_advance(i, bytes, v,
 		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
 			       v.iov_len),
@@ -394,6 +562,10 @@ EXPORT_SYMBOL(copy_to_iter);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
 				 v.iov_len),
@@ -409,6 +581,10 @@ EXPORT_SYMBOL(copy_from_iter);
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user_nocache((to += v.iov_len) - v.iov_len,
 					 v.iov_base, v.iov_len),
@@ -429,14 +605,20 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 		size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
 		kunmap_atomic(kaddr);
 		return wanted;
-	} else
+	} else if (likely(!(i->type & ITER_PIPE)))
 		return copy_page_to_iter_iovec(page, offset, bytes, i);
+	else
+		return copy_page_to_iter_pipe(page, offset, bytes, i);
 }
 EXPORT_SYMBOL(copy_page_to_iter);
 
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	if (i->type & (ITER_BVEC|ITER_KVEC)) {
 		void *kaddr = kmap_atomic(page);
 		size_t wanted = copy_from_iter(kaddr + offset, bytes, i);
@@ -447,8 +629,34 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
+static size_t pipe_zero(size_t bytes, struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t n, off;
+	int idx;
+
+	if (!sanity(i))
+		return 0;
+
+	bytes = n = push_pipe(i, bytes, &idx, &off);
+	if (unlikely(!n))
+		return 0;
+
+	for ( ; n; idx = next_idx(idx, pipe), off = 0) {
+		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
+		memzero_page(pipe->bufs[idx].page, off, chunk);
+		i->idx = idx;
+		i->iov_offset = off + chunk;
+		n -= chunk;
+	}
+	i->count -= bytes;
+	return bytes;
+}
+
 size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 {
+	if (unlikely(i->type & ITER_PIPE))
+		return pipe_zero(bytes, i);
 	iterate_and_advance(i, bytes, v,
 		__clear_user(v.iov_base, v.iov_len),
 		memzero_page(v.bv_page, v.bv_offset, v.bv_len),
@@ -463,6 +671,11 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes)
 {
 	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
+	if (unlikely(i->type & ITER_PIPE)) {
+		kunmap_atomic(kaddr);
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_all_kinds(i, bytes, v,
 		__copy_from_user_inatomic((p += v.iov_len) - v.iov_len,
 					  v.iov_base, v.iov_len),
@@ -475,8 +688,55 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 }
 EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
 
+static void pipe_advance(struct iov_iter *i, size_t size)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	struct pipe_buffer *buf;
+	size_t off;
+	int idx;
+	
+	if (unlikely(i->count < size))
+		size = i->count;
+
+	idx = i->idx;
+	off = i->iov_offset;
+	if (size || off) {
+		/* take it relative to the beginning of buffer */
+		size += off - pipe->bufs[idx].offset;
+		while (1) {
+			buf = &pipe->bufs[idx];
+			if (size > buf->len) {
+				size -= buf->len;
+				idx = next_idx(idx, pipe);
+				off = 0;
+			} else {
+				buf->len = size;
+				i->idx = idx;
+				i->iov_offset = off = buf->offset + size;
+				break;
+			}
+		}
+		idx = next_idx(idx, pipe);
+	}
+	if (pipe->nrbufs) {
+		int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		/* [curbuf,unused) is in use.  Free [idx,unused) */
+		while (idx != unused) {
+			buf = &pipe->bufs[idx];
+			buf->ops->release(pipe, buf);
+			buf->ops = NULL;
+			idx = next_idx(idx, pipe);
+			pipe->nrbufs--;
+		}
+	}
+}
+
 void iov_iter_advance(struct iov_iter *i, size_t size)
 {
+	if (unlikely(i->type & ITER_PIPE)) {
+		pipe_advance(i, size);
+		return;
+	}
 	iterate_and_advance(i, size, v, 0, 0, 0)
 }
 EXPORT_SYMBOL(iov_iter_advance);
@@ -486,6 +746,8 @@ EXPORT_SYMBOL(iov_iter_advance);
  */
 size_t iov_iter_single_seg_count(const struct iov_iter *i)
 {
+	if (unlikely(i->type & ITER_PIPE))
+		return i->count;	// it is a silly place, anyway
 	if (i->nr_segs == 1)
 		return i->count;
 	else if (i->type & ITER_BVEC)
@@ -521,6 +783,19 @@ void iov_iter_bvec(struct iov_iter *i, int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
+void iov_iter_pipe(struct iov_iter *i, int direction,
+			struct pipe_inode_info *pipe,
+			size_t count)
+{
+	BUG_ON(direction != ITER_PIPE);
+	i->type = direction;
+	i->pipe = pipe;
+	i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+	i->iov_offset = 0;
+	i->count = count;
+}
+EXPORT_SYMBOL(iov_iter_pipe);
+
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
 	unsigned long res = 0;
@@ -529,6 +804,11 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 	if (!size)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE)) {
+		if (i->iov_offset && allocated(&i->pipe->bufs[i->idx]))
+			return size | i->iov_offset;
+		return size;
+	}
 	iterate_all_kinds(i, size, v,
 		(res |= (unsigned long)v.iov_base | v.iov_len, 0),
 		res |= v.bv_offset | v.bv_len,
@@ -545,6 +825,11 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 	if (!size)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return ~0U;
+	}
+
 	iterate_all_kinds(i, size, v,
 		(res |= (!res ? 0 : (unsigned long)v.iov_base) |
 			(size != v.iov_len ? size : 0), 0),
@@ -557,6 +842,47 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_gap_alignment);
 
+static inline size_t __pipe_get_pages(struct iov_iter *i,
+				size_t maxsize,
+				struct page **pages,
+				int idx,
+				size_t *start)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t n = push_pipe(i, maxsize, &idx, start);
+	if (!n)
+		return 0;
+
+	maxsize = n;
+	n += *start;
+	while (n >= PAGE_SIZE) {
+		get_page(*pages++ = pipe->bufs[idx].page);
+		idx = next_idx(idx, pipe);
+		n -= PAGE_SIZE;
+	}
+
+	return maxsize;
+}
+
+static ssize_t pipe_get_pages(struct iov_iter *i,
+		   struct page **pages, size_t maxsize, unsigned maxpages,
+		   size_t *start)
+{
+	unsigned npages;
+	size_t capacity;
+	int idx;
+
+	if (!sanity(i))
+		return 0;
+
+	data_start(i, &idx, start);
+	/* some of this one + all after this one */
+	npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1;
+	capacity = min(npages,maxpages) * PAGE_SIZE - *start;
+
+	return __pipe_get_pages(i, min(maxsize, capacity), pages, idx, start);
+}
+
 ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
@@ -567,6 +893,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 	if (!maxsize)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE))
+		return pipe_get_pages(i, pages, maxsize, maxpages, start);
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
 		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -602,6 +930,37 @@ static struct page **get_pages_array(size_t n)
 	return p;
 }
 
+static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
+		   struct page ***pages, size_t maxsize,
+		   size_t *start)
+{
+	struct page **p;
+	size_t n;
+	int idx;
+	int npages;
+
+	if (!sanity(i))
+		return 0;
+
+	data_start(i, &idx, start);
+	/* some of this one + all after this one */
+	npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1;
+	n = npages * PAGE_SIZE - *start;
+	if (maxsize > n)
+		maxsize = n;
+	else
+		npages = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
+	p = get_pages_array(npages);
+	if (!p)
+		return -ENOMEM;
+	n = __pipe_get_pages(i, maxsize, p, idx, start);
+	if (n)
+		*pages = p;
+	else
+		kvfree(p);
+	return n;
+}
+
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
@@ -614,6 +973,8 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 	if (!maxsize)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE))
+		return pipe_get_pages_alloc(i, pages, maxsize, start);
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
 		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -655,6 +1016,10 @@ size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v, ({
 		int err = 0;
 		next = csum_and_copy_from_user(v.iov_base, 
@@ -693,6 +1058,10 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum,
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);	/* for now */
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v, ({
 		int err = 0;
 		next = csum_and_copy_to_user((from += v.iov_len) - v.iov_len,
@@ -732,7 +1101,20 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 	if (!size)
 		return 0;
 
-	iterate_all_kinds(i, size, v, ({
+	if (unlikely(i->type & ITER_PIPE)) {
+		struct pipe_inode_info *pipe = i->pipe;
+		size_t off;
+		int idx;
+
+		if (!sanity(i))
+			return 0;
+
+		data_start(i, &idx, &off);
+		/* some of this one + all after this one */
+		npages = ((pipe->curbuf - idx - 1) & (pipe->buffers - 1)) + 1;
+		if (npages >= maxpages)
+			return maxpages;
+	} else iterate_all_kinds(i, size, v, ({
 		unsigned long p = (unsigned long)v.iov_base;
 		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
 			- p / PAGE_SIZE;
@@ -757,6 +1139,10 @@ EXPORT_SYMBOL(iov_iter_npages);
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags)
 {
 	*new = *old;
+	if (unlikely(new->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return NULL;
+	}
 	if (new->type & ITER_BVEC)
 		return new->bvec = kmemdup(new->bvec,
 				    new->nr_segs * sizeof(struct bio_vec),
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 11/12] switch generic_file_splice_read() to use of ->read_iter()
  2016-09-23 20:36                                 ` Linus Torvalds
                                                     ` (3 preceding siblings ...)
  2016-09-24  4:01                                   ` [PATCH 10/12] new iov_iter flavour: pipe-backed Al Viro
@ 2016-09-24  4:01                                   ` Al Viro
  2016-09-24  4:02                                   ` [PATCH 12/12] switch default_file_splice_read() to use of pipe-backed iov_iter Al Viro
  5 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-09-24  4:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

... and kill the ->splice_read() instances that can be switched to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 drivers/staging/lustre/lustre/llite/file.c         |  70 ++----
 .../staging/lustre/lustre/llite/llite_internal.h   |  15 +-
 drivers/staging/lustre/lustre/llite/vvp_internal.h |  14 --
 drivers/staging/lustre/lustre/llite/vvp_io.c       |  45 +---
 fs/coda/file.c                                     |  23 +-
 fs/gfs2/file.c                                     |  28 +--
 fs/nfs/file.c                                      |  25 +--
 fs/nfs/internal.h                                  |   2 -
 fs/nfs/nfs4file.c                                  |   2 +-
 fs/ocfs2/file.c                                    |  34 +--
 fs/ocfs2/ocfs2_trace.h                             |   2 -
 fs/splice.c                                        | 244 +++------------------
 fs/xfs/xfs_file.c                                  |  41 +---
 fs/xfs/xfs_trace.h                                 |   1 -
 include/linux/fs.h                                 |   2 -
 mm/shmem.c                                         | 115 +---------
 16 files changed, 58 insertions(+), 605 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
index 57281b9..2567b09 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -1153,36 +1153,21 @@ restart:
 		int write_mutex_locked = 0;
 
 		vio->vui_fd  = LUSTRE_FPRIVATE(file);
-		vio->vui_io_subtype = args->via_io_subtype;
-
-		switch (vio->vui_io_subtype) {
-		case IO_NORMAL:
-			vio->vui_iter = args->u.normal.via_iter;
-			vio->vui_iocb = args->u.normal.via_iocb;
-			if ((iot == CIT_WRITE) &&
-			    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
-				if (mutex_lock_interruptible(&lli->
-							       lli_write_mutex)) {
-					result = -ERESTARTSYS;
-					goto out;
-				}
-				write_mutex_locked = 1;
+		vio->vui_iter = args->u.normal.via_iter;
+		vio->vui_iocb = args->u.normal.via_iocb;
+		if ((iot == CIT_WRITE) &&
+		    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
+			if (mutex_lock_interruptible(&lli->lli_write_mutex)) {
+				result = -ERESTARTSYS;
+				goto out;
 			}
-			down_read(&lli->lli_trunc_sem);
-			break;
-		case IO_SPLICE:
-			vio->u.splice.vui_pipe = args->u.splice.via_pipe;
-			vio->u.splice.vui_flags = args->u.splice.via_flags;
-			break;
-		default:
-			CERROR("Unknown IO type - %u\n", vio->vui_io_subtype);
-			LBUG();
+			write_mutex_locked = 1;
 		}
+		down_read(&lli->lli_trunc_sem);
 		ll_cl_add(file, env, io);
 		result = cl_io_loop(env, io);
 		ll_cl_remove(file, env);
-		if (args->via_io_subtype == IO_NORMAL)
-			up_read(&lli->lli_trunc_sem);
+		up_read(&lli->lli_trunc_sem);
 		if (write_mutex_locked)
 			mutex_unlock(&lli->lli_write_mutex);
 	} else {
@@ -1237,7 +1222,7 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (IS_ERR(env))
 		return PTR_ERR(env);
 
-	args = ll_env_args(env, IO_NORMAL);
+	args = ll_env_args(env);
 	args->u.normal.via_iter = to;
 	args->u.normal.via_iocb = iocb;
 
@@ -1261,7 +1246,7 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (IS_ERR(env))
 		return PTR_ERR(env);
 
-	args = ll_env_args(env, IO_NORMAL);
+	args = ll_env_args(env);
 	args->u.normal.via_iter = from;
 	args->u.normal.via_iocb = iocb;
 
@@ -1271,31 +1256,6 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	return result;
 }
 
-/*
- * Send file content (through pagecache) somewhere with helper
- */
-static ssize_t ll_file_splice_read(struct file *in_file, loff_t *ppos,
-				   struct pipe_inode_info *pipe, size_t count,
-				   unsigned int flags)
-{
-	struct lu_env      *env;
-	struct vvp_io_args *args;
-	ssize_t	     result;
-	int		 refcheck;
-
-	env = cl_env_get(&refcheck);
-	if (IS_ERR(env))
-		return PTR_ERR(env);
-
-	args = ll_env_args(env, IO_SPLICE);
-	args->u.splice.via_pipe = pipe;
-	args->u.splice.via_flags = flags;
-
-	result = ll_file_io_generic(env, args, in_file, CIT_READ, ppos, count);
-	cl_env_put(env, &refcheck);
-	return result;
-}
-
 static int ll_lov_recreate(struct inode *inode, struct ost_id *oi, u32 ost_idx)
 {
 	struct obd_export *exp = ll_i2dtexp(inode);
@@ -3173,7 +3133,7 @@ struct file_operations ll_file_operations = {
 	.release	= ll_file_release,
 	.mmap	   = ll_file_mmap,
 	.llseek	 = ll_file_seek,
-	.splice_read    = ll_file_splice_read,
+	.splice_read    = generic_file_splice_read,
 	.fsync	  = ll_fsync,
 	.flush	  = ll_flush
 };
@@ -3186,7 +3146,7 @@ struct file_operations ll_file_operations_flock = {
 	.release	= ll_file_release,
 	.mmap	   = ll_file_mmap,
 	.llseek	 = ll_file_seek,
-	.splice_read    = ll_file_splice_read,
+	.splice_read    = generic_file_splice_read,
 	.fsync	  = ll_fsync,
 	.flush	  = ll_flush,
 	.flock	  = ll_file_flock,
@@ -3202,7 +3162,7 @@ struct file_operations ll_file_operations_noflock = {
 	.release	= ll_file_release,
 	.mmap	   = ll_file_mmap,
 	.llseek	 = ll_file_seek,
-	.splice_read    = ll_file_splice_read,
+	.splice_read    = generic_file_splice_read,
 	.fsync	  = ll_fsync,
 	.flush	  = ll_flush,
 	.flock	  = ll_file_noflock,
diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index 4d6d589..0e738c8 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -800,17 +800,11 @@ void vvp_write_complete(struct vvp_object *club, struct vvp_page *page);
  */
 struct vvp_io_args {
 	/** normal/splice */
-	enum vvp_io_subtype via_io_subtype;
-
 	union {
 		struct {
 			struct kiocb      *via_iocb;
 			struct iov_iter   *via_iter;
 		} normal;
-		struct {
-			struct pipe_inode_info  *via_pipe;
-			unsigned int       via_flags;
-		} splice;
 	} u;
 };
 
@@ -838,14 +832,9 @@ static inline struct ll_thread_info *ll_env_info(const struct lu_env *env)
 	return lti;
 }
 
-static inline struct vvp_io_args *ll_env_args(const struct lu_env *env,
-					      enum vvp_io_subtype type)
+static inline struct vvp_io_args *ll_env_args(const struct lu_env *env)
 {
-	struct vvp_io_args *via = &ll_env_info(env)->lti_args;
-
-	via->via_io_subtype = type;
-
-	return via;
+	return &ll_env_info(env)->lti_args;
 }
 
 void ll_queue_done_writing(struct inode *inode, unsigned long flags);
diff --git a/drivers/staging/lustre/lustre/llite/vvp_internal.h b/drivers/staging/lustre/lustre/llite/vvp_internal.h
index 79fc428..2fa49cc 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_internal.h
+++ b/drivers/staging/lustre/lustre/llite/vvp_internal.h
@@ -49,14 +49,6 @@ struct obd_device;
 struct obd_export;
 struct page;
 
-/* specific architecture can implement only part of this list */
-enum vvp_io_subtype {
-	/** normal IO */
-	IO_NORMAL,
-	/** io started from splice_{read|write} */
-	IO_SPLICE
-};
-
 /**
  * IO state private to IO state private to VVP layer.
  */
@@ -99,10 +91,6 @@ struct vvp_io {
 			bool		ft_flags_valid;
 		} fault;
 		struct {
-			struct pipe_inode_info	*vui_pipe;
-			unsigned int		 vui_flags;
-		} splice;
-		struct {
 			struct cl_page_list vui_queue;
 			unsigned long vui_written;
 			int vui_from;
@@ -110,8 +98,6 @@ struct vvp_io {
 		} write;
 	} u;
 
-	enum vvp_io_subtype	vui_io_subtype;
-
 	/**
 	 * Layout version when this IO is initialized
 	 */
diff --git a/drivers/staging/lustre/lustre/llite/vvp_io.c b/drivers/staging/lustre/lustre/llite/vvp_io.c
index 94916dc..4864600 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_io.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_io.c
@@ -55,18 +55,6 @@ static struct vvp_io *cl2vvp_io(const struct lu_env *env,
 }
 
 /**
- * True, if \a io is a normal io, False for splice_{read,write}
- */
-static int cl_is_normalio(const struct lu_env *env, const struct cl_io *io)
-{
-	struct vvp_io *vio = vvp_env_io(env);
-
-	LASSERT(io->ci_type == CIT_READ || io->ci_type == CIT_WRITE);
-
-	return vio->vui_io_subtype == IO_NORMAL;
-}
-
-/**
  * For swapping layout. The file's layout may have changed.
  * To avoid populating pages to a wrong stripe, we have to verify the
  * correctness of layout. It works because swapping layout processes
@@ -391,9 +379,6 @@ static int vvp_mmap_locks(const struct lu_env *env,
 
 	LASSERT(io->ci_type == CIT_READ || io->ci_type == CIT_WRITE);
 
-	if (!cl_is_normalio(env, io))
-		return 0;
-
 	if (!vio->vui_iter) /* nfs or loop back device write */
 		return 0;
 
@@ -462,15 +447,10 @@ static void vvp_io_advance(const struct lu_env *env,
 			   const struct cl_io_slice *ios,
 			   size_t nob)
 {
-	struct vvp_io    *vio = cl2vvp_io(env, ios);
-	struct cl_io     *io  = ios->cis_io;
 	struct cl_object *obj = ios->cis_io->ci_obj;
-
+	struct vvp_io	 *vio = cl2vvp_io(env, ios);
 	CLOBINVRNT(env, obj, vvp_object_invariant(obj));
 
-	if (!cl_is_normalio(env, io))
-		return;
-
 	iov_iter_reexpand(vio->vui_iter, vio->vui_tot_count  -= nob);
 }
 
@@ -479,7 +459,7 @@ static void vvp_io_update_iov(const struct lu_env *env,
 {
 	size_t size = io->u.ci_rw.crw_count;
 
-	if (!cl_is_normalio(env, io) || !vio->vui_iter)
+	if (!vio->vui_iter)
 		return;
 
 	iov_iter_truncate(vio->vui_iter, size);
@@ -716,25 +696,8 @@ static int vvp_io_read_start(const struct lu_env *env,
 
 	/* BUG: 5972 */
 	file_accessed(file);
-	switch (vio->vui_io_subtype) {
-	case IO_NORMAL:
-		LASSERT(vio->vui_iocb->ki_pos == pos);
-		result = generic_file_read_iter(vio->vui_iocb, vio->vui_iter);
-		break;
-	case IO_SPLICE:
-		result = generic_file_splice_read(file, &pos,
-						  vio->u.splice.vui_pipe, cnt,
-						  vio->u.splice.vui_flags);
-		/* LU-1109: do splice read stripe by stripe otherwise if it
-		 * may make nfsd stuck if this read occupied all internal pipe
-		 * buffers.
-		 */
-		io->ci_continue = 0;
-		break;
-	default:
-		CERROR("Wrong IO type %u\n", vio->vui_io_subtype);
-		LBUG();
-	}
+	LASSERT(vio->vui_iocb->ki_pos == pos);
+	result = generic_file_read_iter(vio->vui_iocb, vio->vui_iter);
 
 out:
 	if (result >= 0) {
diff --git a/fs/coda/file.c b/fs/coda/file.c
index f47c748..8415d4f 100644
--- a/fs/coda/file.c
+++ b/fs/coda/file.c
@@ -38,27 +38,6 @@ coda_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 }
 
 static ssize_t
-coda_file_splice_read(struct file *coda_file, loff_t *ppos,
-		      struct pipe_inode_info *pipe, size_t count,
-		      unsigned int flags)
-{
-	ssize_t (*splice_read)(struct file *, loff_t *,
-			       struct pipe_inode_info *, size_t, unsigned int);
-	struct coda_file_info *cfi;
-	struct file *host_file;
-
-	cfi = CODA_FTOC(coda_file);
-	BUG_ON(!cfi || cfi->cfi_magic != CODA_MAGIC);
-	host_file = cfi->cfi_container;
-
-	splice_read = host_file->f_op->splice_read;
-	if (!splice_read)
-		splice_read = default_file_splice_read;
-
-	return splice_read(host_file, ppos, pipe, count, flags);
-}
-
-static ssize_t
 coda_file_write_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *coda_file = iocb->ki_filp;
@@ -225,6 +204,6 @@ const struct file_operations coda_file_operations = {
 	.open		= coda_open,
 	.release	= coda_release,
 	.fsync		= coda_fsync,
-	.splice_read	= coda_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 };
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 320e65e..7016a6a7 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -954,30 +954,6 @@ out_uninit:
 	return ret;
 }
 
-static ssize_t gfs2_file_splice_read(struct file *in, loff_t *ppos,
-				     struct pipe_inode_info *pipe, size_t len,
-				     unsigned int flags)
-{
-	struct inode *inode = in->f_mapping->host;
-	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_holder gh;
-	int ret;
-
-	inode_lock(inode);
-
-	ret = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
-	if (ret) {
-		inode_unlock(inode);
-		return ret;
-	}
-
-	gfs2_glock_dq_uninit(&gh);
-	inode_unlock(inode);
-
-	return generic_file_splice_read(in, ppos, pipe, len, flags);
-}
-
-
 static ssize_t gfs2_file_splice_write(struct pipe_inode_info *pipe,
 				      struct file *out, loff_t *ppos,
 				      size_t len, unsigned int flags)
@@ -1140,7 +1116,7 @@ const struct file_operations gfs2_file_fops = {
 	.fsync		= gfs2_fsync,
 	.lock		= gfs2_lock,
 	.flock		= gfs2_flock,
-	.splice_read	= gfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= gfs2_file_splice_write,
 	.setlease	= simple_nosetlease,
 	.fallocate	= gfs2_fallocate,
@@ -1168,7 +1144,7 @@ const struct file_operations gfs2_file_fops_nolock = {
 	.open		= gfs2_open,
 	.release	= gfs2_release,
 	.fsync		= gfs2_fsync,
-	.splice_read	= gfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= gfs2_file_splice_write,
 	.setlease	= generic_setlease,
 	.fallocate	= gfs2_fallocate,
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 7d62097..5048585 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -182,29 +182,6 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 }
 EXPORT_SYMBOL_GPL(nfs_file_read);
 
-ssize_t
-nfs_file_splice_read(struct file *filp, loff_t *ppos,
-		     struct pipe_inode_info *pipe, size_t count,
-		     unsigned int flags)
-{
-	struct inode *inode = file_inode(filp);
-	ssize_t res;
-
-	dprintk("NFS: splice_read(%pD2, %lu@%Lu)\n",
-		filp, (unsigned long) count, (unsigned long long) *ppos);
-
-	nfs_start_io_read(inode);
-	res = nfs_revalidate_mapping(inode, filp->f_mapping);
-	if (!res) {
-		res = generic_file_splice_read(filp, ppos, pipe, count, flags);
-		if (res > 0)
-			nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, res);
-	}
-	nfs_end_io_read(inode);
-	return res;
-}
-EXPORT_SYMBOL_GPL(nfs_file_splice_read);
-
 int
 nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
 {
@@ -868,7 +845,7 @@ const struct file_operations nfs_file_operations = {
 	.fsync		= nfs_file_fsync,
 	.lock		= nfs_lock,
 	.flock		= nfs_flock,
-	.splice_read	= nfs_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= simple_nosetlease,
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 74935a1..d7b062b 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -365,8 +365,6 @@ int nfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *)
 int nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync);
 loff_t nfs_file_llseek(struct file *, loff_t, int);
 ssize_t nfs_file_read(struct kiocb *, struct iov_iter *);
-ssize_t nfs_file_splice_read(struct file *, loff_t *, struct pipe_inode_info *,
-			     size_t, unsigned int);
 int nfs_file_mmap(struct file *, struct vm_area_struct *);
 ssize_t nfs_file_write(struct kiocb *, struct iov_iter *);
 int nfs_file_release(struct inode *, struct file *);
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index d085ad7..89a7795 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -248,7 +248,7 @@ const struct file_operations nfs4_file_operations = {
 	.fsync		= nfs_file_fsync,
 	.lock		= nfs_lock,
 	.flock		= nfs_flock,
-	.splice_read	= nfs_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= simple_nosetlease,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 4e7b0dc..6596e41 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2307,36 +2307,6 @@ out_mutex:
 	return ret;
 }
 
-static ssize_t ocfs2_file_splice_read(struct file *in,
-				      loff_t *ppos,
-				      struct pipe_inode_info *pipe,
-				      size_t len,
-				      unsigned int flags)
-{
-	int ret = 0, lock_level = 0;
-	struct inode *inode = file_inode(in);
-
-	trace_ocfs2_file_splice_read(inode, in, in->f_path.dentry,
-			(unsigned long long)OCFS2_I(inode)->ip_blkno,
-			in->f_path.dentry->d_name.len,
-			in->f_path.dentry->d_name.name, len);
-
-	/*
-	 * See the comment in ocfs2_file_read_iter()
-	 */
-	ret = ocfs2_inode_lock_atime(inode, in->f_path.mnt, &lock_level);
-	if (ret < 0) {
-		mlog_errno(ret);
-		goto bail;
-	}
-	ocfs2_inode_unlock(inode, lock_level);
-
-	ret = generic_file_splice_read(in, ppos, pipe, len, flags);
-
-bail:
-	return ret;
-}
-
 static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 				   struct iov_iter *to)
 {
@@ -2495,7 +2465,7 @@ const struct file_operations ocfs2_fops = {
 #endif
 	.lock		= ocfs2_lock,
 	.flock		= ocfs2_flock,
-	.splice_read	= ocfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
 };
@@ -2540,7 +2510,7 @@ const struct file_operations ocfs2_fops_no_plocks = {
 	.compat_ioctl   = ocfs2_compat_ioctl,
 #endif
 	.flock		= ocfs2_flock,
-	.splice_read	= ocfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
 };
diff --git a/fs/ocfs2/ocfs2_trace.h b/fs/ocfs2/ocfs2_trace.h
index f8f5fc5..0b58abc 100644
--- a/fs/ocfs2/ocfs2_trace.h
+++ b/fs/ocfs2/ocfs2_trace.h
@@ -1314,8 +1314,6 @@ DEFINE_OCFS2_FILE_OPS(ocfs2_file_aio_write);
 
 DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_write);
 
-DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_read);
-
 DEFINE_OCFS2_FILE_OPS(ocfs2_file_aio_read);
 
 DEFINE_OCFS2_ULL_ULL_ULL_EVENT(ocfs2_truncate_file);
diff --git a/fs/splice.c b/fs/splice.c
index 589a1d5..58c322a 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -281,207 +281,6 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
 	kfree(spd->partial);
 }
 
-static int
-__generic_file_splice_read(struct file *in, loff_t *ppos,
-			   struct pipe_inode_info *pipe, size_t len,
-			   unsigned int flags)
-{
-	struct address_space *mapping = in->f_mapping;
-	unsigned int loff, nr_pages, req_pages;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct page *page;
-	pgoff_t index, end_index;
-	loff_t isize;
-	int error, page_nr;
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &page_cache_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
-
-	if (splice_grow_spd(pipe, &spd))
-		return -ENOMEM;
-
-	index = *ppos >> PAGE_SHIFT;
-	loff = *ppos & ~PAGE_MASK;
-	req_pages = (len + loff + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	nr_pages = min(req_pages, spd.nr_pages_max);
-
-	/*
-	 * Lookup the (hopefully) full range of pages we need.
-	 */
-	spd.nr_pages = find_get_pages_contig(mapping, index, nr_pages, spd.pages);
-	index += spd.nr_pages;
-
-	/*
-	 * If find_get_pages_contig() returned fewer pages than we needed,
-	 * readahead/allocate the rest and fill in the holes.
-	 */
-	if (spd.nr_pages < nr_pages)
-		page_cache_sync_readahead(mapping, &in->f_ra, in,
-				index, req_pages - spd.nr_pages);
-
-	error = 0;
-	while (spd.nr_pages < nr_pages) {
-		/*
-		 * Page could be there, find_get_pages_contig() breaks on
-		 * the first hole.
-		 */
-		page = find_get_page(mapping, index);
-		if (!page) {
-			/*
-			 * page didn't exist, allocate one.
-			 */
-			page = page_cache_alloc_cold(mapping);
-			if (!page)
-				break;
-
-			error = add_to_page_cache_lru(page, mapping, index,
-				   mapping_gfp_constraint(mapping, GFP_KERNEL));
-			if (unlikely(error)) {
-				put_page(page);
-				if (error == -EEXIST)
-					continue;
-				break;
-			}
-			/*
-			 * add_to_page_cache() locks the page, unlock it
-			 * to avoid convoluting the logic below even more.
-			 */
-			unlock_page(page);
-		}
-
-		spd.pages[spd.nr_pages++] = page;
-		index++;
-	}
-
-	/*
-	 * Now loop over the map and see if we need to start IO on any
-	 * pages, fill in the partial map, etc.
-	 */
-	index = *ppos >> PAGE_SHIFT;
-	nr_pages = spd.nr_pages;
-	spd.nr_pages = 0;
-	for (page_nr = 0; page_nr < nr_pages; page_nr++) {
-		unsigned int this_len;
-
-		if (!len)
-			break;
-
-		/*
-		 * this_len is the max we'll use from this page
-		 */
-		this_len = min_t(unsigned long, len, PAGE_SIZE - loff);
-		page = spd.pages[page_nr];
-
-		if (PageReadahead(page))
-			page_cache_async_readahead(mapping, &in->f_ra, in,
-					page, index, req_pages - page_nr);
-
-		/*
-		 * If the page isn't uptodate, we may need to start io on it
-		 */
-		if (!PageUptodate(page)) {
-			lock_page(page);
-
-			/*
-			 * Page was truncated, or invalidated by the
-			 * filesystem.  Redo the find/create, but this time the
-			 * page is kept locked, so there's no chance of another
-			 * race with truncate/invalidate.
-			 */
-			if (!page->mapping) {
-				unlock_page(page);
-retry_lookup:
-				page = find_or_create_page(mapping, index,
-						mapping_gfp_mask(mapping));
-
-				if (!page) {
-					error = -ENOMEM;
-					break;
-				}
-				put_page(spd.pages[page_nr]);
-				spd.pages[page_nr] = page;
-			}
-			/*
-			 * page was already under io and is now done, great
-			 */
-			if (PageUptodate(page)) {
-				unlock_page(page);
-				goto fill_it;
-			}
-
-			/*
-			 * need to read in the page
-			 */
-			error = mapping->a_ops->readpage(in, page);
-			if (unlikely(error)) {
-				/*
-				 * Re-lookup the page
-				 */
-				if (error == AOP_TRUNCATED_PAGE)
-					goto retry_lookup;
-
-				break;
-			}
-		}
-fill_it:
-		/*
-		 * i_size must be checked after PageUptodate.
-		 */
-		isize = i_size_read(mapping->host);
-		end_index = (isize - 1) >> PAGE_SHIFT;
-		if (unlikely(!isize || index > end_index))
-			break;
-
-		/*
-		 * if this is the last page, see if we need to shrink
-		 * the length and stop
-		 */
-		if (end_index == index) {
-			unsigned int plen;
-
-			/*
-			 * max good bytes in this page
-			 */
-			plen = ((isize - 1) & ~PAGE_MASK) + 1;
-			if (plen <= loff)
-				break;
-
-			/*
-			 * force quit after adding this page
-			 */
-			this_len = min(this_len, plen - loff);
-			len = this_len;
-		}
-
-		spd.partial[page_nr].offset = loff;
-		spd.partial[page_nr].len = this_len;
-		len -= this_len;
-		loff = 0;
-		spd.nr_pages++;
-		index++;
-	}
-
-	/*
-	 * Release any pages at the end, if we quit early. 'page_nr' is how far
-	 * we got, 'nr_pages' is how many pages are in the map.
-	 */
-	while (page_nr < nr_pages)
-		put_page(spd.pages[page_nr++]);
-	in->f_ra.prev_pos = (loff_t)index << PAGE_SHIFT;
-
-	if (spd.nr_pages)
-		error = splice_to_pipe(pipe, &spd);
-
-	splice_shrink_spd(&spd);
-	return error;
-}
-
 /**
  * generic_file_splice_read - splice data from file to a pipe
  * @in:		file to splice from
@@ -492,32 +291,46 @@ fill_it:
  *
  * Description:
  *    Will read pages from given file and fill them into a pipe. Can be
- *    used as long as the address_space operations for the source implements
- *    a readpage() hook.
+ *    used as long as it has more or less sane ->read_iter().
  *
  */
 ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
-	loff_t isize, left;
-	int ret;
-
-	if (IS_DAX(in->f_mapping->host))
-		return default_file_splice_read(in, ppos, pipe, len, flags);
+	struct iov_iter to;
+	struct kiocb kiocb;
+	loff_t isize;
+	int idx, ret;
 
 	isize = i_size_read(in->f_mapping->host);
 	if (unlikely(*ppos >= isize))
 		return 0;
 
-	left = isize - *ppos;
-	if (unlikely(left < len))
-		len = left;
-
-	ret = __generic_file_splice_read(in, ppos, pipe, len, flags);
+	iov_iter_pipe(&to, ITER_PIPE | READ, pipe, len);
+	idx = to.idx;
+	init_sync_kiocb(&kiocb, in);
+	kiocb.ki_pos = *ppos;
+	ret = in->f_op->read_iter(&kiocb, &to);
 	if (ret > 0) {
-		*ppos += ret;
+		*ppos = kiocb.ki_pos;
 		file_accessed(in);
+	} else if (ret < 0) {
+		if (WARN_ON(to.idx != idx || to.iov_offset)) {
+			/*
+			 * a bogus ->read_iter() has copied something and still
+			 * returned an error instead of a short read.
+			 */
+			to.idx = idx;
+			to.iov_offset = 0;
+			iov_iter_advance(&to, 0); /* to free what was emitted */
+		}
+		/*
+		 * callers of ->splice_read() expect -EAGAIN on
+		 * "can't put anything in there", rather than -EFAULT.
+		 */
+		if (ret == -EFAULT)
+			ret = -EAGAIN;
 	}
 
 	return ret;
@@ -580,7 +393,7 @@ ssize_t kernel_write(struct file *file, const char *buf, size_t count,
 }
 EXPORT_SYMBOL(kernel_write);
 
-ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
+static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
@@ -675,7 +488,6 @@ err:
 	res = error;
 	goto shrink_ret;
 }
-EXPORT_SYMBOL(default_file_splice_read);
 
 /*
  * Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e612a02..92f16cf 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -399,45 +399,6 @@ xfs_file_read_iter(
 	return ret;
 }
 
-STATIC ssize_t
-xfs_file_splice_read(
-	struct file		*infilp,
-	loff_t			*ppos,
-	struct pipe_inode_info	*pipe,
-	size_t			count,
-	unsigned int		flags)
-{
-	struct xfs_inode	*ip = XFS_I(infilp->f_mapping->host);
-	ssize_t			ret;
-
-	XFS_STATS_INC(ip->i_mount, xs_read_calls);
-
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
-		return -EIO;
-
-	trace_xfs_file_splice_read(ip, count, *ppos);
-
-	/*
-	 * DAX inodes cannot ues the page cache for splice, so we have to push
-	 * them through the VFS IO path. This means it goes through
-	 * ->read_iter, which for us takes the XFS_IOLOCK_SHARED. Hence we
-	 * cannot lock the splice operation at this level for DAX inodes.
-	 */
-	if (IS_DAX(VFS_I(ip))) {
-		ret = default_file_splice_read(infilp, ppos, pipe, count,
-					       flags);
-		goto out;
-	}
-
-	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
-	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
-out:
-	if (ret > 0)
-		XFS_STATS_ADD(ip->i_mount, xs_read_bytes, ret);
-	return ret;
-}
-
 /*
  * Zero any on disk space between the current EOF and the new, larger EOF.
  *
@@ -1652,7 +1613,7 @@ const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
 	.write_iter	= xfs_file_write_iter,
-	.splice_read	= xfs_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.unlocked_ioctl	= xfs_file_ioctl,
 #ifdef CONFIG_COMPAT
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d303a66..f31db44 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1170,7 +1170,6 @@ DEFINE_RW_EVENT(xfs_file_dax_read);
 DEFINE_RW_EVENT(xfs_file_buffered_write);
 DEFINE_RW_EVENT(xfs_file_direct_write);
 DEFINE_RW_EVENT(xfs_file_dax_write);
-DEFINE_RW_EVENT(xfs_file_splice_read);
 
 DECLARE_EVENT_CLASS(xfs_page_class,
 	TP_PROTO(struct inode *inode, struct page *page, unsigned long off,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 901e25d..b04883e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2794,8 +2794,6 @@ extern void block_sync_page(struct page *page);
 /* fs/splice.c */
 extern ssize_t generic_file_splice_read(struct file *, loff_t *,
 		struct pipe_inode_info *, size_t, unsigned int);
-extern ssize_t default_file_splice_read(struct file *, loff_t *,
-		struct pipe_inode_info *, size_t, unsigned int);
 extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
 		struct file *, loff_t *, size_t, unsigned int);
 extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
diff --git a/mm/shmem.c b/mm/shmem.c
index fd8b2b5..84d7077 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2310,119 +2310,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return retval ? retval : error;
 }
 
-static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
-				struct pipe_inode_info *pipe, size_t len,
-				unsigned int flags)
-{
-	struct address_space *mapping = in->f_mapping;
-	struct inode *inode = mapping->host;
-	unsigned int loff, nr_pages, req_pages;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct page *page;
-	pgoff_t index, end_index;
-	loff_t isize, left;
-	int error, page_nr;
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &page_cache_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
-
-	isize = i_size_read(inode);
-	if (unlikely(*ppos >= isize))
-		return 0;
-
-	left = isize - *ppos;
-	if (unlikely(left < len))
-		len = left;
-
-	if (splice_grow_spd(pipe, &spd))
-		return -ENOMEM;
-
-	index = *ppos >> PAGE_SHIFT;
-	loff = *ppos & ~PAGE_MASK;
-	req_pages = (len + loff + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	nr_pages = min(req_pages, spd.nr_pages_max);
-
-	spd.nr_pages = find_get_pages_contig(mapping, index,
-						nr_pages, spd.pages);
-	index += spd.nr_pages;
-	error = 0;
-
-	while (spd.nr_pages < nr_pages) {
-		error = shmem_getpage(inode, index, &page, SGP_CACHE);
-		if (error)
-			break;
-		unlock_page(page);
-		spd.pages[spd.nr_pages++] = page;
-		index++;
-	}
-
-	index = *ppos >> PAGE_SHIFT;
-	nr_pages = spd.nr_pages;
-	spd.nr_pages = 0;
-
-	for (page_nr = 0; page_nr < nr_pages; page_nr++) {
-		unsigned int this_len;
-
-		if (!len)
-			break;
-
-		this_len = min_t(unsigned long, len, PAGE_SIZE - loff);
-		page = spd.pages[page_nr];
-
-		if (!PageUptodate(page) || page->mapping != mapping) {
-			error = shmem_getpage(inode, index, &page, SGP_CACHE);
-			if (error)
-				break;
-			unlock_page(page);
-			put_page(spd.pages[page_nr]);
-			spd.pages[page_nr] = page;
-		}
-
-		isize = i_size_read(inode);
-		end_index = (isize - 1) >> PAGE_SHIFT;
-		if (unlikely(!isize || index > end_index))
-			break;
-
-		if (end_index == index) {
-			unsigned int plen;
-
-			plen = ((isize - 1) & ~PAGE_MASK) + 1;
-			if (plen <= loff)
-				break;
-
-			this_len = min(this_len, plen - loff);
-			len = this_len;
-		}
-
-		spd.partial[page_nr].offset = loff;
-		spd.partial[page_nr].len = this_len;
-		len -= this_len;
-		loff = 0;
-		spd.nr_pages++;
-		index++;
-	}
-
-	while (page_nr < nr_pages)
-		put_page(spd.pages[page_nr++]);
-
-	if (spd.nr_pages)
-		error = splice_to_pipe(pipe, &spd);
-
-	splice_shrink_spd(&spd);
-
-	if (error > 0) {
-		*ppos += error;
-		file_accessed(in);
-	}
-	return error;
-}
-
 /*
  * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
  */
@@ -3785,7 +3672,7 @@ static const struct file_operations shmem_file_operations = {
 	.read_iter	= shmem_file_read_iter,
 	.write_iter	= generic_file_write_iter,
 	.fsync		= noop_fsync,
-	.splice_read	= shmem_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= shmem_fallocate,
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 12/12] switch default_file_splice_read() to use of pipe-backed iov_iter
  2016-09-23 20:36                                 ` Linus Torvalds
                                                     ` (4 preceding siblings ...)
  2016-09-24  4:01                                   ` [PATCH 11/12] switch generic_file_splice_read() to use of ->read_iter() Al Viro
@ 2016-09-24  4:02                                   ` Al Viro
  5 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-09-24  4:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

we only use iov_iter_get_pages_alloc() and iov_iter_advance() -
pages are filled by kernel_readv() via a kvec array (as we used
to do all along), so iov_iter here is used only as a way of
arranging for those pages to be in pipe.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c | 111 ++++++++++++++++++++++--------------------------------------
 1 file changed, 40 insertions(+), 71 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 58c322a..0df907b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -361,7 +361,7 @@ const struct pipe_buf_operations nosteal_pipe_buf_ops = {
 };
 EXPORT_SYMBOL(nosteal_pipe_buf_ops);
 
-static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
+static ssize_t kernel_readv(struct file *file, const struct kvec *vec,
 			    unsigned long vlen, loff_t offset)
 {
 	mm_segment_t old_fs;
@@ -397,96 +397,65 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
+	struct kvec *vec, __vec[PIPE_DEF_BUFFERS];
+	struct iov_iter to;
+	struct page **pages;
 	unsigned int nr_pages;
-	unsigned int nr_freed;
-	size_t offset;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct iovec *vec, __vec[PIPE_DEF_BUFFERS];
+	size_t offset, dummy, copied = 0;
 	ssize_t res;
-	size_t this_len;
-	int error;
 	int i;
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &default_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
 
-	if (splice_grow_spd(pipe, &spd))
+	if (pipe->nrbufs == pipe->buffers)
+		return -EAGAIN;
+
+	/*
+	 * Try to keep page boundaries matching to source pagecache ones -
+	 * it probably won't be much help, but...
+	 */
+	offset = *ppos & ~PAGE_MASK;
+
+	iov_iter_pipe(&to, ITER_PIPE | READ, pipe, len + offset);
+
+	res = iov_iter_get_pages_alloc(&to, &pages, len + offset, &dummy);
+	if (res <= 0)
 		return -ENOMEM;
 
-	res = -ENOMEM;
+	nr_pages = res / PAGE_SIZE;
+
 	vec = __vec;
-	if (spd.nr_pages_max > PIPE_DEF_BUFFERS) {
-		vec = kmalloc(spd.nr_pages_max * sizeof(struct iovec), GFP_KERNEL);
-		if (!vec)
-			goto shrink_ret;
+	if (nr_pages > PIPE_DEF_BUFFERS) {
+		vec = kmalloc(nr_pages * sizeof(struct kvec), GFP_KERNEL);
+		if (unlikely(!vec)) {
+			res = -ENOMEM;
+			goto out;
+		}
 	}
 
-	offset = *ppos & ~PAGE_MASK;
-	nr_pages = (len + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
-
-	for (i = 0; i < nr_pages && i < spd.nr_pages_max && len; i++) {
-		struct page *page;
-
-		page = alloc_page(GFP_USER);
-		error = -ENOMEM;
-		if (!page)
-			goto err;
+	pipe->bufs[to.idx].offset = offset;
+	pipe->bufs[to.idx].len -= offset;
 
-		this_len = min_t(size_t, len, PAGE_SIZE - offset);
-		vec[i].iov_base = (void __user *) page_address(page);
+	for (i = 0; i < nr_pages; i++) {
+		size_t this_len = min_t(size_t, len, PAGE_SIZE - offset);
+		vec[i].iov_base = page_address(pages[i]) + offset;
 		vec[i].iov_len = this_len;
-		spd.pages[i] = page;
-		spd.nr_pages++;
 		len -= this_len;
 		offset = 0;
 	}
 
-	res = kernel_readv(in, vec, spd.nr_pages, *ppos);
-	if (res < 0) {
-		error = res;
-		goto err;
-	}
-
-	error = 0;
-	if (!res)
-		goto err;
-
-	nr_freed = 0;
-	for (i = 0; i < spd.nr_pages; i++) {
-		this_len = min_t(size_t, vec[i].iov_len, res);
-		spd.partial[i].offset = 0;
-		spd.partial[i].len = this_len;
-		if (!this_len) {
-			__free_page(spd.pages[i]);
-			spd.pages[i] = NULL;
-			nr_freed++;
-		}
-		res -= this_len;
-	}
-	spd.nr_pages -= nr_freed;
-
-	res = splice_to_pipe(pipe, &spd);
-	if (res > 0)
+	res = kernel_readv(in, vec, nr_pages, *ppos);
+	if (res > 0) {
+		copied = res;
 		*ppos += res;
+	}
 
-shrink_ret:
 	if (vec != __vec)
 		kfree(vec);
-	splice_shrink_spd(&spd);
+out:
+	for (i = 0; i < nr_pages; i++)
+		put_page(pages[i]);
+	kvfree(pages);
+	iov_iter_advance(&to, copied);	/* truncates and discards */
 	return res;
-
-err:
-	for (i = 0; i < spd.nr_pages; i++)
-		__free_page(spd.pages[i]);
-
-	res = error;
-	goto shrink_ret;
 }
 
 /*
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-24  3:59                                   ` Al Viro
@ 2016-09-24 17:29                                     ` Al Viro
  2016-09-27 15:38                                       ` Nicholas Piggin
  2016-09-27 15:53                                       ` Chuck Lever
  0 siblings, 2 replies; 104+ messages in thread
From: Al Viro @ 2016-09-24 17:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Sat, Sep 24, 2016 at 04:59:08AM +0100, Al Viro wrote:

> 	FWIW, updated (with fixes) and force-pushed.  Added piece:
> default_file_splice_read() converted to iov_iter.  Seems to work, after
> fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
> beginning, as described above), #6 (context changes from #4), #10 (missing
> get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
> of len - ->read_iter() can bloody well handle that on its own) and added #12.
> Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...

	I think I see how to get full zero-copy (including the write side
of things).  Just add a "from" side for ITER_PIPE iov_iter (advance,
get_pages, get_pages_alloc, npages and alignment will need to behave
differently for "to" and "from" ones) and pull the following trick:
have fault_in_readable return NULL instead of 0, ERR_PTR(-EFAULT) instead
of -EFAULT *and* return a struct page if it was asked for a full-page
range on a page that could be successfully stolen (only "from pipe" iov_iter
would go for the last one, of course).  Then we make generic_perform_write()
shove the return value of fault-in into 'page'.  ->write_begin() is given
&page as an argument, to return the resulting page via that.  All instances
currently just store into that pointer, completely ignoring the prior value.
And they'll keep working just fine.

	Let's make sure that all method call sites outside of
generic_perform_write() (there's only one such, actually) have NULL
stored in there prior to the call.  Now we can start switching the
instances to zero-copy support - all it takes is replacing
grab_cache_page_write_begin() with "if *page is non-NULL, try to
shove it (locked, non-uptodate) into pagecache; if that succeeds grab a
reference to our page and we are done, if it fails - fall back to
grab_cache_page_write_begin()".  Then do get_block, etc., or whatever that
->write_begin() instance would normally do, just remember not to zero anything
if the page had been passed to us by caller.

	Now all we need is to make sure that iov_iter_copy_from_user_atomic()
for those guys recongnizes the case of full-page copy when source and target
are the same page and quietly returns PAGE_SIZE.  Voila - we can make
iter_file_splice_write() pass pipe-backed iov_iter instead of bvec-backed
one *and* get write-side zero-copy for all filesystems with ->write_begin()
taught to handle that (see above).  Since the filesystems with unmodified
->write_begin() will act correctly (just do the copying), we don't have
to make that a flagday change; ->write_begin() instances can be switched
one by one.  Similar treatment of iomap_write_begin()/iomap_write_actor()
would cover iomap-using ->write_iter() instances.

	It's clearly not something I want to touch until -rc1, but it looks
feasible for the next cycle, and if done right it promises to unify the
plain and splice sides of fuse_dev_...() stuff, simplifying the hell out
of them without losing zero-copy there.  And if everything really goes
right, we might be able to get rid of net/* ->splice_read() and ->sendpage()
methods as well...

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 09/11] fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter()
  2016-09-23 19:08                           ` [PATCH 09/11] fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter() Al Viro
@ 2016-09-26  9:31                             ` Miklos Szeredi
  0 siblings, 0 replies; 104+ messages in thread
From: Miklos Szeredi @ 2016-09-26  9:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Fri, Sep 23, 2016 at 9:08 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> [another cleanup, will be moved out of that branch]

Picked up and pushed to fuse.git #for-next

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-24  3:59                                   ` [PATCH 04/12] " Al Viro
@ 2016-09-26 13:35                                     ` Miklos Szeredi
  2016-09-27  4:14                                       ` Al Viro
  2016-12-17 19:54                                     ` Andreas Schwab
  1 sibling, 1 reply; 104+ messages in thread
From: Miklos Szeredi @ 2016-09-26 13:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sat, Sep 24, 2016 at 5:59 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> * splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
> * ->splice_read() instances do the same
> * vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
>   arrange for waiting, looping, etc. themselves.
>
> That should make pipe_lock the outermost one.
>
> Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
> and do_splice() are quite ugly _and_ userland code can be easily broken
> by changing those.  It's not even "no more than the maximal capacity of
> this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
> leave instead of waiting".
>
> Considering how poorly these rules are documented, let's try "wait for some
> space to appear, unless given SPLICE_F_NONBLOCK, then push into pipe
> and if we run into overflow, we are done".
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/fuse/dev.c |   2 -
>  fs/splice.c   | 138 +++++++++++++++++++++++++++-------------------------------
>  2 files changed, 63 insertions(+), 77 deletions(-)
>

[...]

> @@ -1546,14 +1528,20 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
>                 return -ENOMEM;
>         }
>
> -       spd.nr_pages = get_iovec_page_array(&from, spd.pages,
> -                                           spd.partial,
> -                                           spd.nr_pages_max);
> -       if (spd.nr_pages <= 0)
> -               ret = spd.nr_pages;
> -       else
> -               ret = splice_to_pipe(pipe, &spd);
> -
> +       pipe_lock(pipe);
> +       ret = wait_for_space(pipe, flags);
> +       if (!ret) {
> +               spd.nr_pages = get_iovec_page_array(&from, spd.pages,
> +                                                   spd.partial,
> +                                                   spd.nr_pages_max);
> +               if (spd.nr_pages <= 0)
> +                       ret = spd.nr_pages;
> +               else
> +                       ret = splice_to_pipe(pipe, &spd);
> +               pipe_unlock(pipe);
> +               if (ret > 0)
> +                       wakeup_pipe_readers(pipe);
> +       }

Unbalanced pipe_lock()?

Also, while it doesn't hurt, the constification of the "from" argument
of get_iovec_page_array() looks only noise in this patch.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 06/12] new helper: add_to_pipe()
  2016-09-24  4:00                                   ` [PATCH 06/12] new helper: add_to_pipe() Al Viro
@ 2016-09-26 13:49                                     ` Miklos Szeredi
  0 siblings, 0 replies; 104+ messages in thread
From: Miklos Szeredi @ 2016-09-26 13:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sat, Sep 24, 2016 at 6:00 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> single-buffer analogue of splice_to_pipe(); vmsplice_to_pipe() switched
> to that, leaving splice_to_pipe() only for ->splice_read() instances
> (and that only until they are converted as well).
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/splice.c            | 113 ++++++++++++++++++++++++++++---------------------
>  include/linux/splice.h |   2 +
>  2 files changed, 67 insertions(+), 48 deletions(-)
>

[...]

> @@ -1523,26 +1553,13 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
>         if (ret < 0)
>                 return ret;
>
> -       if (splice_grow_spd(pipe, &spd)) {
> -               kfree(iov);
> -               return -ENOMEM;
> -       }
> -
>         pipe_lock(pipe);
>         ret = wait_for_space(pipe, flags);
> -       if (!ret) {
> -               spd.nr_pages = get_iovec_page_array(&from, spd.pages,
> -                                                   spd.partial,
> -                                                   spd.nr_pages_max);
> -               if (spd.nr_pages <= 0)
> -                       ret = spd.nr_pages;
> -               else
> -                       ret = splice_to_pipe(pipe, &spd);
> -               pipe_unlock(pipe);
> -               if (ret > 0)
> -                       wakeup_pipe_readers(pipe);
> -       }
> -       splice_shrink_spd(&spd);
> +       if (!ret)
> +               ret = iter_to_pipe(&from, pipe, buf_flag);
> +       pipe_unlock(pipe);

Ah, here it is :)

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-26 13:35                                     ` Miklos Szeredi
@ 2016-09-27  4:14                                       ` Al Viro
  0 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-09-27  4:14 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Mon, Sep 26, 2016 at 03:35:12PM +0200, Miklos Szeredi wrote:
> > -       if (spd.nr_pages <= 0)
> > -               ret = spd.nr_pages;
> > -       else
> > -               ret = splice_to_pipe(pipe, &spd);
> > -
> > +       pipe_lock(pipe);
> > +       ret = wait_for_space(pipe, flags);
> > +       if (!ret) {
> > +               spd.nr_pages = get_iovec_page_array(&from, spd.pages,
> > +                                                   spd.partial,
> > +                                                   spd.nr_pages_max);
> > +               if (spd.nr_pages <= 0)
> > +                       ret = spd.nr_pages;
> > +               else
> > +                       ret = splice_to_pipe(pipe, &spd);
> > +               pipe_unlock(pipe);
		    ^^^^^^^^^^^^^^^^
> > +               if (ret > 0)
> > +                       wakeup_pipe_readers(pipe);
> > +       }
> 
> Unbalanced pipe_lock()?

Reordering braindamage; fixed.

> Also, while it doesn't hurt, the constification of the "from" argument
> of get_iovec_page_array() looks only noise in this patch.

Rudiment of earlier variant, when we did a non-trivial loop in the caller.
Not needed anymore, removed.

Fixed variant force-pushed to the same branch

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-24 17:29                                     ` Al Viro
@ 2016-09-27 15:38                                       ` Nicholas Piggin
  2016-09-27 15:53                                       ` Chuck Lever
  1 sibling, 0 replies; 104+ messages in thread
From: Nicholas Piggin @ 2016-09-27 15:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, linux-fsdevel

On Sat, 24 Sep 2016 18:29:01 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Sat, Sep 24, 2016 at 04:59:08AM +0100, Al Viro wrote:
> 
> > 	FWIW, updated (with fixes) and force-pushed.  Added piece:
> > default_file_splice_read() converted to iov_iter.  Seems to work, after
> > fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
> > beginning, as described above), #6 (context changes from #4), #10 (missing
> > get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
> > of len - ->read_iter() can bloody well handle that on its own) and added #12.
> > Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...  
> 
> 	I think I see how to get full zero-copy (including the write side
> of things).  Just add a "from" side for ITER_PIPE iov_iter (advance,
> get_pages, get_pages_alloc, npages and alignment will need to behave
> differently for "to" and "from" ones) and pull the following trick:
> have fault_in_readable return NULL instead of 0, ERR_PTR(-EFAULT) instead
> of -EFAULT *and* return a struct page if it was asked for a full-page
> range on a page that could be successfully stolen (only "from pipe" iov_iter
> would go for the last one, of course).  Then we make generic_perform_write()
> shove the return value of fault-in into 'page'.  ->write_begin() is given
> &page as an argument, to return the resulting page via that.  All instances
> currently just store into that pointer, completely ignoring the prior value.
> And they'll keep working just fine.
> 
> 	Let's make sure that all method call sites outside of
> generic_perform_write() (there's only one such, actually) have NULL
> stored in there prior to the call.  Now we can start switching the
> instances to zero-copy support - all it takes is replacing
> grab_cache_page_write_begin() with "if *page is non-NULL, try to
> shove it (locked, non-uptodate) into pagecache; if that succeeds grab a
> reference to our page and we are done, if it fails - fall back to
> grab_cache_page_write_begin()".  Then do get_block, etc., or whatever that
> ->write_begin() instance would normally do, just remember not to zero anything  
> if the page had been passed to us by caller.

Interesting stuff. It should also be possible for a filesystem to replace
existing pagecache as a zero-copy overwrite with the migration APIs and
just a little bit of work.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-24 17:29                                     ` Al Viro
  2016-09-27 15:38                                       ` Nicholas Piggin
@ 2016-09-27 15:53                                       ` Chuck Lever
  1 sibling, 0 replies; 104+ messages in thread
From: Chuck Lever @ 2016-09-27 15:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel


> On Sep 24, 2016, at 1:29 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> On Sat, Sep 24, 2016 at 04:59:08AM +0100, Al Viro wrote:
> 
>> 	FWIW, updated (with fixes) and force-pushed.  Added piece:
>> default_file_splice_read() converted to iov_iter.  Seems to work, after
>> fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
>> beginning, as described above), #6 (context changes from #4), #10 (missing
>> get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
>> of len - ->read_iter() can bloody well handle that on its own) and added #12.
>> Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...
> 
> 	I think I see how to get full zero-copy (including the write side
> of things).  Just add a "from" side for ITER_PIPE iov_iter (advance,
> get_pages, get_pages_alloc, npages and alignment will need to behave
> differently for "to" and "from" ones) and pull the following trick:
> have fault_in_readable return NULL instead of 0, ERR_PTR(-EFAULT) instead
> of -EFAULT *and* return a struct page if it was asked for a full-page
> range on a page that could be successfully stolen (only "from pipe" iov_iter
> would go for the last one, of course).  Then we make generic_perform_write()
> shove the return value of fault-in into 'page'.  ->write_begin() is given
> &page as an argument, to return the resulting page via that.  All instances
> currently just store into that pointer, completely ignoring the prior value.
> And they'll keep working just fine.
> 
> 	Let's make sure that all method call sites outside of
> generic_perform_write() (there's only one such, actually) have NULL
> stored in there prior to the call.  Now we can start switching the
> instances to zero-copy support - all it takes is replacing
> grab_cache_page_write_begin() with "if *page is non-NULL, try to
> shove it (locked, non-uptodate) into pagecache; if that succeeds grab a
> reference to our page and we are done, if it fails - fall back to
> grab_cache_page_write_begin()".  Then do get_block, etc., or whatever that
> ->write_begin() instance would normally do, just remember not to zero anything
> if the page had been passed to us by caller.
> 
> 	Now all we need is to make sure that iov_iter_copy_from_user_atomic()
> for those guys recongnizes the case of full-page copy when source and target
> are the same page and quietly returns PAGE_SIZE.  Voila - we can make
> iter_file_splice_write() pass pipe-backed iov_iter instead of bvec-backed
> one *and* get write-side zero-copy for all filesystems with ->write_begin()
> taught to handle that (see above).  Since the filesystems with unmodified
> ->write_begin() will act correctly (just do the copying), we don't have
> to make that a flagday change; ->write_begin() instances can be switched
> one by one.  Similar treatment of iomap_write_begin()/iomap_write_actor()
> would cover iomap-using ->write_iter() instances.
> 
> 	It's clearly not something I want to touch until -rc1, but it looks
> feasible for the next cycle, and if done right it promises to unify the
> plain and splice sides of fuse_dev_...() stuff, simplifying the hell out
> of them without losing zero-copy there.  And if everything really goes
> right, we might be able to get rid of net/* ->splice_read() and ->sendpage()
> methods as well...

Kernel NFS server already uses splice for its read path, but the
write path appears to require a full data copy of incoming payloads.
Would be awesome to see write-side support for zero-copy.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 10/12] new iov_iter flavour: pipe-backed
  2016-09-24  4:01                                   ` [PATCH 10/12] new iov_iter flavour: pipe-backed Al Viro
@ 2016-09-29 20:53                                     ` Miklos Szeredi
  2016-09-29 22:50                                       ` Al Viro
  0 siblings, 1 reply; 104+ messages in thread
From: Miklos Szeredi @ 2016-09-29 20:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sat, Sep 24, 2016 at 6:01 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> iov_iter variant for passing data into pipe.  copy_to_iter()
> copies data into page(s) it has allocated and stuffs them into
> the pipe; copy_page_to_iter() stuffs there a reference to the
> page given to it.  Both will try to coalesce if possible.
> iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages()
> and friends will do as copy_to_iter() would have and return the
> pages where the data would've been copied.  iov_iter_advance()
> will truncate everything past the spot it has advanced to.
>
> New primitive: iov_iter_pipe(), used for initializing those.
> pipe should be locked all along.
>
> Running out of space acts as fault would for iovec-backed ones;
> in other words, giving it to ->read_iter() may result in short
> read if the pipe overflows, or -EFAULT if it happens with nothing
> copied there.

This is the hardest part of the whole set.  I've been trying to
understand it, but the modular arithmetic makes it really tricky to
read.  Couldn't we have more small inline helpers like next_idx()?

Specific comments inline.

[...]

> +static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
> +                        struct iov_iter *i)
> +{
> +       struct pipe_inode_info *pipe = i->pipe;
> +       struct pipe_buffer *buf;
> +       size_t off;
> +       int idx;
> +
> +       if (unlikely(bytes > i->count))
> +               bytes = i->count;
> +
> +       if (unlikely(!bytes))
> +               return 0;
> +
> +       if (!sanity(i))
> +               return 0;
> +
> +       off = i->iov_offset;
> +       idx = i->idx;
> +       buf = &pipe->bufs[idx];
> +       if (off) {
> +               if (offset == off && buf->page == page) {
> +                       /* merge with the last one */
> +                       buf->len += bytes;
> +                       i->iov_offset += bytes;
> +                       goto out;
> +               }
> +               idx = next_idx(idx, pipe);
> +               buf = &pipe->bufs[idx];
> +       }
> +       if (idx == pipe->curbuf && pipe->nrbufs)
> +               return 0;

The EFAULT logic seems to be missing across the board.  And callers
don't expect a zero return value.  Most will loop indefinitely.

[...]

> +static size_t push_pipe(struct iov_iter *i, size_t size,
> +                       int *idxp, size_t *offp)
> +{
> +       struct pipe_inode_info *pipe = i->pipe;
> +       size_t off;
> +       int idx;
> +       ssize_t left;
> +
> +       if (unlikely(size > i->count))
> +               size = i->count;
> +       if (unlikely(!size))
> +               return 0;
> +
> +       left = size;
> +       data_start(i, &idx, &off);
> +       *idxp = idx;
> +       *offp = off;
> +       if (off) {
> +               left -= PAGE_SIZE - off;
> +               if (left <= 0) {
> +                       pipe->bufs[idx].len += size;
> +                       return size;
> +               }
> +               pipe->bufs[idx].len = PAGE_SIZE;
> +               idx = next_idx(idx, pipe);
> +       }
> +       while (idx != pipe->curbuf || !pipe->nrbufs) {
> +               struct page *page = alloc_page(GFP_USER);
> +               if (!page)
> +                       break;

Again, unexpected zero return if this is the first page.  Should
return -ENOMEM?  Some callers only expect -EFAULT, though.

[...]

> +static void pipe_advance(struct iov_iter *i, size_t size)
> +{
> +       struct pipe_inode_info *pipe = i->pipe;
> +       struct pipe_buffer *buf;
> +       size_t off;
> +       int idx;
> +
> +       if (unlikely(i->count < size))
> +               size = i->count;
> +
> +       idx = i->idx;
> +       off = i->iov_offset;
> +       if (size || off) {
> +               /* take it relative to the beginning of buffer */
> +               size += off - pipe->bufs[idx].offset;
> +               while (1) {
> +                       buf = &pipe->bufs[idx];
> +                       if (size > buf->len) {
> +                               size -= buf->len;
> +                               idx = next_idx(idx, pipe);
> +                               off = 0;

off is unused and reassigned before breaking out of the loop.

[...]

> @@ -732,7 +1101,20 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
>         if (!size)
>                 return 0;
>
> -       iterate_all_kinds(i, size, v, ({
> +       if (unlikely(i->type & ITER_PIPE)) {
> +               struct pipe_inode_info *pipe = i->pipe;
> +               size_t off;
> +               int idx;
> +
> +               if (!sanity(i))
> +                       return 0;
> +
> +               data_start(i, &idx, &off);
> +               /* some of this one + all after this one */
> +               npages = ((pipe->curbuf - idx - 1) & (pipe->buffers - 1)) + 1;

It's supposed to take i->count into account, no?  And that calculation
will result in really funny things if the pipe is full.  And we can't
return -EFAULT here, since that's not expected by callers...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 10/12] new iov_iter flavour: pipe-backed
  2016-09-29 20:53                                     ` Miklos Szeredi
@ 2016-09-29 22:50                                       ` Al Viro
  2016-09-30  7:30                                         ` Miklos Szeredi
  0 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-09-29 22:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Thu, Sep 29, 2016 at 10:53:55PM +0200, Miklos Szeredi wrote:

> The EFAULT logic seems to be missing across the board.  And callers
> don't expect a zero return value.  Most will loop indefinitely.

Nope.  copy_page_to_iter() *never* returns -EFAULT.  Including the iovec
one - check copy_page_to_iter_iovec().  Any caller that does not expect
a zero return value from that primitive is a bug, triggerable as soon as
you feed it an iovec with NULL ->iov_base.

> Again, unexpected zero return if this is the first page.  Should
> return -ENOMEM?  Some callers only expect -EFAULT, though.

For copy_to_iter() and zero_iter() it's definitely "return zero".  For
get_pages...  Hell knows; those probably ought to return -EFAULT, but
I'll need to look some more at the callers.  It should end up triggering
a short read as the end result (or, as usual, EFAULT on zero-length read).

> > +               /* take it relative to the beginning of buffer */
> > +               size += off - pipe->bufs[idx].offset;
> > +               while (1) {
> > +                       buf = &pipe->bufs[idx];
> > +                       if (size > buf->len) {
> > +                               size -= buf->len;
> > +                               idx = next_idx(idx, pipe);
> > +                               off = 0;
> 
> off is unused and reassigned before breaking out of the loop.

True.

> [...]
> 
> > +       if (unlikely(i->type & ITER_PIPE)) {
> > +               struct pipe_inode_info *pipe = i->pipe;
> > +               size_t off;
> > +               int idx;
> > +
> > +               if (!sanity(i))
> > +                       return 0;
> > +
> > +               data_start(i, &idx, &off);
> > +               /* some of this one + all after this one */
> > +               npages = ((pipe->curbuf - idx - 1) & (pipe->buffers - 1)) + 1;
> 
> It's supposed to take i->count into account, no?  And that calculation
> will result in really funny things if the pipe is full.  And we can't
> return -EFAULT here, since that's not expected by callers...

It should look at i->count, in principle.  OTOH, overestimating the amount
is not really a problem for possible users of such iov_iter.  I'll look
into that.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 10/12] new iov_iter flavour: pipe-backed
  2016-09-29 22:50                                       ` Al Viro
@ 2016-09-30  7:30                                         ` Miklos Szeredi
  2016-10-03  3:34                                           ` [RFC] O_DIRECT vs EFAULT (was Re: [PATCH 10/12] new iov_iter flavour: pipe-backed) Al Viro
  0 siblings, 1 reply; 104+ messages in thread
From: Miklos Szeredi @ 2016-09-30  7:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Fri, Sep 30, 2016 at 12:50 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Sep 29, 2016 at 10:53:55PM +0200, Miklos Szeredi wrote:
>
>> The EFAULT logic seems to be missing across the board.  And callers
>> don't expect a zero return value.  Most will loop indefinitely.
>
> Nope.  copy_page_to_iter() *never* returns -EFAULT.  Including the iovec
> one - check copy_page_to_iter_iovec().  Any caller that does not expect
> a zero return value from that primitive is a bug, triggerable as soon as
> you feed it an iovec with NULL ->iov_base.

Right.

I was actually looking at iov_iter_get_pages() callers...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
                                             ` (10 preceding siblings ...)
  2016-09-23 19:10                           ` [PATCH 11/11] switch generic_file_splice_read() to use of ->read_iter() Al Viro
@ 2016-09-30 13:32                           ` CAI Qian
  2016-09-30 17:42                             ` CAI Qian
  11 siblings, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-09-30 13:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "Al Viro" <viro@ZenIV.linux.org.uk>
> To: "Linus Torvalds" <torvalds@linux-foundation.org>
> Cc: "Dave Chinner" <david@fromorbit.com>, "CAI Qian" <caiqian@redhat.com>, "linux-xfs" <linux-xfs@vger.kernel.org>,
> xfs@oss.sgi.com, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org
> Sent: Friday, September 23, 2016 3:00:32 PM
> Subject: [RFC][CFT] splice_read reworked
> 
> The series is supposed to solve the locking order problems for
> ->splice_read() and get rid of code duplication between the read-side
> methods.
> 	pipe_lock is lifted out of ->splice_read() instances, along with
> waiting for empty space in pipe, etc. - we do that stuff in callers.
> 	A new variant of iov_iter is introduced - it's backed by a pipe,
> copy_to_iter() results in allocating pages and copying into those,
> copy_page_to_iter() just sticks a reference to that page into pipe.
> Running out of space in pipe yields a short read, as a fault in iovec-backed
> iov_iter would have.  Enough primitives are implemented for normal
> ->read_iter() instances to work.
> 	generic_file_splice_read() switched to feeding such iov_iter to
> ->read_iter() instance.  That turns out to be enough to kill almost all
> ->splice_read() instances; the only ones _not_ using
> generic_file_splice_read()
> or default_file_splice_read() (== no zero-copy fallback) are
> fuse_dev_splice_read(), 3 instances in kernel/{relay.c,trace/trace.c} and
> sock_splice_read().  It's almost certainly possible to convert fuse one
> and the same might be possible to do to socket one.  relay and tracing
> stuff is just plain weird; might or might not be doable.
> 
> 	Something hopefully working is in
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.splice_read
Tested-by: CAI Qian <caiqian@redhat.com>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-09-30 13:32                           ` [RFC][CFT] splice_read reworked CAI Qian
@ 2016-09-30 17:42                             ` CAI Qian
  2016-09-30 18:33                               ` CAI Qian
  2016-10-03  1:42                               ` [RFC][CFT] splice_read reworked Al Viro
  0 siblings, 2 replies; 104+ messages in thread
From: CAI Qian @ 2016-09-30 17:42 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "Al Viro" <viro@ZenIV.linux.org.uk>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, xfs@oss.sgi.com, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Friday, September 30, 2016 9:32:53 AM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> 
> 
> ----- Original Message -----
> > From: "Al Viro" <viro@ZenIV.linux.org.uk>
> > To: "Linus Torvalds" <torvalds@linux-foundation.org>
> > Cc: "Dave Chinner" <david@fromorbit.com>, "CAI Qian" <caiqian@redhat.com>,
> > "linux-xfs" <linux-xfs@vger.kernel.org>,
> > xfs@oss.sgi.com, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin"
> > <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org
> > Sent: Friday, September 23, 2016 3:00:32 PM
> > Subject: [RFC][CFT] splice_read reworked
> > 
> > The series is supposed to solve the locking order problems for
> > ->splice_read() and get rid of code duplication between the read-side
> > methods.
> > 	pipe_lock is lifted out of ->splice_read() instances, along with
> > waiting for empty space in pipe, etc. - we do that stuff in callers.
> > 	A new variant of iov_iter is introduced - it's backed by a pipe,
> > copy_to_iter() results in allocating pages and copying into those,
> > copy_page_to_iter() just sticks a reference to that page into pipe.
> > Running out of space in pipe yields a short read, as a fault in
> > iovec-backed
> > iov_iter would have.  Enough primitives are implemented for normal
> > ->read_iter() instances to work.
> > 	generic_file_splice_read() switched to feeding such iov_iter to
> > ->read_iter() instance.  That turns out to be enough to kill almost all
> > ->splice_read() instances; the only ones _not_ using
> > generic_file_splice_read()
> > or default_file_splice_read() (== no zero-copy fallback) are
> > fuse_dev_splice_read(), 3 instances in kernel/{relay.c,trace/trace.c} and
> > sock_splice_read().  It's almost certainly possible to convert fuse one
> > and the same might be possible to do to socket one.  relay and tracing
> > stuff is just plain weird; might or might not be doable.
> > 
> > 	Something hopefully working is in
> > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
> > #work.splice_read
> Tested-by: CAI Qian <caiqian@redhat.com>

Except...

One warning just pop up while running trinity.

[ 1599.151286] ------------[ cut here ]------------
[ 1599.156457] WARNING: CPU: 37 PID: 95143 at lib/iov_iter.c:316 sanity+0x75/0x80
[ 1599.164818] Modules linked in: af_key ieee802154_socket ieee802154 vmw_vsock_vmci_transport vsock vmw_vmci hidp cmtp kernelcapi bnep rfcomm bluetooth rfkill can_bcm can_raw can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr mei_me sg i2c_i801 mei shpchp lpc_ich i2c_smbus ipmi_ssif wmi ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci mdio libahci ptp libata i2c_core pps_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 1599.278669] CPU: 50 PID: 95143 Comm: trinity-c142 Not tainted 4.8.0-rc8-usrns-scale+ #8
[ 1599.287604] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 1599.298962]  0000000000000286 000000007794c41e ffff8803c6c7fbb0 ffffffff813d5e93
[ 1599.307259]  0000000000000000 0000000000000000 ffff8803c6c7fbf0 ffffffff8109c87b
[ 1599.315553]  0000013c00000000 0000000000000efe ffffea001de95240 ffff8802e1aca600
[ 1599.323847] Call Trace:
[ 1599.326580]  [<ffffffff813d5e93>] dump_stack+0x85/0xc2
[ 1599.332315]  [<ffffffff8109c87b>] __warn+0xcb/0xf0
[ 1599.337660]  [<ffffffff8109c9ad>] warn_slowpath_null+0x1d/0x20
[ 1599.344171]  [<ffffffff813e9b45>] sanity+0x75/0x80
[ 1599.349518]  [<ffffffff813ec739>] copy_page_to_iter+0xf9/0x1e0
[ 1599.356027]  [<ffffffff8120691f>] shmem_file_read_iter+0x9f/0x340
[ 1599.362829]  [<ffffffff812bbeb9>] generic_file_splice_read+0xb9/0x1b0
[ 1599.370015]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
[ 1599.375941]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
[ 1599.382935]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 1599.390220]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
[ 1599.396537]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
[ 1599.402563]  [<ffffffff81282973>] SyS_sendfile64+0x73/0xd0
[ 1599.408685]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 1599.414820]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 1599.422087] ---[ end trace a3fb2953df356f80 ]---

    CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-09-30 17:42                             ` CAI Qian
@ 2016-09-30 18:33                               ` CAI Qian
  2016-10-03  1:37                                 ` Al Viro
  2016-10-03  1:42                               ` [RFC][CFT] splice_read reworked Al Viro
  1 sibling, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-09-30 18:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> One warning just pop up while running trinity.
Another run triggered a lockdep with splice in the trace,

[ 4787.875980] 
[ 4787.877645] ======================================================
[ 4787.884540] [ INFO: possible circular locking dependency detected ]
[ 4787.891533] 4.8.0-rc8-usrns-scale+ #8 Tainted: G        W      
[ 4787.898138] -------------------------------------------------------
[ 4787.905130] trinity-c116/106905 is trying to acquire lock:
[ 4787.911251]  (&p->lock){+.+.+.}, at: [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4787.919264] 
[ 4787.919264] but task is already holding lock:
[ 4787.925773]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4787.934854] 
[ 4787.934854] which lock already depends on the new lock.
[ 4787.934854] 
[ 4787.943981] 
[ 4787.943981] the existing dependency chain (in reverse order) is:
[ 4787.952333] 
-> #3 (sb_writers#8){.+.+.+}:
[ 4787.957050]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4787.963960]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4787.970577]        [<ffffffff810f769a>] percpu_down_read+0x4a/0xa0
[ 4787.977487]        [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4787.984395]        [<ffffffff812a8974>] mnt_want_write+0x24/0x50
[ 4787.991110]        [<ffffffffa05049af>] ovl_want_write+0x1f/0x30 [overlay]
[ 4787.998799]        [<ffffffffa05070c2>] ovl_do_remove+0x42/0x4a0 [overlay]
[ 4788.006483]        [<ffffffffa0507536>] ovl_rmdir+0x16/0x20 [overlay]
[ 4788.013682]        [<ffffffff8128d357>] vfs_rmdir+0xb7/0x130
[ 4788.020009]        [<ffffffff81292ed3>] do_rmdir+0x183/0x1f0
[ 4788.026335]        [<ffffffff81293cf2>] SyS_unlinkat+0x22/0x30
[ 4788.032853]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.039576]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.046962] 
-> #2 (&sb->s_type->i_mutex_key#16){++++++}:
[ 4788.053140]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.060049]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.066664]        [<ffffffff817d60e7>] down_read+0x47/0x70
[ 4788.072893]        [<ffffffff8128ce79>] lookup_slow+0xc9/0x200
[ 4788.079410]        [<ffffffff81290b9c>] walk_component+0x1ec/0x310
[ 4788.086315]        [<ffffffff81290e5f>] link_path_walk+0x19f/0x5f0
[ 4788.093219]        [<ffffffff8129151d>] path_openat+0xdd/0xb80
[ 4788.099748]        [<ffffffff81293511>] do_filp_open+0x91/0x100
[ 4788.106362]        [<ffffffff81286f56>] do_open_execat+0x76/0x180
[ 4788.113186]        [<ffffffff8128747b>] open_exec+0x2b/0x50
[ 4788.119404]        [<ffffffff812ec61d>] load_elf_binary+0x28d/0x1120
[ 4788.126511]        [<ffffffff81288487>] search_binary_handler+0x97/0x1c0
[ 4788.134002]        [<ffffffff81289619>] do_execveat_common.isra.36+0x6a9/0x9f0
[ 4788.142071]        [<ffffffff81289c4a>] SyS_execve+0x3a/0x50
[ 4788.148398]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.155110]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.162502] 
-> #1 (&sig->cred_guard_mutex){+.+.+.}:
[ 4788.168179]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.175085]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.181712]        [<ffffffff817d4557>] mutex_lock_killable_nested+0x87/0x500
[ 4788.189695]        [<ffffffff81099599>] mm_access+0x29/0xa0
[ 4788.195924]        [<ffffffff81302b6c>] proc_pid_auxv+0x1c/0x70
[ 4788.202540]        [<ffffffff813039d0>] proc_single_show+0x50/0x90
[ 4788.209445]        [<ffffffff812acb48>] seq_read+0x108/0x3e0
[ 4788.215774]        [<ffffffff8127fb07>] __vfs_read+0x37/0x150
[ 4788.222198]        [<ffffffff81280d35>] vfs_read+0x95/0x140
[ 4788.228425]        [<ffffffff81282268>] SyS_read+0x58/0xc0
[ 4788.234557]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.241268]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.248660] 
-> #0 (&p->lock){+.+.+.}:
[ 4788.252987]        [<ffffffff810fc062>] validate_chain.isra.37+0xe72/0x1150
[ 4788.260769]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.267676]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.274302]        [<ffffffff817d3807>] mutex_lock_nested+0x77/0x430
[ 4788.281406]        [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4788.287633]        [<ffffffff81316b39>] kernfs_fop_read+0x129/0x1b0
[ 4788.294659]        [<ffffffff8127fca3>] do_loop_readv_writev+0x83/0xc0
[ 4788.301954]        [<ffffffff812811a8>] do_readv_writev+0x218/0x240
[ 4788.308959]        [<ffffffff81281209>] vfs_readv+0x39/0x50
[ 4788.315188]        [<ffffffff812bc6b1>] default_file_splice_read+0x1a1/0x2b0
[ 4788.323070]        [<ffffffff812bc206>] do_splice_to+0x76/0x90
[ 4788.329587]        [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
[ 4788.337173]        [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
[ 4788.344078]        [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
[ 4788.350694]        [<ffffffff812829c9>] SyS_sendfile64+0xc9/0xd0
[ 4788.357405]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.364119]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.371511] 
[ 4788.371511] other info that might help us debug this:
[ 4788.371511] 
[ 4788.380443] Chain exists of:
  &p->lock --> &sb->s_type->i_mutex_key#16 --> sb_writers#8

[ 4788.389881]  Possible unsafe locking scenario:
[ 4788.389881] 
[ 4788.396497]        CPU0                    CPU1
[ 4788.401549]        ----                    ----
[ 4788.406614]   lock(sb_writers#8);
[ 4788.410352]                                lock(&sb->s_type->i_mutex_key#16);
[ 4788.418354]                                lock(sb_writers#8);
[ 4788.424902]   lock(&p->lock);
[ 4788.428229] 
[ 4788.428229]  *** DEADLOCK ***
[ 4788.428229] 
[ 4788.434836] 1 lock held by trinity-c116/106905:
[ 4788.439888]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4788.449473] 
[ 4788.449473] stack backtrace:
[ 4788.454334] CPU: 16 PID: 106905 Comm: trinity-c116 Tainted: G        W       4.8.0-rc8-usrns-scale+ #8
[ 4788.464719] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 4788.476076]  0000000000000086 00000000cbfc6314 ffff8803ce78b760 ffffffff813d5e93
[ 4788.484371]  ffffffff82a3fbd0 ffffffff82a94890 ffff8803ce78b7a0 ffffffff810fa6ec
[ 4788.492663]  ffff8803ce78b7e0 ffff8802ead08000 0000000000000001 ffff8802ead08ca0
[ 4788.500966] Call Trace:
[ 4788.503694]  [<ffffffff813d5e93>] dump_stack+0x85/0xc2
[ 4788.509426]  [<ffffffff810fa6ec>] print_circular_bug+0x1ec/0x260
[ 4788.516128]  [<ffffffff810fc062>] validate_chain.isra.37+0xe72/0x1150
[ 4788.523319]  [<ffffffff811d4491>] ? ___perf_sw_event+0x171/0x290
[ 4788.530022]  [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.536335]  [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.542359]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.548188]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.554019]  [<ffffffff817d3807>] mutex_lock_nested+0x77/0x430
[ 4788.560528]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.566358]  [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4788.571995]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.578600]  [<ffffffff81316b39>] kernfs_fop_read+0x129/0x1b0
[ 4788.585012]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.591617]  [<ffffffff8127fca3>] do_loop_readv_writev+0x83/0xc0
[ 4788.598318]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.604924]  [<ffffffff812811a8>] do_readv_writev+0x218/0x240
[ 4788.611347]  [<ffffffff813e9535>] ? push_pipe+0xd5/0x190
[ 4788.617278]  [<ffffffff813ecec0>] ? iov_iter_get_pages_alloc+0x250/0x400
[ 4788.624746]  [<ffffffff81281209>] vfs_readv+0x39/0x50
[ 4788.630381]  [<ffffffff812bc6b1>] default_file_splice_read+0x1a1/0x2b0
[ 4788.637668]  [<ffffffff8134ae20>] ? security_file_permission+0xa0/0xc0
[ 4788.644954]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
[ 4788.650880]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
[ 4788.657872]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 4788.665157]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
[ 4788.671472]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
[ 4788.677499]  [<ffffffff812829c9>] SyS_sendfile64+0xc9/0xd0
[ 4788.683622]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.689744]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-09-30 18:33                               ` CAI Qian
@ 2016-10-03  1:37                                 ` Al Viro
  2016-10-03 17:49                                   ` CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-10-03  1:37 UTC (permalink / raw)
  To: CAI Qian
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Fri, Sep 30, 2016 at 02:33:23PM -0400, CAI Qian wrote:

OK, the immeditate trigger is
	* sendfile() from something that uses seq_read to a regular file.
Does sb_start_write() around the call of do_splice_direct() (as always),
which ends up calling default_file_splice_read() (again, as usual), which
ends up calling ->read() of the source, i.e. seq_read().  No changes there.
 
	* sb_start_write() can be called under ->i_mutex.  The latter is
on overlayfs inode, the former is done to upper layer in that overlayfs.
Nothing new, again.

	* ->i_mutex can be taken under ->cred_guard_mutex.  Yes, it can -
in open_exec().  Again, no changes.

	* ->cred_guard_mutex can be taken in ->show() of a seq_file,
namely /proc/*/auxv...  Argh, ->cred_guard_mutex whack-a-mole strikes
again...

OK, I think essentially the same warning had been triggerable since _way_
back.  All changes around splice have no effect on it.

Look: to get a deadlock we need
	(1) sendfile from /proc/<pid>/auxv to a regular file on upper layer of
overlayfs requesting not to freeze the target.
	(2) attempt to freeze it blocking until (1) is done.
	(3) directory modification on overlayfs trying to request not to freeze
the upper layer and blocking until (2) is done.
	(4) execve() in <pid> holding ->cred_guard_mutex, trying to open
something in overlayfs and getting blocked on directory lock, held by (3).

Now (1) gets around to reading from /proc/<pid>/auxv, which blocks on
->cred_guard_mutex.  Mentioning of seq_read itself holding locks is irrelevant;
what matters is that ->read() grabs ->cred_guard_mutex.

We used to have similar problems in /proc/*/environ and /proc/*/mem; looks
like /proc/*/environ needs to get the treatment similar to e268337dfe26 and
b409e578d9a4.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-09-30 17:42                             ` CAI Qian
  2016-09-30 18:33                               ` CAI Qian
@ 2016-10-03  1:42                               ` Al Viro
  2016-10-03 14:06                                 ` CAI Qian
  1 sibling, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-10-03  1:42 UTC (permalink / raw)
  To: CAI Qian
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Fri, Sep 30, 2016 at 01:42:17PM -0400, CAI Qian wrote:

> [ 1599.151286] ------------[ cut here ]------------
> [ 1599.156457] WARNING: CPU: 37 PID: 95143 at lib/iov_iter.c:316 sanity+0x75/0x80

[snip]

> [ 1599.344171]  [<ffffffff813e9b45>] sanity+0x75/0x80
> [ 1599.349518]  [<ffffffff813ec739>] copy_page_to_iter+0xf9/0x1e0
> [ 1599.356027]  [<ffffffff8120691f>] shmem_file_read_iter+0x9f/0x340
> [ 1599.362829]  [<ffffffff812bbeb9>] generic_file_splice_read+0xb9/0x1b0
> [ 1599.370015]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
> [ 1599.375941]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
> [ 1599.382935]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
> [ 1599.390220]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
> [ 1599.396537]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
> [ 1599.402563]  [<ffffffff81282973>] SyS_sendfile64+0x73/0xd0
> [ 1599.408685]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
> [ 1599.414820]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25

IOW, sendfile from shmem...  How easily is that reproduced (IOW, did you
get any more of those)?

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [RFC] O_DIRECT vs EFAULT (was Re: [PATCH 10/12] new iov_iter flavour: pipe-backed)
  2016-09-30  7:30                                         ` Miklos Szeredi
@ 2016-10-03  3:34                                           ` Al Viro
  2016-10-03 17:07                                             ` Linus Torvalds
  0 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-10-03  3:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, CAI Qian, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Fri, Sep 30, 2016 at 09:30:21AM +0200, Miklos Szeredi wrote:
> On Fri, Sep 30, 2016 at 12:50 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Thu, Sep 29, 2016 at 10:53:55PM +0200, Miklos Szeredi wrote:
> >
> >> The EFAULT logic seems to be missing across the board.  And callers
> >> don't expect a zero return value.  Most will loop indefinitely.
> >
> > Nope.  copy_page_to_iter() *never* returns -EFAULT.  Including the iovec
> > one - check copy_page_to_iter_iovec().  Any caller that does not expect
> > a zero return value from that primitive is a bug, triggerable as soon as
> > you feed it an iovec with NULL ->iov_base.
> 
> Right.
> 
> I was actually looking at iov_iter_get_pages() callers...

FWIW, that's interesting - O_DIRECT readv()/writev() reacts to fault anywhere
as "nothing done, return -EFAULT now", rather than a short read/write.
Despite that some IO is actually done.  Note, BTW, that we are not even
consistent between the filesystems - local block ones do IO and give -EFAULT,
while NFS, Lustre and FUSE do short read/write, reporting -EFAULT only upon
shortening to nothing.  So does ceph, except that shortening might be for
more than one page.

Considering how weak POSIX is in that area, we are probably not violating
anything, but... it would be more convenient if we treated those as
short read/write, same way for all filesystems.

Linus, do you have any objections against such behaviour change?  AFAICS,
all it takes is this:

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7c3ce73..3a8ebda 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -246,6 +246,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
 		if ((dio->op == REQ_OP_READ) &&
 		    ((offset + transferred) > dio->i_size))
 			transferred = dio->i_size - offset;
+		if (ret == -EFAULT)
+			ret = 0;
 	}
 
 	if (ret == 0)

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03  1:42                               ` [RFC][CFT] splice_read reworked Al Viro
@ 2016-10-03 14:06                                 ` CAI Qian
  2016-10-03 15:20                                   ` CAI Qian
  2016-10-03 20:32                                   ` CAI Qian
  0 siblings, 2 replies; 104+ messages in thread
From: CAI Qian @ 2016-10-03 14:06 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel



----- Original Message -----
> From: "Al Viro" <viro@ZenIV.linux.org.uk>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Sunday, October 2, 2016 9:42:18 PM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> On Fri, Sep 30, 2016 at 01:42:17PM -0400, CAI Qian wrote:
> 
> > [ 1599.151286] ------------[ cut here ]------------
> > [ 1599.156457] WARNING: CPU: 37 PID: 95143 at lib/iov_iter.c:316
> > sanity+0x75/0x80
> 
> [snip]
> 
> > [ 1599.344171]  [<ffffffff813e9b45>] sanity+0x75/0x80
> > [ 1599.349518]  [<ffffffff813ec739>] copy_page_to_iter+0xf9/0x1e0
> > [ 1599.356027]  [<ffffffff8120691f>] shmem_file_read_iter+0x9f/0x340
> > [ 1599.362829]  [<ffffffff812bbeb9>] generic_file_splice_read+0xb9/0x1b0
> > [ 1599.370015]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
> > [ 1599.375941]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
> > [ 1599.382935]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
> > [ 1599.390220]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
> > [ 1599.396537]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
> > [ 1599.402563]  [<ffffffff81282973>] SyS_sendfile64+0x73/0xd0
> > [ 1599.408685]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
> > [ 1599.414820]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25
> 
> IOW, sendfile from shmem...  How easily is that reproduced (IOW, did you
> get any more of those)?
> 
It is pretty reproducible so far by just running the trinity from a docker
container backed by overlayfs/xfs.

# su - test
$ trinity

   CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03 14:06                                 ` CAI Qian
@ 2016-10-03 15:20                                   ` CAI Qian
  2016-10-03 21:12                                     ` Dave Chinner
  2016-10-03 20:32                                   ` CAI Qian
  1 sibling, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-03 15:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel



----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "Al Viro" <viro@ZenIV.linux.org.uk>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Monday, October 3, 2016 10:06:27 AM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> 
> 
> ----- Original Message -----
> > From: "Al Viro" <viro@ZenIV.linux.org.uk>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> > <david@fromorbit.com>, "linux-xfs"
> > <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin"
> > <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org
> > Sent: Sunday, October 2, 2016 9:42:18 PM
> > Subject: Re: [RFC][CFT] splice_read reworked
> > 
> > On Fri, Sep 30, 2016 at 01:42:17PM -0400, CAI Qian wrote:
> > 
> > > [ 1599.151286] ------------[ cut here ]------------
> > > [ 1599.156457] WARNING: CPU: 37 PID: 95143 at lib/iov_iter.c:316
> > > sanity+0x75/0x80
> > 
> > [snip]
> > 
> > > [ 1599.344171]  [<ffffffff813e9b45>] sanity+0x75/0x80
> > > [ 1599.349518]  [<ffffffff813ec739>] copy_page_to_iter+0xf9/0x1e0
> > > [ 1599.356027]  [<ffffffff8120691f>] shmem_file_read_iter+0x9f/0x340
> > > [ 1599.362829]  [<ffffffff812bbeb9>] generic_file_splice_read+0xb9/0x1b0
> > > [ 1599.370015]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
> > > [ 1599.375941]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
> > > [ 1599.382935]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
> > > [ 1599.390220]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
> > > [ 1599.396537]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
> > > [ 1599.402563]  [<ffffffff81282973>] SyS_sendfile64+0x73/0xd0
> > > [ 1599.408685]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
> > > [ 1599.414820]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > IOW, sendfile from shmem...  How easily is that reproduced (IOW, did you
> > get any more of those)?
> > 
> It is pretty reproducible so far by just running the trinity from a docker
> container backed by overlayfs/xfs.
There is another warning happened once so far. Not sure if related.

[  447.961826] ------------[ cut here ]------------
[  447.967020] WARNING: CPU: 39 PID: 27352 at fs/xfs/xfs_file.c:626 xfs_file_dio_aio_write+0x3dc/0x4b0 [xfs]
[  447.977736] Modules linked in: ieee802154_socket ieee802154 af_key vmw_vsock_vmci_transport vsock vmw_vmci bluetooth rfkill can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 i2c_smbus ipmi_ssif mei_me sg mei shpchp lpc_ich wmi ipmi_si ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm mdio ahci ptp libahci pps_core libata i2c_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[  448.086775] CPU: 39 PID: 27352 Comm: trinity-c39 Not tainted 4.8.0-rc8-splice+ #1
[  448.095126] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[  448.106483]  0000000000000286 00000000389140f2 ffff880404833c48 ffffffff813d2eac
[  448.114776]  0000000000000000 0000000000000000 ffff880404833c88 ffffffff8109cf11
[  448.123067]  00000272389140f2 ffff880404833d80 ffff880404833dd8 ffff8803bfba88e8
[  448.131356] Call Trace:
[  448.134088]  [<ffffffff813d2eac>] dump_stack+0x85/0xc9
[  448.139821]  [<ffffffff8109cf11>] __warn+0xd1/0xf0
[  448.145167]  [<ffffffff8109d04d>] warn_slowpath_null+0x1d/0x20
[  448.151705]  [<ffffffffa044165c>] xfs_file_dio_aio_write+0x3dc/0x4b0 [xfs]
[  448.159394]  [<ffffffffa0441b10>] xfs_file_write_iter+0x90/0x130 [xfs]
[  448.166679]  [<ffffffff81280eee>] do_iter_readv_writev+0xae/0x130
[  448.173479]  [<ffffffff81281992>] do_readv_writev+0x1a2/0x230
[  448.179906]  [<ffffffffa0441a80>] ? xfs_file_buffered_aio_write+0x350/0x350 [xfs]
[  448.188256]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[  448.195347]  [<ffffffff810fce1d>] ? trace_hardirqs_on+0xd/0x10
[  448.201855]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[  448.208944]  [<ffffffff81281c6c>] vfs_writev+0x3c/0x50
[  448.214675]  [<ffffffff81281e22>] do_pwritev+0xa2/0xc0
[  448.220407]  [<ffffffff81282f11>] SyS_pwritev+0x11/0x20
[  448.226237]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[  448.232358]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[  448.239560] ---[ end trace 1c54e743f1fa4f5e ]---

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC] O_DIRECT vs EFAULT (was Re: [PATCH 10/12] new iov_iter flavour: pipe-backed)
  2016-10-03  3:34                                           ` [RFC] O_DIRECT vs EFAULT (was Re: [PATCH 10/12] new iov_iter flavour: pipe-backed) Al Viro
@ 2016-10-03 17:07                                             ` Linus Torvalds
  2016-10-03 18:54                                               ` Al Viro
  0 siblings, 1 reply; 104+ messages in thread
From: Linus Torvalds @ 2016-10-03 17:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Dave Chinner, CAI Qian, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Sun, Oct 2, 2016 at 8:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Linus, do you have any objections against such behaviour change?  AFAICS,
> all it takes is this:
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 7c3ce73..3a8ebda 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -246,6 +246,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>                 if ((dio->op == REQ_OP_READ) &&
>                     ((offset + transferred) > dio->i_size))
>                         transferred = dio->i_size - offset;
> +               if (ret == -EFAULT)
> +                       ret = 0;

I don't think that's right. To me it looks like the short read case
might have changed "transferred" back to zero, in which case we do
*not* want to skip the EFAULT.

But if there's some reason that can't happen (ie "dio->i_size" is
guaranteed to be larger than "offset"), then with a comment to that
effect it's ok.

Otherwise I think it would need to be something like

        /* If we were partially successful, ignore later EFAULT */
        if (transferred && ret == -EFAULT)
                ret = 0;

or something. Yes?

                Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03  1:37                                 ` Al Viro
@ 2016-10-03 17:49                                   ` CAI Qian
  2016-10-04 17:39                                     ` local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-03 17:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "Al Viro" <viro@ZenIV.linux.org.uk>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, xfs@oss.sgi.com, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Sunday, October 2, 2016 9:37:37 PM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> On Fri, Sep 30, 2016 at 02:33:23PM -0400, CAI Qian wrote:
> 
> OK, the immeditate trigger is
> 	* sendfile() from something that uses seq_read to a regular file.
> Does sb_start_write() around the call of do_splice_direct() (as always),
> which ends up calling default_file_splice_read() (again, as usual), which
> ends up calling ->read() of the source, i.e. seq_read().  No changes there.
>  
> 	* sb_start_write() can be called under ->i_mutex.  The latter is
> on overlayfs inode, the former is done to upper layer in that overlayfs.
> Nothing new, again.
> 
> 	* ->i_mutex can be taken under ->cred_guard_mutex.  Yes, it can -
> in open_exec().  Again, no changes.
> 
> 	* ->cred_guard_mutex can be taken in ->show() of a seq_file,
> namely /proc/*/auxv...  Argh, ->cred_guard_mutex whack-a-mole strikes
> again...
> 
> OK, I think essentially the same warning had been triggerable since _way_
> back.  All changes around splice have no effect on it.
> 
> Look: to get a deadlock we need
> 	(1) sendfile from /proc/<pid>/auxv to a regular file on upper layer of
> overlayfs requesting not to freeze the target.
> 	(2) attempt to freeze it blocking until (1) is done.
> 	(3) directory modification on overlayfs trying to request not to freeze
> the upper layer and blocking until (2) is done.
> 	(4) execve() in <pid> holding ->cred_guard_mutex, trying to open
> something in overlayfs and getting blocked on directory lock, held by (3).
> 
> Now (1) gets around to reading from /proc/<pid>/auxv, which blocks on
> ->cred_guard_mutex.  Mentioning of seq_read itself holding locks is
> irrelevant;
> what matters is that ->read() grabs ->cred_guard_mutex.
> 
> We used to have similar problems in /proc/*/environ and /proc/*/mem; looks
> like /proc/*/environ needs to get the treatment similar to e268337dfe26 and
> b409e578d9a4.
> 
You are right. This is also reproducible on v4.8 mainline.
    CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC] O_DIRECT vs EFAULT (was Re: [PATCH 10/12] new iov_iter flavour: pipe-backed)
  2016-10-03 17:07                                             ` Linus Torvalds
@ 2016-10-03 18:54                                               ` Al Viro
  0 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-10-03 18:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, CAI Qian, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Mon, Oct 03, 2016 at 10:07:39AM -0700, Linus Torvalds wrote:
> On Sun, Oct 2, 2016 at 8:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Linus, do you have any objections against such behaviour change?  AFAICS,
> > all it takes is this:
> >
> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > index 7c3ce73..3a8ebda 100644
> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -246,6 +246,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
> >                 if ((dio->op == REQ_OP_READ) &&
> >                     ((offset + transferred) > dio->i_size))
> >                         transferred = dio->i_size - offset;
> > +               if (ret == -EFAULT)
> > +                       ret = 0;
> 
> I don't think that's right. To me it looks like the short read case
> might have changed "transferred" back to zero, in which case we do
> *not* want to skip the EFAULT.

There's this in do_blockdev_direct_IO():
        /* Once we sampled i_size check for reads beyond EOF */
        dio->i_size = i_size_read(inode);
        if (iov_iter_rw(iter) == READ && offset >= dio->i_size) {
                if (dio->flags & DIO_LOCKING)
                        mutex_unlock(&inode->i_mutex);
                kmem_cache_free(dio_cache, dio);
                retval = 0;
                goto out;
        }
so that shouldn't happen.  Said that,

> But if there's some reason that can't happen (ie "dio->i_size" is
> guaranteed to be larger than "offset"), then with a comment to that
> effect it's ok.
> 
> Otherwise I think it would need to be something like
> 
>         /* If we were partially successful, ignore later EFAULT */
>         if (transferred && ret == -EFAULT)
>                 ret = 0;

... it's certainly less brittle that way.  I'd probably still put it under
the same if (dio->result) and write it as
	if (unlikely(ret == -EFAULT) && transferred)
though.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03 14:06                                 ` CAI Qian
  2016-10-03 15:20                                   ` CAI Qian
@ 2016-10-03 20:32                                   ` CAI Qian
  2016-10-03 20:35                                     ` Al Viro
  1 sibling, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-03 20:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel



----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "Al Viro" <viro@ZenIV.linux.org.uk>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Monday, October 3, 2016 10:06:27 AM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> 
> 
> ----- Original Message -----
> > From: "Al Viro" <viro@ZenIV.linux.org.uk>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> > <david@fromorbit.com>, "linux-xfs"
> > <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin"
> > <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org
> > Sent: Sunday, October 2, 2016 9:42:18 PM
> > Subject: Re: [RFC][CFT] splice_read reworked
> > 
> > On Fri, Sep 30, 2016 at 01:42:17PM -0400, CAI Qian wrote:
> > 
> > > [ 1599.151286] ------------[ cut here ]------------
> > > [ 1599.156457] WARNING: CPU: 37 PID: 95143 at lib/iov_iter.c:316
> > > sanity+0x75/0x80
> > 
> > [snip]
> > 
> > > [ 1599.344171]  [<ffffffff813e9b45>] sanity+0x75/0x80
> > > [ 1599.349518]  [<ffffffff813ec739>] copy_page_to_iter+0xf9/0x1e0
> > > [ 1599.356027]  [<ffffffff8120691f>] shmem_file_read_iter+0x9f/0x340
> > > [ 1599.362829]  [<ffffffff812bbeb9>] generic_file_splice_read+0xb9/0x1b0
> > > [ 1599.370015]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
> > > [ 1599.375941]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
> > > [ 1599.382935]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
> > > [ 1599.390220]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
> > > [ 1599.396537]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
> > > [ 1599.402563]  [<ffffffff81282973>] SyS_sendfile64+0x73/0xd0
> > > [ 1599.408685]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
> > > [ 1599.414820]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > IOW, sendfile from shmem...  How easily is that reproduced (IOW, did you
> > get any more of those)?
> > 
> It is pretty reproducible so far by just running the trinity from a docker
> container backed by overlayfs/xfs.
> 
> # su - test
> $ trinity
Also, AFACIT, this is NOT reproducible on v4.8 mainline, but only with this
splice_read reworked branch of vfs tree.
   CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03 20:32                                   ` CAI Qian
@ 2016-10-03 20:35                                     ` Al Viro
  2016-10-04 13:29                                       ` CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-10-03 20:35 UTC (permalink / raw)
  To: CAI Qian
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Mon, Oct 03, 2016 at 04:32:19PM -0400, CAI Qian wrote:
> > It is pretty reproducible so far by just running the trinity from a docker
> > container backed by overlayfs/xfs.
> > 
> > # su - test
> > $ trinity
> Also, AFACIT, this is NOT reproducible on v4.8 mainline, but only with this
> splice_read reworked branch of vfs tree.

I would be very surprised if mainline had somehow managed to trip sanity
checks added in vfs tree ;-)

Is there any way to record the sequence of syscalls leading to that?

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03 15:20                                   ` CAI Qian
@ 2016-10-03 21:12                                     ` Dave Chinner
  2016-10-04 13:57                                       ` CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: Dave Chinner @ 2016-10-03 21:12 UTC (permalink / raw)
  To: CAI Qian
  Cc: Al Viro, Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Mon, Oct 03, 2016 at 11:20:50AM -0400, CAI Qian wrote:
> > container backed by overlayfs/xfs.
> There is another warning happened once so far. Not sure if related.
> 
> [  447.961826] ------------[ cut here ]------------
> [  447.967020] WARNING: CPU: 39 PID: 27352 at fs/xfs/xfs_file.c:626 xfs_file_dio_aio_write+0x3dc/0x4b0 [xfs]
> [  447.977736] Modules linked in: ieee802154_socket ieee802154 af_key vmw_vsock_vmci_transport vsock vmw_vmci bluetooth rfkill can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 i2c_smbus ipmi_ssif mei_me sg mei shpchp lpc_ich wmi ipmi_si ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm mdio ahci ptp libahci pps_core libata i2c_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
> [  448.086775] CPU: 39 PID: 27352 Comm: trinity-c39 Not tainted 4.8.0-rc8-splice+ #1
> [  448.095126] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
> [  448.106483]  0000000000000286 00000000389140f2 ffff880404833c48 ffffffff813d2eac
> [  448.114776]  0000000000000000 0000000000000000 ffff880404833c88 ffffffff8109cf11
> [  448.123067]  00000272389140f2 ffff880404833d80 ffff880404833dd8 ffff8803bfba88e8
> [  448.131356] Call Trace:
> [  448.134088]  [<ffffffff813d2eac>] dump_stack+0x85/0xc9
> [  448.139821]  [<ffffffff8109cf11>] __warn+0xd1/0xf0
> [  448.145167]  [<ffffffff8109d04d>] warn_slowpath_null+0x1d/0x20
> [  448.151705]  [<ffffffffa044165c>] xfs_file_dio_aio_write+0x3dc/0x4b0 [xfs]
> [  448.159394]  [<ffffffffa0441b10>] xfs_file_write_iter+0x90/0x130 [xfs]
> [  448.166679]  [<ffffffff81280eee>] do_iter_readv_writev+0xae/0x130
> [  448.173479]  [<ffffffff81281992>] do_readv_writev+0x1a2/0x230
> [  448.179906]  [<ffffffffa0441a80>] ? xfs_file_buffered_aio_write+0x350/0x350 [xfs]
> [  448.188256]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
> [  448.195347]  [<ffffffff810fce1d>] ? trace_hardirqs_on+0xd/0x10
> [  448.201855]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
> [  448.208944]  [<ffffffff81281c6c>] vfs_writev+0x3c/0x50
> [  448.214675]  [<ffffffff81281e22>] do_pwritev+0xa2/0xc0
> [  448.220407]  [<ffffffff81282f11>] SyS_pwritev+0x11/0x20
> [  448.226237]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [  448.232358]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [  448.239560] ---[ end trace 1c54e743f1fa4f5e ]---

This usually happens when an application mixes mmap access and
direct IO to the same file. The warning fires when the direct IO
cannot invalidate the cached range after writeback (e.g. writeback
raced with mmap app faulting and dirtying the page again), and hence
results in the page cache containing stale data.  This warning fires
when that happens, indicating to developers who get a bug report
about data corruption that it's the userspace application that is
the problem, not the filesystem. i.e the application is doing
something we explicitly document they should not do:

$ man 2 open
....
  O_DIRECT
....
       Applications should avoid mixing O_DIRECT and normal I/O to
       the same file, and especially to overlapping byte regions in
       the  same  file.   Even  when  the filesystem  correctly
       handles the coherency issues in this situation, overall I/O
       throughput is likely to be slower than using either mode
       alone.  Likewise, applications should avoid mixing mmap(2) of
       files with direct I/O to the same files.

Splice should not have this problem if the IO path locking is
correct, as both direct IO and splice IO use the same inode lock for
exclusion. i.e. splice write should not be running at the same time
as a direct IO read or write....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03 20:35                                     ` Al Viro
@ 2016-10-04 13:29                                       ` CAI Qian
  2016-10-04 14:28                                         ` Al Viro
  0 siblings, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-04 13:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel



----- Original Message -----
> From: "Al Viro" <viro@ZenIV.linux.org.uk>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Monday, October 3, 2016 4:35:40 PM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> On Mon, Oct 03, 2016 at 04:32:19PM -0400, CAI Qian wrote:
> > > It is pretty reproducible so far by just running the trinity from a
> > > docker
> > > container backed by overlayfs/xfs.
> > > 
> > > # su - test
> > > $ trinity
> > Also, AFACIT, this is NOT reproducible on v4.8 mainline, but only with this
> > splice_read reworked branch of vfs tree.
> 
> I would be very surprised if mainline had somehow managed to trip sanity
> checks added in vfs tree ;-)
> 
> Is there any way to record the sequence of syscalls leading to that?
> 
Yes, a bit long shot though.

http://people.redhat.com/qcai/tmp/trinity-child113.log

This one triggered the warning at lib/iov_iter.c:316 sanity+0x6b/0x6
3 times at once.

[ 2200.510753] ------------[ cut here ]------------
[ 2200.515929] WARNING: CPU: 9 PID: 116624 at lib/iov_iter.c:316 sanity+0x6b/0x6f
[ 2200.523999] Modules linked in: 8021q garp mrp fuse dlci vmw_vsock_vmci_transport vsock vmw_vmci af_key ieee802154_socket ieee802154 hidp cmtp kernelcapi bnep rfcomm bluetooth rfkill can_bcm can_raw can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 i2c_smbus mei_me ipmi_ssif sg lpc_ich mei shpchp wmi ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci mdio libahci ptp libata pps_core i2c_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 2200.644251] CPU: 9 PID: 116624 Comm: trinity-c113 Not tainted 4.8.0-rc8-splice+ #1
[ 2200.652708] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 2200.664062]  0000000000000286 00000000bad46fa7 ffff8803d1ca7b30 ffffffff813d2eac
[ 2200.672368]  0000000000000000 0000000000000000 ffff8803d1ca7b70 ffffffff8109cf11
[ 2200.680660]  0000013c2e32bdc8 ffffea000eea7540 0000000000001000 ffff88030e9a0000
[ 2200.688954] Call Trace:
[ 2200.691686]  [<ffffffff813d2eac>] dump_stack+0x85/0xc9
[ 2200.697433]  [<ffffffff8109cf11>] __warn+0xd1/0xf0
[ 2200.702777]  [<ffffffff8109d04d>] warn_slowpath_null+0x1d/0x20
[ 2200.709285]  [<ffffffff81418c93>] sanity+0x6b/0x6f
[ 2200.714630]  [<ffffffff813e9586>] copy_page_to_iter+0xf6/0x1e0
[ 2200.721139]  [<ffffffff811e3906>] generic_file_read_iter+0x406/0x800
[ 2200.728231]  [<ffffffff810f8afd>] ? down_read_nested+0x4d/0x80
[ 2200.734798]  [<ffffffffa02c46ae>] ? xfs_ilock+0x1ae/0x260 [xfs]
[ 2200.741433]  [<ffffffffa02b3f2f>] xfs_file_buffered_aio_read+0x6f/0x1b0 [xfs]
[ 2200.749412]  [<ffffffffa02b46e8>] xfs_file_read_iter+0x68/0xc0 [xfs]
[ 2200.756504]  [<ffffffff812bb359>] generic_file_splice_read+0xb9/0x1b0
[ 2200.763691]  [<ffffffff812bb913>] do_splice_to+0x73/0x90
[ 2200.769618]  [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
[ 2200.776610]  [<ffffffff812baee0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 2200.783893]  [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
[ 2200.790207]  [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
[ 2200.796231]  [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
[ 2200.802351]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2200.808471]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2200.815723] ---[ end trace e02dda43787dce2a ]---
[ 2200.821003] ------------[ cut here ]------------
[ 2200.826168] WARNING: CPU: 9 PID: 116624 at lib/iov_iter.c:316 sanity+0x6b/0x6f
[ 2200.834765] Modules linked in: 8021q garp mrp fuse dlci vmw_vsock_vmci_transport vsock vmw_vmci af_key ieee802154_socket ieee802154 hidp cmtp kernelcapi bnep rfcomm bluetooth rfkill can_bcm can_raw can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 i2c_smbus mei_me ipmi_ssif sg lpc_ich mei shpchp wmi ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci mdio libahci ptp libata pps_core i2c_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 2200.951286] CPU: 9 PID: 116624 Comm: trinity-c113 Tainted: G        W       4.8.0-rc8-splice+ #1
[ 2200.961088] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 2200.972443]  0000000000000286 00000000bad46fa7 ffff8803d1ca7b30 ffffffff813d2eac
[ 2200.980747]  0000000000000000 0000000000000000 ffff8803d1ca7b70 ffffffff8109cf11
[ 2200.989078]  0000013c00000000 ffffea000b711880 0000000000001000 ffff88030e9a0000
[ 2200.997375] Call Trace:
[ 2201.000107]  [<ffffffff813d2eac>] dump_stack+0x85/0xc9
[ 2201.005842]  [<ffffffff8109cf11>] __warn+0xd1/0xf0
[ 2201.011199]  [<ffffffff8109d04d>] warn_slowpath_null+0x1d/0x20
[ 2201.017708]  [<ffffffff81418c93>] sanity+0x6b/0x6f
[ 2201.023053]  [<ffffffff813e9586>] copy_page_to_iter+0xf6/0x1e0
[ 2201.029562]  [<ffffffff811e3906>] generic_file_read_iter+0x406/0x800
[ 2201.036654]  [<ffffffff810f8afd>] ? down_read_nested+0x4d/0x80
[ 2201.043213]  [<ffffffffa02c46ae>] ? xfs_ilock+0x1ae/0x260 [xfs]
[ 2201.049849]  [<ffffffffa02b3f2f>] xfs_file_buffered_aio_read+0x6f/0x1b0 [xfs]
[ 2201.057828]  [<ffffffffa02b46e8>] xfs_file_read_iter+0x68/0xc0 [xfs]
[ 2201.064919]  [<ffffffff812bb359>] generic_file_splice_read+0xb9/0x1b0
[ 2201.072108]  [<ffffffff812bb913>] do_splice_to+0x73/0x90
[ 2201.078034]  [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
[ 2201.085026]  [<ffffffff812baee0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 2201.092309]  [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
[ 2201.098623]  [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
[ 2201.104646]  [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
[ 2201.110768]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2201.116890]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2201.124136] ---[ end trace e02dda43787dce2b ]---
[ 2201.192680] ------------[ cut here ]------------
[ 2201.203826] WARNING: CPU: 9 PID: 116624 at lib/iov_iter.c:316 sanity+0x6b/0x6f
[ 2201.211899] Modules linked in: 8021q garp mrp fuse dlci vmw_vsock_vmci_transport vsock vmw_vmci af_key ieee802154_socket ieee802154 hidp cmtp kernelcapi bnep rfcomm bluetooth rfkill can_bcm can_raw can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 i2c_smbus mei_me ipmi_ssif sg lpc_ich mei shpchp wmi ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci mdio libahci ptp libata pps_core i2c_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 2201.329000] CPU: 9 PID: 116624 Comm: trinity-c113 Tainted: G        W       4.8.0-rc8-splice+ #1
[ 2201.338805] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 2201.350160]  0000000000000286 00000000bad46fa7 ffff8803d1ca7b30 ffffffff813d2eac
[ 2201.358455]  0000000000000000 0000000000000000 ffff8803d1ca7b70 ffffffff8109cf11
[ 2201.366747]  0000013c00000000 ffffea000be93cc0 0000000000001000 ffff88030e9a0000
[ 2201.375035] Call Trace:
[ 2201.377767]  [<ffffffff813d2eac>] dump_stack+0x85/0xc9
[ 2201.383499]  [<ffffffff8109cf11>] __warn+0xd1/0xf0
[ 2201.388843]  [<ffffffff8109d04d>] warn_slowpath_null+0x1d/0x20
[ 2201.395351]  [<ffffffff81418c93>] sanity+0x6b/0x6f
[ 2201.400695]  [<ffffffff813e9586>] copy_page_to_iter+0xf6/0x1e0
[ 2201.407204]  [<ffffffff811e3906>] generic_file_read_iter+0x406/0x800
[ 2201.414294]  [<ffffffff810f8afd>] ? down_read_nested+0x4d/0x80
[ 2201.420844]  [<ffffffffa02c46ae>] ? xfs_ilock+0x1ae/0x260 [xfs]
[ 2201.427463]  [<ffffffffa02b3f2f>] xfs_file_buffered_aio_read+0x6f/0x1b0 [xfs]
[ 2201.435451]  [<ffffffffa02b46e8>] xfs_file_read_iter+0x68/0xc0 [xfs]
[ 2201.442542]  [<ffffffff812bb359>] generic_file_splice_read+0xb9/0x1b0
[ 2201.449728]  [<ffffffff812bb913>] do_splice_to+0x73/0x90
[ 2201.455655]  [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
[ 2201.462645]  [<ffffffff812baee0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 2201.469928]  [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
[ 2201.476242]  [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
[ 2201.482264]  [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
[ 2201.488383]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2201.494504]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2201.501736] ---[ end trace e02dda43787dce2c ]---

   CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03 21:12                                     ` Dave Chinner
@ 2016-10-04 13:57                                       ` CAI Qian
  0 siblings, 0 replies; 104+ messages in thread
From: CAI Qian @ 2016-10-04 13:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Al Viro, Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel


> This usually happens when an application mixes mmap access and
> direct IO to the same file. The warning fires when the direct IO
> cannot invalidate the cached range after writeback (e.g. writeback
> raced with mmap app faulting and dirtying the page again), and hence
> results in the page cache containing stale data.  This warning fires
> when that happens, indicating to developers who get a bug report
> about data corruption that it's the userspace application that is
> the problem, not the filesystem. i.e the application is doing
> something we explicitly document they should not do:
> 
> $ man 2 open
> ....
>   O_DIRECT
> ....
>        Applications should avoid mixing O_DIRECT and normal I/O to
>        the same file, and especially to overlapping byte regions in
>        the  same  file.   Even  when  the filesystem  correctly
>        handles the coherency issues in this situation, overall I/O
>        throughput is likely to be slower than using either mode
>        alone.  Likewise, applications should avoid mixing mmap(2) of
>        files with direct I/O to the same files.
> 
> Splice should not have this problem if the IO path locking is
> correct, as both direct IO and splice IO use the same inode lock for
> exclusion. i.e. splice write should not be running at the same time
> as a direct IO read or write....
OK, so I assume that trinity is doing something that a proper userspace
application won't be doing which is fine, and there is nothing to worry
about from the kernel's perspective.

I just want to make sure there is no security implication here that a
non-privileged user could corrupt other users' data etc.
   CAI Qian


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-04 13:29                                       ` CAI Qian
@ 2016-10-04 14:28                                         ` Al Viro
  2016-10-04 16:21                                           ` CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-10-04 14:28 UTC (permalink / raw)
  To: CAI Qian
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Tue, Oct 04, 2016 at 09:29:35AM -0400, CAI Qian wrote:

> > Is there any way to record the sequence of syscalls leading to that?
> > 
> Yes, a bit long shot though.
> 
> http://people.redhat.com/qcai/tmp/trinity-child113.log

;-/

Not enough information, unfortunately (descriptor in question opened
outside of that log, sendfile(out_fd=578, in_fd=578, offset=0x7f8318a07000,
count=0x3ffc00) doesn't tell what *offset was before the call) ;-/

Anyway, I've found and fixed a bug in pipe_advance(), which might or might
not help with those.  Could you try vfs.git#work.splice_read (or #for-next)
and see if these persist?

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-04 14:28                                         ` Al Viro
@ 2016-10-04 16:21                                           ` CAI Qian
  2016-10-04 20:12                                             ` Al Viro
  0 siblings, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-04 16:21 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel


> Not enough information, unfortunately (descriptor in question opened
> outside of that log, sendfile(out_fd=578, in_fd=578, offset=0x7f8318a07000,
> count=0x3ffc00) doesn't tell what *offset was before the call) ;-/
> 
> Anyway, I've found and fixed a bug in pipe_advance(), which might or might
> not help with those.  Could you try vfs.git#work.splice_read (or #for-next)
> and see if these persist?
I am afraid that this can also reproduced in the latest #for-next . The warning
always showed up at the end of trinity run. I captured more information this time.

http://people.redhat.com/qcai/tmp/trinity-child150.log
http://people.redhat.com/qcai/tmp/tri-full.log (big file so may just grep "child150")
http://people.redhat.com/qcai/tmp/trinity.log

[ 2187.697999] ------------[ cut here ]------------
[ 2187.703181] WARNING: CPU: 34 PID: 67630 at lib/iov_iter.c:316 sanity+0x6b/0x6f
[ 2187.713890] Modules linked in: fuse vmac tcp_diag udp_diag inet_diag ieee802154_socket ieee802154 af_key vmw_vsock_vmci_transport vsock vmw_vmci bluetooth rfkill can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr ipmi_ssif i2c_i801 i2c_smbus mei_me sg lpc_ich mei shpchp wmi ipmi_si ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci libahci mdio ptp libata i2c_core pps_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 2187.828488] CPU: 29 PID: 67630 Comm: trinity-c150 Not tainted 4.8.0-rc8-fornext+ #1
[ 2187.837034] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 2187.848392]  0000000000000286 00000000a4c9de22 ffff8803f0d5bb30 ffffffff813d30ac
[ 2187.856687]  0000000000000000 0000000000000000 ffff8803f0d5bb70 ffffffff8109cf31
[ 2187.864983]  0000013c1923e8c0 ffffea000db71000 0000000000001000 ffff88044b127200
[ 2187.873282] Call Trace:
[ 2187.876017]  [<ffffffff813d30ac>] dump_stack+0x85/0xc9
[ 2187.881756]  [<ffffffff8109cf31>] __warn+0xd1/0xf0
[ 2187.887104]  [<ffffffff8109d06d>] warn_slowpath_null+0x1d/0x20
[ 2187.893616]  [<ffffffff81418ec8>] sanity+0x6b/0x6f
[ 2187.898967]  [<ffffffff813e97a6>] copy_page_to_iter+0xf6/0x1e0
[ 2187.905478]  [<ffffffff811e3926>] generic_file_read_iter+0x406/0x800
[ 2187.912570]  [<ffffffff810f8b1d>] ? down_read_nested+0x4d/0x80
[ 2187.919123]  [<ffffffffa029b74e>] ? xfs_ilock+0x1ae/0x260 [xfs]
[ 2187.925746]  [<ffffffffa028af2f>] xfs_file_buffered_aio_read+0x6f/0x1b0 [xfs]
[ 2187.933756]  [<ffffffffa028b6e8>] xfs_file_read_iter+0x68/0xc0 [xfs]
[ 2187.940847]  [<ffffffff812bb559>] generic_file_splice_read+0xb9/0x1b0
[ 2187.948034]  [<ffffffff812bbb13>] do_splice_to+0x73/0x90
[ 2187.953962]  [<ffffffff812bbc1b>] splice_direct_to_actor+0xeb/0x220
[ 2187.960955]  [<ffffffff812bb0e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 2187.968243]  [<ffffffff812bbdd9>] do_splice_direct+0x89/0xd0
[ 2187.974561]  [<ffffffff8128263e>] do_sendfile+0x1ce/0x3b0
[ 2187.980580]  [<ffffffff812831ef>] SyS_sendfile64+0x6f/0xd0
[ 2187.986698]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2187.992823]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2188.000349] ---[ end trace a3a1d0412c1a1214 ]---
[ 2188.006348] ------------[ cut here ]------------
[ 2188.011842] WARNING: CPU: 26 PID: 67630 at lib/iov_iter.c:316 sanity+0x6b/0x6f
[ 2188.019914] Modules linked in: fuse vmac tcp_diag udp_diag inet_diag ieee802154_socket ieee802154 af_key vmw_vsock_vmci_transport vsock vmw_vmci bluetooth rfkill can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr ipmi_ssif i2c_i801 i2c_smbus mei_me sg lpc_ich mei shpchp wmi ipmi_si ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci libahci mdio ptp libata i2c_core pps_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 2188.133408] CPU: 54 PID: 67630 Comm: trinity-c150 Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 2188.143310] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 2188.154667]  0000000000000286 00000000a4c9de22 ffff8803f0d5bb30 ffffffff813d30ac
[ 2188.162962]  0000000000000000 0000000000000000 ffff8803f0d5bb70 ffffffff8109cf31
[ 2188.171257]  0000013c1923e8c8 ffffea000dbbd700 0000000000001000 ffff88044b127200
[ 2188.179551] Call Trace:
[ 2188.182284]  [<ffffffff813d30ac>] dump_stack+0x85/0xc9
[ 2188.188022]  [<ffffffff8109cf31>] __warn+0xd1/0xf0
[ 2188.193368]  [<ffffffff8109d06d>] warn_slowpath_null+0x1d/0x20
[ 2188.199879]  [<ffffffff81418ec8>] sanity+0x6b/0x6f
[ 2188.205227]  [<ffffffff813e97a6>] copy_page_to_iter+0xf6/0x1e0
[ 2188.211738]  [<ffffffff811e3926>] generic_file_read_iter+0x406/0x800
[ 2188.218824]  [<ffffffff810f8b1d>] ? down_read_nested+0x4d/0x80
[ 2188.225363]  [<ffffffffa029b74e>] ? xfs_ilock+0x1ae/0x260 [xfs]
[ 2188.231988]  [<ffffffffa028af2f>] xfs_file_buffered_aio_read+0x6f/0x1b0 [xfs]
[ 2188.239967]  [<ffffffffa028b6e8>] xfs_file_read_iter+0x68/0xc0 [xfs]
[ 2188.247059]  [<ffffffff812bb559>] generic_file_splice_read+0xb9/0x1b0
[ 2188.254246]  [<ffffffff812bbb13>] do_splice_to+0x73/0x90
[ 2188.260174]  [<ffffffff812bbc1b>] splice_direct_to_actor+0xeb/0x220
[ 2188.267168]  [<ffffffff812bb0e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 2188.274453]  [<ffffffff812bbdd9>] do_splice_direct+0x89/0xd0
[ 2188.280771]  [<ffffffff8128263e>] do_sendfile+0x1ce/0x3b0
[ 2188.286796]  [<ffffffff812831ef>] SyS_sendfile64+0x6f/0xd0
[ 2188.292918]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2188.299040]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2188.313523] ---[ end trace a3a1d0412c1a1215 ]---
[ 2188.458941] ------------[ cut here ]------------
[ 2188.464181] WARNING: CPU: 10 PID: 67630 at lib/iov_iter.c:316 sanity+0x6b/0x6f
[ 2188.472261] Modules linked in: fuse vmac tcp_diag udp_diag inet_diag ieee802154_socket ieee802154 af_key vmw_vsock_vmci_transport vsock vmw_vmci bluetooth rfkill can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr ipmi_ssif i2c_i801 i2c_smbus mei_me sg lpc_ich mei shpchp wmi ipmi_si ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci libahci mdio ptp libata i2c_core pps_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 2188.585528] CPU: 38 PID: 67630 Comm: trinity-c150 Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 2188.595431] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 2188.606786]  0000000000000286 00000000a4c9de22 ffff8803f0d5bb30 ffffffff813d30ac
[ 2188.615082]  0000000000000000 0000000000000000 ffff8803f0d5bb70 ffffffff8109cf31
[ 2188.623379]  0000013c11bafb58 ffffea000ee78980 0000000000001000 ffff88044b127200
[ 2188.631675] Call Trace:
[ 2188.634410]  [<ffffffff813d30ac>] dump_stack+0x85/0xc9
[ 2188.640148]  [<ffffffff8109cf31>] __warn+0xd1/0xf0
[ 2188.645497]  [<ffffffff8109d06d>] warn_slowpath_null+0x1d/0x20
[ 2188.652324]  [<ffffffff81418ec8>] sanity+0x6b/0x6f
[ 2188.657672]  [<ffffffff813e97a6>] copy_page_to_iter+0xf6/0x1e0
[ 2188.664185]  [<ffffffff811e3926>] generic_file_read_iter+0x406/0x800
[ 2188.671268]  [<ffffffff810f8b1d>] ? down_read_nested+0x4d/0x80
[ 2188.677825]  [<ffffffffa029b74e>] ? xfs_ilock+0x1ae/0x260 [xfs]
[ 2188.684450]  [<ffffffffa028af2f>] xfs_file_buffered_aio_read+0x6f/0x1b0 [xfs]
[ 2188.692433]  [<ffffffffa028b6e8>] xfs_file_read_iter+0x68/0xc0 [xfs]
[ 2188.699525]  [<ffffffff812bb559>] generic_file_splice_read+0xb9/0x1b0
[ 2188.706711]  [<ffffffff812bbb13>] do_splice_to+0x73/0x90
[ 2188.712638]  [<ffffffff812bbc1b>] splice_direct_to_actor+0xeb/0x220
[ 2188.719632]  [<ffffffff812bb0e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 2188.726916]  [<ffffffff812bbdd9>] do_splice_direct+0x89/0xd0
[ 2188.733231]  [<ffffffff8128263e>] do_sendfile+0x1ce/0x3b0
[ 2188.739255]  [<ffffffff812831ef>] SyS_sendfile64+0x6f/0xd0
[ 2188.745377]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2188.751500]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2188.760216] ---[ end trace a3a1d0412c1a1216 ]---

^ permalink raw reply	[flat|nested] 104+ messages in thread

* local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-03 17:49                                   ` CAI Qian
@ 2016-10-04 17:39                                     ` CAI Qian
  2016-10-04 21:42                                       ` tj
  0 siblings, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-04 17:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel, tj


> ----- Original Message -----
> > From: "Al Viro" <viro@ZenIV.linux.org.uk>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> > <david@fromorbit.com>, "linux-xfs"
> > <linux-xfs@vger.kernel.org>, xfs@oss.sgi.com, "Jens Axboe"
> > <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org
> > Sent: Sunday, October 2, 2016 9:37:37 PM
> > Subject: Re: [RFC][CFT] splice_read reworked
> > 
> > On Fri, Sep 30, 2016 at 02:33:23PM -0400, CAI Qian wrote:
> > 
> > OK, the immeditate trigger is
> > 	* sendfile() from something that uses seq_read to a regular file.
> > Does sb_start_write() around the call of do_splice_direct() (as always),
> > which ends up calling default_file_splice_read() (again, as usual), which
> > ends up calling ->read() of the source, i.e. seq_read().  No changes there.
> >  
> > 	* sb_start_write() can be called under ->i_mutex.  The latter is
> > on overlayfs inode, the former is done to upper layer in that overlayfs.
> > Nothing new, again.
> > 
> > 	* ->i_mutex can be taken under ->cred_guard_mutex.  Yes, it can -
> > in open_exec().  Again, no changes.
> > 
> > 	* ->cred_guard_mutex can be taken in ->show() of a seq_file,
> > namely /proc/*/auxv...  Argh, ->cred_guard_mutex whack-a-mole strikes
> > again...
> > 
> > OK, I think essentially the same warning had been triggerable since _way_
> > back.  All changes around splice have no effect on it.
> > 
> > Look: to get a deadlock we need
> > 	(1) sendfile from /proc/<pid>/auxv to a regular file on upper layer of
> > overlayfs requesting not to freeze the target.
> > 	(2) attempt to freeze it blocking until (1) is done.
> > 	(3) directory modification on overlayfs trying to request not to freeze
> > the upper layer and blocking until (2) is done.
> > 	(4) execve() in <pid> holding ->cred_guard_mutex, trying to open
> > something in overlayfs and getting blocked on directory lock, held by (3).
> > 
> > Now (1) gets around to reading from /proc/<pid>/auxv, which blocks on
> > ->cred_guard_mutex.  Mentioning of seq_read itself holding locks is
> > irrelevant;
> > what matters is that ->read() grabs ->cred_guard_mutex.
> > 
> > We used to have similar problems in /proc/*/environ and /proc/*/mem; looks
> > like /proc/*/environ needs to get the treatment similar to e268337dfe26 and
> > b409e578d9a4.
> > 
> You are right. This is also reproducible on v4.8 mainline.
Not sure if related, but right after this lockdep happened and trinity running by a
non-privileged user finished inside the container. The host's systemctl command just
hang or timeout which renders the whole system unusable.

# systemctl status docker
Failed to get properties: Connection timed out

# systemctl reboot (hang)

[ 5535.596651] INFO: task systemd-journal:1165 blocked for more than 120 seconds.
[ 5535.604728]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5535.611536] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5535.620285] systemd-journal D ffff880466167ca8 12672  1165      1 0x00000000
[ 5535.628182]  ffff880466167ca8 ffff880466167cd0 0000000000000000 ffff88086c6e2000
[ 5535.636504]  ffff88045deb0000 ffff880466168000 ffffffff81deb380 ffff88045deb0000
[ 5535.644817]  0000000000000246 00000000ffffffff ffff880466167cc0 ffffffff817cdaaf
[ 5535.653131] Call Trace:
[ 5535.655874]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5535.661425]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5535.668617]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5535.675237]  [<ffffffff81161d7e>] ? proc_cgroup_show+0x4e/0x300
[ 5535.681857]  [<ffffffff81252b01>] ? kmem_cache_alloc_trace+0x1d1/0x2e0
[ 5535.689162]  [<ffffffff81161d7e>] proc_cgroup_show+0x4e/0x300
[ 5535.695592]  [<ffffffff81302d40>] proc_single_show+0x50/0x90
[ 5535.701925]  [<ffffffff812ac983>] seq_read+0x113/0x3e0
[ 5535.707672]  [<ffffffff81280407>] __vfs_read+0x37/0x150
[ 5535.713521]  [<ffffffff81349ded>] ? security_file_permission+0x9d/0xc0
[ 5535.720819]  [<ffffffff812815ac>] vfs_read+0x8c/0x130
[ 5535.726472]  [<ffffffff81282ac8>] SyS_read+0x58/0xc0
[ 5535.732024]  [<ffffffff817d497c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[ 5535.739221] INFO: lockdep is turned off.
[ 5535.743649] INFO: task kworker/3:1:52401 blocked for more than 120 seconds.
[ 5535.751429]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5535.758239] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5535.766989] kworker/3:1     D ffff8803b25bbca8 13368 52401      2 0x00000080
[ 5535.774904] Workqueue: cgroup_destroy css_release_work_fn
[ 5535.780940]  ffff8803b25bbca8 ffff8803b25bbcd0 0000000000000000 ffff88046ded2000
[ 5535.789254]  ffff88046af8a000 ffff8803b25bc000 ffffffff81deb380 ffff88046af8a000
[ 5535.797562]  0000000000000246 00000000ffffffff ffff8803b25bbcc0 ffffffff817cdaaf
[ 5535.805877] Call Trace:
[ 5535.808621]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5535.814177]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5535.821379]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5535.828001]  [<ffffffff811586af>] ? css_release_work_fn+0x2f/0x110
[ 5535.834911]  [<ffffffff811586af>] css_release_work_fn+0x2f/0x110
[ 5535.841629]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
[ 5535.848159]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
[ 5535.854876]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
[ 5535.861119]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
[ 5535.867847]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
[ 5535.873404]  [<ffffffff817d40ec>] ? _raw_spin_unlock_irq+0x2c/0x60
[ 5535.880320]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 5535.886369]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
[ 5535.893675] INFO: lockdep is turned off.
[ 5535.898085] INFO: task kworker/45:4:146035 blocked for more than 120 seconds.
[ 5535.906059]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5535.912865] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5535.921613] kworker/45:4    D ffff880853e9b950 14048 146035      2 0x00000080
[ 5535.929630] Workqueue: cgroup_destroy css_killed_work_fn
[ 5535.935582]  ffff880853e9b950 0000000000000000 0000000000000000 ffff88086c6da000
[ 5535.943882]  ffff88086c9e2000 ffff880853e9c000 ffff880853e9baa0 ffff88086c9e2000
[ 5535.952205]  ffff880853e9ba98 0000000000000001 ffff880853e9b968 ffffffff817cdaaf
[ 5535.960522] Call Trace:
[ 5535.963265]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5535.968817]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
[ 5535.975346]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
[ 5535.982256]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
[ 5535.988972]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
[ 5535.994804]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
[ 5536.001227]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
[ 5536.007648]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
[ 5536.014654]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
[ 5536.021657]  [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40
[ 5536.029344]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
[ 5536.036447]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
[ 5536.043844]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
[ 5536.051336]  [<ffffffff8116789d>] update_flag+0x11d/0x210
[ 5536.057373]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
[ 5536.064186]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
[ 5536.070899]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
[ 5536.077420]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
[ 5536.084234]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
[ 5536.091049]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
[ 5536.097571]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
[ 5536.104207]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
[ 5536.110736]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
[ 5536.117461]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
[ 5536.123697]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
[ 5536.130426]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
[ 5536.135991]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 5536.142041]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
[ 5536.149345] INFO: lockdep is turned off.
[ 5585.148183] perf: interrupt took too long (3146 > 3136), lowering kernel.perf_event_max_sample_rate to 63000
[ 5658.479538] INFO: task systemd:1 blocked for more than 120 seconds.
[ 5658.486551]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5658.493352] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5658.502095] systemd         D ffff880468ccfca8 11952     1      0 0x00000000
[ 5658.509995]  ffff880468ccfca8 ffff880468ccfcd0 0000000000000000 ffff88046aa24000
[ 5658.518297]  ffff880468cd0000 ffff880468cd0000 ffffffff81deb380 ffff880468cd0000
[ 5658.526602]  0000000000000246 00000000ffffffff ffff880468ccfcc0 ffffffff817cdaaf
[ 5658.534909] Call Trace:
[ 5658.537645]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5658.543188]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5658.550375]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5658.556987]  [<ffffffff81161d7e>] ? proc_cgroup_show+0x4e/0x300
[ 5658.563600]  [<ffffffff81252b01>] ? kmem_cache_alloc_trace+0x1d1/0x2e0
[ 5658.570887]  [<ffffffff81161d7e>] proc_cgroup_show+0x4e/0x300
[ 5658.577304]  [<ffffffff81302d40>] proc_single_show+0x50/0x90
[ 5658.583620]  [<ffffffff812ac983>] seq_read+0x113/0x3e0
[ 5658.589355]  [<ffffffff81280407>] __vfs_read+0x37/0x150
[ 5658.595189]  [<ffffffff81349ded>] ? security_file_permission+0x9d/0xc0
[ 5658.602480]  [<ffffffff812815ac>] vfs_read+0x8c/0x130
[ 5658.608117]  [<ffffffff81282ac8>] SyS_read+0x58/0xc0
[ 5658.613661]  [<ffffffff817d497c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[ 5658.620849] INFO: lockdep is turned off.
[ 5658.625282] INFO: task systemd-journal:1165 blocked for more than 120 seconds.
[ 5658.633346]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5658.640147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5658.648887] systemd-journal D ffff880466167ca8 12672  1165      1 0x00000000
[ 5658.656788]  ffff880466167ca8 ffff880466167cd0 0000000000000000 ffff88086c6e2000
[ 5658.665092]  ffff88045deb0000 ffff880466168000 ffffffff81deb380 ffff88045deb0000
[ 5658.673394]  0000000000000246 00000000ffffffff ffff880466167cc0 ffffffff817cdaaf
[ 5658.681690] Call Trace:
[ 5658.684419]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5658.689961]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5658.697143]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5658.703766]  [<ffffffff81161d7e>] ? proc_cgroup_show+0x4e/0x300
[ 5658.710373]  [<ffffffff81252b01>] ? kmem_cache_alloc_trace+0x1d1/0x2e0
[ 5658.717661]  [<ffffffff81161d7e>] proc_cgroup_show+0x4e/0x300
[ 5658.724067]  [<ffffffff81302d40>] proc_single_show+0x50/0x90
[ 5658.730386]  [<ffffffff812ac983>] seq_read+0x113/0x3e0
[ 5658.736123]  [<ffffffff81280407>] __vfs_read+0x37/0x150
[ 5658.741957]  [<ffffffff81349ded>] ? security_file_permission+0x9d/0xc0
[ 5658.749244]  [<ffffffff812815ac>] vfs_read+0x8c/0x130
[ 5658.754884]  [<ffffffff81282ac8>] SyS_read+0x58/0xc0
[ 5658.760417]  [<ffffffff817d497c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[ 5658.767607] INFO: lockdep is turned off.
[ 5658.772016] INFO: task kworker/3:1:52401 blocked for more than 120 seconds.
[ 5658.779789]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5658.786582] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5658.795322] kworker/3:1     D ffff8803b25bbca8 13368 52401      2 0x00000080
[ 5658.803224] Workqueue: cgroup_destroy css_release_work_fn
[ 5658.809261]  ffff8803b25bbca8 ffff8803b25bbcd0 0000000000000000 ffff88046ded2000
[ 5658.817567]  ffff88046af8a000 ffff8803b25bc000 ffffffff81deb380 ffff88046af8a000
[ 5658.825871]  0000000000000246 00000000ffffffff ffff8803b25bbcc0 ffffffff817cdaaf
[ 5658.834173] Call Trace:
[ 5658.836904]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5658.842447]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5658.849638]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5658.856246]  [<ffffffff811586af>] ? css_release_work_fn+0x2f/0x110
[ 5658.863146]  [<ffffffff811586af>] css_release_work_fn+0x2f/0x110
[ 5658.869858]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
[ 5658.876370]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
[ 5658.883067]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
[ 5658.889287]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
[ 5658.895991]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
[ 5658.901538]  [<ffffffff817d40ec>] ? _raw_spin_unlock_irq+0x2c/0x60
[ 5658.908438]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 5658.914466]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
[ 5658.921745] INFO: lockdep is turned off.
[ 5658.926133] INFO: task kworker/45:4:146035 blocked for more than 120 seconds.
[ 5658.934099]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5658.940902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5658.949636] kworker/45:4    D ffff880853e9b950 14048 146035      2 0x00000080
[ 5658.957632] Workqueue: cgroup_destroy css_killed_work_fn
[ 5658.963574]  ffff880853e9b950 0000000000000000 0000000000000000 ffff88086c6da000
[ 5658.971877]  ffff88086c9e2000 ffff880853e9c000 ffff880853e9baa0 ffff88086c9e2000
[ 5658.980179]  ffff880853e9ba98 0000000000000001 ffff880853e9b968 ffffffff817cdaaf
[ 5658.988498] Call Trace:
[ 5658.991225]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5658.996768]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
[ 5659.003271]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
[ 5659.010161]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
[ 5659.016871]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
[ 5659.022706]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
[ 5659.029120]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
[ 5659.035535]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
[ 5659.042529]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
[ 5659.049528]  [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40
[ 5659.057203]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
[ 5659.064297]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
[ 5659.071673]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
[ 5659.079144]  [<ffffffff8116789d>] update_flag+0x11d/0x210
[ 5659.085172]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
[ 5659.091964]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
[ 5659.098668]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
[ 5659.105179]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
[ 5659.111982]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
[ 5659.118783]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
[ 5659.125296]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
[ 5659.131906]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
[ 5659.138417]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
[ 5659.145124]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
[ 5659.151345]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
[ 5659.158044]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
[ 5659.163586]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 5659.169605]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
[ 5659.176892] INFO: lockdep is turned off.
[ 5781.364367] INFO: task systemd:1 blocked for more than 120 seconds.
[ 5781.371373]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5781.378177] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5781.386918] systemd         D ffff880468ccfca8 11952     1      0 0x00000000
[ 5781.394818]  ffff880468ccfca8 ffff880468ccfcd0 0000000000000000 ffff88046aa24000
[ 5781.403121]  ffff880468cd0000 ffff880468cd0000 ffffffff81deb380 ffff880468cd0000
[ 5781.411421]  0000000000000246 00000000ffffffff ffff880468ccfcc0 ffffffff817cdaaf
[ 5781.419725] Call Trace:
[ 5781.422460]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5781.428003]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5781.435192]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5781.441801]  [<ffffffff81161d7e>] ? proc_cgroup_show+0x4e/0x300
[ 5781.448404]  [<ffffffff81252b01>] ? kmem_cache_alloc_trace+0x1d1/0x2e0
[ 5781.455691]  [<ffffffff81161d7e>] proc_cgroup_show+0x4e/0x300
[ 5781.462109]  [<ffffffff81302d40>] proc_single_show+0x50/0x90
[ 5781.468428]  [<ffffffff812ac983>] seq_read+0x113/0x3e0
[ 5781.474165]  [<ffffffff81280407>] __vfs_read+0x37/0x150
[ 5781.479991]  [<ffffffff81349ded>] ? security_file_permission+0x9d/0xc0
[ 5781.487277]  [<ffffffff812815ac>] vfs_read+0x8c/0x130
[ 5781.492914]  [<ffffffff81282ac8>] SyS_read+0x58/0xc0
[ 5781.498455]  [<ffffffff817d497c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[ 5781.505646] INFO: lockdep is turned off.
[ 5781.510085] INFO: task systemd-journal:1165 blocked for more than 120 seconds.
[ 5781.518146]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5781.524946] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5781.533686] systemd-journal D ffff880466167ca8 12672  1165      1 0x00000000
[ 5781.541581]  ffff880466167ca8 ffff880466167cd0 0000000000000000 ffff88086c6e2000
[ 5781.549880]  ffff88045deb0000 ffff880466168000 ffffffff81deb380 ffff88045deb0000
[ 5781.558186]  0000000000000246 00000000ffffffff ffff880466167cc0 ffffffff817cdaaf
[ 5781.566493] Call Trace:
[ 5781.569222]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5781.574764]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5781.581953]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5781.588559]  [<ffffffff81161d7e>] ? proc_cgroup_show+0x4e/0x300
[ 5781.595166]  [<ffffffff81252b01>] ? kmem_cache_alloc_trace+0x1d1/0x2e0
[ 5781.602451]  [<ffffffff81161d7e>] proc_cgroup_show+0x4e/0x300
[ 5781.608864]  [<ffffffff81302d40>] proc_single_show+0x50/0x90
[ 5781.615182]  [<ffffffff812ac983>] seq_read+0x113/0x3e0
[ 5781.620916]  [<ffffffff81280407>] __vfs_read+0x37/0x150
[ 5781.626749]  [<ffffffff81349ded>] ? security_file_permission+0x9d/0xc0
[ 5781.634035]  [<ffffffff812815ac>] vfs_read+0x8c/0x130
[ 5781.639673]  [<ffffffff81282ac8>] SyS_read+0x58/0xc0
[ 5781.645215]  [<ffffffff817d497c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[ 5781.652403] INFO: lockdep is turned off.
[ 5781.656811] INFO: task kworker/3:1:52401 blocked for more than 120 seconds.
[ 5781.664583]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5781.671383] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5781.680121] kworker/3:1     D ffff8803b25bbca8 13368 52401      2 0x00000080
[ 5781.688021] Workqueue: cgroup_destroy css_release_work_fn
[ 5781.694057]  ffff8803b25bbca8 ffff8803b25bbcd0 0000000000000000 ffff88046ded2000
[ 5781.702356]  ffff88046af8a000 ffff8803b25bc000 ffffffff81deb380 ffff88046af8a000
[ 5781.710656]  0000000000000246 00000000ffffffff ffff8803b25bbcc0 ffffffff817cdaaf
[ 5781.718954] Call Trace:
[ 5781.721684]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5781.727224]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5781.734414]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5781.741021]  [<ffffffff811586af>] ? css_release_work_fn+0x2f/0x110
[ 5781.747919]  [<ffffffff811586af>] css_release_work_fn+0x2f/0x110
[ 5781.754626]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
[ 5781.761137]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
[ 5781.767841]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
[ 5781.774061]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
[ 5781.780765]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
[ 5781.786304]  [<ffffffff817d40ec>] ? _raw_spin_unlock_irq+0x2c/0x60
[ 5781.793203]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 5781.799229]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
[ 5781.806514] INFO: lockdep is turned off.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-04 16:21                                           ` CAI Qian
@ 2016-10-04 20:12                                             ` Al Viro
  2016-10-05 14:30                                               ` CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-10-04 20:12 UTC (permalink / raw)
  To: CAI Qian
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Tue, Oct 04, 2016 at 12:21:28PM -0400, CAI Qian wrote:
> 
> > Not enough information, unfortunately (descriptor in question opened
> > outside of that log, sendfile(out_fd=578, in_fd=578, offset=0x7f8318a07000,
> > count=0x3ffc00) doesn't tell what *offset was before the call) ;-/
> > 
> > Anyway, I've found and fixed a bug in pipe_advance(), which might or might
> > not help with those.  Could you try vfs.git#work.splice_read (or #for-next)
> > and see if these persist?
> I am afraid that this can also reproduced in the latest #for-next . The warning
> always showed up at the end of trinity run. I captured more information this time.

OK, let's try to get more information about what's going on (this is on top
of either for-next or work.splice_read):

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c97d661..a9cb9ff 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -313,6 +313,15 @@ static bool sanity(const struct iov_iter *i)
 	}
 	return true;
 Bad:
+	printk(KERN_ERR "idx = %d, offset = %zd\n", i->idx, i->iov_offset);
+	printk(KERN_ERR "curbuf = %d, nrbufs = %d, buffers = %d\n",
+			pipe->curbuf, pipe->nrbufs, pipe->buffers);
+	for (idx = 0; idx < pipe->buffers; idx++)
+		printk(KERN_ERR "[%p %p %d %d]\n",
+			pipe->bufs[idx].ops,
+			pipe->bufs[idx].page,
+			pipe->bufs[idx].offset,
+			pipe->bufs[idx].len);
 	WARN_ON(1);
 	return false;
 }
@@ -339,8 +348,11 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 	if (unlikely(!bytes))
 		return 0;
 
-	if (!sanity(i))
+	if (!sanity(i)) {
+		printk(KERN_ERR "page = %p, offset = %zd, size = %zd\n",
+			page, offset, bytes);
 		return 0;
+	}
 
 	off = i->iov_offset;
 	idx = i->idx;
@@ -518,6 +530,8 @@ static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
 		addr += chunk;
 	}
 	i->count -= bytes;
+	if (!sanity(i))
+		printk(KERN_ERR "buggered after copy_to_iter\n");
 	return bytes;
 }
 
@@ -629,6 +643,8 @@ static size_t pipe_zero(size_t bytes, struct iov_iter *i)
 		n -= chunk;
 	}
 	i->count -= bytes;
+	if (!sanity(i))
+		printk(KERN_ERR "buggered after zero_iter\n");
 	return bytes;
 }
 
@@ -673,6 +689,8 @@ static void pipe_advance(struct iov_iter *i, size_t size)
 	struct pipe_buffer *buf;
 	int idx = i->idx;
 	size_t off = i->iov_offset;
+	struct iov_iter orig = *i;
+	size_t orig_size = size;
 	
 	if (unlikely(i->count < size))
 		size = i->count;
@@ -702,6 +720,9 @@ static void pipe_advance(struct iov_iter *i, size_t size)
 			pipe->nrbufs--;
 		}
 	}
+	if (!sanity(i))
+		printk(KERN_ERR "buggered pipe_advance by %zd from [%d.%zd]",
+			orig_size, orig.idx, orig.iov_offset);
 }
 
 void iov_iter_advance(struct iov_iter *i, size_t size)

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-04 17:39                                     ` local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) CAI Qian
@ 2016-10-04 21:42                                       ` tj
  2016-10-05 14:09                                         ` CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: tj @ 2016-10-04 21:42 UTC (permalink / raw)
  To: CAI Qian
  Cc: Al Viro, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

Hello, CAI.

On Tue, Oct 04, 2016 at 01:39:11PM -0400, CAI Qian wrote:
...
> Not sure if related, but right after this lockdep happened and trinity running by a
> non-privileged user finished inside the container. The host's systemctl command just
> hang or timeout which renders the whole system unusable.
> 
> # systemctl status docker
> Failed to get properties: Connection timed out
> 
> # systemctl reboot (hang)
> 
...
> [ 5535.893675] INFO: lockdep is turned off.
> [ 5535.898085] INFO: task kworker/45:4:146035 blocked for more than 120 seconds.
> [ 5535.906059]       Tainted: G        W       4.8.0-rc8-fornext+ #1
> [ 5535.912865] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 5535.921613] kworker/45:4    D ffff880853e9b950 14048 146035      2 0x00000080
> [ 5535.929630] Workqueue: cgroup_destroy css_killed_work_fn
> [ 5535.935582]  ffff880853e9b950 0000000000000000 0000000000000000 ffff88086c6da000
> [ 5535.943882]  ffff88086c9e2000 ffff880853e9c000 ffff880853e9baa0 ffff88086c9e2000
> [ 5535.952205]  ffff880853e9ba98 0000000000000001 ffff880853e9b968 ffffffff817cdaaf
> [ 5535.960522] Call Trace:
> [ 5535.963265]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 5535.968817]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
> [ 5535.975346]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
> [ 5535.982256]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
> [ 5535.988972]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
> [ 5535.994804]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
> [ 5536.001227]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
> [ 5536.007648]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
> [ 5536.014654]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
> [ 5536.021657]  [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40
> [ 5536.029344]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
> [ 5536.036447]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
> [ 5536.043844]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
> [ 5536.051336]  [<ffffffff8116789d>] update_flag+0x11d/0x210
> [ 5536.057373]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
> [ 5536.064186]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
> [ 5536.070899]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
> [ 5536.077420]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
> [ 5536.084234]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
> [ 5536.091049]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
> [ 5536.097571]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
> [ 5536.104207]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
> [ 5536.110736]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
> [ 5536.117461]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
> [ 5536.123697]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
> [ 5536.130426]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
> [ 5536.135991]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
> [ 5536.142041]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230

This one seems to be the offender.  cgroup is trying to offline a
cpuset css, which takes place under cgroup_mutex.  The offlining ends
up trying to drain active usages of a sysctl table which apprently is
not happening.  Did something hang or crash while trying to generate
sysctl content?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-04 21:42                                       ` tj
@ 2016-10-05 14:09                                         ` CAI Qian
  2016-10-05 15:30                                           ` tj
  0 siblings, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-05 14:09 UTC (permalink / raw)
  To: tj
  Cc: Al Viro, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "tj" <tj@kernel.org>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Al Viro" <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> <david@fromorbit.com>, "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin"
> <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org
> Sent: Tuesday, October 4, 2016 5:42:19 PM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> Hello, CAI.
> 
> On Tue, Oct 04, 2016 at 01:39:11PM -0400, CAI Qian wrote:
> ...
> > Not sure if related, but right after this lockdep happened and trinity
> > running by a
> > non-privileged user finished inside the container. The host's systemctl
> > command just
> > hang or timeout which renders the whole system unusable.
> > 
> > # systemctl status docker
> > Failed to get properties: Connection timed out
> > 
> > # systemctl reboot (hang)
> > 
> ...
> > [ 5535.893675] INFO: lockdep is turned off.
> > [ 5535.898085] INFO: task kworker/45:4:146035 blocked for more than 120
> > seconds.
> > [ 5535.906059]       Tainted: G        W       4.8.0-rc8-fornext+ #1
> > [ 5535.912865] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [ 5535.921613] kworker/45:4    D ffff880853e9b950 14048 146035      2
> > 0x00000080
> > [ 5535.929630] Workqueue: cgroup_destroy css_killed_work_fn
> > [ 5535.935582]  ffff880853e9b950 0000000000000000 0000000000000000
> > ffff88086c6da000
> > [ 5535.943882]  ffff88086c9e2000 ffff880853e9c000 ffff880853e9baa0
> > ffff88086c9e2000
> > [ 5535.952205]  ffff880853e9ba98 0000000000000001 ffff880853e9b968
> > ffffffff817cdaaf
> > [ 5535.960522] Call Trace:
> > [ 5535.963265]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> > [ 5535.968817]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
> > [ 5535.975346]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
> > [ 5535.982256]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
> > [ 5535.988972]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
> > [ 5535.994804]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
> > [ 5536.001227]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
> > [ 5536.007648]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
> > [ 5536.014654]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
> > [ 5536.021657]  [<ffffffff810f57f5>]
> > unregister_sched_domain_sysctl+0x15/0x40
> > [ 5536.029344]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
> > [ 5536.036447]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
> > [ 5536.043844]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
> > [ 5536.051336]  [<ffffffff8116789d>] update_flag+0x11d/0x210
> > [ 5536.057373]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
> > [ 5536.064186]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
> > [ 5536.070899]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
> > [ 5536.077420]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
> > [ 5536.084234]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
> > [ 5536.091049]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
> > [ 5536.097571]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
> > [ 5536.104207]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
> > [ 5536.110736]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
> > [ 5536.117461]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
> > [ 5536.123697]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
> > [ 5536.130426]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
> > [ 5536.135991]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
> > [ 5536.142041]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
> 
> This one seems to be the offender.  cgroup is trying to offline a
> cpuset css, which takes place under cgroup_mutex.  The offlining ends
> up trying to drain active usages of a sysctl table which apprently is
> not happening.  Did something hang or crash while trying to generate
> sysctl content?
Hmm, I am not sure, since the trinity was running from an non-privileged
user which can only read content from /proc or /sys.
    CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-04 20:12                                             ` Al Viro
@ 2016-10-05 14:30                                               ` CAI Qian
  2016-10-05 16:07                                                 ` Al Viro
  0 siblings, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-05 14:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel



----- Original Message -----
> From: "Al Viro" <viro@ZenIV.linux.org.uk>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Tuesday, October 4, 2016 4:12:33 PM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> On Tue, Oct 04, 2016 at 12:21:28PM -0400, CAI Qian wrote:
> > 
> > > Not enough information, unfortunately (descriptor in question opened
> > > outside of that log, sendfile(out_fd=578, in_fd=578,
> > > offset=0x7f8318a07000,
> > > count=0x3ffc00) doesn't tell what *offset was before the call) ;-/
> > > 
> > > Anyway, I've found and fixed a bug in pipe_advance(), which might or
> > > might
> > > not help with those.  Could you try vfs.git#work.splice_read (or
> > > #for-next)
> > > and see if these persist?
> > I am afraid that this can also reproduced in the latest #for-next . The
> > warning
> > always showed up at the end of trinity run. I captured more information
> > this time.
> 
> OK, let's try to get more information about what's going on (this is on top
> of either for-next or work.splice_read):
Here you go,

http://people.redhat.com/qcai/tmp/trinity-child89.log


[  856.537452] idx = 0, offset = 12
[  856.541066] curbuf = 0, nrbufs = 1, buffers = 1
[  856.546149] [ffffffff81836660 ffffea001e2e1ec0 0 12]
[  856.551750] ------------[ cut here ]------------
[  856.556921] WARNING: CPU: 24 PID: 13756 at lib/iov_iter.c:325 sanity+0xdb/0xe2
[  856.565000] Modules linked in: ieee802154_socket ieee802154 af_key vmw_vsock_vmci_transport vsock vmw_vmci bluetooth rfkill can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr mei_me i2c_i801 ipmi_ssif sg i2c_smbus mei shpchp lpc_ich wmi ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm ixgbe ahci drm mdio libahci ptp libata pps_core i2c_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[  856.683348] CPU: 27 PID: 13756 Comm: trinity-c89 Not tainted 4.8.0-rc8-fornext-debug+ #2
[  856.692380] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[  856.703736]  0000000000000286 00000000cf291d96 ffff8803c355fae0 ffffffff813d30ac
[  856.712034]  0000000000000000 0000000000000000 ffff8803c355fb20 ffffffff8109cf31
[  856.720329]  00000145c355fb00 ffff8804586e3200 0000000000000001 0000000000000000
[  856.728627] Call Trace:
[  856.731362]  [<ffffffff813d30ac>] dump_stack+0x85/0xc9
[  856.737099]  [<ffffffff8109cf31>] __warn+0xd1/0xf0
[  856.742444]  [<ffffffff8109d06d>] warn_slowpath_null+0x1d/0x20
[  856.748953]  [<ffffffff81418ff8>] sanity+0xdb/0xe2
[  856.754299]  [<ffffffff813e9676>] iov_iter_advance+0x1d6/0x3c0
[  856.760810]  [<ffffffff812bc7d3>] default_file_splice_read+0x223/0x2c0
[  856.768099]  [<ffffffff812503bb>] ? __slab_free+0x9b/0x270
[  856.774222]  [<ffffffff811222d8>] ? __call_rcu+0xd8/0x380
[  856.780258]  [<ffffffff810cbaa9>] ? __might_sleep+0x49/0x80
[  856.786480]  [<ffffffff81349ded>] ? security_file_permission+0x9d/0xc0
[  856.793777]  [<ffffffff812bbb13>] do_splice_to+0x73/0x90
[  856.799703]  [<ffffffff812bbc1b>] splice_direct_to_actor+0xeb/0x220
[  856.806696]  [<ffffffff812bb0e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[  856.813982]  [<ffffffff812bbdd9>] do_splice_direct+0x89/0xd0
[  856.820299]  [<ffffffff8128263e>] do_sendfile+0x1ce/0x3b0
[  856.826323]  [<ffffffff812831ef>] SyS_sendfile64+0x6f/0xd0
[  856.832445]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[  856.838568]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[  856.845810] ---[ end trace 702eb33216129766 ]---
[  856.851032] buggered pipe_advance by 12 from [0.0]

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-05 14:09                                         ` CAI Qian
@ 2016-10-05 15:30                                           ` tj
  2016-10-05 15:54                                             ` CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: tj @ 2016-10-05 15:30 UTC (permalink / raw)
  To: CAI Qian
  Cc: Al Viro, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

Hello, CAI.

On Wed, Oct 05, 2016 at 10:09:39AM -0400, CAI Qian wrote:
> > This one seems to be the offender.  cgroup is trying to offline a
> > cpuset css, which takes place under cgroup_mutex.  The offlining ends
> > up trying to drain active usages of a sysctl table which apprently is
> > not happening.  Did something hang or crash while trying to generate
> > sysctl content?
>
> Hmm, I am not sure, since the trinity was running from an non-privileged
> user which can only read content from /proc or /sys.

So, userland, priviledged or not, can't cause this.  The ref is held
only while the kernel code is operating to generate content or
iterating, which shouldn't be affected by userland actions.  This is
caused by kernel code hanging or crashing while holding a ref.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-05 15:30                                           ` tj
@ 2016-10-05 15:54                                             ` CAI Qian
  2016-10-05 18:57                                               ` CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-05 15:54 UTC (permalink / raw)
  To: tj
  Cc: Al Viro, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "tj" <tj@kernel.org>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Al Viro" <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> <david@fromorbit.com>, "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin"
> <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org
> Sent: Wednesday, October 5, 2016 11:30:14 AM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> Hello, CAI.
> 
> On Wed, Oct 05, 2016 at 10:09:39AM -0400, CAI Qian wrote:
> > > This one seems to be the offender.  cgroup is trying to offline a
> > > cpuset css, which takes place under cgroup_mutex.  The offlining ends
> > > up trying to drain active usages of a sysctl table which apprently is
> > > not happening.  Did something hang or crash while trying to generate
> > > sysctl content?
> >
> > Hmm, I am not sure, since the trinity was running from an non-privileged
> > user which can only read content from /proc or /sys.
> 
> So, userland, priviledged or not, can't cause this.  The ref is held
> only while the kernel code is operating to generate content or
> iterating, which shouldn't be affected by userland actions.  This is
> caused by kernel code hanging or crashing while holding a ref.
Right, the trinity calls many different random syscalls and options on those
/proc/ and /sys/ files and generate lots of different errno. It is likely
some of error-path out there causes hang or crash.
    CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-05 14:30                                               ` CAI Qian
@ 2016-10-05 16:07                                                 ` Al Viro
  0 siblings, 0 replies; 104+ messages in thread
From: Al Viro @ 2016-10-05 16:07 UTC (permalink / raw)
  To: CAI Qian
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Wed, Oct 05, 2016 at 10:30:46AM -0400, CAI Qian wrote:

> [  856.537452] idx = 0, offset = 12
> [  856.541066] curbuf = 0, nrbufs = 1, buffers = 1
					^^^^^^^^^^^^

Lovely - that's pretty much guaranteed to make sanity() spew false
positives.
        int delta = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1);
        if (i->iov_offset) {
                struct pipe_buffer *p;
                if (unlikely(delta != 1) || unlikely(!pipe->nrbufs))
                        goto Bad;       // must be at the last buffer...
and at the last buffer it is - idx == (curbuf + nrbufs - 1) % pipe->buffers.
The test would've done the right thing if pipe->buffers had been at least 2,
but...  OK, the patch below ought to fix those; could you check if anything
remains with it?

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c97d661..0ce3411 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -298,21 +298,32 @@ static bool sanity(const struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
 	int idx = i->idx;
-	int delta = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1);
+	int next = pipe->curbuf + pipe->nrbufs;
 	if (i->iov_offset) {
 		struct pipe_buffer *p;
-		if (unlikely(delta != 1) || unlikely(!pipe->nrbufs))
+		if (unlikely(!pipe->nrbufs))
+			goto Bad;	// pipe must be non-empty
+		if (unlikely(idx != ((next - 1) & (pipe->buffers - 1))))
 			goto Bad;	// must be at the last buffer...
 
 		p = &pipe->bufs[idx];
 		if (unlikely(p->offset + p->len != i->iov_offset))
 			goto Bad;	// ... at the end of segment
 	} else {
-		if (delta)
+		if (idx != (next & (pipe->buffers - 1)))
 			goto Bad;	// must be right after the last buffer
 	}
 	return true;
 Bad:
+	printk(KERN_ERR "idx = %d, offset = %zd\n", i->idx, i->iov_offset);
+	printk(KERN_ERR "curbuf = %d, nrbufs = %d, buffers = %d\n",
+			pipe->curbuf, pipe->nrbufs, pipe->buffers);
+	for (idx = 0; idx < pipe->buffers; idx++)
+		printk(KERN_ERR "[%p %p %d %d]\n",
+			pipe->bufs[idx].ops,
+			pipe->bufs[idx].page,
+			pipe->bufs[idx].offset,
+			pipe->bufs[idx].len);
 	WARN_ON(1);
 	return false;
 }

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-05 15:54                                             ` CAI Qian
@ 2016-10-05 18:57                                               ` CAI Qian
  2016-10-05 20:05                                                 ` Al Viro
  0 siblings, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-05 18:57 UTC (permalink / raw)
  To: tj
  Cc: Al Viro, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "tj" <tj@kernel.org>
> Cc: "Al Viro" <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> <david@fromorbit.com>, "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin"
> <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org
> Sent: Wednesday, October 5, 2016 11:54:48 AM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> 
> 
> ----- Original Message -----
> > From: "tj" <tj@kernel.org>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "Al Viro" <viro@ZenIV.linux.org.uk>, "Linus Torvalds"
> > <torvalds@linux-foundation.org>, "Dave Chinner"
> > <david@fromorbit.com>, "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens
> > Axboe" <axboe@kernel.dk>, "Nick Piggin"
> > <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org
> > Sent: Wednesday, October 5, 2016 11:30:14 AM
> > Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT]
> > splice_read reworked)
> > 
> > Hello, CAI.
> > 
> > On Wed, Oct 05, 2016 at 10:09:39AM -0400, CAI Qian wrote:
> > > > This one seems to be the offender.  cgroup is trying to offline a
> > > > cpuset css, which takes place under cgroup_mutex.  The offlining ends
> > > > up trying to drain active usages of a sysctl table which apprently is
> > > > not happening.  Did something hang or crash while trying to generate
> > > > sysctl content?
> > >
> > > Hmm, I am not sure, since the trinity was running from an non-privileged
> > > user which can only read content from /proc or /sys.
> > 
> > So, userland, priviledged or not, can't cause this.  The ref is held
> > only while the kernel code is operating to generate content or
> > iterating, which shouldn't be affected by userland actions.  This is
> > caused by kernel code hanging or crashing while holding a ref.
> Right, the trinity calls many different random syscalls and options on those
> /proc/ and /sys/ files and generate lots of different errno. It is likely
> some of error-path out there causes hang or crash.
Tejun,

Not sure if this related, and there is always a lockdep regards procfs happened
below unless masking by other lockdep issues before the cgroup hang. Also, this
hang is always reproducible.

[ 4787.875980] 
[ 4787.877645] ======================================================
[ 4787.884540] [ INFO: possible circular locking dependency detected ]
[ 4787.891533] 4.8.0-rc8-usrns-scale+ #8 Tainted: G        W      
[ 4787.898138] -------------------------------------------------------
[ 4787.905130] trinity-c116/106905 is trying to acquire lock:
[ 4787.911251]  (&p->lock){+.+.+.}, at: [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4787.919264] 
[ 4787.919264] but task is already holding lock:
[ 4787.925773]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4787.934854] 
[ 4787.934854] which lock already depends on the new lock.
[ 4787.934854] 
[ 4787.943981] 
[ 4787.943981] the existing dependency chain (in reverse order) is:
[ 4787.952333] 
-> #3 (sb_writers#8){.+.+.+}:
[ 4787.957050]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4787.963960]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4787.970577]        [<ffffffff810f769a>] percpu_down_read+0x4a/0xa0
[ 4787.977487]        [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4787.984395]        [<ffffffff812a8974>] mnt_want_write+0x24/0x50
[ 4787.991110]        [<ffffffffa05049af>] ovl_want_write+0x1f/0x30 [overlay]
[ 4787.998799]        [<ffffffffa05070c2>] ovl_do_remove+0x42/0x4a0 [overlay]
[ 4788.006483]        [<ffffffffa0507536>] ovl_rmdir+0x16/0x20 [overlay]
[ 4788.013682]        [<ffffffff8128d357>] vfs_rmdir+0xb7/0x130
[ 4788.020009]        [<ffffffff81292ed3>] do_rmdir+0x183/0x1f0
[ 4788.026335]        [<ffffffff81293cf2>] SyS_unlinkat+0x22/0x30
[ 4788.032853]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.039576]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.046962] 
-> #2 (&sb->s_type->i_mutex_key#16){++++++}:
[ 4788.053140]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.060049]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.066664]        [<ffffffff817d60e7>] down_read+0x47/0x70
[ 4788.072893]        [<ffffffff8128ce79>] lookup_slow+0xc9/0x200
[ 4788.079410]        [<ffffffff81290b9c>] walk_component+0x1ec/0x310
[ 4788.086315]        [<ffffffff81290e5f>] link_path_walk+0x19f/0x5f0
[ 4788.093219]        [<ffffffff8129151d>] path_openat+0xdd/0xb80
[ 4788.099748]        [<ffffffff81293511>] do_filp_open+0x91/0x100
[ 4788.106362]        [<ffffffff81286f56>] do_open_execat+0x76/0x180
[ 4788.113186]        [<ffffffff8128747b>] open_exec+0x2b/0x50
[ 4788.119404]        [<ffffffff812ec61d>] load_elf_binary+0x28d/0x1120
[ 4788.126511]        [<ffffffff81288487>] search_binary_handler+0x97/0x1c0
[ 4788.134002]        [<ffffffff81289619>] do_execveat_common.isra.36+0x6a9/0x9f0
[ 4788.142071]        [<ffffffff81289c4a>] SyS_execve+0x3a/0x50
[ 4788.148398]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.155110]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.162502] 
-> #1 (&sig->cred_guard_mutex){+.+.+.}:
[ 4788.168179]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.175085]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.181712]        [<ffffffff817d4557>] mutex_lock_killable_nested+0x87/0x500
[ 4788.189695]        [<ffffffff81099599>] mm_access+0x29/0xa0
[ 4788.195924]        [<ffffffff81302b6c>] proc_pid_auxv+0x1c/0x70
[ 4788.202540]        [<ffffffff813039d0>] proc_single_show+0x50/0x90
[ 4788.209445]        [<ffffffff812acb48>] seq_read+0x108/0x3e0
[ 4788.215774]        [<ffffffff8127fb07>] __vfs_read+0x37/0x150
[ 4788.222198]        [<ffffffff81280d35>] vfs_read+0x95/0x140
[ 4788.228425]        [<ffffffff81282268>] SyS_read+0x58/0xc0
[ 4788.234557]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.241268]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.248660] 
-> #0 (&p->lock){+.+.+.}:
[ 4788.252987]        [<ffffffff810fc062>] validate_chain.isra.37+0xe72/0x1150
[ 4788.260769]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.267676]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.274302]        [<ffffffff817d3807>] mutex_lock_nested+0x77/0x430
[ 4788.281406]        [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4788.287633]        [<ffffffff81316b39>] kernfs_fop_read+0x129/0x1b0
[ 4788.294659]        [<ffffffff8127fca3>] do_loop_readv_writev+0x83/0xc0
[ 4788.301954]        [<ffffffff812811a8>] do_readv_writev+0x218/0x240
[ 4788.308959]        [<ffffffff81281209>] vfs_readv+0x39/0x50
[ 4788.315188]        [<ffffffff812bc6b1>] default_file_splice_read+0x1a1/0x2b0
[ 4788.323070]        [<ffffffff812bc206>] do_splice_to+0x76/0x90
[ 4788.329587]        [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
[ 4788.337173]        [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
[ 4788.344078]        [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
[ 4788.350694]        [<ffffffff812829c9>] SyS_sendfile64+0xc9/0xd0
[ 4788.357405]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.364119]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.371511] 
[ 4788.371511] other info that might help us debug this:
[ 4788.371511] 
[ 4788.380443] Chain exists of:
  &p->lock --> &sb->s_type->i_mutex_key#16 --> sb_writers#8

[ 4788.389881]  Possible unsafe locking scenario:
[ 4788.389881] 
[ 4788.396497]        CPU0                    CPU1
[ 4788.401549]        ----                    ----
[ 4788.406614]   lock(sb_writers#8);
[ 4788.410352]                                lock(&sb->s_type->i_mutex_key#16);
[ 4788.418354]                                lock(sb_writers#8);
[ 4788.424902]   lock(&p->lock);
[ 4788.428229] 
[ 4788.428229]  *** DEADLOCK ***
[ 4788.428229] 
[ 4788.434836] 1 lock held by trinity-c116/106905:
[ 4788.439888]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4788.449473] 
[ 4788.449473] stack backtrace:
[ 4788.454334] CPU: 16 PID: 106905 Comm: trinity-c116 Tainted: G        W       4.8.0-rc8-usrns-scale+ #8
[ 4788.464719] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 4788.476076]  0000000000000086 00000000cbfc6314 ffff8803ce78b760 ffffffff813d5e93
[ 4788.484371]  ffffffff82a3fbd0 ffffffff82a94890 ffff8803ce78b7a0 ffffffff810fa6ec
[ 4788.492663]  ffff8803ce78b7e0 ffff8802ead08000 0000000000000001 ffff8802ead08ca0
[ 4788.500966] Call Trace:
[ 4788.503694]  [<ffffffff813d5e93>] dump_stack+0x85/0xc2
[ 4788.509426]  [<ffffffff810fa6ec>] print_circular_bug+0x1ec/0x260
[ 4788.516128]  [<ffffffff810fc062>] validate_chain.isra.37+0xe72/0x1150
[ 4788.523319]  [<ffffffff811d4491>] ? ___perf_sw_event+0x171/0x290
[ 4788.530022]  [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.536335]  [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.542359]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.548188]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.554019]  [<ffffffff817d3807>] mutex_lock_nested+0x77/0x430
[ 4788.560528]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.566358]  [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4788.571995]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.578600]  [<ffffffff81316b39>] kernfs_fop_read+0x129/0x1b0
[ 4788.585012]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.591617]  [<ffffffff8127fca3>] do_loop_readv_writev+0x83/0xc0
[ 4788.598318]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.604924]  [<ffffffff812811a8>] do_readv_writev+0x218/0x240
[ 4788.611347]  [<ffffffff813e9535>] ? push_pipe+0xd5/0x190
[ 4788.617278]  [<ffffffff813ecec0>] ? iov_iter_get_pages_alloc+0x250/0x400
[ 4788.624746]  [<ffffffff81281209>] vfs_readv+0x39/0x50
[ 4788.630381]  [<ffffffff812bc6b1>] default_file_splice_read+0x1a1/0x2b0
[ 4788.637668]  [<ffffffff8134ae20>] ? security_file_permission+0xa0/0xc0
[ 4788.644954]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
[ 4788.650880]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
[ 4788.657872]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 4788.665157]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
[ 4788.671472]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
[ 4788.677499]  [<ffffffff812829c9>] SyS_sendfile64+0xc9/0xd0
[ 4788.683622]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.689744]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-05 18:57                                               ` CAI Qian
@ 2016-10-05 20:05                                                 ` Al Viro
  2016-10-06 12:20                                                   ` CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-10-05 20:05 UTC (permalink / raw)
  To: CAI Qian
  Cc: tj, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Wed, Oct 05, 2016 at 02:57:04PM -0400, CAI Qian wrote:

> Not sure if this related, and there is always a lockdep regards procfs happened
> below unless masking by other lockdep issues before the cgroup hang. Also, this
> hang is always reproducible.

Sigh...  Let's get the /proc/*/auxv out of the way - this should deal with it:

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d588d14..489d2d6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -400,23 +400,6 @@ static const struct file_operations proc_pid_cmdline_ops = {
 	.llseek	= generic_file_llseek,
 };
 
-static int proc_pid_auxv(struct seq_file *m, struct pid_namespace *ns,
-			 struct pid *pid, struct task_struct *task)
-{
-	struct mm_struct *mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
-	if (mm && !IS_ERR(mm)) {
-		unsigned int nwords = 0;
-		do {
-			nwords += 2;
-		} while (mm->saved_auxv[nwords - 2] != 0); /* AT_NULL */
-		seq_write(m, mm->saved_auxv, nwords * sizeof(mm->saved_auxv[0]));
-		mmput(mm);
-		return 0;
-	} else
-		return PTR_ERR(mm);
-}
-
-
 #ifdef CONFIG_KALLSYMS
 /*
  * Provides a wchan file via kallsyms in a proper one-value-per-file format.
@@ -1014,6 +997,30 @@ static const struct file_operations proc_environ_operations = {
 	.release	= mem_release,
 };
 
+static int auxv_open(struct inode *inode, struct file *file)
+{
+	return __mem_open(inode, file, PTRACE_MODE_READ_FSCREDS);
+}
+
+static ssize_t auxv_read(struct file *file, char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct mm_struct *mm = file->private_data;
+	unsigned int nwords = 0;
+	do {
+		nwords += 2;
+	} while (mm->saved_auxv[nwords - 2] != 0); /* AT_NULL */
+	return simple_read_from_buffer(buf, count, ppos, mm->saved_auxv,
+				       nwords * sizeof(mm->saved_auxv[0]));
+}
+
+static const struct file_operations proc_auxv_operations = {
+	.open		= auxv_open,
+	.read		= auxv_read,
+	.llseek		= generic_file_llseek,
+	.release	= mem_release,
+};
+
 static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 			    loff_t *ppos)
 {
@@ -2822,7 +2829,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
 #endif
 	REG("environ",    S_IRUSR, proc_environ_operations),
-	ONE("auxv",       S_IRUSR, proc_pid_auxv),
+	REG("auxv",       S_IRUSR, proc_auxv_operations),
 	ONE("status",     S_IRUGO, proc_pid_status),
 	ONE("personality", S_IRUSR, proc_pid_personality),
 	ONE("limits",	  S_IRUGO, proc_pid_limits),
@@ -3210,7 +3217,7 @@ static const struct pid_entry tid_base_stuff[] = {
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
 #endif
 	REG("environ",   S_IRUSR, proc_environ_operations),
-	ONE("auxv",      S_IRUSR, proc_pid_auxv),
+	REG("auxv",      S_IRUSR, proc_auxv_operations),
 	ONE("status",    S_IRUGO, proc_pid_status),
 	ONE("personality", S_IRUSR, proc_pid_personality),
 	ONE("limits",	 S_IRUGO, proc_pid_limits),

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-05 20:05                                                 ` Al Viro
@ 2016-10-06 12:20                                                   ` CAI Qian
  2016-10-06 12:25                                                     ` CAI Qian
  2016-10-07  9:27                                                     ` local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) Dave Chinner
  0 siblings, 2 replies; 104+ messages in thread
From: CAI Qian @ 2016-10-06 12:20 UTC (permalink / raw)
  To: Al Viro
  Cc: tj, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "Al Viro" <viro@ZenIV.linux.org.uk>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "tj" <tj@kernel.org>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>,
> "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Wednesday, October 5, 2016 4:05:22 PM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> On Wed, Oct 05, 2016 at 02:57:04PM -0400, CAI Qian wrote:
> 
> > Not sure if this related, and there is always a lockdep regards procfs
> > happened
> > below unless masking by other lockdep issues before the cgroup hang. Also,
> > this
> > hang is always reproducible.
> 
> Sigh...  Let's get the /proc/*/auxv out of the way - this should deal with
> it:
So I applied both this and the sanity patch, and both original sanity and the
proc warnings went away. However, the cgroup hang can still be reproduced as
well as this new xfs internal error below,

[16921.141233] XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line 5619 of file fs/xfs/libxfs/xfs_bmap.c.  Caller xfs_bmap_shift_extents+0x1cc/0x3a0 [xfs]
[16921.157694] CPU: 9 PID: 52920 Comm: trinity-c108 Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[16921.167012] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[16921.178368]  0000000000000286 00000000c3833246 ffff8803d0a83b60 ffffffff813d2ecc
[16921.186658]  ffff88042a898000 0000000000000001 ffff8803d0a83b78 ffffffffa02f36eb
[16921.194946]  ffffffffa02b544c ffff8803d0a83c30 ffffffffa02a8e52 ffff88042a898040
[16921.203238] Call Trace:
[16921.205972]  [<ffffffff813d2ecc>] dump_stack+0x85/0xc9
[16921.211742]  [<ffffffffa02f36eb>] xfs_error_report+0x3b/0x40 [xfs]
[16921.218660]  [<ffffffffa02b544c>] ? xfs_bmap_shift_extents+0x1cc/0x3a0 [xfs]
[16921.226543]  [<ffffffffa02a8e52>] xfs_bmse_shift_one.constprop.20+0x332/0x370 [xfs]
[16921.235090]  [<ffffffff817cb73a>] ? kmemleak_alloc+0x4a/0xa0
[16921.241426]  [<ffffffffa02b544c>] xfs_bmap_shift_extents+0x1cc/0x3a0 [xfs]
[16921.249122]  [<ffffffffa03142aa>] ? xfs_trans_add_item+0x2a/0x60 [xfs]
[16921.256430]  [<ffffffffa02eb361>] xfs_shift_file_space+0x231/0x2f0 [xfs]
[16921.263931]  [<ffffffffa02ebe8c>] xfs_collapse_file_space+0x5c/0x180 [xfs]
[16921.271622]  [<ffffffffa02f69b8>] xfs_file_fallocate+0x158/0x360 [xfs]
[16921.278907]  [<ffffffff810f8eae>] ? update_fast_ctr+0x4e/0x70
[16921.285320]  [<ffffffff810f8f57>] ? percpu_down_read+0x57/0x90
[16921.291828]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[16921.298337]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[16921.304847]  [<ffffffff8127e000>] vfs_fallocate+0x140/0x230
[16921.311067]  [<ffffffff8127eee4>] SyS_fallocate+0x44/0x70
[16921.317091]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[16921.323212]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25

    CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 12:20                                                   ` CAI Qian
@ 2016-10-06 12:25                                                     ` CAI Qian
  2016-10-06 16:11                                                       ` CAI Qian
  2016-10-07  7:08                                                       ` Jan Kara
  2016-10-07  9:27                                                     ` local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) Dave Chinner
  1 sibling, 2 replies; 104+ messages in thread
From: CAI Qian @ 2016-10-06 12:25 UTC (permalink / raw)
  To: Al Viro
  Cc: tj, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "Al Viro" <viro@ZenIV.linux.org.uk>
> Cc: "tj" <tj@kernel.org>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>,
> "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Thursday, October 6, 2016 8:20:17 AM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> 
> 
> ----- Original Message -----
> > From: "Al Viro" <viro@ZenIV.linux.org.uk>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "tj" <tj@kernel.org>, "Linus Torvalds" <torvalds@linux-foundation.org>,
> > "Dave Chinner" <david@fromorbit.com>,
> > "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>,
> > "Nick Piggin" <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org
> > Sent: Wednesday, October 5, 2016 4:05:22 PM
> > Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT]
> > splice_read reworked)
> > 
> > On Wed, Oct 05, 2016 at 02:57:04PM -0400, CAI Qian wrote:
> > 
> > > Not sure if this related, and there is always a lockdep regards procfs
> > > happened
> > > below unless masking by other lockdep issues before the cgroup hang.
> > > Also,
> > > this
> > > hang is always reproducible.
> > 
> > Sigh...  Let's get the /proc/*/auxv out of the way - this should deal with
> > it:
> So I applied both this and the sanity patch, and both original sanity and the
> proc warnings went away. However, the cgroup hang can still be reproduced as
> well as this new xfs internal error below,

Wait. There is also a lockep happened before the xfs internal error as well.

[ 5839.452325] ======================================================
[ 5839.459221] [ INFO: possible circular locking dependency detected ]
[ 5839.466215] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
[ 5839.471945] -------------------------------------------------------
[ 5839.478937] trinity-c220/69531 is trying to acquire lock:
[ 5839.484961]  (&p->lock){+.+.+.}, at: [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
[ 5839.492967] 
but task is already holding lock:
[ 5839.499476]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[ 5839.508560] 
which lock already depends on the new lock.

[ 5839.517686] 
the existing dependency chain (in reverse order) is:
[ 5839.526036] 
-> #3 (sb_writers#8){.+.+.+}:
[ 5839.530751]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
[ 5839.537368]        [<ffffffff810f8f4a>] percpu_down_read+0x4a/0x90
[ 5839.544275]        [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[ 5839.551181]        [<ffffffff812a8544>] mnt_want_write+0x24/0x50
[ 5839.557892]        [<ffffffffa04a398f>] ovl_want_write+0x1f/0x30 [overlay]
[ 5839.565577]        [<ffffffffa04a6036>] ovl_do_remove+0x46/0x480 [overlay]
[ 5839.573259]        [<ffffffffa04a64a3>] ovl_unlink+0x13/0x20 [overlay]
[ 5839.580555]        [<ffffffff812918ea>] vfs_unlink+0xda/0x190
[ 5839.586979]        [<ffffffff81293698>] do_unlinkat+0x268/0x2b0
[ 5839.593599]        [<ffffffff8129419b>] SyS_unlinkat+0x1b/0x30
[ 5839.600120]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 5839.606836]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
[ 5839.614231] 
-> #2 (&sb->s_type->i_mutex_key#17){++++++}:
[ 5839.620399]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
[ 5839.627015]        [<ffffffff817d1b77>] down_read+0x47/0x70
[ 5839.633242]        [<ffffffff8128cfd2>] lookup_slow+0xc2/0x1f0
[ 5839.639762]        [<ffffffff8128f6f2>] walk_component+0x172/0x220
[ 5839.646668]        [<ffffffff81290fd6>] link_path_walk+0x1a6/0x620
[ 5839.653574]        [<ffffffff81291a81>] path_openat+0xe1/0xdb0
[ 5839.660092]        [<ffffffff812939e1>] do_filp_open+0x91/0x100
[ 5839.666707]        [<ffffffff81288e06>] do_open_execat+0x76/0x180
[ 5839.673517]        [<ffffffff81288f3b>] open_exec+0x2b/0x50
[ 5839.679743]        [<ffffffff812eccf3>] load_elf_binary+0x2a3/0x10a0
[ 5839.686844]        [<ffffffff81288917>] search_binary_handler+0x97/0x1d0
[ 5839.694331]        [<ffffffff81289ed8>] do_execveat_common.isra.35+0x678/0x9a0
[ 5839.702400]        [<ffffffff8128a4da>] SyS_execve+0x3a/0x50
[ 5839.708726]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 5839.715441]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
[ 5839.722833] 
-> #1 (&sig->cred_guard_mutex){+.+.+.}:
[ 5839.728510]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
[ 5839.735126]        [<ffffffff817cfc66>] mutex_lock_killable_nested+0x86/0x540
[ 5839.743097]        [<ffffffff81301e84>] lock_trace+0x24/0x60
[ 5839.749421]        [<ffffffff8130224d>] proc_pid_syscall+0x2d/0x110
[ 5839.756423]        [<ffffffff81302af0>] proc_single_show+0x50/0x90
[ 5839.763330]        [<ffffffff812ab867>] traverse+0xf7/0x210
[ 5839.769557]        [<ffffffff812ac9eb>] seq_read+0x39b/0x3e0
[ 5839.775884]        [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
[ 5839.783179]        [<ffffffff81281a03>] do_readv_writev+0x213/0x230
[ 5839.790181]        [<ffffffff81281a59>] vfs_readv+0x39/0x50
[ 5839.796406]        [<ffffffff81281c12>] do_preadv+0xa2/0xc0
[ 5839.802634]        [<ffffffff81282ec1>] SyS_preadv+0x11/0x20
[ 5839.808963]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 5839.815681]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
[ 5839.823075] 
-> #0 (&p->lock){+.+.+.}:
[ 5839.827395]        [<ffffffff810fe69c>] __lock_acquire+0x151c/0x1990
[ 5839.834500]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
[ 5839.841115]        [<ffffffff817cf3b6>] mutex_lock_nested+0x76/0x450
[ 5839.848219]        [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
[ 5839.854448]        [<ffffffff8131566b>] kernfs_fop_read+0x12b/0x1b0
[ 5839.861451]        [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
[ 5839.868742]        [<ffffffff81281a03>] do_readv_writev+0x213/0x230
[ 5839.875744]        [<ffffffff81281a59>] vfs_readv+0x39/0x50
[ 5839.881971]        [<ffffffff812bc55a>] default_file_splice_read+0x1aa/0x2c0
[ 5839.889847]        [<ffffffff812bb913>] do_splice_to+0x73/0x90
[ 5839.896365]        [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
[ 5839.903950]        [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
[ 5839.910857]        [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
[ 5839.917470]        [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
[ 5839.924184]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 5839.930898]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
[ 5839.938286] 
other info that might help us debug this:

[ 5839.947217] Chain exists of:
  &p->lock --> &sb->s_type->i_mutex_key#17 --> sb_writers#8

[ 5839.956615]  Possible unsafe locking scenario:

[ 5839.963218]        CPU0                    CPU1
[ 5839.968269]        ----                    ----
[ 5839.973321]   lock(sb_writers#8);
[ 5839.977046]                                lock(&sb->s_type->i_mutex_key#17);
[ 5839.985037]                                lock(sb_writers#8);
[ 5839.991573]   lock(&p->lock);
[ 5839.994900] 
 *** DEADLOCK ***

[ 5840.001503] 1 lock held by trinity-c220/69531:
[ 5840.006457]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[ 5840.016031] 
stack backtrace:
[ 5840.020891] CPU: 12 PID: 69531 Comm: trinity-c220 Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 5840.030306] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 5840.041660]  0000000000000086 00000000a1ef62f8 ffff8803ca52f7c0 ffffffff813d2ecc
[ 5840.049952]  ffffffff82a41160 ffffffff82a913e0 ffff8803ca52f800 ffffffff811dd630
[ 5840.058245]  ffff8803ca52f840 ffff880392c4ecc8 ffff880392c4e000 0000000000000001
[ 5840.066537] Call Trace:
[ 5840.069266]  [<ffffffff813d2ecc>] dump_stack+0x85/0xc9
[ 5840.075000]  [<ffffffff811dd630>] print_circular_bug+0x1f9/0x207
[ 5840.081701]  [<ffffffff810fe69c>] __lock_acquire+0x151c/0x1990
[ 5840.088208]  [<ffffffff810ff174>] lock_acquire+0xd4/0x240
[ 5840.094232]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
[ 5840.100061]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
[ 5840.105891]  [<ffffffff817cf3b6>] mutex_lock_nested+0x76/0x450
[ 5840.112397]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
[ 5840.118228]  [<ffffffff810fb3e9>] ? __lock_is_held+0x49/0x70
[ 5840.124540]  [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
[ 5840.130175]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
[ 5840.137360]  [<ffffffff8131566b>] kernfs_fop_read+0x12b/0x1b0
[ 5840.143770]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
[ 5840.150956]  [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
[ 5840.157657]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
[ 5840.164843]  [<ffffffff81281a03>] do_readv_writev+0x213/0x230
[ 5840.171255]  [<ffffffff81418cf9>] ? __pipe_get_pages+0x24/0x9b
[ 5840.177762]  [<ffffffff813e6f0f>] ? iov_iter_get_pages_alloc+0x19f/0x360
[ 5840.185240]  [<ffffffff810fd5f2>] ? __lock_acquire+0x472/0x1990
[ 5840.191843]  [<ffffffff81281a59>] vfs_readv+0x39/0x50
[ 5840.197478]  [<ffffffff812bc55a>] default_file_splice_read+0x1aa/0x2c0
[ 5840.204763]  [<ffffffff810cba89>] ? __might_sleep+0x49/0x80
[ 5840.210980]  [<ffffffff81349c93>] ? security_file_permission+0xa3/0xc0
[ 5840.218264]  [<ffffffff812bb913>] do_splice_to+0x73/0x90
[ 5840.224190]  [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
[ 5840.231182]  [<ffffffff812baee0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 5840.238465]  [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
[ 5840.244778]  [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
[ 5840.250802]  [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
[ 5840.256922]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 5840.263042]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25

   CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 12:25                                                     ` CAI Qian
@ 2016-10-06 16:11                                                       ` CAI Qian
  2016-10-06 17:00                                                         ` Linus Torvalds
  2016-10-07  7:08                                                       ` Jan Kara
  1 sibling, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-06 16:11 UTC (permalink / raw)
  To: Al Viro
  Cc: tj, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel


> > > On Wed, Oct 05, 2016 at 02:57:04PM -0400, CAI Qian wrote:
> > > 
> > > > Not sure if this related, and there is always a lockdep regards procfs
> > > > happened
> > > > below unless masking by other lockdep issues before the cgroup hang.
> > > > Also,
> > > > this
> > > > hang is always reproducible.
> > > 
> > > Sigh...  Let's get the /proc/*/auxv out of the way - this should deal
> > > with
> > > it:
> > So I applied both this and the sanity patch, and both original sanity and
> > the
> > proc warnings went away. However, the cgroup hang can still be reproduced
> > as
> > well as this new xfs internal error below,
> 
> Wait. There is also a lockep happened before the xfs internal error as well.
Some other lockdep this time,

[ 4872.310639] =================================
[ 4872.315499] [ INFO: inconsistent lock state ]
[ 4872.320359] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
[ 4872.326091] ---------------------------------
[ 4872.330950] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 4872.338235] kswapd1/437 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 4872.343965]  (&xfs_nondir_ilock_class){++++?.}, at: [<ffffffffa029968e>] xfs_ilock+0x18e/0x260 [xfs]
[ 4872.354236] {RECLAIM_FS-ON-W} state was registered at:
[ 4872.359969]   [<ffffffff810fcbd6>] mark_held_locks+0x66/0x90
[ 4872.366297]   [<ffffffff810fffd5>] lockdep_trace_alloc+0xc5/0x110
[ 4872.373107]   [<ffffffff81253ad3>] kmem_cache_alloc+0x33/0x2e0
[ 4872.379628]   [<ffffffffa02a8386>] kmem_zone_alloc+0x96/0x120 [xfs]
[ 4872.386654]   [<ffffffffa024967b>] xfs_bmbt_init_cursor+0x3b/0x160 [xfs]
[ 4872.394147]   [<ffffffffa0247f8f>] xfs_bunmapi+0x80f/0xb00 [xfs]
[ 4872.400202] kmemleak: Cannot allocate a kmemleak_object structure
[ 4872.400205] kmemleak: Kernel memory leak detector disabled
[ 4872.400337] kmemleak: Automatic memory scanning thread ended
[ 4872.400869] kmemleak: Kmemleak disabled without freeing internal data. Reclaim the memory with "echo clear > /sys/kernel/debug/kmemleak".
[ 4872.433878]   [<ffffffffa027ddc3>] xfs_bmap_punch_delalloc_range+0xe3/0x180 [xfs]
[ 4872.442253]   [<ffffffffa0294b39>] xfs_file_iomap_end+0x89/0xd0 [xfs]
[ 4872.449468]   [<ffffffff812f3da0>] iomap_apply+0xe0/0x130
[ 4872.455505]   [<ffffffff812f3e58>] iomap_file_buffered_write+0x68/0xa0
[ 4872.462798]   [<ffffffffa028a87f>] xfs_file_buffered_aio_write+0x14f/0x350 [xfs]
[ 4872.471079]   [<ffffffffa028ab6d>] xfs_file_write_iter+0xed/0x130 [xfs]
[ 4872.478485]   [<ffffffff81280eee>] do_iter_readv_writev+0xae/0x130
[ 4872.485393]   [<ffffffff81281992>] do_readv_writev+0x1a2/0x230
[ 4872.491911]   [<ffffffff81281c6c>] vfs_writev+0x3c/0x50
[ 4872.497752]   [<ffffffff81281ce4>] do_writev+0x64/0x100
[ 4872.503589]   [<ffffffff81282ea0>] SyS_writev+0x10/0x20
[ 4872.509428]   [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 4872.515656]   [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
[ 4872.522563] irq event stamp: 427
[ 4872.526160] hardirqs last  enabled at (427): [<ffffffff817cf21d>] mutex_trylock+0xdd/0x200
[ 4872.535393] hardirqs last disabled at (426): [<ffffffff817cf191>] mutex_trylock+0x51/0x200
[ 4872.544627] softirqs last  enabled at (424): [<ffffffff817d7b37>] __do_softirq+0x1f7/0x4b7
[ 4872.553862] softirqs last disabled at (417): [<ffffffff810a4a98>] irq_exit+0xc8/0xe0
[ 4872.562513] 
[ 4872.562513] other info that might help us debug this:
[ 4872.569797]  Possible unsafe locking scenario:
[ 4872.569797] 
[ 4872.576401]        CPU0
[ 4872.579127]        ----
[ 4872.581854]   lock(&xfs_nondir_ilock_class);
[ 4872.586637]   <Interrupt>
[ 4872.589558]     lock(&xfs_nondir_ilock_class);
[ 4872.594533] 
[ 4872.594533]  *** DEADLOCK ***
[ 4872.594533] 
[ 4872.601140] 3 locks held by kswapd1/437:
[ 4872.605515]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff811f78ad>] shrink_slab+0x9d/0x620
[ 4872.614889]  #1:  (&type->s_umount_key#48){++++++}, at: [<ffffffff8128550b>] trylock_super+0x1b/0x50
[ 4872.625145]  #2:  (&pag->pag_ici_reclaim_lock){+.+...}, at: [<ffffffffa028e7a7>] xfs_reclaim_inodes_ag+0xc7/0x4f0 [xfs]
[ 4872.637247] 
[ 4872.637247] stack backtrace:
[ 4872.642109] CPU: 49 PID: 437 Comm: kswapd1 Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 4872.650846] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 4872.662202]  0000000000000086 00000000eda15d18 ffff880462bd7798 ffffffff813d2ecc
[ 4872.670498]  ffff880462e56000 ffffffff82a66870 ffff880462bd77e8 ffffffff811dd9e1
[ 4872.678793]  0000000000000000 ffff880400000001 ffff880400000001 000000000000000a
[ 4872.687086] Call Trace:
[ 4872.689817]  [<ffffffff813d2ecc>] dump_stack+0x85/0xc9
[ 4872.695543]  [<ffffffff811dd9e1>] print_usage_bug+0x1eb/0x1fc
[ 4872.701954]  [<ffffffff810fc0b0>] ? check_usage_backwards+0x150/0x150
[ 4872.709141]  [<ffffffff810fcae4>] mark_lock+0x264/0x2f0
[ 4872.714968]  [<ffffffff810fd491>] __lock_acquire+0x311/0x1990
[ 4872.721379]  [<ffffffff810499db>] ? save_stack_trace+0x2b/0x50
[ 4872.727892]  [<ffffffff810fd5f2>] ? __lock_acquire+0x472/0x1990
[ 4872.734497]  [<ffffffff810ff174>] lock_acquire+0xd4/0x240
[ 4872.740535]  [<ffffffffa029968e>] ? xfs_ilock+0x18e/0x260 [xfs]
[ 4872.747155]  [<ffffffffa028dd93>] ? xfs_reclaim_inode+0x113/0x380 [xfs]
[ 4872.754538]  [<ffffffff810f8bfa>] down_write_nested+0x4a/0x80
[ 4872.760962]  [<ffffffffa029968e>] ? xfs_ilock+0x18e/0x260 [xfs]
[ 4872.767579]  [<ffffffffa029968e>] xfs_ilock+0x18e/0x260 [xfs]
[ 4872.774004]  [<ffffffffa028dd93>] xfs_reclaim_inode+0x113/0x380 [xfs]
[ 4872.781203]  [<ffffffffa028e9ab>] xfs_reclaim_inodes_ag+0x2cb/0x4f0 [xfs]
[ 4872.788780]  [<ffffffffa028e7d2>] ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
[ 4872.796453]  [<ffffffff817d40aa>] ? _raw_spin_unlock_irqrestore+0x6a/0x80
[ 4872.804026]  [<ffffffff817d408a>] ? _raw_spin_unlock_irqrestore+0x4a/0x80
[ 4872.811602]  [<ffffffff810d1a58>] ? try_to_wake_up+0x58/0x510
[ 4872.818014]  [<ffffffff810d1f25>] ? wake_up_process+0x15/0x20
[ 4872.824438]  [<ffffffffa0290523>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 4872.831835]  [<ffffffffa02a26d9>] xfs_fs_free_cached_objects+0x19/0x20 [xfs]
[ 4872.839702]  [<ffffffff812856c1>] super_cache_scan+0x181/0x190
[ 4872.846210]  [<ffffffff811f7a79>] shrink_slab+0x269/0x620
[ 4872.852233]  [<ffffffff811fcc88>] shrink_node+0x108/0x310
[ 4872.858256]  [<ffffffff811fe360>] kswapd+0x3d0/0x960
[ 4872.863796]  [<ffffffff811fdf90>] ? mem_cgroup_shrink_node+0x370/0x370
[ 4872.871081]  [<ffffffff810c3f5e>] kthread+0xfe/0x120
[ 4872.876618]  [<ffffffff817d40ec>] ? _raw_spin_unlock_irq+0x2c/0x60
[ 4872.883514]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 4872.889539]  [<ffffffff810c3e60>] ? kthread_create_on_node+0x230/0x230

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 16:11                                                       ` CAI Qian
@ 2016-10-06 17:00                                                         ` Linus Torvalds
  2016-10-06 18:12                                                           ` CAI Qian
  2016-10-07  9:57                                                           ` Dave Chinner
  0 siblings, 2 replies; 104+ messages in thread
From: Linus Torvalds @ 2016-10-06 17:00 UTC (permalink / raw)
  To: CAI Qian
  Cc: Al Viro, tj, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Thu, Oct 6, 2016 at 9:11 AM, CAI Qian <caiqian@redhat.com> wrote:
>
>>
>> Wait. There is also a lockep happened before the xfs internal error as well.
> Some other lockdep this time,

This one looks just bogus.

> [ 4872.569797]  Possible unsafe locking scenario:
> [ 4872.569797]
> [ 4872.576401]        CPU0
> [ 4872.579127]        ----
> [ 4872.581854]   lock(&xfs_nondir_ilock_class);
> [ 4872.586637]   <Interrupt>
> [ 4872.589558]     lock(&xfs_nondir_ilock_class);

I'm not seeing that .lock taken in interrupt context.

I'm wondering how many of your reports are confused by earlier errors
that  happened.

               Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 17:00                                                         ` Linus Torvalds
@ 2016-10-06 18:12                                                           ` CAI Qian
  2016-10-07  9:57                                                           ` Dave Chinner
  1 sibling, 0 replies; 104+ messages in thread
From: CAI Qian @ 2016-10-06 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, tj, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel



----- Original Message -----
> From: "Linus Torvalds" <torvalds@linux-foundation.org>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Al Viro" <viro@zeniv.linux.org.uk>, "tj" <tj@kernel.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>, "linux-fsdevel"
> <linux-fsdevel@vger.kernel.org>
> Sent: Thursday, October 6, 2016 1:00:08 PM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> On Thu, Oct 6, 2016 at 9:11 AM, CAI Qian <caiqian@redhat.com> wrote:
> >
> >>
> >> Wait. There is also a lockep happened before the xfs internal error as
> >> well.
> > Some other lockdep this time,
> 
> This one looks just bogus.
> 
> > [ 4872.569797]  Possible unsafe locking scenario:
> > [ 4872.569797]
> > [ 4872.576401]        CPU0
> > [ 4872.579127]        ----
> > [ 4872.581854]   lock(&xfs_nondir_ilock_class);
> > [ 4872.586637]   <Interrupt>
> > [ 4872.589558]     lock(&xfs_nondir_ilock_class);
> 
> I'm not seeing that .lock taken in interrupt context.
> 
> I'm wondering how many of your reports are confused by earlier errors
> that  happened.
Hmm, there was no previous error/lockdep/warnings on the console prior to
this AFAICT. It was a fresh trinity run after reboot.

The previous run triggered seq_read/__sb_start_write lockdep and
then xfs XFS_WANT_CORRUPTED_RETURN internal error was highlighted
in another reply was also started from a fresh reboot.

After all of those individual runs it will reliably triggered the
cgroup hang from any systemctl command or "make install" of kernel etc.
   CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 12:25                                                     ` CAI Qian
  2016-10-06 16:11                                                       ` CAI Qian
@ 2016-10-07  7:08                                                       ` Jan Kara
  2016-10-07 14:43                                                         ` CAI Qian
  2016-10-21 15:38                                                         ` [4.9-rc1+] overlayfs lockdep CAI Qian
  1 sibling, 2 replies; 104+ messages in thread
From: Jan Kara @ 2016-10-07  7:08 UTC (permalink / raw)
  To: CAI Qian
  Cc: Al Viro, tj, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel, Miklos Szeredi


So I believe this may be just a problem in overlayfs lockdep annotation
(see below). Added Miklos to CC.

On Thu 06-10-16 08:25:59, CAI Qian wrote:
> > > > Not sure if this related, and there is always a lockdep regards procfs
> > > > happened
> > > > below unless masking by other lockdep issues before the cgroup hang.
> > > > Also,
> > > > this
> > > > hang is always reproducible.
> > > 
> > > Sigh...  Let's get the /proc/*/auxv out of the way - this should deal with
> > > it:
> > So I applied both this and the sanity patch, and both original sanity and the
> > proc warnings went away. However, the cgroup hang can still be reproduced as
> > well as this new xfs internal error below,
> 
> Wait. There is also a lockep happened before the xfs internal error as well.
> 
> [ 5839.452325] ======================================================
> [ 5839.459221] [ INFO: possible circular locking dependency detected ]
> [ 5839.466215] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
> [ 5839.471945] -------------------------------------------------------
> [ 5839.478937] trinity-c220/69531 is trying to acquire lock:
> [ 5839.484961]  (&p->lock){+.+.+.}, at: [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
> [ 5839.492967] 
> but task is already holding lock:
> [ 5839.499476]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> [ 5839.508560] 
> which lock already depends on the new lock.
> 
> [ 5839.517686] 
> the existing dependency chain (in reverse order) is:
> [ 5839.526036] 
> -> #3 (sb_writers#8){.+.+.+}:
> [ 5839.530751]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> [ 5839.537368]        [<ffffffff810f8f4a>] percpu_down_read+0x4a/0x90
> [ 5839.544275]        [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> [ 5839.551181]        [<ffffffff812a8544>] mnt_want_write+0x24/0x50
> [ 5839.557892]        [<ffffffffa04a398f>] ovl_want_write+0x1f/0x30 [overlay]
> [ 5839.565577]        [<ffffffffa04a6036>] ovl_do_remove+0x46/0x480 [overlay]
> [ 5839.573259]        [<ffffffffa04a64a3>] ovl_unlink+0x13/0x20 [overlay]
> [ 5839.580555]        [<ffffffff812918ea>] vfs_unlink+0xda/0x190
> [ 5839.586979]        [<ffffffff81293698>] do_unlinkat+0x268/0x2b0
> [ 5839.593599]        [<ffffffff8129419b>] SyS_unlinkat+0x1b/0x30
> [ 5839.600120]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 5839.606836]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> [ 5839.614231] 

So here is IMO the real culprit: do_unlinkat() grabs fs freeze protection
through mnt_want_write(), we grab also i_rwsem in do_unlinkat() in
I_MUTEX_PARENT class a bit after that and further down in vfs_unlink() we
grab i_rwsem for the unlinked inode itself in default I_MUTEX class. Then
in ovl_want_write() we grab freeze protection again, but this time for the
upper filesystem. That establishes sb_writers (overlay) -> I_MUTEX_PARENT
(overlay) -> I_MUTEX (overlay) -> sb_writers (FS-A) lock ordering
(we maintain locking classes per fs type so that's why I'm showing fs type
in parenthesis).

Now this nesting is nasty because once you add locks that are not tracked
per fs type into the mix, you get cycles. In this case we've got
seq_file->lock and cred_guard_mutex into the mix - the splice path is
doing sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex (splicing
from seq_file into the real filesystem). Exec path further establishes
cred_guard_mutex -> I_MUTEX (overlay) which closes the full cycle:

sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex -> i_mutex
(overlay) -> sb_writers (FS-A)

If I analyzed the lockdep trace, this looks like a real (although remote)
deadlock possibility. Miklos?

								Honza

> -> #2 (&sb->s_type->i_mutex_key#17){++++++}:
> [ 5839.620399]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> [ 5839.627015]        [<ffffffff817d1b77>] down_read+0x47/0x70
> [ 5839.633242]        [<ffffffff8128cfd2>] lookup_slow+0xc2/0x1f0
> [ 5839.639762]        [<ffffffff8128f6f2>] walk_component+0x172/0x220
> [ 5839.646668]        [<ffffffff81290fd6>] link_path_walk+0x1a6/0x620
> [ 5839.653574]        [<ffffffff81291a81>] path_openat+0xe1/0xdb0
> [ 5839.660092]        [<ffffffff812939e1>] do_filp_open+0x91/0x100
> [ 5839.666707]        [<ffffffff81288e06>] do_open_execat+0x76/0x180
> [ 5839.673517]        [<ffffffff81288f3b>] open_exec+0x2b/0x50
> [ 5839.679743]        [<ffffffff812eccf3>] load_elf_binary+0x2a3/0x10a0
> [ 5839.686844]        [<ffffffff81288917>] search_binary_handler+0x97/0x1d0
> [ 5839.694331]        [<ffffffff81289ed8>] do_execveat_common.isra.35+0x678/0x9a0
> [ 5839.702400]        [<ffffffff8128a4da>] SyS_execve+0x3a/0x50
> [ 5839.708726]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 5839.715441]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> [ 5839.722833] 
> -> #1 (&sig->cred_guard_mutex){+.+.+.}:
> [ 5839.728510]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> [ 5839.735126]        [<ffffffff817cfc66>] mutex_lock_killable_nested+0x86/0x540
> [ 5839.743097]        [<ffffffff81301e84>] lock_trace+0x24/0x60
> [ 5839.749421]        [<ffffffff8130224d>] proc_pid_syscall+0x2d/0x110
> [ 5839.756423]        [<ffffffff81302af0>] proc_single_show+0x50/0x90
> [ 5839.763330]        [<ffffffff812ab867>] traverse+0xf7/0x210
> [ 5839.769557]        [<ffffffff812ac9eb>] seq_read+0x39b/0x3e0
> [ 5839.775884]        [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
> [ 5839.783179]        [<ffffffff81281a03>] do_readv_writev+0x213/0x230
> [ 5839.790181]        [<ffffffff81281a59>] vfs_readv+0x39/0x50
> [ 5839.796406]        [<ffffffff81281c12>] do_preadv+0xa2/0xc0
> [ 5839.802634]        [<ffffffff81282ec1>] SyS_preadv+0x11/0x20
> [ 5839.808963]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 5839.815681]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> [ 5839.823075] 
> -> #0 (&p->lock){+.+.+.}:
> [ 5839.827395]        [<ffffffff810fe69c>] __lock_acquire+0x151c/0x1990
> [ 5839.834500]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> [ 5839.841115]        [<ffffffff817cf3b6>] mutex_lock_nested+0x76/0x450
> [ 5839.848219]        [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
> [ 5839.854448]        [<ffffffff8131566b>] kernfs_fop_read+0x12b/0x1b0
> [ 5839.861451]        [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
> [ 5839.868742]        [<ffffffff81281a03>] do_readv_writev+0x213/0x230
> [ 5839.875744]        [<ffffffff81281a59>] vfs_readv+0x39/0x50
> [ 5839.881971]        [<ffffffff812bc55a>] default_file_splice_read+0x1aa/0x2c0
> [ 5839.889847]        [<ffffffff812bb913>] do_splice_to+0x73/0x90
> [ 5839.896365]        [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
> [ 5839.903950]        [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
> [ 5839.910857]        [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
> [ 5839.917470]        [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
> [ 5839.924184]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 5839.930898]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> [ 5839.938286] 
> other info that might help us debug this:
> 
> [ 5839.947217] Chain exists of:
>   &p->lock --> &sb->s_type->i_mutex_key#17 --> sb_writers#8
> 
> [ 5839.956615]  Possible unsafe locking scenario:
> 
> [ 5839.963218]        CPU0                    CPU1
> [ 5839.968269]        ----                    ----
> [ 5839.973321]   lock(sb_writers#8);
> [ 5839.977046]                                lock(&sb->s_type->i_mutex_key#17);
> [ 5839.985037]                                lock(sb_writers#8);
> [ 5839.991573]   lock(&p->lock);
> [ 5839.994900] 
>  *** DEADLOCK ***
> 
> [ 5840.001503] 1 lock held by trinity-c220/69531:
> [ 5840.006457]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> [ 5840.016031] 
> stack backtrace:
> [ 5840.020891] CPU: 12 PID: 69531 Comm: trinity-c220 Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 5840.030306] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
> [ 5840.041660]  0000000000000086 00000000a1ef62f8 ffff8803ca52f7c0 ffffffff813d2ecc
> [ 5840.049952]  ffffffff82a41160 ffffffff82a913e0 ffff8803ca52f800 ffffffff811dd630
> [ 5840.058245]  ffff8803ca52f840 ffff880392c4ecc8 ffff880392c4e000 0000000000000001
> [ 5840.066537] Call Trace:
> [ 5840.069266]  [<ffffffff813d2ecc>] dump_stack+0x85/0xc9
> [ 5840.075000]  [<ffffffff811dd630>] print_circular_bug+0x1f9/0x207
> [ 5840.081701]  [<ffffffff810fe69c>] __lock_acquire+0x151c/0x1990
> [ 5840.088208]  [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> [ 5840.094232]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
> [ 5840.100061]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
> [ 5840.105891]  [<ffffffff817cf3b6>] mutex_lock_nested+0x76/0x450
> [ 5840.112397]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
> [ 5840.118228]  [<ffffffff810fb3e9>] ? __lock_is_held+0x49/0x70
> [ 5840.124540]  [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
> [ 5840.130175]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
> [ 5840.137360]  [<ffffffff8131566b>] kernfs_fop_read+0x12b/0x1b0
> [ 5840.143770]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
> [ 5840.150956]  [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
> [ 5840.157657]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
> [ 5840.164843]  [<ffffffff81281a03>] do_readv_writev+0x213/0x230
> [ 5840.171255]  [<ffffffff81418cf9>] ? __pipe_get_pages+0x24/0x9b
> [ 5840.177762]  [<ffffffff813e6f0f>] ? iov_iter_get_pages_alloc+0x19f/0x360
> [ 5840.185240]  [<ffffffff810fd5f2>] ? __lock_acquire+0x472/0x1990
> [ 5840.191843]  [<ffffffff81281a59>] vfs_readv+0x39/0x50
> [ 5840.197478]  [<ffffffff812bc55a>] default_file_splice_read+0x1aa/0x2c0
> [ 5840.204763]  [<ffffffff810cba89>] ? __might_sleep+0x49/0x80
> [ 5840.210980]  [<ffffffff81349c93>] ? security_file_permission+0xa3/0xc0
> [ 5840.218264]  [<ffffffff812bb913>] do_splice_to+0x73/0x90
> [ 5840.224190]  [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
> [ 5840.231182]  [<ffffffff812baee0>] ? generic_pipe_buf_nosteal+0x10/0x10
> [ 5840.238465]  [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
> [ 5840.244778]  [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
> [ 5840.250802]  [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
> [ 5840.256922]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 5840.263042]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> 
>    CAI Qian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 12:20                                                   ` CAI Qian
  2016-10-06 12:25                                                     ` CAI Qian
@ 2016-10-07  9:27                                                     ` Dave Chinner
  1 sibling, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2016-10-07  9:27 UTC (permalink / raw)
  To: CAI Qian
  Cc: Al Viro, tj, Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Thu, Oct 06, 2016 at 08:20:17AM -0400, CAI Qian wrote:
> 
> 
> ----- Original Message -----
> > From: "Al Viro" <viro@ZenIV.linux.org.uk>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "tj" <tj@kernel.org>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>,
> > "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org
> > Sent: Wednesday, October 5, 2016 4:05:22 PM
> > Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> > 
> > On Wed, Oct 05, 2016 at 02:57:04PM -0400, CAI Qian wrote:
> > 
> > > Not sure if this related, and there is always a lockdep regards procfs
> > > happened
> > > below unless masking by other lockdep issues before the cgroup hang. Also,
> > > this
> > > hang is always reproducible.
> > 
> > Sigh...  Let's get the /proc/*/auxv out of the way - this should deal with
> > it:
> So I applied both this and the sanity patch, and both original sanity and the
> proc warnings went away. However, the cgroup hang can still be reproduced as
> well as this new xfs internal error below,
> 
> [16921.141233] XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line 5619 of file fs/xfs/libxfs/xfs_bmap.c.  Caller xfs_bmap_shift_extents+0x1cc/0x3a0 [xfs]
> [16921.157694] CPU: 9 PID: 52920 Comm: trinity-c108 Not tainted 4.8.0-rc8-splice-fixw-proc+ #4

iIt found a delayed allocation extent in the extent map after
flushing all the dirty data in the file. Something else has gone
wrong, this corruption detection is just the messenger. Maybe
memory corruption?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 17:00                                                         ` Linus Torvalds
  2016-10-06 18:12                                                           ` CAI Qian
@ 2016-10-07  9:57                                                           ` Dave Chinner
  2016-10-07 15:25                                                             ` Linus Torvalds
  1 sibling, 1 reply; 104+ messages in thread
From: Dave Chinner @ 2016-10-07  9:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: CAI Qian, Al Viro, tj, linux-xfs, Jens Axboe, Nick Piggin, linux-fsdevel

On Thu, Oct 06, 2016 at 10:00:08AM -0700, Linus Torvalds wrote:
> On Thu, Oct 6, 2016 at 9:11 AM, CAI Qian <caiqian@redhat.com> wrote:
> >
> >>
> >> Wait. There is also a lockep happened before the xfs internal error as well.
> > Some other lockdep this time,
> 
> This one looks just bogus.
> 
> > [ 4872.569797]  Possible unsafe locking scenario:
> > [ 4872.569797]
> > [ 4872.576401]        CPU0
> > [ 4872.579127]        ----
> > [ 4872.581854]   lock(&xfs_nondir_ilock_class);
> > [ 4872.586637]   <Interrupt>
> > [ 4872.589558]     lock(&xfs_nondir_ilock_class);
> 
> I'm not seeing that .lock taken in interrupt context.

It's a memory allocation vs reclaim context warning, not a lock
warning. That overloads the lock vs interrupt lockdep mechanism, so
if lockdep sees a context violation it is reported as an "interrupt
context" lock problem.

The allocation context in question is in a function that can be
called from both inside and outside a transaction context. When
outside a transaction, it's a GFP_KERNEL allocation, when inside
it's a GFP_NOFS context.  However, both allocation contexts hold the
inode ilock over the allocation.

the inode shrinker (reclaim context) also happens to take the inode
ilock, and that's what lockdep is complaining about. i.e. it thinks
that this path ilock -> alloc(GFP_KERNEL) -> reclaim -> ilock can
deadlock. But it can't - the ilock held at the upper side is a
referenced inode and can't be seen by reclaim, and the ilocks taken
by reclaim are inodes that can't be seen or referenced by the VFS.

i.e. There's no depedencies between the ilocks on either side of
memory allocation, but there's no way of telling lockdep that short
of giving the inodes in reclaim a different lock class. We used to
do that, but that was a nasty hack and prevented lockdep from
verifying locking orders used on inodes and objects in reclaim
matched the locking orders of referenced inodes...

We've historically shut these false positives up by simply making
all the allocations in these dual context paths GFP_NOFS. However, I
recently got told not to do that by someone on the mm side because
it exacerbated deficiencies in memory reclaim when too many
allocations use GFP_NOFS.

So it's not "fixed" and instead I'm ignoring it.  If you spend any
amount of time running lockdep on XFS you'll get as sick and tired
of playing this whack-a-lockdep-false-positive game as I am.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-07  7:08                                                       ` Jan Kara
@ 2016-10-07 14:43                                                         ` CAI Qian
  2016-10-07 15:27                                                           ` CAI Qian
  2016-10-09 21:51                                                           ` Dave Chinner
  2016-10-21 15:38                                                         ` [4.9-rc1+] overlayfs lockdep CAI Qian
  1 sibling, 2 replies; 104+ messages in thread
From: CAI Qian @ 2016-10-07 14:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, tj, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel, Miklos Szeredi, Dave Jones



----- Original Message -----
> From: "Jan Kara" <jack@suse.cz>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Al Viro" <viro@ZenIV.linux.org.uk>, "tj" <tj@kernel.org>, "Linus Torvalds" <torvalds@linux-foundation.org>,
> "Dave Chinner" <david@fromorbit.com>, "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick
> Piggin" <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org, "Miklos Szeredi" <miklos@szeredi.hu>
> Sent: Friday, October 7, 2016 3:08:38 AM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> 
> So I believe this may be just a problem in overlayfs lockdep annotation
> (see below). Added Miklos to CC.
> 
> On Thu 06-10-16 08:25:59, CAI Qian wrote:
> > > > > Not sure if this related, and there is always a lockdep regards
> > > > > procfs
> > > > > happened
> > > > > below unless masking by other lockdep issues before the cgroup hang.
> > > > > Also,
> > > > > this
> > > > > hang is always reproducible.
> > > > 
> > > > Sigh...  Let's get the /proc/*/auxv out of the way - this should deal
> > > > with
> > > > it:
> > > So I applied both this and the sanity patch, and both original sanity and
> > > the
> > > proc warnings went away. However, the cgroup hang can still be reproduced
> > > as
> > > well as this new xfs internal error below,
> > 
> > Wait. There is also a lockep happened before the xfs internal error as
> > well.
> > 
> > [ 5839.452325] ======================================================
> > [ 5839.459221] [ INFO: possible circular locking dependency detected ]
> > [ 5839.466215] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
> > [ 5839.471945] -------------------------------------------------------
> > [ 5839.478937] trinity-c220/69531 is trying to acquire lock:
> > [ 5839.484961]  (&p->lock){+.+.+.}, at: [<ffffffff812ac69c>]
> > seq_read+0x4c/0x3e0
> > [ 5839.492967]
> > but task is already holding lock:
> > [ 5839.499476]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>]
> > __sb_start_write+0xd1/0xf0
> > [ 5839.508560]
> > which lock already depends on the new lock.
> > 
> > [ 5839.517686]
> > the existing dependency chain (in reverse order) is:
> > [ 5839.526036]
> > -> #3 (sb_writers#8){.+.+.+}:
> > [ 5839.530751]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> > [ 5839.537368]        [<ffffffff810f8f4a>] percpu_down_read+0x4a/0x90
> > [ 5839.544275]        [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> > [ 5839.551181]        [<ffffffff812a8544>] mnt_want_write+0x24/0x50
> > [ 5839.557892]        [<ffffffffa04a398f>] ovl_want_write+0x1f/0x30
> > [overlay]
> > [ 5839.565577]        [<ffffffffa04a6036>] ovl_do_remove+0x46/0x480
> > [overlay]
> > [ 5839.573259]        [<ffffffffa04a64a3>] ovl_unlink+0x13/0x20 [overlay]
> > [ 5839.580555]        [<ffffffff812918ea>] vfs_unlink+0xda/0x190
> > [ 5839.586979]        [<ffffffff81293698>] do_unlinkat+0x268/0x2b0
> > [ 5839.593599]        [<ffffffff8129419b>] SyS_unlinkat+0x1b/0x30
> > [ 5839.600120]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> > [ 5839.606836]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> > [ 5839.614231]
> 
> So here is IMO the real culprit: do_unlinkat() grabs fs freeze protection
> through mnt_want_write(), we grab also i_rwsem in do_unlinkat() in
> I_MUTEX_PARENT class a bit after that and further down in vfs_unlink() we
> grab i_rwsem for the unlinked inode itself in default I_MUTEX class. Then
> in ovl_want_write() we grab freeze protection again, but this time for the
> upper filesystem. That establishes sb_writers (overlay) -> I_MUTEX_PARENT
> (overlay) -> I_MUTEX (overlay) -> sb_writers (FS-A) lock ordering
> (we maintain locking classes per fs type so that's why I'm showing fs type
> in parenthesis).
> 
> Now this nesting is nasty because once you add locks that are not tracked
> per fs type into the mix, you get cycles. In this case we've got
> seq_file->lock and cred_guard_mutex into the mix - the splice path is
> doing sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex (splicing
> from seq_file into the real filesystem). Exec path further establishes
> cred_guard_mutex -> I_MUTEX (overlay) which closes the full cycle:
> 
> sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex -> i_mutex
> (overlay) -> sb_writers (FS-A)
> 
> If I analyzed the lockdep trace, this looks like a real (although remote)
> deadlock possibility. Miklos?
> 
> 								Honza
> 
> > -> #2 (&sb->s_type->i_mutex_key#17){++++++}:
> > [ 5839.620399]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> > [ 5839.627015]        [<ffffffff817d1b77>] down_read+0x47/0x70
> > [ 5839.633242]        [<ffffffff8128cfd2>] lookup_slow+0xc2/0x1f0
> > [ 5839.639762]        [<ffffffff8128f6f2>] walk_component+0x172/0x220
> > [ 5839.646668]        [<ffffffff81290fd6>] link_path_walk+0x1a6/0x620
> > [ 5839.653574]        [<ffffffff81291a81>] path_openat+0xe1/0xdb0
> > [ 5839.660092]        [<ffffffff812939e1>] do_filp_open+0x91/0x100
> > [ 5839.666707]        [<ffffffff81288e06>] do_open_execat+0x76/0x180
> > [ 5839.673517]        [<ffffffff81288f3b>] open_exec+0x2b/0x50
> > [ 5839.679743]        [<ffffffff812eccf3>] load_elf_binary+0x2a3/0x10a0
> > [ 5839.686844]        [<ffffffff81288917>] search_binary_handler+0x97/0x1d0
> > [ 5839.694331]        [<ffffffff81289ed8>]
> > do_execveat_common.isra.35+0x678/0x9a0
> > [ 5839.702400]        [<ffffffff8128a4da>] SyS_execve+0x3a/0x50
> > [ 5839.708726]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> > [ 5839.715441]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> > [ 5839.722833]
> > -> #1 (&sig->cred_guard_mutex){+.+.+.}:
> > [ 5839.728510]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> > [ 5839.735126]        [<ffffffff817cfc66>]
> > mutex_lock_killable_nested+0x86/0x540
> > [ 5839.743097]        [<ffffffff81301e84>] lock_trace+0x24/0x60
> > [ 5839.749421]        [<ffffffff8130224d>] proc_pid_syscall+0x2d/0x110
> > [ 5839.756423]        [<ffffffff81302af0>] proc_single_show+0x50/0x90
> > [ 5839.763330]        [<ffffffff812ab867>] traverse+0xf7/0x210
> > [ 5839.769557]        [<ffffffff812ac9eb>] seq_read+0x39b/0x3e0
> > [ 5839.775884]        [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
> > [ 5839.783179]        [<ffffffff81281a03>] do_readv_writev+0x213/0x230
> > [ 5839.790181]        [<ffffffff81281a59>] vfs_readv+0x39/0x50
> > [ 5839.796406]        [<ffffffff81281c12>] do_preadv+0xa2/0xc0
> > [ 5839.802634]        [<ffffffff81282ec1>] SyS_preadv+0x11/0x20
> > [ 5839.808963]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> > [ 5839.815681]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> > [ 5839.823075]
> > -> #0 (&p->lock){+.+.+.}:
> > [ 5839.827395]        [<ffffffff810fe69c>] __lock_acquire+0x151c/0x1990
> > [ 5839.834500]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> > [ 5839.841115]        [<ffffffff817cf3b6>] mutex_lock_nested+0x76/0x450
> > [ 5839.848219]        [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
> > [ 5839.854448]        [<ffffffff8131566b>] kernfs_fop_read+0x12b/0x1b0
> > [ 5839.861451]        [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
> > [ 5839.868742]        [<ffffffff81281a03>] do_readv_writev+0x213/0x230
> > [ 5839.875744]        [<ffffffff81281a59>] vfs_readv+0x39/0x50
> > [ 5839.881971]        [<ffffffff812bc55a>]
> > default_file_splice_read+0x1aa/0x2c0
> > [ 5839.889847]        [<ffffffff812bb913>] do_splice_to+0x73/0x90
> > [ 5839.896365]        [<ffffffff812bba1b>]
> > splice_direct_to_actor+0xeb/0x220
> > [ 5839.903950]        [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
> > [ 5839.910857]        [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
> > [ 5839.917470]        [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
> > [ 5839.924184]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> > [ 5839.930898]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> > [ 5839.938286]
> > other info that might help us debug this:
> > 
> > [ 5839.947217] Chain exists of:
> >   &p->lock --> &sb->s_type->i_mutex_key#17 --> sb_writers#8
> > 
> > [ 5839.956615]  Possible unsafe locking scenario:
> > 
> > [ 5839.963218]        CPU0                    CPU1
> > [ 5839.968269]        ----                    ----
> > [ 5839.973321]   lock(sb_writers#8);
> > [ 5839.977046]
> > lock(&sb->s_type->i_mutex_key#17);
> > [ 5839.985037]                                lock(sb_writers#8);
> > [ 5839.991573]   lock(&p->lock);
> > [ 5839.994900]
> >  *** DEADLOCK ***
> > 
> > [ 5840.001503] 1 lock held by trinity-c220/69531:
> > [ 5840.006457]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>]
> > __sb_start_write+0xd1/0xf0
> > [ 5840.016031]
> > stack backtrace:
> > [ 5840.020891] CPU: 12 PID: 69531 Comm: trinity-c220 Not tainted
> > 4.8.0-rc8-splice-fixw-proc+ #4
> > [ 5840.030306] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS
> > GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
> > [ 5840.041660]  0000000000000086 00000000a1ef62f8 ffff8803ca52f7c0
> > ffffffff813d2ecc
> > [ 5840.049952]  ffffffff82a41160 ffffffff82a913e0 ffff8803ca52f800
> > ffffffff811dd630
> > [ 5840.058245]  ffff8803ca52f840 ffff880392c4ecc8 ffff880392c4e000
> > 0000000000000001
> > [ 5840.066537] Call Trace:
> > [ 5840.069266]  [<ffffffff813d2ecc>] dump_stack+0x85/0xc9
> > [ 5840.075000]  [<ffffffff811dd630>] print_circular_bug+0x1f9/0x207
> > [ 5840.081701]  [<ffffffff810fe69c>] __lock_acquire+0x151c/0x1990
> > [ 5840.088208]  [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> > [ 5840.094232]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
> > [ 5840.100061]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
> > [ 5840.105891]  [<ffffffff817cf3b6>] mutex_lock_nested+0x76/0x450
> > [ 5840.112397]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
> > [ 5840.118228]  [<ffffffff810fb3e9>] ? __lock_is_held+0x49/0x70
> > [ 5840.124540]  [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
> > [ 5840.130175]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
> > [ 5840.137360]  [<ffffffff8131566b>] kernfs_fop_read+0x12b/0x1b0
> > [ 5840.143770]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
> > [ 5840.150956]  [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
> > [ 5840.157657]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
> > [ 5840.164843]  [<ffffffff81281a03>] do_readv_writev+0x213/0x230
> > [ 5840.171255]  [<ffffffff81418cf9>] ? __pipe_get_pages+0x24/0x9b
> > [ 5840.177762]  [<ffffffff813e6f0f>] ? iov_iter_get_pages_alloc+0x19f/0x360
> > [ 5840.185240]  [<ffffffff810fd5f2>] ? __lock_acquire+0x472/0x1990
> > [ 5840.191843]  [<ffffffff81281a59>] vfs_readv+0x39/0x50
> > [ 5840.197478]  [<ffffffff812bc55a>] default_file_splice_read+0x1aa/0x2c0
> > [ 5840.204763]  [<ffffffff810cba89>] ? __might_sleep+0x49/0x80
> > [ 5840.210980]  [<ffffffff81349c93>] ? security_file_permission+0xa3/0xc0
> > [ 5840.218264]  [<ffffffff812bb913>] do_splice_to+0x73/0x90
> > [ 5840.224190]  [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
> > [ 5840.231182]  [<ffffffff812baee0>] ? generic_pipe_buf_nosteal+0x10/0x10
> > [ 5840.238465]  [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
> > [ 5840.244778]  [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
> > [ 5840.250802]  [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
> > [ 5840.256922]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> > [ 5840.263042]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
Hmm, this round of trinity triggered a different hang.

[ 2094.403119] INFO: task trinity-c0:3126 blocked for more than 120 seconds.
[ 2094.410705]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2094.417027] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2094.425770] trinity-c0      D ffff88044efc3d10 13472  3126   3124 0x00000084
[ 2094.433659]  ffff88044efc3d10 ffffffff00000000 ffff880400000000 ffff880822b5e000
[ 2094.441965]  ffff88044c8b8000 ffff88044efc4000 ffff880443755670 ffff880443755658
[ 2094.450272]  ffffffff00000000 ffff88044c8b8000 ffff88044efc3d28 ffffffff817cdaaf
[ 2094.458572] Call Trace:
[ 2094.461312]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2094.466858]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
[ 2094.474049]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
[ 2094.481352]  [<ffffffff810fd5f2>] ? __lock_acquire+0x472/0x1990
[ 2094.487964]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
[ 2094.495450]  [<ffffffff817d1bff>] down_write+0x5f/0x80
[ 2094.501190]  [<ffffffff8127e301>] ? chown_common.isra.12+0x131/0x1e0
[ 2094.508284]  [<ffffffff8127e301>] chown_common.isra.12+0x131/0x1e0
[ 2094.515177]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[ 2094.521692]  [<ffffffff810cc367>] ? preempt_count_add+0x47/0xc0
[ 2094.528304]  [<ffffffff812a665f>] ? mnt_clone_write+0x3f/0x70
[ 2094.534723]  [<ffffffff8127faef>] SyS_fchown+0x8f/0xa0
[ 2094.540463]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2094.546588]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2094.553784] 2 locks held by trinity-c0/3126:
[ 2094.558552]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[ 2094.568240]  #1:  (&sb->s_type->i_mutex_key#17){++++++}, at: [<ffffffff8127e301>] chown_common.isra.12+0x131/0x1e0
[ 2094.579864] INFO: task trinity-c1:3127 blocked for more than 120 seconds.
[ 2094.587442]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2094.593761] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2094.602503] trinity-c1      D ffff88045a1bbd10 13312  3127   3124 0x00000084
[ 2094.610402]  ffff88045a1bbd10 ffff880443769fe8 ffff880400000000 ffff88046cefe000
[ 2094.618710]  ffff88044c8ba000 ffff88045a1bc000 ffff880443769fd0 ffff88045a1bbd40
[ 2094.627015]  ffff880443769fe8 ffff88044376a158 ffff88045a1bbd28 ffffffff817cdaaf
[ 2094.635321] Call Trace:
[ 2094.638053]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2094.643597]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[ 2094.650726]  [<ffffffffa0322cca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
[ 2094.657727]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[ 2094.665119]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[ 2094.671457]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[ 2094.677987]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2094.684324]  [<ffffffffa0322cca>] xfs_file_fsync+0xea/0x2e0 [xfs]
[ 2094.691133]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
[ 2094.697354]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
[ 2094.702896]  [<ffffffff812bdf40>] SyS_fsync+0x10/0x20
[ 2094.708528]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2094.714652]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2094.721844] 1 lock held by trinity-c1/3127:
[ 2094.726515]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2094.737181] INFO: task trinity-c2:3128 blocked for more than 120 seconds.
[ 2094.744751]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2094.751068] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2094.759810] trinity-c2      D ffff8804574f3df8 13472  3128   3124 0x00000084
[ 2094.767692]  ffff8804574f3df8 0000000000000006 0000000000000000 ffff8804569a4000
[ 2094.776002]  ffff88044c8bc000 ffff8804574f4000 ffff8804622eb338 ffff88044c8bc000
[ 2094.784307]  0000000000000246 00000000ffffffff ffff8804574f3e10 ffffffff817cdaaf
[ 2094.792605] Call Trace:
[ 2094.795340]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2094.800886]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 2094.808078]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 2094.814688]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
[ 2094.820715]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[ 2094.826544]  [<ffffffff81297f53>] SyS_getdents+0x83/0x140
[ 2094.832573]  [<ffffffff81297cd0>] ? fillonedir+0x100/0x100
[ 2094.838699]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2094.844822]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2094.852013] 1 lock held by trinity-c2/3128:
[ 2094.856682]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[ 2094.865969] INFO: task trinity-c3:3129 blocked for more than 120 seconds.
[ 2094.873547]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2094.879864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2094.888606] trinity-c3      D ffff880455ce3e08 13440  3129   3124 0x00000084
[ 2094.896495]  ffff880455ce3e08 0000000000000006 0000000000000000 ffff88045144e000
[ 2094.904803]  ffff88044c8be000 ffff880455ce4000 ffff8804622eb338 ffff88044c8be000
[ 2094.913111]  0000000000000246 00000000ffffffff ffff880455ce3e20 ffffffff817cdaaf
[ 2094.921418] Call Trace:
[ 2094.924152]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2094.929695]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 2094.936885]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 2094.943496]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
[ 2094.949526]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[ 2094.956620]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[ 2094.962454]  [<ffffffff81298091>] SyS_getdents64+0x81/0x130
[ 2094.968675]  [<ffffffff81297a80>] ? iterate_dir+0x190/0x190
[ 2094.974895]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2094.981019]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2094.988204] 1 lock held by trinity-c3/3129:
[ 2094.992872]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[ 2095.002158] INFO: task trinity-c4:3130 blocked for more than 120 seconds.
[ 2095.009734]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2095.016052] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2095.024793] trinity-c4      D ffff880458997e28 13392  3130   3124 0x00000084
[ 2095.032690]  ffff880458997e28 0000000000000006 0000000000000000 ffff88046ca18000
[ 2095.040995]  ffff880458998000 ffff880458998000 ffff8804622eb338 ffff880458998000
[ 2095.049342]  0000000000000246 00000000ffffffff ffff880458997e40 ffffffff817cdaaf
[ 2095.057650] Call Trace:
[ 2095.060382]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2095.065926]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 2095.073118]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 2095.079728]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
[ 2095.085757]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[ 2095.091589]  [<ffffffff812811dd>] SyS_lseek+0x1d/0xb0
[ 2095.097229]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2095.103355]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2095.110547] 1 lock held by trinity-c4/3130:
[ 2095.115216]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[ 2095.124507] INFO: task trinity-c5:3131 blocked for more than 120 seconds.
[ 2095.132083]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2095.138402] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2095.147135] trinity-c5      D ffff88045a12bae0 13472  3131   3124 0x00000084
[ 2095.155034]  ffff88045a12bae0 ffff880443769fe8 ffff880400000000 ffff88046ca1a000
[ 2095.163339]  ffff88045899a000 ffff88045a12c000 ffff880443769fd0 ffff88045a12bb10
[ 2095.171645]  ffff880443769fe8 0000000000000000 ffff88045a12baf8 ffffffff817cdaaf
[ 2095.179952] Call Trace:
[ 2095.182684]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2095.188230]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[ 2095.195341]  [<ffffffffa03337d4>] ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
[ 2095.203310]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[ 2095.210696]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[ 2095.217029]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[ 2095.223558]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.229894]  [<ffffffffa03337d4>] xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
[ 2095.237682]  [<ffffffffa02ccfaf>] xfs_attr_get+0xdf/0x1b0 [xfs]
[ 2095.244312]  [<ffffffffa0341bfc>] xfs_xattr_get+0x4c/0x70 [xfs]
[ 2095.250924]  [<ffffffff812ad269>] generic_getxattr+0x59/0x70
[ 2095.257244]  [<ffffffff812acf9b>] vfs_getxattr+0x8b/0xb0
[ 2095.263177]  [<ffffffffa0435bd6>] ovl_xattr_get+0x46/0x60 [overlay]
[ 2095.270176]  [<ffffffffa04331aa>] ovl_other_xattr_get+0x1a/0x20 [overlay]
[ 2095.277756]  [<ffffffff812ad269>] generic_getxattr+0x59/0x70
[ 2095.284079]  [<ffffffff81345e9e>] cap_inode_need_killpriv+0x2e/0x40
[ 2095.291078]  [<ffffffff81349a33>] security_inode_need_killpriv+0x33/0x50
[ 2095.298560]  [<ffffffff812a2fb0>] dentry_needs_remove_privs+0x30/0x50
[ 2095.305743]  [<ffffffff8127ea21>] do_truncate+0x51/0xc0
[ 2095.311581]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[ 2095.318094]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[ 2095.324609]  [<ffffffff8127edde>] do_sys_ftruncate.constprop.15+0xfe/0x160
[ 2095.332286]  [<ffffffff8127ee7e>] SyS_ftruncate+0xe/0x10
[ 2095.338225]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2095.344339]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2095.351531] 2 locks held by trinity-c5/3131:
[ 2095.356297]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[ 2095.365983]  #1:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.376647] INFO: task trinity-c6:3132 blocked for more than 120 seconds.
[ 2095.384216]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2095.390535] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2095.399275] trinity-c6      D ffff88044da5fd30 13312  3132   3124 0x00000084
[ 2095.407177]  ffff88044da5fd30 ffffffff00000000 ffff880400000000 ffff880459858000
[ 2095.415485]  ffff88045899c000 ffff88044da60000 ffff880443755670 ffff880443755658
[ 2095.423789]  ffffffff00000000 ffff88045899c000 ffff88044da5fd48 ffffffff817cdaaf
[ 2095.432094] Call Trace:
[ 2095.434825]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2095.440372]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
[ 2095.447565]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
[ 2095.454854]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
[ 2095.462337]  [<ffffffff817d1bff>] down_write+0x5f/0x80
[ 2095.468077]  [<ffffffff8127e413>] ? chmod_common+0x63/0x150
[ 2095.474300]  [<ffffffff8127e413>] chmod_common+0x63/0x150
[ 2095.480327]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[ 2095.487421]  [<ffffffff810035cc>] ? syscall_trace_enter+0x1dc/0x390
[ 2095.494418]  [<ffffffff8127f5f2>] SyS_fchmod+0x52/0x80
[ 2095.500155]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2095.506270]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2095.513452] 2 locks held by trinity-c6/3132:
[ 2095.518217]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[ 2095.527895]  #1:  (&sb->s_type->i_mutex_key#17){++++++}, at: [<ffffffff8127e413>] chmod_common+0x63/0x150
[ 2095.538648] INFO: task trinity-c7:3133 blocked for more than 120 seconds.
[ 2095.546227]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2095.552544] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2095.561288] trinity-c7      D ffff88044d393d10 13472  3133   3124 0x00000084
[ 2095.569188]  ffff88044d393d10 ffff880443769fe8 ffff880400000000 ffff88086ce68000
[ 2095.577491]  ffff88045899e000 ffff88044d394000 ffff880443769fd0 ffff88044d393d40
[ 2095.585796]  ffff880443769fe8 ffff88044376a158 ffff88044d393d28 ffffffff817cdaaf
[ 2095.594103] Call Trace:
[ 2095.596836]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2095.602379]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[ 2095.609491]  [<ffffffffa0322cca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
[ 2095.616490]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[ 2095.623877]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[ 2095.630212]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[ 2095.636740]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.643076]  [<ffffffffa0322cca>] xfs_file_fsync+0xea/0x2e0 [xfs]
[ 2095.649889]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
[ 2095.656109]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
[ 2095.661653]  [<ffffffff812bdf40>] SyS_fsync+0x10/0x20
[ 2095.667291]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2095.673417]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2095.680610] 1 lock held by trinity-c7/3133:
[ 2095.685281]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.695947] INFO: task trinity-c8:3135 blocked for more than 120 seconds.
[ 2095.703530]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2095.709848] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2095.718590] trinity-c8      D ffff88044d3b3d10 12912  3135   3124 0x00000084
[ 2095.726470]  ffff88044d3b3d10 ffff880443769fe8 ffff880400000000 ffff88046ca30000
[ 2095.734775]  ffff88044d3a8000 ffff88044d3b4000 ffff880443769fd0 ffff88044d3b3d40
[ 2095.743083]  ffff880443769fe8 ffff88044376a158 ffff88044d3b3d28 ffffffff817cdaaf
[ 2095.751387] Call Trace:
[ 2095.754119]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2095.759662]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[ 2095.766772]  [<ffffffffa0322cca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
[ 2095.773763]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[ 2095.781148]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[ 2095.787482]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[ 2095.794013]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.800347]  [<ffffffffa0322cca>] xfs_file_fsync+0xea/0x2e0 [xfs]
[ 2095.807155]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
[ 2095.813377]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
[ 2095.818921]  [<ffffffff812bdf63>] SyS_fdatasync+0x13/0x20
[ 2095.824949]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2095.831074]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2095.838261] 1 lock held by trinity-c8/3135:
[ 2095.842930]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.853588] INFO: task trinity-c9:3136 blocked for more than 120 seconds.
[ 2095.861167]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2095.867485] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2095.876228] trinity-c9      D ffff88045b3679e0 13328  3136   3124 0x00000084
[ 2095.884111]  ffff88045b3679e0 ffff880443769fe8 ffff880400000000 ffff88086ce56000
[ 2095.892417]  ffff88044d3aa000 ffff88045b368000 ffff880443769fd0 ffff88045b367a10
[ 2095.900721]  ffff880443769fe8 ffff88044376a1e8 ffff88045b3679f8 ffffffff817cdaaf
[ 2095.909024] Call Trace:
[ 2095.911761]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2095.917305]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[ 2095.924414]  [<ffffffffa0333790>] ? xfs_ilock_data_map_shared+0x30/0x40 [xfs]
[ 2095.932383]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[ 2095.939768]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[ 2095.946104]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[ 2095.952632]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.958968]  [<ffffffffa0333790>] xfs_ilock_data_map_shared+0x30/0x40 [xfs]
[ 2095.966752]  [<ffffffffa03128c6>] __xfs_get_blocks+0x96/0x9d0 [xfs]
[ 2095.973753]  [<ffffffff8126462e>] ? mem_cgroup_event_ratelimit.isra.39+0x3e/0xb0
[ 2095.982012]  [<ffffffff8126e8e5>] ? mem_cgroup_commit_charge+0x95/0x110
[ 2095.989413]  [<ffffffffa0313214>] xfs_get_blocks+0x14/0x20 [xfs]
[ 2095.996122]  [<ffffffff812cca44>] do_mpage_readpage+0x474/0x800
[ 2096.002745]  [<ffffffffa0313200>] ? __xfs_get_blocks+0x9d0/0x9d0 [xfs]
[ 2096.010037]  [<ffffffff81402fd7>] ? debug_smp_processor_id+0x17/0x20
[ 2096.017136]  [<ffffffff811f3565>] ? __lru_cache_add+0x75/0xb0
[ 2096.023551]  [<ffffffff811f45fe>] ? lru_cache_add+0xe/0x10
[ 2096.029678]  [<ffffffff812ccf0d>] mpage_readpages+0x13d/0x1b0
[ 2096.036109]  [<ffffffffa0313200>] ? __xfs_get_blocks+0x9d0/0x9d0 [xfs]
[ 2096.043420]  [<ffffffffa0313200>] ? __xfs_get_blocks+0x9d0/0x9d0 [xfs]
[ 2096.050724]  [<ffffffffa0311f14>] xfs_vm_readpages+0x54/0x170 [xfs]
[ 2096.057724]  [<ffffffff811f1a1d>] __do_page_cache_readahead+0x2ad/0x370
[ 2096.065113]  [<ffffffff811f18ec>] ? __do_page_cache_readahead+0x17c/0x370
[ 2096.072693]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[ 2096.079787]  [<ffffffff811f2014>] force_page_cache_readahead+0x94/0xf0
[ 2096.087077]  [<ffffffff811f2168>] SyS_readahead+0xa8/0xc0
[ 2096.093106]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2096.099234]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2096.106427] 1 lock held by trinity-c9/3136:
[ 2096.111097]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-07  9:57                                                           ` Dave Chinner
@ 2016-10-07 15:25                                                             ` Linus Torvalds
  0 siblings, 0 replies; 104+ messages in thread
From: Linus Torvalds @ 2016-10-07 15:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: CAI Qian, Al Viro, tj, linux-xfs, Jens Axboe, Nick Piggin, linux-fsdevel

On Fri, Oct 7, 2016 at 2:57 AM, Dave Chinner <david@fromorbit.com> wrote:
>
> So it's not "fixed" and instead I'm ignoring it.  If you spend any
> amount of time running lockdep on XFS you'll get as sick and tired
> of playing this whack-a-lockdep-false-positive game as I am.

Thanks for the background here. I'll try to remember it for the next
time this comes up, it doesn't help that lockdep reports are often a
bit cryptic to begin with (that "interrupt" thing certainly didn't
help).

             Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-07 14:43                                                         ` CAI Qian
@ 2016-10-07 15:27                                                           ` CAI Qian
  2016-10-07 18:56                                                             ` CAI Qian
  2016-10-09 21:51                                                           ` Dave Chinner
  1 sibling, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-07 15:27 UTC (permalink / raw)
  To: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, Dave Chinner
  Cc: linux-xfs, Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones



> Hmm, this round of trinity triggered a different hang.
This hang is reproducible so far with the command below on a overlayfs/xfs,

$ trinity -g vfs --arch 64 --disable-fds=sockets --disable-fds=perf --disable-fds=epoll
  --disable-fds=eventfd --disable-fds=pseudo --disable-fds=timerfd --disable-fds=memfd
  --disable-fds=drm
> 
> [ 2094.403119] INFO: task trinity-c0:3126 blocked for more than 120 seconds.
> [ 2094.410705]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2094.417027] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2094.425770] trinity-c0      D ffff88044efc3d10 13472  3126   3124
> 0x00000084
> [ 2094.433659]  ffff88044efc3d10 ffffffff00000000 ffff880400000000
> ffff880822b5e000
> [ 2094.441965]  ffff88044c8b8000 ffff88044efc4000 ffff880443755670
> ffff880443755658
> [ 2094.450272]  ffffffff00000000 ffff88044c8b8000 ffff88044efc3d28
> ffffffff817cdaaf
> [ 2094.458572] Call Trace:
> [ 2094.461312]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2094.466858]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
> [ 2094.474049]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
> [ 2094.481352]  [<ffffffff810fd5f2>] ? __lock_acquire+0x472/0x1990
> [ 2094.487964]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
> [ 2094.495450]  [<ffffffff817d1bff>] down_write+0x5f/0x80
> [ 2094.501190]  [<ffffffff8127e301>] ? chown_common.isra.12+0x131/0x1e0
> [ 2094.508284]  [<ffffffff8127e301>] chown_common.isra.12+0x131/0x1e0
> [ 2094.515177]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
> [ 2094.521692]  [<ffffffff810cc367>] ? preempt_count_add+0x47/0xc0
> [ 2094.528304]  [<ffffffff812a665f>] ? mnt_clone_write+0x3f/0x70
> [ 2094.534723]  [<ffffffff8127faef>] SyS_fchown+0x8f/0xa0
> [ 2094.540463]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2094.546588]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2094.553784] 2 locks held by trinity-c0/3126:
> [ 2094.558552]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>]
> __sb_start_write+0xd1/0xf0
> [ 2094.568240]  #1:  (&sb->s_type->i_mutex_key#17){++++++}, at:
> [<ffffffff8127e301>] chown_common.isra.12+0x131/0x1e0
> [ 2094.579864] INFO: task trinity-c1:3127 blocked for more than 120 seconds.
> [ 2094.587442]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2094.593761] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2094.602503] trinity-c1      D ffff88045a1bbd10 13312  3127   3124
> 0x00000084
> [ 2094.610402]  ffff88045a1bbd10 ffff880443769fe8 ffff880400000000
> ffff88046cefe000
> [ 2094.618710]  ffff88044c8ba000 ffff88045a1bc000 ffff880443769fd0
> ffff88045a1bbd40
> [ 2094.627015]  ffff880443769fe8 ffff88044376a158 ffff88045a1bbd28
> ffffffff817cdaaf
> [ 2094.635321] Call Trace:
> [ 2094.638053]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2094.643597]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2094.650726]  [<ffffffffa0322cca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
> [ 2094.657727]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
> [ 2094.665119]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2094.671457]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
> [ 2094.677987]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2094.684324]  [<ffffffffa0322cca>] xfs_file_fsync+0xea/0x2e0 [xfs]
> [ 2094.691133]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
> [ 2094.697354]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
> [ 2094.702896]  [<ffffffff812bdf40>] SyS_fsync+0x10/0x20
> [ 2094.708528]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2094.714652]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2094.721844] 1 lock held by trinity-c1/3127:
> [ 2094.726515]  #0:  (&xfs_nondir_ilock_class){++++..}, at:
> [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2094.737181] INFO: task trinity-c2:3128 blocked for more than 120 seconds.
> [ 2094.744751]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2094.751068] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2094.759810] trinity-c2      D ffff8804574f3df8 13472  3128   3124
> 0x00000084
> [ 2094.767692]  ffff8804574f3df8 0000000000000006 0000000000000000
> ffff8804569a4000
> [ 2094.776002]  ffff88044c8bc000 ffff8804574f4000 ffff8804622eb338
> ffff88044c8bc000
> [ 2094.784307]  0000000000000246 00000000ffffffff ffff8804574f3e10
> ffffffff817cdaaf
> [ 2094.792605] Call Trace:
> [ 2094.795340]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2094.800886]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
> [ 2094.808078]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
> [ 2094.814688]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
> [ 2094.820715]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
> [ 2094.826544]  [<ffffffff81297f53>] SyS_getdents+0x83/0x140
> [ 2094.832573]  [<ffffffff81297cd0>] ? fillonedir+0x100/0x100
> [ 2094.838699]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2094.844822]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2094.852013] 1 lock held by trinity-c2/3128:
> [ 2094.856682]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>]
> __fdget_pos+0x43/0x50
> [ 2094.865969] INFO: task trinity-c3:3129 blocked for more than 120 seconds.
> [ 2094.873547]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2094.879864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2094.888606] trinity-c3      D ffff880455ce3e08 13440  3129   3124
> 0x00000084
> [ 2094.896495]  ffff880455ce3e08 0000000000000006 0000000000000000
> ffff88045144e000
> [ 2094.904803]  ffff88044c8be000 ffff880455ce4000 ffff8804622eb338
> ffff88044c8be000
> [ 2094.913111]  0000000000000246 00000000ffffffff ffff880455ce3e20
> ffffffff817cdaaf
> [ 2094.921418] Call Trace:
> [ 2094.924152]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2094.929695]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
> [ 2094.936885]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
> [ 2094.943496]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
> [ 2094.949526]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
> [ 2094.956620]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
> [ 2094.962454]  [<ffffffff81298091>] SyS_getdents64+0x81/0x130
> [ 2094.968675]  [<ffffffff81297a80>] ? iterate_dir+0x190/0x190
> [ 2094.974895]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2094.981019]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2094.988204] 1 lock held by trinity-c3/3129:
> [ 2094.992872]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>]
> __fdget_pos+0x43/0x50
> [ 2095.002158] INFO: task trinity-c4:3130 blocked for more than 120 seconds.
> [ 2095.009734]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2095.016052] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2095.024793] trinity-c4      D ffff880458997e28 13392  3130   3124
> 0x00000084
> [ 2095.032690]  ffff880458997e28 0000000000000006 0000000000000000
> ffff88046ca18000
> [ 2095.040995]  ffff880458998000 ffff880458998000 ffff8804622eb338
> ffff880458998000
> [ 2095.049342]  0000000000000246 00000000ffffffff ffff880458997e40
> ffffffff817cdaaf
> [ 2095.057650] Call Trace:
> [ 2095.060382]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2095.065926]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
> [ 2095.073118]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
> [ 2095.079728]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
> [ 2095.085757]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
> [ 2095.091589]  [<ffffffff812811dd>] SyS_lseek+0x1d/0xb0
> [ 2095.097229]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.103355]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2095.110547] 1 lock held by trinity-c4/3130:
> [ 2095.115216]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>]
> __fdget_pos+0x43/0x50
> [ 2095.124507] INFO: task trinity-c5:3131 blocked for more than 120 seconds.
> [ 2095.132083]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2095.138402] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2095.147135] trinity-c5      D ffff88045a12bae0 13472  3131   3124
> 0x00000084
> [ 2095.155034]  ffff88045a12bae0 ffff880443769fe8 ffff880400000000
> ffff88046ca1a000
> [ 2095.163339]  ffff88045899a000 ffff88045a12c000 ffff880443769fd0
> ffff88045a12bb10
> [ 2095.171645]  ffff880443769fe8 0000000000000000 ffff88045a12baf8
> ffffffff817cdaaf
> [ 2095.179952] Call Trace:
> [ 2095.182684]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2095.188230]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.195341]  [<ffffffffa03337d4>] ? xfs_ilock_attr_map_shared+0x34/0x40
> [xfs]
> [ 2095.203310]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
> [ 2095.210696]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2095.217029]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.223558]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.229894]  [<ffffffffa03337d4>] xfs_ilock_attr_map_shared+0x34/0x40
> [xfs]
> [ 2095.237682]  [<ffffffffa02ccfaf>] xfs_attr_get+0xdf/0x1b0 [xfs]
> [ 2095.244312]  [<ffffffffa0341bfc>] xfs_xattr_get+0x4c/0x70 [xfs]
> [ 2095.250924]  [<ffffffff812ad269>] generic_getxattr+0x59/0x70
> [ 2095.257244]  [<ffffffff812acf9b>] vfs_getxattr+0x8b/0xb0
> [ 2095.263177]  [<ffffffffa0435bd6>] ovl_xattr_get+0x46/0x60 [overlay]
> [ 2095.270176]  [<ffffffffa04331aa>] ovl_other_xattr_get+0x1a/0x20 [overlay]
> [ 2095.277756]  [<ffffffff812ad269>] generic_getxattr+0x59/0x70
> [ 2095.284079]  [<ffffffff81345e9e>] cap_inode_need_killpriv+0x2e/0x40
> [ 2095.291078]  [<ffffffff81349a33>] security_inode_need_killpriv+0x33/0x50
> [ 2095.298560]  [<ffffffff812a2fb0>] dentry_needs_remove_privs+0x30/0x50
> [ 2095.305743]  [<ffffffff8127ea21>] do_truncate+0x51/0xc0
> [ 2095.311581]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
> [ 2095.318094]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
> [ 2095.324609]  [<ffffffff8127edde>] do_sys_ftruncate.constprop.15+0xfe/0x160
> [ 2095.332286]  [<ffffffff8127ee7e>] SyS_ftruncate+0xe/0x10
> [ 2095.338225]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.344339]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2095.351531] 2 locks held by trinity-c5/3131:
> [ 2095.356297]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>]
> __sb_start_write+0xd1/0xf0
> [ 2095.365983]  #1:  (&xfs_nondir_ilock_class){++++..}, at:
> [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.376647] INFO: task trinity-c6:3132 blocked for more than 120 seconds.
> [ 2095.384216]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2095.390535] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2095.399275] trinity-c6      D ffff88044da5fd30 13312  3132   3124
> 0x00000084
> [ 2095.407177]  ffff88044da5fd30 ffffffff00000000 ffff880400000000
> ffff880459858000
> [ 2095.415485]  ffff88045899c000 ffff88044da60000 ffff880443755670
> ffff880443755658
> [ 2095.423789]  ffffffff00000000 ffff88045899c000 ffff88044da5fd48
> ffffffff817cdaaf
> [ 2095.432094] Call Trace:
> [ 2095.434825]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2095.440372]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
> [ 2095.447565]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
> [ 2095.454854]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
> [ 2095.462337]  [<ffffffff817d1bff>] down_write+0x5f/0x80
> [ 2095.468077]  [<ffffffff8127e413>] ? chmod_common+0x63/0x150
> [ 2095.474300]  [<ffffffff8127e413>] chmod_common+0x63/0x150
> [ 2095.480327]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
> [ 2095.487421]  [<ffffffff810035cc>] ? syscall_trace_enter+0x1dc/0x390
> [ 2095.494418]  [<ffffffff8127f5f2>] SyS_fchmod+0x52/0x80
> [ 2095.500155]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.506270]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2095.513452] 2 locks held by trinity-c6/3132:
> [ 2095.518217]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>]
> __sb_start_write+0xd1/0xf0
> [ 2095.527895]  #1:  (&sb->s_type->i_mutex_key#17){++++++}, at:
> [<ffffffff8127e413>] chmod_common+0x63/0x150
> [ 2095.538648] INFO: task trinity-c7:3133 blocked for more than 120 seconds.
> [ 2095.546227]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2095.552544] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2095.561288] trinity-c7      D ffff88044d393d10 13472  3133   3124
> 0x00000084
> [ 2095.569188]  ffff88044d393d10 ffff880443769fe8 ffff880400000000
> ffff88086ce68000
> [ 2095.577491]  ffff88045899e000 ffff88044d394000 ffff880443769fd0
> ffff88044d393d40
> [ 2095.585796]  ffff880443769fe8 ffff88044376a158 ffff88044d393d28
> ffffffff817cdaaf
> [ 2095.594103] Call Trace:
> [ 2095.596836]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2095.602379]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.609491]  [<ffffffffa0322cca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
> [ 2095.616490]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
> [ 2095.623877]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2095.630212]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.636740]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.643076]  [<ffffffffa0322cca>] xfs_file_fsync+0xea/0x2e0 [xfs]
> [ 2095.649889]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
> [ 2095.656109]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
> [ 2095.661653]  [<ffffffff812bdf40>] SyS_fsync+0x10/0x20
> [ 2095.667291]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.673417]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2095.680610] 1 lock held by trinity-c7/3133:
> [ 2095.685281]  #0:  (&xfs_nondir_ilock_class){++++..}, at:
> [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.695947] INFO: task trinity-c8:3135 blocked for more than 120 seconds.
> [ 2095.703530]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2095.709848] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2095.718590] trinity-c8      D ffff88044d3b3d10 12912  3135   3124
> 0x00000084
> [ 2095.726470]  ffff88044d3b3d10 ffff880443769fe8 ffff880400000000
> ffff88046ca30000
> [ 2095.734775]  ffff88044d3a8000 ffff88044d3b4000 ffff880443769fd0
> ffff88044d3b3d40
> [ 2095.743083]  ffff880443769fe8 ffff88044376a158 ffff88044d3b3d28
> ffffffff817cdaaf
> [ 2095.751387] Call Trace:
> [ 2095.754119]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2095.759662]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.766772]  [<ffffffffa0322cca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
> [ 2095.773763]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
> [ 2095.781148]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2095.787482]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.794013]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.800347]  [<ffffffffa0322cca>] xfs_file_fsync+0xea/0x2e0 [xfs]
> [ 2095.807155]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
> [ 2095.813377]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
> [ 2095.818921]  [<ffffffff812bdf63>] SyS_fdatasync+0x13/0x20
> [ 2095.824949]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.831074]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2095.838261] 1 lock held by trinity-c8/3135:
> [ 2095.842930]  #0:  (&xfs_nondir_ilock_class){++++..}, at:
> [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.853588] INFO: task trinity-c9:3136 blocked for more than 120 seconds.
> [ 2095.861167]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2095.867485] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2095.876228] trinity-c9      D ffff88045b3679e0 13328  3136   3124
> 0x00000084
> [ 2095.884111]  ffff88045b3679e0 ffff880443769fe8 ffff880400000000
> ffff88086ce56000
> [ 2095.892417]  ffff88044d3aa000 ffff88045b368000 ffff880443769fd0
> ffff88045b367a10
> [ 2095.900721]  ffff880443769fe8 ffff88044376a1e8 ffff88045b3679f8
> ffffffff817cdaaf
> [ 2095.909024] Call Trace:
> [ 2095.911761]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2095.917305]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.924414]  [<ffffffffa0333790>] ? xfs_ilock_data_map_shared+0x30/0x40
> [xfs]
> [ 2095.932383]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
> [ 2095.939768]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2095.946104]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.952632]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.958968]  [<ffffffffa0333790>] xfs_ilock_data_map_shared+0x30/0x40
> [xfs]
> [ 2095.966752]  [<ffffffffa03128c6>] __xfs_get_blocks+0x96/0x9d0 [xfs]
> [ 2095.973753]  [<ffffffff8126462e>] ?
> mem_cgroup_event_ratelimit.isra.39+0x3e/0xb0
> [ 2095.982012]  [<ffffffff8126e8e5>] ? mem_cgroup_commit_charge+0x95/0x110
> [ 2095.989413]  [<ffffffffa0313214>] xfs_get_blocks+0x14/0x20 [xfs]
> [ 2095.996122]  [<ffffffff812cca44>] do_mpage_readpage+0x474/0x800
> [ 2096.002745]  [<ffffffffa0313200>] ? __xfs_get_blocks+0x9d0/0x9d0 [xfs]
> [ 2096.010037]  [<ffffffff81402fd7>] ? debug_smp_processor_id+0x17/0x20
> [ 2096.017136]  [<ffffffff811f3565>] ? __lru_cache_add+0x75/0xb0
> [ 2096.023551]  [<ffffffff811f45fe>] ? lru_cache_add+0xe/0x10
> [ 2096.029678]  [<ffffffff812ccf0d>] mpage_readpages+0x13d/0x1b0
> [ 2096.036109]  [<ffffffffa0313200>] ? __xfs_get_blocks+0x9d0/0x9d0 [xfs]
> [ 2096.043420]  [<ffffffffa0313200>] ? __xfs_get_blocks+0x9d0/0x9d0 [xfs]
> [ 2096.050724]  [<ffffffffa0311f14>] xfs_vm_readpages+0x54/0x170 [xfs]
> [ 2096.057724]  [<ffffffff811f1a1d>] __do_page_cache_readahead+0x2ad/0x370
> [ 2096.065113]  [<ffffffff811f18ec>] ? __do_page_cache_readahead+0x17c/0x370
> [ 2096.072693]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
> [ 2096.079787]  [<ffffffff811f2014>] force_page_cache_readahead+0x94/0xf0
> [ 2096.087077]  [<ffffffff811f2168>] SyS_readahead+0xa8/0xc0
> [ 2096.093106]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2096.099234]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2096.106427] 1 lock held by trinity-c9/3136:
> [ 2096.111097]  #0:  (&xfs_nondir_ilock_class){++++..}, at:
> [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-07 15:27                                                           ` CAI Qian
@ 2016-10-07 18:56                                                             ` CAI Qian
  2016-10-09 21:54                                                               ` Dave Chinner
  0 siblings, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-07 18:56 UTC (permalink / raw)
  To: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, Dave Chinner
  Cc: linux-xfs, Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones



----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "Jan Kara" <jack@suse.cz>, "Miklos Szeredi" <miklos@szeredi.hu>, "tj" <tj@kernel.org>, "Al Viro"
> <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>
> Cc: "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org, "Dave Jones" <davej@codemonkey.org.uk>
> Sent: Friday, October 7, 2016 11:27:55 AM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> 
> 
> > Hmm, this round of trinity triggered a different hang.
> This hang is reproducible so far with the command below on a overlayfs/xfs,
Another data point is that this hang can also be reproduced using device-mapper thinp
as the docker backend.
    CAI Qian

[12047.714409] INFO: task trinity-c0:3716 blocked for more than 120 seconds.
[12047.722033]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12047.728354] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12047.737107] trinity-c0      D ffff8804507dbd10 13552  3716   3713 0x00000084
[12047.744997]  ffff8804507dbd10 ffff8804240e9368 ffff880400000000 ffffffff81c0d540
[12047.753300]  ffff88044c430000 ffff8804507dc000 ffff8804240e9350 ffff8804507dbd40
[12047.761598]  ffff8804240e9368 ffff8804240e94d8 ffff8804507dbd28 ffffffff817cdaaf
[12047.769898] Call Trace:
[12047.772631]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12047.778174]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[12047.785303]  [<ffffffffa028ccca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
[12047.792309]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[12047.799695]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[12047.806029]  [<ffffffffa029d5fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[12047.812554]  [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12047.818887]  [<ffffffffa028ccca>] xfs_file_fsync+0xea/0x2e0 [xfs]
[12047.825693]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
[12047.831915]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
[12047.837455]  [<ffffffff812bdf63>] SyS_fdatasync+0x13/0x20
[12047.843485]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12047.849609]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12047.856801] 1 lock held by trinity-c0/3716:
[12047.861470]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12047.872125] INFO: task trinity-c1:3717 blocked for more than 120 seconds.
[12047.879703]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12047.886011] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12047.894749] trinity-c1      D ffff8804507ffd10 13568  3717   3713 0x00000084
[12047.902645]  ffff8804507ffd10 ffff8804240e9368 ffff880400000000 ffff88046c9da000
[12047.910941]  ffff88044c434000 ffff880450800000 ffff8804240e9350 ffff8804507ffd40
[12047.919240]  ffff8804240e9368 ffff8804240e94d8 ffff8804507ffd28 ffffffff817cdaaf
[12047.927542] Call Trace:
[12047.930284]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12047.935826]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[12047.942933]  [<ffffffffa028ccca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
[12047.949930]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[12047.957315]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[12047.963647]  [<ffffffffa029d5fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[12047.970171]  [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12047.976506]  [<ffffffffa028ccca>] xfs_file_fsync+0xea/0x2e0 [xfs]
[12047.983310]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
[12047.989529]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
[12047.995070]  [<ffffffff812bdf63>] SyS_fdatasync+0x13/0x20
[12048.001096]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.007217]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.014407] 1 lock held by trinity-c1/3717:
[12048.019085]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12048.029742] INFO: task trinity-c2:3718 blocked for more than 120 seconds.
[12048.037310]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12048.043626] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12048.052365] trinity-c2      D ffff8804586c7df8 13504  3718   3713 0x00000084
[12048.060261]  ffff8804586c7df8 0000000000000006 0000000000000000 ffff88046c9dc000
[12048.068565]  ffff88044c436000 ffff8804586c8000 ffff88044ec7e6f8 ffff88044c436000
[12048.076862]  0000000000000246 00000000ffffffff ffff8804586c7e10 ffffffff817cdaaf
[12048.085163] Call Trace:
[12048.087893]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12048.093434]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[12048.100627]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[12048.107237]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
[12048.113262]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[12048.119094]  [<ffffffff81297f53>] SyS_getdents+0x83/0x140
[12048.125120]  [<ffffffff81297cd0>] ? fillonedir+0x100/0x100
[12048.131243]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.137357]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.144546] 1 lock held by trinity-c2/3718:
[12048.149214]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[12048.158495] INFO: task trinity-c3:3719 blocked for more than 120 seconds.
[12048.166071]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12048.172388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12048.181120] trinity-c3      D ffff880450707c60 13552  3719   3713 0x00000084
[12048.189013]  ffff880450707c60 ffffffff00000000 ffff880400000000 ffff88046ca10000
[12048.197313]  ffff88044c432000 ffff880450708000 ffff8804240e9658 ffff8804240e9640
[12048.205612]  ffffffff00000000 ffff88044c432000 ffff880450707c78 ffffffff817cdaaf
[12048.213912] Call Trace:
[12048.216643]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12048.222183]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
[12048.229374]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
[12048.236662]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
[12048.244144]  [<ffffffff817d1bff>] down_write+0x5f/0x80
[12048.249881]  [<ffffffff812ad021>] ? vfs_removexattr+0x61/0x120
[12048.256391]  [<ffffffff812ad021>] vfs_removexattr+0x61/0x120
[12048.262709]  [<ffffffff812ad135>] removexattr+0x55/0x80
[12048.268533]  [<ffffffff81402ff3>] ? __this_cpu_preempt_check+0x13/0x20
[12048.275811]  [<ffffffff810f8eae>] ? update_fast_ctr+0x4e/0x70
[12048.282225]  [<ffffffff810f8f57>] ? percpu_down_read+0x57/0x90
[12048.288728]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[12048.295230]  [<ffffffff810cc367>] ? preempt_count_add+0x47/0xc0
[12048.301829]  [<ffffffff812a665f>] ? mnt_clone_write+0x3f/0x70
[12048.308242]  [<ffffffff812a8588>] ? __mnt_want_write_file+0x18/0x30
[12048.315238]  [<ffffffff812a85d0>] ? mnt_want_write_file+0x30/0x60
[12048.322039]  [<ffffffff812ae303>] SyS_fremovexattr+0x83/0xb0
[12048.328356]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.334478]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.341679] 2 locks held by trinity-c3/3719:
[12048.346454]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[12048.356042]  #1:  (&sb->s_type->i_mutex_key#14){+.+.+.}, at: [<ffffffff812ad021>] vfs_removexattr+0x61/0x120
[12048.367079] INFO: task trinity-c4:3720 blocked for more than 120 seconds.
[12048.374655]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12048.380972] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12048.389712] trinity-c4      D ffff88045072be08 13536  3720   3713 0x00000084
[12048.397606]  ffff88045072be08 0000000000000006 0000000000000000 ffff88046c9fe000
[12048.405902]  ffff880450720000 ffff88045072c000 ffff88044ec7e6f8 ffff880450720000
[12048.414205]  0000000000000246 00000000ffffffff ffff88045072be20 ffffffff817cdaaf
[12048.422505] Call Trace:
[12048.425235]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12048.430767]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[12048.437957]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[12048.444565]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
[12048.450591]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[12048.457675]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[12048.463508]  [<ffffffff81298091>] SyS_getdents64+0x81/0x130
[12048.469720]  [<ffffffff81297a80>] ? iterate_dir+0x190/0x190
[12048.475939]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.482063]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.489243] 1 lock held by trinity-c4/3720:
[12048.493913]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[12048.503182] INFO: task trinity-c5:3721 blocked for more than 120 seconds.
[12048.510757]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12048.517071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12048.525812] trinity-c5      D ffff8804510a7e08 13552  3721   3713 0x00000084
[12048.533706]  ffff8804510a7e08 0000000000000006 0000000000000000 ffff88046c9fa000
[12048.542007]  ffff880450722000 ffff8804510a8000 ffff88044ec7e6f8 ffff880450722000
[12048.550310]  0000000000000246 00000000ffffffff ffff8804510a7e20 ffffffff817cdaaf
[12048.558610] Call Trace:
[12048.561339]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12048.566879]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[12048.574070]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[12048.580677]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
[12048.586703]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[12048.593796]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[12048.599629]  [<ffffffff81298091>] SyS_getdents64+0x81/0x130
[12048.605849]  [<ffffffff81297a80>] ? iterate_dir+0x190/0x190
[12048.612069]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.618191]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.625382] 1 lock held by trinity-c5/3721:
[12048.630049]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[12048.639329] INFO: task trinity-c6:3722 blocked for more than 120 seconds.
[12048.646903]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12048.653219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12048.661958] trinity-c6      D ffff88044f0ebc50 12224  3722   3713 0x00000084
[12048.669849]  ffff88044f0ebc50 ffff8804240e9368 ffff880400000000 ffff88046c9fc000
[12048.678149]  ffff880450724000 ffff88044f0ec000 ffff8804240e9350 ffff88044f0ebc80
[12048.686448]  ffff8804240e9368 ffff8804240e92c0 ffff88044f0ebc68 ffffffff817cdaaf
[12048.694750] Call Trace:
[12048.697478]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12048.703018]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[12048.710126]  [<ffffffffa029d7d4>] ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
[12048.718095]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[12048.725479]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[12048.731800]  [<ffffffffa029d5fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[12048.738337]  [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12048.744669]  [<ffffffffa029d7d4>] xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
[12048.752457]  [<ffffffffa0280801>] xfs_attr_list_int+0x71/0x690 [xfs]
[12048.759555]  [<ffffffff810cba89>] ? __might_sleep+0x49/0x80
[12048.765792]  [<ffffffffa02abf2a>] xfs_vn_listxattr+0x7a/0xb0 [xfs]
[12048.772707]  [<ffffffffa02abcc0>] ? __xfs_xattr_put_listent+0xa0/0xa0 [xfs]
[12048.780480]  [<ffffffff812ad582>] vfs_listxattr+0x42/0x70
[12048.786517]  [<ffffffff812ad68e>] listxattr+0xde/0xf0
[12048.792156]  [<ffffffff812ae1f6>] SyS_flistxattr+0x56/0xa0
[12048.798271]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.804404]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.811595] 1 lock held by trinity-c6/3722:
[12048.816263]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12048.826935] INFO: task trinity-c7:3723 blocked for more than 120 seconds.
[12048.834516]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12048.840832] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12048.849572] trinity-c7      D ffff88044fc23c50 13552  3723   3713 0x00000084
[12048.857469]  ffff88044fc23c50 ffff8804240e9368 ffff880400000000 ffff88046c9f8000
[12048.865768]  ffff880450726000 ffff88044fc24000 ffff8804240e9350 ffff88044fc23c80
[12048.874067]  ffff8804240e9368 ffff8804240e92c0 ffff88044fc23c68 ffffffff817cdaaf
[12048.882370] Call Trace:
[12048.885100]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12048.890634]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[12048.897741]  [<ffffffffa029d7d4>] ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
[12048.905707]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[12048.913081]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[12048.919412]  [<ffffffffa029d5fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[12048.925937]  [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12048.932267]  [<ffffffffa029d7d4>] xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
[12048.940053]  [<ffffffffa0280801>] xfs_attr_list_int+0x71/0x690 [xfs]
[12048.947146]  [<ffffffff810cba89>] ? __might_sleep+0x49/0x80
[12048.953374]  [<ffffffffa02abf2a>] xfs_vn_listxattr+0x7a/0xb0 [xfs]
[12048.960288]  [<ffffffffa02abcc0>] ? __xfs_xattr_put_listent+0xa0/0xa0 [xfs]
[12048.968060]  [<ffffffff812ad582>] vfs_listxattr+0x42/0x70
[12048.974088]  [<ffffffff812ad602>] listxattr+0x52/0xf0
[12048.979726]  [<ffffffff812ae1f6>] SyS_flistxattr+0x56/0xa0
[12048.985849]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.991973]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.999162] 1 lock held by trinity-c7/3723:
[12049.003831]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12049.014481] INFO: task trinity-c8:3724 blocked for more than 120 seconds.
[12049.022072]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12049.028389] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12049.037130] trinity-c8      D ffff88044fc3fc60 13504  3724   3713 0x00000084
[12049.045023]  ffff88044fc3fc60 ffffffff00000000 ffff880400000000 ffff88046ca14000
[12049.053324]  ffff88044e540000 ffff88044fc40000 ffff8804240e9368 ffff8804240e9350
[12049.061623]  ffffffff00000000 ffff88044e540000 ffff88044fc3fc78 ffffffff817cdaaf
[12049.069924] Call Trace:
[12049.072654]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12049.078208]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
[12049.085408]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
[12049.092734]  [<ffffffffa028d7cc>] ? xfs_update_prealloc_flags+0x6c/0x100 [xfs]
[12049.100798]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
[12049.108290]  [<ffffffff810f8c15>] down_write_nested+0x65/0x80
[12049.114742]  [<ffffffffa029d68e>] ? xfs_ilock+0x18e/0x260 [xfs]
[12049.121377]  [<ffffffffa029d68e>] xfs_ilock+0x18e/0x260 [xfs]
[12049.127819]  [<ffffffffa028d7cc>] xfs_update_prealloc_flags+0x6c/0x100 [xfs]
[12049.135714]  [<ffffffffa028da8e>] xfs_file_fallocate+0x22e/0x360 [xfs]
[12049.143004]  [<ffffffff810f8eae>] ? update_fast_ctr+0x4e/0x70
[12049.149435]  [<ffffffff810f8f57>] ? percpu_down_read+0x57/0x90
[12049.155958]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[12049.162492]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[12049.169016]  [<ffffffff8127e000>] vfs_fallocate+0x140/0x230
[12049.175249]  [<ffffffff8127eee4>] SyS_fallocate+0x44/0x70
[12049.181288]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12049.187423]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12049.194667] 5 locks held by trinity-c8/3724:
[12049.199429]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[12049.209024]  #1:  (&(&ip->i_iolock)->mr_lock){++++++}, at: [<ffffffffa029d654>] xfs_ilock+0x154/0x260 [xfs]
[12049.219990]  #2:  (&(&ip->i_mmaplock)->mr_lock){+++++.}, at: [<ffffffffa029d674>] xfs_ilock+0x174/0x260 [xfs]
[12049.231128]  #3:  (sb_internal){.+.+.+}, at: [<ffffffff81284b8b>] __sb_start_write+0x7b/0xf0
[12049.240620]  #4:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa029d68e>] xfs_ilock+0x18e/0x260 [xfs]
[12049.251383] INFO: task trinity-c9:3725 blocked for more than 120 seconds.
[12049.258959]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12049.265287] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12049.274027] trinity-c9      D ffff88044f043d30 13552  3725   3713 0x00000084
[12049.281922]  ffff88044f043d30 ffffffff00000000 ffff880400000000 ffff88046ca14000
[12049.290238]  ffff88044e542000 ffff88044f044000 ffff8804240e9658 ffff8804240e9640
[12049.298539]  ffffffff00000000 ffff88044e542000 ffff88044f043d48 ffffffff817cdaaf
[12049.306840] Call Trace:
[12049.309569]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12049.315122]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
[12049.322327]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
[12049.329625]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
[12049.337118]  [<ffffffff817d1bff>] down_write+0x5f/0x80
[12049.342864]  [<ffffffff8127e413>] ? chmod_common+0x63/0x150
[12049.349096]  [<ffffffff8127e413>] chmod_common+0x63/0x150
[12049.355131]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[12049.362236]  [<ffffffff810035cc>] ? syscall_trace_enter+0x1dc/0x390
[12049.369243]  [<ffffffff8127f5f2>] SyS_fchmod+0x52/0x80
[12049.374988]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12049.381124]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12049.388324] 2 locks held by trinity-c9/3725:
[12049.393100]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[12049.402705]  #1:  (&sb->s_type->i_mutex_key#14){+.+.+.}, at: [<ffffffff8127e413>] chmod_common+0x63/0x150

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-07 14:43                                                         ` CAI Qian
  2016-10-07 15:27                                                           ` CAI Qian
@ 2016-10-09 21:51                                                           ` Dave Chinner
  1 sibling, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2016-10-09 21:51 UTC (permalink / raw)
  To: CAI Qian
  Cc: Jan Kara, Al Viro, tj, Linus Torvalds, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel, Miklos Szeredi, Dave Jones

On Fri, Oct 07, 2016 at 10:43:18AM -0400, CAI Qian wrote:
> Hmm, this round of trinity triggered a different hang.
> 
> [ 2094.487964]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
> [ 2094.495450]  [<ffffffff817d1bff>] down_write+0x5f/0x80
> [ 2094.508284]  [<ffffffff8127e301>] chown_common.isra.12+0x131/0x1e0
> [ 2094.553784] 2 locks held by trinity-c0/3126:
> [ 2094.558552]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> [ 2094.568240]  #1:  (&sb->s_type->i_mutex_key#17){++++++}, at: [<ffffffff8127e301>] chown_common.isra.12+0x131/0x1e0

Waiting on i_mutex.

> [ 2094.643597]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2094.665119]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2094.691133]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
> [ 2094.721844] 1 lock held by trinity-c1/3127:
> [ 2094.726515]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]

Waiting on i_ilock.

> [ 2094.808078]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
> [ 2094.820715]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
> [ 2094.826544]  [<ffffffff81297f53>] SyS_getdents+0x83/0x140
> [ 2094.856682]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50

concurrent readdir on the same directory fd, blocked on fd.

> [ 2094.936885]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
> [ 2094.956620]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
> [ 2094.962454]  [<ffffffff81298091>] SyS_getdents64+0x81/0x130
> [ 2094.988204] 1 lock held by trinity-c3/3129:
> [ 2094.992872]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50

Same.

> [ 2095.073118]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
> [ 2095.091589]  [<ffffffff812811dd>] SyS_lseek+0x1d/0xb0
> [ 2095.097229]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.110547] 1 lock held by trinity-c4/3130:
> [ 2095.115216]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50

Concurrent lseek on directory fd, blocked on fd.


> [ 2095.188230]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.223558]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.229894]  [<ffffffffa03337d4>] xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
> [ 2095.237682]  [<ffffffffa02ccfaf>] xfs_attr_get+0xdf/0x1b0 [xfs]
> [ 2095.244312]  [<ffffffffa0341bfc>] xfs_xattr_get+0x4c/0x70 [xfs]
> [ 2095.250924]  [<ffffffff812ad269>] generic_getxattr+0x59/0x70
> [ 2095.257244]  [<ffffffff812acf9b>] vfs_getxattr+0x8b/0xb0
> [ 2095.263177]  [<ffffffffa0435bd6>] ovl_xattr_get+0x46/0x60 [overlay]
> [ 2095.270176]  [<ffffffffa04331aa>] ovl_other_xattr_get+0x1a/0x20 [overlay]
> [ 2095.277756]  [<ffffffff812ad269>] generic_getxattr+0x59/0x70
> [ 2095.284079]  [<ffffffff81345e9e>] cap_inode_need_killpriv+0x2e/0x40
> [ 2095.291078]  [<ffffffff81349a33>] security_inode_need_killpriv+0x33/0x50
> [ 2095.298560]  [<ffffffff812a2fb0>] dentry_needs_remove_privs+0x30/0x50
> [ 2095.305743]  [<ffffffff8127ea21>] do_truncate+0x51/0xc0
> [ 2095.311581]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
> [ 2095.318094]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
> [ 2095.324609]  [<ffffffff8127edde>] do_sys_ftruncate.constprop.15+0xfe/0x160
> [ 2095.332286]  [<ffffffff8127ee7e>] SyS_ftruncate+0xe/0x10
> [ 2095.338225]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.344339]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2095.351531] 2 locks held by trinity-c5/3131:
> [ 2095.356297]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> [ 2095.365983]  #1:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]

truncate on overlay, removing xattrs from XFS file, blocked on
i_ilock.

> [ 2095.440372]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
> [ 2095.474300]  [<ffffffff8127e413>] chmod_common+0x63/0x150
> [ 2095.513452] 2 locks held by trinity-c6/3132:
> [ 2095.518217]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> [ 2095.527895]  #1:  (&sb->s_type->i_mutex_key#17){++++++}, at: [<ffffffff8127e413>] chmod_common+0x63/0x150

chmod, blocked on i_mutex.

> [ 2095.602379]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.616490]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
> [ 2095.623877]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2095.649889]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
> [ 2095.680610] 1 lock held by trinity-c7/3133:
> [ 2095.685281]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]

fsync on file, blocked on i_ilock.

> [ 2095.759662]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.807155]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
> [ 2095.813377]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
> [ 2095.818921]  [<ffffffff812bdf63>] SyS_fdatasync+0x13/0x20
> [ 2095.838261] 1 lock held by trinity-c8/3135:
> [ 2095.842930]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]

ditto.

> [ 2095.917305]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.958968]  [<ffffffffa0333790>] xfs_ilock_data_map_shared+0x30/0x40 [xfs]
> [ 2095.966752]  [<ffffffffa03128c6>] __xfs_get_blocks+0x96/0x9d0 [xfs]
> [ 2095.989413]  [<ffffffffa0313214>] xfs_get_blocks+0x14/0x20 [xfs]
> [ 2095.996122]  [<ffffffff812cca44>] do_mpage_readpage+0x474/0x800
> [ 2096.029678]  [<ffffffff812ccf0d>] mpage_readpages+0x13d/0x1b0
> [ 2096.050724]  [<ffffffffa0311f14>] xfs_vm_readpages+0x54/0x170 [xfs]
> [ 2096.057724]  [<ffffffff811f1a1d>] __do_page_cache_readahead+0x2ad/0x370
> [ 2096.079787]  [<ffffffff811f2014>] force_page_cache_readahead+0x94/0xf0
> [ 2096.087077]  [<ffffffff811f2168>] SyS_readahead+0xa8/0xc0
> [ 2096.106427] 1 lock held by trinity-c9/3136:
> [ 2096.111097]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]

readhead blocking in i_ilock before reading in extents.

Nothing here indicates a deadlock. Everything is waiting for locks,
but nothing is holding locks in a way that indicates that progress
is not being made. This sort of thing can happen when slow storage
is massively overloaded - sysrq-w is really the only way to get a
better picutre of what is happening here, but so far there's no
concrete evidence of a hang from this output.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-07 18:56                                                             ` CAI Qian
@ 2016-10-09 21:54                                                               ` Dave Chinner
  2016-10-10 14:10                                                                 ` CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: Dave Chinner @ 2016-10-09 21:54 UTC (permalink / raw)
  To: CAI Qian
  Cc: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, linux-xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones

On Fri, Oct 07, 2016 at 02:56:22PM -0400, CAI Qian wrote:
> 
> 
> ----- Original Message -----
> > From: "CAI Qian" <caiqian@redhat.com>
> > To: "Jan Kara" <jack@suse.cz>, "Miklos Szeredi" <miklos@szeredi.hu>, "tj" <tj@kernel.org>, "Al Viro"
> > <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>
> > Cc: "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org, "Dave Jones" <davej@codemonkey.org.uk>
> > Sent: Friday, October 7, 2016 11:27:55 AM
> > Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> > 
> > 
> > 
> > > Hmm, this round of trinity triggered a different hang.
> > This hang is reproducible so far with the command below on a overlayfs/xfs,
> Another data point is that this hang can also be reproduced using device-mapper thinp
> as the docker backend.

Again, no evidence that the system is actually hung. Waiting on
locks, yes, but nothing to indicate there is a deadlock in those
waiters.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-09 21:54                                                               ` Dave Chinner
@ 2016-10-10 14:10                                                                 ` CAI Qian
  2016-10-10 20:14                                                                   ` CAI Qian
  2016-10-10 21:57                                                                   ` Dave Chinner
  0 siblings, 2 replies; 104+ messages in thread
From: CAI Qian @ 2016-10-10 14:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, linux-xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Jan Kara" <jack@suse.cz>, "Miklos Szeredi" <miklos@szeredi.hu>, "tj" <tj@kernel.org>, "Al Viro"
> <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org, "Dave Jones" <davej@codemonkey.org.uk>
> Sent: Sunday, October 9, 2016 5:54:55 PM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> Again, no evidence that the system is actually hung. Waiting on
> locks, yes, but nothing to indicate there is a deadlock in those
> waiters.
Here you are,

http://people.redhat.com/qcai/tmp/dmesg

    CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-10 14:10                                                                 ` CAI Qian
@ 2016-10-10 20:14                                                                   ` CAI Qian
  2016-10-10 21:57                                                                   ` Dave Chinner
  1 sibling, 0 replies; 104+ messages in thread
From: CAI Qian @ 2016-10-10 20:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, linux-xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones


> Here you are,
> 
> http://people.redhat.com/qcai/tmp/dmesg
Also, this turned out to be a regression and bisecting so far pointed out this commit,

commit 5d50ac70fe98518dbf620bfba8184254663125eb
Merge: 31c1feb 4e14e49
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Nov 11 20:18:48 2015 -0800

    Merge tag 'xfs-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/g
    
    Pull xfs updates from Dave Chinner:
     "There is nothing really major here - the only significant addition is
      the per-mount operation statistics infrastructure.  Otherwises there's
      various ACL, xattr, DAX, AIO and logging fixes, and a smattering of
      small cleanups and fixes elsewhere.
    
      Summary:
    
       - per-mount operational statistics in sysfs
       - fixes for concurrent aio append write submission
       - various logging fixes
       - detection of zeroed logs and invalid log sequence numbers on v5 filesys
       - memory allocation failure message improvements
       - a bunch of xattr/ACL fixes
       - fdatasync optimisation
       - miscellaneous other fixes and cleanups"

    * tag 'xfs-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/d
      xfs: give all workqueues rescuer threads
      xfs: fix log recovery op header validation assert
      xfs: Fix error path in xfs_get_acl
      xfs: optimise away log forces on timestamp updates for fdatasync
      xfs: don't leak uuid table on rmmod
      xfs: invalidate cached acl if set via ioctl
      xfs: Plug memory leak in xfs_attrmulti_attr_set
      xfs: Validate the length of on-disk ACLs
      xfs: invalidate cached acl if set directly via xattr
      xfs: xfs_filemap_pmd_fault treats read faults as write faults
      xfs: add ->pfn_mkwrite support for DAX
      xfs: DAX does not use IO completion callbacks
      xfs: Don't use unwritten extents for DAX
      xfs: introduce BMAPI_ZERO for allocating zeroed extents
      xfs: fix inode size update overflow in xfs_map_direct()
      xfs: clear PF_NOFREEZE for xfsaild kthread
      xfs: fix an error code in xfs_fs_fill_super()
      xfs: stats are no longer dependent on CONFIG_PROC_FS
      xfs: simplify /proc teardown & error handling
      xfs: per-filesystem stats counter implementation
      ...

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-10 14:10                                                                 ` CAI Qian
  2016-10-10 20:14                                                                   ` CAI Qian
@ 2016-10-10 21:57                                                                   ` Dave Chinner
  2016-10-12 19:50                                                                     ` [bisected] " CAI Qian
  1 sibling, 1 reply; 104+ messages in thread
From: Dave Chinner @ 2016-10-10 21:57 UTC (permalink / raw)
  To: CAI Qian
  Cc: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, linux-xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones

On Mon, Oct 10, 2016 at 10:10:29AM -0400, CAI Qian wrote:
> 
> 
> ----- Original Message -----
> > From: "Dave Chinner" <david@fromorbit.com>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "Jan Kara" <jack@suse.cz>, "Miklos Szeredi" <miklos@szeredi.hu>, "tj" <tj@kernel.org>, "Al Viro"
> > <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "linux-xfs"
> > <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org, "Dave Jones" <davej@codemonkey.org.uk>
> > Sent: Sunday, October 9, 2016 5:54:55 PM
> > Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> > 
> > Again, no evidence that the system is actually hung. Waiting on
> > locks, yes, but nothing to indicate there is a deadlock in those
> > waiters.
> Here you are,
> 
> http://people.redhat.com/qcai/tmp/dmesg

It's a page lock order bug in the XFS seek hole/data implementation.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-10 21:57                                                                   ` Dave Chinner
@ 2016-10-12 19:50                                                                     ` CAI Qian
  2016-10-12 20:59                                                                       ` Dave Chinner
  0 siblings, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-12 19:50 UTC (permalink / raw)
  To: Dave Chinner, Sage Weil, Brian Foster
  Cc: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, linux-xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> Sent: Monday, October 10, 2016 5:57:14 PM
> 
> > http://people.redhat.com/qcai/tmp/dmesg
> 
> It's a page lock order bug in the XFS seek hole/data implementation.
So reverted this commit against the latest mainline allows trinity run
hours. Otherwise, it always hang at fdatasync() within 30 minutes.

fc0561cefc04e7803c0f6501ca4f310a502f65b8
xfs: optimise away log forces on timestamp updates for fdatasync

PS: tested against the vfs tree's #work.splice_read with this commit
reverted is now hanging at sync() instead which won't be  reproduced
against the mainline so far.
http://people.redhat.com/qcai/tmp/dmesg-sync

   CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-12 19:50                                                                     ` [bisected] " CAI Qian
@ 2016-10-12 20:59                                                                       ` Dave Chinner
  2016-10-13 16:25                                                                         ` CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: Dave Chinner @ 2016-10-12 20:59 UTC (permalink / raw)
  To: CAI Qian
  Cc: Sage Weil, Brian Foster, Jan Kara, Miklos Szeredi, tj, Al Viro,
	Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel, Dave Jones

On Wed, Oct 12, 2016 at 03:50:36PM -0400, CAI Qian wrote:
> 
> 
> ----- Original Message -----
> > From: "Dave Chinner" <david@fromorbit.com>
> > Sent: Monday, October 10, 2016 5:57:14 PM
> > 
> > > http://people.redhat.com/qcai/tmp/dmesg
> > 
> > It's a page lock order bug in the XFS seek hole/data implementation.
> So reverted this commit against the latest mainline allows trinity run
> hours. Otherwise, it always hang at fdatasync() within 30 minutes.
> 
> fc0561cefc04e7803c0f6501ca4f310a502f65b8
> xfs: optimise away log forces on timestamp updates for fdatasync

Has nothing at all to do with the hang.

> PS: tested against the vfs tree's #work.splice_read with this commit
> reverted is now hanging at sync() instead which won't be  reproduced
> against the mainline so far.
> http://people.redhat.com/qcai/tmp/dmesg-sync

It is the same page lock vs seek hole/data issue.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-12 20:59                                                                       ` Dave Chinner
@ 2016-10-13 16:25                                                                         ` CAI Qian
  2016-10-13 20:49                                                                           ` Dave Chinner
  0 siblings, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-13 16:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Sage Weil, Brian Foster, Jan Kara, Miklos Szeredi, tj, Al Viro,
	Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel, Dave Jones



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> Sent: Wednesday, October 12, 2016 4:59:01 PM
> Subject: Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> On Wed, Oct 12, 2016 at 03:50:36PM -0400, CAI Qian wrote:
> > 
> > 
> > ----- Original Message -----
> > > From: "Dave Chinner" <david@fromorbit.com>
> > > Sent: Monday, October 10, 2016 5:57:14 PM
> > > 
> > > > http://people.redhat.com/qcai/tmp/dmesg
> > > 
> > > It's a page lock order bug in the XFS seek hole/data implementation.
> > So reverted this commit against the latest mainline allows trinity run
> > hours. Otherwise, it always hang at fdatasync() within 30 minutes.
> > 
> > fc0561cefc04e7803c0f6501ca4f310a502f65b8
> > xfs: optimise away log forces on timestamp updates for fdatasync
> 
> Has nothing at all to do with the hang.
> 
> > PS: tested against the vfs tree's #work.splice_read with this commit
> > reverted is now hanging at sync() instead which won't be  reproduced
> > against the mainline so far.
> > http://people.redhat.com/qcai/tmp/dmesg-sync
> 
> It is the same page lock vs seek hole/data issue.
FYI, CVE-2016-8660 was assigned for it.
   CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-13 16:25                                                                         ` CAI Qian
@ 2016-10-13 20:49                                                                           ` Dave Chinner
  2016-10-13 20:56                                                                             ` CAI Qian
  0 siblings, 1 reply; 104+ messages in thread
From: Dave Chinner @ 2016-10-13 20:49 UTC (permalink / raw)
  To: CAI Qian
  Cc: Sage Weil, Brian Foster, Jan Kara, Miklos Szeredi, tj, Al Viro,
	Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel, Dave Jones

On Thu, Oct 13, 2016 at 12:25:30PM -0400, CAI Qian wrote:
> 
> 
> ----- Original Message -----
> > From: "Dave Chinner" <david@fromorbit.com>
> > Sent: Wednesday, October 12, 2016 4:59:01 PM
> > Subject: Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> > 
> > On Wed, Oct 12, 2016 at 03:50:36PM -0400, CAI Qian wrote:
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Dave Chinner" <david@fromorbit.com>
> > > > Sent: Monday, October 10, 2016 5:57:14 PM
> > > > 
> > > > > http://people.redhat.com/qcai/tmp/dmesg
> > > > 
> > > > It's a page lock order bug in the XFS seek hole/data implementation.
> > > So reverted this commit against the latest mainline allows trinity run
> > > hours. Otherwise, it always hang at fdatasync() within 30 minutes.
> > > 
> > > fc0561cefc04e7803c0f6501ca4f310a502f65b8
> > > xfs: optimise away log forces on timestamp updates for fdatasync
> > 
> > Has nothing at all to do with the hang.
> > 
> > > PS: tested against the vfs tree's #work.splice_read with this commit
> > > reverted is now hanging at sync() instead which won't be  reproduced
> > > against the mainline so far.
> > > http://people.redhat.com/qcai/tmp/dmesg-sync
> > 
> > It is the same page lock vs seek hole/data issue.
> FYI, CVE-2016-8660 was assigned for it.

Why? This isn't a security issue - CVEs cost time and effort for
everyone to track and follow and raising them for issues like this
does not help anyone fix the actual problem.  It doesn't help us
track it, analyse it, communicate with the bug reporter, test it or
get the fix committed.  It's meaningless to the developers fixing
the code, it's meaningless to users, and it's meaningless to most
distros that are supporting XFS because the distro maintainers don't
watch the CVE lists for XFS bugs they need to backport and fix.

All this does is artificially inflate the supposed importance of the
bug. CVEs are for security or severe issues. This is neither serious
or a security issue - please have the common courtesy to ask the
people with the knowledge to make such a determination (i.e. the
maintainers) before you waste the time of a /large number/ of people
by raising a useless CVE...

Yes, you found a bug. No, it's not a security bug. No, you should
not abusing of the CVE process to apply pressure to get it fixed.
Please don't do this again.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-13 20:49                                                                           ` Dave Chinner
@ 2016-10-13 20:56                                                                             ` CAI Qian
  0 siblings, 0 replies; 104+ messages in thread
From: CAI Qian @ 2016-10-13 20:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Sage Weil, Brian Foster, Jan Kara, Miklos Szeredi, tj, Al Viro,
	Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel, Dave Jones



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> Sent: Thursday, October 13, 2016 4:49:17 PM
> Subject: Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
>
> Why? This isn't a security issue - CVEs cost time and effort for
> everyone to track and follow and raising them for issues like this
> does not help anyone fix the actual problem.  It doesn't help us
> track it, analyse it, communicate with the bug reporter, test it or
> get the fix committed.  It's meaningless to the developers fixing
> the code, it's meaningless to users, and it's meaningless to most
> distros that are supporting XFS because the distro maintainers don't
> watch the CVE lists for XFS bugs they need to backport and fix.
> 
> All this does is artificially inflate the supposed importance of the
> bug. CVEs are for security or severe issues. This is neither serious
> or a security issue - please have the common courtesy to ask the
> people with the knowledge to make such a determination (i.e. the
> maintainers) before you waste the time of a /large number/ of people
> by raising a useless CVE...
> 
> Yes, you found a bug. No, it's not a security bug. No, you should
> not abusing of the CVE process to apply pressure to get it fixed.
> Please don't do this again.
As far as I can tell, this is a medium-severity security issue that a
non-privileged user can exploit it to cause a system hang/deadlock.
Hence, a local DoS for other users use the system.
   CAI Qian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [4.9-rc1+] overlayfs lockdep
  2016-10-07  7:08                                                       ` Jan Kara
  2016-10-07 14:43                                                         ` CAI Qian
@ 2016-10-21 15:38                                                         ` CAI Qian
  2016-10-24 12:57                                                           ` Miklos Szeredi
  1 sibling, 1 reply; 104+ messages in thread
From: CAI Qian @ 2016-10-21 15:38 UTC (permalink / raw)
  To: Jan Kara, Miklos Szeredi; +Cc: Al Viro, Linus Torvalds, linux-fsdevel


----- Original Message -----
> From: "Jan Kara" <jack@suse.cz>
> Sent: Friday, October 7, 2016 3:08:38 AM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> 
> So I believe this may be just a problem in overlayfs lockdep annotation
> (see below). Added Miklos to CC.
> 
> > Wait. There is also a lockep happened before the xfs internal error as
> > well.
> > 
> > [ 5839.452325] ======================================================
> > [ 5839.459221] [ INFO: possible circular locking dependency detected ]
> > [ 5839.466215] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
> > [ 5839.471945] -------------------------------------------------------
> > [ 5839.478937] trinity-c220/69531 is trying to acquire lock:
> > [ 5839.484961]  (&p->lock){+.+.+.}, at: [<ffffffff812ac69c>]
> > seq_read+0x4c/0x3e0
> > [ 5839.492967]
> > but task is already holding lock:
> > [ 5839.499476]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>]
> > __sb_start_write+0xd1/0xf0
> > [ 5839.508560]
> > which lock already depends on the new lock.
> > 
> > [ 5839.517686]
> > the existing dependency chain (in reverse order) is:
> > [ 5839.526036]
> > -> #3 (sb_writers#8){.+.+.+}:
> > [ 5839.530751]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> > [ 5839.537368]        [<ffffffff810f8f4a>] percpu_down_read+0x4a/0x90
> > [ 5839.544275]        [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> > [ 5839.551181]        [<ffffffff812a8544>] mnt_want_write+0x24/0x50
> > [ 5839.557892]        [<ffffffffa04a398f>] ovl_want_write+0x1f/0x30
> > [overlay]
> > [ 5839.565577]        [<ffffffffa04a6036>] ovl_do_remove+0x46/0x480
> > [overlay]
> > [ 5839.573259]        [<ffffffffa04a64a3>] ovl_unlink+0x13/0x20 [overlay]
> > [ 5839.580555]        [<ffffffff812918ea>] vfs_unlink+0xda/0x190
> > [ 5839.586979]        [<ffffffff81293698>] do_unlinkat+0x268/0x2b0
> > [ 5839.593599]        [<ffffffff8129419b>] SyS_unlinkat+0x1b/0x30
> > [ 5839.600120]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> > [ 5839.606836]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> > [ 5839.614231]
> 
> So here is IMO the real culprit: do_unlinkat() grabs fs freeze protection
> through mnt_want_write(), we grab also i_rwsem in do_unlinkat() in
> I_MUTEX_PARENT class a bit after that and further down in vfs_unlink() we
> grab i_rwsem for the unlinked inode itself in default I_MUTEX class. Then
> in ovl_want_write() we grab freeze protection again, but this time for the
> upper filesystem. That establishes sb_writers (overlay) -> I_MUTEX_PARENT
> (overlay) -> I_MUTEX (overlay) -> sb_writers (FS-A) lock ordering
> (we maintain locking classes per fs type so that's why I'm showing fs type
> in parenthesis).
> 
> Now this nesting is nasty because once you add locks that are not tracked
> per fs type into the mix, you get cycles. In this case we've got
> seq_file->lock and cred_guard_mutex into the mix - the splice path is
> doing sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex (splicing
> from seq_file into the real filesystem). Exec path further establishes
> cred_guard_mutex -> I_MUTEX (overlay) which closes the full cycle:
> 
> sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex -> i_mutex
> (overlay) -> sb_writers (FS-A)
> 
> If I analyzed the lockdep trace, this looks like a real (although remote)
> deadlock possibility. Miklos?

So this can still be reproduced in the yesterday's mainline.

[40581.813575] [ INFO: possible circular locking dependency detected ]
[40581.813578] 4.9.0-rc1-lockfix-uncorev2+ #51 Tainted: G        W      
[40581.813581] -------------------------------------------------------
[40581.813582] trinity-c104/39795 is trying to acquire lock:
[40581.813587]  (
[40581.813588] &p->lock
[40581.813589] ){+.+.+.}
[40581.813600] , at: 
[40581.813601] [<ffffffff8191588c>] seq_read+0xec/0x1400
[40581.813603] 
[40581.813603] but task is already holding lock:
[40581.813605]  (
[40581.813607] sb_writers
[40581.813608] #8
[40581.813609] ){.+.+.+}
[40581.813617] , at: 
[40581.813617] [<ffffffff81889c6a>] do_sendfile+0x9ea/0x1270
[40581.813618] 
[40581.813618] which lock already depends on the new lock.
[40581.813618] 
[40581.813620] 
[40581.813620] the existing dependency chain (in reverse order) is:
[40581.813623] 
[40581.813623] -> #3
[40581.813624]  (
[40581.813625] sb_writers
[40581.813626] #8
[40581.813628] ){.+.+.+}
[40581.813628] :
[40581.813636]        
[40581.813636] [<ffffffff8133dcda>] __lock_acquire+0x9aa/0x1710
[40581.813640]        
[40581.813640] [<ffffffff8133fd4e>] lock_acquire+0x24e/0x5d0
[40581.813644]        
[40581.813645] [<ffffffff8189037e>] __sb_start_write+0xae/0x360
[40581.813650]        
[40581.813650] [<ffffffff819066fa>] mnt_want_write+0x4a/0xc0
[40581.813661]        
[40581.813661] [<ffffffffa16cdfbd>] ovl_want_write+0x8d/0xf0 [overlay]
[40581.813668]        
[40581.813668] [<ffffffffa16d4dc7>] ovl_do_remove+0xe7/0x9a0 [overlay]
[40581.813675]        
[40581.813676] [<ffffffffa16d5696>] ovl_rmdir+0x16/0x20 [overlay]
[40581.813680]        
[40581.813680] [<ffffffff818af90f>] vfs_rmdir+0x1bf/0x3e0
[40581.813685]        
[40581.813686] [<ffffffff818c5965>] do_rmdir+0x2c5/0x430
[40581.813689]        
[40581.813690] [<ffffffff818c8242>] SyS_unlinkat+0x22/0x30
[40581.813696]        
[40581.813696] [<ffffffff8100924d>] do_syscall_64+0x19d/0x540
[40581.813704]        
[40581.813704] [<ffffffff82c8af24>] return_from_SYSCALL_64+0x0/0x7a
[40581.813707] 
[40581.813707] -> #2
[40581.813709]  (
[40581.813710] &sb->s_type->i_mutex_key
[40581.813711] #17
[40581.813712] ){++++++}
[40581.813713] :
[40581.813720]        
[40581.813720] [<ffffffff8133dcda>] __lock_acquire+0x9aa/0x1710
[40581.813726]        
[40581.813726] [<ffffffff8133fd4e>] lock_acquire+0x24e/0x5d0
[40581.813736]        
[40581.813736] [<ffffffff82c84261>] down_read+0xa1/0x1c0
[40581.813740]        
[40581.813740] [<ffffffff818ae2db>] lookup_slow+0x17b/0x4f0
[40581.813744]        
[40581.813744] [<ffffffff818bb228>] walk_component+0x728/0x1d10
[40581.813750]        
[40581.813750] [<ffffffff818bcc1e>] link_path_walk+0x40e/0x1690
[40581.813758]        
[40581.813758] [<ffffffff818c0274>] path_openat+0x1c4/0x3870
[40581.813764]        
[40581.813765] [<ffffffff818c6d19>] do_filp_open+0x1a9/0x2e0
[40581.813772]        
[40581.813772] [<ffffffff8189832b>] do_open_execat+0xcb/0x420
[40581.813783]        
[40581.813784] [<ffffffff8189932b>] open_exec+0x2b/0x50
[40581.813793]        
[40581.813793] [<ffffffff819ea78c>] load_elf_binary+0x103c/0x3550
[40581.813807]        
[40581.813807] [<ffffffff8189a852>] search_binary_handler+0x162/0x480
[40581.813814]        
[40581.813815] [<ffffffff818a106a>] do_execveat_common.isra.24+0x138a/0x2570
[40581.813823]        
[40581.813824] [<ffffffff818a2efa>] SyS_execve+0x3a/0x50
[40581.813828]        
[40581.813828] [<ffffffff8100924d>] do_syscall_64+0x19d/0x540
[40581.813833]        
[40581.813834] [<ffffffff82c8af24>] return_from_SYSCALL_64+0x0/0x7a
[40581.813843] 
[40581.813843] -> #1
[40581.813845]  (
[40581.813850] &sig->cred_guard_mutex
[40581.813852] ){+.+.+.}
[40581.813852] :
[40581.813861]        
[40581.813862] [<ffffffff8133dcda>] __lock_acquire+0x9aa/0x1710
[40581.813871]        
[40581.813871] [<ffffffff8133fd4e>] lock_acquire+0x24e/0x5d0
[40581.813885]        
[40581.813886] [<ffffffff82c7d1d3>] mutex_lock_killable_nested+0x103/0xb90
[40581.813895]        
[40581.813896] [<ffffffff81a3f7a6>] do_io_accounting+0x186/0xcf0
[40581.813902]        
[40581.813903] [<ffffffff81a40329>] proc_tgid_io_accounting+0x19/0x20
[40581.813908]        
[40581.813909] [<ffffffff81a41494>] proc_single_show+0x114/0x1d0
[40581.813917]        
[40581.813917] [<ffffffff81915ad4>] seq_read+0x334/0x1400
[40581.813921]        
[40581.813921] [<ffffffff81884da6>] __vfs_read+0x106/0x990
[40581.813927]        
[40581.813927] [<ffffffff81886038>] vfs_read+0x118/0x400
[40581.813931]        
[40581.813931] [<ffffffff8188aebf>] SyS_read+0xdf/0x1d0
[40581.813938]        
[40581.813938] [<ffffffff8100924d>] do_syscall_64+0x19d/0x540
[40581.813945]        
[40581.813946] [<ffffffff82c8af24>] return_from_SYSCALL_64+0x0/0x7a
[40581.813949] 
[40581.813949] -> #0
[40581.813951]  (
[40581.813954] &p->lock
[40581.813955] ){+.+.+.}
[40581.813956] :
[40581.813961]        
[40581.813961] [<ffffffff81337938>] validate_chain.isra.31+0x2b28/0x4c00
[40581.813965]        
[40581.813966] [<ffffffff8133dcda>] __lock_acquire+0x9aa/0x1710
[40581.813972]        
[40581.813972] [<ffffffff8133fd4e>] lock_acquire+0x24e/0x5d0
[40581.813977]        
[40581.813977] [<ffffffff82c7f2f8>] mutex_lock_nested+0x108/0xa50
[40581.813983]        
[40581.813983] [<ffffffff8191588c>] seq_read+0xec/0x1400
[40581.813993]        
[40581.813993] [<ffffffff81a7bdde>] kernfs_fop_read+0x35e/0x640
[40581.813998]        
[40581.813998] [<ffffffff818812ef>] do_loop_readv_writev+0xdf/0x250
[40581.814003]        
[40581.814003] [<ffffffff81886fb5>] do_readv_writev+0x6a5/0xab0
[40581.814007]        
[40581.814007] [<ffffffff81887446>] vfs_readv+0x86/0xe0
[40581.814020]        
[40581.814020] [<ffffffff8194fdac>] default_file_splice_read+0x49c/0xbb0
[40581.814026]        
[40581.814027] [<ffffffff8194eb74>] do_splice_to+0x104/0x1a0
[40581.814033]        
[40581.814033] [<ffffffff8194ee80>] splice_direct_to_actor+0x270/0xa00
[40581.814039]        
[40581.814039] [<ffffffff8194f7a4>] do_splice_direct+0x194/0x300
[40581.814046]        
[40581.814046] [<ffffffff818896e9>] do_sendfile+0x469/0x1270
[40581.814051]        
[40581.814051] [<ffffffff8188bcb0>] SyS_sendfile64+0x140/0x150
[40581.814054]        
[40581.814055] [<ffffffff8100924d>] do_syscall_64+0x19d/0x540
[40581.814059]        
[40581.814060] [<ffffffff82c8af24>] return_from_SYSCALL_64+0x0/0x7a
[40581.814062] 
[40581.814062] other info that might help us debug this:
[40581.814062] 
[40581.814066] Chain exists of:
[40581.814066]   
[40581.814067] &p->lock
[40581.814069]  --> 
[40581.814070] &sb->s_type->i_mutex_key
[40581.814071] #17
[40581.814073]  --> 
[40581.814076] sb_writers
[40581.814079] #8
[40581.814079] 
[40581.814079] 
[40581.814080]  Possible unsafe locking scenario:
[40581.814080] 
[40581.814081]        CPU0                    CPU1
[40581.814083]        ----                    ----
[40581.814085]   lock(
[40581.814088] sb_writers
[40581.814089] #8
[40581.814089] );
[40581.814091]                                lock(
[40581.814093] &sb->s_type->i_mutex_key
[40581.814095] #17
[40581.814095] );
[40581.814097]                                lock(
[40581.814098] sb_writers
[40581.814099] #8
[40581.814099] );
[40581.814101]   lock(
[40581.814103] &p->lock
[40581.814103] );
[40581.814104] 
[40581.814104]  *** DEADLOCK ***
[40581.814104] 
[40581.814106] 1 lock held by trinity-c104/39795:
[40581.814109]  #0: 
[40581.814111]  (
[40581.814112] sb_writers
[40581.814113] #8
[40581.814114] ){.+.+.+}
[40581.814116] , at: 
[40581.814117] [<ffffffff81889c6a>] do_sendfile+0x9ea/0x1270
[40581.814118] 
[40581.814118] stack backtrace:
[40581.814121] CPU: 25 PID: 39795 Comm: trinity-c104 Tainted: G        W       4.9.0-rc1-lockfix-uncorev2+ #51
[40581.814123] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRRFSDP1.86B.0271.R00.1510301446 10/30/2015
[40581.814131]  ffff880825886da0 ffffffff81d37124 0000000041b58ab3 ffffffff83348dc7
[40581.814138]  ffffffff81d37064 0000000000000001 0000000000000000 ffff8807b4d5d5d8
[40581.814145]  00000000bfc018be ffff880825886d78 0000000000000001 0000000000000000
[40581.814146] Call Trace:
[40581.814155]  [<ffffffff81d37124>] dump_stack+0xc0/0x12c
[40581.814159]  [<ffffffff81d37064>] ? _atomic_dec_and_lock+0xc4/0xc4
[40581.814168]  [<ffffffff81332fa9>] print_circular_bug+0x3c9/0x5e0
[40581.814171]  [<ffffffff81332be0>] ? print_circular_bug_entry+0xd0/0xd0
[40581.814176]  [<ffffffff81337938>] validate_chain.isra.31+0x2b28/0x4c00
[40581.814182]  [<ffffffff81334e10>] ? check_irq_usage+0x300/0x300
[40581.814192]  [<ffffffff81334e10>] ? check_irq_usage+0x300/0x300
[40581.814196]  [<ffffffff81de21f3>] ? __this_cpu_preempt_check+0x13/0x20
[40581.814200]  [<ffffffff81336045>] ? validate_chain.isra.31+0x1235/0x4c00
[40581.814204]  [<ffffffff8133a4d0>] ? print_usage_bug+0x700/0x700
[40581.814208]  [<ffffffff812abdc0>] ? sched_clock_cpu+0x1b0/0x310
[40581.814214]  [<ffffffff8133a4d0>] ? print_usage_bug+0x700/0x700
[40581.814219]  [<ffffffff812abdc0>] ? sched_clock_cpu+0x1b0/0x310
[40581.814226]  [<ffffffff8133dcda>] __lock_acquire+0x9aa/0x1710
[40581.814232]  [<ffffffff8133fd4e>] lock_acquire+0x24e/0x5d0
[40581.814235]  [<ffffffff8191588c>] ? seq_read+0xec/0x1400
[40581.814240]  [<ffffffff8191588c>] ? seq_read+0xec/0x1400
[40581.814243]  [<ffffffff82c7f2f8>] mutex_lock_nested+0x108/0xa50
[40581.814246]  [<ffffffff8191588c>] ? seq_read+0xec/0x1400
[40581.814250]  [<ffffffff8191588c>] ? seq_read+0xec/0x1400
[40581.814256]  [<ffffffff817fedd6>] ? kasan_unpoison_shadow+0x36/0x50
[40581.814259]  [<ffffffff82c7f1f0>] ? mutex_lock_interruptible_nested+0xb40/0xb40
[40581.814264]  [<ffffffff8168ec2c>] ? get_page_from_freelist+0x175c/0x2ed0
[40581.814271]  [<ffffffff8168d4d0>] ? __isolate_free_page+0x7e0/0x7e0
[40581.814275]  [<ffffffff8133c3f9>] ? mark_held_locks+0x109/0x290
[40581.814278]  [<ffffffff8191588c>] seq_read+0xec/0x1400
[40581.814283]  [<ffffffff813ac01d>] ? rcu_lockdep_current_cpu_online+0x11d/0x1d0
[40581.814290]  [<ffffffff819157a0>] ? seq_hlist_start_percpu+0x4a0/0x4a0
[40581.814295]  [<ffffffff8198ef20>] ? __fsnotify_update_child_dentry_flags.part.0+0x2b0/0x2b0
[40581.814298]  [<ffffffff81de21f3>] ? __this_cpu_preempt_check+0x13/0x20
[40581.814300]  [<ffffffff81a7bdde>] kernfs_fop_read+0x35e/0x640
[40581.814305]  [<ffffffff81b49a55>] ? selinux_file_permission+0x3c5/0x550
[40581.814310]  [<ffffffff81a7ba80>] ? kernfs_fop_open+0xf40/0xf40
[40581.814312]  [<ffffffff818812ef>] do_loop_readv_writev+0xdf/0x250
[40581.814318]  [<ffffffff81886fb5>] do_readv_writev+0x6a5/0xab0
[40581.814324]  [<ffffffff81886910>] ? vfs_write+0x5f0/0x5f0
[40581.814328]  [<ffffffff81d8fbaf>] ? iov_iter_get_pages_alloc+0x53f/0x1990
[40581.814332]  [<ffffffff81d8f670>] ? iov_iter_npages+0xed0/0xed0
[40581.814336]  [<ffffffff8133c3f9>] ? mark_held_locks+0x109/0x290
[40581.814339]  [<ffffffff81de21f3>] ? __this_cpu_preempt_check+0x13/0x20
[40581.814344]  [<ffffffff8133caa0>] ? trace_hardirqs_on_caller+0x520/0x720
[40581.814347]  [<ffffffff81887446>] vfs_readv+0x86/0xe0
[40581.814352]  [<ffffffff8194fdac>] default_file_splice_read+0x49c/0xbb0
[40581.814361]  [<ffffffff8194f910>] ? do_splice_direct+0x300/0x300
[40581.814363]  [<ffffffff817fef3d>] ? kasan_kmalloc+0xad/0xe0
[40581.814366]  [<ffffffff818a6287>] ? alloc_pipe_info+0x1b7/0x410
[40581.814371]  [<ffffffff8133a4d0>] ? print_usage_bug+0x700/0x700
[40581.814373]  [<ffffffff8188bcb0>] ? SyS_sendfile64+0x140/0x150
[40581.814377]  [<ffffffff8100924d>] ? do_syscall_64+0x19d/0x540
[40581.814380]  [<ffffffff82c8af24>] ? entry_SYSCALL64_slow_path+0x25/0x25
[40581.814382]  [<ffffffff812abdc0>] ? sched_clock_cpu+0x1b0/0x310
[40581.814386]  [<ffffffff8133c3f9>] ? mark_held_locks+0x109/0x290
[40581.814390]  [<ffffffff8133caa0>] ? trace_hardirqs_on_caller+0x520/0x720
[40581.814395]  [<ffffffff8198ef20>] ? __fsnotify_update_child_dentry_flags.part.0+0x2b0/0x2b0
[40581.814398]  [<ffffffff81b49a55>] ? selinux_file_permission+0x3c5/0x550
[40581.814404]  [<ffffffff81b26e96>] ? security_file_permission+0x176/0x220
[40581.814408]  [<ffffffff81885c78>] ? rw_verify_area+0xd8/0x380
[40581.814411]  [<ffffffff8194eb74>] do_splice_to+0x104/0x1a0
[40581.814415]  [<ffffffff818a63b7>] ? alloc_pipe_info+0x2e7/0x410
[40581.814419]  [<ffffffff8194ee80>] splice_direct_to_actor+0x270/0xa00
[40581.814424]  [<ffffffff8194c5e0>] ? wakeup_pipe_readers+0x90/0x90
[40581.814429]  [<ffffffff8194ec10>] ? do_splice_to+0x1a0/0x1a0
[40581.814432]  [<ffffffff81885c78>] ? rw_verify_area+0xd8/0x380
[40581.814438]  [<ffffffff8194f7a4>] do_splice_direct+0x194/0x300
[40581.814443]  [<ffffffff8194f610>] ? splice_direct_to_actor+0xa00/0xa00
[40581.814450]  [<ffffffff81278cee>] ? preempt_count_sub+0x5e/0xe0
[40581.814452]  [<ffffffff81890415>] ? __sb_start_write+0x145/0x360
[40581.814457]  [<ffffffff818896e9>] do_sendfile+0x469/0x1270
[40581.814461]  [<ffffffff81889280>] ? do_compat_pwritev64.isra.16+0xd0/0xd0
[40581.814466]  [<ffffffff814cb287>] ? __audit_syscall_exit+0x637/0x960
[40581.814469]  [<ffffffff81006afb>] ? syscall_trace_enter+0x89b/0x1930
[40581.814473]  [<ffffffff817f7993>] ? kfree+0x3f3/0x620
[40581.814477]  [<ffffffff8188bcb0>] SyS_sendfile64+0x140/0x150
[40581.814479]  [<ffffffff8188bb70>] ? SyS_sendfile+0x140/0x140
[40581.814482]  [<ffffffff81de21f3>] ? __this_cpu_preempt_check+0x13/0x20
[40581.814485]  [<ffffffff8188bb70>] ? SyS_sendfile+0x140/0x140
[40581.814487]  [<ffffffff8100924d>] do_syscall_64+0x19d/0x540
[40581.814491]  [<ffffffff82c8af24>] entry_SYSCALL64_slow_path+0x25/0x25

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [4.9-rc1+] overlayfs lockdep
  2016-10-21 15:38                                                         ` [4.9-rc1+] overlayfs lockdep CAI Qian
@ 2016-10-24 12:57                                                           ` Miklos Szeredi
  0 siblings, 0 replies; 104+ messages in thread
From: Miklos Szeredi @ 2016-10-24 12:57 UTC (permalink / raw)
  To: CAI Qian; +Cc: Jan Kara, Al Viro, Linus Torvalds, linux-fsdevel

On Fri, Oct 21, 2016 at 5:38 PM, CAI Qian <caiqian@redhat.com> wrote:
>
> ----- Original Message -----
>> From: "Jan Kara" <jack@suse.cz>
>> Sent: Friday, October 7, 2016 3:08:38 AM
>> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
>>
>>
>> So I believe this may be just a problem in overlayfs lockdep annotation
>> (see below). Added Miklos to CC.
>>
>> > Wait. There is also a lockep happened before the xfs internal error as
>> > well.
>> >
>> > [ 5839.452325] ======================================================
>> > [ 5839.459221] [ INFO: possible circular locking dependency detected ]
>> > [ 5839.466215] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
>> > [ 5839.471945] -------------------------------------------------------
>> > [ 5839.478937] trinity-c220/69531 is trying to acquire lock:
>> > [ 5839.484961]  (&p->lock){+.+.+.}, at: [<ffffffff812ac69c>]
>> > seq_read+0x4c/0x3e0
>> > [ 5839.492967]
>> > but task is already holding lock:
>> > [ 5839.499476]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>]
>> > __sb_start_write+0xd1/0xf0
>> > [ 5839.508560]
>> > which lock already depends on the new lock.
>> >
>> > [ 5839.517686]
>> > the existing dependency chain (in reverse order) is:
>> > [ 5839.526036]
>> > -> #3 (sb_writers#8){.+.+.+}:
>> > [ 5839.530751]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
>> > [ 5839.537368]        [<ffffffff810f8f4a>] percpu_down_read+0x4a/0x90
>> > [ 5839.544275]        [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
>> > [ 5839.551181]        [<ffffffff812a8544>] mnt_want_write+0x24/0x50
>> > [ 5839.557892]        [<ffffffffa04a398f>] ovl_want_write+0x1f/0x30
>> > [overlay]
>> > [ 5839.565577]        [<ffffffffa04a6036>] ovl_do_remove+0x46/0x480
>> > [overlay]
>> > [ 5839.573259]        [<ffffffffa04a64a3>] ovl_unlink+0x13/0x20 [overlay]
>> > [ 5839.580555]        [<ffffffff812918ea>] vfs_unlink+0xda/0x190
>> > [ 5839.586979]        [<ffffffff81293698>] do_unlinkat+0x268/0x2b0
>> > [ 5839.593599]        [<ffffffff8129419b>] SyS_unlinkat+0x1b/0x30
>> > [ 5839.600120]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
>> > [ 5839.606836]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
>> > [ 5839.614231]
>>
>> So here is IMO the real culprit: do_unlinkat() grabs fs freeze protection
>> through mnt_want_write(), we grab also i_rwsem in do_unlinkat() in
>> I_MUTEX_PARENT class a bit after that and further down in vfs_unlink() we
>> grab i_rwsem for the unlinked inode itself in default I_MUTEX class. Then
>> in ovl_want_write() we grab freeze protection again, but this time for the
>> upper filesystem. That establishes sb_writers (overlay) -> I_MUTEX_PARENT
>> (overlay) -> I_MUTEX (overlay) -> sb_writers (FS-A) lock ordering
>> (we maintain locking classes per fs type so that's why I'm showing fs type
>> in parenthesis).
>>
>> Now this nesting is nasty because once you add locks that are not tracked
>> per fs type into the mix, you get cycles. In this case we've got
>> seq_file->lock and cred_guard_mutex into the mix - the splice path is
>> doing sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex (splicing
>> from seq_file into the real filesystem). Exec path further establishes
>> cred_guard_mutex -> I_MUTEX (overlay) which closes the full cycle:
>>
>> sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex -> i_mutex
>> (overlay) -> sb_writers (FS-A)
>>
>> If I analyzed the lockdep trace, this looks like a real (although remote)
>> deadlock possibility. Miklos?

Yeah, you can leave out seq_file->lock, the chain can be made up from
just 3 parts:

unlink : i_mutex(ov) -> sb_writers(fs-a)
splice: sb_writers(fs-a) ->cred_guard_mutex (though proc_tgid_io_accounting)
exec:  cred_guard_mutex -> i_mutex(ov)

None of those are incorrect, but the cred_guard_mutex usage is also
pretty weird: taken outside path lookup as well as inside ->read() in
proc.

Doesn't look a serious worry in practice, I don't think anybody would
trigger the actual deadlock in a 1000years (an fs freeze is needed at
just the right moment in addition to the above, very unlikely chain).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-24  3:59                                   ` [PATCH 04/12] " Al Viro
  2016-09-26 13:35                                     ` Miklos Szeredi
@ 2016-12-17 19:54                                     ` Andreas Schwab
  2016-12-18 19:28                                       ` Linus Torvalds
  1 sibling, 1 reply; 104+ messages in thread
From: Andreas Schwab @ 2016-12-17 19:54 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

This break EPIPE handling inside splice when SIGPIPE is ignored:

Before:
$ { sleep 1; strace -e splice pv -q /dev/zero; } | :
splice(3, NULL, 1, NULL, 131072, SPLICE_F_MORE) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=23750, si_uid=17005} ---
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=23750, si_uid=17005} ---
+++ exited with 0 +++

After:
$ { sleep 1; strace -e splice pv -q /dev/zero; } | :
splice(3, NULL, 1, NULL, 131072, SPLICE_F_MORE) = 65536
splice(3, NULL, 1, NULL, 131072, SPLICE_F_MORE
[hangs]

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-17 19:54                                     ` Andreas Schwab
@ 2016-12-18 19:28                                       ` Linus Torvalds
  2016-12-18 19:57                                         ` Andreas Schwab
  2016-12-18 20:12                                         ` Al Viro
  0 siblings, 2 replies; 104+ messages in thread
From: Linus Torvalds @ 2016-12-18 19:28 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Al Viro, Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Sat, Dec 17, 2016 at 11:54 AM, Andreas Schwab <schwab@linux-m68k.org> wrote:
> This break EPIPE handling inside splice when SIGPIPE is ignored:
>
> Before:
> $ { sleep 1; strace -e splice pv -q /dev/zero; } | :

Where is that "splice" program from? Google isn't helpful, and fedora
doesn't seem to have it. I'm assuming it was posted in one of the
threads, but if so I've long since lost sight of it..

             Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 19:28                                       ` Linus Torvalds
@ 2016-12-18 19:57                                         ` Andreas Schwab
  2016-12-18 20:12                                         ` Al Viro
  1 sibling, 0 replies; 104+ messages in thread
From: Andreas Schwab @ 2016-12-18 19:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Dez 18 2016, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Dec 17, 2016 at 11:54 AM, Andreas Schwab <schwab@linux-m68k.org> wrote:
>> This break EPIPE handling inside splice when SIGPIPE is ignored:
>>
>> Before:
>> $ { sleep 1; strace -e splice pv -q /dev/zero; } | :
>
> Where is that "splice" program from?

It's running pv (splice is the argument of strace -e).

http://ivarch.com/programs/pv.shtml

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 19:28                                       ` Linus Torvalds
  2016-12-18 19:57                                         ` Andreas Schwab
@ 2016-12-18 20:12                                         ` Al Viro
  2016-12-18 20:30                                           ` Al Viro
  1 sibling, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-12-18 20:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Schwab, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sun, Dec 18, 2016 at 11:28:44AM -0800, Linus Torvalds wrote:
> On Sat, Dec 17, 2016 at 11:54 AM, Andreas Schwab <schwab@linux-m68k.org> wrote:
> > This break EPIPE handling inside splice when SIGPIPE is ignored:
> >
> > Before:
> > $ { sleep 1; strace -e splice pv -q /dev/zero; } | :
> 
> Where is that "splice" program from? Google isn't helpful, and fedora
> doesn't seem to have it. I'm assuming it was posted in one of the
> threads, but if so I've long since lost sight of it..

It's pv(1), actually.  I'm looking into that - debian-packaged pv reproduced
that crap.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 20:12                                         ` Al Viro
@ 2016-12-18 20:30                                           ` Al Viro
  2016-12-18 22:10                                             ` Linus Torvalds
  0 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-12-18 20:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Schwab, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sun, Dec 18, 2016 at 08:12:07PM +0000, Al Viro wrote:
> On Sun, Dec 18, 2016 at 11:28:44AM -0800, Linus Torvalds wrote:
> > On Sat, Dec 17, 2016 at 11:54 AM, Andreas Schwab <schwab@linux-m68k.org> wrote:
> > > This break EPIPE handling inside splice when SIGPIPE is ignored:
> > >
> > > Before:
> > > $ { sleep 1; strace -e splice pv -q /dev/zero; } | :
> > 
> > Where is that "splice" program from? Google isn't helpful, and fedora
> > doesn't seem to have it. I'm assuming it was posted in one of the
> > threads, but if so I've long since lost sight of it..
> 
> It's pv(1), actually.  I'm looking into that - debian-packaged pv reproduced
> that crap.

OK, I see what's going on - it's wait_for_space() lifted past the checks
for lack of readers.  The fix, AFAICS, is simply

diff --git a/fs/splice.c b/fs/splice.c
index 6a2b0db5..aeba2b7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1082,6 +1082,10 @@ EXPORT_SYMBOL(do_splice_direct);
 
 static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 {
+	if (unlikely(!pipe->readers)) {
+		send_sig(SIGPIPE, current, 0);
+		return -EPIPE;
+	}
 	while (pipe->nrbufs == pipe->buffers) {
 		if (flags & SPLICE_F_NONBLOCK)
 			return -EAGAIN;
@@ -1090,6 +1094,10 @@ static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 		pipe->waiting_writers++;
 		pipe_wait(pipe);
 		pipe->waiting_writers--;
+		if (unlikely(!pipe->readers)) {
+			send_sig(SIGPIPE, current, 0);
+			return -EPIPE;
+		}
 	}
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 20:30                                           ` Al Viro
@ 2016-12-18 22:10                                             ` Linus Torvalds
  2016-12-18 22:18                                               ` Al Viro
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Linus Torvalds @ 2016-12-18 22:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Andreas Schwab, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sun, Dec 18, 2016 at 12:30 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OK, I see what's going on - it's wait_for_space() lifted past the checks
> for lack of readers.  The fix, AFAICS, is simply

Ugh. Does it have to be duplicated?

How about just making the wait_for_space() loop be a for-loop, and writing it as

   for (;;) {
        if (unlikely(!pipe->readers)) {
                send_sig(SIGPIPE, current, 0);
                return -EPIPE;
        }
        if (pipe->nrbufs == pipe->buffers)
                return 0;
        if (flags & SPLICE_F_NONBLOCK)
                return -EAGAIN;
        if (signal_pending(current))
                return -ERESTARTSYS;
        pipe->waiting_writers++;
        pipe_wait(pipe);
        pipe->waiting_writers--;
   }

and just having it once?

Regardless - Andreas, can you verify that that fixes your issues? I'm
assuming you had some real load that made you notice this, not just he
dummy example..

            Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 22:10                                             ` Linus Torvalds
@ 2016-12-18 22:18                                               ` Al Viro
  2016-12-18 22:22                                                 ` Linus Torvalds
  2016-12-18 22:49                                               ` Andreas Schwab
  2016-12-21 18:56                                               ` Andreas Schwab
  2 siblings, 1 reply; 104+ messages in thread
From: Al Viro @ 2016-12-18 22:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Schwab, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sun, Dec 18, 2016 at 02:10:54PM -0800, Linus Torvalds wrote:
> On Sun, Dec 18, 2016 at 12:30 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > OK, I see what's going on - it's wait_for_space() lifted past the checks
> > for lack of readers.  The fix, AFAICS, is simply
> 
> Ugh. Does it have to be duplicated?
> 
> How about just making the wait_for_space() loop be a for-loop, and writing it as
> 
>    for (;;) {
>         if (unlikely(!pipe->readers)) {
>                 send_sig(SIGPIPE, current, 0);
>                 return -EPIPE;
>         }
>         if (pipe->nrbufs == pipe->buffers)

ITYM "!="...

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 22:18                                               ` Al Viro
@ 2016-12-18 22:22                                                 ` Linus Torvalds
  0 siblings, 0 replies; 104+ messages in thread
From: Linus Torvalds @ 2016-12-18 22:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Andreas Schwab, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sun, Dec 18, 2016 at 2:18 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> ITYM "!="...

Right. A bit too much cut-and-pasting going on in my email ;)

              Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 22:10                                             ` Linus Torvalds
  2016-12-18 22:18                                               ` Al Viro
@ 2016-12-18 22:49                                               ` Andreas Schwab
  2016-12-21 18:56                                               ` Andreas Schwab
  2 siblings, 0 replies; 104+ messages in thread
From: Andreas Schwab @ 2016-12-18 22:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Dez 18 2016, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Regardless - Andreas, can you verify that that fixes your issues? I'm
> assuming you had some real load that made you notice this, not just he
> dummy example..

This is from the testsuite of pv, I only noticed because it was hanging.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 22:10                                             ` Linus Torvalds
  2016-12-18 22:18                                               ` Al Viro
  2016-12-18 22:49                                               ` Andreas Schwab
@ 2016-12-21 18:56                                               ` Andreas Schwab
  2016-12-21 19:12                                                 ` Linus Torvalds
  2 siblings, 1 reply; 104+ messages in thread
From: Andreas Schwab @ 2016-12-21 18:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Dez 18 2016, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Regardless - Andreas, can you verify that that fixes your issues? I'm
> assuming you had some real load that made you notice this, not just he
> dummy example..

FWIW, I have verified that the testsuite of pv succeeds with this patch:

diff --git a/fs/splice.c b/fs/splice.c
index 5a7750bd2e..63b8f54485 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1086,7 +1086,13 @@ EXPORT_SYMBOL(do_splice_direct);
 
 static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 {
-	while (pipe->nrbufs == pipe->buffers) {
+	for (;;) {
+		if (unlikely(!pipe->readers)) {
+			send_sig(SIGPIPE, current, 0);
+			return -EPIPE;
+		}
+		if (pipe->nrbufs != pipe->buffers)
+			return 0;
 		if (flags & SPLICE_F_NONBLOCK)
 			return -EAGAIN;
 		if (signal_pending(current))
@@ -1095,7 +1101,6 @@ static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 		pipe_wait(pipe);
 		pipe->waiting_writers--;
 	}
-	return 0;
 }
 
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
-- 
2.11.0


Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-21 18:56                                               ` Andreas Schwab
@ 2016-12-21 19:12                                                 ` Linus Torvalds
  0 siblings, 0 replies; 104+ messages in thread
From: Linus Torvalds @ 2016-12-21 19:12 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Al Viro, Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Wed, Dec 21, 2016 at 10:56 AM, Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> FWIW, I have verified that the testsuite of pv succeeds with this patch:

Ok, thanks, committed.

Al, looking at this area, I think there's some room for cleanups. In
particular, isn't the loop in opipe_prep() now just
"wait_for_space()"? I'm also thinking that we could perhaps remove the
SIGPIPE/EPIPE handling from splice_to_pipe()..

Hmm?

               Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

end of thread, other threads:[~2016-12-21 19:12 UTC | newest]

Thread overview: 104+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20160908235521.GL2356@ZenIV.linux.org.uk>
     [not found] ` <20160909015324.GD30056@dastard>
     [not found]   ` <CA+55aFzohsUXj_3BeFNr2t50Wm=G+7toRDEz=Tk7VJqP3n1hXQ@mail.gmail.com>
     [not found]     ` <CA+55aFxrqCng2Qxasc9pyMrKUGFjo==fEaFT1vkH9Lncte3RgQ@mail.gmail.com>
     [not found]       ` <20160909023452.GO2356@ZenIV.linux.org.uk>
     [not found]         ` <CA+55aFwHQMjO4-vtfB9-ytc=o+DRo-HXVGckvXLboUxgpwb7_g@mail.gmail.com>
     [not found]           ` <20160909221945.GQ2356@ZenIV.linux.org.uk>
     [not found]             ` <CA+55aFzTOOB6oEVaaGD0N7Uznk-W9+ULPwzsxS_L_oZqGVSeLA@mail.gmail.com>
     [not found]               ` <20160914031648.GB2356@ZenIV.linux.org.uk>
     [not found]                 ` <20160914133925.2fba4629@roar.ozlabs.ibm.com>
2016-09-18  5:33                   ` xfs_file_splice_read: possible circular locking dependency detected Al Viro
2016-09-19  3:08                     ` Nicholas Piggin
2016-09-19  6:11                       ` Al Viro
2016-09-19  7:26                         ` Nicholas Piggin
     [not found]                 ` <CA+55aFznQaOWoSMNphgGJJWZ=8-odrc0DAUMzfGPQe+_N4UgNA@mail.gmail.com>
     [not found]                   ` <20160914042559.GC2356@ZenIV.linux.org.uk>
     [not found]                     ` <20160917082007.GA6489@ZenIV.linux.org.uk>
     [not found]                       ` <20160917190023.GA8039@ZenIV.linux.org.uk>
2016-09-18 19:31                         ` skb_splice_bits() and large chunks in pipe (was " Al Viro
2016-09-18 20:12                           ` Linus Torvalds
2016-09-18 22:31                             ` Al Viro
2016-09-19  0:18                               ` Linus Torvalds
2016-09-19  0:22                               ` Al Viro
2016-09-20  9:51                                 ` Herbert Xu
2016-09-23 19:00                         ` [RFC][CFT] splice_read reworked Al Viro
2016-09-23 19:01                           ` [PATCH 01/11] fix memory leaks in tracing_buffers_splice_read() Al Viro
2016-09-23 19:02                           ` [PATCH 02/11] splice_to_pipe(): don't open-code wakeup_pipe_readers() Al Viro
2016-09-23 19:02                           ` [PATCH 03/11] splice: switch get_iovec_page_array() to iov_iter Al Viro
2016-09-23 19:03                           ` [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe() Al Viro
2016-09-23 19:45                             ` Linus Torvalds
2016-09-23 20:10                               ` Al Viro
2016-09-23 20:36                                 ` Linus Torvalds
2016-09-24  3:59                                   ` Al Viro
2016-09-24 17:29                                     ` Al Viro
2016-09-27 15:38                                       ` Nicholas Piggin
2016-09-27 15:53                                       ` Chuck Lever
2016-09-24  3:59                                   ` [PATCH 04/12] " Al Viro
2016-09-26 13:35                                     ` Miklos Szeredi
2016-09-27  4:14                                       ` Al Viro
2016-12-17 19:54                                     ` Andreas Schwab
2016-12-18 19:28                                       ` Linus Torvalds
2016-12-18 19:57                                         ` Andreas Schwab
2016-12-18 20:12                                         ` Al Viro
2016-12-18 20:30                                           ` Al Viro
2016-12-18 22:10                                             ` Linus Torvalds
2016-12-18 22:18                                               ` Al Viro
2016-12-18 22:22                                                 ` Linus Torvalds
2016-12-18 22:49                                               ` Andreas Schwab
2016-12-21 18:56                                               ` Andreas Schwab
2016-12-21 19:12                                                 ` Linus Torvalds
2016-09-24  4:00                                   ` [PATCH 06/12] new helper: add_to_pipe() Al Viro
2016-09-26 13:49                                     ` Miklos Szeredi
2016-09-24  4:01                                   ` [PATCH 10/12] new iov_iter flavour: pipe-backed Al Viro
2016-09-29 20:53                                     ` Miklos Szeredi
2016-09-29 22:50                                       ` Al Viro
2016-09-30  7:30                                         ` Miklos Szeredi
2016-10-03  3:34                                           ` [RFC] O_DIRECT vs EFAULT (was Re: [PATCH 10/12] new iov_iter flavour: pipe-backed) Al Viro
2016-10-03 17:07                                             ` Linus Torvalds
2016-10-03 18:54                                               ` Al Viro
2016-09-24  4:01                                   ` [PATCH 11/12] switch generic_file_splice_read() to use of ->read_iter() Al Viro
2016-09-24  4:02                                   ` [PATCH 12/12] switch default_file_splice_read() to use of pipe-backed iov_iter Al Viro
2016-09-23 19:03                           ` [PATCH 05/11] skb_splice_bits(): get rid of callback Al Viro
2016-09-23 19:04                           ` [PATCH 06/11] new helper: add_to_pipe() Al Viro
2016-09-23 19:04                           ` [PATCH 07/11] fuse_dev_splice_read(): switch to add_to_pipe() Al Viro
2016-09-23 19:06                           ` [PATCH 08/11] cifs: don't use memcpy() to copy struct iov_iter Al Viro
2016-09-23 19:08                           ` [PATCH 09/11] fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter() Al Viro
2016-09-26  9:31                             ` Miklos Szeredi
2016-09-23 19:09                           ` [PATCH 10/11] new iov_iter flavour: pipe-backed Al Viro
2016-09-23 19:10                           ` [PATCH 11/11] switch generic_file_splice_read() to use of ->read_iter() Al Viro
2016-09-30 13:32                           ` [RFC][CFT] splice_read reworked CAI Qian
2016-09-30 17:42                             ` CAI Qian
2016-09-30 18:33                               ` CAI Qian
2016-10-03  1:37                                 ` Al Viro
2016-10-03 17:49                                   ` CAI Qian
2016-10-04 17:39                                     ` local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) CAI Qian
2016-10-04 21:42                                       ` tj
2016-10-05 14:09                                         ` CAI Qian
2016-10-05 15:30                                           ` tj
2016-10-05 15:54                                             ` CAI Qian
2016-10-05 18:57                                               ` CAI Qian
2016-10-05 20:05                                                 ` Al Viro
2016-10-06 12:20                                                   ` CAI Qian
2016-10-06 12:25                                                     ` CAI Qian
2016-10-06 16:11                                                       ` CAI Qian
2016-10-06 17:00                                                         ` Linus Torvalds
2016-10-06 18:12                                                           ` CAI Qian
2016-10-07  9:57                                                           ` Dave Chinner
2016-10-07 15:25                                                             ` Linus Torvalds
2016-10-07  7:08                                                       ` Jan Kara
2016-10-07 14:43                                                         ` CAI Qian
2016-10-07 15:27                                                           ` CAI Qian
2016-10-07 18:56                                                             ` CAI Qian
2016-10-09 21:54                                                               ` Dave Chinner
2016-10-10 14:10                                                                 ` CAI Qian
2016-10-10 20:14                                                                   ` CAI Qian
2016-10-10 21:57                                                                   ` Dave Chinner
2016-10-12 19:50                                                                     ` [bisected] " CAI Qian
2016-10-12 20:59                                                                       ` Dave Chinner
2016-10-13 16:25                                                                         ` CAI Qian
2016-10-13 20:49                                                                           ` Dave Chinner
2016-10-13 20:56                                                                             ` CAI Qian
2016-10-09 21:51                                                           ` Dave Chinner
2016-10-21 15:38                                                         ` [4.9-rc1+] overlayfs lockdep CAI Qian
2016-10-24 12:57                                                           ` Miklos Szeredi
2016-10-07  9:27                                                     ` local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) Dave Chinner
2016-10-03  1:42                               ` [RFC][CFT] splice_read reworked Al Viro
2016-10-03 14:06                                 ` CAI Qian
2016-10-03 15:20                                   ` CAI Qian
2016-10-03 21:12                                     ` Dave Chinner
2016-10-04 13:57                                       ` CAI Qian
2016-10-03 20:32                                   ` CAI Qian
2016-10-03 20:35                                     ` Al Viro
2016-10-04 13:29                                       ` CAI Qian
2016-10-04 14:28                                         ` Al Viro
2016-10-04 16:21                                           ` CAI Qian
2016-10-04 20:12                                             ` Al Viro
2016-10-05 14:30                                               ` CAI Qian
2016-10-05 16:07                                                 ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).