All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: xfs_file_splice_read: possible circular locking dependency detected
       [not found] ` <1832555471.1341372.1472835736236.JavaMail.zimbra@redhat.com>
@ 2016-09-03  0:39   ` Dave Chinner
  2016-09-03  0:57     ` Linus Torvalds
                       ` (2 more replies)
  0 siblings, 3 replies; 152+ messages in thread
From: Dave Chinner @ 2016-09-03  0:39 UTC (permalink / raw)
  To: CAI Qian; +Cc: linux-xfs, Linus Torvalds, Al Viro, xfs

On Fri, Sep 02, 2016 at 01:02:16PM -0400, CAI Qian wrote:
> Spice seems start to deadlock using the reproducer,
> 
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/splice/splice01.c
> 
> This seems introduced recently after v4.8-rc3 or -rc4, so suspect this xfs update was one to blame,
> 
> 7d1ce606a37922879cbe40a6122047827105a332

Nope, this goes back to the splice rework back around ~3.16, IIRC.

> [ 1749.956818] 
> [ 1749.958492] ======================================================
> [ 1749.965386] [ INFO: possible circular locking dependency detected ]
> [ 1749.972381] 4.8.0-rc4+ #34 Not tainted
> [ 1749.976560] -------------------------------------------------------
> [ 1749.983554] splice01/35921 is trying to acquire lock:
> [ 1749.989188]  (&sb->s_type->i_mutex_key#14){+.+.+.}, at: [<ffffffffa083c1f7>] xfs_file_buffered_aio_write+0x127/0x840 [xfs]
> [ 1750.001644] 
> [ 1750.001644] but task is already holding lock:
> [ 1750.008151]  (&pipe->mutex/1){+.+.+.}, at: [<ffffffff8169e7c1>] pipe_lock+0x51/0x60
> [ 1750.016753] 
> [ 1750.016753] which lock already depends on the new lock.
> [ 1750.016753] 
> [ 1750.025880] 
> [ 1750.025880] the existing dependency chain (in reverse order) is:
> [ 1750.034229] 
> -> #2 (&pipe->mutex/1){+.+.+.}:
> [ 1750.039139]        [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
> [ 1750.045857]        [<ffffffff8266448d>] mutex_lock_nested+0xdd/0x850
> [ 1750.052963]        [<ffffffff8169e7c1>] pipe_lock+0x51/0x60
> [ 1750.059190]        [<ffffffff8171ee25>] splice_to_pipe+0x75/0x9e0
> [ 1750.066001]        [<ffffffff81723991>] __generic_file_splice_read+0xa71/0xe90
> [ 1750.074071]        [<ffffffff81723e71>] generic_file_splice_read+0xc1/0x1f0
> [ 1750.081849]        [<ffffffffa0838628>] xfs_file_splice_read+0x368/0x7b0 [xfs]
> [ 1750.089940]        [<ffffffff8171fa7e>] do_splice_to+0xee/0x150
> [ 1750.096555]        [<ffffffff817262f4>] SyS_splice+0x1144/0x1c10
> [ 1750.103269]        [<ffffffff81007b66>] do_syscall_64+0x1a6/0x500
> [ 1750.110084]        [<ffffffff8266ea7f>] return_from_SYSCALL_64+0x0/0x7a

pipe_lock taken below the filesystem IO path, filesystem holds locks
to protect against racing hole punch, etc...

> [ 1750.188328] 
> -> #0 (&sb->s_type->i_mutex_key#14){+.+.+.}:
> [ 1750.194508]        [<ffffffff812adbc3>] __lock_acquire+0x3043/0x3dd0
> [ 1750.201609]        [<ffffffff812af52a>] lock_acquire+0x1fa/0x440
> [ 1750.208321]        [<ffffffff82668cda>] down_write+0x5a/0xe0
> [ 1750.214645]        [<ffffffffa083c1f7>] xfs_file_buffered_aio_write+0x127/0x840 [xfs]
> [ 1750.223421]        [<ffffffffa083cb7d>] xfs_file_write_iter+0x26d/0x6d0 [xfs]
> [ 1750.231423]        [<ffffffff816859be>] vfs_iter_write+0x29e/0x550
> [ 1750.238330]        [<ffffffff81722729>] iter_file_splice_write+0x529/0xb70
> [ 1750.246012]        [<ffffffff817258d4>] SyS_splice+0x724/0x1c10
> [ 1750.252627]        [<ffffffff81007b66>] do_syscall_64+0x1a6/0x500
> [ 1750.259438]        [<ffffffff8266ea7f>] return_from_SYSCALL_64+0x0/0x7a

pipe_lock taken above the filesystem IO path, filesystem tries to
take locks to protect against racing hole punch, etc, lockdep goes
boom.

Fundamentally a splice infrastructure problem. If we let splice race
with hole punch and other fallocate() based extent manipulations to
avoid this lockdep warning, we allow potential for read or write to
regions of the file that have been freed. We can live with having
lockdep complain about this potential deadlock as it is unlikely to
ever occur in practice. The other option is simply not an acceptible
solution....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-03  0:39   ` xfs_file_splice_read: possible circular locking dependency detected Dave Chinner
@ 2016-09-03  0:57     ` Linus Torvalds
  2016-09-03  1:45       ` Al Viro
  2016-09-06 21:53     ` CAI Qian
  2016-09-08 15:29     ` CAI Qian
  2 siblings, 1 reply; 152+ messages in thread
From: Linus Torvalds @ 2016-09-03  0:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, xfs, CAI Qian, Al Viro

On Fri, Sep 2, 2016 at 5:39 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Fundamentally a splice infrastructure problem.

Yeah, I don't really like how we handle the pipe lock.

It *might* be possible to instead just increment the reference
counters as we build a kvec[] array of them, and simply do teh write
without holding the pipe lock at all.

That has other problems, ie concurrect spices from the same pipe would
possibly write the same data multiple times, though.

But yes, the fundamental problem is how splice wants to take the pipe
lock both early and late. Very annoying.

              Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-03  0:57     ` Linus Torvalds
@ 2016-09-03  1:45       ` Al Viro
  2016-09-06 23:59         ` Dave Chinner
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-03  1:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-xfs, CAI Qian, xfs

On Fri, Sep 02, 2016 at 05:57:04PM -0700, Linus Torvalds wrote:
> On Fri, Sep 2, 2016 at 5:39 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Fundamentally a splice infrastructure problem.
> 
> Yeah, I don't really like how we handle the pipe lock.
> 
> It *might* be possible to instead just increment the reference
> counters as we build a kvec[] array of them, and simply do teh write
> without holding the pipe lock at all.
> 
> That has other problems, ie concurrect spices from the same pipe would
> possibly write the same data multiple times, though.
> 
> But yes, the fundamental problem is how splice wants to take the pipe
> lock both early and late. Very annoying.

We could, in principle, add another flavour of iov_iter, with bvec
array attached to it with copy_page_to_iter() sticking an extra ref to that
page into array.  Then, under pipe lock, feed that thing to ->read_iter()
and do an equivalent of splice_to_pipe() that would take bvec array instead
of struct page */struct partial_page arrays.

Hell, we could even have copy_to_iter() for these puppies allocate a page,
stick it into the next bvec and copy into it.  Especially if we have those
bvec zeroed, with copy_page_to_iter() leaving ->bvec pointing to the next
(unused) bvec and copy_to_iter() doing that only when a page had been
completely filled.  I.e.

copy_page_to_iter()
	if (!iter->nr_segs)
		return 0;
	if (iter->bvec->bv_page) {
		iter->bvec++;
		if (!--iter->nr_segs)
			return 0;
	}
	stick (page,offset,bytes) into iter->bvec
	iter->bvec++;
	iter->nr_segs--;
	return bytes;

copy_to_iter():
	wanted = bytes;
	while (bytes && iter->nr_segs) {
		if (!iter->bvec->bv_page)
			iter->bvec->bv_page = alloc_page()
		n = min(PAGE_SIZE - iter->bvec->bv_len, bytes);
		memcpy_to_page(addr, iter->bvec->bv_len, n);
		bytes -= n;
		iter->bvec->bv_len += n;
		if (iter->bvec->bv_len == PAGE_SIZE) {
			iter->bvec++;
			iter->nr_segs--;
		}
	}
	return wanted - bytes;

That should suffice for quite a few of read_iter-using file_operations,
if not for all of them.  pipe lock is on the outside, same as for write
side *and* for default_file_splice_read().  And filesystem gets the
same locking it would for read(2)/readv(2)/etc...

Comments?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-03  0:39   ` xfs_file_splice_read: possible circular locking dependency detected Dave Chinner
  2016-09-03  0:57     ` Linus Torvalds
@ 2016-09-06 21:53     ` CAI Qian
  2016-09-06 23:34       ` Dave Chinner
  2016-09-08 15:29     ` CAI Qian
  2 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-09-06 21:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Linus Torvalds, Al Viro, xfs



----- Original Message -----
> Fundamentally a splice infrastructure problem. If we let splice race
> with hole punch and other fallocate() based extent manipulations to
> avoid this lockdep warning, we allow potential for read or write to
> regions of the file that have been freed. We can live with having
> lockdep complain about this potential deadlock as it is unlikely to
> ever occur in practice. The other option is simply not an acceptible
> solution....
The problem with living with having this lockdep complain that
it seems once this lockdep happens, it will prevent other complains from
showing up. For example, I have to apply the commit dc3a04d to fix an early
rcu lockdep first during the bisecting.
   CAI Qian

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-06 21:53     ` CAI Qian
@ 2016-09-06 23:34       ` Dave Chinner
  0 siblings, 0 replies; 152+ messages in thread
From: Dave Chinner @ 2016-09-06 23:34 UTC (permalink / raw)
  To: CAI Qian; +Cc: linux-xfs, Linus Torvalds, Al Viro, xfs

On Tue, Sep 06, 2016 at 05:53:59PM -0400, CAI Qian wrote:
> 
> 
> ----- Original Message -----
> > Fundamentally a splice infrastructure problem. If we let splice race
> > with hole punch and other fallocate() based extent manipulations to
> > avoid this lockdep warning, we allow potential for read or write to
> > regions of the file that have been freed. We can live with having
> > lockdep complain about this potential deadlock as it is unlikely to
> > ever occur in practice. The other option is simply not an acceptible
> > solution....
> The problem with living with having this lockdep complain that
> it seems once this lockdep happens, it will prevent other complains from
> showing up. For example, I have to apply the commit dc3a04d to fix an early
> rcu lockdep first during the bisecting.

Not my problem.

My primary responsibility is to maintain the filesystem integrity
and data safety for the hundreds of thousands (millions?) of XFS
users: it's their data, and I will always err on the side of safety
and integrity. As such I really don't care if there's collateral
damage to developer debug tools - user data integrity requirements
always come first...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-03  1:45       ` Al Viro
@ 2016-09-06 23:59         ` Dave Chinner
  2016-09-08 20:35           ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2016-09-06 23:59 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-xfs, Linus Torvalds, CAI Qian, xfs

On Sat, Sep 03, 2016 at 02:45:14AM +0100, Al Viro wrote:
> On Fri, Sep 02, 2016 at 05:57:04PM -0700, Linus Torvalds wrote:
> > On Fri, Sep 2, 2016 at 5:39 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > Fundamentally a splice infrastructure problem.
> > 
> > Yeah, I don't really like how we handle the pipe lock.
> > 
> > It *might* be possible to instead just increment the reference
> > counters as we build a kvec[] array of them, and simply do teh write
> > without holding the pipe lock at all.
> > 
> > That has other problems, ie concurrect spices from the same pipe would
> > possibly write the same data multiple times, though.
> > 
> > But yes, the fundamental problem is how splice wants to take the pipe
> > lock both early and late. Very annoying.
> 
> We could, in principle, add another flavour of iov_iter, with bvec
> array attached to it with copy_page_to_iter() sticking an extra ref to that
> page into array.  Then, under pipe lock, feed that thing to ->read_iter()
> and do an equivalent of splice_to_pipe() that would take bvec array instead
> of struct page */struct partial_page arrays.

Not sure I quite follow - where do the pages come from? Do we
allocate new pages that get put into the bvec, then run the read
which copies data from the page cache page into them, then hand
those pages in the bvec to the pipe?

ISTR this read->splice_to_pipe path was once supposed to be a
zero-copy path - doesn't this make zero-copy impossible? Or was the
zero-copy splice read path done through some other path I've
forgotten about?

> Hell, we could even have copy_to_iter() for these puppies allocate a page,
> stick it into the next bvec and copy into it.  Especially if we have those
> bvec zeroed, with copy_page_to_iter() leaving ->bvec pointing to the next
> (unused) bvec and copy_to_iter() doing that only when a page had been
> completely filled.  I.e.

This has the same "data copy in the splice read path" as the above
interface. However, I suspect that this interface could actually be
used for zero copy (by stealing pages from the page cache rather
than allocating new pages and copying), so it may be a better way to
proceed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-03  0:39   ` xfs_file_splice_read: possible circular locking dependency detected Dave Chinner
  2016-09-03  0:57     ` Linus Torvalds
  2016-09-06 21:53     ` CAI Qian
@ 2016-09-08 15:29     ` CAI Qian
  2016-09-08 17:56       ` Al Viro
  2016-09-08 18:01       ` Linus Torvalds
  2 siblings, 2 replies; 152+ messages in thread
From: CAI Qian @ 2016-09-08 15:29 UTC (permalink / raw)
  To: Dave Chinner, Al Viro; +Cc: linux-xfs, Linus Torvalds, xfs



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "linux-xfs" <linux-xfs@vger.kernel.org>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Al Viro"
> <viro@zeniv.linux.org.uk>, xfs@oss.sgi.com
> Sent: Friday, September 2, 2016 8:39:19 PM
> Subject: Re: xfs_file_splice_read: possible circular locking dependency detected
> 
> On Fri, Sep 02, 2016 at 01:02:16PM -0400, CAI Qian wrote:
> > Spice seems start to deadlock using the reproducer,
> > 
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/splice/splice01.c
> > 
> > This seems introduced recently after v4.8-rc3 or -rc4, so suspect this xfs
> > update was one to blame,
> > 
> > 7d1ce606a37922879cbe40a6122047827105a332
> 
> Nope, this goes back to the splice rework back around ~3.16, IIRC.
Right. FYI, revert the commit below fixes the regression,

8d02076 : ->splice_write() via ->write_iter()

   CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 15:29     ` CAI Qian
@ 2016-09-08 17:56       ` Al Viro
  2016-09-08 18:12         ` Linus Torvalds
  2016-09-08 18:01       ` Linus Torvalds
  1 sibling, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-08 17:56 UTC (permalink / raw)
  To: CAI Qian; +Cc: Dave Chinner, linux-xfs, Linus Torvalds, xfs

On Thu, Sep 08, 2016 at 11:29:11AM -0400, CAI Qian wrote:

> > Nope, this goes back to the splice rework back around ~3.16, IIRC.
> Right. FYI, revert the commit below fixes the regression,
> 
> 8d02076 : ->splice_write() via ->write_iter()

... and brings back a lot of other crap.  The thing is, pipe lock should
be on the outside of everything fs might be taking, so that splice IO
is the same as normal IO as far as filesystem locking is concerned.  For
the write side it had been done in that commit, for the read side it's yet
to be done.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 15:29     ` CAI Qian
  2016-09-08 17:56       ` Al Viro
@ 2016-09-08 18:01       ` Linus Torvalds
  2016-09-08 20:39         ` CAI Qian
  1 sibling, 1 reply; 152+ messages in thread
From: Linus Torvalds @ 2016-09-08 18:01 UTC (permalink / raw)
  To: CAI Qian; +Cc: Dave Chinner, Al Viro, linux-xfs, xfs

On Thu, Sep 8, 2016 at 8:29 AM, CAI Qian <caiqian@redhat.com> wrote:
> Right. FYI, revert the commit below fixes the regression,
>
> 8d02076 : ->splice_write() via ->write_iter()

I guess you didn't actually revert that, because so much else has
changed. So you just tested the pre- and post- state of that commit?

It does look like that commit is just buggy, exactly because XSF was
the only user of -generic_file_splice_write() in order to be able to
do the filesystem locks *before* taking the pipe lock.

Al? I wonder if we could just re-introduce xfs_file_splice_write()
(except with the modern iter-based interface)?

Looking at not holding the pipe lock, that really does seem very bad,
because it would require us to:

 (a) play the ref-count games with each page

 (b) make concurrent splice writers have very subtle semantics

I'm not sure (b) is a big issue, because concurrent splice writers
already have random ordering, but at least right now they'd have
non-overlapping data accesses rather than possibly splicing the same
data twice (and some data not at all).

The basic issue with splice and xfs is that right now our splice
interface *forces*:

 - generic_file_splice_read() when called by a filesystem will take
the filesystem locks *first*, and the pipe lock second (ie
xfs_file_splice_read)

 - iter_file_splice_write() forces the reverse ordering.

So it really is our splice helpers that do things in fundamentally the
wrong order.

Other filesystems don't seem to have that extra filesystem lock that shows this.

Ideas?

             Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 17:56       ` Al Viro
@ 2016-09-08 18:12         ` Linus Torvalds
  2016-09-08 18:18           ` Linus Torvalds
                             ` (2 more replies)
  0 siblings, 3 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-09-08 18:12 UTC (permalink / raw)
  To: Al Viro; +Cc: CAI Qian, Dave Chinner, linux-xfs, xfs

On Thu, Sep 8, 2016 at 10:56 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> ... and brings back a lot of other crap.  The thing is, pipe lock should
> be on the outside of everything fs might be taking, so that splice IO
> is the same as normal IO as far as filesystem locking is concerned.  For
> the write side it had been done in that commit, for the read side it's yet
> to be done.

Al, look at generic_file_splice_read().

The problem is that XFS takes its own

        xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
        ..
        xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);

around all the generic file accessors. So for example, it doesn't use
"generic_file_read_iter()", it does

STATIC ssize_t
xfs_file_buffered_aio_read(
        struct kiocb            *iocb,
        struct iov_iter         *to)
{
        struct xfs_inode        *ip = XFS_I(file_inode(iocb->ki_filp));
        ssize_t                 ret;

        trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos);

        xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
        ret = generic_file_read_iter(iocb, to);
        xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);

        return ret;
}

and the exact same pattern holds for generic_file_splice_read().

So the XFS model *requires* that XFS_IOLOCK to be outside all the operations.

But then in iter_file_splice_write we have the other ordering.

Now, xfs could do a wrapper for ->splice_write() like it used to, and
have that same "take the xfs lock around the call to
iter_file_splice_write(). That was my first obvious patch.

I threw it out because that's garbage too: then we end up doing
->write_iter(), which takes the xfs_rw_ilock() again, and would
immediately deadlock *there* instead.

So the problem really is that the vfs layer seems to simply not allow
the filesystem to do any locking around the generic page cache helper
functions. And XFS really wants to do that.

              Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 18:12         ` Linus Torvalds
@ 2016-09-08 18:18           ` Linus Torvalds
  2016-09-08 20:44           ` Al Viro
  2016-09-08 21:38           ` Dave Chinner
  2 siblings, 0 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-09-08 18:18 UTC (permalink / raw)
  To: Al Viro; +Cc: CAI Qian, Dave Chinner, linux-xfs, xfs

On Thu, Sep 8, 2016 at 11:12 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So the problem really is that the vfs layer seems to simply not allow
> the filesystem to do any locking around the generic page cache helper
> functions. And XFS really wants to do that.

Hmm.

I wonder if we could just take the pipe lock *much* earlier at the
splice() layer? Do it before any callbacks to the low-level
filesystems, not inside the "generic" splice helpers at all?

That would clean up a ton of crap.

The *one* reason that seems impossible right now seems to be that we
use "pipe_wait()" in our splice ops. And "pipe_wait()" drops and
retakes the pipe lock over the waiting.

BUT.

What if we got rid of all the pipe-wait crap entirely, and just made
all the splice routines return EAGAIN instead of waiting? And then do
the pipe_wait() at the higher level, outside the filesystem callback
code, and outside the low-level generic helpers?

Maybe that pipe_wait() movement doesn't really work for some reason
that I didn't look at, but that would really help make the locking
enormously simpler. And then the pipe lock would *obviously* be the
outermost lock, and we'd get rid of all the issues with filesystem
lock ordering.

               Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-06 23:59         ` Dave Chinner
@ 2016-09-08 20:35           ` Al Viro
  0 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-08 20:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Linus Torvalds, CAI Qian, xfs

On Wed, Sep 07, 2016 at 09:59:17AM +1000, Dave Chinner wrote:

> Not sure I quite follow - where do the pages come from? Do we
> allocate new pages that get put into the bvec, then run the read
> which copies data from the page cache page into them, then hand
> those pages in the bvec to the pipe?

Nope.  generic_file_read_iter() (do_generic_file_read(), in the end)
finds them in page cache, or allocates and sticks them into pagecache,
makes sure that they are uptodate, etc.   And passes them to
copy_page_to_iter(), which would, for this iov_iter flavour, just grab
a reference to page and stash it into bvec.  There's your zero-copy,
exactly as it works now.  Only __generic_file_splice_read() open-codes
everything ->read_iter() would do, sans the locks filesystem would need.

> This has the same "data copy in the splice read path" as the above
> interface. However, I suspect that this interface could actually be
> used for zero copy (by stealing pages from the page cache rather
> than allocating new pages and copying), so it may be a better way to
> proceed...

For copy_page_to_iter() we have a page; for copy_to_iter() the data comes
from hell knows what - kmalloc'ed array into which we'd decrypted something,
results of sprintf() into on-stack array, etc.  So the counterparts of
copy_to_iter() callers must be non-zerocopy.  copy_page_to_iter() is the
potential zerocopy path and we do get zerocopy there that way.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 18:01       ` Linus Torvalds
@ 2016-09-08 20:39         ` CAI Qian
  2016-09-08 21:19           ` Dave Chinner
  0 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-09-08 20:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-xfs, Al Viro, xfs



----- Original Message -----
> From: "Linus Torvalds" <torvalds@linux-foundation.org>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Dave Chinner" <david@fromorbit.com>, "Al Viro" <viro@zeniv.linux.org.uk>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, xfs@oss.sgi.com
> Sent: Thursday, September 8, 2016 2:01:23 PM
> Subject: Re: xfs_file_splice_read: possible circular locking dependency detected
> 
> On Thu, Sep 8, 2016 at 8:29 AM, CAI Qian <caiqian@redhat.com> wrote:
> > Right. FYI, revert the commit below fixes the regression,
> >
> > 8d02076 : ->splice_write() via ->write_iter()
> 
> I guess you didn't actually revert that, because so much else has
> changed. So you just tested the pre- and post- state of that commit?
Right, I just reverted that commit while that one is as a HEAD. It is
not going to be a straight-forward revert. There have had a few commits
on the top already, so there will be some additional work to bake a proper
revert to the current origin HEAD.

Though, Everything else looks straigtforward (PAGE_CACHE_* conversion,
inode_lock* conversion, file_remove_privs() converstion). It seems only
tricky thing is that generic_write_sync() starts to use struct kiocb *
instead of struct file *, so generic_file_splice_write() and probably
xfs_file_splice_write() need to change to use kiocb as well.
   CAI Qian

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 18:12         ` Linus Torvalds
  2016-09-08 18:18           ` Linus Torvalds
@ 2016-09-08 20:44           ` Al Viro
  2016-09-08 20:57             ` Al Viro
  2016-09-08 21:23             ` Al Viro
  2016-09-08 21:38           ` Dave Chinner
  2 siblings, 2 replies; 152+ messages in thread
From: Al Viro @ 2016-09-08 20:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: CAI Qian, Dave Chinner, linux-xfs, xfs

On Thu, Sep 08, 2016 at 11:12:33AM -0700, Linus Torvalds wrote:

> So the XFS model *requires* that XFS_IOLOCK to be outside all the operations.
> 
> But then in iter_file_splice_write we have the other ordering.
> 
> Now, xfs could do a wrapper for ->splice_write() like it used to, and
> have that same "take the xfs lock around the call to
> iter_file_splice_write(). That was my first obvious patch.
> 
> I threw it out because that's garbage too: then we end up doing
> ->write_iter(), which takes the xfs_rw_ilock() again, and would
> immediately deadlock *there* instead.
> 
> So the problem really is that the vfs layer seems to simply not allow
> the filesystem to do any locking around the generic page cache helper
> functions. And XFS really wants to do that.

Why *have* ->splice_read() there at all?  Let's use its ->read_iter(), where
it will take its lock as it always did for read.

All we need is a variant of __generic_file_splice_read() that would pass
a new kind of iov_iter down to filesystem's own ->read_iter().  And let that
guy do whatever locking it wants.  It will end up doing a sequence of
copy_page_to_iter() and, possibly, copy_to_iter() (XFS one would only do the
former).  So let's add an iov_iter flavour that would simply grab a reference
to page passed to copy_page_to_iter() and allocate-and-copy for copy_to_iter().
As the result, you'll get an array of <page, offset, count> triples - same
as you would from the existing __generic_file_splice_read().  Pages already
uptodate, with all readahead logics done as usual, etc.

What else do we need?  Just feed the resulting triples into the pipe and
that's it.  Sure, they can get stale by truncate/punchhole/whatnot.  So
they can right after we return from xfs_file_splice_read()...

Moreover, I don't see why we need to hold pipe lock the actual call of
->read_iter().  Right now we only grab it for "feed into pipe buffers"
part.  Objections?

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 20:44           ` Al Viro
@ 2016-09-08 20:57             ` Al Viro
  2016-09-08 21:23             ` Al Viro
  1 sibling, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-08 20:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-xfs, CAI Qian, xfs

On Thu, Sep 08, 2016 at 09:44:50PM +0100, Al Viro wrote:
> On Thu, Sep 08, 2016 at 11:12:33AM -0700, Linus Torvalds wrote:
> 
> > So the XFS model *requires* that XFS_IOLOCK to be outside all the operations.
> > 
> > But then in iter_file_splice_write we have the other ordering.
> > 
> > Now, xfs could do a wrapper for ->splice_write() like it used to, and
> > have that same "take the xfs lock around the call to
> > iter_file_splice_write(). That was my first obvious patch.
> > 
> > I threw it out because that's garbage too: then we end up doing
> > ->write_iter(), which takes the xfs_rw_ilock() again, and would
> > immediately deadlock *there* instead.
> > 
> > So the problem really is that the vfs layer seems to simply not allow
> > the filesystem to do any locking around the generic page cache helper
> > functions. And XFS really wants to do that.
> 
> Why *have* ->splice_read() there at all?  Let's use its ->read_iter(), where
> it will take its lock as it always did for read.
> 
> All we need is a variant of __generic_file_splice_read() that would pass
> a new kind of iov_iter down to filesystem's own ->read_iter().  And let that
> guy do whatever locking it wants.  It will end up doing a sequence of
> copy_page_to_iter() and, possibly, copy_to_iter() (XFS one would only do the
> former).  So let's add an iov_iter flavour that would simply grab a reference
> to page passed to copy_page_to_iter() and allocate-and-copy for copy_to_iter().
> As the result, you'll get an array of <page, offset, count> triples - same
> as you would from the existing __generic_file_splice_read().  Pages already
> uptodate, with all readahead logics done as usual, etc.
> 
> What else do we need?  Just feed the resulting triples into the pipe and
> that's it.  Sure, they can get stale by truncate/punchhole/whatnot.  So
> they can right after we return from xfs_file_splice_read()...
> 
> Moreover, I don't see why we need to hold pipe lock the actual call of
> ->read_iter().  Right now we only grab it for "feed into pipe buffers"
> part.  Objections?

PS: take a look at how much of do_generic_file_read() logics is kinda-sorta
open-coded in __generic_file_splice_read(); readahead stuff, etc.  It also
assumes that filesystem needs no extra locking for playing with pagecache,
etc.  That's exactly why XFS ends up having to do a wrapper for that sucker
and why we get all this headache.  Why bother, when we already have a method
that turns "read that much from this offset in this file" into a sequence of
"take that many bytes from that offset in this page" and "take that many
bytes from that buffer"?  It doesn't even need to be a ->splice_read() instance
- just a function called by do_splice_to() if ->read_iter() is present.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 20:39         ` CAI Qian
@ 2016-09-08 21:19           ` Dave Chinner
  2016-09-08 21:30             ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2016-09-08 21:19 UTC (permalink / raw)
  To: CAI Qian; +Cc: linux-xfs, Linus Torvalds, Al Viro, xfs

On Thu, Sep 08, 2016 at 04:39:34PM -0400, CAI Qian wrote:
> 
> 
> ----- Original Message -----
> > From: "Linus Torvalds" <torvalds@linux-foundation.org>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "Dave Chinner" <david@fromorbit.com>, "Al Viro" <viro@zeniv.linux.org.uk>, "linux-xfs"
> > <linux-xfs@vger.kernel.org>, xfs@oss.sgi.com
> > Sent: Thursday, September 8, 2016 2:01:23 PM
> > Subject: Re: xfs_file_splice_read: possible circular locking dependency detected
> > 
> > On Thu, Sep 8, 2016 at 8:29 AM, CAI Qian <caiqian@redhat.com> wrote:
> > > Right. FYI, revert the commit below fixes the regression,
> > >
> > > 8d02076 : ->splice_write() via ->write_iter()
> > 
> > I guess you didn't actually revert that, because so much else has
> > changed. So you just tested the pre- and post- state of that commit?
> Right, I just reverted that commit while that one is as a HEAD. It is
> not going to be a straight-forward revert. There have had a few commits
> on the top already, so there will be some additional work to bake a proper
> revert to the current origin HEAD.
> 
> Though, Everything else looks straigtforward (PAGE_CACHE_* conversion,
> inode_lock* conversion, file_remove_privs() converstion). It seems only
> tricky thing is that generic_write_sync() starts to use struct kiocb *
> instead of struct file *, so generic_file_splice_write() and probably
> xfs_file_splice_write() need to change to use kiocb as well.

Don't bother. You'll just hit a different lockdep issue - a locking
order problem on the write side. I tried to get that fixed years
ago:

https://lkml.org/lkml/2011/7/18/4
http://oss.sgi.com/archives/xfs/2011-08/msg00122.html
http://oss.sgi.com/archives/xfs/2012-11/msg00671.html

That specific problem was fixed by the above write_iter
infrastructure fixes, but introduced the read side problem.  i.e.
splice has /always/ had locking order issues that XFS exposed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 20:44           ` Al Viro
  2016-09-08 20:57             ` Al Viro
@ 2016-09-08 21:23             ` Al Viro
  1 sibling, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-08 21:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: CAI Qian, Dave Chinner, linux-xfs, xfs

On Thu, Sep 08, 2016 at 09:44:50PM +0100, Al Viro wrote:

> Moreover, I don't see why we need to hold pipe lock the actual call of
> ->read_iter().  Right now we only grab it for "feed into pipe buffers"
> part.  Objections?

Actually, screw the "array of bvec"; we'd need to mark the ones that are
pagecache-backed somehow to tell which methods should be used.  Let's
add a variant of iov_iter that would be backed by pipe_buffer array;
copy_page_to_iter() fills the next slot with an extra reference to the
page we'd been given and using page_cache_pipe_buf_ops for ->ops.
copy_to_iter() adds to the last slot if it has default_pipe_buf_ops for
->ops and still has space in it or allocates a new page, sticks into the
next slot, copies data into it and sets default_pipe_ops for ->ops.

Then all we need is a variant of splice_to_pipe()/splice_{grow,shrink}_spd()
that would work with an array of pipe_buffer instead of page/partial_page
array pair.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 21:19           ` Dave Chinner
@ 2016-09-08 21:30             ` Al Viro
  0 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-08 21:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: CAI Qian, Linus Torvalds, linux-xfs, xfs

On Fri, Sep 09, 2016 at 07:19:31AM +1000, Dave Chinner wrote:

> Don't bother. You'll just hit a different lockdep issue - a locking
> order problem on the write side. I tried to get that fixed years
> ago:
> 
> https://lkml.org/lkml/2011/7/18/4
> http://oss.sgi.com/archives/xfs/2011-08/msg00122.html
> http://oss.sgi.com/archives/xfs/2012-11/msg00671.html
> 
> That specific problem was fixed by the above write_iter
> infrastructure fixes, but introduced the read side problem.  i.e.
> splice has /always/ had locking order issues that XFS exposed.

Yep.  I'll try to slap together something testable for the variant I'd
outlined; I really think that the root of problems here is that we have
parallel logics in ->read_iter and ->splice_read.  That just might
get rid of special-casing DAX in there, while we are at it...

Looking at the DAX side of things, we need iov_iter_zero() to grok those
as well (in addition to copy_page_to_iter() and copy_to_iter()).

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 18:12         ` Linus Torvalds
  2016-09-08 18:18           ` Linus Torvalds
  2016-09-08 20:44           ` Al Viro
@ 2016-09-08 21:38           ` Dave Chinner
  2016-09-08 23:55             ` Al Viro
  2 siblings, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2016-09-08 21:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, CAI Qian, linux-xfs, xfs

On Thu, Sep 08, 2016 at 11:12:33AM -0700, Linus Torvalds wrote:
> On Thu, Sep 8, 2016 at 10:56 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > ... and brings back a lot of other crap.  The thing is, pipe lock should
> > be on the outside of everything fs might be taking, so that splice IO
> > is the same as normal IO as far as filesystem locking is concerned.  For
> > the write side it had been done in that commit, for the read side it's yet
> > to be done.
> 
> Al, look at generic_file_splice_read().
> 
> The problem is that XFS takes its own
> 
>         xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
>         ..
>         xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
> 
> around all the generic file accessors. So for example, it doesn't use
> "generic_file_read_iter()", it does
> 
> STATIC ssize_t
> xfs_file_buffered_aio_read(
>         struct kiocb            *iocb,
>         struct iov_iter         *to)
> {
>         struct xfs_inode        *ip = XFS_I(file_inode(iocb->ki_filp));
>         ssize_t                 ret;
> 
>         trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos);
> 
>         xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
>         ret = generic_file_read_iter(iocb, to);
>         xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
> 
>         return ret;
> }
> 
> and the exact same pattern holds for generic_file_splice_read().
> 
> So the XFS model *requires* that XFS_IOLOCK to be outside all the operations.
> 
> But then in iter_file_splice_write we have the other ordering.
> 
> Now, xfs could do a wrapper for ->splice_write() like it used to, and
> have that same "take the xfs lock around the call to
> iter_file_splice_write(). That was my first obvious patch.
> 
> I threw it out because that's garbage too: then we end up doing
> ->write_iter(), which takes the xfs_rw_ilock() again, and would
> immediately deadlock *there* instead.

That's what I first tried when this was first reported back in
3.18-rc0, and after a couple of other aborted attempts to work
around the pipe_lock I came to the same conclusion:

	"That smells like a splice architecture bug. splice write puts the
	pipe lock outside the inode locks, but splice read puts the pipes
	locks *inside* the inode locks. "

http://oss.sgi.com/archives/xfs/2014-10/msg00319.html

> So the problem really is that the vfs layer seems to simply not allow
> the filesystem to do any locking around the generic page cache helper
> functions. And XFS really wants to do that.

It's not an XFS specific problem: any filesystem that supports hole
punch and it's fallocate() friends needs this high level splice IO
exclusion as well.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 21:38           ` Dave Chinner
@ 2016-09-08 23:55             ` Al Viro
  2016-09-09  1:53               ` Dave Chinner
  2016-09-09  2:19               ` Al Viro
  0 siblings, 2 replies; 152+ messages in thread
From: Al Viro @ 2016-09-08 23:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linus Torvalds, CAI Qian, linux-xfs, xfs

On Fri, Sep 09, 2016 at 07:38:35AM +1000, Dave Chinner wrote:

> It's not an XFS specific problem: any filesystem that supports hole
> punch and it's fallocate() friends needs this high level splice IO
> exclusion as well.

How is hole punch different from truncate()?  My reading of the situation
is that we don't need exclusion between that and insertion into pipe;
only for "gather uptodate page references" part.  If some page gets
evicted afterwards... how is that different from having that happen
right after we'd finished with ->splice_read()?  Am I missing something
subtle in there?

I'm still looking at the O_DIRECT paths in that stuff; we'll probably
need iov_iter_get_pages() for these suckers to allocate pages and stick
them into slots.  The tricky part is to get the semantics of iov_iter_advance()
right for them, but it does look feasible.

Again, what I propose is a new iov_iter flavour.  Backed by pipe_buffer array,
used only for reads (i.e. copy to, not copy from).  Three states for element:
pagecache one, copied data, empty.  Semantics:
	* copy_page_to_iter(): grab a reference to page and stick it into
the next element (making it a pagecache one) with offset and len coming
directly from arguments.
	* copy_to_iter(): if the last element is a 'copied data' with empty
space remaining - copy to the end.  Otherwise allocate a new page and stick
it into the next element (making it 'copied data'), then copy into it.  If 
still not all data copied, do the same for the next element, etc.  Of course,
if there's no elements left, we are done copying.
	* zero_iter(): ditto, with s/copy/fill with zeroes/
	* iov_iter_get_pages(): allocate pages, stick them into the next
slots (making those 'copied data').  That might need some changes, though -
I'm still looking through the users.  The tricky part is decision when to
update the lengths.
	* iov_iter_get_pages_alloc(): not sure, hadn't really looked yet.
	* iov_iter_alignment(): probably just returns 0.
	* iov_iter_advance(): probably like bvec variant.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 23:55             ` Al Viro
@ 2016-09-09  1:53               ` Dave Chinner
  2016-09-09  2:22                 ` Linus Torvalds
  2016-09-09  2:26                 ` Al Viro
  2016-09-09  2:19               ` Al Viro
  1 sibling, 2 replies; 152+ messages in thread
From: Dave Chinner @ 2016-09-09  1:53 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, CAI Qian, linux-xfs, xfs

On Fri, Sep 09, 2016 at 12:55:21AM +0100, Al Viro wrote:
> On Fri, Sep 09, 2016 at 07:38:35AM +1000, Dave Chinner wrote:
> 
> > It's not an XFS specific problem: any filesystem that supports hole
> > punch and it's fallocate() friends needs this high level splice IO
> > exclusion as well.
> 
> How is hole punch different from truncate()?  My reading of the situation
> is that we don't need exclusion between that and insertion into pipe;
> only for "gather uptodate page references" part.  If some page gets
> evicted afterwards... how is that different from having that happen
> right after we'd finished with ->splice_read()?  Am I missing something
> subtle in there?

generic_file_splice_read() gathers pages into spd.pages[], taking a
refernce to them. The pages are not locked.

truncate does things in this order:

	move EOF,
	invalidate page cache,
	free disk space

So if we race iwth a truncate, the pages in spd.pages[] that are
beyond the new EOF may or may not have been removed from the page
cache. The splice code handles this specific race condition by again
checking the uptodate page against the current EOF before updating
the spd to include it:


fill_it:
		/*
		 * i_size must be checked after PageUptodate.
		 */
		isize = i_size_read(mapping->host);
		end_index = (isize - 1) >> PAGE_SHIFT;
		if (unlikely(!isize || index > end_index))
			break;

At this point, if the page is inside isize we know it has good data
in it, and we can hand it off to whoever.

The problem with hole punch or an extent shift is that the size does
not change and so the invalidated page is still within the valid
range of the file. Hence if we race with invalidation here, it does
not get caught and what we put into the buffer does not reflect
the data in the file at the time the pipe buffer is built.

This isn't specific to splice - it's the same issue for all page
cache lookup and validation checks. This issue is one of the reasons
why XFS has a MMAPLOCK similar to the IOLOCK - we can't take the
IOLOCK in the page fault path, but we still need to protect page
faults against racing page invalidations within EOF from operations
like hole punch.

> Again, what I propose is a new iov_iter flavour.  Backed by pipe_buffer array,
> used only for reads (i.e. copy to, not copy from).  Three states for element:
> pagecache one, copied data, empty.  Semantics:
> 	* copy_page_to_iter(): grab a reference to page and stick it into
> the next element (making it a pagecache one) with offset and len coming
> directly from arguments.
> 	* copy_to_iter(): if the last element is a 'copied data' with empty
> space remaining - copy to the end.  Otherwise allocate a new page and stick
> it into the next element (making it 'copied data'), then copy into it.  If 
> still not all data copied, do the same for the next element, etc.  Of course,
> if there's no elements left, we are done copying.
> 	* zero_iter(): ditto, with s/copy/fill with zeroes/
> 	* iov_iter_get_pages(): allocate pages, stick them into the next
> slots (making those 'copied data').  That might need some changes, though -
> I'm still looking through the users.  The tricky part is decision when to
> update the lengths.
> 	* iov_iter_get_pages_alloc(): not sure, hadn't really looked yet.
> 	* iov_iter_alignment(): probably just returns 0.
> 	* iov_iter_advance(): probably like bvec variant.
> 

Sounds reasonable, but the iter stuff makes my head hurt so I
haven't thought about it that deeply yet.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-08 23:55             ` Al Viro
  2016-09-09  1:53               ` Dave Chinner
@ 2016-09-09  2:19               ` Al Viro
  1 sibling, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-09  2:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linus Torvalds, CAI Qian, linux-xfs, xfs

On Fri, Sep 09, 2016 at 12:55:21AM +0100, Al Viro wrote:

> Again, what I propose is a new iov_iter flavour.  Backed by pipe_buffer array,
> used only for reads (i.e. copy to, not copy from).  Three states for element:
> pagecache one, copied data, empty.  Semantics:
> 	* copy_page_to_iter(): grab a reference to page and stick it into
> the next element (making it a pagecache one) with offset and len coming
> directly from arguments.
	Start with checking if we are asking for the next chunk of the
same page and simply adjust ->length if so.
	
> 	* copy_to_iter(): if the last element is a 'copied data' with empty
> space remaining - copy to the end.  Otherwise allocate a new page and stick
> it into the next element (making it 'copied data'), then copy into it.  If 
> still not all data copied, do the same for the next element, etc.  Of course,
> if there's no elements left, we are done copying.
> 	* zero_iter(): ditto, with s/copy/fill with zeroes/
> 	* iov_iter_get_pages(): allocate pages, stick them into the next
> slots (making those 'copied data').  That might need some changes, though -
> I'm still looking through the users.  The tricky part is decision when to
> update the lengths.
	... setting lengths to PAGE_SIZE.
	* a new primitive to be used instead of iov_iter_advance() in
success case of ->direct_IO() from generic_file_read_iter() - equivalent
to iov_iter_advance() for all existing iov_iter flavours.  For this one:
iov_iter_advance() + truncate ->length on the element we'd ended up on +
free pages on all subsequent elements, converting them to "empty".

> 	* iov_iter_get_pages_alloc(): not sure, hadn't really looked yet.

Usual "allocate array, then as in iov_iter_get_pages()"

> 	* iov_iter_alignment(): probably just returns 0.
> 	* iov_iter_advance(): probably like bvec variant.
	Probably needs to scream bloody murder if we are seeking _not_ to
the end of the last element.

	* ->count handling: capacity.  IOW, number of unused elements times
PAGE_SIZE + if the current element is 'copied data' the amount of data left
in this one.  That will need a careful review - any ->read_iter() making
assumptions about iov_iter_count() after copy_page_to_iter() might need
to be adjusted (i.e. we can't assume that iov_iter_count() decreases exactly
by the amount returned by copy_page_to_iter()).

	* copying from such iov_iter: BUG()
	* fault-in: nop
	* iov_iter_npages(): elements left
	* dup_iter(): BUG() for now
	* csum_and_copy_to_iter(): similar to copy_to_iter(), with obvious
modifications for actually calculating csum.  Or just BUG() - we are not
likely to use it at the moment.

	Looks like that should suffice...  And it looks like that ought to
take shmem ->splice_read() out as well, so that's one fewer caller of
splice_to_pipe() to switch...

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-09  1:53               ` Dave Chinner
@ 2016-09-09  2:22                 ` Linus Torvalds
  2016-09-09  2:26                   ` Linus Torvalds
  2016-09-09  2:31                   ` xfs_file_splice_read: possible circular locking dependency detected Al Viro
  2016-09-09  2:26                 ` Al Viro
  1 sibling, 2 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-09-09  2:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, CAI Qian, linux-xfs, xfs

On Thu, Sep 8, 2016 at 6:53 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> So if we race iwth a truncate, the pages in spd.pages[] that are
> beyond the new EOF may or may not have been removed from the page
> cache.

So I'm not sure why we'd need to care?

The thing is, if the splicer and the hole puncher aren't synchronized,
then there is by definition no "before/after" point.

The splice data may be "stale" in the sense that we look at the page
after the hole punch has happened and the page no longer has a
->mapping associated with it, but it is equally valid to treat that as
just a case of "the read happened before the hole punch".

Put another way: it's not wrong to use the ostensibly "stale" data, it
just means that the splice acts as if the IO had happened before the
data became stale.

The whole point of "splice" is for the pipe to act as a in-kernel
buffer. So a splice does not *synchronize* the two end-points, quite
the reverse: it is meant to act as a "read + write" with the pipe
itself being the buffer in between (and because it's a in-kernel
buffer rather than a user space buffer like a real read()+write() pair
would be, it means that we then *can* do things like zero-copy, but
realistically it really aims for "one-copy" rather than "two-copy".

So if the splice buffer contains stale values, then that's exactly
similar to a user space application having done a "read()" of old
data, then the file is truncated (or hole punched), and then the
application does a "write()" on that data. The target clearly sees
*different* data than is on the filesystem at that point, but since
"complete synchronization" has never been a guarantee of splice() in
the first place, that's just not a downside.

If an application expects to have "splice()" give some kind of data
consistency guarantees wrt people writing to the file (or with
truncate or hole punching), then the application would have to
implement that serialization itself. Splice in itself does not do
serialization, it does data copying.

            Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-09  1:53               ` Dave Chinner
  2016-09-09  2:22                 ` Linus Torvalds
@ 2016-09-09  2:26                 ` Al Viro
  1 sibling, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-09  2:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Linus Torvalds, CAI Qian, linux-xfs, xfs

On Fri, Sep 09, 2016 at 11:53:24AM +1000, Dave Chinner wrote:
 
> This isn't specific to splice - it's the same issue for all page
> cache lookup and validation checks. This issue is one of the reasons
> why XFS has a MMAPLOCK similar to the IOLOCK - we can't take the
> IOLOCK in the page fault path, but we still need to protect page
> faults against racing page invalidations within EOF from operations
> like hole punch.

Point taken.  The window is between grabbing the pages and ->readpage()
calls, though, so converting to ->read_iter() ought to deal with the
entire class of problems...
 
[snip]

> Sounds reasonable, but the iter stuff makes my head hurt so I
> haven't thought about it that deeply yet.

O_DIRECT requires a bit of care, but it seems to be doable.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-09  2:22                 ` Linus Torvalds
@ 2016-09-09  2:26                   ` Linus Torvalds
  2016-09-09  2:34                     ` Al Viro
  2016-09-09  2:31                   ` xfs_file_splice_read: possible circular locking dependency detected Al Viro
  1 sibling, 1 reply; 152+ messages in thread
From: Linus Torvalds @ 2016-09-09  2:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, CAI Qian, linux-xfs, xfs

On Thu, Sep 8, 2016 at 7:22 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Sep 8, 2016 at 6:53 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> So if we race iwth a truncate, the pages in spd.pages[] that are
>> beyond the new EOF may or may not have been removed from the page
>> cache.
>
> So I'm not sure why we'd need to care?

Side note, just to clarify: I'm not actually convinced that turning
things into page/offset/len tuples is the right thing to do.

I still suspect that the reference count updates on each page may not
be a good idea.  I suspect we'd easily be better off trying to do
everything under the pipe lock exactly so that we can *avoid* having
to do per-page "increment ref-count, then decrement it again". But the
locking would have to be changed radically for us to be able to do
that (and the only sane model ios, I think, to make pipe_lock be the
outermost lock, and outside *every* downcall)

               Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-09  2:22                 ` Linus Torvalds
  2016-09-09  2:26                   ` Linus Torvalds
@ 2016-09-09  2:31                   ` Al Viro
  2016-09-09  2:39                     ` Linus Torvalds
  1 sibling, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-09  2:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Chinner, CAI Qian, linux-xfs, xfs


> So I'm not sure why we'd need to care?
> 
> The thing is, if the splicer and the hole puncher aren't synchronized,
> then there is by definition no "before/after" point.
> 
> The splice data may be "stale" in the sense that we look at the page
> after the hole punch has happened and the page no longer has a
> ->mapping associated with it, but it is equally valid to treat that as
> just a case of "the read happened before the hole punch".
> 
> Put another way: it's not wrong to use the ostensibly "stale" data, it
> just means that the splice acts as if the IO had happened before the
> data became stale.

We care because __generic_file_splice_read() is playing fast and loose with
pagecache.  It gathers pointers to pages and *then* issues ->readpage() on
them.  Without any protection against hole-punching.  That's actually one
more example of the reasons why we really ought to just call ->read_iter()
and let the regular fs logics take care of exclusion.  pipe lock is needed
only to pass the pages we'd grabbed (from page cache) or allocated (for
copy_to_iter() callers, like e.g. DAX) into the pipe itself.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-09  2:26                   ` Linus Torvalds
@ 2016-09-09  2:34                     ` Al Viro
  2016-09-09  2:50                       ` Linus Torvalds
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-09  2:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Chinner, CAI Qian, linux-xfs, xfs

On Thu, Sep 08, 2016 at 07:26:44PM -0700, Linus Torvalds wrote:
> On Thu, Sep 8, 2016 at 7:22 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Thu, Sep 8, 2016 at 6:53 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> So if we race iwth a truncate, the pages in spd.pages[] that are
> >> beyond the new EOF may or may not have been removed from the page
> >> cache.
> >
> > So I'm not sure why we'd need to care?
> 
> Side note, just to clarify: I'm not actually convinced that turning
> things into page/offset/len tuples is the right thing to do.
> 
> I still suspect that the reference count updates on each page may not
> be a good idea.  I suspect we'd easily be better off trying to do
> everything under the pipe lock exactly so that we can *avoid* having
> to do per-page "increment ref-count, then decrement it again". But the
> locking would have to be changed radically for us to be able to do
> that (and the only sane model ios, I think, to make pipe_lock be the
> outermost lock, and outside *every* downcall)

IDGI.  Suppose we do splice from file to pipe.  Everything had been in
page cache, so we want to end up with pipe_buffers containing references
to those page cache pages.  How do you propose to do that without having
grabbed references to them?  What's to keep them from being freed by the
time we get to reading from the pipe?

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-09  2:31                   ` xfs_file_splice_read: possible circular locking dependency detected Al Viro
@ 2016-09-09  2:39                     ` Linus Torvalds
  0 siblings, 0 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-09-09  2:39 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, CAI Qian, linux-xfs, xfs

On Thu, Sep 8, 2016 at 7:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> We care because __generic_file_splice_read() is playing fast and loose with
> pagecache.  It gathers pointers to pages and *then* issues ->readpage() on
> them.  Without any protection against hole-punching.

Ugh. It should just lock them when it gathers the pointers.

And in fact they *are* locked for the add_to_page_cache_lru() case,
but the splice code explicitly unlocks them in order to then
unconditionally lock them *again* in the IO path.

Oh, that's just crazy. And stupid.

You're right, that code just has to be killed. It's too wrong to live.

If you can replace it with the generic read iterator, then that does
indeed just fix things. So color me convinced.

              Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-09  2:34                     ` Al Viro
@ 2016-09-09  2:50                       ` Linus Torvalds
  2016-09-09 22:19                         ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: Linus Torvalds @ 2016-09-09  2:50 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, CAI Qian, linux-xfs, xfs

On Thu, Sep 8, 2016 at 7:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> IDGI.  Suppose we do splice from file to pipe.  Everything had been in
> page cache, so we want to end up with pipe_buffers containing references
> to those page cache pages.  How do you propose to do that without having
> grabbed references to them?  What's to keep them from being freed by the
> time we get to reading from the pipe?

So that's obviously what we already do. That is, after all, why splice
doesn't actually keep track of "pages", it keeps track of "struct
pipe_buffer". So each page has not just offset/len associated with it,
but also a get/release/verify operation block and some flags with them
(it might not be a page-cache page, so in some cases it might be a skb
or something that needs different release semantics).

And if you can make the iterator basically interface with turning the
page/offset/len directly into a "struct pipe_buffer" and not do any
extra reference operations, then it actually would work very well.

But the way I read your description of what you'd do I just expected
you to have an extra "get/put" ref at the iterator level.

Maybe I misunderstood. I'd love to see a rough patch.

                           Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-09  2:50                       ` Linus Torvalds
@ 2016-09-09 22:19                         ` Al Viro
  2016-09-10  2:06                           ` Linus Torvalds
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-09 22:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Chinner, CAI Qian, linux-xfs, xfs

On Thu, Sep 08, 2016 at 07:50:05PM -0700, Linus Torvalds wrote:
> On Thu, Sep 8, 2016 at 7:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > IDGI.  Suppose we do splice from file to pipe.  Everything had been in
> > page cache, so we want to end up with pipe_buffers containing references
> > to those page cache pages.  How do you propose to do that without having
> > grabbed references to them?  What's to keep them from being freed by the
> > time we get to reading from the pipe?
> 
> So that's obviously what we already do. That is, after all, why splice
> doesn't actually keep track of "pages", it keeps track of "struct
> pipe_buffer". So each page has not just offset/len associated with it,
> but also a get/release/verify operation block and some flags with them
> (it might not be a page-cache page, so in some cases it might be a skb
> or something that needs different release semantics).
> 
> And if you can make the iterator basically interface with turning the
> page/offset/len directly into a "struct pipe_buffer" and not do any
> extra reference operations, then it actually would work very well.
> 
> But the way I read your description of what you'd do I just expected
> you to have an extra "get/put" ref at the iterator level.

Umm...  Looks like I misunderstood you, then.  Yes, it ends up with
get/get/put, the last two close in time.  Do you expect that to be a serious
overhead?  atomic_inc + atomic_dec_and_test + branch not taken shouldn't be
_that_ hot, and I would rather avoid complicating do_generic_file_read()
and friends with "iov_iter (somehow) told us not to put this page as we
normally would".  Can be done that way, but let's not bother until it really
shows in profiles.

> Maybe I misunderstood. I'd love to see a rough patch.

Cooking it...  The thing I really hate about the use of pipe_buffer is that
we want to keep the "use on-stack array for default pipe size" trick, and
pipe_buffer is fatter than I'd like.  Instead of pointer + two numbers +
something to indicate whether it's picked from page cache or something we'd
allocated we get get pointer + int + int + pointer + int + long, which turns
into 5 words on 64bit.  With 16-element array of those on stack frame, it's
not nice - more than half kilobyte of stack space with ->read_iter() yet to
be called...  bvec would be better (60% savings boils down to 384 bytes
shaved off that thing), but we'd need to play games with encoding the "is
it page cache or not" bit somewhere in it.

BTW, AFAICS that thing can be used to replace _all_ non-default filesystem
instances - lustre, nfs, gfs2, ocfs2, xfs, shmem, even coda.  What remains:
	* fuse_dev_splice_read()
	* relay_file_splice_read()
	* tracing_splice_read_pipe()
	* tracing_buffers_splice_read()
	* sock_splice_read()
TBH, the last one makes me nervious -
        /* Drop the socket lock, otherwise we have reverse
         * locking dependencies between sk_lock and i_mutex
         * here as compared to sendfile(). We enter here
         * with the socket lock held, and splice_to_pipe() will
         * grab the pipe inode lock. For sendfile() emulation,
         * we call into ->sendpage() with the i_mutex lock held
         * and networking will grab the socket lock.
         */
        release_sock(sk);
        ret = splice_to_pipe(pipe, spd);
        lock_sock(sk);
in skb_socket_splice() and
        mutex_unlock(&u->readlock);
        ret = splice_to_pipe(pipe, spd);
        mutex_lock(&u->readlock);
in skb_unix_socket_splice() smell like yet another indication that we are
taking the locks in wrong order.  OTOH, lifting the pipe lock all way out
of that, especially the last one, really smells like asking for deadlocks.
It is a separate issue, but it'll also need looking into...

I wonder if relay_file_read() would be better off converted to ->read_iter()
and unified with relay_file_splice_read() - we do copy_to_user() from
vmap'ed area, but we have the array of underlying struct page *, so it could
switch to copy_page_to_iter(), at which point ->splice_read() would _probably_
be OK with switch to ->read_iter() use.  The tricky part is their use games
with relay_consume_bytes() in relay_pipe_buf_release().  Who maintains that
thing (kernel/relay.c) these days?  git log for it looks like it's been
pretty much abandoned...

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-09 22:19                         ` Al Viro
@ 2016-09-10  2:06                           ` Linus Torvalds
  2016-09-14  3:16                             ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: Linus Torvalds @ 2016-09-10  2:06 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, CAI Qian, linux-xfs, xfs

On Fri, Sep 9, 2016 at 3:19 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Cooking it...  The thing I really hate about the use of pipe_buffer is that
> we want to keep the "use on-stack array for default pipe size" trick, and
> pipe_buffer is fatter than I'd like.  Instead of pointer + two numbers +
> something to indicate whether it's picked from page cache or something we'd
> allocated we get get pointer + int + int + pointer + int + long, which turns
> into 5 words on 64bit.  With 16-element array of those on stack frame, it's
> not nice - more than half kilobyte of stack space with ->read_iter() yet to
> be called...  bvec would be better (60% savings boils down to 384 bytes
> shaved off that thing), but we'd need to play games with encoding the "is
> it page cache or not" bit somewhere in it.

No, please don't play games like that.

I think you'd be better off with just a really small on-stack case
(like maybe 2-3 entries), and just allocate anything bigger
dynamically. Or you could even see how bad it is if you just
force-limit it to max 4 entries or something like that and just do
partial writes.

>From when I looked at things (admittedly a *long* time ago), the
buffer sizes for things like read/write system calls were *very*
skewed.

There's a lot of small stuff, then there is the stuff that actually
honors st.st_blksize (normally one page), and then there is the big
buffers stuff.

And the thing is, the big buffers are almost never worth it. It's
often better to have a tight loop over smaller data than bouncing lots
of data into buffers and then out of buffers.

So I suspect all the "let's do many pages in one go" stuff is actually
not worth it. Especially since the pipes will basically force a wait
event when the pipe buffers fill up anyway.

So feel free to try maxing out using only a small handful of
pipe_buffer entries. Returning partial IO from splice() is fine.

                   Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-10  2:06                           ` Linus Torvalds
@ 2016-09-14  3:16                             ` Al Viro
  2016-09-14  3:39                               ` Nicholas Piggin
  2016-09-14  3:49                               ` Linus Torvalds
  0 siblings, 2 replies; 152+ messages in thread
From: Al Viro @ 2016-09-14  3:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin

[Jens and Nick Cc'd]

On Fri, Sep 09, 2016 at 07:06:29PM -0700, Linus Torvalds wrote:

> I think you'd be better off with just a really small on-stack case
> (like maybe 2-3 entries), and just allocate anything bigger
> dynamically. Or you could even see how bad it is if you just
> force-limit it to max 4 entries or something like that and just do
> partial writes.

Umm...  Right now it tries to allocate as much as the output pipe could
possibly hold.  With default being 16 buffers, you'll end up with doing
dynamic allocation in all cases (it doesn't even look at the amount of
data we want to transfer).

The situation with splice_pipe_desc looks very odd:

	* all but one instance are on stack frames of some ->splice_read()
or something called by it (exception is in vmsplice)

	* all but one instance (a different one - see below) go through
splice_grow_spd / splice_to_pipe / splice_shrink_spd sequence and
nothing else sees them.  The exception is skb_splice_bits() and there we
have MAX_SKB_FRAGS for size, don't bother with grow/shrink and the only
thing done to that spd is splice_to_pipe() (from the callback passed to
skb_splice_bits()).

	* only one ->splice_read() instance does _not_ create
splice_pipe_descriptor.  It's fuse_dev_splice_read(), and it pays for that
by open-coding splice_to_pipe().  The only reason for open-coding is that
we don't have a "stronger SPLICE_F_NONBLOCK" that would fail if the data
wouldn't fit.  SPLICE_F_NONBLOCK stuffs as much as possible and buggers off
without waiting, fuse_dev_splice_read() wants all or nothing (and no waiting).

	* incidentally, we *can't* add new flags - splice(2)/tee(2)/vmsplice(2)
quietly ignore all bits they do not recognize.  In fact, splice(2) ends up
passing them (unsanitized) to ->splice_read and ->splice_write instances.

	* for splice(2) the IO size is limited by nominal capacity of output
pipe.  Looks fairly arbitrary (the limit is the same whether the pipe is
full or empty), but I wouldn't be surprised if userland programmers would
get unhappy if they have to take more iterations through their loops.

	* the other caller of ->splice_read() is splice_direct_to_actor() and
that can be called on a fairly deep stack.  However, there we loop ourselves
and smaller chunk size is not a problem.

	* in case of skb_splice_bits(), we probably want a grow/shrink pair
as well, with well below MAX_SKB_FRAGS for a default - what's the typical
number of fragments per skb?

> So feel free to try maxing out using only a small handful of
> pipe_buffer entries. Returning partial IO from splice() is fine.

	Are you sure that nobody's growing the output pipe buffer before
doing splice() into it as a way to reduce the amount of iterations?

	FWIW, I would love to replace these array of page * + array of
<offset,len,private> triples with array of pipe_buffer; for one thing,
this ridiculous ->sbd_release() goes away (we simply call ->ops->release()
on all unwanted buffers), which gets rid of wonders like
static void buffer_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
        struct buffer_ref *ref =
                (struct buffer_ref *)spd->partial[i].private;

        if (--ref->ref)
                return;

        ring_buffer_free_read_page(ref->buffer, ref->page);
        kfree(ref);
        spd->partial[i].private = 0;
}
static void buffer_pipe_buf_release(struct pipe_inode_info *pipe,
                                    struct pipe_buffer *buf)
{
        struct buffer_ref *ref = (struct buffer_ref *)buf->private;

        if (--ref->ref)
                return;

        ring_buffer_free_read_page(ref->buffer, ref->page);
        kfree(ref);
        buf->private = 0;
}

pairs that need to be kept in sync, etc.

One inconvenience created by that is stuff like
        spd.nr_pages = find_get_pages_contig(mapping, index, nr_pages, spd.pages);
in there; granted, this one will go away with __generic_file_splice_read(),
but e.g. get_iovec_page_array() is using get_user_pages_fast(), which wants
to put pages next to each other.  That one is from vmsplice_to_pipe() guts,
and I've no idea what the normal use patterns are.  OTOH, how much overhead
would we get from repeated calls of get_user_pages_fast() for e.g. 16 pages
or so, compared to larger chunks?  It is on a shallow stack, so it's not
as if we couldn't afford a 16-element array of struct page * in there...

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-14  3:16                             ` Al Viro
@ 2016-09-14  3:39                               ` Nicholas Piggin
  2016-09-14  4:01                                 ` Linus Torvalds
  2016-09-18  5:33                                 ` Al Viro
  2016-09-14  3:49                               ` Linus Torvalds
  1 sibling, 2 replies; 152+ messages in thread
From: Nicholas Piggin @ 2016-09-14  3:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe

On Wed, 14 Sep 2016 04:16:48 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> [Jens and Nick Cc'd]
> 
> On Fri, Sep 09, 2016 at 07:06:29PM -0700, Linus Torvalds wrote:
> 
> > I think you'd be better off with just a really small on-stack case
> > (like maybe 2-3 entries), and just allocate anything bigger
> > dynamically. Or you could even see how bad it is if you just
> > force-limit it to max 4 entries or something like that and just do
> > partial writes.  
> 
> Umm...  Right now it tries to allocate as much as the output pipe could
> possibly hold.  With default being 16 buffers, you'll end up with doing
> dynamic allocation in all cases (it doesn't even look at the amount of
> data we want to transfer).
> 
> The situation with splice_pipe_desc looks very odd:
> 
> 	* all but one instance are on stack frames of some ->splice_read()
> or something called by it (exception is in vmsplice)
> 
> 	* all but one instance (a different one - see below) go through
> splice_grow_spd / splice_to_pipe / splice_shrink_spd sequence and
> nothing else sees them.  The exception is skb_splice_bits() and there we
> have MAX_SKB_FRAGS for size, don't bother with grow/shrink and the only
> thing done to that spd is splice_to_pipe() (from the callback passed to
> skb_splice_bits()).
> 
> 	* only one ->splice_read() instance does _not_ create
> splice_pipe_descriptor.  It's fuse_dev_splice_read(), and it pays for that
> by open-coding splice_to_pipe().  The only reason for open-coding is that
> we don't have a "stronger SPLICE_F_NONBLOCK" that would fail if the data
> wouldn't fit.  SPLICE_F_NONBLOCK stuffs as much as possible and buggers off
> without waiting, fuse_dev_splice_read() wants all or nothing (and no waiting).
> 
> 	* incidentally, we *can't* add new flags - splice(2)/tee(2)/vmsplice(2)
> quietly ignore all bits they do not recognize.  In fact, splice(2) ends up
> passing them (unsanitized) to ->splice_read and ->splice_write instances.
> 
> 	* for splice(2) the IO size is limited by nominal capacity of output
> pipe.  Looks fairly arbitrary (the limit is the same whether the pipe is
> full or empty), but I wouldn't be surprised if userland programmers would
> get unhappy if they have to take more iterations through their loops.
> 
> 	* the other caller of ->splice_read() is splice_direct_to_actor() and
> that can be called on a fairly deep stack.  However, there we loop ourselves
> and smaller chunk size is not a problem.
> 
> 	* in case of skb_splice_bits(), we probably want a grow/shrink pair
> as well, with well below MAX_SKB_FRAGS for a default - what's the typical
> number of fragments per skb?
> 
> > So feel free to try maxing out using only a small handful of
> > pipe_buffer entries. Returning partial IO from splice() is fine.  
> 
> 	Are you sure that nobody's growing the output pipe buffer before
> doing splice() into it as a way to reduce the amount of iterations?
> 
> 	FWIW, I would love to replace these array of page * + array of
> <offset,len,private> triples with array of pipe_buffer; for one thing,
> this ridiculous ->sbd_release() goes away (we simply call ->ops->release()
> on all unwanted buffers), which gets rid of wonders like
> static void buffer_spd_release(struct splice_pipe_desc *spd, unsigned int i)
> {
>         struct buffer_ref *ref =
>                 (struct buffer_ref *)spd->partial[i].private;
> 
>         if (--ref->ref)
>                 return;
> 
>         ring_buffer_free_read_page(ref->buffer, ref->page);
>         kfree(ref);
>         spd->partial[i].private = 0;
> }
> static void buffer_pipe_buf_release(struct pipe_inode_info *pipe,
>                                     struct pipe_buffer *buf)
> {
>         struct buffer_ref *ref = (struct buffer_ref *)buf->private;
> 
>         if (--ref->ref)
>                 return;
> 
>         ring_buffer_free_read_page(ref->buffer, ref->page);
>         kfree(ref);
>         buf->private = 0;
> }
> 
> pairs that need to be kept in sync, etc.
> 
> One inconvenience created by that is stuff like
>         spd.nr_pages = find_get_pages_contig(mapping, index, nr_pages, spd.pages);
> in there; granted, this one will go away with __generic_file_splice_read(),
> but e.g. get_iovec_page_array() is using get_user_pages_fast(), which wants
> to put pages next to each other.  That one is from vmsplice_to_pipe() guts,
> and I've no idea what the normal use patterns are.  OTOH, how much overhead
> would we get from repeated calls of get_user_pages_fast() for e.g. 16 pages
> or so, compared to larger chunks?  It is on a shallow stack, so it's not
> as if we couldn't afford a 16-element array of struct page * in there...

Should not be so bad, but I don't have hard numbers for you. PAGEVEC_SIZE
is 14, and that's conceptually rather similar operation (walk radix tree;
grab pages). OTOH many archs are heavier and do locking and vmas walking etc.

Documentation/features/vm/pte_special/arch-support.txt

But even for those, at 16 entries, the bulk of the cost *should* be hitting
struct page cachelines and refcounting. The rest should mostly stay in cache.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-14  3:16                             ` Al Viro
  2016-09-14  3:39                               ` Nicholas Piggin
@ 2016-09-14  3:49                               ` Linus Torvalds
  2016-09-14  4:26                                 ` Al Viro
  1 sibling, 1 reply; 152+ messages in thread
From: Linus Torvalds @ 2016-09-14  3:49 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin

On Tue, Sep 13, 2016 at 8:16 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>> So feel free to try maxing out using only a small handful of
>> pipe_buffer entries. Returning partial IO from splice() is fine.
>
>         Are you sure that nobody's growing the output pipe buffer before
> doing splice() into it as a way to reduce the amount of iterations?

Do we care?

There's a lot of people who use large buffers. That doesn't
necessarily mean that it is the right thing to do. A small buffer that
we can allocate on-stack might well be better even if it causes more
iterations.

I'd also like to simplify the splice code if at all possible.
Particularly as there really aren't necessarily all that many actual
users of it. So if we can say "screw that" and just allocate a small
buffer on stack, and people end up iterating a bit more, so what? The
point of splice is to avoid the data copies and VM games, not to make
big buffers.

             Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-14  3:39                               ` Nicholas Piggin
@ 2016-09-14  4:01                                 ` Linus Torvalds
  2016-09-18  5:33                                 ` Al Viro
  1 sibling, 0 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-09-14  4:01 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Al Viro, Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe

On Tue, Sep 13, 2016 at 8:39 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
>
> But even for those, at 16 entries, the bulk of the cost *should* be hitting
> struct page cachelines and refcounting. The rest should mostly stay in cache.

Yes. And those costs will be exactly the same whether we do 16 entries
at a time or 4 loops of 4 entries.

There's something to be said for small temp buffers. They often have
better cache behavior thanks to re-use than having larger arrays.

But I still think that the biggest win could be from just trying to
cut down on code, if we can just say "we'll limit splice to N entries"
(where "N" is small enough that we really can do everything in a
simple stack allocation - I suspect 16 is already too big, and we
really should look at 4 or 8).

And if we actually get a report of a performance regression, we'd at
least hear who actually *uses* splice and notices.

I'm (sadly) still not at all convinced that "splice()" was ever a good
idea. I think it was a clever idea, and it is definitely much more
powerful conceptually than sendfile(), but I also suspect that it's
simply not used enough to be really worth the pain.

You can get great benchmark numbers with it. But whether it actually
matters in real life? I really don't know. But if we screw it up, and
make the buffers too small, and people actually complain and tell us
about what they are doing, that in itself would be a good datapoint.

So I wouldn't be too worried about just trying things out. We
certainly don't want to *break* anything, but at the same time I
really don't think we should be too nervous about it either.

Which is why I'd be more than happy to say "Just try limiting things
to a pretty small buffer and see if anybody even notices!"

                   Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-14  3:49                               ` Linus Torvalds
@ 2016-09-14  4:26                                 ` Al Viro
  2016-09-17  8:20                                   ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-14  4:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin

On Tue, Sep 13, 2016 at 08:49:58PM -0700, Linus Torvalds wrote:
 
> I'd also like to simplify the splice code if at all possible.

Then pipe_buffer it is; it will take a bit of surgery, but I'd expect
the end result to be much simpler.  OK, so splice_pipe_desc switches
from the pages/partial_pages/ops/spd_release to pipe_bufs, and I'm
actually tempted to replace nr_pages with "the rest of ->pipe_bufs[] has
NULL ->page".  Then it becomes simply
struct splice_pipe_desc {
	struct pipe_buffer *bufs;
	int nbufs;
	unsigned flags;
}, perhaps with struct pipe_buffer _bufs[INLINE_SPLICE_BUFS]; in the end.
struct partial_page simply dies...

Next question: what to do with sanitizing flags in splice(2)/vmsplice(2)/tee(2)?
Right now we accept anything, and quietly ignore everything outside of lower
4 bits.  Should we start masking everything else out and/or warning about
anything unexpected?

	What I definitely want for splice_to_pipe() is an additional flag for
"fail unless there's enough space to copy everything".  Having fuse open-code
splice_to_pipe() with all its guts is just plain wrong.  I'm not saying that
it should be possible to set in splice(2) arguments; it's obviously an ABI
breakage, since currently we ignore all unknown bits.  The question is whether
we mask the unknown bits quietly; doing that with yelling might allow to
make them eventually available.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-14  4:26                                 ` Al Viro
@ 2016-09-17  8:20                                   ` Al Viro
  2016-09-17 19:00                                     ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-17  8:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin

On Wed, Sep 14, 2016 at 05:25:59AM +0100, Al Viro wrote:
> On Tue, Sep 13, 2016 at 08:49:58PM -0700, Linus Torvalds wrote:
>  
> > I'd also like to simplify the splice code if at all possible.
> 
> Then pipe_buffer it is; it will take a bit of surgery, but I'd expect
> the end result to be much simpler.  OK, so splice_pipe_desc switches
> from the pages/partial_pages/ops/spd_release to pipe_bufs, and I'm
> actually tempted to replace nr_pages with "the rest of ->pipe_bufs[] has
> NULL ->page".  Then it becomes simply
> struct splice_pipe_desc {
> 	struct pipe_buffer *bufs;
> 	int nbufs;
> 	unsigned flags;
> }, perhaps with struct pipe_buffer _bufs[INLINE_SPLICE_BUFS]; in the end.
> struct partial_page simply dies...

Actually, we can do even better, and kill the sodding splice_pipe_desc
entirely, along with skb_splice_bits() callback.

1) make splice_to_pipe() return on pipe overflow, flags be damned.  And
lift pipe_lock()/looping/waking the readers up into callers.  Basically,
what you've suggested earlier in the thread.  There are 2 kinds of callers -
vmsplice_to_pipe() and assorted ->splice_read(), called from do_splice_to().
pipe_lock and loop is lifted into vmsplice_to_pipe() and into do_splice();
another caller of do_splice_to() already has a loop *and* couldn't wait
on the pipe anyway - it uses an internal one.

2) fuse_dev_splice_read() checks the amount of space in the pipe and
either buggers off or calls splice_to_pipe().

3) since the pipe is locked, skb_splice_bits() callbacks don't need to
unlock/relock any socket locks.  All those callbacks are simply
splice_to_pipe() and can be replaced with direct call of that sucker.

4) since the pipe is locked, there's no point feeding the bits in one go;
we can as well send them one by one.  That kills splice_to_pipe(),
splice_pipe_desc and these on-stack arrays, along with the questions about
their size.

5) that iov_iter flavour is backed by pipe.  {__,}generic_file_splice_read()
is gone - we simply set an iov_iter over our locked pipe and pass it to
->read_iter().  That serves as ->splice_read() where generic_file_splice_read()
used to be used, as well as nfs/ocfs2/gfs2/shmem instances.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-17  8:20                                   ` Al Viro
@ 2016-09-17 19:00                                     ` Al Viro
  2016-09-17 20:15                                       ` Linus Torvalds
                                                         ` (2 more replies)
  0 siblings, 3 replies; 152+ messages in thread
From: Al Viro @ 2016-09-17 19:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin

On Sat, Sep 17, 2016 at 09:20:07AM +0100, Al Viro wrote:

> 5) that iov_iter flavour is backed by pipe.  {__,}generic_file_splice_read()
> is gone - we simply set an iov_iter over our locked pipe and pass it to
> ->read_iter().  That serves as ->splice_read() where generic_file_splice_read()
> used to be used, as well as nfs/ocfs2/gfs2/shmem instances.

6) The same happens to coda and lustre instances, taking a bunch of crud out
in case of lustre (IO_SPLICE handling parallel to IO_NORMAL and
->vui_io_subtype in general).  Moreover, skb_splice_bits() becomes very
similar to skb_copy_datagram_iter(), possibly allowing to replace at least
AF_UNIX ->splice_read() with the same generic ->read_iter()-based one - or
doing the same to _all_ socket ones.  Even more interesting is that
fuse_dev_splice_read() just might become replacable with that, at the price
of some massage (and simplifications) of fuse_copy_page().  If _that_ works
out, we are in a situation where that thing is universal for everything
that has ->read_iter() in the first place.  Most of the stuff that has
only ->read() uses default_file_splice_read(); the only irregular instances
left are kernel/relay.c and kernel/trace/trace.c ones.  Incidentally, these
irregulars are precisely the ones that make use of buf->private.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-17 19:00                                     ` Al Viro
@ 2016-09-17 20:15                                       ` Linus Torvalds
  2016-09-18 19:31                                       ` skb_splice_bits() and large chunks in pipe (was " Al Viro
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
  2 siblings, 0 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-09-17 20:15 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin

On Sat, Sep 17, 2016 at 12:00 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> [ edited out steps 1-6]

So that all sounds very much like a big improvement. Not that I think
removing splice_pipe_desc is such a big deal per se, but on the whole
the less the actual low-level iterator does, and the more we do at a
higher level, the happier I am. The fact that you say you can remove
it does make it sound like you got rid of the right amount of
complexity, though. The reason that whole thing exists is exactly
because otherwise the splice callbacks would look too damn hair for
words.

If we get rid of all the SPLICE_F_NONBLOCK, sighandling, pipe_wait()
and fasync crap at the low level, I'll already be much happier. I hate
how complex that code is, and how the filesystems call into it as a
helper etc. Doing just the iterator in the deep corners of splice
sounds like absolutely the right thing to do.

So reading your outline I say "wonderful".

Of course, maybe I'll change my mind when I actually see your patches ;^p

             Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-14  3:39                               ` Nicholas Piggin
  2016-09-14  4:01                                 ` Linus Torvalds
@ 2016-09-18  5:33                                 ` Al Viro
  2016-09-19  3:08                                   ` Nicholas Piggin
  1 sibling, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-18  5:33 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, linux-fsdevel

[finally Cc'd to fsdevel - should've done that several iterations upthread]

On Wed, Sep 14, 2016 at 01:39:25PM +1000, Nicholas Piggin wrote:

> Should not be so bad, but I don't have hard numbers for you. PAGEVEC_SIZE
> is 14, and that's conceptually rather similar operation (walk radix tree;
> grab pages). OTOH many archs are heavier and do locking and vmas walking etc.
> 
> Documentation/features/vm/pte_special/arch-support.txt
> 
> But even for those, at 16 entries, the bulk of the cost *should* be hitting
> struct page cachelines and refcounting. The rest should mostly stay in cache.

OK...  That's actually important only for vmsplice_to_pipe() and 16-page
array seems to be doing fine there.

Another question, now that you've finally resurfaced: could you reconstruct
the story with page-stealing and breakage(s) thereof that had lead to
commit 485ddb4b9741bafb70b22e5c1f9b4f37dc3e85bd
Author: Nick Piggin <npiggin@suse.de>
Date:   Tue Mar 27 08:55:08 2007 +0200

    1/2 splice: dont steal

I realize that it had been 9 years ago, but anything resembling a braindump
would be very welcome.  Note that there is a couple of ->splice_write()
instances that _do_ use ->steal() (fuse_dev_splice_write() and virtio_console
port_fops_splice_write()) and I wonder if they suffer from the same problems;
your commit message is rather short on details, unfortunately.  FUSE one
is especially interesting...

^ permalink raw reply	[flat|nested] 152+ messages in thread

* skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-17 19:00                                     ` Al Viro
  2016-09-17 20:15                                       ` Linus Torvalds
@ 2016-09-18 19:31                                       ` Al Viro
  2016-09-18 20:12                                         ` Linus Torvalds
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
  2 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-18 19:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, netdev, Eric Dumazet

FWIW, I'm not sure if skb_splice_bits() can't land us in trouble; fragments
might come from compound pages and I'm not entirely convinced that we won't
end up with coalesced fragments putting more than PAGE_SIZE into a single
pipe_buffer.  And that could badly confuse a bunch of code.

Can that legitimately happen?  If so, we'll need to audit quite a few
->splice_write()-related codepaths; FUSE, in particular, is very likely
to be unhappy with that kind of stuff, and it's not the only place where
we might count upon never seeing e.g. longer than PAGE_SIZE chunks in
bio_vec.  It shouldn't be all that hard to fix, but if the whole thing
is simply impossible, I would rather avoid that round of RTFS at the moment...

Comments?

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-18 19:31                                       ` skb_splice_bits() and large chunks in pipe (was " Al Viro
@ 2016-09-18 20:12                                         ` Linus Torvalds
  2016-09-18 22:31                                           ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: Linus Torvalds @ 2016-09-18 20:12 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, Network Development,
	Eric Dumazet

On Sun, Sep 18, 2016 at 12:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> FWIW, I'm not sure if skb_splice_bits() can't land us in trouble; fragments
> might come from compound pages and I'm not entirely convinced that we won't
> end up with coalesced fragments putting more than PAGE_SIZE into a single
> pipe_buffer.  And that could badly confuse a bunch of code.

The pipe buffer code is actually *supposed* to handle any size
allocations at all. They should *not* be limited by pages, exactly
because the data can come from huge-pages or just multi-page
allocations. It's definitely possible with networking, and networking
is one of the *primary* targets of splice in many ways.

So if the splice code ends up being confused by "this is not just
inside a single page", then the splice code is buggy, I think.

Why would splice_write() cases be confused anyway? A filesystem needs
to be able to handle the case of "this needs to be split" regardless,
since even if the source buffer were to fit in a page, the offset
might obviously mean that the target won't fit in a page.

Now, if you decide that you want to make the iterator always split
those possibly big cases and never have big iovec entries, I guess
that would potentially be ok. But my initial reaction is that they are
perfectly normal and should be handled normally, and any code that
depends on a splice buffer fitting in one page is just buggy and
should be fixed.

                 Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-18 20:12                                         ` Linus Torvalds
@ 2016-09-18 22:31                                           ` Al Viro
  2016-09-19  0:18                                             ` Linus Torvalds
  2016-09-19  0:22                                               ` Al Viro
  0 siblings, 2 replies; 152+ messages in thread
From: Al Viro @ 2016-09-18 22:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, Network Development,
	Eric Dumazet

On Sun, Sep 18, 2016 at 01:12:21PM -0700, Linus Torvalds wrote:

> So if the splice code ends up being confused by "this is not just
> inside a single page", then the splice code is buggy, I think.
> 
> Why would splice_write() cases be confused anyway? A filesystem needs
> to be able to handle the case of "this needs to be split" regardless,
> since even if the source buffer were to fit in a page, the offset
> might obviously mean that the target won't fit in a page.

What worries me is iov_iter_get_pages() and friends.  The calling conventions
are
	size = iov_iter_get_pages(iter, pages, maxlen, maxpages, &start);

They are convenient enough for most of the callers - we fill an array of
pages, the first (and only in bvec case) one having start bytes skipped.

The thing is, the calculation of the number of pages returned is broken
in this case; normally it's ROUND_DIV_UP(start + n, PAGE_SIZE).  That,
of course, gets broken even by the offset being large enough.  We don't
have that many users of that thing (and iov_iter_get_pages_alloc()), but
it'll need careful review.  What's more, looking at those shows other
fun issues:
        sg_init_table(sgl->sg, npages + 1);

        for (i = 0, len = n; i < npages; i++) {
                int plen = min_t(int, len, PAGE_SIZE - off);

                sg_set_page(sgl->sg + i, sgl->pages[i], plen, off);

and that'll instantly blow up, due to PAGE_SIZE - off possibly becoming
negative.  That's af_alg_make_sg(), and it shouldn't see anything
coming from pipe buffers (right now the only way for that to happen is
iter_file_splice_write()), but the things like e.g. dio_refill_pages()
might, and they also get seriously confused by that.  Worse, some of those
callers have calling conventions that have similar problems of their own.

At the moment there are 11 callers (10 in mainline; one more added in
conversion of vmsplice_to_pipe() to new pipe locking, but it's irrelevant
anyway - it gets fed an iovec-backed iov_iter).  I'm looking through those
right now, hopefully will come up with something sane...

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-18 22:31                                           ` Al Viro
@ 2016-09-19  0:18                                             ` Linus Torvalds
  2016-09-19  0:22                                               ` Al Viro
  1 sibling, 0 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-09-19  0:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, Network Development,
	Eric Dumazet

On Sun, Sep 18, 2016 at 3:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What worries me is iov_iter_get_pages() and friends.

So honestly, if it worries you, I'm not going to complain at all if
you decide that you'd rather translate the pipe_buffer[] array into a
kvec by always splitting at page boundaries.

Even with large packets in networking, it's not going t be a huge
deal. And maybe we *should* make it a rule that a "kvec" is always
composed of individual entries that fit entirely within a page.

In this code, being safe rather than clever would be a welcome and
surprising change, I guess.

             Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-18 22:31                                           ` Al Viro
@ 2016-09-19  0:22                                               ` Al Viro
  2016-09-19  0:22                                               ` Al Viro
  1 sibling, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-19  0:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, Network Development,
	Eric Dumazet

On Sun, Sep 18, 2016 at 11:31:17PM +0100, Al Viro wrote:

> At the moment there are 11 callers (10 in mainline; one more added in
> conversion of vmsplice_to_pipe() to new pipe locking, but it's irrelevant
> anyway - it gets fed an iovec-backed iov_iter).  I'm looking through those
> right now, hopefully will come up with something sane...

FWIW, I wonder how many of those users are ready to cope with compound
pages in the first place; they end up passed to
	* skb_fill_page_desc().  Probably OK (as in all of them, modulo
calculating the number of pages and ranges for them).
	* shoved into scatterlist, which gets passed to virtqueue_add_sgs().
Need to check virtio to see what happens there.
	* shoved into nfs ->wb_page and fed into nfs_pageio_add_request() and
machinery behind it.  These, BTW, are reachable by pipe_buffer-derived ones
at the moment (splice to O_DIRECT nfs file).  The code looks like it's
playing fast and loose with ->wb_page - in some cases it's an NFS pagecache
one, in some - anything from userland, and there are places like
	inode = page_file_mapping(req->wb_page)->host;
which will do nasty things if they are ever reached by the second kind.
nfs_pgio_rpcsetup() looks like it won't be happy with compound pages, but
again, I'm not familiar enough with that code to tell if it's reachable
from nfs_pageio_add_request().
	* shoved into scatterlist, which gets fed into crypto/*.c machinery.
No way for a pipe_buffer stuff to get there, fortunately, because I would
be very surprised if it works correctly with compound pages and large
ranges in those.
	* shoved into lustre ->ldp_pages; almost certainly not ready for
compound pages.
	* fed to ceph_osd_data_pages_init(); again, practically certain not
to be ready.
	* put into dio_submit ->pages[], eventually fed to bio_add_page();
that might be fixable, but it would take some massage in fs/direct-io.c
	* fuse - probably OK, but that's only on a fairly cursory look.

It certainly won't be easy to verify in details ;-/

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
@ 2016-09-19  0:22                                               ` Al Viro
  0 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-19  0:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, Network Development,
	Eric Dumazet

On Sun, Sep 18, 2016 at 11:31:17PM +0100, Al Viro wrote:

> At the moment there are 11 callers (10 in mainline; one more added in
> conversion of vmsplice_to_pipe() to new pipe locking, but it's irrelevant
> anyway - it gets fed an iovec-backed iov_iter).  I'm looking through those
> right now, hopefully will come up with something sane...

FWIW, I wonder how many of those users are ready to cope with compound
pages in the first place; they end up passed to
	* skb_fill_page_desc().  Probably OK (as in all of them, modulo
calculating the number of pages and ranges for them).
	* shoved into scatterlist, which gets passed to virtqueue_add_sgs().
Need to check virtio to see what happens there.
	* shoved into nfs ->wb_page and fed into nfs_pageio_add_request() and
machinery behind it.  These, BTW, are reachable by pipe_buffer-derived ones
at the moment (splice to O_DIRECT nfs file).  The code looks like it's
playing fast and loose with ->wb_page - in some cases it's an NFS pagecache
one, in some - anything from userland, and there are places like
	inode = page_file_mapping(req->wb_page)->host;
which will do nasty things if they are ever reached by the second kind.
nfs_pgio_rpcsetup() looks like it won't be happy with compound pages, but
again, I'm not familiar enough with that code to tell if it's reachable
from nfs_pageio_add_request().
	* shoved into scatterlist, which gets fed into crypto/*.c machinery.
No way for a pipe_buffer stuff to get there, fortunately, because I would
be very surprised if it works correctly with compound pages and large
ranges in those.
	* shoved into lustre ->ldp_pages; almost certainly not ready for
compound pages.
	* fed to ceph_osd_data_pages_init(); again, practically certain not
to be ready.
	* put into dio_submit ->pages[], eventually fed to bio_add_page();
that might be fixable, but it would take some massage in fs/direct-io.c
	*�fuse - probably OK, but that's only on a fairly cursory look.

It certainly won't be easy to verify in details ;-/

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-18  5:33                                 ` Al Viro
@ 2016-09-19  3:08                                   ` Nicholas Piggin
  2016-09-19  6:11                                     ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: Nicholas Piggin @ 2016-09-19  3:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, linux-fsdevel

On Sun, 18 Sep 2016 06:33:52 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> [finally Cc'd to fsdevel - should've done that several iterations upthread]
> 
> On Wed, Sep 14, 2016 at 01:39:25PM +1000, Nicholas Piggin wrote:
> 
> > Should not be so bad, but I don't have hard numbers for you. PAGEVEC_SIZE
> > is 14, and that's conceptually rather similar operation (walk radix tree;
> > grab pages). OTOH many archs are heavier and do locking and vmas walking etc.
> > 
> > Documentation/features/vm/pte_special/arch-support.txt
> > 
> > But even for those, at 16 entries, the bulk of the cost *should* be hitting
> > struct page cachelines and refcounting. The rest should mostly stay in cache.  
> 
> OK...  That's actually important only for vmsplice_to_pipe() and 16-page
> array seems to be doing fine there.
> 
> Another question, now that you've finally resurfaced: could you reconstruct
> the story with page-stealing and breakage(s) thereof that had lead to
> commit 485ddb4b9741bafb70b22e5c1f9b4f37dc3e85bd
> Author: Nick Piggin <npiggin@suse.de>
> Date:   Tue Mar 27 08:55:08 2007 +0200
> 
>     1/2 splice: dont steal
> 
> I realize that it had been 9 years ago, but anything resembling a braindump
> would be very welcome.  Note that there is a couple of ->splice_write()
> instances that _do_ use ->steal() (fuse_dev_splice_write() and virtio_console
> port_fops_splice_write()) and I wonder if they suffer from the same problems;
> your commit message is rather short on details, unfortunately.  FUSE one
> is especially interesting...

Without looking through all the patches again, I believe the issue was
just that filesystems were not expecting (or at least, not audited to
expect) pages being added to their pagecache in that particular state
(they'd expect to go through ->readpage or see !uptodate in prepare_write).

If some wanted to attach metadata to uptodate pages for example, this
may have caused a problem. It wasn't some big fundamental problem, just a
mechanical one.

Thanks,
Nick'

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-19  3:08                                   ` Nicholas Piggin
@ 2016-09-19  6:11                                     ` Al Viro
  2016-09-19  7:26                                       ` Nicholas Piggin
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-19  6:11 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, linux-fsdevel

On Mon, Sep 19, 2016 at 01:08:30PM +1000, Nicholas Piggin wrote:

> Without looking through all the patches again, I believe the issue was
> just that filesystems were not expecting (or at least, not audited to
> expect) pages being added to their pagecache in that particular state
> (they'd expect to go through ->readpage or see !uptodate in prepare_write).
> 
> If some wanted to attach metadata to uptodate pages for example, this
> may have caused a problem. It wasn't some big fundamental problem, just a
> mechanical one.

Umm...  Why not make it non-uptodate/locked, try to replace the original
with it in pagecache and then do full-page ->write_begin immediately
followed by full-page ->write_end?  Looks like that ought to work in
all in-tree cases...

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-19  6:11                                     ` Al Viro
@ 2016-09-19  7:26                                       ` Nicholas Piggin
  0 siblings, 0 replies; 152+ messages in thread
From: Nicholas Piggin @ 2016-09-19  7:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, linux-fsdevel

On Mon, 19 Sep 2016 07:11:21 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Mon, Sep 19, 2016 at 01:08:30PM +1000, Nicholas Piggin wrote:
> 
> > Without looking through all the patches again, I believe the issue was
> > just that filesystems were not expecting (or at least, not audited to
> > expect) pages being added to their pagecache in that particular state
> > (they'd expect to go through ->readpage or see !uptodate in prepare_write).
> > 
> > If some wanted to attach metadata to uptodate pages for example, this
> > may have caused a problem. It wasn't some big fundamental problem, just a
> > mechanical one.  
> 
> Umm...  Why not make it non-uptodate/locked, try to replace the original
> with it in pagecache and then do full-page ->write_begin immediately
> followed by full-page ->write_end?  Looks like that ought to work in
> all in-tree cases...

That sounds like it probably should work for that case. IIRC, I was looking
at using a write_begin flag to notify the case of of replacing the page, so
the fs could also handle the case of replacing existing pagecache.


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
  2016-09-19  0:22                                               ` Al Viro
  (?)
@ 2016-09-20  9:51                                               ` Herbert Xu
  -1 siblings, 0 replies; 152+ messages in thread
From: Herbert Xu @ 2016-09-20  9:51 UTC (permalink / raw)
  To: Al Viro; +Cc: torvalds, axboe, npiggin, linux-fsdevel, netdev, edumazet

Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>        * shoved into scatterlist, which gets fed into crypto/*.c machinery.
> No way for a pipe_buffer stuff to get there, fortunately, because I would
> be very surprised if it works correctly with compound pages and large
> ranges in those.

FWIW the crypto API has always been supposed to handle SG entries
that cross page boundaries.  There were a couple of bugs in this
area but AFAIK they've all been fixed.

Of course I cannot guarantee that every crypto driver also handles
it correctly, but at least we have a few test vectors which test
the page-crossing case specifically.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 152+ messages in thread

* [RFC][CFT] splice_read reworked
  2016-09-17 19:00                                     ` Al Viro
  2016-09-17 20:15                                       ` Linus Torvalds
  2016-09-18 19:31                                       ` skb_splice_bits() and large chunks in pipe (was " Al Viro
@ 2016-09-23 19:00                                       ` Al Viro
  2016-09-23 19:01                                         ` [PATCH 01/11] fix memory leaks in tracing_buffers_splice_read() Al Viro
                                                           ` (11 more replies)
  2 siblings, 12 replies; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

	The series is supposed to solve the locking order problems for
->splice_read() and get rid of code duplication between the read-side
methods.
	pipe_lock is lifted out of ->splice_read() instances, along with
waiting for empty space in pipe, etc. - we do that stuff in callers.
	A new variant of iov_iter is introduced - it's backed by a pipe,
copy_to_iter() results in allocating pages and copying into those,
copy_page_to_iter() just sticks a reference to that page into pipe.
Running out of space in pipe yields a short read, as a fault in iovec-backed
iov_iter would have.  Enough primitives are implemented for normal
->read_iter() instances to work.
	generic_file_splice_read() switched to feeding such iov_iter to
->read_iter() instance.  That turns out to be enough to kill almost all
->splice_read() instances; the only ones _not_ using generic_file_splice_read()
or default_file_splice_read() (== no zero-copy fallback) are
fuse_dev_splice_read(), 3 instances in kernel/{relay.c,trace/trace.c} and
sock_splice_read().  It's almost certainly possible to convert fuse one
and the same might be possible to do to socket one.  relay and tracing
stuff is just plain weird; might or might not be doable.

	Something hopefully working is in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.splice_read

Several commits in that pipe (#1, #8 and #9) are trivial cleanups and fixes
for crap caught while doing the rest, probably ought to be separated.

Shortlog:
Al Viro (11):
      fix memory leaks in tracing_buffers_splice_read()
      splice_to_pipe(): don't open-code wakeup_pipe_readers()
      splice: switch get_iovec_page_array() to iov_iter
      splice: lift pipe_lock out of splice_to_pipe()
      skb_splice_bits(): get rid of callback
      new helper: add_to_pipe()
      fuse_dev_splice_read(): switch to add_to_pipe()
      cifs: don't use memcpy() to copy struct iov_iter
      fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter()
      new iov_iter flavour: pipe-backed
      switch generic_file_splice_read() to use of ->read_iter()

Diffstat:
 drivers/staging/lustre/lustre/llite/file.c         |  70 +--
 .../staging/lustre/lustre/llite/llite_internal.h   |  15 +-
 drivers/staging/lustre/lustre/llite/vvp_internal.h |  14 -
 drivers/staging/lustre/lustre/llite/vvp_io.c       |  45 +-
 fs/cifs/file.c                                     |  14 +-
 fs/coda/file.c                                     |  23 +-
 fs/fuse/dev.c                                      |  48 +-
 fs/fuse/file.c                                     |  30 +-
 fs/gfs2/file.c                                     |  28 +-
 fs/nfs/file.c                                      |  25 +-
 fs/nfs/internal.h                                  |   2 -
 fs/nfs/nfs4file.c                                  |   2 +-
 fs/ocfs2/file.c                                    |  34 +-
 fs/ocfs2/ocfs2_trace.h                             |   2 -
 fs/splice.c                                        | 578 +++++++--------------
 fs/xfs/xfs_file.c                                  |  41 +-
 fs/xfs/xfs_trace.h                                 |   1 -
 include/linux/fs.h                                 |   2 -
 include/linux/skbuff.h                             |   8 +-
 include/linux/splice.h                             |   3 +
 include/linux/uio.h                                |  14 +-
 kernel/trace/trace.c                               |  14 +-
 lib/iov_iter.c                                     | 390 +++++++++++++-
 mm/shmem.c                                         | 115 +---
 net/core/skbuff.c                                  |  28 +-
 net/ipv4/tcp.c                                     |   3 +-
 net/kcm/kcmsock.c                                  |  16 +-
 net/unix/af_unix.c                                 |  17 +-
 28 files changed, 648 insertions(+), 934 deletions(-)

	It's not all I would like to do there (in particular, I hadn't
done fuse splice_read conversion to read_iter, even though it does appear
to be doable; that'll take copy_page_to_iter_nosteal() as a new primitive
+ considerable amount of massage in fs/fuse/dev.c), but it should at least
	* make pipe lock the outermost
	* switch generic_file_splice_read() to ->read_iter(), making
it suitable for lustre/coda/gfs2/ocfs2/xfs/shmem without any wrappers
	* somewhat simplify socket ->splice_read() guts (not by much - to
start doing that right we'd need the same new primitive)
	* remove a considerable pile of code.
	* get rid of a bunch of splice_{grow,shrink}_spd/splice_to_pipe
callers; remaining ones are in default_file_splice_read() (trivially
killable by conversion to iov_iter_get_pages_alloc(), followed by the same
build iovec array + use kernel_readv as we do now + iov_iter_advance to
the length returned by kernel_readv), kernel/relay and kernel/trace/trace.c
ones (should switch to add_to_pipe(), AFAICS) and skb_splice_bits()
(again, a matter of copy_page_to_iter_nosteal(), which will take out
spd_can_coalesce/spd_fill_page in there as well).  Once the remaining ones
are taken care of, splice_pipe_desc and friends will go away.

	In its current form it survives LTP, xfstests and overlayfs testsuite;
if anybody has additional tests for splice and friends, I would like to hear
about such.  It really needs more beating, though.

	Please, help with review and testing.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* [PATCH 01/11] fix memory leaks in tracing_buffers_splice_read()
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
@ 2016-09-23 19:01                                         ` Al Viro
  2016-09-23 19:02                                         ` [PATCH 02/11] splice_to_pipe(): don't open-code wakeup_pipe_readers() Al Viro
                                                           ` (10 subsequent siblings)
  11 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 kernel/trace/trace.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index dade4c9..9016f98 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6163,9 +6163,6 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos,
 		return -EBUSY;
 #endif
 
-	if (splice_grow_spd(pipe, &spd))
-		return -ENOMEM;
-
 	if (*ppos & (PAGE_SIZE - 1))
 		return -EINVAL;
 
@@ -6175,6 +6172,9 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos,
 		len &= PAGE_MASK;
 	}
 
+	if (splice_grow_spd(pipe, &spd))
+		return -ENOMEM;
+
  again:
 	trace_access_lock(iter->cpu_file);
 	entries = ring_buffer_entries_cpu(iter->trace_buffer->buffer, iter->cpu_file);
@@ -6232,19 +6232,21 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos,
 	/* did we read anything? */
 	if (!spd.nr_pages) {
 		if (ret)
-			return ret;
+			goto out;
 
+		ret = -EAGAIN;
 		if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK))
-			return -EAGAIN;
+			goto out;
 
 		ret = wait_on_pipe(iter, true);
 		if (ret)
-			return ret;
+			goto out;
 
 		goto again;
 	}
 
 	ret = splice_to_pipe(pipe, &spd);
+out:
 	splice_shrink_spd(&spd);
 
 	return ret;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 02/11] splice_to_pipe(): don't open-code wakeup_pipe_readers()
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
  2016-09-23 19:01                                         ` [PATCH 01/11] fix memory leaks in tracing_buffers_splice_read() Al Viro
@ 2016-09-23 19:02                                         ` Al Viro
  2016-09-23 19:02                                           ` Al Viro
                                                           ` (9 subsequent siblings)
  11 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index dd9bf7e..36e9353 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -242,10 +242,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		}
 
 		if (do_wakeup) {
-			smp_mb();
-			if (waitqueue_active(&pipe->wait))
-				wake_up_interruptible_sync(&pipe->wait);
-			kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
+			wakeup_pipe_readers(pipe);
 			do_wakeup = 0;
 		}
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 03/11] splice: switch get_iovec_page_array() to iov_iter
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
@ 2016-09-23 19:02                                           ` Al Viro
  2016-09-23 19:02                                         ` [PATCH 02/11] splice_to_pipe(): don't open-code wakeup_pipe_readers() Al Viro
                                                             ` (10 subsequent siblings)
  11 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c | 135 ++++++++++++++++--------------------------------------------
 1 file changed, 36 insertions(+), 99 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 36e9353..31c52e0 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1434,106 +1434,32 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	return -EINVAL;
 }
 
-/*
- * Map an iov into an array of pages and offset/length tupples. With the
- * partial_page structure, we can map several non-contiguous ranges into
- * our ones pages[] map instead of splitting that operation into pieces.
- * Could easily be exported as a generic helper for other users, in which
- * case one would probably want to add a 'max_nr_pages' parameter as well.
- */
-static int get_iovec_page_array(const struct iovec __user *iov,
-				unsigned int nr_vecs, struct page **pages,
-				struct partial_page *partial, bool aligned,
+static int get_iovec_page_array(struct iov_iter *from,
+				struct page **pages,
+				struct partial_page *partial,
 				unsigned int pipe_buffers)
 {
-	int buffers = 0, error = 0;
-
-	while (nr_vecs) {
-		unsigned long off, npages;
-		struct iovec entry;
-		void __user *base;
-		size_t len;
-		int i;
-
-		error = -EFAULT;
-		if (copy_from_user(&entry, iov, sizeof(entry)))
-			break;
-
-		base = entry.iov_base;
-		len = entry.iov_len;
-
-		/*
-		 * Sanity check this iovec. 0 read succeeds.
-		 */
-		error = 0;
-		if (unlikely(!len))
-			break;
-		error = -EFAULT;
-		if (!access_ok(VERIFY_READ, base, len))
-			break;
-
-		/*
-		 * Get this base offset and number of pages, then map
-		 * in the user pages.
-		 */
-		off = (unsigned long) base & ~PAGE_MASK;
-
-		/*
-		 * If asked for alignment, the offset must be zero and the
-		 * length a multiple of the PAGE_SIZE.
-		 */
-		error = -EINVAL;
-		if (aligned && (off || len & ~PAGE_MASK))
-			break;
-
-		npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-		if (npages > pipe_buffers - buffers)
-			npages = pipe_buffers - buffers;
-
-		error = get_user_pages_fast((unsigned long)base, npages,
-					0, &pages[buffers]);
-
-		if (unlikely(error <= 0))
-			break;
-
-		/*
-		 * Fill this contiguous range into the partial page map.
-		 */
-		for (i = 0; i < error; i++) {
-			const int plen = min_t(size_t, len, PAGE_SIZE - off);
-
-			partial[buffers].offset = off;
-			partial[buffers].len = plen;
-
-			off = 0;
-			len -= plen;
+	int buffers = 0;
+	while (iov_iter_count(from)) {
+		ssize_t copied;
+		size_t start;
+
+		copied = iov_iter_get_pages(from, pages + buffers, ~0UL,
+					pipe_buffers - buffers, &start);
+		if (copied <= 0)
+			return buffers ? buffers : copied;
+
+		iov_iter_advance(from, copied);
+		while (copied) {
+			int size = min_t(int, copied, PAGE_SIZE - start);
+			partial[buffers].offset = start;
+			partial[buffers].len = size;
+			copied -= size;
+			start = 0;
 			buffers++;
 		}
-
-		/*
-		 * We didn't complete this iov, stop here since it probably
-		 * means we have to move some of this into a pipe to
-		 * be able to continue.
-		 */
-		if (len)
-			break;
-
-		/*
-		 * Don't continue if we mapped fewer pages than we asked for,
-		 * or if we mapped the max number of pages that we have
-		 * room for.
-		 */
-		if (error < npages || buffers == pipe_buffers)
-			break;
-
-		nr_vecs--;
-		iov++;
 	}
-
-	if (buffers)
-		return buffers;
-
-	return error;
+	return buffers;
 }
 
 static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
@@ -1587,10 +1513,13 @@ static long vmsplice_to_user(struct file *file, const struct iovec __user *uiov,
  * as splice-from-memory, where the regular splice is splice-from-file (or
  * to file). In both cases the output is a pipe, naturally.
  */
-static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
+static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 			     unsigned long nr_segs, unsigned int flags)
 {
 	struct pipe_inode_info *pipe;
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = iovstack;
+	struct iov_iter from;
 	struct page *pages[PIPE_DEF_BUFFERS];
 	struct partial_page partial[PIPE_DEF_BUFFERS];
 	struct splice_pipe_desc spd = {
@@ -1607,11 +1536,18 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
 	if (!pipe)
 		return -EBADF;
 
-	if (splice_grow_spd(pipe, &spd))
+	ret = import_iovec(WRITE, uiov, nr_segs,
+			   ARRAY_SIZE(iovstack), &iov, &from);
+	if (ret < 0)
+		return ret;
+
+	if (splice_grow_spd(pipe, &spd)) {
+		kfree(iov);
 		return -ENOMEM;
+	}
 
-	spd.nr_pages = get_iovec_page_array(iov, nr_segs, spd.pages,
-					    spd.partial, false,
+	spd.nr_pages = get_iovec_page_array(&from, spd.pages,
+					    spd.partial,
 					    spd.nr_pages_max);
 	if (spd.nr_pages <= 0)
 		ret = spd.nr_pages;
@@ -1619,6 +1555,7 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
 		ret = splice_to_pipe(pipe, &spd);
 
 	splice_shrink_spd(&spd);
+	kfree(iov);
 	return ret;
 }
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 03/11] splice: switch get_iovec_page_array() to iov_iter
@ 2016-09-23 19:02                                           ` Al Viro
  0 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, CAI Qian, Nick Piggin, xfs, linux-xfs, linux-fsdevel

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c | 135 ++++++++++++++++--------------------------------------------
 1 file changed, 36 insertions(+), 99 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 36e9353..31c52e0 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1434,106 +1434,32 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	return -EINVAL;
 }
 
-/*
- * Map an iov into an array of pages and offset/length tupples. With the
- * partial_page structure, we can map several non-contiguous ranges into
- * our ones pages[] map instead of splitting that operation into pieces.
- * Could easily be exported as a generic helper for other users, in which
- * case one would probably want to add a 'max_nr_pages' parameter as well.
- */
-static int get_iovec_page_array(const struct iovec __user *iov,
-				unsigned int nr_vecs, struct page **pages,
-				struct partial_page *partial, bool aligned,
+static int get_iovec_page_array(struct iov_iter *from,
+				struct page **pages,
+				struct partial_page *partial,
 				unsigned int pipe_buffers)
 {
-	int buffers = 0, error = 0;
-
-	while (nr_vecs) {
-		unsigned long off, npages;
-		struct iovec entry;
-		void __user *base;
-		size_t len;
-		int i;
-
-		error = -EFAULT;
-		if (copy_from_user(&entry, iov, sizeof(entry)))
-			break;
-
-		base = entry.iov_base;
-		len = entry.iov_len;
-
-		/*
-		 * Sanity check this iovec. 0 read succeeds.
-		 */
-		error = 0;
-		if (unlikely(!len))
-			break;
-		error = -EFAULT;
-		if (!access_ok(VERIFY_READ, base, len))
-			break;
-
-		/*
-		 * Get this base offset and number of pages, then map
-		 * in the user pages.
-		 */
-		off = (unsigned long) base & ~PAGE_MASK;
-
-		/*
-		 * If asked for alignment, the offset must be zero and the
-		 * length a multiple of the PAGE_SIZE.
-		 */
-		error = -EINVAL;
-		if (aligned && (off || len & ~PAGE_MASK))
-			break;
-
-		npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-		if (npages > pipe_buffers - buffers)
-			npages = pipe_buffers - buffers;
-
-		error = get_user_pages_fast((unsigned long)base, npages,
-					0, &pages[buffers]);
-
-		if (unlikely(error <= 0))
-			break;
-
-		/*
-		 * Fill this contiguous range into the partial page map.
-		 */
-		for (i = 0; i < error; i++) {
-			const int plen = min_t(size_t, len, PAGE_SIZE - off);
-
-			partial[buffers].offset = off;
-			partial[buffers].len = plen;
-
-			off = 0;
-			len -= plen;
+	int buffers = 0;
+	while (iov_iter_count(from)) {
+		ssize_t copied;
+		size_t start;
+
+		copied = iov_iter_get_pages(from, pages + buffers, ~0UL,
+					pipe_buffers - buffers, &start);
+		if (copied <= 0)
+			return buffers ? buffers : copied;
+
+		iov_iter_advance(from, copied);
+		while (copied) {
+			int size = min_t(int, copied, PAGE_SIZE - start);
+			partial[buffers].offset = start;
+			partial[buffers].len = size;
+			copied -= size;
+			start = 0;
 			buffers++;
 		}
-
-		/*
-		 * We didn't complete this iov, stop here since it probably
-		 * means we have to move some of this into a pipe to
-		 * be able to continue.
-		 */
-		if (len)
-			break;
-
-		/*
-		 * Don't continue if we mapped fewer pages than we asked for,
-		 * or if we mapped the max number of pages that we have
-		 * room for.
-		 */
-		if (error < npages || buffers == pipe_buffers)
-			break;
-
-		nr_vecs--;
-		iov++;
 	}
-
-	if (buffers)
-		return buffers;
-
-	return error;
+	return buffers;
 }
 
 static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
@@ -1587,10 +1513,13 @@ static long vmsplice_to_user(struct file *file, const struct iovec __user *uiov,
  * as splice-from-memory, where the regular splice is splice-from-file (or
  * to file). In both cases the output is a pipe, naturally.
  */
-static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
+static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 			     unsigned long nr_segs, unsigned int flags)
 {
 	struct pipe_inode_info *pipe;
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = iovstack;
+	struct iov_iter from;
 	struct page *pages[PIPE_DEF_BUFFERS];
 	struct partial_page partial[PIPE_DEF_BUFFERS];
 	struct splice_pipe_desc spd = {
@@ -1607,11 +1536,18 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
 	if (!pipe)
 		return -EBADF;
 
-	if (splice_grow_spd(pipe, &spd))
+	ret = import_iovec(WRITE, uiov, nr_segs,
+			   ARRAY_SIZE(iovstack), &iov, &from);
+	if (ret < 0)
+		return ret;
+
+	if (splice_grow_spd(pipe, &spd)) {
+		kfree(iov);
 		return -ENOMEM;
+	}
 
-	spd.nr_pages = get_iovec_page_array(iov, nr_segs, spd.pages,
-					    spd.partial, false,
+	spd.nr_pages = get_iovec_page_array(&from, spd.pages,
+					    spd.partial,
 					    spd.nr_pages_max);
 	if (spd.nr_pages <= 0)
 		ret = spd.nr_pages;
@@ -1619,6 +1555,7 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
 		ret = splice_to_pipe(pipe, &spd);
 
 	splice_shrink_spd(&spd);
+	kfree(iov);
 	return ret;
 }
 
-- 
2.9.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
                                                           ` (2 preceding siblings ...)
  2016-09-23 19:02                                           ` Al Viro
@ 2016-09-23 19:03                                         ` Al Viro
  2016-09-23 19:45                                           ` Linus Torvalds
  2016-09-23 19:03                                           ` Al Viro
                                                           ` (7 subsequent siblings)
  11 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

* splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
* ->splice_read() instances do the same
* vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
  arrange for waiting, looping, etc. themselves.

That should make pipe_lock the outermost one.

Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
and do_splice() are quite ugly _and_ userland code can be easily broken
by changing those.  It's not even "no more than the maximal capacity of
this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
leave instead of waiting".  I would like to change it to something saner,
but that's for later.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/fuse/dev.c |   2 -
 fs/splice.c   | 171 ++++++++++++++++++++++++++++++++--------------------------
 2 files changed, 96 insertions(+), 77 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index a94d2ed..eaf56c6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1364,7 +1364,6 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 		goto out;
 
 	ret = 0;
-	pipe_lock(pipe);
 
 	if (!pipe->readers) {
 		send_sig(SIGPIPE, current, 0);
@@ -1400,7 +1399,6 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	}
 
 out_unlock:
-	pipe_unlock(pipe);
 
 	if (do_wakeup) {
 		smp_mb();
diff --git a/fs/splice.c b/fs/splice.c
index 31c52e0..9ce6e62 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -183,79 +183,41 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		       struct splice_pipe_desc *spd)
 {
 	unsigned int spd_pages = spd->nr_pages;
-	int ret, do_wakeup, page_nr;
+	int ret = 0, page_nr = 0;
 
 	if (!spd_pages)
 		return 0;
 
-	ret = 0;
-	do_wakeup = 0;
-	page_nr = 0;
-
-	pipe_lock(pipe);
-
-	for (;;) {
-		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
-			if (!ret)
-				ret = -EPIPE;
-			break;
-		}
-
-		if (pipe->nrbufs < pipe->buffers) {
-			int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-			struct pipe_buffer *buf = pipe->bufs + newbuf;
-
-			buf->page = spd->pages[page_nr];
-			buf->offset = spd->partial[page_nr].offset;
-			buf->len = spd->partial[page_nr].len;
-			buf->private = spd->partial[page_nr].private;
-			buf->ops = spd->ops;
-			if (spd->flags & SPLICE_F_GIFT)
-				buf->flags |= PIPE_BUF_FLAG_GIFT;
-
-			pipe->nrbufs++;
-			page_nr++;
-			ret += buf->len;
-
-			if (pipe->files)
-				do_wakeup = 1;
+	if (unlikely(!pipe->readers)) {
+		send_sig(SIGPIPE, current, 0);
+		ret = -EPIPE;
+		goto out;
+	}
 
-			if (!--spd->nr_pages)
-				break;
-			if (pipe->nrbufs < pipe->buffers)
-				continue;
+	while (pipe->nrbufs < pipe->buffers) {
+		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		struct pipe_buffer *buf = pipe->bufs + newbuf;
 
-			break;
-		}
+		buf->page = spd->pages[page_nr];
+		buf->offset = spd->partial[page_nr].offset;
+		buf->len = spd->partial[page_nr].len;
+		buf->private = spd->partial[page_nr].private;
+		buf->ops = spd->ops;
+		if (spd->flags & SPLICE_F_GIFT)
+			buf->flags |= PIPE_BUF_FLAG_GIFT;
 
-		if (spd->flags & SPLICE_F_NONBLOCK) {
-			if (!ret)
-				ret = -EAGAIN;
-			break;
-		}
+		pipe->nrbufs++;
+		page_nr++;
+		ret += buf->len;
 
-		if (signal_pending(current)) {
-			if (!ret)
-				ret = -ERESTARTSYS;
+		if (!--spd->nr_pages)
 			break;
-		}
-
-		if (do_wakeup) {
-			wakeup_pipe_readers(pipe);
-			do_wakeup = 0;
-		}
-
-		pipe->waiting_writers++;
-		pipe_wait(pipe);
-		pipe->waiting_writers--;
 	}
 
-	pipe_unlock(pipe);
-
-	if (do_wakeup)
-		wakeup_pipe_readers(pipe);
+	if (!ret)
+		ret = -EAGAIN;
 
+out:
 	while (page_nr < spd_pages)
 		spd->spd_release(spd, page_nr++);
 
@@ -1339,6 +1301,27 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(do_splice_direct);
 
+static bool splice_more(struct pipe_inode_info *pipe,
+			long *p, unsigned flags)
+{
+	if (pipe->nrbufs < pipe->buffers) // no overflows
+		return false;
+	if (flags & SPLICE_F_NONBLOCK) // not allowed to wait
+		return false;
+	if (*p < 0 && *p != -EAGAIN) // error happened
+		return false;
+	if (signal_pending(current)) { // interrupted
+		*p = -ERESTARTSYS;
+		return false;
+	}
+	if (*p > 0)
+		wakeup_pipe_readers(pipe);
+	pipe->waiting_writers++;
+	pipe_wait(pipe);
+	pipe->waiting_writers--;
+	return true;
+}
+
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
@@ -1410,6 +1393,8 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	}
 
 	if (opipe) {
+		size_t total = 0;
+		int bogus_count;
 		if (off_out)
 			return -ESPIPE;
 		if (off_in) {
@@ -1421,8 +1406,25 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 			offset = in->f_pos;
 		}
 
-		ret = do_splice_to(in, &offset, opipe, len, flags);
-
+		ret = 0;
+		pipe_lock(opipe);
+		bogus_count = opipe->buffers;
+		do {
+			bogus_count += opipe->nrbufs;
+			ret = do_splice_to(in, &offset, opipe, len, flags);
+			if (ret > 0) {
+				total += ret;
+				len -= ret;
+			}
+			bogus_count -= opipe->nrbufs;
+			if (bogus_count <= 0)
+				break;
+		} while (len && splice_more(opipe, &ret, flags));
+		pipe_unlock(opipe);
+		if (total) {
+			wakeup_pipe_readers(opipe);
+			ret = total;
+		}
 		if (!off_in)
 			in->f_pos = offset;
 		else if (copy_to_user(off_in, &offset, sizeof(loff_t)))
@@ -1434,22 +1436,23 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	return -EINVAL;
 }
 
-static int get_iovec_page_array(struct iov_iter *from,
+static int get_iovec_page_array(const struct iov_iter *from,
 				struct page **pages,
 				struct partial_page *partial,
 				unsigned int pipe_buffers)
 {
+	struct iov_iter i = *from;
 	int buffers = 0;
-	while (iov_iter_count(from)) {
+	while (iov_iter_count(&i)) {
 		ssize_t copied;
 		size_t start;
 
-		copied = iov_iter_get_pages(from, pages + buffers, ~0UL,
+		copied = iov_iter_get_pages(&i, pages + buffers, ~0UL,
 					pipe_buffers - buffers, &start);
 		if (copied <= 0)
 			return buffers ? buffers : copied;
 
-		iov_iter_advance(from, copied);
+		iov_iter_advance(&i, copied);
 		while (copied) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
 			partial[buffers].offset = start;
@@ -1530,7 +1533,8 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 		.ops = &user_page_pipe_buf_ops,
 		.spd_release = spd_release_page,
 	};
-	long ret;
+	long ret, total = 0;
+	int bogus_count;
 
 	pipe = get_pipe_info(file);
 	if (!pipe)
@@ -1546,14 +1550,31 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 		return -ENOMEM;
 	}
 
-	spd.nr_pages = get_iovec_page_array(&from, spd.pages,
-					    spd.partial,
-					    spd.nr_pages_max);
-	if (spd.nr_pages <= 0)
-		ret = spd.nr_pages;
-	else
+	pipe_lock(pipe);
+	bogus_count = pipe->buffers;
+	do {
+		bogus_count += pipe->nrbufs;
+		spd.nr_pages = get_iovec_page_array(&from, spd.pages,
+						    spd.partial,
+						    spd.nr_pages_max);
+		if (spd.nr_pages <= 0) {
+			ret = spd.nr_pages;
+			break;
+		}
 		ret = splice_to_pipe(pipe, &spd);
-
+		if (ret > 0) {
+			total += ret;
+			iov_iter_advance(&from, ret);
+		}
+		bogus_count -= pipe->nrbufs;
+		if (bogus_count <= 0)
+			break;
+	} while (iov_iter_count(&from) && splice_more(pipe, &ret, flags));
+	pipe_unlock(pipe);
+	if (total) {
+		wakeup_pipe_readers(pipe);
+		ret = total;
+	}
 	splice_shrink_spd(&spd);
 	kfree(iov);
 	return ret;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 05/11] skb_splice_bits(): get rid of callback
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
@ 2016-09-23 19:03                                           ` Al Viro
  2016-09-23 19:02                                         ` [PATCH 02/11] splice_to_pipe(): don't open-code wakeup_pipe_readers() Al Viro
                                                             ` (10 subsequent siblings)
  11 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

since pipe_lock is the outermost now, we don't need to drop/regain
socket locks around the call of splice_to_pipe() from skb_splice_bits(),
which kills the need to have a socket-specific callback; we can just
call splice_to_pipe() and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/skbuff.h |  8 +-------
 net/core/skbuff.c      | 28 ++--------------------------
 net/ipv4/tcp.c         |  3 +--
 net/kcm/kcmsock.c      | 16 +---------------
 net/unix/af_unix.c     | 17 +----------------
 5 files changed, 6 insertions(+), 66 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0f665cb..f520251 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3021,15 +3021,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
 			      int len, __wsum csum);
-ssize_t skb_socket_splice(struct sock *sk,
-			  struct pipe_inode_info *pipe,
-			  struct splice_pipe_desc *spd);
 int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int len,
-		    unsigned int flags,
-		    ssize_t (*splice_cb)(struct sock *,
-					 struct pipe_inode_info *,
-					 struct splice_pipe_desc *));
+		    unsigned int flags);
 void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 unsigned int skb_zerocopy_headlen(const struct sk_buff *from);
 int skb_zerocopy(struct sk_buff *to, struct sk_buff *from,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3864b4b6..208a9bc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1962,37 +1962,13 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
 	return false;
 }
 
-ssize_t skb_socket_splice(struct sock *sk,
-			  struct pipe_inode_info *pipe,
-			  struct splice_pipe_desc *spd)
-{
-	int ret;
-
-	/* Drop the socket lock, otherwise we have reverse
-	 * locking dependencies between sk_lock and i_mutex
-	 * here as compared to sendfile(). We enter here
-	 * with the socket lock held, and splice_to_pipe() will
-	 * grab the pipe inode lock. For sendfile() emulation,
-	 * we call into ->sendpage() with the i_mutex lock held
-	 * and networking will grab the socket lock.
-	 */
-	release_sock(sk);
-	ret = splice_to_pipe(pipe, spd);
-	lock_sock(sk);
-
-	return ret;
-}
-
 /*
  * Map data from the skb to a pipe. Should handle both the linear part,
  * the fragments, and the frag list.
  */
 int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int tlen,
-		    unsigned int flags,
-		    ssize_t (*splice_cb)(struct sock *,
-					 struct pipe_inode_info *,
-					 struct splice_pipe_desc *))
+		    unsigned int flags)
 {
 	struct partial_page partial[MAX_SKB_FRAGS];
 	struct page *pages[MAX_SKB_FRAGS];
@@ -2009,7 +1985,7 @@ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 	__skb_splice_bits(skb, pipe, &offset, &tlen, &spd, sk);
 
 	if (spd.nr_pages)
-		ret = splice_cb(sk, pipe, &spd);
+		ret = splice_to_pipe(pipe, &spd);
 
 	return ret;
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ffbb218..ddd2179 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -688,8 +688,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
 	int ret;
 
 	ret = skb_splice_bits(skb, skb->sk, offset, tss->pipe,
-			      min(rd_desc->count, len), tss->flags,
-			      skb_socket_splice);
+			      min(rd_desc->count, len), tss->flags);
 	if (ret > 0)
 		rd_desc->count -= ret;
 	return ret;
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index cb39e05..994baae 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -1461,19 +1461,6 @@ out:
 	return copied ? : err;
 }
 
-static ssize_t kcm_sock_splice(struct sock *sk,
-			       struct pipe_inode_info *pipe,
-			       struct splice_pipe_desc *spd)
-{
-	int ret;
-
-	release_sock(sk);
-	ret = splice_to_pipe(pipe, spd);
-	lock_sock(sk);
-
-	return ret;
-}
-
 static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
 			       struct pipe_inode_info *pipe, size_t len,
 			       unsigned int flags)
@@ -1503,8 +1490,7 @@ static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
 	if (len > rxm->full_len)
 		len = rxm->full_len;
 
-	copied = skb_splice_bits(skb, sk, rxm->offset, pipe, len, flags,
-				 kcm_sock_splice);
+	copied = skb_splice_bits(skb, sk, rxm->offset, pipe, len, flags);
 	if (copied < 0) {
 		err = copied;
 		goto err_out;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f1dffe8..e7707ca 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2488,28 +2488,13 @@ static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
 	return unix_stream_read_generic(&state);
 }
 
-static ssize_t skb_unix_socket_splice(struct sock *sk,
-				      struct pipe_inode_info *pipe,
-				      struct splice_pipe_desc *spd)
-{
-	int ret;
-	struct unix_sock *u = unix_sk(sk);
-
-	mutex_unlock(&u->readlock);
-	ret = splice_to_pipe(pipe, spd);
-	mutex_lock(&u->readlock);
-
-	return ret;
-}
-
 static int unix_stream_splice_actor(struct sk_buff *skb,
 				    int skip, int chunk,
 				    struct unix_stream_read_state *state)
 {
 	return skb_splice_bits(skb, state->socket->sk,
 			       UNIXCB(skb).consumed + skip,
-			       state->pipe, chunk, state->splice_flags,
-			       skb_unix_socket_splice);
+			       state->pipe, chunk, state->splice_flags);
 }
 
 static ssize_t unix_stream_splice_read(struct socket *sock,  loff_t *ppos,
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 05/11] skb_splice_bits(): get rid of callback
@ 2016-09-23 19:03                                           ` Al Viro
  0 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, CAI Qian, Nick Piggin, xfs, linux-xfs, linux-fsdevel

since pipe_lock is the outermost now, we don't need to drop/regain
socket locks around the call of splice_to_pipe() from skb_splice_bits(),
which kills the need to have a socket-specific callback; we can just
call splice_to_pipe() and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/skbuff.h |  8 +-------
 net/core/skbuff.c      | 28 ++--------------------------
 net/ipv4/tcp.c         |  3 +--
 net/kcm/kcmsock.c      | 16 +---------------
 net/unix/af_unix.c     | 17 +----------------
 5 files changed, 6 insertions(+), 66 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0f665cb..f520251 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3021,15 +3021,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
 			      int len, __wsum csum);
-ssize_t skb_socket_splice(struct sock *sk,
-			  struct pipe_inode_info *pipe,
-			  struct splice_pipe_desc *spd);
 int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int len,
-		    unsigned int flags,
-		    ssize_t (*splice_cb)(struct sock *,
-					 struct pipe_inode_info *,
-					 struct splice_pipe_desc *));
+		    unsigned int flags);
 void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 unsigned int skb_zerocopy_headlen(const struct sk_buff *from);
 int skb_zerocopy(struct sk_buff *to, struct sk_buff *from,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3864b4b6..208a9bc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1962,37 +1962,13 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
 	return false;
 }
 
-ssize_t skb_socket_splice(struct sock *sk,
-			  struct pipe_inode_info *pipe,
-			  struct splice_pipe_desc *spd)
-{
-	int ret;
-
-	/* Drop the socket lock, otherwise we have reverse
-	 * locking dependencies between sk_lock and i_mutex
-	 * here as compared to sendfile(). We enter here
-	 * with the socket lock held, and splice_to_pipe() will
-	 * grab the pipe inode lock. For sendfile() emulation,
-	 * we call into ->sendpage() with the i_mutex lock held
-	 * and networking will grab the socket lock.
-	 */
-	release_sock(sk);
-	ret = splice_to_pipe(pipe, spd);
-	lock_sock(sk);
-
-	return ret;
-}
-
 /*
  * Map data from the skb to a pipe. Should handle both the linear part,
  * the fragments, and the frag list.
  */
 int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int tlen,
-		    unsigned int flags,
-		    ssize_t (*splice_cb)(struct sock *,
-					 struct pipe_inode_info *,
-					 struct splice_pipe_desc *))
+		    unsigned int flags)
 {
 	struct partial_page partial[MAX_SKB_FRAGS];
 	struct page *pages[MAX_SKB_FRAGS];
@@ -2009,7 +1985,7 @@ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 	__skb_splice_bits(skb, pipe, &offset, &tlen, &spd, sk);
 
 	if (spd.nr_pages)
-		ret = splice_cb(sk, pipe, &spd);
+		ret = splice_to_pipe(pipe, &spd);
 
 	return ret;
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ffbb218..ddd2179 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -688,8 +688,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
 	int ret;
 
 	ret = skb_splice_bits(skb, skb->sk, offset, tss->pipe,
-			      min(rd_desc->count, len), tss->flags,
-			      skb_socket_splice);
+			      min(rd_desc->count, len), tss->flags);
 	if (ret > 0)
 		rd_desc->count -= ret;
 	return ret;
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index cb39e05..994baae 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -1461,19 +1461,6 @@ out:
 	return copied ? : err;
 }
 
-static ssize_t kcm_sock_splice(struct sock *sk,
-			       struct pipe_inode_info *pipe,
-			       struct splice_pipe_desc *spd)
-{
-	int ret;
-
-	release_sock(sk);
-	ret = splice_to_pipe(pipe, spd);
-	lock_sock(sk);
-
-	return ret;
-}
-
 static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
 			       struct pipe_inode_info *pipe, size_t len,
 			       unsigned int flags)
@@ -1503,8 +1490,7 @@ static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
 	if (len > rxm->full_len)
 		len = rxm->full_len;
 
-	copied = skb_splice_bits(skb, sk, rxm->offset, pipe, len, flags,
-				 kcm_sock_splice);
+	copied = skb_splice_bits(skb, sk, rxm->offset, pipe, len, flags);
 	if (copied < 0) {
 		err = copied;
 		goto err_out;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f1dffe8..e7707ca 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2488,28 +2488,13 @@ static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
 	return unix_stream_read_generic(&state);
 }
 
-static ssize_t skb_unix_socket_splice(struct sock *sk,
-				      struct pipe_inode_info *pipe,
-				      struct splice_pipe_desc *spd)
-{
-	int ret;
-	struct unix_sock *u = unix_sk(sk);
-
-	mutex_unlock(&u->readlock);
-	ret = splice_to_pipe(pipe, spd);
-	mutex_lock(&u->readlock);
-
-	return ret;
-}
-
 static int unix_stream_splice_actor(struct sk_buff *skb,
 				    int skip, int chunk,
 				    struct unix_stream_read_state *state)
 {
 	return skb_splice_bits(skb, state->socket->sk,
 			       UNIXCB(skb).consumed + skip,
-			       state->pipe, chunk, state->splice_flags,
-			       skb_unix_socket_splice);
+			       state->pipe, chunk, state->splice_flags);
 }
 
 static ssize_t unix_stream_splice_read(struct socket *sock,  loff_t *ppos,
-- 
2.9.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 06/11] new helper: add_to_pipe()
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
                                                           ` (4 preceding siblings ...)
  2016-09-23 19:03                                           ` Al Viro
@ 2016-09-23 19:04                                         ` Al Viro
  2016-09-23 19:04                                         ` [PATCH 07/11] fuse_dev_splice_read(): switch to add_to_pipe() Al Viro
                                                           ` (5 subsequent siblings)
  11 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

single-buffer analogue of splice_to_pipe(); vmsplice_to_pipe() switched
to that, leaving splice_to_pipe() only for ->splice_read() instances
(and that only until they are converted as well).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c            | 109 ++++++++++++++++++++++++++++---------------------
 include/linux/splice.h |   2 +
 2 files changed, 64 insertions(+), 47 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 9ce6e62..085ad37 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -203,8 +203,6 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		buf->len = spd->partial[page_nr].len;
 		buf->private = spd->partial[page_nr].private;
 		buf->ops = spd->ops;
-		if (spd->flags & SPLICE_F_GIFT)
-			buf->flags |= PIPE_BUF_FLAG_GIFT;
 
 		pipe->nrbufs++;
 		page_nr++;
@@ -225,6 +223,27 @@ out:
 }
 EXPORT_SYMBOL_GPL(splice_to_pipe);
 
+ssize_t add_to_pipe(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
+{
+	int ret;
+
+	if (unlikely(!pipe->readers)) {
+		send_sig(SIGPIPE, current, 0);
+		ret = -EPIPE;
+	} else if (pipe->nrbufs == pipe->buffers) {
+		ret = -EAGAIN;
+	} else {
+		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		pipe->bufs[newbuf] = *buf;
+		pipe->nrbufs++;
+		return buf->len;
+	}
+	buf->ops->release(pipe, buf);
+	buf->ops = NULL;
+	return ret;
+}
+EXPORT_SYMBOL(add_to_pipe);
+
 void spd_release_page(struct splice_pipe_desc *spd, unsigned int i)
 {
 	put_page(spd->pages[i]);
@@ -1436,33 +1455,50 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	return -EINVAL;
 }
 
-static int get_iovec_page_array(const struct iov_iter *from,
-				struct page **pages,
-				struct partial_page *partial,
-				unsigned int pipe_buffers)
+static int iter_to_pipe(struct iov_iter *from,
+			struct pipe_inode_info *pipe,
+			unsigned flags)
 {
-	struct iov_iter i = *from;
-	int buffers = 0;
-	while (iov_iter_count(&i)) {
+	struct pipe_buffer buf = {
+		.ops = &user_page_pipe_buf_ops,
+		.flags = flags
+	};
+	size_t total = 0;
+	int ret = 0;
+	bool failed = false;
+
+	while (iov_iter_count(from) && !failed) {
+		struct page *pages[16];
 		ssize_t copied;
 		size_t start;
+		int n;
 
-		copied = iov_iter_get_pages(&i, pages + buffers, ~0UL,
-					pipe_buffers - buffers, &start);
-		if (copied <= 0)
-			return buffers ? buffers : copied;
+		copied = iov_iter_get_pages(from, pages, ~0UL, 16, &start);
+		if (copied <= 0) {
+			ret = copied;
+			break;
+		}
 
-		iov_iter_advance(&i, copied);
-		while (copied) {
+		for (n = 0; copied; n++, start = 0) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
-			partial[buffers].offset = start;
-			partial[buffers].len = size;
+			if (!failed) {
+				buf.page = pages[n];
+				buf.offset = start;
+				buf.len = size;
+				ret = add_to_pipe(pipe, &buf);
+				if (unlikely(ret < 0)) {
+					failed = true;
+				} else {
+					iov_iter_advance(from, ret);
+					total += ret;
+				}
+			} else {
+				put_page(pages[n]);
+			}
 			copied -= size;
-			start = 0;
-			buffers++;
 		}
 	}
-	return buffers;
+	return total ? total : ret;
 }
 
 static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
@@ -1523,19 +1559,13 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 	struct iovec iovstack[UIO_FASTIOV];
 	struct iovec *iov = iovstack;
 	struct iov_iter from;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &user_page_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
 	long ret, total = 0;
+	unsigned buf_flag = 0;
 	int bogus_count;
 
+	if (flags & SPLICE_F_GIFT)
+		buf_flag = PIPE_BUF_FLAG_GIFT;
+
 	pipe = get_pipe_info(file);
 	if (!pipe)
 		return -EBADF;
@@ -1545,27 +1575,13 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 	if (ret < 0)
 		return ret;
 
-	if (splice_grow_spd(pipe, &spd)) {
-		kfree(iov);
-		return -ENOMEM;
-	}
-
 	pipe_lock(pipe);
 	bogus_count = pipe->buffers;
 	do {
 		bogus_count += pipe->nrbufs;
-		spd.nr_pages = get_iovec_page_array(&from, spd.pages,
-						    spd.partial,
-						    spd.nr_pages_max);
-		if (spd.nr_pages <= 0) {
-			ret = spd.nr_pages;
-			break;
-		}
-		ret = splice_to_pipe(pipe, &spd);
-		if (ret > 0) {
+		ret = iter_to_pipe(&from, pipe, buf_flag);
+		if (ret > 0)
 			total += ret;
-			iov_iter_advance(&from, ret);
-		}
 		bogus_count -= pipe->nrbufs;
 		if (bogus_count <= 0)
 			break;
@@ -1575,7 +1591,6 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 		wakeup_pipe_readers(pipe);
 		ret = total;
 	}
-	splice_shrink_spd(&spd);
 	kfree(iov);
 	return ret;
 }
diff --git a/include/linux/splice.h b/include/linux/splice.h
index da2751d..58b300f 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -72,6 +72,8 @@ extern ssize_t __splice_from_pipe(struct pipe_inode_info *,
 				  struct splice_desc *, splice_actor *);
 extern ssize_t splice_to_pipe(struct pipe_inode_info *,
 			      struct splice_pipe_desc *);
+extern ssize_t add_to_pipe(struct pipe_inode_info *,
+			      struct pipe_buffer *);
 extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
 				      splice_direct_actor *);
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 07/11] fuse_dev_splice_read(): switch to add_to_pipe()
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
                                                           ` (5 preceding siblings ...)
  2016-09-23 19:04                                         ` [PATCH 06/11] new helper: add_to_pipe() Al Viro
@ 2016-09-23 19:04                                         ` Al Viro
  2016-09-23 19:06                                         ` [PATCH 08/11] cifs: don't use memcpy() to copy struct iov_iter Al Viro
                                                           ` (4 subsequent siblings)
  11 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/fuse/dev.c | 46 +++++++++-------------------------------------
 1 file changed, 9 insertions(+), 37 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index eaf56c6..0a6a808 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1342,9 +1342,8 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 				    struct pipe_inode_info *pipe,
 				    size_t len, unsigned int flags)
 {
-	int ret;
+	int total, ret;
 	int page_nr = 0;
-	int do_wakeup = 0;
 	struct pipe_buffer *bufs;
 	struct fuse_copy_state cs;
 	struct fuse_dev *fud = fuse_get_dev(in);
@@ -1363,50 +1362,23 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	if (ret < 0)
 		goto out;
 
-	ret = 0;
-
-	if (!pipe->readers) {
-		send_sig(SIGPIPE, current, 0);
-		if (!ret)
-			ret = -EPIPE;
-		goto out_unlock;
-	}
-
 	if (pipe->nrbufs + cs.nr_segs > pipe->buffers) {
 		ret = -EIO;
-		goto out_unlock;
+		goto out;
 	}
 
-	while (page_nr < cs.nr_segs) {
-		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-		struct pipe_buffer *buf = pipe->bufs + newbuf;
-
-		buf->page = bufs[page_nr].page;
-		buf->offset = bufs[page_nr].offset;
-		buf->len = bufs[page_nr].len;
+	for (ret = total = 0; page_nr < cs.nr_segs; total += ret) {
 		/*
 		 * Need to be careful about this.  Having buf->ops in module
 		 * code can Oops if the buffer persists after module unload.
 		 */
-		buf->ops = &nosteal_pipe_buf_ops;
-
-		pipe->nrbufs++;
-		page_nr++;
-		ret += buf->len;
-
-		if (pipe->files)
-			do_wakeup = 1;
-	}
-
-out_unlock:
-
-	if (do_wakeup) {
-		smp_mb();
-		if (waitqueue_active(&pipe->wait))
-			wake_up_interruptible(&pipe->wait);
-		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
+		bufs[page_nr].ops = &nosteal_pipe_buf_ops;
+		ret = add_to_pipe(pipe, &bufs[page_nr++]);
+		if (unlikely(ret < 0))
+			break;
 	}
-
+	if (total)
+		ret = total;
 out:
 	for (; page_nr < cs.nr_segs; page_nr++)
 		put_page(bufs[page_nr].page);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 08/11] cifs: don't use memcpy() to copy struct iov_iter
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
                                                           ` (6 preceding siblings ...)
  2016-09-23 19:04                                         ` [PATCH 07/11] fuse_dev_splice_read(): switch to add_to_pipe() Al Viro
@ 2016-09-23 19:06                                         ` Al Viro
  2016-09-23 19:08                                         ` [PATCH 09/11] fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter() Al Viro
                                                           ` (3 subsequent siblings)
  11 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

it's not 70s anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
[obviously should be separated; trivial cleanup almost unrelated to series]
 fs/cifs/file.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 579e41b..42b99af 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2478,7 +2478,7 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
 	size_t cur_len;
 	unsigned long nr_pages, num_pages, i;
 	struct cifs_writedata *wdata;
-	struct iov_iter saved_from;
+	struct iov_iter saved_from = *from;
 	loff_t saved_offset = offset;
 	pid_t pid;
 	struct TCP_Server_Info *server;
@@ -2489,7 +2489,6 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
 		pid = current->tgid;
 
 	server = tlink_tcon(open_file->tlink)->ses->server;
-	memcpy(&saved_from, from, sizeof(struct iov_iter));
 
 	do {
 		unsigned int wsize, credits;
@@ -2551,8 +2550,7 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
 			kref_put(&wdata->refcount,
 				 cifs_uncached_writedata_release);
 			if (rc == -EAGAIN) {
-				memcpy(from, &saved_from,
-				       sizeof(struct iov_iter));
+				*from = saved_from;
 				iov_iter_advance(from, offset - saved_offset);
 				continue;
 			}
@@ -2576,7 +2574,7 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from)
 	struct cifs_sb_info *cifs_sb;
 	struct cifs_writedata *wdata, *tmp;
 	struct list_head wdata_list;
-	struct iov_iter saved_from;
+	struct iov_iter saved_from = *from;
 	int rc;
 
 	/*
@@ -2597,8 +2595,6 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from)
 	if (!tcon->ses->server->ops->async_writev)
 		return -ENOSYS;
 
-	memcpy(&saved_from, from, sizeof(struct iov_iter));
-
 	rc = cifs_write_from_iter(iocb->ki_pos, iov_iter_count(from), from,
 				  open_file, cifs_sb, &wdata_list);
 
@@ -2631,13 +2627,11 @@ restart_loop:
 			/* resend call if it's a retryable error */
 			if (rc == -EAGAIN) {
 				struct list_head tmp_list;
-				struct iov_iter tmp_from;
+				struct iov_iter tmp_from = saved_from;
 
 				INIT_LIST_HEAD(&tmp_list);
 				list_del_init(&wdata->list);
 
-				memcpy(&tmp_from, &saved_from,
-				       sizeof(struct iov_iter));
 				iov_iter_advance(&tmp_from,
 						 wdata->offset - iocb->ki_pos);
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 09/11] fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter()
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
                                                           ` (7 preceding siblings ...)
  2016-09-23 19:06                                         ` [PATCH 08/11] cifs: don't use memcpy() to copy struct iov_iter Al Viro
@ 2016-09-23 19:08                                         ` Al Viro
  2016-09-26  9:31                                           ` Miklos Szeredi
  2016-09-23 19:09                                         ` [PATCH 10/11] new iov_iter flavour: pipe-backed Al Viro
                                                           ` (2 subsequent siblings)
  11 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
[another cleanup, will be moved out of that branch]
 fs/fuse/file.c | 30 +++++++-----------------------
 1 file changed, 7 insertions(+), 23 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3988b43..4c1db6c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2339,31 +2339,15 @@ static int fuse_ioctl_copy_user(struct page **pages, struct iovec *iov,
 
 	while (iov_iter_count(&ii)) {
 		struct page *page = pages[page_idx++];
-		size_t todo = min_t(size_t, PAGE_SIZE, iov_iter_count(&ii));
-		void *kaddr;
+		size_t copied;
 
-		kaddr = kmap(page);
-
-		while (todo) {
-			char __user *uaddr = ii.iov->iov_base + ii.iov_offset;
-			size_t iov_len = ii.iov->iov_len - ii.iov_offset;
-			size_t copy = min(todo, iov_len);
-			size_t left;
-
-			if (!to_user)
-				left = copy_from_user(kaddr, uaddr, copy);
-			else
-				left = copy_to_user(uaddr, kaddr, copy);
-
-			if (unlikely(left))
-				return -EFAULT;
-
-			iov_iter_advance(&ii, copy);
-			todo -= copy;
-			kaddr += copy;
-		}
+		if (!to_user)
+			copied = copy_page_from_iter(page, 0, PAGE_SIZE, &ii);
+		else
+			copied = copy_page_to_iter(page, 0, PAGE_SIZE, &ii);
 
-		kunmap(page);
+		if (unlikely(copied != PAGE_SIZE && iov_iter_count(&ii)))
+			return -EFAULT;
 	}
 
 	return 0;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 10/11] new iov_iter flavour: pipe-backed
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
                                                           ` (8 preceding siblings ...)
  2016-09-23 19:08                                         ` [PATCH 09/11] fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter() Al Viro
@ 2016-09-23 19:09                                         ` Al Viro
  2016-09-23 19:10                                         ` [PATCH 11/11] switch generic_file_splice_read() to use of ->read_iter() Al Viro
  2016-09-30 13:32                                         ` [RFC][CFT] splice_read reworked CAI Qian
  11 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

iov_iter variant for passing data into pipe.  copy_to_iter()
copies data into page(s) it has allocated and stuffs them into
the pipe; copy_page_to_iter() stuffs there a reference to the
page given to it.  Both will try to coalesce if possible.
iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages()
and friends will do as copy_to_iter() would have and return the
pages where the data would've been copied.  iov_iter_advance()
will truncate everything past the spot it has advanced to.

New primitive: iov_iter_pipe(), used for initializing those.
pipe should be locked all along.

Running out of space acts as fault would for iovec-backed ones;
in other words, giving it to ->read_iter() may result in short
read if the pipe overflows, or -EFAULT if it happens with nothing
copied there.

In other words, ->read_iter() on those acts pretty much like
->splice_read().  Moreover, all generic_file_splice_read() users,
as well as many other ->splice_read() instances can be switched
to that scheme - that'll happen in the next commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
[this certainly needs to be documented in more details]
 fs/splice.c            |   2 +-
 include/linux/splice.h |   1 +
 include/linux/uio.h    |  14 +-
 lib/iov_iter.c         | 390 ++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 401 insertions(+), 6 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 085ad37..0daa7d1 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -524,7 +524,7 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 }
 EXPORT_SYMBOL(generic_file_splice_read);
 
-static const struct pipe_buf_operations default_pipe_buf_ops = {
+const struct pipe_buf_operations default_pipe_buf_ops = {
 	.can_merge = 0,
 	.confirm = generic_pipe_buf_confirm,
 	.release = generic_pipe_buf_release,
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 58b300f..00a2116 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -85,4 +85,5 @@ extern void splice_shrink_spd(struct splice_pipe_desc *);
 extern void spd_release_page(struct splice_pipe_desc *, unsigned int);
 
 extern const struct pipe_buf_operations page_cache_pipe_buf_ops;
+extern const struct pipe_buf_operations default_pipe_buf_ops;
 #endif
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 1b5d1cd..c4fe1ab 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -13,6 +13,7 @@
 #include <uapi/linux/uio.h>
 
 struct page;
+struct pipe_inode_info;
 
 struct kvec {
 	void *iov_base; /* and that should *never* hold a userland pointer */
@@ -23,6 +24,7 @@ enum {
 	ITER_IOVEC = 0,
 	ITER_KVEC = 2,
 	ITER_BVEC = 4,
+	ITER_PIPE = 8,
 };
 
 struct iov_iter {
@@ -33,8 +35,12 @@ struct iov_iter {
 		const struct iovec *iov;
 		const struct kvec *kvec;
 		const struct bio_vec *bvec;
+		struct pipe_inode_info *pipe;
+	};
+	union {
+		unsigned long nr_segs;
+		int idx;
 	};
-	unsigned long nr_segs;
 };
 
 /*
@@ -64,7 +70,7 @@ static inline struct iovec iov_iter_iovec(const struct iov_iter *iter)
 }
 
 #define iov_for_each(iov, iter, start)				\
-	if (!((start).type & ITER_BVEC))			\
+	if (!((start).type & (ITER_BVEC | ITER_PIPE)))		\
 	for (iter = (start);					\
 	     (iter).count &&					\
 	     ((iov = iov_iter_iovec(&(iter))), 1);		\
@@ -94,6 +100,8 @@ void iov_iter_kvec(struct iov_iter *i, int direction, const struct kvec *kvec,
 			unsigned long nr_segs, size_t count);
 void iov_iter_bvec(struct iov_iter *i, int direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
+void iov_iter_pipe(struct iov_iter *i, int direction, struct pipe_inode_info *pipe,
+			size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
@@ -109,7 +117,7 @@ static inline size_t iov_iter_count(struct iov_iter *i)
 
 static inline bool iter_is_iovec(struct iov_iter *i)
 {
-	return !(i->type & (ITER_BVEC | ITER_KVEC));
+	return !(i->type & (ITER_BVEC | ITER_KVEC | ITER_PIPE));
 }
 
 /*
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 9e8c738..02efc898 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -3,8 +3,11 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
+#include <linux/splice.h>
 #include <net/checksum.h>
 
+#define PIPE_PARANOIA /* for now */
+
 #define iterate_iovec(i, n, __v, __p, skip, STEP) {	\
 	size_t left;					\
 	size_t wanted = n;				\
@@ -290,6 +293,82 @@ done:
 	return wanted - bytes;
 }
 
+#ifdef PIPE_PARANOIA
+static bool sanity(const struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	int idx = i->idx;
+	int delta = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1);
+	if (i->iov_offset) {
+		struct pipe_buffer *p;
+		if (unlikely(delta != 1) || unlikely(!pipe->nrbufs))
+			goto Bad;	// must be at the last buffer...
+
+		p = &pipe->bufs[idx];
+		if (unlikely(p->offset + p->len != i->iov_offset))
+			goto Bad;	// ... at the end of segment
+	} else {
+		if (delta)
+			goto Bad;	// must be right after the last buffer
+	}
+	return true;
+Bad:
+	WARN_ON(1);
+	return false;
+}
+#else
+#define sanity(i) true
+#endif
+
+static inline int next_idx(int idx, struct pipe_inode_info *pipe)
+{
+	return (idx + 1) & (pipe->buffers - 1);
+}
+
+static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
+			 struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	struct pipe_buffer *buf;
+	size_t off;
+	int idx;
+
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	if (!sanity(i))
+		return 0;
+
+	off = i->iov_offset;
+	idx = i->idx;
+	buf = &pipe->bufs[idx];
+	if (off) {
+		if (offset == off && buf->page == page) {
+			/* merge with the last one */
+			buf->len += bytes;
+			i->iov_offset += bytes;
+			goto out;
+		}
+		idx = next_idx(idx, pipe);
+		buf = &pipe->bufs[idx];
+	}
+	if (idx == pipe->curbuf && pipe->nrbufs)
+		return 0;
+	pipe->nrbufs++;
+	buf->ops = &page_cache_pipe_buf_ops;
+	get_page(buf->page = page);
+	buf->offset = offset;
+	buf->len = bytes;
+	i->iov_offset = offset + bytes;
+	i->idx = idx;
+out:
+	i->count -= bytes;
+	return bytes;
+}
+
 /*
  * Fault in the first iovec of the given iov_iter, to a maximum length
  * of bytes. Returns 0 on success, or non-zero if the memory could not be
@@ -376,9 +455,98 @@ static void memzero_page(struct page *page, size_t offset, size_t len)
 	kunmap_atomic(addr);
 }
 
+static inline bool allocated(struct pipe_buffer *buf)
+{
+	return buf->ops == &default_pipe_buf_ops;
+}
+
+static inline void data_start(const struct iov_iter *i, int *idxp, size_t *offp)
+{
+	size_t off = i->iov_offset;
+	int idx = i->idx;
+	if (off && (!allocated(&i->pipe->bufs[idx]) || off == PAGE_SIZE)) {
+		idx = next_idx(idx, i->pipe);
+		off = 0;
+	}
+	*idxp = idx;
+	*offp = off;
+}
+
+static size_t push_pipe(struct iov_iter *i, size_t size,
+			int *idxp, size_t *offp)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t off;
+	int idx;
+	ssize_t left;
+
+	if (unlikely(size > i->count))
+		size = i->count;
+	if (unlikely(!size))
+		return 0;
+
+	left = size;
+	data_start(i, &idx, &off);
+	*idxp = idx;
+	*offp = off;
+	if (off) {
+		left -= PAGE_SIZE - off;
+		if (left <= 0) {
+			pipe->bufs[idx].len += size;
+			return size;
+		}
+		pipe->bufs[idx].len = PAGE_SIZE;
+		idx = next_idx(idx, pipe);
+	}
+	while (idx != pipe->curbuf || !pipe->nrbufs) {
+		struct page *page = alloc_page(GFP_USER);
+		if (!page)
+			break;
+		pipe->nrbufs++;
+		pipe->bufs[idx].ops = &default_pipe_buf_ops;
+		pipe->bufs[idx].page = page;
+		pipe->bufs[idx].offset = 0;
+		if (left <= PAGE_SIZE) {
+			pipe->bufs[idx].len = left;
+			return size;
+		}
+		pipe->bufs[idx].len = PAGE_SIZE;
+		left -= PAGE_SIZE;
+		idx = next_idx(idx, pipe);
+	}
+	return size - left;
+}
+
+static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
+				struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t n, off;
+	int idx;
+
+	if (!sanity(i))
+		return 0;
+
+	bytes = n = push_pipe(i, bytes, &idx, &off);
+	if (unlikely(!n))
+		return 0;
+	for ( ; n; idx = next_idx(idx, pipe), off = 0) {
+		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
+		memcpy_to_page(pipe->bufs[idx].page, off, addr, chunk);
+		i->idx = idx;
+		i->iov_offset = off + chunk;
+		n -= chunk;
+		addr += chunk;
+	}
+	i->count -= bytes;
+	return bytes;
+}
+
 size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	const char *from = addr;
+	if (unlikely(i->type & ITER_PIPE))
+		return copy_pipe_to_iter(addr, bytes, i);
 	iterate_and_advance(i, bytes, v,
 		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
 			       v.iov_len),
@@ -394,6 +562,10 @@ EXPORT_SYMBOL(copy_to_iter);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
 				 v.iov_len),
@@ -409,6 +581,10 @@ EXPORT_SYMBOL(copy_from_iter);
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user_nocache((to += v.iov_len) - v.iov_len,
 					 v.iov_base, v.iov_len),
@@ -429,14 +605,20 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 		size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
 		kunmap_atomic(kaddr);
 		return wanted;
-	} else
+	} else if (likely(!(i->type & ITER_PIPE)))
 		return copy_page_to_iter_iovec(page, offset, bytes, i);
+	else
+		return copy_page_to_iter_pipe(page, offset, bytes, i);
 }
 EXPORT_SYMBOL(copy_page_to_iter);
 
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	if (i->type & (ITER_BVEC|ITER_KVEC)) {
 		void *kaddr = kmap_atomic(page);
 		size_t wanted = copy_from_iter(kaddr + offset, bytes, i);
@@ -447,8 +629,34 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
+static size_t pipe_zero(size_t bytes, struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t n, off;
+	int idx;
+
+	if (!sanity(i))
+		return 0;
+
+	bytes = n = push_pipe(i, bytes, &idx, &off);
+	if (unlikely(!n))
+		return 0;
+
+	for ( ; n; idx = next_idx(idx, pipe), off = 0) {
+		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
+		memzero_page(pipe->bufs[idx].page, off, chunk);
+		i->idx = idx;
+		i->iov_offset = off + chunk;
+		n -= chunk;
+	}
+	i->count -= bytes;
+	return bytes;
+}
+
 size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 {
+	if (unlikely(i->type & ITER_PIPE))
+		return pipe_zero(bytes, i);
 	iterate_and_advance(i, bytes, v,
 		__clear_user(v.iov_base, v.iov_len),
 		memzero_page(v.bv_page, v.bv_offset, v.bv_len),
@@ -463,6 +671,11 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes)
 {
 	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
+	if (unlikely(i->type & ITER_PIPE)) {
+		kunmap_atomic(kaddr);
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_all_kinds(i, bytes, v,
 		__copy_from_user_inatomic((p += v.iov_len) - v.iov_len,
 					  v.iov_base, v.iov_len),
@@ -475,8 +688,55 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 }
 EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
 
+static void pipe_advance(struct iov_iter *i, size_t size)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	struct pipe_buffer *buf;
+	size_t off;
+	int idx;
+	
+	if (unlikely(i->count < size))
+		size = i->count;
+
+	idx = i->idx;
+	off = i->iov_offset;
+	if (size || off) {
+		/* take it relative to the beginning of buffer */
+		size += off - pipe->bufs[idx].offset;
+		while (1) {
+			buf = &pipe->bufs[idx];
+			if (size > buf->len) {
+				size -= buf->len;
+				idx = next_idx(idx, pipe);
+				off = 0;
+			} else {
+				buf->len = size;
+				i->idx = idx;
+				i->iov_offset = off = buf->offset + size;
+				break;
+			}
+		}
+		idx = next_idx(idx, pipe);
+	}
+	if (pipe->nrbufs) {
+		int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		/* [curbuf,unused) is in use.  Free [idx,unused) */
+		while (idx != unused) {
+			buf = &pipe->bufs[idx];
+			buf->ops->release(pipe, buf);
+			buf->ops = NULL;
+			idx = next_idx(idx, pipe);
+			pipe->nrbufs--;
+		}
+	}
+}
+
 void iov_iter_advance(struct iov_iter *i, size_t size)
 {
+	if (unlikely(i->type & ITER_PIPE)) {
+		pipe_advance(i, size);
+		return;
+	}
 	iterate_and_advance(i, size, v, 0, 0, 0)
 }
 EXPORT_SYMBOL(iov_iter_advance);
@@ -486,6 +746,8 @@ EXPORT_SYMBOL(iov_iter_advance);
  */
 size_t iov_iter_single_seg_count(const struct iov_iter *i)
 {
+	if (unlikely(i->type & ITER_PIPE))
+		return i->count;	// it is a silly place, anyway
 	if (i->nr_segs == 1)
 		return i->count;
 	else if (i->type & ITER_BVEC)
@@ -521,6 +783,19 @@ void iov_iter_bvec(struct iov_iter *i, int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
+void iov_iter_pipe(struct iov_iter *i, int direction,
+			struct pipe_inode_info *pipe,
+			size_t count)
+{
+	BUG_ON(direction != ITER_PIPE);
+	i->type = direction;
+	i->pipe = pipe;
+	i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+	i->iov_offset = 0;
+	i->count = count;
+}
+EXPORT_SYMBOL(iov_iter_pipe);
+
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
 	unsigned long res = 0;
@@ -529,6 +804,11 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 	if (!size)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE)) {
+		if (i->iov_offset && allocated(&i->pipe->bufs[i->idx]))
+			return size | i->iov_offset;
+		return size;
+	}
 	iterate_all_kinds(i, size, v,
 		(res |= (unsigned long)v.iov_base | v.iov_len, 0),
 		res |= v.bv_offset | v.bv_len,
@@ -545,6 +825,11 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 	if (!size)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return ~0U;
+	}
+
 	iterate_all_kinds(i, size, v,
 		(res |= (!res ? 0 : (unsigned long)v.iov_base) |
 			(size != v.iov_len ? size : 0), 0),
@@ -557,6 +842,47 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_gap_alignment);
 
+static inline size_t __pipe_get_pages(struct iov_iter *i,
+				size_t maxsize,
+				struct page **pages,
+				int idx,
+				size_t *start)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t n = push_pipe(i, maxsize, &idx, start);
+	if (!n)
+		return 0;
+
+	maxsize = n;
+	n += *start;
+	while (n >= PAGE_SIZE) {
+		*pages++ = pipe->bufs[idx].page;
+		idx = next_idx(idx, pipe);
+		n -= PAGE_SIZE;
+	}
+
+	return maxsize;
+}
+
+static ssize_t pipe_get_pages(struct iov_iter *i,
+		   struct page **pages, size_t maxsize, unsigned maxpages,
+		   size_t *start)
+{
+	unsigned npages;
+	size_t capacity;
+	int idx;
+
+	if (!sanity(i))
+		return 0;
+
+	data_start(i, &idx, start);
+	/* some of this one + all after this one */
+	npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1;
+	capacity = min(npages,maxpages) * PAGE_SIZE - *start;
+
+	return __pipe_get_pages(i, min(maxsize, capacity), pages, idx, start);
+}
+
 ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
@@ -567,6 +893,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 	if (!maxsize)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE))
+		return pipe_get_pages(i, pages, maxsize, maxpages, start);
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
 		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -602,6 +930,37 @@ static struct page **get_pages_array(size_t n)
 	return p;
 }
 
+static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
+		   struct page ***pages, size_t maxsize,
+		   size_t *start)
+{
+	struct page **p;
+	size_t n;
+	int idx;
+	int npages;
+
+	if (!sanity(i))
+		return 0;
+
+	data_start(i, &idx, start);
+	/* some of this one + all after this one */
+	npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1;
+	n = npages * PAGE_SIZE - *start;
+	if (maxsize > n)
+		maxsize = n;
+	else
+		npages = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
+	p = get_pages_array(npages);
+	if (!p)
+		return -ENOMEM;
+	n = __pipe_get_pages(i, maxsize, p, idx, start);
+	if (n)
+		*pages = p;
+	else
+		kvfree(p);
+	return n;
+}
+
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
@@ -614,6 +973,8 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 	if (!maxsize)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE))
+		return pipe_get_pages_alloc(i, pages, maxsize, start);
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
 		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -655,6 +1016,10 @@ size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v, ({
 		int err = 0;
 		next = csum_and_copy_from_user(v.iov_base, 
@@ -693,6 +1058,10 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum,
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);	/* for now */
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v, ({
 		int err = 0;
 		next = csum_and_copy_to_user((from += v.iov_len) - v.iov_len,
@@ -732,7 +1101,20 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 	if (!size)
 		return 0;
 
-	iterate_all_kinds(i, size, v, ({
+	if (unlikely(i->type & ITER_PIPE)) {
+		struct pipe_inode_info *pipe = i->pipe;
+		size_t off;
+		int idx;
+
+		if (!sanity(i))
+			return 0;
+
+		data_start(i, &idx, &off);
+		/* some of this one + all after this one */
+		npages = ((pipe->curbuf - idx - 1) & (pipe->buffers - 1)) + 1;
+		if (npages >= maxpages)
+			return maxpages;
+	} else iterate_all_kinds(i, size, v, ({
 		unsigned long p = (unsigned long)v.iov_base;
 		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
 			- p / PAGE_SIZE;
@@ -757,6 +1139,10 @@ EXPORT_SYMBOL(iov_iter_npages);
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags)
 {
 	*new = *old;
+	if (unlikely(new->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return NULL;
+	}
 	if (new->type & ITER_BVEC)
 		return new->bvec = kmemdup(new->bvec,
 				    new->nr_segs * sizeof(struct bio_vec),
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 11/11] switch generic_file_splice_read() to use of ->read_iter()
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
                                                           ` (9 preceding siblings ...)
  2016-09-23 19:09                                         ` [PATCH 10/11] new iov_iter flavour: pipe-backed Al Viro
@ 2016-09-23 19:10                                         ` Al Viro
  2016-09-30 13:32                                         ` [RFC][CFT] splice_read reworked CAI Qian
  11 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-23 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

... and kill the ->splice_read() instances that can be switched to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 drivers/staging/lustre/lustre/llite/file.c         |  70 ++----
 .../staging/lustre/lustre/llite/llite_internal.h   |  15 +-
 drivers/staging/lustre/lustre/llite/vvp_internal.h |  14 --
 drivers/staging/lustre/lustre/llite/vvp_io.c       |  45 +---
 fs/coda/file.c                                     |  23 +-
 fs/gfs2/file.c                                     |  28 +--
 fs/nfs/file.c                                      |  25 +--
 fs/nfs/internal.h                                  |   2 -
 fs/nfs/nfs4file.c                                  |   2 +-
 fs/ocfs2/file.c                                    |  34 +--
 fs/ocfs2/ocfs2_trace.h                             |   2 -
 fs/splice.c                                        | 238 +++------------------
 fs/xfs/xfs_file.c                                  |  41 +---
 fs/xfs/xfs_trace.h                                 |   1 -
 include/linux/fs.h                                 |   2 -
 mm/shmem.c                                         | 115 +---------
 16 files changed, 57 insertions(+), 600 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
index 57281b9..2567b09 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -1153,36 +1153,21 @@ restart:
 		int write_mutex_locked = 0;
 
 		vio->vui_fd  = LUSTRE_FPRIVATE(file);
-		vio->vui_io_subtype = args->via_io_subtype;
-
-		switch (vio->vui_io_subtype) {
-		case IO_NORMAL:
-			vio->vui_iter = args->u.normal.via_iter;
-			vio->vui_iocb = args->u.normal.via_iocb;
-			if ((iot == CIT_WRITE) &&
-			    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
-				if (mutex_lock_interruptible(&lli->
-							       lli_write_mutex)) {
-					result = -ERESTARTSYS;
-					goto out;
-				}
-				write_mutex_locked = 1;
+		vio->vui_iter = args->u.normal.via_iter;
+		vio->vui_iocb = args->u.normal.via_iocb;
+		if ((iot == CIT_WRITE) &&
+		    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
+			if (mutex_lock_interruptible(&lli->lli_write_mutex)) {
+				result = -ERESTARTSYS;
+				goto out;
 			}
-			down_read(&lli->lli_trunc_sem);
-			break;
-		case IO_SPLICE:
-			vio->u.splice.vui_pipe = args->u.splice.via_pipe;
-			vio->u.splice.vui_flags = args->u.splice.via_flags;
-			break;
-		default:
-			CERROR("Unknown IO type - %u\n", vio->vui_io_subtype);
-			LBUG();
+			write_mutex_locked = 1;
 		}
+		down_read(&lli->lli_trunc_sem);
 		ll_cl_add(file, env, io);
 		result = cl_io_loop(env, io);
 		ll_cl_remove(file, env);
-		if (args->via_io_subtype == IO_NORMAL)
-			up_read(&lli->lli_trunc_sem);
+		up_read(&lli->lli_trunc_sem);
 		if (write_mutex_locked)
 			mutex_unlock(&lli->lli_write_mutex);
 	} else {
@@ -1237,7 +1222,7 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (IS_ERR(env))
 		return PTR_ERR(env);
 
-	args = ll_env_args(env, IO_NORMAL);
+	args = ll_env_args(env);
 	args->u.normal.via_iter = to;
 	args->u.normal.via_iocb = iocb;
 
@@ -1261,7 +1246,7 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (IS_ERR(env))
 		return PTR_ERR(env);
 
-	args = ll_env_args(env, IO_NORMAL);
+	args = ll_env_args(env);
 	args->u.normal.via_iter = from;
 	args->u.normal.via_iocb = iocb;
 
@@ -1271,31 +1256,6 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	return result;
 }
 
-/*
- * Send file content (through pagecache) somewhere with helper
- */
-static ssize_t ll_file_splice_read(struct file *in_file, loff_t *ppos,
-				   struct pipe_inode_info *pipe, size_t count,
-				   unsigned int flags)
-{
-	struct lu_env      *env;
-	struct vvp_io_args *args;
-	ssize_t	     result;
-	int		 refcheck;
-
-	env = cl_env_get(&refcheck);
-	if (IS_ERR(env))
-		return PTR_ERR(env);
-
-	args = ll_env_args(env, IO_SPLICE);
-	args->u.splice.via_pipe = pipe;
-	args->u.splice.via_flags = flags;
-
-	result = ll_file_io_generic(env, args, in_file, CIT_READ, ppos, count);
-	cl_env_put(env, &refcheck);
-	return result;
-}
-
 static int ll_lov_recreate(struct inode *inode, struct ost_id *oi, u32 ost_idx)
 {
 	struct obd_export *exp = ll_i2dtexp(inode);
@@ -3173,7 +3133,7 @@ struct file_operations ll_file_operations = {
 	.release	= ll_file_release,
 	.mmap	   = ll_file_mmap,
 	.llseek	 = ll_file_seek,
-	.splice_read    = ll_file_splice_read,
+	.splice_read    = generic_file_splice_read,
 	.fsync	  = ll_fsync,
 	.flush	  = ll_flush
 };
@@ -3186,7 +3146,7 @@ struct file_operations ll_file_operations_flock = {
 	.release	= ll_file_release,
 	.mmap	   = ll_file_mmap,
 	.llseek	 = ll_file_seek,
-	.splice_read    = ll_file_splice_read,
+	.splice_read    = generic_file_splice_read,
 	.fsync	  = ll_fsync,
 	.flush	  = ll_flush,
 	.flock	  = ll_file_flock,
@@ -3202,7 +3162,7 @@ struct file_operations ll_file_operations_noflock = {
 	.release	= ll_file_release,
 	.mmap	   = ll_file_mmap,
 	.llseek	 = ll_file_seek,
-	.splice_read    = ll_file_splice_read,
+	.splice_read    = generic_file_splice_read,
 	.fsync	  = ll_fsync,
 	.flush	  = ll_flush,
 	.flock	  = ll_file_noflock,
diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index 4d6d589..0e738c8 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -800,17 +800,11 @@ void vvp_write_complete(struct vvp_object *club, struct vvp_page *page);
  */
 struct vvp_io_args {
 	/** normal/splice */
-	enum vvp_io_subtype via_io_subtype;
-
 	union {
 		struct {
 			struct kiocb      *via_iocb;
 			struct iov_iter   *via_iter;
 		} normal;
-		struct {
-			struct pipe_inode_info  *via_pipe;
-			unsigned int       via_flags;
-		} splice;
 	} u;
 };
 
@@ -838,14 +832,9 @@ static inline struct ll_thread_info *ll_env_info(const struct lu_env *env)
 	return lti;
 }
 
-static inline struct vvp_io_args *ll_env_args(const struct lu_env *env,
-					      enum vvp_io_subtype type)
+static inline struct vvp_io_args *ll_env_args(const struct lu_env *env)
 {
-	struct vvp_io_args *via = &ll_env_info(env)->lti_args;
-
-	via->via_io_subtype = type;
-
-	return via;
+	return &ll_env_info(env)->lti_args;
 }
 
 void ll_queue_done_writing(struct inode *inode, unsigned long flags);
diff --git a/drivers/staging/lustre/lustre/llite/vvp_internal.h b/drivers/staging/lustre/lustre/llite/vvp_internal.h
index 79fc428..2fa49cc 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_internal.h
+++ b/drivers/staging/lustre/lustre/llite/vvp_internal.h
@@ -49,14 +49,6 @@ struct obd_device;
 struct obd_export;
 struct page;
 
-/* specific architecture can implement only part of this list */
-enum vvp_io_subtype {
-	/** normal IO */
-	IO_NORMAL,
-	/** io started from splice_{read|write} */
-	IO_SPLICE
-};
-
 /**
  * IO state private to IO state private to VVP layer.
  */
@@ -99,10 +91,6 @@ struct vvp_io {
 			bool		ft_flags_valid;
 		} fault;
 		struct {
-			struct pipe_inode_info	*vui_pipe;
-			unsigned int		 vui_flags;
-		} splice;
-		struct {
 			struct cl_page_list vui_queue;
 			unsigned long vui_written;
 			int vui_from;
@@ -110,8 +98,6 @@ struct vvp_io {
 		} write;
 	} u;
 
-	enum vvp_io_subtype	vui_io_subtype;
-
 	/**
 	 * Layout version when this IO is initialized
 	 */
diff --git a/drivers/staging/lustre/lustre/llite/vvp_io.c b/drivers/staging/lustre/lustre/llite/vvp_io.c
index 94916dc..4864600 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_io.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_io.c
@@ -55,18 +55,6 @@ static struct vvp_io *cl2vvp_io(const struct lu_env *env,
 }
 
 /**
- * True, if \a io is a normal io, False for splice_{read,write}
- */
-static int cl_is_normalio(const struct lu_env *env, const struct cl_io *io)
-{
-	struct vvp_io *vio = vvp_env_io(env);
-
-	LASSERT(io->ci_type == CIT_READ || io->ci_type == CIT_WRITE);
-
-	return vio->vui_io_subtype == IO_NORMAL;
-}
-
-/**
  * For swapping layout. The file's layout may have changed.
  * To avoid populating pages to a wrong stripe, we have to verify the
  * correctness of layout. It works because swapping layout processes
@@ -391,9 +379,6 @@ static int vvp_mmap_locks(const struct lu_env *env,
 
 	LASSERT(io->ci_type == CIT_READ || io->ci_type == CIT_WRITE);
 
-	if (!cl_is_normalio(env, io))
-		return 0;
-
 	if (!vio->vui_iter) /* nfs or loop back device write */
 		return 0;
 
@@ -462,15 +447,10 @@ static void vvp_io_advance(const struct lu_env *env,
 			   const struct cl_io_slice *ios,
 			   size_t nob)
 {
-	struct vvp_io    *vio = cl2vvp_io(env, ios);
-	struct cl_io     *io  = ios->cis_io;
 	struct cl_object *obj = ios->cis_io->ci_obj;
-
+	struct vvp_io	 *vio = cl2vvp_io(env, ios);
 	CLOBINVRNT(env, obj, vvp_object_invariant(obj));
 
-	if (!cl_is_normalio(env, io))
-		return;
-
 	iov_iter_reexpand(vio->vui_iter, vio->vui_tot_count  -= nob);
 }
 
@@ -479,7 +459,7 @@ static void vvp_io_update_iov(const struct lu_env *env,
 {
 	size_t size = io->u.ci_rw.crw_count;
 
-	if (!cl_is_normalio(env, io) || !vio->vui_iter)
+	if (!vio->vui_iter)
 		return;
 
 	iov_iter_truncate(vio->vui_iter, size);
@@ -716,25 +696,8 @@ static int vvp_io_read_start(const struct lu_env *env,
 
 	/* BUG: 5972 */
 	file_accessed(file);
-	switch (vio->vui_io_subtype) {
-	case IO_NORMAL:
-		LASSERT(vio->vui_iocb->ki_pos == pos);
-		result = generic_file_read_iter(vio->vui_iocb, vio->vui_iter);
-		break;
-	case IO_SPLICE:
-		result = generic_file_splice_read(file, &pos,
-						  vio->u.splice.vui_pipe, cnt,
-						  vio->u.splice.vui_flags);
-		/* LU-1109: do splice read stripe by stripe otherwise if it
-		 * may make nfsd stuck if this read occupied all internal pipe
-		 * buffers.
-		 */
-		io->ci_continue = 0;
-		break;
-	default:
-		CERROR("Wrong IO type %u\n", vio->vui_io_subtype);
-		LBUG();
-	}
+	LASSERT(vio->vui_iocb->ki_pos == pos);
+	result = generic_file_read_iter(vio->vui_iocb, vio->vui_iter);
 
 out:
 	if (result >= 0) {
diff --git a/fs/coda/file.c b/fs/coda/file.c
index f47c748..8415d4f 100644
--- a/fs/coda/file.c
+++ b/fs/coda/file.c
@@ -38,27 +38,6 @@ coda_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 }
 
 static ssize_t
-coda_file_splice_read(struct file *coda_file, loff_t *ppos,
-		      struct pipe_inode_info *pipe, size_t count,
-		      unsigned int flags)
-{
-	ssize_t (*splice_read)(struct file *, loff_t *,
-			       struct pipe_inode_info *, size_t, unsigned int);
-	struct coda_file_info *cfi;
-	struct file *host_file;
-
-	cfi = CODA_FTOC(coda_file);
-	BUG_ON(!cfi || cfi->cfi_magic != CODA_MAGIC);
-	host_file = cfi->cfi_container;
-
-	splice_read = host_file->f_op->splice_read;
-	if (!splice_read)
-		splice_read = default_file_splice_read;
-
-	return splice_read(host_file, ppos, pipe, count, flags);
-}
-
-static ssize_t
 coda_file_write_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *coda_file = iocb->ki_filp;
@@ -225,6 +204,6 @@ const struct file_operations coda_file_operations = {
 	.open		= coda_open,
 	.release	= coda_release,
 	.fsync		= coda_fsync,
-	.splice_read	= coda_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 };
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 320e65e..7016a6a7 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -954,30 +954,6 @@ out_uninit:
 	return ret;
 }
 
-static ssize_t gfs2_file_splice_read(struct file *in, loff_t *ppos,
-				     struct pipe_inode_info *pipe, size_t len,
-				     unsigned int flags)
-{
-	struct inode *inode = in->f_mapping->host;
-	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_holder gh;
-	int ret;
-
-	inode_lock(inode);
-
-	ret = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
-	if (ret) {
-		inode_unlock(inode);
-		return ret;
-	}
-
-	gfs2_glock_dq_uninit(&gh);
-	inode_unlock(inode);
-
-	return generic_file_splice_read(in, ppos, pipe, len, flags);
-}
-
-
 static ssize_t gfs2_file_splice_write(struct pipe_inode_info *pipe,
 				      struct file *out, loff_t *ppos,
 				      size_t len, unsigned int flags)
@@ -1140,7 +1116,7 @@ const struct file_operations gfs2_file_fops = {
 	.fsync		= gfs2_fsync,
 	.lock		= gfs2_lock,
 	.flock		= gfs2_flock,
-	.splice_read	= gfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= gfs2_file_splice_write,
 	.setlease	= simple_nosetlease,
 	.fallocate	= gfs2_fallocate,
@@ -1168,7 +1144,7 @@ const struct file_operations gfs2_file_fops_nolock = {
 	.open		= gfs2_open,
 	.release	= gfs2_release,
 	.fsync		= gfs2_fsync,
-	.splice_read	= gfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= gfs2_file_splice_write,
 	.setlease	= generic_setlease,
 	.fallocate	= gfs2_fallocate,
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 7d62097..5048585 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -182,29 +182,6 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 }
 EXPORT_SYMBOL_GPL(nfs_file_read);
 
-ssize_t
-nfs_file_splice_read(struct file *filp, loff_t *ppos,
-		     struct pipe_inode_info *pipe, size_t count,
-		     unsigned int flags)
-{
-	struct inode *inode = file_inode(filp);
-	ssize_t res;
-
-	dprintk("NFS: splice_read(%pD2, %lu@%Lu)\n",
-		filp, (unsigned long) count, (unsigned long long) *ppos);
-
-	nfs_start_io_read(inode);
-	res = nfs_revalidate_mapping(inode, filp->f_mapping);
-	if (!res) {
-		res = generic_file_splice_read(filp, ppos, pipe, count, flags);
-		if (res > 0)
-			nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, res);
-	}
-	nfs_end_io_read(inode);
-	return res;
-}
-EXPORT_SYMBOL_GPL(nfs_file_splice_read);
-
 int
 nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
 {
@@ -868,7 +845,7 @@ const struct file_operations nfs_file_operations = {
 	.fsync		= nfs_file_fsync,
 	.lock		= nfs_lock,
 	.flock		= nfs_flock,
-	.splice_read	= nfs_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= simple_nosetlease,
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 74935a1..d7b062b 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -365,8 +365,6 @@ int nfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *)
 int nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync);
 loff_t nfs_file_llseek(struct file *, loff_t, int);
 ssize_t nfs_file_read(struct kiocb *, struct iov_iter *);
-ssize_t nfs_file_splice_read(struct file *, loff_t *, struct pipe_inode_info *,
-			     size_t, unsigned int);
 int nfs_file_mmap(struct file *, struct vm_area_struct *);
 ssize_t nfs_file_write(struct kiocb *, struct iov_iter *);
 int nfs_file_release(struct inode *, struct file *);
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index d085ad7..89a7795 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -248,7 +248,7 @@ const struct file_operations nfs4_file_operations = {
 	.fsync		= nfs_file_fsync,
 	.lock		= nfs_lock,
 	.flock		= nfs_flock,
-	.splice_read	= nfs_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= simple_nosetlease,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 4e7b0dc..6596e41 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2307,36 +2307,6 @@ out_mutex:
 	return ret;
 }
 
-static ssize_t ocfs2_file_splice_read(struct file *in,
-				      loff_t *ppos,
-				      struct pipe_inode_info *pipe,
-				      size_t len,
-				      unsigned int flags)
-{
-	int ret = 0, lock_level = 0;
-	struct inode *inode = file_inode(in);
-
-	trace_ocfs2_file_splice_read(inode, in, in->f_path.dentry,
-			(unsigned long long)OCFS2_I(inode)->ip_blkno,
-			in->f_path.dentry->d_name.len,
-			in->f_path.dentry->d_name.name, len);
-
-	/*
-	 * See the comment in ocfs2_file_read_iter()
-	 */
-	ret = ocfs2_inode_lock_atime(inode, in->f_path.mnt, &lock_level);
-	if (ret < 0) {
-		mlog_errno(ret);
-		goto bail;
-	}
-	ocfs2_inode_unlock(inode, lock_level);
-
-	ret = generic_file_splice_read(in, ppos, pipe, len, flags);
-
-bail:
-	return ret;
-}
-
 static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 				   struct iov_iter *to)
 {
@@ -2495,7 +2465,7 @@ const struct file_operations ocfs2_fops = {
 #endif
 	.lock		= ocfs2_lock,
 	.flock		= ocfs2_flock,
-	.splice_read	= ocfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
 };
@@ -2540,7 +2510,7 @@ const struct file_operations ocfs2_fops_no_plocks = {
 	.compat_ioctl   = ocfs2_compat_ioctl,
 #endif
 	.flock		= ocfs2_flock,
-	.splice_read	= ocfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
 };
diff --git a/fs/ocfs2/ocfs2_trace.h b/fs/ocfs2/ocfs2_trace.h
index f8f5fc5..0b58abc 100644
--- a/fs/ocfs2/ocfs2_trace.h
+++ b/fs/ocfs2/ocfs2_trace.h
@@ -1314,8 +1314,6 @@ DEFINE_OCFS2_FILE_OPS(ocfs2_file_aio_write);
 
 DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_write);
 
-DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_read);
-
 DEFINE_OCFS2_FILE_OPS(ocfs2_file_aio_read);
 
 DEFINE_OCFS2_ULL_ULL_ULL_EVENT(ocfs2_truncate_file);
diff --git a/fs/splice.c b/fs/splice.c
index 0daa7d1..7b756d3 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -281,207 +281,6 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
 	kfree(spd->partial);
 }
 
-static int
-__generic_file_splice_read(struct file *in, loff_t *ppos,
-			   struct pipe_inode_info *pipe, size_t len,
-			   unsigned int flags)
-{
-	struct address_space *mapping = in->f_mapping;
-	unsigned int loff, nr_pages, req_pages;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct page *page;
-	pgoff_t index, end_index;
-	loff_t isize;
-	int error, page_nr;
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &page_cache_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
-
-	if (splice_grow_spd(pipe, &spd))
-		return -ENOMEM;
-
-	index = *ppos >> PAGE_SHIFT;
-	loff = *ppos & ~PAGE_MASK;
-	req_pages = (len + loff + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	nr_pages = min(req_pages, spd.nr_pages_max);
-
-	/*
-	 * Lookup the (hopefully) full range of pages we need.
-	 */
-	spd.nr_pages = find_get_pages_contig(mapping, index, nr_pages, spd.pages);
-	index += spd.nr_pages;
-
-	/*
-	 * If find_get_pages_contig() returned fewer pages than we needed,
-	 * readahead/allocate the rest and fill in the holes.
-	 */
-	if (spd.nr_pages < nr_pages)
-		page_cache_sync_readahead(mapping, &in->f_ra, in,
-				index, req_pages - spd.nr_pages);
-
-	error = 0;
-	while (spd.nr_pages < nr_pages) {
-		/*
-		 * Page could be there, find_get_pages_contig() breaks on
-		 * the first hole.
-		 */
-		page = find_get_page(mapping, index);
-		if (!page) {
-			/*
-			 * page didn't exist, allocate one.
-			 */
-			page = page_cache_alloc_cold(mapping);
-			if (!page)
-				break;
-
-			error = add_to_page_cache_lru(page, mapping, index,
-				   mapping_gfp_constraint(mapping, GFP_KERNEL));
-			if (unlikely(error)) {
-				put_page(page);
-				if (error == -EEXIST)
-					continue;
-				break;
-			}
-			/*
-			 * add_to_page_cache() locks the page, unlock it
-			 * to avoid convoluting the logic below even more.
-			 */
-			unlock_page(page);
-		}
-
-		spd.pages[spd.nr_pages++] = page;
-		index++;
-	}
-
-	/*
-	 * Now loop over the map and see if we need to start IO on any
-	 * pages, fill in the partial map, etc.
-	 */
-	index = *ppos >> PAGE_SHIFT;
-	nr_pages = spd.nr_pages;
-	spd.nr_pages = 0;
-	for (page_nr = 0; page_nr < nr_pages; page_nr++) {
-		unsigned int this_len;
-
-		if (!len)
-			break;
-
-		/*
-		 * this_len is the max we'll use from this page
-		 */
-		this_len = min_t(unsigned long, len, PAGE_SIZE - loff);
-		page = spd.pages[page_nr];
-
-		if (PageReadahead(page))
-			page_cache_async_readahead(mapping, &in->f_ra, in,
-					page, index, req_pages - page_nr);
-
-		/*
-		 * If the page isn't uptodate, we may need to start io on it
-		 */
-		if (!PageUptodate(page)) {
-			lock_page(page);
-
-			/*
-			 * Page was truncated, or invalidated by the
-			 * filesystem.  Redo the find/create, but this time the
-			 * page is kept locked, so there's no chance of another
-			 * race with truncate/invalidate.
-			 */
-			if (!page->mapping) {
-				unlock_page(page);
-retry_lookup:
-				page = find_or_create_page(mapping, index,
-						mapping_gfp_mask(mapping));
-
-				if (!page) {
-					error = -ENOMEM;
-					break;
-				}
-				put_page(spd.pages[page_nr]);
-				spd.pages[page_nr] = page;
-			}
-			/*
-			 * page was already under io and is now done, great
-			 */
-			if (PageUptodate(page)) {
-				unlock_page(page);
-				goto fill_it;
-			}
-
-			/*
-			 * need to read in the page
-			 */
-			error = mapping->a_ops->readpage(in, page);
-			if (unlikely(error)) {
-				/*
-				 * Re-lookup the page
-				 */
-				if (error == AOP_TRUNCATED_PAGE)
-					goto retry_lookup;
-
-				break;
-			}
-		}
-fill_it:
-		/*
-		 * i_size must be checked after PageUptodate.
-		 */
-		isize = i_size_read(mapping->host);
-		end_index = (isize - 1) >> PAGE_SHIFT;
-		if (unlikely(!isize || index > end_index))
-			break;
-
-		/*
-		 * if this is the last page, see if we need to shrink
-		 * the length and stop
-		 */
-		if (end_index == index) {
-			unsigned int plen;
-
-			/*
-			 * max good bytes in this page
-			 */
-			plen = ((isize - 1) & ~PAGE_MASK) + 1;
-			if (plen <= loff)
-				break;
-
-			/*
-			 * force quit after adding this page
-			 */
-			this_len = min(this_len, plen - loff);
-			len = this_len;
-		}
-
-		spd.partial[page_nr].offset = loff;
-		spd.partial[page_nr].len = this_len;
-		len -= this_len;
-		loff = 0;
-		spd.nr_pages++;
-		index++;
-	}
-
-	/*
-	 * Release any pages at the end, if we quit early. 'page_nr' is how far
-	 * we got, 'nr_pages' is how many pages are in the map.
-	 */
-	while (page_nr < nr_pages)
-		put_page(spd.pages[page_nr++]);
-	in->f_ra.prev_pos = (loff_t)index << PAGE_SHIFT;
-
-	if (spd.nr_pages)
-		error = splice_to_pipe(pipe, &spd);
-
-	splice_shrink_spd(&spd);
-	return error;
-}
-
 /**
  * generic_file_splice_read - splice data from file to a pipe
  * @in:		file to splice from
@@ -492,19 +291,17 @@ fill_it:
  *
  * Description:
  *    Will read pages from given file and fill them into a pipe. Can be
- *    used as long as the address_space operations for the source implements
- *    a readpage() hook.
+ *    used as long as it has more or less sane ->read_iter().
  *
  */
 ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
+	struct iov_iter to;
+	struct kiocb kiocb;
 	loff_t isize, left;
-	int ret;
-
-	if (IS_DAX(in->f_mapping->host))
-		return default_file_splice_read(in, ppos, pipe, len, flags);
+	int idx, ret;
 
 	isize = i_size_read(in->f_mapping->host);
 	if (unlikely(*ppos >= isize))
@@ -514,10 +311,30 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	if (unlikely(left < len))
 		len = left;
 
-	ret = __generic_file_splice_read(in, ppos, pipe, len, flags);
+	iov_iter_pipe(&to, ITER_PIPE | READ, pipe, len);
+	idx = to.idx;
+	init_sync_kiocb(&kiocb, in);
+	kiocb.ki_pos = *ppos;
+	ret = in->f_op->read_iter(&kiocb, &to);
 	if (ret > 0) {
-		*ppos += ret;
+		*ppos = kiocb.ki_pos;
 		file_accessed(in);
+	} else if (ret < 0) {
+		if (WARN_ON(to.idx != idx || to.iov_offset)) {
+			/*
+			 * a bogus ->read_iter() has copied something and still
+			 * returned an error instead of a short read.
+			 */
+			to.idx = idx;
+			to.iov_offset = 0;
+			iov_iter_advance(&to, 0); /* to free what was emitted */
+		}
+		/*
+		 * callers of ->splice_read() expect -EAGAIN on
+		 * "can't put anything in there", rather than -EFAULT.
+		 */
+		if (ret == -EFAULT)
+			ret = -EAGAIN;
 	}
 
 	return ret;
@@ -580,7 +397,7 @@ ssize_t kernel_write(struct file *file, const char *buf, size_t count,
 }
 EXPORT_SYMBOL(kernel_write);
 
-ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
+static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
@@ -675,7 +492,6 @@ err:
 	res = error;
 	goto shrink_ret;
 }
-EXPORT_SYMBOL(default_file_splice_read);
 
 /*
  * Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e612a02..92f16cf 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -399,45 +399,6 @@ xfs_file_read_iter(
 	return ret;
 }
 
-STATIC ssize_t
-xfs_file_splice_read(
-	struct file		*infilp,
-	loff_t			*ppos,
-	struct pipe_inode_info	*pipe,
-	size_t			count,
-	unsigned int		flags)
-{
-	struct xfs_inode	*ip = XFS_I(infilp->f_mapping->host);
-	ssize_t			ret;
-
-	XFS_STATS_INC(ip->i_mount, xs_read_calls);
-
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
-		return -EIO;
-
-	trace_xfs_file_splice_read(ip, count, *ppos);
-
-	/*
-	 * DAX inodes cannot ues the page cache for splice, so we have to push
-	 * them through the VFS IO path. This means it goes through
-	 * ->read_iter, which for us takes the XFS_IOLOCK_SHARED. Hence we
-	 * cannot lock the splice operation at this level for DAX inodes.
-	 */
-	if (IS_DAX(VFS_I(ip))) {
-		ret = default_file_splice_read(infilp, ppos, pipe, count,
-					       flags);
-		goto out;
-	}
-
-	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
-	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
-out:
-	if (ret > 0)
-		XFS_STATS_ADD(ip->i_mount, xs_read_bytes, ret);
-	return ret;
-}
-
 /*
  * Zero any on disk space between the current EOF and the new, larger EOF.
  *
@@ -1652,7 +1613,7 @@ const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
 	.write_iter	= xfs_file_write_iter,
-	.splice_read	= xfs_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.unlocked_ioctl	= xfs_file_ioctl,
 #ifdef CONFIG_COMPAT
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d303a66..f31db44 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1170,7 +1170,6 @@ DEFINE_RW_EVENT(xfs_file_dax_read);
 DEFINE_RW_EVENT(xfs_file_buffered_write);
 DEFINE_RW_EVENT(xfs_file_direct_write);
 DEFINE_RW_EVENT(xfs_file_dax_write);
-DEFINE_RW_EVENT(xfs_file_splice_read);
 
 DECLARE_EVENT_CLASS(xfs_page_class,
 	TP_PROTO(struct inode *inode, struct page *page, unsigned long off,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 901e25d..b04883e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2794,8 +2794,6 @@ extern void block_sync_page(struct page *page);
 /* fs/splice.c */
 extern ssize_t generic_file_splice_read(struct file *, loff_t *,
 		struct pipe_inode_info *, size_t, unsigned int);
-extern ssize_t default_file_splice_read(struct file *, loff_t *,
-		struct pipe_inode_info *, size_t, unsigned int);
 extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
 		struct file *, loff_t *, size_t, unsigned int);
 extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
diff --git a/mm/shmem.c b/mm/shmem.c
index fd8b2b5..84d7077 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2310,119 +2310,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return retval ? retval : error;
 }
 
-static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
-				struct pipe_inode_info *pipe, size_t len,
-				unsigned int flags)
-{
-	struct address_space *mapping = in->f_mapping;
-	struct inode *inode = mapping->host;
-	unsigned int loff, nr_pages, req_pages;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct page *page;
-	pgoff_t index, end_index;
-	loff_t isize, left;
-	int error, page_nr;
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &page_cache_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
-
-	isize = i_size_read(inode);
-	if (unlikely(*ppos >= isize))
-		return 0;
-
-	left = isize - *ppos;
-	if (unlikely(left < len))
-		len = left;
-
-	if (splice_grow_spd(pipe, &spd))
-		return -ENOMEM;
-
-	index = *ppos >> PAGE_SHIFT;
-	loff = *ppos & ~PAGE_MASK;
-	req_pages = (len + loff + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	nr_pages = min(req_pages, spd.nr_pages_max);
-
-	spd.nr_pages = find_get_pages_contig(mapping, index,
-						nr_pages, spd.pages);
-	index += spd.nr_pages;
-	error = 0;
-
-	while (spd.nr_pages < nr_pages) {
-		error = shmem_getpage(inode, index, &page, SGP_CACHE);
-		if (error)
-			break;
-		unlock_page(page);
-		spd.pages[spd.nr_pages++] = page;
-		index++;
-	}
-
-	index = *ppos >> PAGE_SHIFT;
-	nr_pages = spd.nr_pages;
-	spd.nr_pages = 0;
-
-	for (page_nr = 0; page_nr < nr_pages; page_nr++) {
-		unsigned int this_len;
-
-		if (!len)
-			break;
-
-		this_len = min_t(unsigned long, len, PAGE_SIZE - loff);
-		page = spd.pages[page_nr];
-
-		if (!PageUptodate(page) || page->mapping != mapping) {
-			error = shmem_getpage(inode, index, &page, SGP_CACHE);
-			if (error)
-				break;
-			unlock_page(page);
-			put_page(spd.pages[page_nr]);
-			spd.pages[page_nr] = page;
-		}
-
-		isize = i_size_read(inode);
-		end_index = (isize - 1) >> PAGE_SHIFT;
-		if (unlikely(!isize || index > end_index))
-			break;
-
-		if (end_index == index) {
-			unsigned int plen;
-
-			plen = ((isize - 1) & ~PAGE_MASK) + 1;
-			if (plen <= loff)
-				break;
-
-			this_len = min(this_len, plen - loff);
-			len = this_len;
-		}
-
-		spd.partial[page_nr].offset = loff;
-		spd.partial[page_nr].len = this_len;
-		len -= this_len;
-		loff = 0;
-		spd.nr_pages++;
-		index++;
-	}
-
-	while (page_nr < nr_pages)
-		put_page(spd.pages[page_nr++]);
-
-	if (spd.nr_pages)
-		error = splice_to_pipe(pipe, &spd);
-
-	splice_shrink_spd(&spd);
-
-	if (error > 0) {
-		*ppos += error;
-		file_accessed(in);
-	}
-	return error;
-}
-
 /*
  * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
  */
@@ -3785,7 +3672,7 @@ static const struct file_operations shmem_file_operations = {
 	.read_iter	= shmem_file_read_iter,
 	.write_iter	= generic_file_write_iter,
 	.fsync		= noop_fsync,
-	.splice_read	= shmem_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= shmem_fallocate,
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-23 19:03                                         ` [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe() Al Viro
@ 2016-09-23 19:45                                           ` Linus Torvalds
  2016-09-23 20:10                                             ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: Linus Torvalds @ 2016-09-23 19:45 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Fri, Sep 23, 2016 at 12:03 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> @@ -1421,8 +1406,25 @@ static long do_splice(struct file *in, loff_t __user *off_in,
> +               ret = 0;
> +               pipe_lock(opipe);
> +               bogus_count = opipe->buffers;
> +               do {
> +                       bogus_count += opipe->nrbufs;
> +                       ret = do_splice_to(in, &offset, opipe, len, flags);
> +                       if (ret > 0) {
> +                               total += ret;
> +                               len -= ret;
> +                       }
> +                       bogus_count -= opipe->nrbufs;
> +                       if (bogus_count <= 0)
> +                               break;

I was like "oh, I'm sure this is some temporary hack, it will be gone
by the end of the series".

It wasn't gone by the end.

There's two copies of that pattern, and at the very least it needs a
big comment about what this pattern does and why.

But other than that reaction, I didn't get any hives from this. I
didn't *test* it, only looking at patches, but no red flags I could
notice.

               Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-23 19:45                                           ` Linus Torvalds
@ 2016-09-23 20:10                                             ` Al Viro
  2016-09-23 20:36                                               ` Linus Torvalds
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-23 20:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Fri, Sep 23, 2016 at 12:45:53PM -0700, Linus Torvalds wrote:

> I was like "oh, I'm sure this is some temporary hack, it will be gone
> by the end of the series".
> 
> It wasn't gone by the end.
> 
> There's two copies of that pattern, and at the very least it needs a
> big comment about what this pattern does and why.

The thing is, I'm not sure what to do with it; it was brought by the LTP
vmsplice test, which asks to feed 128Kb into a pipe.  With the caller
itself on the other end of that pipe, SPLICE_F_NONBLOCK *not* given and
the pipe capacity being 64Kb.  Unfortunately, "quietly truncate the
length down to 64Kb" does *not* suffice - the damn thing starts not at
the page boundary, so we only copy about 62Kb until hitting the pipe
overflow (the pipe is initially empty).  The reason why it doesn't go
to sleep indefinitely on the mainline kernel is that mainline collects
up to page->buffers *pages*, before feeding them into the pipe.  And these
~62Kb are just that.  Note that had there been anything already in the
pipe, the same call would've gone to sleep (and in the end transferred the
same ~62Kb worth of data).

All of that is completely undocumented in vmsplice(2) (or anywhere else that
I'd been able to find) ;-/

OTOH, considering the quality of documentation, I'm somewhat tempted to go
for "sleep only if it had been completely full when we entered; once there's
some space feed as much as fits and be done with that".  OTTH, I'm not sure
that no userland cr^Hode will manage to be hurt by that variant...

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-23 20:10                                             ` Al Viro
@ 2016-09-23 20:36                                               ` Linus Torvalds
  2016-09-24  3:59                                                 ` Al Viro
                                                                   ` (5 more replies)
  0 siblings, 6 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-09-23 20:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Fri, Sep 23, 2016 at 1:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OTOH, considering the quality of documentation, I'm somewhat tempted to go
> for "sleep only if it had been completely full when we entered; once there's
> some space feed as much as fits and be done with that".  OTTH, I'm not sure
> that no userland cr^Hode will manage to be hurt by that variant...

Let's just try it.

If that then doesn't work, we can introduce your odd code (with a
*big* comment). Ok?

               Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-23 20:36                                               ` Linus Torvalds
@ 2016-09-24  3:59                                                 ` Al Viro
  2016-09-24 17:29                                                   ` Al Viro
  2016-09-24  3:59                                                 ` [PATCH 04/12] " Al Viro
                                                                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-24  3:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Fri, Sep 23, 2016 at 01:36:12PM -0700, Linus Torvalds wrote:
> On Fri, Sep 23, 2016 at 1:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > OTOH, considering the quality of documentation, I'm somewhat tempted to go
> > for "sleep only if it had been completely full when we entered; once there's
> > some space feed as much as fits and be done with that".  OTTH, I'm not sure
> > that no userland cr^Hode will manage to be hurt by that variant...
> 
> Let's just try it.
> 
> If that then doesn't work, we can introduce your odd code (with a
> *big* comment). Ok?

	FWIW, updated (with fixes) and force-pushed.  Added piece:
default_file_splice_read() converted to iov_iter.  Seems to work, after
fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
beginning, as described above), #6 (context changes from #4), #10 (missing
get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
of len - ->read_iter() can bloody well handle that on its own) and added #12.
Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...

^ permalink raw reply	[flat|nested] 152+ messages in thread

* [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-23 20:36                                               ` Linus Torvalds
  2016-09-24  3:59                                                 ` Al Viro
@ 2016-09-24  3:59                                                 ` Al Viro
  2016-09-26 13:35                                                     ` Miklos Szeredi
  2016-12-17 19:54                                                   ` Andreas Schwab
  2016-09-24  4:00                                                 ` [PATCH 06/12] new helper: add_to_pipe() Al Viro
                                                                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 152+ messages in thread
From: Al Viro @ 2016-09-24  3:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

* splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
* ->splice_read() instances do the same
* vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
  arrange for waiting, looping, etc. themselves.

That should make pipe_lock the outermost one.

Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
and do_splice() are quite ugly _and_ userland code can be easily broken
by changing those.  It's not even "no more than the maximal capacity of
this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
leave instead of waiting".

Considering how poorly these rules are documented, let's try "wait for some
space to appear, unless given SPLICE_F_NONBLOCK, then push into pipe
and if we run into overflow, we are done".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/fuse/dev.c |   2 -
 fs/splice.c   | 138 +++++++++++++++++++++++++++-------------------------------
 2 files changed, 63 insertions(+), 77 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index a94d2ed..eaf56c6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1364,7 +1364,6 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 		goto out;
 
 	ret = 0;
-	pipe_lock(pipe);
 
 	if (!pipe->readers) {
 		send_sig(SIGPIPE, current, 0);
@@ -1400,7 +1399,6 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	}
 
 out_unlock:
-	pipe_unlock(pipe);
 
 	if (do_wakeup) {
 		smp_mb();
diff --git a/fs/splice.c b/fs/splice.c
index 31c52e0..02daa61 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -183,79 +183,41 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		       struct splice_pipe_desc *spd)
 {
 	unsigned int spd_pages = spd->nr_pages;
-	int ret, do_wakeup, page_nr;
+	int ret = 0, page_nr = 0;
 
 	if (!spd_pages)
 		return 0;
 
-	ret = 0;
-	do_wakeup = 0;
-	page_nr = 0;
-
-	pipe_lock(pipe);
-
-	for (;;) {
-		if (!pipe->readers) {
-			send_sig(SIGPIPE, current, 0);
-			if (!ret)
-				ret = -EPIPE;
-			break;
-		}
-
-		if (pipe->nrbufs < pipe->buffers) {
-			int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-			struct pipe_buffer *buf = pipe->bufs + newbuf;
-
-			buf->page = spd->pages[page_nr];
-			buf->offset = spd->partial[page_nr].offset;
-			buf->len = spd->partial[page_nr].len;
-			buf->private = spd->partial[page_nr].private;
-			buf->ops = spd->ops;
-			if (spd->flags & SPLICE_F_GIFT)
-				buf->flags |= PIPE_BUF_FLAG_GIFT;
-
-			pipe->nrbufs++;
-			page_nr++;
-			ret += buf->len;
-
-			if (pipe->files)
-				do_wakeup = 1;
+	if (unlikely(!pipe->readers)) {
+		send_sig(SIGPIPE, current, 0);
+		ret = -EPIPE;
+		goto out;
+	}
 
-			if (!--spd->nr_pages)
-				break;
-			if (pipe->nrbufs < pipe->buffers)
-				continue;
+	while (pipe->nrbufs < pipe->buffers) {
+		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		struct pipe_buffer *buf = pipe->bufs + newbuf;
 
-			break;
-		}
+		buf->page = spd->pages[page_nr];
+		buf->offset = spd->partial[page_nr].offset;
+		buf->len = spd->partial[page_nr].len;
+		buf->private = spd->partial[page_nr].private;
+		buf->ops = spd->ops;
+		if (spd->flags & SPLICE_F_GIFT)
+			buf->flags |= PIPE_BUF_FLAG_GIFT;
 
-		if (spd->flags & SPLICE_F_NONBLOCK) {
-			if (!ret)
-				ret = -EAGAIN;
-			break;
-		}
+		pipe->nrbufs++;
+		page_nr++;
+		ret += buf->len;
 
-		if (signal_pending(current)) {
-			if (!ret)
-				ret = -ERESTARTSYS;
+		if (!--spd->nr_pages)
 			break;
-		}
-
-		if (do_wakeup) {
-			wakeup_pipe_readers(pipe);
-			do_wakeup = 0;
-		}
-
-		pipe->waiting_writers++;
-		pipe_wait(pipe);
-		pipe->waiting_writers--;
 	}
 
-	pipe_unlock(pipe);
-
-	if (do_wakeup)
-		wakeup_pipe_readers(pipe);
+	if (!ret)
+		ret = -EAGAIN;
 
+out:
 	while (page_nr < spd_pages)
 		spd->spd_release(spd, page_nr++);
 
@@ -1339,6 +1301,20 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 }
 EXPORT_SYMBOL(do_splice_direct);
 
+static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
+{
+	while (pipe->nrbufs == pipe->buffers) {
+		if (flags & SPLICE_F_NONBLOCK)
+			return -EAGAIN;
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+		pipe->waiting_writers++;
+		pipe_wait(pipe);
+		pipe->waiting_writers--;
+	}
+	return 0;
+}
+
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
@@ -1421,8 +1397,13 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 			offset = in->f_pos;
 		}
 
-		ret = do_splice_to(in, &offset, opipe, len, flags);
-
+		pipe_lock(opipe);
+		ret = wait_for_space(opipe, flags);
+		if (!ret)
+			ret = do_splice_to(in, &offset, opipe, len, flags);
+		pipe_unlock(opipe);
+		if (ret > 0)
+			wakeup_pipe_readers(opipe);
 		if (!off_in)
 			in->f_pos = offset;
 		else if (copy_to_user(off_in, &offset, sizeof(loff_t)))
@@ -1434,22 +1415,23 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	return -EINVAL;
 }
 
-static int get_iovec_page_array(struct iov_iter *from,
+static int get_iovec_page_array(const struct iov_iter *from,
 				struct page **pages,
 				struct partial_page *partial,
 				unsigned int pipe_buffers)
 {
+	struct iov_iter i = *from;
 	int buffers = 0;
-	while (iov_iter_count(from)) {
+	while (iov_iter_count(&i)) {
 		ssize_t copied;
 		size_t start;
 
-		copied = iov_iter_get_pages(from, pages + buffers, ~0UL,
+		copied = iov_iter_get_pages(&i, pages + buffers, ~0UL,
 					pipe_buffers - buffers, &start);
 		if (copied <= 0)
 			return buffers ? buffers : copied;
 
-		iov_iter_advance(from, copied);
+		iov_iter_advance(&i, copied);
 		while (copied) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
 			partial[buffers].offset = start;
@@ -1546,14 +1528,20 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 		return -ENOMEM;
 	}
 
-	spd.nr_pages = get_iovec_page_array(&from, spd.pages,
-					    spd.partial,
-					    spd.nr_pages_max);
-	if (spd.nr_pages <= 0)
-		ret = spd.nr_pages;
-	else
-		ret = splice_to_pipe(pipe, &spd);
-
+	pipe_lock(pipe);
+	ret = wait_for_space(pipe, flags);
+	if (!ret) {
+		spd.nr_pages = get_iovec_page_array(&from, spd.pages,
+						    spd.partial,
+						    spd.nr_pages_max);
+		if (spd.nr_pages <= 0)
+			ret = spd.nr_pages;
+		else
+			ret = splice_to_pipe(pipe, &spd);
+		pipe_unlock(pipe);
+		if (ret > 0)
+			wakeup_pipe_readers(pipe);
+	}
 	splice_shrink_spd(&spd);
 	kfree(iov);
 	return ret;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 06/12] new helper: add_to_pipe()
  2016-09-23 20:36                                               ` Linus Torvalds
  2016-09-24  3:59                                                 ` Al Viro
  2016-09-24  3:59                                                 ` [PATCH 04/12] " Al Viro
@ 2016-09-24  4:00                                                 ` Al Viro
  2016-09-26 13:49                                                   ` Miklos Szeredi
  2016-09-24  4:01                                                 ` [PATCH 10/12] new iov_iter flavour: pipe-backed Al Viro
                                                                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-24  4:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

single-buffer analogue of splice_to_pipe(); vmsplice_to_pipe() switched
to that, leaving splice_to_pipe() only for ->splice_read() instances
(and that only until they are converted as well).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c            | 113 ++++++++++++++++++++++++++++---------------------
 include/linux/splice.h |   2 +
 2 files changed, 67 insertions(+), 48 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 02daa61..e13d935 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -203,8 +203,6 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		buf->len = spd->partial[page_nr].len;
 		buf->private = spd->partial[page_nr].private;
 		buf->ops = spd->ops;
-		if (spd->flags & SPLICE_F_GIFT)
-			buf->flags |= PIPE_BUF_FLAG_GIFT;
 
 		pipe->nrbufs++;
 		page_nr++;
@@ -225,6 +223,27 @@ out:
 }
 EXPORT_SYMBOL_GPL(splice_to_pipe);
 
+ssize_t add_to_pipe(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
+{
+	int ret;
+
+	if (unlikely(!pipe->readers)) {
+		send_sig(SIGPIPE, current, 0);
+		ret = -EPIPE;
+	} else if (pipe->nrbufs == pipe->buffers) {
+		ret = -EAGAIN;
+	} else {
+		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		pipe->bufs[newbuf] = *buf;
+		pipe->nrbufs++;
+		return buf->len;
+	}
+	buf->ops->release(pipe, buf);
+	buf->ops = NULL;
+	return ret;
+}
+EXPORT_SYMBOL(add_to_pipe);
+
 void spd_release_page(struct splice_pipe_desc *spd, unsigned int i)
 {
 	put_page(spd->pages[i]);
@@ -1415,33 +1434,50 @@ static long do_splice(struct file *in, loff_t __user *off_in,
 	return -EINVAL;
 }
 
-static int get_iovec_page_array(const struct iov_iter *from,
-				struct page **pages,
-				struct partial_page *partial,
-				unsigned int pipe_buffers)
+static int iter_to_pipe(struct iov_iter *from,
+			struct pipe_inode_info *pipe,
+			unsigned flags)
 {
-	struct iov_iter i = *from;
-	int buffers = 0;
-	while (iov_iter_count(&i)) {
+	struct pipe_buffer buf = {
+		.ops = &user_page_pipe_buf_ops,
+		.flags = flags
+	};
+	size_t total = 0;
+	int ret = 0;
+	bool failed = false;
+
+	while (iov_iter_count(from) && !failed) {
+		struct page *pages[16];
 		ssize_t copied;
 		size_t start;
+		int n;
 
-		copied = iov_iter_get_pages(&i, pages + buffers, ~0UL,
-					pipe_buffers - buffers, &start);
-		if (copied <= 0)
-			return buffers ? buffers : copied;
+		copied = iov_iter_get_pages(from, pages, ~0UL, 16, &start);
+		if (copied <= 0) {
+			ret = copied;
+			break;
+		}
 
-		iov_iter_advance(&i, copied);
-		while (copied) {
+		for (n = 0; copied; n++, start = 0) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
-			partial[buffers].offset = start;
-			partial[buffers].len = size;
+			if (!failed) {
+				buf.page = pages[n];
+				buf.offset = start;
+				buf.len = size;
+				ret = add_to_pipe(pipe, &buf);
+				if (unlikely(ret < 0)) {
+					failed = true;
+				} else {
+					iov_iter_advance(from, ret);
+					total += ret;
+				}
+			} else {
+				put_page(pages[n]);
+			}
 			copied -= size;
-			start = 0;
-			buffers++;
 		}
 	}
-	return buffers;
+	return total ? total : ret;
 }
 
 static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
@@ -1502,17 +1538,11 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 	struct iovec iovstack[UIO_FASTIOV];
 	struct iovec *iov = iovstack;
 	struct iov_iter from;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &user_page_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
 	long ret;
+	unsigned buf_flag = 0;
+
+	if (flags & SPLICE_F_GIFT)
+		buf_flag = PIPE_BUF_FLAG_GIFT;
 
 	pipe = get_pipe_info(file);
 	if (!pipe)
@@ -1523,26 +1553,13 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
 	if (ret < 0)
 		return ret;
 
-	if (splice_grow_spd(pipe, &spd)) {
-		kfree(iov);
-		return -ENOMEM;
-	}
-
 	pipe_lock(pipe);
 	ret = wait_for_space(pipe, flags);
-	if (!ret) {
-		spd.nr_pages = get_iovec_page_array(&from, spd.pages,
-						    spd.partial,
-						    spd.nr_pages_max);
-		if (spd.nr_pages <= 0)
-			ret = spd.nr_pages;
-		else
-			ret = splice_to_pipe(pipe, &spd);
-		pipe_unlock(pipe);
-		if (ret > 0)
-			wakeup_pipe_readers(pipe);
-	}
-	splice_shrink_spd(&spd);
+	if (!ret)
+		ret = iter_to_pipe(&from, pipe, buf_flag);
+	pipe_unlock(pipe);
+	if (ret > 0)
+		wakeup_pipe_readers(pipe);
 	kfree(iov);
 	return ret;
 }
diff --git a/include/linux/splice.h b/include/linux/splice.h
index da2751d..58b300f 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -72,6 +72,8 @@ extern ssize_t __splice_from_pipe(struct pipe_inode_info *,
 				  struct splice_desc *, splice_actor *);
 extern ssize_t splice_to_pipe(struct pipe_inode_info *,
 			      struct splice_pipe_desc *);
+extern ssize_t add_to_pipe(struct pipe_inode_info *,
+			      struct pipe_buffer *);
 extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
 				      splice_direct_actor *);
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 10/12] new iov_iter flavour: pipe-backed
  2016-09-23 20:36                                               ` Linus Torvalds
                                                                   ` (2 preceding siblings ...)
  2016-09-24  4:00                                                 ` [PATCH 06/12] new helper: add_to_pipe() Al Viro
@ 2016-09-24  4:01                                                 ` Al Viro
  2016-09-29 20:53                                                   ` Miklos Szeredi
  2016-09-24  4:01                                                 ` [PATCH 11/12] switch generic_file_splice_read() to use of ->read_iter() Al Viro
  2016-09-24  4:02                                                 ` [PATCH 12/12] switch default_file_splice_read() to use of pipe-backed iov_iter Al Viro
  5 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-09-24  4:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

iov_iter variant for passing data into pipe.  copy_to_iter()
copies data into page(s) it has allocated and stuffs them into
the pipe; copy_page_to_iter() stuffs there a reference to the
page given to it.  Both will try to coalesce if possible.
iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages()
and friends will do as copy_to_iter() would have and return the
pages where the data would've been copied.  iov_iter_advance()
will truncate everything past the spot it has advanced to.

New primitive: iov_iter_pipe(), used for initializing those.
pipe should be locked all along.

Running out of space acts as fault would for iovec-backed ones;
in other words, giving it to ->read_iter() may result in short
read if the pipe overflows, or -EFAULT if it happens with nothing
copied there.

In other words, ->read_iter() on those acts pretty much like
->splice_read().  Moreover, all generic_file_splice_read() users,
as well as many other ->splice_read() instances can be switched
to that scheme - that'll happen in the next commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c            |   2 +-
 include/linux/splice.h |   1 +
 include/linux/uio.h    |  14 +-
 lib/iov_iter.c         | 390 ++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 401 insertions(+), 6 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index e13d935..589a1d5 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -524,7 +524,7 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 }
 EXPORT_SYMBOL(generic_file_splice_read);
 
-static const struct pipe_buf_operations default_pipe_buf_ops = {
+const struct pipe_buf_operations default_pipe_buf_ops = {
 	.can_merge = 0,
 	.confirm = generic_pipe_buf_confirm,
 	.release = generic_pipe_buf_release,
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 58b300f..00a2116 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -85,4 +85,5 @@ extern void splice_shrink_spd(struct splice_pipe_desc *);
 extern void spd_release_page(struct splice_pipe_desc *, unsigned int);
 
 extern const struct pipe_buf_operations page_cache_pipe_buf_ops;
+extern const struct pipe_buf_operations default_pipe_buf_ops;
 #endif
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 1b5d1cd..c4fe1ab 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -13,6 +13,7 @@
 #include <uapi/linux/uio.h>
 
 struct page;
+struct pipe_inode_info;
 
 struct kvec {
 	void *iov_base; /* and that should *never* hold a userland pointer */
@@ -23,6 +24,7 @@ enum {
 	ITER_IOVEC = 0,
 	ITER_KVEC = 2,
 	ITER_BVEC = 4,
+	ITER_PIPE = 8,
 };
 
 struct iov_iter {
@@ -33,8 +35,12 @@ struct iov_iter {
 		const struct iovec *iov;
 		const struct kvec *kvec;
 		const struct bio_vec *bvec;
+		struct pipe_inode_info *pipe;
+	};
+	union {
+		unsigned long nr_segs;
+		int idx;
 	};
-	unsigned long nr_segs;
 };
 
 /*
@@ -64,7 +70,7 @@ static inline struct iovec iov_iter_iovec(const struct iov_iter *iter)
 }
 
 #define iov_for_each(iov, iter, start)				\
-	if (!((start).type & ITER_BVEC))			\
+	if (!((start).type & (ITER_BVEC | ITER_PIPE)))		\
 	for (iter = (start);					\
 	     (iter).count &&					\
 	     ((iov = iov_iter_iovec(&(iter))), 1);		\
@@ -94,6 +100,8 @@ void iov_iter_kvec(struct iov_iter *i, int direction, const struct kvec *kvec,
 			unsigned long nr_segs, size_t count);
 void iov_iter_bvec(struct iov_iter *i, int direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
+void iov_iter_pipe(struct iov_iter *i, int direction, struct pipe_inode_info *pipe,
+			size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
@@ -109,7 +117,7 @@ static inline size_t iov_iter_count(struct iov_iter *i)
 
 static inline bool iter_is_iovec(struct iov_iter *i)
 {
-	return !(i->type & (ITER_BVEC | ITER_KVEC));
+	return !(i->type & (ITER_BVEC | ITER_KVEC | ITER_PIPE));
 }
 
 /*
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 9e8c738..405fdd6 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -3,8 +3,11 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
+#include <linux/splice.h>
 #include <net/checksum.h>
 
+#define PIPE_PARANOIA /* for now */
+
 #define iterate_iovec(i, n, __v, __p, skip, STEP) {	\
 	size_t left;					\
 	size_t wanted = n;				\
@@ -290,6 +293,82 @@ done:
 	return wanted - bytes;
 }
 
+#ifdef PIPE_PARANOIA
+static bool sanity(const struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	int idx = i->idx;
+	int delta = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1);
+	if (i->iov_offset) {
+		struct pipe_buffer *p;
+		if (unlikely(delta != 1) || unlikely(!pipe->nrbufs))
+			goto Bad;	// must be at the last buffer...
+
+		p = &pipe->bufs[idx];
+		if (unlikely(p->offset + p->len != i->iov_offset))
+			goto Bad;	// ... at the end of segment
+	} else {
+		if (delta)
+			goto Bad;	// must be right after the last buffer
+	}
+	return true;
+Bad:
+	WARN_ON(1);
+	return false;
+}
+#else
+#define sanity(i) true
+#endif
+
+static inline int next_idx(int idx, struct pipe_inode_info *pipe)
+{
+	return (idx + 1) & (pipe->buffers - 1);
+}
+
+static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
+			 struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	struct pipe_buffer *buf;
+	size_t off;
+	int idx;
+
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	if (!sanity(i))
+		return 0;
+
+	off = i->iov_offset;
+	idx = i->idx;
+	buf = &pipe->bufs[idx];
+	if (off) {
+		if (offset == off && buf->page == page) {
+			/* merge with the last one */
+			buf->len += bytes;
+			i->iov_offset += bytes;
+			goto out;
+		}
+		idx = next_idx(idx, pipe);
+		buf = &pipe->bufs[idx];
+	}
+	if (idx == pipe->curbuf && pipe->nrbufs)
+		return 0;
+	pipe->nrbufs++;
+	buf->ops = &page_cache_pipe_buf_ops;
+	get_page(buf->page = page);
+	buf->offset = offset;
+	buf->len = bytes;
+	i->iov_offset = offset + bytes;
+	i->idx = idx;
+out:
+	i->count -= bytes;
+	return bytes;
+}
+
 /*
  * Fault in the first iovec of the given iov_iter, to a maximum length
  * of bytes. Returns 0 on success, or non-zero if the memory could not be
@@ -376,9 +455,98 @@ static void memzero_page(struct page *page, size_t offset, size_t len)
 	kunmap_atomic(addr);
 }
 
+static inline bool allocated(struct pipe_buffer *buf)
+{
+	return buf->ops == &default_pipe_buf_ops;
+}
+
+static inline void data_start(const struct iov_iter *i, int *idxp, size_t *offp)
+{
+	size_t off = i->iov_offset;
+	int idx = i->idx;
+	if (off && (!allocated(&i->pipe->bufs[idx]) || off == PAGE_SIZE)) {
+		idx = next_idx(idx, i->pipe);
+		off = 0;
+	}
+	*idxp = idx;
+	*offp = off;
+}
+
+static size_t push_pipe(struct iov_iter *i, size_t size,
+			int *idxp, size_t *offp)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t off;
+	int idx;
+	ssize_t left;
+
+	if (unlikely(size > i->count))
+		size = i->count;
+	if (unlikely(!size))
+		return 0;
+
+	left = size;
+	data_start(i, &idx, &off);
+	*idxp = idx;
+	*offp = off;
+	if (off) {
+		left -= PAGE_SIZE - off;
+		if (left <= 0) {
+			pipe->bufs[idx].len += size;
+			return size;
+		}
+		pipe->bufs[idx].len = PAGE_SIZE;
+		idx = next_idx(idx, pipe);
+	}
+	while (idx != pipe->curbuf || !pipe->nrbufs) {
+		struct page *page = alloc_page(GFP_USER);
+		if (!page)
+			break;
+		pipe->nrbufs++;
+		pipe->bufs[idx].ops = &default_pipe_buf_ops;
+		pipe->bufs[idx].page = page;
+		pipe->bufs[idx].offset = 0;
+		if (left <= PAGE_SIZE) {
+			pipe->bufs[idx].len = left;
+			return size;
+		}
+		pipe->bufs[idx].len = PAGE_SIZE;
+		left -= PAGE_SIZE;
+		idx = next_idx(idx, pipe);
+	}
+	return size - left;
+}
+
+static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
+				struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t n, off;
+	int idx;
+
+	if (!sanity(i))
+		return 0;
+
+	bytes = n = push_pipe(i, bytes, &idx, &off);
+	if (unlikely(!n))
+		return 0;
+	for ( ; n; idx = next_idx(idx, pipe), off = 0) {
+		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
+		memcpy_to_page(pipe->bufs[idx].page, off, addr, chunk);
+		i->idx = idx;
+		i->iov_offset = off + chunk;
+		n -= chunk;
+		addr += chunk;
+	}
+	i->count -= bytes;
+	return bytes;
+}
+
 size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	const char *from = addr;
+	if (unlikely(i->type & ITER_PIPE))
+		return copy_pipe_to_iter(addr, bytes, i);
 	iterate_and_advance(i, bytes, v,
 		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
 			       v.iov_len),
@@ -394,6 +562,10 @@ EXPORT_SYMBOL(copy_to_iter);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
 				 v.iov_len),
@@ -409,6 +581,10 @@ EXPORT_SYMBOL(copy_from_iter);
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	char *to = addr;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v,
 		__copy_from_user_nocache((to += v.iov_len) - v.iov_len,
 					 v.iov_base, v.iov_len),
@@ -429,14 +605,20 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 		size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
 		kunmap_atomic(kaddr);
 		return wanted;
-	} else
+	} else if (likely(!(i->type & ITER_PIPE)))
 		return copy_page_to_iter_iovec(page, offset, bytes, i);
+	else
+		return copy_page_to_iter_pipe(page, offset, bytes, i);
 }
 EXPORT_SYMBOL(copy_page_to_iter);
 
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	if (i->type & (ITER_BVEC|ITER_KVEC)) {
 		void *kaddr = kmap_atomic(page);
 		size_t wanted = copy_from_iter(kaddr + offset, bytes, i);
@@ -447,8 +629,34 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
+static size_t pipe_zero(size_t bytes, struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t n, off;
+	int idx;
+
+	if (!sanity(i))
+		return 0;
+
+	bytes = n = push_pipe(i, bytes, &idx, &off);
+	if (unlikely(!n))
+		return 0;
+
+	for ( ; n; idx = next_idx(idx, pipe), off = 0) {
+		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
+		memzero_page(pipe->bufs[idx].page, off, chunk);
+		i->idx = idx;
+		i->iov_offset = off + chunk;
+		n -= chunk;
+	}
+	i->count -= bytes;
+	return bytes;
+}
+
 size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 {
+	if (unlikely(i->type & ITER_PIPE))
+		return pipe_zero(bytes, i);
 	iterate_and_advance(i, bytes, v,
 		__clear_user(v.iov_base, v.iov_len),
 		memzero_page(v.bv_page, v.bv_offset, v.bv_len),
@@ -463,6 +671,11 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes)
 {
 	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
+	if (unlikely(i->type & ITER_PIPE)) {
+		kunmap_atomic(kaddr);
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_all_kinds(i, bytes, v,
 		__copy_from_user_inatomic((p += v.iov_len) - v.iov_len,
 					  v.iov_base, v.iov_len),
@@ -475,8 +688,55 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 }
 EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
 
+static void pipe_advance(struct iov_iter *i, size_t size)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	struct pipe_buffer *buf;
+	size_t off;
+	int idx;
+	
+	if (unlikely(i->count < size))
+		size = i->count;
+
+	idx = i->idx;
+	off = i->iov_offset;
+	if (size || off) {
+		/* take it relative to the beginning of buffer */
+		size += off - pipe->bufs[idx].offset;
+		while (1) {
+			buf = &pipe->bufs[idx];
+			if (size > buf->len) {
+				size -= buf->len;
+				idx = next_idx(idx, pipe);
+				off = 0;
+			} else {
+				buf->len = size;
+				i->idx = idx;
+				i->iov_offset = off = buf->offset + size;
+				break;
+			}
+		}
+		idx = next_idx(idx, pipe);
+	}
+	if (pipe->nrbufs) {
+		int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+		/* [curbuf,unused) is in use.  Free [idx,unused) */
+		while (idx != unused) {
+			buf = &pipe->bufs[idx];
+			buf->ops->release(pipe, buf);
+			buf->ops = NULL;
+			idx = next_idx(idx, pipe);
+			pipe->nrbufs--;
+		}
+	}
+}
+
 void iov_iter_advance(struct iov_iter *i, size_t size)
 {
+	if (unlikely(i->type & ITER_PIPE)) {
+		pipe_advance(i, size);
+		return;
+	}
 	iterate_and_advance(i, size, v, 0, 0, 0)
 }
 EXPORT_SYMBOL(iov_iter_advance);
@@ -486,6 +746,8 @@ EXPORT_SYMBOL(iov_iter_advance);
  */
 size_t iov_iter_single_seg_count(const struct iov_iter *i)
 {
+	if (unlikely(i->type & ITER_PIPE))
+		return i->count;	// it is a silly place, anyway
 	if (i->nr_segs == 1)
 		return i->count;
 	else if (i->type & ITER_BVEC)
@@ -521,6 +783,19 @@ void iov_iter_bvec(struct iov_iter *i, int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
+void iov_iter_pipe(struct iov_iter *i, int direction,
+			struct pipe_inode_info *pipe,
+			size_t count)
+{
+	BUG_ON(direction != ITER_PIPE);
+	i->type = direction;
+	i->pipe = pipe;
+	i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+	i->iov_offset = 0;
+	i->count = count;
+}
+EXPORT_SYMBOL(iov_iter_pipe);
+
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
 	unsigned long res = 0;
@@ -529,6 +804,11 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 	if (!size)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE)) {
+		if (i->iov_offset && allocated(&i->pipe->bufs[i->idx]))
+			return size | i->iov_offset;
+		return size;
+	}
 	iterate_all_kinds(i, size, v,
 		(res |= (unsigned long)v.iov_base | v.iov_len, 0),
 		res |= v.bv_offset | v.bv_len,
@@ -545,6 +825,11 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 	if (!size)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return ~0U;
+	}
+
 	iterate_all_kinds(i, size, v,
 		(res |= (!res ? 0 : (unsigned long)v.iov_base) |
 			(size != v.iov_len ? size : 0), 0),
@@ -557,6 +842,47 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_gap_alignment);
 
+static inline size_t __pipe_get_pages(struct iov_iter *i,
+				size_t maxsize,
+				struct page **pages,
+				int idx,
+				size_t *start)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t n = push_pipe(i, maxsize, &idx, start);
+	if (!n)
+		return 0;
+
+	maxsize = n;
+	n += *start;
+	while (n >= PAGE_SIZE) {
+		get_page(*pages++ = pipe->bufs[idx].page);
+		idx = next_idx(idx, pipe);
+		n -= PAGE_SIZE;
+	}
+
+	return maxsize;
+}
+
+static ssize_t pipe_get_pages(struct iov_iter *i,
+		   struct page **pages, size_t maxsize, unsigned maxpages,
+		   size_t *start)
+{
+	unsigned npages;
+	size_t capacity;
+	int idx;
+
+	if (!sanity(i))
+		return 0;
+
+	data_start(i, &idx, start);
+	/* some of this one + all after this one */
+	npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1;
+	capacity = min(npages,maxpages) * PAGE_SIZE - *start;
+
+	return __pipe_get_pages(i, min(maxsize, capacity), pages, idx, start);
+}
+
 ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
@@ -567,6 +893,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 	if (!maxsize)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE))
+		return pipe_get_pages(i, pages, maxsize, maxpages, start);
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
 		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -602,6 +930,37 @@ static struct page **get_pages_array(size_t n)
 	return p;
 }
 
+static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
+		   struct page ***pages, size_t maxsize,
+		   size_t *start)
+{
+	struct page **p;
+	size_t n;
+	int idx;
+	int npages;
+
+	if (!sanity(i))
+		return 0;
+
+	data_start(i, &idx, start);
+	/* some of this one + all after this one */
+	npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1;
+	n = npages * PAGE_SIZE - *start;
+	if (maxsize > n)
+		maxsize = n;
+	else
+		npages = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
+	p = get_pages_array(npages);
+	if (!p)
+		return -ENOMEM;
+	n = __pipe_get_pages(i, maxsize, p, idx, start);
+	if (n)
+		*pages = p;
+	else
+		kvfree(p);
+	return n;
+}
+
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
@@ -614,6 +973,8 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 	if (!maxsize)
 		return 0;
 
+	if (unlikely(i->type & ITER_PIPE))
+		return pipe_get_pages_alloc(i, pages, maxsize, start);
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
 		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -655,6 +1016,10 @@ size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v, ({
 		int err = 0;
 		next = csum_and_copy_from_user(v.iov_base, 
@@ -693,6 +1058,10 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum,
 	__wsum sum, next;
 	size_t off = 0;
 	sum = *csum;
+	if (unlikely(i->type & ITER_PIPE)) {
+		WARN_ON(1);	/* for now */
+		return 0;
+	}
 	iterate_and_advance(i, bytes, v, ({
 		int err = 0;
 		next = csum_and_copy_to_user((from += v.iov_len) - v.iov_len,
@@ -732,7 +1101,20 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 	if (!size)
 		return 0;
 
-	iterate_all_kinds(i, size, v, ({
+	if (unlikely(i->type & ITER_PIPE)) {
+		struct pipe_inode_info *pipe = i->pipe;
+		size_t off;
+		int idx;
+
+		if (!sanity(i))
+			return 0;
+
+		data_start(i, &idx, &off);
+		/* some of this one + all after this one */
+		npages = ((pipe->curbuf - idx - 1) & (pipe->buffers - 1)) + 1;
+		if (npages >= maxpages)
+			return maxpages;
+	} else iterate_all_kinds(i, size, v, ({
 		unsigned long p = (unsigned long)v.iov_base;
 		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
 			- p / PAGE_SIZE;
@@ -757,6 +1139,10 @@ EXPORT_SYMBOL(iov_iter_npages);
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags)
 {
 	*new = *old;
+	if (unlikely(new->type & ITER_PIPE)) {
+		WARN_ON(1);
+		return NULL;
+	}
 	if (new->type & ITER_BVEC)
 		return new->bvec = kmemdup(new->bvec,
 				    new->nr_segs * sizeof(struct bio_vec),
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 11/12] switch generic_file_splice_read() to use of ->read_iter()
  2016-09-23 20:36                                               ` Linus Torvalds
                                                                   ` (3 preceding siblings ...)
  2016-09-24  4:01                                                 ` [PATCH 10/12] new iov_iter flavour: pipe-backed Al Viro
@ 2016-09-24  4:01                                                 ` Al Viro
  2016-09-24  4:02                                                 ` [PATCH 12/12] switch default_file_splice_read() to use of pipe-backed iov_iter Al Viro
  5 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-24  4:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

... and kill the ->splice_read() instances that can be switched to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 drivers/staging/lustre/lustre/llite/file.c         |  70 ++----
 .../staging/lustre/lustre/llite/llite_internal.h   |  15 +-
 drivers/staging/lustre/lustre/llite/vvp_internal.h |  14 --
 drivers/staging/lustre/lustre/llite/vvp_io.c       |  45 +---
 fs/coda/file.c                                     |  23 +-
 fs/gfs2/file.c                                     |  28 +--
 fs/nfs/file.c                                      |  25 +--
 fs/nfs/internal.h                                  |   2 -
 fs/nfs/nfs4file.c                                  |   2 +-
 fs/ocfs2/file.c                                    |  34 +--
 fs/ocfs2/ocfs2_trace.h                             |   2 -
 fs/splice.c                                        | 244 +++------------------
 fs/xfs/xfs_file.c                                  |  41 +---
 fs/xfs/xfs_trace.h                                 |   1 -
 include/linux/fs.h                                 |   2 -
 mm/shmem.c                                         | 115 +---------
 16 files changed, 58 insertions(+), 605 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
index 57281b9..2567b09 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -1153,36 +1153,21 @@ restart:
 		int write_mutex_locked = 0;
 
 		vio->vui_fd  = LUSTRE_FPRIVATE(file);
-		vio->vui_io_subtype = args->via_io_subtype;
-
-		switch (vio->vui_io_subtype) {
-		case IO_NORMAL:
-			vio->vui_iter = args->u.normal.via_iter;
-			vio->vui_iocb = args->u.normal.via_iocb;
-			if ((iot == CIT_WRITE) &&
-			    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
-				if (mutex_lock_interruptible(&lli->
-							       lli_write_mutex)) {
-					result = -ERESTARTSYS;
-					goto out;
-				}
-				write_mutex_locked = 1;
+		vio->vui_iter = args->u.normal.via_iter;
+		vio->vui_iocb = args->u.normal.via_iocb;
+		if ((iot == CIT_WRITE) &&
+		    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
+			if (mutex_lock_interruptible(&lli->lli_write_mutex)) {
+				result = -ERESTARTSYS;
+				goto out;
 			}
-			down_read(&lli->lli_trunc_sem);
-			break;
-		case IO_SPLICE:
-			vio->u.splice.vui_pipe = args->u.splice.via_pipe;
-			vio->u.splice.vui_flags = args->u.splice.via_flags;
-			break;
-		default:
-			CERROR("Unknown IO type - %u\n", vio->vui_io_subtype);
-			LBUG();
+			write_mutex_locked = 1;
 		}
+		down_read(&lli->lli_trunc_sem);
 		ll_cl_add(file, env, io);
 		result = cl_io_loop(env, io);
 		ll_cl_remove(file, env);
-		if (args->via_io_subtype == IO_NORMAL)
-			up_read(&lli->lli_trunc_sem);
+		up_read(&lli->lli_trunc_sem);
 		if (write_mutex_locked)
 			mutex_unlock(&lli->lli_write_mutex);
 	} else {
@@ -1237,7 +1222,7 @@ static ssize_t ll_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (IS_ERR(env))
 		return PTR_ERR(env);
 
-	args = ll_env_args(env, IO_NORMAL);
+	args = ll_env_args(env);
 	args->u.normal.via_iter = to;
 	args->u.normal.via_iocb = iocb;
 
@@ -1261,7 +1246,7 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (IS_ERR(env))
 		return PTR_ERR(env);
 
-	args = ll_env_args(env, IO_NORMAL);
+	args = ll_env_args(env);
 	args->u.normal.via_iter = from;
 	args->u.normal.via_iocb = iocb;
 
@@ -1271,31 +1256,6 @@ static ssize_t ll_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	return result;
 }
 
-/*
- * Send file content (through pagecache) somewhere with helper
- */
-static ssize_t ll_file_splice_read(struct file *in_file, loff_t *ppos,
-				   struct pipe_inode_info *pipe, size_t count,
-				   unsigned int flags)
-{
-	struct lu_env      *env;
-	struct vvp_io_args *args;
-	ssize_t	     result;
-	int		 refcheck;
-
-	env = cl_env_get(&refcheck);
-	if (IS_ERR(env))
-		return PTR_ERR(env);
-
-	args = ll_env_args(env, IO_SPLICE);
-	args->u.splice.via_pipe = pipe;
-	args->u.splice.via_flags = flags;
-
-	result = ll_file_io_generic(env, args, in_file, CIT_READ, ppos, count);
-	cl_env_put(env, &refcheck);
-	return result;
-}
-
 static int ll_lov_recreate(struct inode *inode, struct ost_id *oi, u32 ost_idx)
 {
 	struct obd_export *exp = ll_i2dtexp(inode);
@@ -3173,7 +3133,7 @@ struct file_operations ll_file_operations = {
 	.release	= ll_file_release,
 	.mmap	   = ll_file_mmap,
 	.llseek	 = ll_file_seek,
-	.splice_read    = ll_file_splice_read,
+	.splice_read    = generic_file_splice_read,
 	.fsync	  = ll_fsync,
 	.flush	  = ll_flush
 };
@@ -3186,7 +3146,7 @@ struct file_operations ll_file_operations_flock = {
 	.release	= ll_file_release,
 	.mmap	   = ll_file_mmap,
 	.llseek	 = ll_file_seek,
-	.splice_read    = ll_file_splice_read,
+	.splice_read    = generic_file_splice_read,
 	.fsync	  = ll_fsync,
 	.flush	  = ll_flush,
 	.flock	  = ll_file_flock,
@@ -3202,7 +3162,7 @@ struct file_operations ll_file_operations_noflock = {
 	.release	= ll_file_release,
 	.mmap	   = ll_file_mmap,
 	.llseek	 = ll_file_seek,
-	.splice_read    = ll_file_splice_read,
+	.splice_read    = generic_file_splice_read,
 	.fsync	  = ll_fsync,
 	.flush	  = ll_flush,
 	.flock	  = ll_file_noflock,
diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index 4d6d589..0e738c8 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -800,17 +800,11 @@ void vvp_write_complete(struct vvp_object *club, struct vvp_page *page);
  */
 struct vvp_io_args {
 	/** normal/splice */
-	enum vvp_io_subtype via_io_subtype;
-
 	union {
 		struct {
 			struct kiocb      *via_iocb;
 			struct iov_iter   *via_iter;
 		} normal;
-		struct {
-			struct pipe_inode_info  *via_pipe;
-			unsigned int       via_flags;
-		} splice;
 	} u;
 };
 
@@ -838,14 +832,9 @@ static inline struct ll_thread_info *ll_env_info(const struct lu_env *env)
 	return lti;
 }
 
-static inline struct vvp_io_args *ll_env_args(const struct lu_env *env,
-					      enum vvp_io_subtype type)
+static inline struct vvp_io_args *ll_env_args(const struct lu_env *env)
 {
-	struct vvp_io_args *via = &ll_env_info(env)->lti_args;
-
-	via->via_io_subtype = type;
-
-	return via;
+	return &ll_env_info(env)->lti_args;
 }
 
 void ll_queue_done_writing(struct inode *inode, unsigned long flags);
diff --git a/drivers/staging/lustre/lustre/llite/vvp_internal.h b/drivers/staging/lustre/lustre/llite/vvp_internal.h
index 79fc428..2fa49cc 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_internal.h
+++ b/drivers/staging/lustre/lustre/llite/vvp_internal.h
@@ -49,14 +49,6 @@ struct obd_device;
 struct obd_export;
 struct page;
 
-/* specific architecture can implement only part of this list */
-enum vvp_io_subtype {
-	/** normal IO */
-	IO_NORMAL,
-	/** io started from splice_{read|write} */
-	IO_SPLICE
-};
-
 /**
  * IO state private to IO state private to VVP layer.
  */
@@ -99,10 +91,6 @@ struct vvp_io {
 			bool		ft_flags_valid;
 		} fault;
 		struct {
-			struct pipe_inode_info	*vui_pipe;
-			unsigned int		 vui_flags;
-		} splice;
-		struct {
 			struct cl_page_list vui_queue;
 			unsigned long vui_written;
 			int vui_from;
@@ -110,8 +98,6 @@ struct vvp_io {
 		} write;
 	} u;
 
-	enum vvp_io_subtype	vui_io_subtype;
-
 	/**
 	 * Layout version when this IO is initialized
 	 */
diff --git a/drivers/staging/lustre/lustre/llite/vvp_io.c b/drivers/staging/lustre/lustre/llite/vvp_io.c
index 94916dc..4864600 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_io.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_io.c
@@ -55,18 +55,6 @@ static struct vvp_io *cl2vvp_io(const struct lu_env *env,
 }
 
 /**
- * True, if \a io is a normal io, False for splice_{read,write}
- */
-static int cl_is_normalio(const struct lu_env *env, const struct cl_io *io)
-{
-	struct vvp_io *vio = vvp_env_io(env);
-
-	LASSERT(io->ci_type == CIT_READ || io->ci_type == CIT_WRITE);
-
-	return vio->vui_io_subtype == IO_NORMAL;
-}
-
-/**
  * For swapping layout. The file's layout may have changed.
  * To avoid populating pages to a wrong stripe, we have to verify the
  * correctness of layout. It works because swapping layout processes
@@ -391,9 +379,6 @@ static int vvp_mmap_locks(const struct lu_env *env,
 
 	LASSERT(io->ci_type == CIT_READ || io->ci_type == CIT_WRITE);
 
-	if (!cl_is_normalio(env, io))
-		return 0;
-
 	if (!vio->vui_iter) /* nfs or loop back device write */
 		return 0;
 
@@ -462,15 +447,10 @@ static void vvp_io_advance(const struct lu_env *env,
 			   const struct cl_io_slice *ios,
 			   size_t nob)
 {
-	struct vvp_io    *vio = cl2vvp_io(env, ios);
-	struct cl_io     *io  = ios->cis_io;
 	struct cl_object *obj = ios->cis_io->ci_obj;
-
+	struct vvp_io	 *vio = cl2vvp_io(env, ios);
 	CLOBINVRNT(env, obj, vvp_object_invariant(obj));
 
-	if (!cl_is_normalio(env, io))
-		return;
-
 	iov_iter_reexpand(vio->vui_iter, vio->vui_tot_count  -= nob);
 }
 
@@ -479,7 +459,7 @@ static void vvp_io_update_iov(const struct lu_env *env,
 {
 	size_t size = io->u.ci_rw.crw_count;
 
-	if (!cl_is_normalio(env, io) || !vio->vui_iter)
+	if (!vio->vui_iter)
 		return;
 
 	iov_iter_truncate(vio->vui_iter, size);
@@ -716,25 +696,8 @@ static int vvp_io_read_start(const struct lu_env *env,
 
 	/* BUG: 5972 */
 	file_accessed(file);
-	switch (vio->vui_io_subtype) {
-	case IO_NORMAL:
-		LASSERT(vio->vui_iocb->ki_pos == pos);
-		result = generic_file_read_iter(vio->vui_iocb, vio->vui_iter);
-		break;
-	case IO_SPLICE:
-		result = generic_file_splice_read(file, &pos,
-						  vio->u.splice.vui_pipe, cnt,
-						  vio->u.splice.vui_flags);
-		/* LU-1109: do splice read stripe by stripe otherwise if it
-		 * may make nfsd stuck if this read occupied all internal pipe
-		 * buffers.
-		 */
-		io->ci_continue = 0;
-		break;
-	default:
-		CERROR("Wrong IO type %u\n", vio->vui_io_subtype);
-		LBUG();
-	}
+	LASSERT(vio->vui_iocb->ki_pos == pos);
+	result = generic_file_read_iter(vio->vui_iocb, vio->vui_iter);
 
 out:
 	if (result >= 0) {
diff --git a/fs/coda/file.c b/fs/coda/file.c
index f47c748..8415d4f 100644
--- a/fs/coda/file.c
+++ b/fs/coda/file.c
@@ -38,27 +38,6 @@ coda_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 }
 
 static ssize_t
-coda_file_splice_read(struct file *coda_file, loff_t *ppos,
-		      struct pipe_inode_info *pipe, size_t count,
-		      unsigned int flags)
-{
-	ssize_t (*splice_read)(struct file *, loff_t *,
-			       struct pipe_inode_info *, size_t, unsigned int);
-	struct coda_file_info *cfi;
-	struct file *host_file;
-
-	cfi = CODA_FTOC(coda_file);
-	BUG_ON(!cfi || cfi->cfi_magic != CODA_MAGIC);
-	host_file = cfi->cfi_container;
-
-	splice_read = host_file->f_op->splice_read;
-	if (!splice_read)
-		splice_read = default_file_splice_read;
-
-	return splice_read(host_file, ppos, pipe, count, flags);
-}
-
-static ssize_t
 coda_file_write_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *coda_file = iocb->ki_filp;
@@ -225,6 +204,6 @@ const struct file_operations coda_file_operations = {
 	.open		= coda_open,
 	.release	= coda_release,
 	.fsync		= coda_fsync,
-	.splice_read	= coda_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 };
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 320e65e..7016a6a7 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -954,30 +954,6 @@ out_uninit:
 	return ret;
 }
 
-static ssize_t gfs2_file_splice_read(struct file *in, loff_t *ppos,
-				     struct pipe_inode_info *pipe, size_t len,
-				     unsigned int flags)
-{
-	struct inode *inode = in->f_mapping->host;
-	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_holder gh;
-	int ret;
-
-	inode_lock(inode);
-
-	ret = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
-	if (ret) {
-		inode_unlock(inode);
-		return ret;
-	}
-
-	gfs2_glock_dq_uninit(&gh);
-	inode_unlock(inode);
-
-	return generic_file_splice_read(in, ppos, pipe, len, flags);
-}
-
-
 static ssize_t gfs2_file_splice_write(struct pipe_inode_info *pipe,
 				      struct file *out, loff_t *ppos,
 				      size_t len, unsigned int flags)
@@ -1140,7 +1116,7 @@ const struct file_operations gfs2_file_fops = {
 	.fsync		= gfs2_fsync,
 	.lock		= gfs2_lock,
 	.flock		= gfs2_flock,
-	.splice_read	= gfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= gfs2_file_splice_write,
 	.setlease	= simple_nosetlease,
 	.fallocate	= gfs2_fallocate,
@@ -1168,7 +1144,7 @@ const struct file_operations gfs2_file_fops_nolock = {
 	.open		= gfs2_open,
 	.release	= gfs2_release,
 	.fsync		= gfs2_fsync,
-	.splice_read	= gfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= gfs2_file_splice_write,
 	.setlease	= generic_setlease,
 	.fallocate	= gfs2_fallocate,
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 7d62097..5048585 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -182,29 +182,6 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 }
 EXPORT_SYMBOL_GPL(nfs_file_read);
 
-ssize_t
-nfs_file_splice_read(struct file *filp, loff_t *ppos,
-		     struct pipe_inode_info *pipe, size_t count,
-		     unsigned int flags)
-{
-	struct inode *inode = file_inode(filp);
-	ssize_t res;
-
-	dprintk("NFS: splice_read(%pD2, %lu@%Lu)\n",
-		filp, (unsigned long) count, (unsigned long long) *ppos);
-
-	nfs_start_io_read(inode);
-	res = nfs_revalidate_mapping(inode, filp->f_mapping);
-	if (!res) {
-		res = generic_file_splice_read(filp, ppos, pipe, count, flags);
-		if (res > 0)
-			nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, res);
-	}
-	nfs_end_io_read(inode);
-	return res;
-}
-EXPORT_SYMBOL_GPL(nfs_file_splice_read);
-
 int
 nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
 {
@@ -868,7 +845,7 @@ const struct file_operations nfs_file_operations = {
 	.fsync		= nfs_file_fsync,
 	.lock		= nfs_lock,
 	.flock		= nfs_flock,
-	.splice_read	= nfs_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= simple_nosetlease,
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 74935a1..d7b062b 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -365,8 +365,6 @@ int nfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *)
 int nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync);
 loff_t nfs_file_llseek(struct file *, loff_t, int);
 ssize_t nfs_file_read(struct kiocb *, struct iov_iter *);
-ssize_t nfs_file_splice_read(struct file *, loff_t *, struct pipe_inode_info *,
-			     size_t, unsigned int);
 int nfs_file_mmap(struct file *, struct vm_area_struct *);
 ssize_t nfs_file_write(struct kiocb *, struct iov_iter *);
 int nfs_file_release(struct inode *, struct file *);
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index d085ad7..89a7795 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -248,7 +248,7 @@ const struct file_operations nfs4_file_operations = {
 	.fsync		= nfs_file_fsync,
 	.lock		= nfs_lock,
 	.flock		= nfs_flock,
-	.splice_read	= nfs_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= simple_nosetlease,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 4e7b0dc..6596e41 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2307,36 +2307,6 @@ out_mutex:
 	return ret;
 }
 
-static ssize_t ocfs2_file_splice_read(struct file *in,
-				      loff_t *ppos,
-				      struct pipe_inode_info *pipe,
-				      size_t len,
-				      unsigned int flags)
-{
-	int ret = 0, lock_level = 0;
-	struct inode *inode = file_inode(in);
-
-	trace_ocfs2_file_splice_read(inode, in, in->f_path.dentry,
-			(unsigned long long)OCFS2_I(inode)->ip_blkno,
-			in->f_path.dentry->d_name.len,
-			in->f_path.dentry->d_name.name, len);
-
-	/*
-	 * See the comment in ocfs2_file_read_iter()
-	 */
-	ret = ocfs2_inode_lock_atime(inode, in->f_path.mnt, &lock_level);
-	if (ret < 0) {
-		mlog_errno(ret);
-		goto bail;
-	}
-	ocfs2_inode_unlock(inode, lock_level);
-
-	ret = generic_file_splice_read(in, ppos, pipe, len, flags);
-
-bail:
-	return ret;
-}
-
 static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 				   struct iov_iter *to)
 {
@@ -2495,7 +2465,7 @@ const struct file_operations ocfs2_fops = {
 #endif
 	.lock		= ocfs2_lock,
 	.flock		= ocfs2_flock,
-	.splice_read	= ocfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
 };
@@ -2540,7 +2510,7 @@ const struct file_operations ocfs2_fops_no_plocks = {
 	.compat_ioctl   = ocfs2_compat_ioctl,
 #endif
 	.flock		= ocfs2_flock,
-	.splice_read	= ocfs2_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
 };
diff --git a/fs/ocfs2/ocfs2_trace.h b/fs/ocfs2/ocfs2_trace.h
index f8f5fc5..0b58abc 100644
--- a/fs/ocfs2/ocfs2_trace.h
+++ b/fs/ocfs2/ocfs2_trace.h
@@ -1314,8 +1314,6 @@ DEFINE_OCFS2_FILE_OPS(ocfs2_file_aio_write);
 
 DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_write);
 
-DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_read);
-
 DEFINE_OCFS2_FILE_OPS(ocfs2_file_aio_read);
 
 DEFINE_OCFS2_ULL_ULL_ULL_EVENT(ocfs2_truncate_file);
diff --git a/fs/splice.c b/fs/splice.c
index 589a1d5..58c322a 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -281,207 +281,6 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
 	kfree(spd->partial);
 }
 
-static int
-__generic_file_splice_read(struct file *in, loff_t *ppos,
-			   struct pipe_inode_info *pipe, size_t len,
-			   unsigned int flags)
-{
-	struct address_space *mapping = in->f_mapping;
-	unsigned int loff, nr_pages, req_pages;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct page *page;
-	pgoff_t index, end_index;
-	loff_t isize;
-	int error, page_nr;
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &page_cache_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
-
-	if (splice_grow_spd(pipe, &spd))
-		return -ENOMEM;
-
-	index = *ppos >> PAGE_SHIFT;
-	loff = *ppos & ~PAGE_MASK;
-	req_pages = (len + loff + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	nr_pages = min(req_pages, spd.nr_pages_max);
-
-	/*
-	 * Lookup the (hopefully) full range of pages we need.
-	 */
-	spd.nr_pages = find_get_pages_contig(mapping, index, nr_pages, spd.pages);
-	index += spd.nr_pages;
-
-	/*
-	 * If find_get_pages_contig() returned fewer pages than we needed,
-	 * readahead/allocate the rest and fill in the holes.
-	 */
-	if (spd.nr_pages < nr_pages)
-		page_cache_sync_readahead(mapping, &in->f_ra, in,
-				index, req_pages - spd.nr_pages);
-
-	error = 0;
-	while (spd.nr_pages < nr_pages) {
-		/*
-		 * Page could be there, find_get_pages_contig() breaks on
-		 * the first hole.
-		 */
-		page = find_get_page(mapping, index);
-		if (!page) {
-			/*
-			 * page didn't exist, allocate one.
-			 */
-			page = page_cache_alloc_cold(mapping);
-			if (!page)
-				break;
-
-			error = add_to_page_cache_lru(page, mapping, index,
-				   mapping_gfp_constraint(mapping, GFP_KERNEL));
-			if (unlikely(error)) {
-				put_page(page);
-				if (error == -EEXIST)
-					continue;
-				break;
-			}
-			/*
-			 * add_to_page_cache() locks the page, unlock it
-			 * to avoid convoluting the logic below even more.
-			 */
-			unlock_page(page);
-		}
-
-		spd.pages[spd.nr_pages++] = page;
-		index++;
-	}
-
-	/*
-	 * Now loop over the map and see if we need to start IO on any
-	 * pages, fill in the partial map, etc.
-	 */
-	index = *ppos >> PAGE_SHIFT;
-	nr_pages = spd.nr_pages;
-	spd.nr_pages = 0;
-	for (page_nr = 0; page_nr < nr_pages; page_nr++) {
-		unsigned int this_len;
-
-		if (!len)
-			break;
-
-		/*
-		 * this_len is the max we'll use from this page
-		 */
-		this_len = min_t(unsigned long, len, PAGE_SIZE - loff);
-		page = spd.pages[page_nr];
-
-		if (PageReadahead(page))
-			page_cache_async_readahead(mapping, &in->f_ra, in,
-					page, index, req_pages - page_nr);
-
-		/*
-		 * If the page isn't uptodate, we may need to start io on it
-		 */
-		if (!PageUptodate(page)) {
-			lock_page(page);
-
-			/*
-			 * Page was truncated, or invalidated by the
-			 * filesystem.  Redo the find/create, but this time the
-			 * page is kept locked, so there's no chance of another
-			 * race with truncate/invalidate.
-			 */
-			if (!page->mapping) {
-				unlock_page(page);
-retry_lookup:
-				page = find_or_create_page(mapping, index,
-						mapping_gfp_mask(mapping));
-
-				if (!page) {
-					error = -ENOMEM;
-					break;
-				}
-				put_page(spd.pages[page_nr]);
-				spd.pages[page_nr] = page;
-			}
-			/*
-			 * page was already under io and is now done, great
-			 */
-			if (PageUptodate(page)) {
-				unlock_page(page);
-				goto fill_it;
-			}
-
-			/*
-			 * need to read in the page
-			 */
-			error = mapping->a_ops->readpage(in, page);
-			if (unlikely(error)) {
-				/*
-				 * Re-lookup the page
-				 */
-				if (error == AOP_TRUNCATED_PAGE)
-					goto retry_lookup;
-
-				break;
-			}
-		}
-fill_it:
-		/*
-		 * i_size must be checked after PageUptodate.
-		 */
-		isize = i_size_read(mapping->host);
-		end_index = (isize - 1) >> PAGE_SHIFT;
-		if (unlikely(!isize || index > end_index))
-			break;
-
-		/*
-		 * if this is the last page, see if we need to shrink
-		 * the length and stop
-		 */
-		if (end_index == index) {
-			unsigned int plen;
-
-			/*
-			 * max good bytes in this page
-			 */
-			plen = ((isize - 1) & ~PAGE_MASK) + 1;
-			if (plen <= loff)
-				break;
-
-			/*
-			 * force quit after adding this page
-			 */
-			this_len = min(this_len, plen - loff);
-			len = this_len;
-		}
-
-		spd.partial[page_nr].offset = loff;
-		spd.partial[page_nr].len = this_len;
-		len -= this_len;
-		loff = 0;
-		spd.nr_pages++;
-		index++;
-	}
-
-	/*
-	 * Release any pages at the end, if we quit early. 'page_nr' is how far
-	 * we got, 'nr_pages' is how many pages are in the map.
-	 */
-	while (page_nr < nr_pages)
-		put_page(spd.pages[page_nr++]);
-	in->f_ra.prev_pos = (loff_t)index << PAGE_SHIFT;
-
-	if (spd.nr_pages)
-		error = splice_to_pipe(pipe, &spd);
-
-	splice_shrink_spd(&spd);
-	return error;
-}
-
 /**
  * generic_file_splice_read - splice data from file to a pipe
  * @in:		file to splice from
@@ -492,32 +291,46 @@ fill_it:
  *
  * Description:
  *    Will read pages from given file and fill them into a pipe. Can be
- *    used as long as the address_space operations for the source implements
- *    a readpage() hook.
+ *    used as long as it has more or less sane ->read_iter().
  *
  */
 ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
-	loff_t isize, left;
-	int ret;
-
-	if (IS_DAX(in->f_mapping->host))
-		return default_file_splice_read(in, ppos, pipe, len, flags);
+	struct iov_iter to;
+	struct kiocb kiocb;
+	loff_t isize;
+	int idx, ret;
 
 	isize = i_size_read(in->f_mapping->host);
 	if (unlikely(*ppos >= isize))
 		return 0;
 
-	left = isize - *ppos;
-	if (unlikely(left < len))
-		len = left;
-
-	ret = __generic_file_splice_read(in, ppos, pipe, len, flags);
+	iov_iter_pipe(&to, ITER_PIPE | READ, pipe, len);
+	idx = to.idx;
+	init_sync_kiocb(&kiocb, in);
+	kiocb.ki_pos = *ppos;
+	ret = in->f_op->read_iter(&kiocb, &to);
 	if (ret > 0) {
-		*ppos += ret;
+		*ppos = kiocb.ki_pos;
 		file_accessed(in);
+	} else if (ret < 0) {
+		if (WARN_ON(to.idx != idx || to.iov_offset)) {
+			/*
+			 * a bogus ->read_iter() has copied something and still
+			 * returned an error instead of a short read.
+			 */
+			to.idx = idx;
+			to.iov_offset = 0;
+			iov_iter_advance(&to, 0); /* to free what was emitted */
+		}
+		/*
+		 * callers of ->splice_read() expect -EAGAIN on
+		 * "can't put anything in there", rather than -EFAULT.
+		 */
+		if (ret == -EFAULT)
+			ret = -EAGAIN;
 	}
 
 	return ret;
@@ -580,7 +393,7 @@ ssize_t kernel_write(struct file *file, const char *buf, size_t count,
 }
 EXPORT_SYMBOL(kernel_write);
 
-ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
+static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
@@ -675,7 +488,6 @@ err:
 	res = error;
 	goto shrink_ret;
 }
-EXPORT_SYMBOL(default_file_splice_read);
 
 /*
  * Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e612a02..92f16cf 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -399,45 +399,6 @@ xfs_file_read_iter(
 	return ret;
 }
 
-STATIC ssize_t
-xfs_file_splice_read(
-	struct file		*infilp,
-	loff_t			*ppos,
-	struct pipe_inode_info	*pipe,
-	size_t			count,
-	unsigned int		flags)
-{
-	struct xfs_inode	*ip = XFS_I(infilp->f_mapping->host);
-	ssize_t			ret;
-
-	XFS_STATS_INC(ip->i_mount, xs_read_calls);
-
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
-		return -EIO;
-
-	trace_xfs_file_splice_read(ip, count, *ppos);
-
-	/*
-	 * DAX inodes cannot ues the page cache for splice, so we have to push
-	 * them through the VFS IO path. This means it goes through
-	 * ->read_iter, which for us takes the XFS_IOLOCK_SHARED. Hence we
-	 * cannot lock the splice operation at this level for DAX inodes.
-	 */
-	if (IS_DAX(VFS_I(ip))) {
-		ret = default_file_splice_read(infilp, ppos, pipe, count,
-					       flags);
-		goto out;
-	}
-
-	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
-	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
-out:
-	if (ret > 0)
-		XFS_STATS_ADD(ip->i_mount, xs_read_bytes, ret);
-	return ret;
-}
-
 /*
  * Zero any on disk space between the current EOF and the new, larger EOF.
  *
@@ -1652,7 +1613,7 @@ const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
 	.write_iter	= xfs_file_write_iter,
-	.splice_read	= xfs_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.unlocked_ioctl	= xfs_file_ioctl,
 #ifdef CONFIG_COMPAT
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d303a66..f31db44 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1170,7 +1170,6 @@ DEFINE_RW_EVENT(xfs_file_dax_read);
 DEFINE_RW_EVENT(xfs_file_buffered_write);
 DEFINE_RW_EVENT(xfs_file_direct_write);
 DEFINE_RW_EVENT(xfs_file_dax_write);
-DEFINE_RW_EVENT(xfs_file_splice_read);
 
 DECLARE_EVENT_CLASS(xfs_page_class,
 	TP_PROTO(struct inode *inode, struct page *page, unsigned long off,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 901e25d..b04883e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2794,8 +2794,6 @@ extern void block_sync_page(struct page *page);
 /* fs/splice.c */
 extern ssize_t generic_file_splice_read(struct file *, loff_t *,
 		struct pipe_inode_info *, size_t, unsigned int);
-extern ssize_t default_file_splice_read(struct file *, loff_t *,
-		struct pipe_inode_info *, size_t, unsigned int);
 extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
 		struct file *, loff_t *, size_t, unsigned int);
 extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
diff --git a/mm/shmem.c b/mm/shmem.c
index fd8b2b5..84d7077 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2310,119 +2310,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return retval ? retval : error;
 }
 
-static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
-				struct pipe_inode_info *pipe, size_t len,
-				unsigned int flags)
-{
-	struct address_space *mapping = in->f_mapping;
-	struct inode *inode = mapping->host;
-	unsigned int loff, nr_pages, req_pages;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct page *page;
-	pgoff_t index, end_index;
-	loff_t isize, left;
-	int error, page_nr;
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &page_cache_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
-
-	isize = i_size_read(inode);
-	if (unlikely(*ppos >= isize))
-		return 0;
-
-	left = isize - *ppos;
-	if (unlikely(left < len))
-		len = left;
-
-	if (splice_grow_spd(pipe, &spd))
-		return -ENOMEM;
-
-	index = *ppos >> PAGE_SHIFT;
-	loff = *ppos & ~PAGE_MASK;
-	req_pages = (len + loff + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	nr_pages = min(req_pages, spd.nr_pages_max);
-
-	spd.nr_pages = find_get_pages_contig(mapping, index,
-						nr_pages, spd.pages);
-	index += spd.nr_pages;
-	error = 0;
-
-	while (spd.nr_pages < nr_pages) {
-		error = shmem_getpage(inode, index, &page, SGP_CACHE);
-		if (error)
-			break;
-		unlock_page(page);
-		spd.pages[spd.nr_pages++] = page;
-		index++;
-	}
-
-	index = *ppos >> PAGE_SHIFT;
-	nr_pages = spd.nr_pages;
-	spd.nr_pages = 0;
-
-	for (page_nr = 0; page_nr < nr_pages; page_nr++) {
-		unsigned int this_len;
-
-		if (!len)
-			break;
-
-		this_len = min_t(unsigned long, len, PAGE_SIZE - loff);
-		page = spd.pages[page_nr];
-
-		if (!PageUptodate(page) || page->mapping != mapping) {
-			error = shmem_getpage(inode, index, &page, SGP_CACHE);
-			if (error)
-				break;
-			unlock_page(page);
-			put_page(spd.pages[page_nr]);
-			spd.pages[page_nr] = page;
-		}
-
-		isize = i_size_read(inode);
-		end_index = (isize - 1) >> PAGE_SHIFT;
-		if (unlikely(!isize || index > end_index))
-			break;
-
-		if (end_index == index) {
-			unsigned int plen;
-
-			plen = ((isize - 1) & ~PAGE_MASK) + 1;
-			if (plen <= loff)
-				break;
-
-			this_len = min(this_len, plen - loff);
-			len = this_len;
-		}
-
-		spd.partial[page_nr].offset = loff;
-		spd.partial[page_nr].len = this_len;
-		len -= this_len;
-		loff = 0;
-		spd.nr_pages++;
-		index++;
-	}
-
-	while (page_nr < nr_pages)
-		put_page(spd.pages[page_nr++]);
-
-	if (spd.nr_pages)
-		error = splice_to_pipe(pipe, &spd);
-
-	splice_shrink_spd(&spd);
-
-	if (error > 0) {
-		*ppos += error;
-		file_accessed(in);
-	}
-	return error;
-}
-
 /*
  * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
  */
@@ -3785,7 +3672,7 @@ static const struct file_operations shmem_file_operations = {
 	.read_iter	= shmem_file_read_iter,
 	.write_iter	= generic_file_write_iter,
 	.fsync		= noop_fsync,
-	.splice_read	= shmem_file_splice_read,
+	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= shmem_fallocate,
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* [PATCH 12/12] switch default_file_splice_read() to use of pipe-backed iov_iter
  2016-09-23 20:36                                               ` Linus Torvalds
                                                                   ` (4 preceding siblings ...)
  2016-09-24  4:01                                                 ` [PATCH 11/12] switch generic_file_splice_read() to use of ->read_iter() Al Viro
@ 2016-09-24  4:02                                                 ` Al Viro
  5 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-24  4:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

we only use iov_iter_get_pages_alloc() and iov_iter_advance() -
pages are filled by kernel_readv() via a kvec array (as we used
to do all along), so iov_iter here is used only as a way of
arranging for those pages to be in pipe.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c | 111 ++++++++++++++++++++++--------------------------------------
 1 file changed, 40 insertions(+), 71 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 58c322a..0df907b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -361,7 +361,7 @@ const struct pipe_buf_operations nosteal_pipe_buf_ops = {
 };
 EXPORT_SYMBOL(nosteal_pipe_buf_ops);
 
-static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
+static ssize_t kernel_readv(struct file *file, const struct kvec *vec,
 			    unsigned long vlen, loff_t offset)
 {
 	mm_segment_t old_fs;
@@ -397,96 +397,65 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
+	struct kvec *vec, __vec[PIPE_DEF_BUFFERS];
+	struct iov_iter to;
+	struct page **pages;
 	unsigned int nr_pages;
-	unsigned int nr_freed;
-	size_t offset;
-	struct page *pages[PIPE_DEF_BUFFERS];
-	struct partial_page partial[PIPE_DEF_BUFFERS];
-	struct iovec *vec, __vec[PIPE_DEF_BUFFERS];
+	size_t offset, dummy, copied = 0;
 	ssize_t res;
-	size_t this_len;
-	int error;
 	int i;
-	struct splice_pipe_desc spd = {
-		.pages = pages,
-		.partial = partial,
-		.nr_pages_max = PIPE_DEF_BUFFERS,
-		.flags = flags,
-		.ops = &default_pipe_buf_ops,
-		.spd_release = spd_release_page,
-	};
 
-	if (splice_grow_spd(pipe, &spd))
+	if (pipe->nrbufs == pipe->buffers)
+		return -EAGAIN;
+
+	/*
+	 * Try to keep page boundaries matching to source pagecache ones -
+	 * it probably won't be much help, but...
+	 */
+	offset = *ppos & ~PAGE_MASK;
+
+	iov_iter_pipe(&to, ITER_PIPE | READ, pipe, len + offset);
+
+	res = iov_iter_get_pages_alloc(&to, &pages, len + offset, &dummy);
+	if (res <= 0)
 		return -ENOMEM;
 
-	res = -ENOMEM;
+	nr_pages = res / PAGE_SIZE;
+
 	vec = __vec;
-	if (spd.nr_pages_max > PIPE_DEF_BUFFERS) {
-		vec = kmalloc(spd.nr_pages_max * sizeof(struct iovec), GFP_KERNEL);
-		if (!vec)
-			goto shrink_ret;
+	if (nr_pages > PIPE_DEF_BUFFERS) {
+		vec = kmalloc(nr_pages * sizeof(struct kvec), GFP_KERNEL);
+		if (unlikely(!vec)) {
+			res = -ENOMEM;
+			goto out;
+		}
 	}
 
-	offset = *ppos & ~PAGE_MASK;
-	nr_pages = (len + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
-
-	for (i = 0; i < nr_pages && i < spd.nr_pages_max && len; i++) {
-		struct page *page;
-
-		page = alloc_page(GFP_USER);
-		error = -ENOMEM;
-		if (!page)
-			goto err;
+	pipe->bufs[to.idx].offset = offset;
+	pipe->bufs[to.idx].len -= offset;
 
-		this_len = min_t(size_t, len, PAGE_SIZE - offset);
-		vec[i].iov_base = (void __user *) page_address(page);
+	for (i = 0; i < nr_pages; i++) {
+		size_t this_len = min_t(size_t, len, PAGE_SIZE - offset);
+		vec[i].iov_base = page_address(pages[i]) + offset;
 		vec[i].iov_len = this_len;
-		spd.pages[i] = page;
-		spd.nr_pages++;
 		len -= this_len;
 		offset = 0;
 	}
 
-	res = kernel_readv(in, vec, spd.nr_pages, *ppos);
-	if (res < 0) {
-		error = res;
-		goto err;
-	}
-
-	error = 0;
-	if (!res)
-		goto err;
-
-	nr_freed = 0;
-	for (i = 0; i < spd.nr_pages; i++) {
-		this_len = min_t(size_t, vec[i].iov_len, res);
-		spd.partial[i].offset = 0;
-		spd.partial[i].len = this_len;
-		if (!this_len) {
-			__free_page(spd.pages[i]);
-			spd.pages[i] = NULL;
-			nr_freed++;
-		}
-		res -= this_len;
-	}
-	spd.nr_pages -= nr_freed;
-
-	res = splice_to_pipe(pipe, &spd);
-	if (res > 0)
+	res = kernel_readv(in, vec, nr_pages, *ppos);
+	if (res > 0) {
+		copied = res;
 		*ppos += res;
+	}
 
-shrink_ret:
 	if (vec != __vec)
 		kfree(vec);
-	splice_shrink_spd(&spd);
+out:
+	for (i = 0; i < nr_pages; i++)
+		put_page(pages[i]);
+	kvfree(pages);
+	iov_iter_advance(&to, copied);	/* truncates and discards */
 	return res;
-
-err:
-	for (i = 0; i < spd.nr_pages; i++)
-		__free_page(spd.pages[i]);
-
-	res = error;
-	goto shrink_ret;
 }
 
 /*
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-24  3:59                                                 ` Al Viro
@ 2016-09-24 17:29                                                   ` Al Viro
  2016-09-27 15:38                                                     ` Nicholas Piggin
  2016-09-27 15:53                                                       ` Chuck Lever
  0 siblings, 2 replies; 152+ messages in thread
From: Al Viro @ 2016-09-24 17:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Sat, Sep 24, 2016 at 04:59:08AM +0100, Al Viro wrote:

> 	FWIW, updated (with fixes) and force-pushed.  Added piece:
> default_file_splice_read() converted to iov_iter.  Seems to work, after
> fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
> beginning, as described above), #6 (context changes from #4), #10 (missing
> get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
> of len - ->read_iter() can bloody well handle that on its own) and added #12.
> Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...

	I think I see how to get full zero-copy (including the write side
of things).  Just add a "from" side for ITER_PIPE iov_iter (advance,
get_pages, get_pages_alloc, npages and alignment will need to behave
differently for "to" and "from" ones) and pull the following trick:
have fault_in_readable return NULL instead of 0, ERR_PTR(-EFAULT) instead
of -EFAULT *and* return a struct page if it was asked for a full-page
range on a page that could be successfully stolen (only "from pipe" iov_iter
would go for the last one, of course).  Then we make generic_perform_write()
shove the return value of fault-in into 'page'.  ->write_begin() is given
&page as an argument, to return the resulting page via that.  All instances
currently just store into that pointer, completely ignoring the prior value.
And they'll keep working just fine.

	Let's make sure that all method call sites outside of
generic_perform_write() (there's only one such, actually) have NULL
stored in there prior to the call.  Now we can start switching the
instances to zero-copy support - all it takes is replacing
grab_cache_page_write_begin() with "if *page is non-NULL, try to
shove it (locked, non-uptodate) into pagecache; if that succeeds grab a
reference to our page and we are done, if it fails - fall back to
grab_cache_page_write_begin()".  Then do get_block, etc., or whatever that
->write_begin() instance would normally do, just remember not to zero anything
if the page had been passed to us by caller.

	Now all we need is to make sure that iov_iter_copy_from_user_atomic()
for those guys recongnizes the case of full-page copy when source and target
are the same page and quietly returns PAGE_SIZE.  Voila - we can make
iter_file_splice_write() pass pipe-backed iov_iter instead of bvec-backed
one *and* get write-side zero-copy for all filesystems with ->write_begin()
taught to handle that (see above).  Since the filesystems with unmodified
->write_begin() will act correctly (just do the copying), we don't have
to make that a flagday change; ->write_begin() instances can be switched
one by one.  Similar treatment of iomap_write_begin()/iomap_write_actor()
would cover iomap-using ->write_iter() instances.

	It's clearly not something I want to touch until -rc1, but it looks
feasible for the next cycle, and if done right it promises to unify the
plain and splice sides of fuse_dev_...() stuff, simplifying the hell out
of them without losing zero-copy there.  And if everything really goes
right, we might be able to get rid of net/* ->splice_read() and ->sendpage()
methods as well...

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 09/11] fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter()
  2016-09-23 19:08                                         ` [PATCH 09/11] fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter() Al Viro
@ 2016-09-26  9:31                                           ` Miklos Szeredi
  0 siblings, 0 replies; 152+ messages in thread
From: Miklos Szeredi @ 2016-09-26  9:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Fri, Sep 23, 2016 at 9:08 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> [another cleanup, will be moved out of that branch]

Picked up and pushed to fuse.git #for-next

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-24  3:59                                                 ` [PATCH 04/12] " Al Viro
@ 2016-09-26 13:35                                                     ` Miklos Szeredi
  2016-12-17 19:54                                                   ` Andreas Schwab
  1 sibling, 0 replies; 152+ messages in thread
From: Miklos Szeredi @ 2016-09-26 13:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sat, Sep 24, 2016 at 5:59 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> * splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
> * ->splice_read() instances do the same
> * vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
>   arrange for waiting, looping, etc. themselves.
>
> That should make pipe_lock the outermost one.
>
> Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
> and do_splice() are quite ugly _and_ userland code can be easily broken
> by changing those.  It's not even "no more than the maximal capacity of
> this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
> leave instead of waiting".
>
> Considering how poorly these rules are documented, let's try "wait for some
> space to appear, unless given SPLICE_F_NONBLOCK, then push into pipe
> and if we run into overflow, we are done".
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/fuse/dev.c |   2 -
>  fs/splice.c   | 138 +++++++++++++++++++++++++++-------------------------------
>  2 files changed, 63 insertions(+), 77 deletions(-)
>

[...]

> @@ -1546,14 +1528,20 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
>                 return -ENOMEM;
>         }
>
> -       spd.nr_pages = get_iovec_page_array(&from, spd.pages,
> -                                           spd.partial,
> -                                           spd.nr_pages_max);
> -       if (spd.nr_pages <= 0)
> -               ret = spd.nr_pages;
> -       else
> -               ret = splice_to_pipe(pipe, &spd);
> -
> +       pipe_lock(pipe);
> +       ret = wait_for_space(pipe, flags);
> +       if (!ret) {
> +               spd.nr_pages = get_iovec_page_array(&from, spd.pages,
> +                                                   spd.partial,
> +                                                   spd.nr_pages_max);
> +               if (spd.nr_pages <= 0)
> +                       ret = spd.nr_pages;
> +               else
> +                       ret = splice_to_pipe(pipe, &spd);
> +               pipe_unlock(pipe);
> +               if (ret > 0)
> +                       wakeup_pipe_readers(pipe);
> +       }

Unbalanced pipe_lock()?

Also, while it doesn't hurt, the constification of the "from" argument
of get_iovec_page_array() looks only noise in this patch.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
@ 2016-09-26 13:35                                                     ` Miklos Szeredi
  0 siblings, 0 replies; 152+ messages in thread
From: Miklos Szeredi @ 2016-09-26 13:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, CAI Qian, Nick Piggin, xfs, linux-xfs, linux-fsdevel,
	Linus Torvalds

On Sat, Sep 24, 2016 at 5:59 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> * splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
> * ->splice_read() instances do the same
> * vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
>   arrange for waiting, looping, etc. themselves.
>
> That should make pipe_lock the outermost one.
>
> Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
> and do_splice() are quite ugly _and_ userland code can be easily broken
> by changing those.  It's not even "no more than the maximal capacity of
> this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
> leave instead of waiting".
>
> Considering how poorly these rules are documented, let's try "wait for some
> space to appear, unless given SPLICE_F_NONBLOCK, then push into pipe
> and if we run into overflow, we are done".
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/fuse/dev.c |   2 -
>  fs/splice.c   | 138 +++++++++++++++++++++++++++-------------------------------
>  2 files changed, 63 insertions(+), 77 deletions(-)
>

[...]

> @@ -1546,14 +1528,20 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
>                 return -ENOMEM;
>         }
>
> -       spd.nr_pages = get_iovec_page_array(&from, spd.pages,
> -                                           spd.partial,
> -                                           spd.nr_pages_max);
> -       if (spd.nr_pages <= 0)
> -               ret = spd.nr_pages;
> -       else
> -               ret = splice_to_pipe(pipe, &spd);
> -
> +       pipe_lock(pipe);
> +       ret = wait_for_space(pipe, flags);
> +       if (!ret) {
> +               spd.nr_pages = get_iovec_page_array(&from, spd.pages,
> +                                                   spd.partial,
> +                                                   spd.nr_pages_max);
> +               if (spd.nr_pages <= 0)
> +                       ret = spd.nr_pages;
> +               else
> +                       ret = splice_to_pipe(pipe, &spd);
> +               pipe_unlock(pipe);
> +               if (ret > 0)
> +                       wakeup_pipe_readers(pipe);
> +       }

Unbalanced pipe_lock()?

Also, while it doesn't hurt, the constification of the "from" argument
of get_iovec_page_array() looks only noise in this patch.

Thanks,
Miklos

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 06/12] new helper: add_to_pipe()
  2016-09-24  4:00                                                 ` [PATCH 06/12] new helper: add_to_pipe() Al Viro
@ 2016-09-26 13:49                                                   ` Miklos Szeredi
  0 siblings, 0 replies; 152+ messages in thread
From: Miklos Szeredi @ 2016-09-26 13:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sat, Sep 24, 2016 at 6:00 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> single-buffer analogue of splice_to_pipe(); vmsplice_to_pipe() switched
> to that, leaving splice_to_pipe() only for ->splice_read() instances
> (and that only until they are converted as well).
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/splice.c            | 113 ++++++++++++++++++++++++++++---------------------
>  include/linux/splice.h |   2 +
>  2 files changed, 67 insertions(+), 48 deletions(-)
>

[...]

> @@ -1523,26 +1553,13 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *uiov,
>         if (ret < 0)
>                 return ret;
>
> -       if (splice_grow_spd(pipe, &spd)) {
> -               kfree(iov);
> -               return -ENOMEM;
> -       }
> -
>         pipe_lock(pipe);
>         ret = wait_for_space(pipe, flags);
> -       if (!ret) {
> -               spd.nr_pages = get_iovec_page_array(&from, spd.pages,
> -                                                   spd.partial,
> -                                                   spd.nr_pages_max);
> -               if (spd.nr_pages <= 0)
> -                       ret = spd.nr_pages;
> -               else
> -                       ret = splice_to_pipe(pipe, &spd);
> -               pipe_unlock(pipe);
> -               if (ret > 0)
> -                       wakeup_pipe_readers(pipe);
> -       }
> -       splice_shrink_spd(&spd);
> +       if (!ret)
> +               ret = iter_to_pipe(&from, pipe, buf_flag);
> +       pipe_unlock(pipe);

Ah, here it is :)

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-26 13:35                                                     ` Miklos Szeredi
@ 2016-09-27  4:14                                                       ` Al Viro
  -1 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-27  4:14 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Mon, Sep 26, 2016 at 03:35:12PM +0200, Miklos Szeredi wrote:
> > -       if (spd.nr_pages <= 0)
> > -               ret = spd.nr_pages;
> > -       else
> > -               ret = splice_to_pipe(pipe, &spd);
> > -
> > +       pipe_lock(pipe);
> > +       ret = wait_for_space(pipe, flags);
> > +       if (!ret) {
> > +               spd.nr_pages = get_iovec_page_array(&from, spd.pages,
> > +                                                   spd.partial,
> > +                                                   spd.nr_pages_max);
> > +               if (spd.nr_pages <= 0)
> > +                       ret = spd.nr_pages;
> > +               else
> > +                       ret = splice_to_pipe(pipe, &spd);
> > +               pipe_unlock(pipe);
		    ^^^^^^^^^^^^^^^^
> > +               if (ret > 0)
> > +                       wakeup_pipe_readers(pipe);
> > +       }
> 
> Unbalanced pipe_lock()?

Reordering braindamage; fixed.

> Also, while it doesn't hurt, the constification of the "from" argument
> of get_iovec_page_array() looks only noise in this patch.

Rudiment of earlier variant, when we did a non-trivial loop in the caller.
Not needed anymore, removed.

Fixed variant force-pushed to the same branch

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
@ 2016-09-27  4:14                                                       ` Al Viro
  0 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-27  4:14 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jens Axboe, CAI Qian, Nick Piggin, xfs, linux-xfs, linux-fsdevel,
	Linus Torvalds

On Mon, Sep 26, 2016 at 03:35:12PM +0200, Miklos Szeredi wrote:
> > -       if (spd.nr_pages <= 0)
> > -               ret = spd.nr_pages;
> > -       else
> > -               ret = splice_to_pipe(pipe, &spd);
> > -
> > +       pipe_lock(pipe);
> > +       ret = wait_for_space(pipe, flags);
> > +       if (!ret) {
> > +               spd.nr_pages = get_iovec_page_array(&from, spd.pages,
> > +                                                   spd.partial,
> > +                                                   spd.nr_pages_max);
> > +               if (spd.nr_pages <= 0)
> > +                       ret = spd.nr_pages;
> > +               else
> > +                       ret = splice_to_pipe(pipe, &spd);
> > +               pipe_unlock(pipe);
		    ^^^^^^^^^^^^^^^^
> > +               if (ret > 0)
> > +                       wakeup_pipe_readers(pipe);
> > +       }
> 
> Unbalanced pipe_lock()?

Reordering braindamage; fixed.

> Also, while it doesn't hurt, the constification of the "from" argument
> of get_iovec_page_array() looks only noise in this patch.

Rudiment of earlier variant, when we did a non-trivial loop in the caller.
Not needed anymore, removed.

Fixed variant force-pushed to the same branch

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-24 17:29                                                   ` Al Viro
@ 2016-09-27 15:38                                                     ` Nicholas Piggin
  2016-09-27 15:53                                                       ` Chuck Lever
  1 sibling, 0 replies; 152+ messages in thread
From: Nicholas Piggin @ 2016-09-27 15:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, linux-fsdevel

On Sat, 24 Sep 2016 18:29:01 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Sat, Sep 24, 2016 at 04:59:08AM +0100, Al Viro wrote:
> 
> > 	FWIW, updated (with fixes) and force-pushed.  Added piece:
> > default_file_splice_read() converted to iov_iter.  Seems to work, after
> > fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
> > beginning, as described above), #6 (context changes from #4), #10 (missing
> > get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
> > of len - ->read_iter() can bloody well handle that on its own) and added #12.
> > Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...  
> 
> 	I think I see how to get full zero-copy (including the write side
> of things).  Just add a "from" side for ITER_PIPE iov_iter (advance,
> get_pages, get_pages_alloc, npages and alignment will need to behave
> differently for "to" and "from" ones) and pull the following trick:
> have fault_in_readable return NULL instead of 0, ERR_PTR(-EFAULT) instead
> of -EFAULT *and* return a struct page if it was asked for a full-page
> range on a page that could be successfully stolen (only "from pipe" iov_iter
> would go for the last one, of course).  Then we make generic_perform_write()
> shove the return value of fault-in into 'page'.  ->write_begin() is given
> &page as an argument, to return the resulting page via that.  All instances
> currently just store into that pointer, completely ignoring the prior value.
> And they'll keep working just fine.
> 
> 	Let's make sure that all method call sites outside of
> generic_perform_write() (there's only one such, actually) have NULL
> stored in there prior to the call.  Now we can start switching the
> instances to zero-copy support - all it takes is replacing
> grab_cache_page_write_begin() with "if *page is non-NULL, try to
> shove it (locked, non-uptodate) into pagecache; if that succeeds grab a
> reference to our page and we are done, if it fails - fall back to
> grab_cache_page_write_begin()".  Then do get_block, etc., or whatever that
> ->write_begin() instance would normally do, just remember not to zero anything  
> if the page had been passed to us by caller.

Interesting stuff. It should also be possible for a filesystem to replace
existing pagecache as a zero-copy overwrite with the migration APIs and
just a little bit of work.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-24 17:29                                                   ` Al Viro
@ 2016-09-27 15:53                                                       ` Chuck Lever
  2016-09-27 15:53                                                       ` Chuck Lever
  1 sibling, 0 replies; 152+ messages in thread
From: Chuck Lever @ 2016-09-27 15:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel


> On Sep 24, 2016, at 1:29 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> On Sat, Sep 24, 2016 at 04:59:08AM +0100, Al Viro wrote:
> 
>> 	FWIW, updated (with fixes) and force-pushed.  Added piece:
>> default_file_splice_read() converted to iov_iter.  Seems to work, after
>> fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
>> beginning, as described above), #6 (context changes from #4), #10 (missing
>> get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
>> of len - ->read_iter() can bloody well handle that on its own) and added #12.
>> Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...
> 
> 	I think I see how to get full zero-copy (including the write side
> of things).  Just add a "from" side for ITER_PIPE iov_iter (advance,
> get_pages, get_pages_alloc, npages and alignment will need to behave
> differently for "to" and "from" ones) and pull the following trick:
> have fault_in_readable return NULL instead of 0, ERR_PTR(-EFAULT) instead
> of -EFAULT *and* return a struct page if it was asked for a full-page
> range on a page that could be successfully stolen (only "from pipe" iov_iter
> would go for the last one, of course).  Then we make generic_perform_write()
> shove the return value of fault-in into 'page'.  ->write_begin() is given
> &page as an argument, to return the resulting page via that.  All instances
> currently just store into that pointer, completely ignoring the prior value.
> And they'll keep working just fine.
> 
> 	Let's make sure that all method call sites outside of
> generic_perform_write() (there's only one such, actually) have NULL
> stored in there prior to the call.  Now we can start switching the
> instances to zero-copy support - all it takes is replacing
> grab_cache_page_write_begin() with "if *page is non-NULL, try to
> shove it (locked, non-uptodate) into pagecache; if that succeeds grab a
> reference to our page and we are done, if it fails - fall back to
> grab_cache_page_write_begin()".  Then do get_block, etc., or whatever that
> ->write_begin() instance would normally do, just remember not to zero anything
> if the page had been passed to us by caller.
> 
> 	Now all we need is to make sure that iov_iter_copy_from_user_atomic()
> for those guys recongnizes the case of full-page copy when source and target
> are the same page and quietly returns PAGE_SIZE.  Voila - we can make
> iter_file_splice_write() pass pipe-backed iov_iter instead of bvec-backed
> one *and* get write-side zero-copy for all filesystems with ->write_begin()
> taught to handle that (see above).  Since the filesystems with unmodified
> ->write_begin() will act correctly (just do the copying), we don't have
> to make that a flagday change; ->write_begin() instances can be switched
> one by one.  Similar treatment of iomap_write_begin()/iomap_write_actor()
> would cover iomap-using ->write_iter() instances.
> 
> 	It's clearly not something I want to touch until -rc1, but it looks
> feasible for the next cycle, and if done right it promises to unify the
> plain and splice sides of fuse_dev_...() stuff, simplifying the hell out
> of them without losing zero-copy there.  And if everything really goes
> right, we might be able to get rid of net/* ->splice_read() and ->sendpage()
> methods as well...

Kernel NFS server already uses splice for its read path, but the
write path appears to require a full data copy of incoming payloads.
Would be awesome to see write-side support for zero-copy.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe()
@ 2016-09-27 15:53                                                       ` Chuck Lever
  0 siblings, 0 replies; 152+ messages in thread
From: Chuck Lever @ 2016-09-27 15:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, CAI Qian, Nick Piggin, xfs, linux-xfs, linux-fsdevel,
	Linus Torvalds


> On Sep 24, 2016, at 1:29 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> On Sat, Sep 24, 2016 at 04:59:08AM +0100, Al Viro wrote:
> 
>> 	FWIW, updated (with fixes) and force-pushed.  Added piece:
>> default_file_splice_read() converted to iov_iter.  Seems to work, after
>> fixing a braino in __pipe_get_pages().  Changed: #4 (sleep only in the
>> beginning, as described above), #6 (context changes from #4), #10 (missing
>> get_page() added in __pipe_get_pages()), #11 (removed pointless truncation
>> of len - ->read_iter() can bloody well handle that on its own) and added #12.
>> Stands at 28 files changed, 657 insertions(+), 1009 deletions(-) now...
> 
> 	I think I see how to get full zero-copy (including the write side
> of things).  Just add a "from" side for ITER_PIPE iov_iter (advance,
> get_pages, get_pages_alloc, npages and alignment will need to behave
> differently for "to" and "from" ones) and pull the following trick:
> have fault_in_readable return NULL instead of 0, ERR_PTR(-EFAULT) instead
> of -EFAULT *and* return a struct page if it was asked for a full-page
> range on a page that could be successfully stolen (only "from pipe" iov_iter
> would go for the last one, of course).  Then we make generic_perform_write()
> shove the return value of fault-in into 'page'.  ->write_begin() is given
> &page as an argument, to return the resulting page via that.  All instances
> currently just store into that pointer, completely ignoring the prior value.
> And they'll keep working just fine.
> 
> 	Let's make sure that all method call sites outside of
> generic_perform_write() (there's only one such, actually) have NULL
> stored in there prior to the call.  Now we can start switching the
> instances to zero-copy support - all it takes is replacing
> grab_cache_page_write_begin() with "if *page is non-NULL, try to
> shove it (locked, non-uptodate) into pagecache; if that succeeds grab a
> reference to our page and we are done, if it fails - fall back to
> grab_cache_page_write_begin()".  Then do get_block, etc., or whatever that
> ->write_begin() instance would normally do, just remember not to zero anything
> if the page had been passed to us by caller.
> 
> 	Now all we need is to make sure that iov_iter_copy_from_user_atomic()
> for those guys recongnizes the case of full-page copy when source and target
> are the same page and quietly returns PAGE_SIZE.  Voila - we can make
> iter_file_splice_write() pass pipe-backed iov_iter instead of bvec-backed
> one *and* get write-side zero-copy for all filesystems with ->write_begin()
> taught to handle that (see above).  Since the filesystems with unmodified
> ->write_begin() will act correctly (just do the copying), we don't have
> to make that a flagday change; ->write_begin() instances can be switched
> one by one.  Similar treatment of iomap_write_begin()/iomap_write_actor()
> would cover iomap-using ->write_iter() instances.
> 
> 	It's clearly not something I want to touch until -rc1, but it looks
> feasible for the next cycle, and if done right it promises to unify the
> plain and splice sides of fuse_dev_...() stuff, simplifying the hell out
> of them without losing zero-copy there.  And if everything really goes
> right, we might be able to get rid of net/* ->splice_read() and ->sendpage()
> methods as well...

Kernel NFS server already uses splice for its read path, but the
write path appears to require a full data copy of incoming payloads.
Would be awesome to see write-side support for zero-copy.


--
Chuck Lever



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 10/12] new iov_iter flavour: pipe-backed
  2016-09-24  4:01                                                 ` [PATCH 10/12] new iov_iter flavour: pipe-backed Al Viro
@ 2016-09-29 20:53                                                   ` Miklos Szeredi
  2016-09-29 22:50                                                       ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: Miklos Szeredi @ 2016-09-29 20:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sat, Sep 24, 2016 at 6:01 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> iov_iter variant for passing data into pipe.  copy_to_iter()
> copies data into page(s) it has allocated and stuffs them into
> the pipe; copy_page_to_iter() stuffs there a reference to the
> page given to it.  Both will try to coalesce if possible.
> iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages()
> and friends will do as copy_to_iter() would have and return the
> pages where the data would've been copied.  iov_iter_advance()
> will truncate everything past the spot it has advanced to.
>
> New primitive: iov_iter_pipe(), used for initializing those.
> pipe should be locked all along.
>
> Running out of space acts as fault would for iovec-backed ones;
> in other words, giving it to ->read_iter() may result in short
> read if the pipe overflows, or -EFAULT if it happens with nothing
> copied there.

This is the hardest part of the whole set.  I've been trying to
understand it, but the modular arithmetic makes it really tricky to
read.  Couldn't we have more small inline helpers like next_idx()?

Specific comments inline.

[...]

> +static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
> +                        struct iov_iter *i)
> +{
> +       struct pipe_inode_info *pipe = i->pipe;
> +       struct pipe_buffer *buf;
> +       size_t off;
> +       int idx;
> +
> +       if (unlikely(bytes > i->count))
> +               bytes = i->count;
> +
> +       if (unlikely(!bytes))
> +               return 0;
> +
> +       if (!sanity(i))
> +               return 0;
> +
> +       off = i->iov_offset;
> +       idx = i->idx;
> +       buf = &pipe->bufs[idx];
> +       if (off) {
> +               if (offset == off && buf->page == page) {
> +                       /* merge with the last one */
> +                       buf->len += bytes;
> +                       i->iov_offset += bytes;
> +                       goto out;
> +               }
> +               idx = next_idx(idx, pipe);
> +               buf = &pipe->bufs[idx];
> +       }
> +       if (idx == pipe->curbuf && pipe->nrbufs)
> +               return 0;

The EFAULT logic seems to be missing across the board.  And callers
don't expect a zero return value.  Most will loop indefinitely.

[...]

> +static size_t push_pipe(struct iov_iter *i, size_t size,
> +                       int *idxp, size_t *offp)
> +{
> +       struct pipe_inode_info *pipe = i->pipe;
> +       size_t off;
> +       int idx;
> +       ssize_t left;
> +
> +       if (unlikely(size > i->count))
> +               size = i->count;
> +       if (unlikely(!size))
> +               return 0;
> +
> +       left = size;
> +       data_start(i, &idx, &off);
> +       *idxp = idx;
> +       *offp = off;
> +       if (off) {
> +               left -= PAGE_SIZE - off;
> +               if (left <= 0) {
> +                       pipe->bufs[idx].len += size;
> +                       return size;
> +               }
> +               pipe->bufs[idx].len = PAGE_SIZE;
> +               idx = next_idx(idx, pipe);
> +       }
> +       while (idx != pipe->curbuf || !pipe->nrbufs) {
> +               struct page *page = alloc_page(GFP_USER);
> +               if (!page)
> +                       break;

Again, unexpected zero return if this is the first page.  Should
return -ENOMEM?  Some callers only expect -EFAULT, though.

[...]

> +static void pipe_advance(struct iov_iter *i, size_t size)
> +{
> +       struct pipe_inode_info *pipe = i->pipe;
> +       struct pipe_buffer *buf;
> +       size_t off;
> +       int idx;
> +
> +       if (unlikely(i->count < size))
> +               size = i->count;
> +
> +       idx = i->idx;
> +       off = i->iov_offset;
> +       if (size || off) {
> +               /* take it relative to the beginning of buffer */
> +               size += off - pipe->bufs[idx].offset;
> +               while (1) {
> +                       buf = &pipe->bufs[idx];
> +                       if (size > buf->len) {
> +                               size -= buf->len;
> +                               idx = next_idx(idx, pipe);
> +                               off = 0;

off is unused and reassigned before breaking out of the loop.

[...]

> @@ -732,7 +1101,20 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
>         if (!size)
>                 return 0;
>
> -       iterate_all_kinds(i, size, v, ({
> +       if (unlikely(i->type & ITER_PIPE)) {
> +               struct pipe_inode_info *pipe = i->pipe;
> +               size_t off;
> +               int idx;
> +
> +               if (!sanity(i))
> +                       return 0;
> +
> +               data_start(i, &idx, &off);
> +               /* some of this one + all after this one */
> +               npages = ((pipe->curbuf - idx - 1) & (pipe->buffers - 1)) + 1;

It's supposed to take i->count into account, no?  And that calculation
will result in really funny things if the pipe is full.  And we can't
return -EFAULT here, since that's not expected by callers...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 10/12] new iov_iter flavour: pipe-backed
  2016-09-29 20:53                                                   ` Miklos Szeredi
@ 2016-09-29 22:50                                                       ` Al Viro
  0 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-29 22:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Thu, Sep 29, 2016 at 10:53:55PM +0200, Miklos Szeredi wrote:

> The EFAULT logic seems to be missing across the board.  And callers
> don't expect a zero return value.  Most will loop indefinitely.

Nope.  copy_page_to_iter() *never* returns -EFAULT.  Including the iovec
one - check copy_page_to_iter_iovec().  Any caller that does not expect
a zero return value from that primitive is a bug, triggerable as soon as
you feed it an iovec with NULL ->iov_base.

> Again, unexpected zero return if this is the first page.  Should
> return -ENOMEM?  Some callers only expect -EFAULT, though.

For copy_to_iter() and zero_iter() it's definitely "return zero".  For
get_pages...  Hell knows; those probably ought to return -EFAULT, but
I'll need to look some more at the callers.  It should end up triggering
a short read as the end result (or, as usual, EFAULT on zero-length read).

> > +               /* take it relative to the beginning of buffer */
> > +               size += off - pipe->bufs[idx].offset;
> > +               while (1) {
> > +                       buf = &pipe->bufs[idx];
> > +                       if (size > buf->len) {
> > +                               size -= buf->len;
> > +                               idx = next_idx(idx, pipe);
> > +                               off = 0;
> 
> off is unused and reassigned before breaking out of the loop.

True.

> [...]
> 
> > +       if (unlikely(i->type & ITER_PIPE)) {
> > +               struct pipe_inode_info *pipe = i->pipe;
> > +               size_t off;
> > +               int idx;
> > +
> > +               if (!sanity(i))
> > +                       return 0;
> > +
> > +               data_start(i, &idx, &off);
> > +               /* some of this one + all after this one */
> > +               npages = ((pipe->curbuf - idx - 1) & (pipe->buffers - 1)) + 1;
> 
> It's supposed to take i->count into account, no?  And that calculation
> will result in really funny things if the pipe is full.  And we can't
> return -EFAULT here, since that's not expected by callers...

It should look at i->count, in principle.  OTOH, overestimating the amount
is not really a problem for possible users of such iov_iter.  I'll look
into that.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 10/12] new iov_iter flavour: pipe-backed
@ 2016-09-29 22:50                                                       ` Al Viro
  0 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-09-29 22:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jens Axboe, CAI Qian, Nick Piggin, xfs, linux-xfs, linux-fsdevel,
	Linus Torvalds

On Thu, Sep 29, 2016 at 10:53:55PM +0200, Miklos Szeredi wrote:

> The EFAULT logic seems to be missing across the board.  And callers
> don't expect a zero return value.  Most will loop indefinitely.

Nope.  copy_page_to_iter() *never* returns -EFAULT.  Including the iovec
one - check copy_page_to_iter_iovec().  Any caller that does not expect
a zero return value from that primitive is a bug, triggerable as soon as
you feed it an iovec with NULL ->iov_base.

> Again, unexpected zero return if this is the first page.  Should
> return -ENOMEM?  Some callers only expect -EFAULT, though.

For copy_to_iter() and zero_iter() it's definitely "return zero".  For
get_pages...  Hell knows; those probably ought to return -EFAULT, but
I'll need to look some more at the callers.  It should end up triggering
a short read as the end result (or, as usual, EFAULT on zero-length read).

> > +               /* take it relative to the beginning of buffer */
> > +               size += off - pipe->bufs[idx].offset;
> > +               while (1) {
> > +                       buf = &pipe->bufs[idx];
> > +                       if (size > buf->len) {
> > +                               size -= buf->len;
> > +                               idx = next_idx(idx, pipe);
> > +                               off = 0;
> 
> off is unused and reassigned before breaking out of the loop.

True.

> [...]
> 
> > +       if (unlikely(i->type & ITER_PIPE)) {
> > +               struct pipe_inode_info *pipe = i->pipe;
> > +               size_t off;
> > +               int idx;
> > +
> > +               if (!sanity(i))
> > +                       return 0;
> > +
> > +               data_start(i, &idx, &off);
> > +               /* some of this one + all after this one */
> > +               npages = ((pipe->curbuf - idx - 1) & (pipe->buffers - 1)) + 1;
> 
> It's supposed to take i->count into account, no?  And that calculation
> will result in really funny things if the pipe is full.  And we can't
> return -EFAULT here, since that's not expected by callers...

It should look at i->count, in principle.  OTOH, overestimating the amount
is not really a problem for possible users of such iov_iter.  I'll look
into that.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 10/12] new iov_iter flavour: pipe-backed
  2016-09-29 22:50                                                       ` Al Viro
  (?)
@ 2016-09-30  7:30                                                       ` Miklos Szeredi
  2016-10-03  3:34                                                         ` [RFC] O_DIRECT vs EFAULT (was Re: [PATCH 10/12] new iov_iter flavour: pipe-backed) Al Viro
  -1 siblings, 1 reply; 152+ messages in thread
From: Miklos Szeredi @ 2016-09-30  7:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Fri, Sep 30, 2016 at 12:50 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Sep 29, 2016 at 10:53:55PM +0200, Miklos Szeredi wrote:
>
>> The EFAULT logic seems to be missing across the board.  And callers
>> don't expect a zero return value.  Most will loop indefinitely.
>
> Nope.  copy_page_to_iter() *never* returns -EFAULT.  Including the iovec
> one - check copy_page_to_iter_iovec().  Any caller that does not expect
> a zero return value from that primitive is a bug, triggerable as soon as
> you feed it an iovec with NULL ->iov_base.

Right.

I was actually looking at iov_iter_get_pages() callers...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
                                                           ` (10 preceding siblings ...)
  2016-09-23 19:10                                         ` [PATCH 11/11] switch generic_file_splice_read() to use of ->read_iter() Al Viro
@ 2016-09-30 13:32                                         ` CAI Qian
  2016-09-30 17:42                                           ` CAI Qian
  11 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-09-30 13:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "Al Viro" <viro@ZenIV.linux.org.uk>
> To: "Linus Torvalds" <torvalds@linux-foundation.org>
> Cc: "Dave Chinner" <david@fromorbit.com>, "CAI Qian" <caiqian@redhat.com>, "linux-xfs" <linux-xfs@vger.kernel.org>,
> xfs@oss.sgi.com, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org
> Sent: Friday, September 23, 2016 3:00:32 PM
> Subject: [RFC][CFT] splice_read reworked
> 
> The series is supposed to solve the locking order problems for
> ->splice_read() and get rid of code duplication between the read-side
> methods.
> 	pipe_lock is lifted out of ->splice_read() instances, along with
> waiting for empty space in pipe, etc. - we do that stuff in callers.
> 	A new variant of iov_iter is introduced - it's backed by a pipe,
> copy_to_iter() results in allocating pages and copying into those,
> copy_page_to_iter() just sticks a reference to that page into pipe.
> Running out of space in pipe yields a short read, as a fault in iovec-backed
> iov_iter would have.  Enough primitives are implemented for normal
> ->read_iter() instances to work.
> 	generic_file_splice_read() switched to feeding such iov_iter to
> ->read_iter() instance.  That turns out to be enough to kill almost all
> ->splice_read() instances; the only ones _not_ using
> generic_file_splice_read()
> or default_file_splice_read() (== no zero-copy fallback) are
> fuse_dev_splice_read(), 3 instances in kernel/{relay.c,trace/trace.c} and
> sock_splice_read().  It's almost certainly possible to convert fuse one
> and the same might be possible to do to socket one.  relay and tracing
> stuff is just plain weird; might or might not be doable.
> 
> 	Something hopefully working is in
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.splice_read
Tested-by: CAI Qian <caiqian@redhat.com>

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-09-30 13:32                                         ` [RFC][CFT] splice_read reworked CAI Qian
@ 2016-09-30 17:42                                           ` CAI Qian
  2016-09-30 18:33                                               ` CAI Qian
  2016-10-03  1:42                                             ` [RFC][CFT] splice_read reworked Al Viro
  0 siblings, 2 replies; 152+ messages in thread
From: CAI Qian @ 2016-09-30 17:42 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "Al Viro" <viro@ZenIV.linux.org.uk>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, xfs@oss.sgi.com, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Friday, September 30, 2016 9:32:53 AM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> 
> 
> ----- Original Message -----
> > From: "Al Viro" <viro@ZenIV.linux.org.uk>
> > To: "Linus Torvalds" <torvalds@linux-foundation.org>
> > Cc: "Dave Chinner" <david@fromorbit.com>, "CAI Qian" <caiqian@redhat.com>,
> > "linux-xfs" <linux-xfs@vger.kernel.org>,
> > xfs@oss.sgi.com, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin"
> > <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org
> > Sent: Friday, September 23, 2016 3:00:32 PM
> > Subject: [RFC][CFT] splice_read reworked
> > 
> > The series is supposed to solve the locking order problems for
> > ->splice_read() and get rid of code duplication between the read-side
> > methods.
> > 	pipe_lock is lifted out of ->splice_read() instances, along with
> > waiting for empty space in pipe, etc. - we do that stuff in callers.
> > 	A new variant of iov_iter is introduced - it's backed by a pipe,
> > copy_to_iter() results in allocating pages and copying into those,
> > copy_page_to_iter() just sticks a reference to that page into pipe.
> > Running out of space in pipe yields a short read, as a fault in
> > iovec-backed
> > iov_iter would have.  Enough primitives are implemented for normal
> > ->read_iter() instances to work.
> > 	generic_file_splice_read() switched to feeding such iov_iter to
> > ->read_iter() instance.  That turns out to be enough to kill almost all
> > ->splice_read() instances; the only ones _not_ using
> > generic_file_splice_read()
> > or default_file_splice_read() (== no zero-copy fallback) are
> > fuse_dev_splice_read(), 3 instances in kernel/{relay.c,trace/trace.c} and
> > sock_splice_read().  It's almost certainly possible to convert fuse one
> > and the same might be possible to do to socket one.  relay and tracing
> > stuff is just plain weird; might or might not be doable.
> > 
> > 	Something hopefully working is in
> > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
> > #work.splice_read
> Tested-by: CAI Qian <caiqian@redhat.com>

Except...

One warning just pop up while running trinity.

[ 1599.151286] ------------[ cut here ]------------
[ 1599.156457] WARNING: CPU: 37 PID: 95143 at lib/iov_iter.c:316 sanity+0x75/0x80
[ 1599.164818] Modules linked in: af_key ieee802154_socket ieee802154 vmw_vsock_vmci_transport vsock vmw_vmci hidp cmtp kernelcapi bnep rfcomm bluetooth rfkill can_bcm can_raw can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr mei_me sg i2c_i801 mei shpchp lpc_ich i2c_smbus ipmi_ssif wmi ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci mdio libahci ptp libata i2c_core pps_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 1599.278669] CPU: 50 PID: 95143 Comm: trinity-c142 Not tainted 4.8.0-rc8-usrns-scale+ #8
[ 1599.287604] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 1599.298962]  0000000000000286 000000007794c41e ffff8803c6c7fbb0 ffffffff813d5e93
[ 1599.307259]  0000000000000000 0000000000000000 ffff8803c6c7fbf0 ffffffff8109c87b
[ 1599.315553]  0000013c00000000 0000000000000efe ffffea001de95240 ffff8802e1aca600
[ 1599.323847] Call Trace:
[ 1599.326580]  [<ffffffff813d5e93>] dump_stack+0x85/0xc2
[ 1599.332315]  [<ffffffff8109c87b>] __warn+0xcb/0xf0
[ 1599.337660]  [<ffffffff8109c9ad>] warn_slowpath_null+0x1d/0x20
[ 1599.344171]  [<ffffffff813e9b45>] sanity+0x75/0x80
[ 1599.349518]  [<ffffffff813ec739>] copy_page_to_iter+0xf9/0x1e0
[ 1599.356027]  [<ffffffff8120691f>] shmem_file_read_iter+0x9f/0x340
[ 1599.362829]  [<ffffffff812bbeb9>] generic_file_splice_read+0xb9/0x1b0
[ 1599.370015]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
[ 1599.375941]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
[ 1599.382935]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 1599.390220]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
[ 1599.396537]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
[ 1599.402563]  [<ffffffff81282973>] SyS_sendfile64+0x73/0xd0
[ 1599.408685]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 1599.414820]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 1599.422087] ---[ end trace a3fb2953df356f80 ]---

    CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-09-30 17:42                                           ` CAI Qian
@ 2016-09-30 18:33                                               ` CAI Qian
  2016-10-03  1:42                                             ` [RFC][CFT] splice_read reworked Al Viro
  1 sibling, 0 replies; 152+ messages in thread
From: CAI Qian @ 2016-09-30 18:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> One warning just pop up while running trinity.
Another run triggered a lockdep with splice in the trace,

[ 4787.875980] 
[ 4787.877645] ======================================================
[ 4787.884540] [ INFO: possible circular locking dependency detected ]
[ 4787.891533] 4.8.0-rc8-usrns-scale+ #8 Tainted: G        W      
[ 4787.898138] -------------------------------------------------------
[ 4787.905130] trinity-c116/106905 is trying to acquire lock:
[ 4787.911251]  (&p->lock){+.+.+.}, at: [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4787.919264] 
[ 4787.919264] but task is already holding lock:
[ 4787.925773]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4787.934854] 
[ 4787.934854] which lock already depends on the new lock.
[ 4787.934854] 
[ 4787.943981] 
[ 4787.943981] the existing dependency chain (in reverse order) is:
[ 4787.952333] 
-> #3 (sb_writers#8){.+.+.+}:
[ 4787.957050]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4787.963960]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4787.970577]        [<ffffffff810f769a>] percpu_down_read+0x4a/0xa0
[ 4787.977487]        [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4787.984395]        [<ffffffff812a8974>] mnt_want_write+0x24/0x50
[ 4787.991110]        [<ffffffffa05049af>] ovl_want_write+0x1f/0x30 [overlay]
[ 4787.998799]        [<ffffffffa05070c2>] ovl_do_remove+0x42/0x4a0 [overlay]
[ 4788.006483]        [<ffffffffa0507536>] ovl_rmdir+0x16/0x20 [overlay]
[ 4788.013682]        [<ffffffff8128d357>] vfs_rmdir+0xb7/0x130
[ 4788.020009]        [<ffffffff81292ed3>] do_rmdir+0x183/0x1f0
[ 4788.026335]        [<ffffffff81293cf2>] SyS_unlinkat+0x22/0x30
[ 4788.032853]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.039576]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.046962] 
-> #2 (&sb->s_type->i_mutex_key#16){++++++}:
[ 4788.053140]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.060049]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.066664]        [<ffffffff817d60e7>] down_read+0x47/0x70
[ 4788.072893]        [<ffffffff8128ce79>] lookup_slow+0xc9/0x200
[ 4788.079410]        [<ffffffff81290b9c>] walk_component+0x1ec/0x310
[ 4788.086315]        [<ffffffff81290e5f>] link_path_walk+0x19f/0x5f0
[ 4788.093219]        [<ffffffff8129151d>] path_openat+0xdd/0xb80
[ 4788.099748]        [<ffffffff81293511>] do_filp_open+0x91/0x100
[ 4788.106362]        [<ffffffff81286f56>] do_open_execat+0x76/0x180
[ 4788.113186]        [<ffffffff8128747b>] open_exec+0x2b/0x50
[ 4788.119404]        [<ffffffff812ec61d>] load_elf_binary+0x28d/0x1120
[ 4788.126511]        [<ffffffff81288487>] search_binary_handler+0x97/0x1c0
[ 4788.134002]        [<ffffffff81289619>] do_execveat_common.isra.36+0x6a9/0x9f0
[ 4788.142071]        [<ffffffff81289c4a>] SyS_execve+0x3a/0x50
[ 4788.148398]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.155110]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.162502] 
-> #1 (&sig->cred_guard_mutex){+.+.+.}:
[ 4788.168179]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.175085]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.181712]        [<ffffffff817d4557>] mutex_lock_killable_nested+0x87/0x500
[ 4788.189695]        [<ffffffff81099599>] mm_access+0x29/0xa0
[ 4788.195924]        [<ffffffff81302b6c>] proc_pid_auxv+0x1c/0x70
[ 4788.202540]        [<ffffffff813039d0>] proc_single_show+0x50/0x90
[ 4788.209445]        [<ffffffff812acb48>] seq_read+0x108/0x3e0
[ 4788.215774]        [<ffffffff8127fb07>] __vfs_read+0x37/0x150
[ 4788.222198]        [<ffffffff81280d35>] vfs_read+0x95/0x140
[ 4788.228425]        [<ffffffff81282268>] SyS_read+0x58/0xc0
[ 4788.234557]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.241268]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.248660] 
-> #0 (&p->lock){+.+.+.}:
[ 4788.252987]        [<ffffffff810fc062>] validate_chain.isra.37+0xe72/0x1150
[ 4788.260769]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.267676]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.274302]        [<ffffffff817d3807>] mutex_lock_nested+0x77/0x430
[ 4788.281406]        [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4788.287633]        [<ffffffff81316b39>] kernfs_fop_read+0x129/0x1b0
[ 4788.294659]        [<ffffffff8127fca3>] do_loop_readv_writev+0x83/0xc0
[ 4788.301954]        [<ffffffff812811a8>] do_readv_writev+0x218/0x240
[ 4788.308959]        [<ffffffff81281209>] vfs_readv+0x39/0x50
[ 4788.315188]        [<ffffffff812bc6b1>] default_file_splice_read+0x1a1/0x2b0
[ 4788.323070]        [<ffffffff812bc206>] do_splice_to+0x76/0x90
[ 4788.329587]        [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
[ 4788.337173]        [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
[ 4788.344078]        [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
[ 4788.350694]        [<ffffffff812829c9>] SyS_sendfile64+0xc9/0xd0
[ 4788.357405]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.364119]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.371511] 
[ 4788.371511] other info that might help us debug this:
[ 4788.371511] 
[ 4788.380443] Chain exists of:
  &p->lock --> &sb->s_type->i_mutex_key#16 --> sb_writers#8

[ 4788.389881]  Possible unsafe locking scenario:
[ 4788.389881] 
[ 4788.396497]        CPU0                    CPU1
[ 4788.401549]        ----                    ----
[ 4788.406614]   lock(sb_writers#8);
[ 4788.410352]                                lock(&sb->s_type->i_mutex_key#16);
[ 4788.418354]                                lock(sb_writers#8);
[ 4788.424902]   lock(&p->lock);
[ 4788.428229] 
[ 4788.428229]  *** DEADLOCK ***
[ 4788.428229] 
[ 4788.434836] 1 lock held by trinity-c116/106905:
[ 4788.439888]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4788.449473] 
[ 4788.449473] stack backtrace:
[ 4788.454334] CPU: 16 PID: 106905 Comm: trinity-c116 Tainted: G        W       4.8.0-rc8-usrns-scale+ #8
[ 4788.464719] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 4788.476076]  0000000000000086 00000000cbfc6314 ffff8803ce78b760 ffffffff813d5e93
[ 4788.484371]  ffffffff82a3fbd0 ffffffff82a94890 ffff8803ce78b7a0 ffffffff810fa6ec
[ 4788.492663]  ffff8803ce78b7e0 ffff8802ead08000 0000000000000001 ffff8802ead08ca0
[ 4788.500966] Call Trace:
[ 4788.503694]  [<ffffffff813d5e93>] dump_stack+0x85/0xc2
[ 4788.509426]  [<ffffffff810fa6ec>] print_circular_bug+0x1ec/0x260
[ 4788.516128]  [<ffffffff810fc062>] validate_chain.isra.37+0xe72/0x1150
[ 4788.523319]  [<ffffffff811d4491>] ? ___perf_sw_event+0x171/0x290
[ 4788.530022]  [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.536335]  [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.542359]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.548188]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.554019]  [<ffffffff817d3807>] mutex_lock_nested+0x77/0x430
[ 4788.560528]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.566358]  [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4788.571995]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.578600]  [<ffffffff81316b39>] kernfs_fop_read+0x129/0x1b0
[ 4788.585012]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.591617]  [<ffffffff8127fca3>] do_loop_readv_writev+0x83/0xc0
[ 4788.598318]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.604924]  [<ffffffff812811a8>] do_readv_writev+0x218/0x240
[ 4788.611347]  [<ffffffff813e9535>] ? push_pipe+0xd5/0x190
[ 4788.617278]  [<ffffffff813ecec0>] ? iov_iter_get_pages_alloc+0x250/0x400
[ 4788.624746]  [<ffffffff81281209>] vfs_readv+0x39/0x50
[ 4788.630381]  [<ffffffff812bc6b1>] default_file_splice_read+0x1a1/0x2b0
[ 4788.637668]  [<ffffffff8134ae20>] ? security_file_permission+0xa0/0xc0
[ 4788.644954]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
[ 4788.650880]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
[ 4788.657872]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 4788.665157]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
[ 4788.671472]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
[ 4788.677499]  [<ffffffff812829c9>] SyS_sendfile64+0xc9/0xd0
[ 4788.683622]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.689744]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
@ 2016-09-30 18:33                                               ` CAI Qian
  0 siblings, 0 replies; 152+ messages in thread
From: CAI Qian @ 2016-09-30 18:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Nick Piggin, xfs, linux-xfs, linux-fsdevel, Linus Torvalds



----- Original Message -----
> One warning just pop up while running trinity.
Another run triggered a lockdep with splice in the trace,

[ 4787.875980] 
[ 4787.877645] ======================================================
[ 4787.884540] [ INFO: possible circular locking dependency detected ]
[ 4787.891533] 4.8.0-rc8-usrns-scale+ #8 Tainted: G        W      
[ 4787.898138] -------------------------------------------------------
[ 4787.905130] trinity-c116/106905 is trying to acquire lock:
[ 4787.911251]  (&p->lock){+.+.+.}, at: [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4787.919264] 
[ 4787.919264] but task is already holding lock:
[ 4787.925773]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4787.934854] 
[ 4787.934854] which lock already depends on the new lock.
[ 4787.934854] 
[ 4787.943981] 
[ 4787.943981] the existing dependency chain (in reverse order) is:
[ 4787.952333] 
-> #3 (sb_writers#8){.+.+.+}:
[ 4787.957050]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4787.963960]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4787.970577]        [<ffffffff810f769a>] percpu_down_read+0x4a/0xa0
[ 4787.977487]        [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4787.984395]        [<ffffffff812a8974>] mnt_want_write+0x24/0x50
[ 4787.991110]        [<ffffffffa05049af>] ovl_want_write+0x1f/0x30 [overlay]
[ 4787.998799]        [<ffffffffa05070c2>] ovl_do_remove+0x42/0x4a0 [overlay]
[ 4788.006483]        [<ffffffffa0507536>] ovl_rmdir+0x16/0x20 [overlay]
[ 4788.013682]        [<ffffffff8128d357>] vfs_rmdir+0xb7/0x130
[ 4788.020009]        [<ffffffff81292ed3>] do_rmdir+0x183/0x1f0
[ 4788.026335]        [<ffffffff81293cf2>] SyS_unlinkat+0x22/0x30
[ 4788.032853]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.039576]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.046962] 
-> #2 (&sb->s_type->i_mutex_key#16){++++++}:
[ 4788.053140]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.060049]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.066664]        [<ffffffff817d60e7>] down_read+0x47/0x70
[ 4788.072893]        [<ffffffff8128ce79>] lookup_slow+0xc9/0x200
[ 4788.079410]        [<ffffffff81290b9c>] walk_component+0x1ec/0x310
[ 4788.086315]        [<ffffffff81290e5f>] link_path_walk+0x19f/0x5f0
[ 4788.093219]        [<ffffffff8129151d>] path_openat+0xdd/0xb80
[ 4788.099748]        [<ffffffff81293511>] do_filp_open+0x91/0x100
[ 4788.106362]        [<ffffffff81286f56>] do_open_execat+0x76/0x180
[ 4788.113186]        [<ffffffff8128747b>] open_exec+0x2b/0x50
[ 4788.119404]        [<ffffffff812ec61d>] load_elf_binary+0x28d/0x1120
[ 4788.126511]        [<ffffffff81288487>] search_binary_handler+0x97/0x1c0
[ 4788.134002]        [<ffffffff81289619>] do_execveat_common.isra.36+0x6a9/0x9f0
[ 4788.142071]        [<ffffffff81289c4a>] SyS_execve+0x3a/0x50
[ 4788.148398]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.155110]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.162502] 
-> #1 (&sig->cred_guard_mutex){+.+.+.}:
[ 4788.168179]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.175085]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.181712]        [<ffffffff817d4557>] mutex_lock_killable_nested+0x87/0x500
[ 4788.189695]        [<ffffffff81099599>] mm_access+0x29/0xa0
[ 4788.195924]        [<ffffffff81302b6c>] proc_pid_auxv+0x1c/0x70
[ 4788.202540]        [<ffffffff813039d0>] proc_single_show+0x50/0x90
[ 4788.209445]        [<ffffffff812acb48>] seq_read+0x108/0x3e0
[ 4788.215774]        [<ffffffff8127fb07>] __vfs_read+0x37/0x150
[ 4788.222198]        [<ffffffff81280d35>] vfs_read+0x95/0x140
[ 4788.228425]        [<ffffffff81282268>] SyS_read+0x58/0xc0
[ 4788.234557]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.241268]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.248660] 
-> #0 (&p->lock){+.+.+.}:
[ 4788.252987]        [<ffffffff810fc062>] validate_chain.isra.37+0xe72/0x1150
[ 4788.260769]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.267676]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.274302]        [<ffffffff817d3807>] mutex_lock_nested+0x77/0x430
[ 4788.281406]        [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4788.287633]        [<ffffffff81316b39>] kernfs_fop_read+0x129/0x1b0
[ 4788.294659]        [<ffffffff8127fca3>] do_loop_readv_writev+0x83/0xc0
[ 4788.301954]        [<ffffffff812811a8>] do_readv_writev+0x218/0x240
[ 4788.308959]        [<ffffffff81281209>] vfs_readv+0x39/0x50
[ 4788.315188]        [<ffffffff812bc6b1>] default_file_splice_read+0x1a1/0x2b0
[ 4788.323070]        [<ffffffff812bc206>] do_splice_to+0x76/0x90
[ 4788.329587]        [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
[ 4788.337173]        [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
[ 4788.344078]        [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
[ 4788.350694]        [<ffffffff812829c9>] SyS_sendfile64+0xc9/0xd0
[ 4788.357405]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.364119]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.371511] 
[ 4788.371511] other info that might help us debug this:
[ 4788.371511] 
[ 4788.380443] Chain exists of:
  &p->lock --> &sb->s_type->i_mutex_key#16 --> sb_writers#8

[ 4788.389881]  Possible unsafe locking scenario:
[ 4788.389881] 
[ 4788.396497]        CPU0                    CPU1
[ 4788.401549]        ----                    ----
[ 4788.406614]   lock(sb_writers#8);
[ 4788.410352]                                lock(&sb->s_type->i_mutex_key#16);
[ 4788.418354]                                lock(sb_writers#8);
[ 4788.424902]   lock(&p->lock);
[ 4788.428229] 
[ 4788.428229]  *** DEADLOCK ***
[ 4788.428229] 
[ 4788.434836] 1 lock held by trinity-c116/106905:
[ 4788.439888]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4788.449473] 
[ 4788.449473] stack backtrace:
[ 4788.454334] CPU: 16 PID: 106905 Comm: trinity-c116 Tainted: G        W       4.8.0-rc8-usrns-scale+ #8
[ 4788.464719] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 4788.476076]  0000000000000086 00000000cbfc6314 ffff8803ce78b760 ffffffff813d5e93
[ 4788.484371]  ffffffff82a3fbd0 ffffffff82a94890 ffff8803ce78b7a0 ffffffff810fa6ec
[ 4788.492663]  ffff8803ce78b7e0 ffff8802ead08000 0000000000000001 ffff8802ead08ca0
[ 4788.500966] Call Trace:
[ 4788.503694]  [<ffffffff813d5e93>] dump_stack+0x85/0xc2
[ 4788.509426]  [<ffffffff810fa6ec>] print_circular_bug+0x1ec/0x260
[ 4788.516128]  [<ffffffff810fc062>] validate_chain.isra.37+0xe72/0x1150
[ 4788.523319]  [<ffffffff811d4491>] ? ___perf_sw_event+0x171/0x290
[ 4788.530022]  [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.536335]  [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.542359]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.548188]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.554019]  [<ffffffff817d3807>] mutex_lock_nested+0x77/0x430
[ 4788.560528]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.566358]  [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4788.571995]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.578600]  [<ffffffff81316b39>] kernfs_fop_read+0x129/0x1b0
[ 4788.585012]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.591617]  [<ffffffff8127fca3>] do_loop_readv_writev+0x83/0xc0
[ 4788.598318]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.604924]  [<ffffffff812811a8>] do_readv_writev+0x218/0x240
[ 4788.611347]  [<ffffffff813e9535>] ? push_pipe+0xd5/0x190
[ 4788.617278]  [<ffffffff813ecec0>] ? iov_iter_get_pages_alloc+0x250/0x400
[ 4788.624746]  [<ffffffff81281209>] vfs_readv+0x39/0x50
[ 4788.630381]  [<ffffffff812bc6b1>] default_file_splice_read+0x1a1/0x2b0
[ 4788.637668]  [<ffffffff8134ae20>] ? security_file_permission+0xa0/0xc0
[ 4788.644954]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
[ 4788.650880]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
[ 4788.657872]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 4788.665157]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
[ 4788.671472]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
[ 4788.677499]  [<ffffffff812829c9>] SyS_sendfile64+0xc9/0xd0
[ 4788.683622]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.689744]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-09-30 18:33                                               ` CAI Qian
  (?)
@ 2016-10-03  1:37                                               ` Al Viro
  2016-10-03 17:49                                                 ` CAI Qian
  -1 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-10-03  1:37 UTC (permalink / raw)
  To: CAI Qian
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Fri, Sep 30, 2016 at 02:33:23PM -0400, CAI Qian wrote:

OK, the immeditate trigger is
	* sendfile() from something that uses seq_read to a regular file.
Does sb_start_write() around the call of do_splice_direct() (as always),
which ends up calling default_file_splice_read() (again, as usual), which
ends up calling ->read() of the source, i.e. seq_read().  No changes there.
 
	* sb_start_write() can be called under ->i_mutex.  The latter is
on overlayfs inode, the former is done to upper layer in that overlayfs.
Nothing new, again.

	* ->i_mutex can be taken under ->cred_guard_mutex.  Yes, it can -
in open_exec().  Again, no changes.

	* ->cred_guard_mutex can be taken in ->show() of a seq_file,
namely /proc/*/auxv...  Argh, ->cred_guard_mutex whack-a-mole strikes
again...

OK, I think essentially the same warning had been triggerable since _way_
back.  All changes around splice have no effect on it.

Look: to get a deadlock we need
	(1) sendfile from /proc/<pid>/auxv to a regular file on upper layer of
overlayfs requesting not to freeze the target.
	(2) attempt to freeze it blocking until (1) is done.
	(3) directory modification on overlayfs trying to request not to freeze
the upper layer and blocking until (2) is done.
	(4) execve() in <pid> holding ->cred_guard_mutex, trying to open
something in overlayfs and getting blocked on directory lock, held by (3).

Now (1) gets around to reading from /proc/<pid>/auxv, which blocks on
->cred_guard_mutex.  Mentioning of seq_read itself holding locks is irrelevant;
what matters is that ->read() grabs ->cred_guard_mutex.

We used to have similar problems in /proc/*/environ and /proc/*/mem; looks
like /proc/*/environ needs to get the treatment similar to e268337dfe26 and
b409e578d9a4.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-09-30 17:42                                           ` CAI Qian
  2016-09-30 18:33                                               ` CAI Qian
@ 2016-10-03  1:42                                             ` Al Viro
  2016-10-03 14:06                                               ` CAI Qian
  1 sibling, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-10-03  1:42 UTC (permalink / raw)
  To: CAI Qian
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Fri, Sep 30, 2016 at 01:42:17PM -0400, CAI Qian wrote:

> [ 1599.151286] ------------[ cut here ]------------
> [ 1599.156457] WARNING: CPU: 37 PID: 95143 at lib/iov_iter.c:316 sanity+0x75/0x80

[snip]

> [ 1599.344171]  [<ffffffff813e9b45>] sanity+0x75/0x80
> [ 1599.349518]  [<ffffffff813ec739>] copy_page_to_iter+0xf9/0x1e0
> [ 1599.356027]  [<ffffffff8120691f>] shmem_file_read_iter+0x9f/0x340
> [ 1599.362829]  [<ffffffff812bbeb9>] generic_file_splice_read+0xb9/0x1b0
> [ 1599.370015]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
> [ 1599.375941]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
> [ 1599.382935]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
> [ 1599.390220]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
> [ 1599.396537]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
> [ 1599.402563]  [<ffffffff81282973>] SyS_sendfile64+0x73/0xd0
> [ 1599.408685]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
> [ 1599.414820]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25

IOW, sendfile from shmem...  How easily is that reproduced (IOW, did you
get any more of those)?

^ permalink raw reply	[flat|nested] 152+ messages in thread

* [RFC] O_DIRECT vs EFAULT (was Re: [PATCH 10/12] new iov_iter flavour: pipe-backed)
  2016-09-30  7:30                                                       ` Miklos Szeredi
@ 2016-10-03  3:34                                                         ` Al Viro
  2016-10-03 17:07                                                           ` Linus Torvalds
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-10-03  3:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, CAI Qian, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Fri, Sep 30, 2016 at 09:30:21AM +0200, Miklos Szeredi wrote:
> On Fri, Sep 30, 2016 at 12:50 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Thu, Sep 29, 2016 at 10:53:55PM +0200, Miklos Szeredi wrote:
> >
> >> The EFAULT logic seems to be missing across the board.  And callers
> >> don't expect a zero return value.  Most will loop indefinitely.
> >
> > Nope.  copy_page_to_iter() *never* returns -EFAULT.  Including the iovec
> > one - check copy_page_to_iter_iovec().  Any caller that does not expect
> > a zero return value from that primitive is a bug, triggerable as soon as
> > you feed it an iovec with NULL ->iov_base.
> 
> Right.
> 
> I was actually looking at iov_iter_get_pages() callers...

FWIW, that's interesting - O_DIRECT readv()/writev() reacts to fault anywhere
as "nothing done, return -EFAULT now", rather than a short read/write.
Despite that some IO is actually done.  Note, BTW, that we are not even
consistent between the filesystems - local block ones do IO and give -EFAULT,
while NFS, Lustre and FUSE do short read/write, reporting -EFAULT only upon
shortening to nothing.  So does ceph, except that shortening might be for
more than one page.

Considering how weak POSIX is in that area, we are probably not violating
anything, but... it would be more convenient if we treated those as
short read/write, same way for all filesystems.

Linus, do you have any objections against such behaviour change?  AFAICS,
all it takes is this:

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7c3ce73..3a8ebda 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -246,6 +246,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
 		if ((dio->op == REQ_OP_READ) &&
 		    ((offset + transferred) > dio->i_size))
 			transferred = dio->i_size - offset;
+		if (ret == -EFAULT)
+			ret = 0;
 	}
 
 	if (ret == 0)

^ permalink raw reply related	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03  1:42                                             ` [RFC][CFT] splice_read reworked Al Viro
@ 2016-10-03 14:06                                               ` CAI Qian
  2016-10-03 15:20                                                 ` CAI Qian
  2016-10-03 20:32                                                 ` CAI Qian
  0 siblings, 2 replies; 152+ messages in thread
From: CAI Qian @ 2016-10-03 14:06 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel



----- Original Message -----
> From: "Al Viro" <viro@ZenIV.linux.org.uk>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Sunday, October 2, 2016 9:42:18 PM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> On Fri, Sep 30, 2016 at 01:42:17PM -0400, CAI Qian wrote:
> 
> > [ 1599.151286] ------------[ cut here ]------------
> > [ 1599.156457] WARNING: CPU: 37 PID: 95143 at lib/iov_iter.c:316
> > sanity+0x75/0x80
> 
> [snip]
> 
> > [ 1599.344171]  [<ffffffff813e9b45>] sanity+0x75/0x80
> > [ 1599.349518]  [<ffffffff813ec739>] copy_page_to_iter+0xf9/0x1e0
> > [ 1599.356027]  [<ffffffff8120691f>] shmem_file_read_iter+0x9f/0x340
> > [ 1599.362829]  [<ffffffff812bbeb9>] generic_file_splice_read+0xb9/0x1b0
> > [ 1599.370015]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
> > [ 1599.375941]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
> > [ 1599.382935]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
> > [ 1599.390220]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
> > [ 1599.396537]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
> > [ 1599.402563]  [<ffffffff81282973>] SyS_sendfile64+0x73/0xd0
> > [ 1599.408685]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
> > [ 1599.414820]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25
> 
> IOW, sendfile from shmem...  How easily is that reproduced (IOW, did you
> get any more of those)?
> 
It is pretty reproducible so far by just running the trinity from a docker
container backed by overlayfs/xfs.

# su - test
$ trinity

   CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03 14:06                                               ` CAI Qian
@ 2016-10-03 15:20                                                 ` CAI Qian
  2016-10-03 21:12                                                   ` Dave Chinner
  2016-10-03 20:32                                                 ` CAI Qian
  1 sibling, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-03 15:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel



----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "Al Viro" <viro@ZenIV.linux.org.uk>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Monday, October 3, 2016 10:06:27 AM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> 
> 
> ----- Original Message -----
> > From: "Al Viro" <viro@ZenIV.linux.org.uk>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> > <david@fromorbit.com>, "linux-xfs"
> > <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin"
> > <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org
> > Sent: Sunday, October 2, 2016 9:42:18 PM
> > Subject: Re: [RFC][CFT] splice_read reworked
> > 
> > On Fri, Sep 30, 2016 at 01:42:17PM -0400, CAI Qian wrote:
> > 
> > > [ 1599.151286] ------------[ cut here ]------------
> > > [ 1599.156457] WARNING: CPU: 37 PID: 95143 at lib/iov_iter.c:316
> > > sanity+0x75/0x80
> > 
> > [snip]
> > 
> > > [ 1599.344171]  [<ffffffff813e9b45>] sanity+0x75/0x80
> > > [ 1599.349518]  [<ffffffff813ec739>] copy_page_to_iter+0xf9/0x1e0
> > > [ 1599.356027]  [<ffffffff8120691f>] shmem_file_read_iter+0x9f/0x340
> > > [ 1599.362829]  [<ffffffff812bbeb9>] generic_file_splice_read+0xb9/0x1b0
> > > [ 1599.370015]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
> > > [ 1599.375941]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
> > > [ 1599.382935]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
> > > [ 1599.390220]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
> > > [ 1599.396537]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
> > > [ 1599.402563]  [<ffffffff81282973>] SyS_sendfile64+0x73/0xd0
> > > [ 1599.408685]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
> > > [ 1599.414820]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > IOW, sendfile from shmem...  How easily is that reproduced (IOW, did you
> > get any more of those)?
> > 
> It is pretty reproducible so far by just running the trinity from a docker
> container backed by overlayfs/xfs.
There is another warning happened once so far. Not sure if related.

[  447.961826] ------------[ cut here ]------------
[  447.967020] WARNING: CPU: 39 PID: 27352 at fs/xfs/xfs_file.c:626 xfs_file_dio_aio_write+0x3dc/0x4b0 [xfs]
[  447.977736] Modules linked in: ieee802154_socket ieee802154 af_key vmw_vsock_vmci_transport vsock vmw_vmci bluetooth rfkill can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 i2c_smbus ipmi_ssif mei_me sg mei shpchp lpc_ich wmi ipmi_si ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm mdio ahci ptp libahci pps_core libata i2c_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[  448.086775] CPU: 39 PID: 27352 Comm: trinity-c39 Not tainted 4.8.0-rc8-splice+ #1
[  448.095126] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[  448.106483]  0000000000000286 00000000389140f2 ffff880404833c48 ffffffff813d2eac
[  448.114776]  0000000000000000 0000000000000000 ffff880404833c88 ffffffff8109cf11
[  448.123067]  00000272389140f2 ffff880404833d80 ffff880404833dd8 ffff8803bfba88e8
[  448.131356] Call Trace:
[  448.134088]  [<ffffffff813d2eac>] dump_stack+0x85/0xc9
[  448.139821]  [<ffffffff8109cf11>] __warn+0xd1/0xf0
[  448.145167]  [<ffffffff8109d04d>] warn_slowpath_null+0x1d/0x20
[  448.151705]  [<ffffffffa044165c>] xfs_file_dio_aio_write+0x3dc/0x4b0 [xfs]
[  448.159394]  [<ffffffffa0441b10>] xfs_file_write_iter+0x90/0x130 [xfs]
[  448.166679]  [<ffffffff81280eee>] do_iter_readv_writev+0xae/0x130
[  448.173479]  [<ffffffff81281992>] do_readv_writev+0x1a2/0x230
[  448.179906]  [<ffffffffa0441a80>] ? xfs_file_buffered_aio_write+0x350/0x350 [xfs]
[  448.188256]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[  448.195347]  [<ffffffff810fce1d>] ? trace_hardirqs_on+0xd/0x10
[  448.201855]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[  448.208944]  [<ffffffff81281c6c>] vfs_writev+0x3c/0x50
[  448.214675]  [<ffffffff81281e22>] do_pwritev+0xa2/0xc0
[  448.220407]  [<ffffffff81282f11>] SyS_pwritev+0x11/0x20
[  448.226237]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[  448.232358]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[  448.239560] ---[ end trace 1c54e743f1fa4f5e ]---

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC] O_DIRECT vs EFAULT (was Re: [PATCH 10/12] new iov_iter flavour: pipe-backed)
  2016-10-03  3:34                                                         ` [RFC] O_DIRECT vs EFAULT (was Re: [PATCH 10/12] new iov_iter flavour: pipe-backed) Al Viro
@ 2016-10-03 17:07                                                           ` Linus Torvalds
  2016-10-03 18:54                                                             ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: Linus Torvalds @ 2016-10-03 17:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Dave Chinner, CAI Qian, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Sun, Oct 2, 2016 at 8:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Linus, do you have any objections against such behaviour change?  AFAICS,
> all it takes is this:
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 7c3ce73..3a8ebda 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -246,6 +246,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>                 if ((dio->op == REQ_OP_READ) &&
>                     ((offset + transferred) > dio->i_size))
>                         transferred = dio->i_size - offset;
> +               if (ret == -EFAULT)
> +                       ret = 0;

I don't think that's right. To me it looks like the short read case
might have changed "transferred" back to zero, in which case we do
*not* want to skip the EFAULT.

But if there's some reason that can't happen (ie "dio->i_size" is
guaranteed to be larger than "offset"), then with a comment to that
effect it's ok.

Otherwise I think it would need to be something like

        /* If we were partially successful, ignore later EFAULT */
        if (transferred && ret == -EFAULT)
                ret = 0;

or something. Yes?

                Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03  1:37                                               ` Al Viro
@ 2016-10-03 17:49                                                 ` CAI Qian
  2016-10-04 17:39                                                   ` local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) CAI Qian
  0 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-03 17:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "Al Viro" <viro@ZenIV.linux.org.uk>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, xfs@oss.sgi.com, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Sunday, October 2, 2016 9:37:37 PM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> On Fri, Sep 30, 2016 at 02:33:23PM -0400, CAI Qian wrote:
> 
> OK, the immeditate trigger is
> 	* sendfile() from something that uses seq_read to a regular file.
> Does sb_start_write() around the call of do_splice_direct() (as always),
> which ends up calling default_file_splice_read() (again, as usual), which
> ends up calling ->read() of the source, i.e. seq_read().  No changes there.
>  
> 	* sb_start_write() can be called under ->i_mutex.  The latter is
> on overlayfs inode, the former is done to upper layer in that overlayfs.
> Nothing new, again.
> 
> 	* ->i_mutex can be taken under ->cred_guard_mutex.  Yes, it can -
> in open_exec().  Again, no changes.
> 
> 	* ->cred_guard_mutex can be taken in ->show() of a seq_file,
> namely /proc/*/auxv...  Argh, ->cred_guard_mutex whack-a-mole strikes
> again...
> 
> OK, I think essentially the same warning had been triggerable since _way_
> back.  All changes around splice have no effect on it.
> 
> Look: to get a deadlock we need
> 	(1) sendfile from /proc/<pid>/auxv to a regular file on upper layer of
> overlayfs requesting not to freeze the target.
> 	(2) attempt to freeze it blocking until (1) is done.
> 	(3) directory modification on overlayfs trying to request not to freeze
> the upper layer and blocking until (2) is done.
> 	(4) execve() in <pid> holding ->cred_guard_mutex, trying to open
> something in overlayfs and getting blocked on directory lock, held by (3).
> 
> Now (1) gets around to reading from /proc/<pid>/auxv, which blocks on
> ->cred_guard_mutex.  Mentioning of seq_read itself holding locks is
> irrelevant;
> what matters is that ->read() grabs ->cred_guard_mutex.
> 
> We used to have similar problems in /proc/*/environ and /proc/*/mem; looks
> like /proc/*/environ needs to get the treatment similar to e268337dfe26 and
> b409e578d9a4.
> 
You are right. This is also reproducible on v4.8 mainline.
    CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC] O_DIRECT vs EFAULT (was Re: [PATCH 10/12] new iov_iter flavour: pipe-backed)
  2016-10-03 17:07                                                           ` Linus Torvalds
@ 2016-10-03 18:54                                                             ` Al Viro
  0 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-10-03 18:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, CAI Qian, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Mon, Oct 03, 2016 at 10:07:39AM -0700, Linus Torvalds wrote:
> On Sun, Oct 2, 2016 at 8:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Linus, do you have any objections against such behaviour change?  AFAICS,
> > all it takes is this:
> >
> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > index 7c3ce73..3a8ebda 100644
> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -246,6 +246,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
> >                 if ((dio->op == REQ_OP_READ) &&
> >                     ((offset + transferred) > dio->i_size))
> >                         transferred = dio->i_size - offset;
> > +               if (ret == -EFAULT)
> > +                       ret = 0;
> 
> I don't think that's right. To me it looks like the short read case
> might have changed "transferred" back to zero, in which case we do
> *not* want to skip the EFAULT.

There's this in do_blockdev_direct_IO():
        /* Once we sampled i_size check for reads beyond EOF */
        dio->i_size = i_size_read(inode);
        if (iov_iter_rw(iter) == READ && offset >= dio->i_size) {
                if (dio->flags & DIO_LOCKING)
                        mutex_unlock(&inode->i_mutex);
                kmem_cache_free(dio_cache, dio);
                retval = 0;
                goto out;
        }
so that shouldn't happen.  Said that,

> But if there's some reason that can't happen (ie "dio->i_size" is
> guaranteed to be larger than "offset"), then with a comment to that
> effect it's ok.
> 
> Otherwise I think it would need to be something like
> 
>         /* If we were partially successful, ignore later EFAULT */
>         if (transferred && ret == -EFAULT)
>                 ret = 0;

... it's certainly less brittle that way.  I'd probably still put it under
the same if (dio->result) and write it as
	if (unlikely(ret == -EFAULT) && transferred)
though.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03 14:06                                               ` CAI Qian
  2016-10-03 15:20                                                 ` CAI Qian
@ 2016-10-03 20:32                                                 ` CAI Qian
  2016-10-03 20:35                                                   ` Al Viro
  1 sibling, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-03 20:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel



----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "Al Viro" <viro@ZenIV.linux.org.uk>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Monday, October 3, 2016 10:06:27 AM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> 
> 
> ----- Original Message -----
> > From: "Al Viro" <viro@ZenIV.linux.org.uk>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> > <david@fromorbit.com>, "linux-xfs"
> > <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin"
> > <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org
> > Sent: Sunday, October 2, 2016 9:42:18 PM
> > Subject: Re: [RFC][CFT] splice_read reworked
> > 
> > On Fri, Sep 30, 2016 at 01:42:17PM -0400, CAI Qian wrote:
> > 
> > > [ 1599.151286] ------------[ cut here ]------------
> > > [ 1599.156457] WARNING: CPU: 37 PID: 95143 at lib/iov_iter.c:316
> > > sanity+0x75/0x80
> > 
> > [snip]
> > 
> > > [ 1599.344171]  [<ffffffff813e9b45>] sanity+0x75/0x80
> > > [ 1599.349518]  [<ffffffff813ec739>] copy_page_to_iter+0xf9/0x1e0
> > > [ 1599.356027]  [<ffffffff8120691f>] shmem_file_read_iter+0x9f/0x340
> > > [ 1599.362829]  [<ffffffff812bbeb9>] generic_file_splice_read+0xb9/0x1b0
> > > [ 1599.370015]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
> > > [ 1599.375941]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
> > > [ 1599.382935]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
> > > [ 1599.390220]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
> > > [ 1599.396537]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
> > > [ 1599.402563]  [<ffffffff81282973>] SyS_sendfile64+0x73/0xd0
> > > [ 1599.408685]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
> > > [ 1599.414820]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > IOW, sendfile from shmem...  How easily is that reproduced (IOW, did you
> > get any more of those)?
> > 
> It is pretty reproducible so far by just running the trinity from a docker
> container backed by overlayfs/xfs.
> 
> # su - test
> $ trinity
Also, AFACIT, this is NOT reproducible on v4.8 mainline, but only with this
splice_read reworked branch of vfs tree.
   CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03 20:32                                                 ` CAI Qian
@ 2016-10-03 20:35                                                   ` Al Viro
  2016-10-04 13:29                                                     ` CAI Qian
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-10-03 20:35 UTC (permalink / raw)
  To: CAI Qian
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Mon, Oct 03, 2016 at 04:32:19PM -0400, CAI Qian wrote:
> > It is pretty reproducible so far by just running the trinity from a docker
> > container backed by overlayfs/xfs.
> > 
> > # su - test
> > $ trinity
> Also, AFACIT, this is NOT reproducible on v4.8 mainline, but only with this
> splice_read reworked branch of vfs tree.

I would be very surprised if mainline had somehow managed to trip sanity
checks added in vfs tree ;-)

Is there any way to record the sequence of syscalls leading to that?

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03 15:20                                                 ` CAI Qian
@ 2016-10-03 21:12                                                   ` Dave Chinner
  2016-10-04 13:57                                                     ` CAI Qian
  0 siblings, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2016-10-03 21:12 UTC (permalink / raw)
  To: CAI Qian
  Cc: Al Viro, Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Mon, Oct 03, 2016 at 11:20:50AM -0400, CAI Qian wrote:
> > container backed by overlayfs/xfs.
> There is another warning happened once so far. Not sure if related.
> 
> [  447.961826] ------------[ cut here ]------------
> [  447.967020] WARNING: CPU: 39 PID: 27352 at fs/xfs/xfs_file.c:626 xfs_file_dio_aio_write+0x3dc/0x4b0 [xfs]
> [  447.977736] Modules linked in: ieee802154_socket ieee802154 af_key vmw_vsock_vmci_transport vsock vmw_vmci bluetooth rfkill can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 i2c_smbus ipmi_ssif mei_me sg mei shpchp lpc_ich wmi ipmi_si ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm mdio ahci ptp libahci pps_core libata i2c_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
> [  448.086775] CPU: 39 PID: 27352 Comm: trinity-c39 Not tainted 4.8.0-rc8-splice+ #1
> [  448.095126] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
> [  448.106483]  0000000000000286 00000000389140f2 ffff880404833c48 ffffffff813d2eac
> [  448.114776]  0000000000000000 0000000000000000 ffff880404833c88 ffffffff8109cf11
> [  448.123067]  00000272389140f2 ffff880404833d80 ffff880404833dd8 ffff8803bfba88e8
> [  448.131356] Call Trace:
> [  448.134088]  [<ffffffff813d2eac>] dump_stack+0x85/0xc9
> [  448.139821]  [<ffffffff8109cf11>] __warn+0xd1/0xf0
> [  448.145167]  [<ffffffff8109d04d>] warn_slowpath_null+0x1d/0x20
> [  448.151705]  [<ffffffffa044165c>] xfs_file_dio_aio_write+0x3dc/0x4b0 [xfs]
> [  448.159394]  [<ffffffffa0441b10>] xfs_file_write_iter+0x90/0x130 [xfs]
> [  448.166679]  [<ffffffff81280eee>] do_iter_readv_writev+0xae/0x130
> [  448.173479]  [<ffffffff81281992>] do_readv_writev+0x1a2/0x230
> [  448.179906]  [<ffffffffa0441a80>] ? xfs_file_buffered_aio_write+0x350/0x350 [xfs]
> [  448.188256]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
> [  448.195347]  [<ffffffff810fce1d>] ? trace_hardirqs_on+0xd/0x10
> [  448.201855]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
> [  448.208944]  [<ffffffff81281c6c>] vfs_writev+0x3c/0x50
> [  448.214675]  [<ffffffff81281e22>] do_pwritev+0xa2/0xc0
> [  448.220407]  [<ffffffff81282f11>] SyS_pwritev+0x11/0x20
> [  448.226237]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [  448.232358]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [  448.239560] ---[ end trace 1c54e743f1fa4f5e ]---

This usually happens when an application mixes mmap access and
direct IO to the same file. The warning fires when the direct IO
cannot invalidate the cached range after writeback (e.g. writeback
raced with mmap app faulting and dirtying the page again), and hence
results in the page cache containing stale data.  This warning fires
when that happens, indicating to developers who get a bug report
about data corruption that it's the userspace application that is
the problem, not the filesystem. i.e the application is doing
something we explicitly document they should not do:

$ man 2 open
....
  O_DIRECT
....
       Applications should avoid mixing O_DIRECT and normal I/O to
       the same file, and especially to overlapping byte regions in
       the  same  file.   Even  when  the filesystem  correctly
       handles the coherency issues in this situation, overall I/O
       throughput is likely to be slower than using either mode
       alone.  Likewise, applications should avoid mixing mmap(2) of
       files with direct I/O to the same files.

Splice should not have this problem if the IO path locking is
correct, as both direct IO and splice IO use the same inode lock for
exclusion. i.e. splice write should not be running at the same time
as a direct IO read or write....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03 20:35                                                   ` Al Viro
@ 2016-10-04 13:29                                                     ` CAI Qian
  2016-10-04 14:28                                                       ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-04 13:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel



----- Original Message -----
> From: "Al Viro" <viro@ZenIV.linux.org.uk>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Monday, October 3, 2016 4:35:40 PM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> On Mon, Oct 03, 2016 at 04:32:19PM -0400, CAI Qian wrote:
> > > It is pretty reproducible so far by just running the trinity from a
> > > docker
> > > container backed by overlayfs/xfs.
> > > 
> > > # su - test
> > > $ trinity
> > Also, AFACIT, this is NOT reproducible on v4.8 mainline, but only with this
> > splice_read reworked branch of vfs tree.
> 
> I would be very surprised if mainline had somehow managed to trip sanity
> checks added in vfs tree ;-)
> 
> Is there any way to record the sequence of syscalls leading to that?
> 
Yes, a bit long shot though.

http://people.redhat.com/qcai/tmp/trinity-child113.log

This one triggered the warning at lib/iov_iter.c:316 sanity+0x6b/0x6
3 times at once.

[ 2200.510753] ------------[ cut here ]------------
[ 2200.515929] WARNING: CPU: 9 PID: 116624 at lib/iov_iter.c:316 sanity+0x6b/0x6f
[ 2200.523999] Modules linked in: 8021q garp mrp fuse dlci vmw_vsock_vmci_transport vsock vmw_vmci af_key ieee802154_socket ieee802154 hidp cmtp kernelcapi bnep rfcomm bluetooth rfkill can_bcm can_raw can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 i2c_smbus mei_me ipmi_ssif sg lpc_ich mei shpchp wmi ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci mdio libahci ptp libata pps_core i2c_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 2200.644251] CPU: 9 PID: 116624 Comm: trinity-c113 Not tainted 4.8.0-rc8-splice+ #1
[ 2200.652708] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 2200.664062]  0000000000000286 00000000bad46fa7 ffff8803d1ca7b30 ffffffff813d2eac
[ 2200.672368]  0000000000000000 0000000000000000 ffff8803d1ca7b70 ffffffff8109cf11
[ 2200.680660]  0000013c2e32bdc8 ffffea000eea7540 0000000000001000 ffff88030e9a0000
[ 2200.688954] Call Trace:
[ 2200.691686]  [<ffffffff813d2eac>] dump_stack+0x85/0xc9
[ 2200.697433]  [<ffffffff8109cf11>] __warn+0xd1/0xf0
[ 2200.702777]  [<ffffffff8109d04d>] warn_slowpath_null+0x1d/0x20
[ 2200.709285]  [<ffffffff81418c93>] sanity+0x6b/0x6f
[ 2200.714630]  [<ffffffff813e9586>] copy_page_to_iter+0xf6/0x1e0
[ 2200.721139]  [<ffffffff811e3906>] generic_file_read_iter+0x406/0x800
[ 2200.728231]  [<ffffffff810f8afd>] ? down_read_nested+0x4d/0x80
[ 2200.734798]  [<ffffffffa02c46ae>] ? xfs_ilock+0x1ae/0x260 [xfs]
[ 2200.741433]  [<ffffffffa02b3f2f>] xfs_file_buffered_aio_read+0x6f/0x1b0 [xfs]
[ 2200.749412]  [<ffffffffa02b46e8>] xfs_file_read_iter+0x68/0xc0 [xfs]
[ 2200.756504]  [<ffffffff812bb359>] generic_file_splice_read+0xb9/0x1b0
[ 2200.763691]  [<ffffffff812bb913>] do_splice_to+0x73/0x90
[ 2200.769618]  [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
[ 2200.776610]  [<ffffffff812baee0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 2200.783893]  [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
[ 2200.790207]  [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
[ 2200.796231]  [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
[ 2200.802351]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2200.808471]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2200.815723] ---[ end trace e02dda43787dce2a ]---
[ 2200.821003] ------------[ cut here ]------------
[ 2200.826168] WARNING: CPU: 9 PID: 116624 at lib/iov_iter.c:316 sanity+0x6b/0x6f
[ 2200.834765] Modules linked in: 8021q garp mrp fuse dlci vmw_vsock_vmci_transport vsock vmw_vmci af_key ieee802154_socket ieee802154 hidp cmtp kernelcapi bnep rfcomm bluetooth rfkill can_bcm can_raw can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 i2c_smbus mei_me ipmi_ssif sg lpc_ich mei shpchp wmi ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci mdio libahci ptp libata pps_core i2c_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 2200.951286] CPU: 9 PID: 116624 Comm: trinity-c113 Tainted: G        W       4.8.0-rc8-splice+ #1
[ 2200.961088] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 2200.972443]  0000000000000286 00000000bad46fa7 ffff8803d1ca7b30 ffffffff813d2eac
[ 2200.980747]  0000000000000000 0000000000000000 ffff8803d1ca7b70 ffffffff8109cf11
[ 2200.989078]  0000013c00000000 ffffea000b711880 0000000000001000 ffff88030e9a0000
[ 2200.997375] Call Trace:
[ 2201.000107]  [<ffffffff813d2eac>] dump_stack+0x85/0xc9
[ 2201.005842]  [<ffffffff8109cf11>] __warn+0xd1/0xf0
[ 2201.011199]  [<ffffffff8109d04d>] warn_slowpath_null+0x1d/0x20
[ 2201.017708]  [<ffffffff81418c93>] sanity+0x6b/0x6f
[ 2201.023053]  [<ffffffff813e9586>] copy_page_to_iter+0xf6/0x1e0
[ 2201.029562]  [<ffffffff811e3906>] generic_file_read_iter+0x406/0x800
[ 2201.036654]  [<ffffffff810f8afd>] ? down_read_nested+0x4d/0x80
[ 2201.043213]  [<ffffffffa02c46ae>] ? xfs_ilock+0x1ae/0x260 [xfs]
[ 2201.049849]  [<ffffffffa02b3f2f>] xfs_file_buffered_aio_read+0x6f/0x1b0 [xfs]
[ 2201.057828]  [<ffffffffa02b46e8>] xfs_file_read_iter+0x68/0xc0 [xfs]
[ 2201.064919]  [<ffffffff812bb359>] generic_file_splice_read+0xb9/0x1b0
[ 2201.072108]  [<ffffffff812bb913>] do_splice_to+0x73/0x90
[ 2201.078034]  [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
[ 2201.085026]  [<ffffffff812baee0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 2201.092309]  [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
[ 2201.098623]  [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
[ 2201.104646]  [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
[ 2201.110768]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2201.116890]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2201.124136] ---[ end trace e02dda43787dce2b ]---
[ 2201.192680] ------------[ cut here ]------------
[ 2201.203826] WARNING: CPU: 9 PID: 116624 at lib/iov_iter.c:316 sanity+0x6b/0x6f
[ 2201.211899] Modules linked in: 8021q garp mrp fuse dlci vmw_vsock_vmci_transport vsock vmw_vmci af_key ieee802154_socket ieee802154 hidp cmtp kernelcapi bnep rfcomm bluetooth rfkill can_bcm can_raw can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 i2c_smbus mei_me ipmi_ssif sg lpc_ich mei shpchp wmi ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci mdio libahci ptp libata pps_core i2c_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 2201.329000] CPU: 9 PID: 116624 Comm: trinity-c113 Tainted: G        W       4.8.0-rc8-splice+ #1
[ 2201.338805] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 2201.350160]  0000000000000286 00000000bad46fa7 ffff8803d1ca7b30 ffffffff813d2eac
[ 2201.358455]  0000000000000000 0000000000000000 ffff8803d1ca7b70 ffffffff8109cf11
[ 2201.366747]  0000013c00000000 ffffea000be93cc0 0000000000001000 ffff88030e9a0000
[ 2201.375035] Call Trace:
[ 2201.377767]  [<ffffffff813d2eac>] dump_stack+0x85/0xc9
[ 2201.383499]  [<ffffffff8109cf11>] __warn+0xd1/0xf0
[ 2201.388843]  [<ffffffff8109d04d>] warn_slowpath_null+0x1d/0x20
[ 2201.395351]  [<ffffffff81418c93>] sanity+0x6b/0x6f
[ 2201.400695]  [<ffffffff813e9586>] copy_page_to_iter+0xf6/0x1e0
[ 2201.407204]  [<ffffffff811e3906>] generic_file_read_iter+0x406/0x800
[ 2201.414294]  [<ffffffff810f8afd>] ? down_read_nested+0x4d/0x80
[ 2201.420844]  [<ffffffffa02c46ae>] ? xfs_ilock+0x1ae/0x260 [xfs]
[ 2201.427463]  [<ffffffffa02b3f2f>] xfs_file_buffered_aio_read+0x6f/0x1b0 [xfs]
[ 2201.435451]  [<ffffffffa02b46e8>] xfs_file_read_iter+0x68/0xc0 [xfs]
[ 2201.442542]  [<ffffffff812bb359>] generic_file_splice_read+0xb9/0x1b0
[ 2201.449728]  [<ffffffff812bb913>] do_splice_to+0x73/0x90
[ 2201.455655]  [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
[ 2201.462645]  [<ffffffff812baee0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 2201.469928]  [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
[ 2201.476242]  [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
[ 2201.482264]  [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
[ 2201.488383]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2201.494504]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2201.501736] ---[ end trace e02dda43787dce2c ]---

   CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-03 21:12                                                   ` Dave Chinner
@ 2016-10-04 13:57                                                     ` CAI Qian
  0 siblings, 0 replies; 152+ messages in thread
From: CAI Qian @ 2016-10-04 13:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Al Viro, Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel


> This usually happens when an application mixes mmap access and
> direct IO to the same file. The warning fires when the direct IO
> cannot invalidate the cached range after writeback (e.g. writeback
> raced with mmap app faulting and dirtying the page again), and hence
> results in the page cache containing stale data.  This warning fires
> when that happens, indicating to developers who get a bug report
> about data corruption that it's the userspace application that is
> the problem, not the filesystem. i.e the application is doing
> something we explicitly document they should not do:
> 
> $ man 2 open
> ....
>   O_DIRECT
> ....
>        Applications should avoid mixing O_DIRECT and normal I/O to
>        the same file, and especially to overlapping byte regions in
>        the  same  file.   Even  when  the filesystem  correctly
>        handles the coherency issues in this situation, overall I/O
>        throughput is likely to be slower than using either mode
>        alone.  Likewise, applications should avoid mixing mmap(2) of
>        files with direct I/O to the same files.
> 
> Splice should not have this problem if the IO path locking is
> correct, as both direct IO and splice IO use the same inode lock for
> exclusion. i.e. splice write should not be running at the same time
> as a direct IO read or write....
OK, so I assume that trinity is doing something that a proper userspace
application won't be doing which is fine, and there is nothing to worry
about from the kernel's perspective.

I just want to make sure there is no security implication here that a
non-privileged user could corrupt other users' data etc.
   CAI Qian


^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-04 13:29                                                     ` CAI Qian
@ 2016-10-04 14:28                                                       ` Al Viro
  2016-10-04 16:21                                                         ` CAI Qian
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-10-04 14:28 UTC (permalink / raw)
  To: CAI Qian
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Tue, Oct 04, 2016 at 09:29:35AM -0400, CAI Qian wrote:

> > Is there any way to record the sequence of syscalls leading to that?
> > 
> Yes, a bit long shot though.
> 
> http://people.redhat.com/qcai/tmp/trinity-child113.log

;-/

Not enough information, unfortunately (descriptor in question opened
outside of that log, sendfile(out_fd=578, in_fd=578, offset=0x7f8318a07000,
count=0x3ffc00) doesn't tell what *offset was before the call) ;-/

Anyway, I've found and fixed a bug in pipe_advance(), which might or might
not help with those.  Could you try vfs.git#work.splice_read (or #for-next)
and see if these persist?

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-04 14:28                                                       ` Al Viro
@ 2016-10-04 16:21                                                         ` CAI Qian
  2016-10-04 20:12                                                           ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-04 16:21 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel


> Not enough information, unfortunately (descriptor in question opened
> outside of that log, sendfile(out_fd=578, in_fd=578, offset=0x7f8318a07000,
> count=0x3ffc00) doesn't tell what *offset was before the call) ;-/
> 
> Anyway, I've found and fixed a bug in pipe_advance(), which might or might
> not help with those.  Could you try vfs.git#work.splice_read (or #for-next)
> and see if these persist?
I am afraid that this can also reproduced in the latest #for-next . The warning
always showed up at the end of trinity run. I captured more information this time.

http://people.redhat.com/qcai/tmp/trinity-child150.log
http://people.redhat.com/qcai/tmp/tri-full.log (big file so may just grep "child150")
http://people.redhat.com/qcai/tmp/trinity.log

[ 2187.697999] ------------[ cut here ]------------
[ 2187.703181] WARNING: CPU: 34 PID: 67630 at lib/iov_iter.c:316 sanity+0x6b/0x6f
[ 2187.713890] Modules linked in: fuse vmac tcp_diag udp_diag inet_diag ieee802154_socket ieee802154 af_key vmw_vsock_vmci_transport vsock vmw_vmci bluetooth rfkill can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr ipmi_ssif i2c_i801 i2c_smbus mei_me sg lpc_ich mei shpchp wmi ipmi_si ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci libahci mdio ptp libata i2c_core pps_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 2187.828488] CPU: 29 PID: 67630 Comm: trinity-c150 Not tainted 4.8.0-rc8-fornext+ #1
[ 2187.837034] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 2187.848392]  0000000000000286 00000000a4c9de22 ffff8803f0d5bb30 ffffffff813d30ac
[ 2187.856687]  0000000000000000 0000000000000000 ffff8803f0d5bb70 ffffffff8109cf31
[ 2187.864983]  0000013c1923e8c0 ffffea000db71000 0000000000001000 ffff88044b127200
[ 2187.873282] Call Trace:
[ 2187.876017]  [<ffffffff813d30ac>] dump_stack+0x85/0xc9
[ 2187.881756]  [<ffffffff8109cf31>] __warn+0xd1/0xf0
[ 2187.887104]  [<ffffffff8109d06d>] warn_slowpath_null+0x1d/0x20
[ 2187.893616]  [<ffffffff81418ec8>] sanity+0x6b/0x6f
[ 2187.898967]  [<ffffffff813e97a6>] copy_page_to_iter+0xf6/0x1e0
[ 2187.905478]  [<ffffffff811e3926>] generic_file_read_iter+0x406/0x800
[ 2187.912570]  [<ffffffff810f8b1d>] ? down_read_nested+0x4d/0x80
[ 2187.919123]  [<ffffffffa029b74e>] ? xfs_ilock+0x1ae/0x260 [xfs]
[ 2187.925746]  [<ffffffffa028af2f>] xfs_file_buffered_aio_read+0x6f/0x1b0 [xfs]
[ 2187.933756]  [<ffffffffa028b6e8>] xfs_file_read_iter+0x68/0xc0 [xfs]
[ 2187.940847]  [<ffffffff812bb559>] generic_file_splice_read+0xb9/0x1b0
[ 2187.948034]  [<ffffffff812bbb13>] do_splice_to+0x73/0x90
[ 2187.953962]  [<ffffffff812bbc1b>] splice_direct_to_actor+0xeb/0x220
[ 2187.960955]  [<ffffffff812bb0e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 2187.968243]  [<ffffffff812bbdd9>] do_splice_direct+0x89/0xd0
[ 2187.974561]  [<ffffffff8128263e>] do_sendfile+0x1ce/0x3b0
[ 2187.980580]  [<ffffffff812831ef>] SyS_sendfile64+0x6f/0xd0
[ 2187.986698]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2187.992823]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2188.000349] ---[ end trace a3a1d0412c1a1214 ]---
[ 2188.006348] ------------[ cut here ]------------
[ 2188.011842] WARNING: CPU: 26 PID: 67630 at lib/iov_iter.c:316 sanity+0x6b/0x6f
[ 2188.019914] Modules linked in: fuse vmac tcp_diag udp_diag inet_diag ieee802154_socket ieee802154 af_key vmw_vsock_vmci_transport vsock vmw_vmci bluetooth rfkill can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr ipmi_ssif i2c_i801 i2c_smbus mei_me sg lpc_ich mei shpchp wmi ipmi_si ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci libahci mdio ptp libata i2c_core pps_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 2188.133408] CPU: 54 PID: 67630 Comm: trinity-c150 Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 2188.143310] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 2188.154667]  0000000000000286 00000000a4c9de22 ffff8803f0d5bb30 ffffffff813d30ac
[ 2188.162962]  0000000000000000 0000000000000000 ffff8803f0d5bb70 ffffffff8109cf31
[ 2188.171257]  0000013c1923e8c8 ffffea000dbbd700 0000000000001000 ffff88044b127200
[ 2188.179551] Call Trace:
[ 2188.182284]  [<ffffffff813d30ac>] dump_stack+0x85/0xc9
[ 2188.188022]  [<ffffffff8109cf31>] __warn+0xd1/0xf0
[ 2188.193368]  [<ffffffff8109d06d>] warn_slowpath_null+0x1d/0x20
[ 2188.199879]  [<ffffffff81418ec8>] sanity+0x6b/0x6f
[ 2188.205227]  [<ffffffff813e97a6>] copy_page_to_iter+0xf6/0x1e0
[ 2188.211738]  [<ffffffff811e3926>] generic_file_read_iter+0x406/0x800
[ 2188.218824]  [<ffffffff810f8b1d>] ? down_read_nested+0x4d/0x80
[ 2188.225363]  [<ffffffffa029b74e>] ? xfs_ilock+0x1ae/0x260 [xfs]
[ 2188.231988]  [<ffffffffa028af2f>] xfs_file_buffered_aio_read+0x6f/0x1b0 [xfs]
[ 2188.239967]  [<ffffffffa028b6e8>] xfs_file_read_iter+0x68/0xc0 [xfs]
[ 2188.247059]  [<ffffffff812bb559>] generic_file_splice_read+0xb9/0x1b0
[ 2188.254246]  [<ffffffff812bbb13>] do_splice_to+0x73/0x90
[ 2188.260174]  [<ffffffff812bbc1b>] splice_direct_to_actor+0xeb/0x220
[ 2188.267168]  [<ffffffff812bb0e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 2188.274453]  [<ffffffff812bbdd9>] do_splice_direct+0x89/0xd0
[ 2188.280771]  [<ffffffff8128263e>] do_sendfile+0x1ce/0x3b0
[ 2188.286796]  [<ffffffff812831ef>] SyS_sendfile64+0x6f/0xd0
[ 2188.292918]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2188.299040]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2188.313523] ---[ end trace a3a1d0412c1a1215 ]---
[ 2188.458941] ------------[ cut here ]------------
[ 2188.464181] WARNING: CPU: 10 PID: 67630 at lib/iov_iter.c:316 sanity+0x6b/0x6f
[ 2188.472261] Modules linked in: fuse vmac tcp_diag udp_diag inet_diag ieee802154_socket ieee802154 af_key vmw_vsock_vmci_transport vsock vmw_vmci bluetooth rfkill can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr ipmi_ssif i2c_i801 i2c_smbus mei_me sg lpc_ich mei shpchp wmi ipmi_si ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ttm ixgbe drm ahci libahci mdio ptp libata i2c_core pps_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[ 2188.585528] CPU: 38 PID: 67630 Comm: trinity-c150 Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 2188.595431] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 2188.606786]  0000000000000286 00000000a4c9de22 ffff8803f0d5bb30 ffffffff813d30ac
[ 2188.615082]  0000000000000000 0000000000000000 ffff8803f0d5bb70 ffffffff8109cf31
[ 2188.623379]  0000013c11bafb58 ffffea000ee78980 0000000000001000 ffff88044b127200
[ 2188.631675] Call Trace:
[ 2188.634410]  [<ffffffff813d30ac>] dump_stack+0x85/0xc9
[ 2188.640148]  [<ffffffff8109cf31>] __warn+0xd1/0xf0
[ 2188.645497]  [<ffffffff8109d06d>] warn_slowpath_null+0x1d/0x20
[ 2188.652324]  [<ffffffff81418ec8>] sanity+0x6b/0x6f
[ 2188.657672]  [<ffffffff813e97a6>] copy_page_to_iter+0xf6/0x1e0
[ 2188.664185]  [<ffffffff811e3926>] generic_file_read_iter+0x406/0x800
[ 2188.671268]  [<ffffffff810f8b1d>] ? down_read_nested+0x4d/0x80
[ 2188.677825]  [<ffffffffa029b74e>] ? xfs_ilock+0x1ae/0x260 [xfs]
[ 2188.684450]  [<ffffffffa028af2f>] xfs_file_buffered_aio_read+0x6f/0x1b0 [xfs]
[ 2188.692433]  [<ffffffffa028b6e8>] xfs_file_read_iter+0x68/0xc0 [xfs]
[ 2188.699525]  [<ffffffff812bb559>] generic_file_splice_read+0xb9/0x1b0
[ 2188.706711]  [<ffffffff812bbb13>] do_splice_to+0x73/0x90
[ 2188.712638]  [<ffffffff812bbc1b>] splice_direct_to_actor+0xeb/0x220
[ 2188.719632]  [<ffffffff812bb0e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 2188.726916]  [<ffffffff812bbdd9>] do_splice_direct+0x89/0xd0
[ 2188.733231]  [<ffffffff8128263e>] do_sendfile+0x1ce/0x3b0
[ 2188.739255]  [<ffffffff812831ef>] SyS_sendfile64+0x6f/0xd0
[ 2188.745377]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2188.751500]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2188.760216] ---[ end trace a3a1d0412c1a1216 ]---

^ permalink raw reply	[flat|nested] 152+ messages in thread

* local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-03 17:49                                                 ` CAI Qian
@ 2016-10-04 17:39                                                   ` CAI Qian
  2016-10-04 21:42                                                     ` tj
  0 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-04 17:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel, tj


> ----- Original Message -----
> > From: "Al Viro" <viro@ZenIV.linux.org.uk>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> > <david@fromorbit.com>, "linux-xfs"
> > <linux-xfs@vger.kernel.org>, xfs@oss.sgi.com, "Jens Axboe"
> > <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org
> > Sent: Sunday, October 2, 2016 9:37:37 PM
> > Subject: Re: [RFC][CFT] splice_read reworked
> > 
> > On Fri, Sep 30, 2016 at 02:33:23PM -0400, CAI Qian wrote:
> > 
> > OK, the immeditate trigger is
> > 	* sendfile() from something that uses seq_read to a regular file.
> > Does sb_start_write() around the call of do_splice_direct() (as always),
> > which ends up calling default_file_splice_read() (again, as usual), which
> > ends up calling ->read() of the source, i.e. seq_read().  No changes there.
> >  
> > 	* sb_start_write() can be called under ->i_mutex.  The latter is
> > on overlayfs inode, the former is done to upper layer in that overlayfs.
> > Nothing new, again.
> > 
> > 	* ->i_mutex can be taken under ->cred_guard_mutex.  Yes, it can -
> > in open_exec().  Again, no changes.
> > 
> > 	* ->cred_guard_mutex can be taken in ->show() of a seq_file,
> > namely /proc/*/auxv...  Argh, ->cred_guard_mutex whack-a-mole strikes
> > again...
> > 
> > OK, I think essentially the same warning had been triggerable since _way_
> > back.  All changes around splice have no effect on it.
> > 
> > Look: to get a deadlock we need
> > 	(1) sendfile from /proc/<pid>/auxv to a regular file on upper layer of
> > overlayfs requesting not to freeze the target.
> > 	(2) attempt to freeze it blocking until (1) is done.
> > 	(3) directory modification on overlayfs trying to request not to freeze
> > the upper layer and blocking until (2) is done.
> > 	(4) execve() in <pid> holding ->cred_guard_mutex, trying to open
> > something in overlayfs and getting blocked on directory lock, held by (3).
> > 
> > Now (1) gets around to reading from /proc/<pid>/auxv, which blocks on
> > ->cred_guard_mutex.  Mentioning of seq_read itself holding locks is
> > irrelevant;
> > what matters is that ->read() grabs ->cred_guard_mutex.
> > 
> > We used to have similar problems in /proc/*/environ and /proc/*/mem; looks
> > like /proc/*/environ needs to get the treatment similar to e268337dfe26 and
> > b409e578d9a4.
> > 
> You are right. This is also reproducible on v4.8 mainline.
Not sure if related, but right after this lockdep happened and trinity running by a
non-privileged user finished inside the container. The host's systemctl command just
hang or timeout which renders the whole system unusable.

# systemctl status docker
Failed to get properties: Connection timed out

# systemctl reboot (hang)

[ 5535.596651] INFO: task systemd-journal:1165 blocked for more than 120 seconds.
[ 5535.604728]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5535.611536] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5535.620285] systemd-journal D ffff880466167ca8 12672  1165      1 0x00000000
[ 5535.628182]  ffff880466167ca8 ffff880466167cd0 0000000000000000 ffff88086c6e2000
[ 5535.636504]  ffff88045deb0000 ffff880466168000 ffffffff81deb380 ffff88045deb0000
[ 5535.644817]  0000000000000246 00000000ffffffff ffff880466167cc0 ffffffff817cdaaf
[ 5535.653131] Call Trace:
[ 5535.655874]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5535.661425]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5535.668617]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5535.675237]  [<ffffffff81161d7e>] ? proc_cgroup_show+0x4e/0x300
[ 5535.681857]  [<ffffffff81252b01>] ? kmem_cache_alloc_trace+0x1d1/0x2e0
[ 5535.689162]  [<ffffffff81161d7e>] proc_cgroup_show+0x4e/0x300
[ 5535.695592]  [<ffffffff81302d40>] proc_single_show+0x50/0x90
[ 5535.701925]  [<ffffffff812ac983>] seq_read+0x113/0x3e0
[ 5535.707672]  [<ffffffff81280407>] __vfs_read+0x37/0x150
[ 5535.713521]  [<ffffffff81349ded>] ? security_file_permission+0x9d/0xc0
[ 5535.720819]  [<ffffffff812815ac>] vfs_read+0x8c/0x130
[ 5535.726472]  [<ffffffff81282ac8>] SyS_read+0x58/0xc0
[ 5535.732024]  [<ffffffff817d497c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[ 5535.739221] INFO: lockdep is turned off.
[ 5535.743649] INFO: task kworker/3:1:52401 blocked for more than 120 seconds.
[ 5535.751429]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5535.758239] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5535.766989] kworker/3:1     D ffff8803b25bbca8 13368 52401      2 0x00000080
[ 5535.774904] Workqueue: cgroup_destroy css_release_work_fn
[ 5535.780940]  ffff8803b25bbca8 ffff8803b25bbcd0 0000000000000000 ffff88046ded2000
[ 5535.789254]  ffff88046af8a000 ffff8803b25bc000 ffffffff81deb380 ffff88046af8a000
[ 5535.797562]  0000000000000246 00000000ffffffff ffff8803b25bbcc0 ffffffff817cdaaf
[ 5535.805877] Call Trace:
[ 5535.808621]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5535.814177]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5535.821379]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5535.828001]  [<ffffffff811586af>] ? css_release_work_fn+0x2f/0x110
[ 5535.834911]  [<ffffffff811586af>] css_release_work_fn+0x2f/0x110
[ 5535.841629]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
[ 5535.848159]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
[ 5535.854876]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
[ 5535.861119]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
[ 5535.867847]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
[ 5535.873404]  [<ffffffff817d40ec>] ? _raw_spin_unlock_irq+0x2c/0x60
[ 5535.880320]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 5535.886369]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
[ 5535.893675] INFO: lockdep is turned off.
[ 5535.898085] INFO: task kworker/45:4:146035 blocked for more than 120 seconds.
[ 5535.906059]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5535.912865] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5535.921613] kworker/45:4    D ffff880853e9b950 14048 146035      2 0x00000080
[ 5535.929630] Workqueue: cgroup_destroy css_killed_work_fn
[ 5535.935582]  ffff880853e9b950 0000000000000000 0000000000000000 ffff88086c6da000
[ 5535.943882]  ffff88086c9e2000 ffff880853e9c000 ffff880853e9baa0 ffff88086c9e2000
[ 5535.952205]  ffff880853e9ba98 0000000000000001 ffff880853e9b968 ffffffff817cdaaf
[ 5535.960522] Call Trace:
[ 5535.963265]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5535.968817]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
[ 5535.975346]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
[ 5535.982256]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
[ 5535.988972]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
[ 5535.994804]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
[ 5536.001227]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
[ 5536.007648]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
[ 5536.014654]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
[ 5536.021657]  [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40
[ 5536.029344]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
[ 5536.036447]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
[ 5536.043844]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
[ 5536.051336]  [<ffffffff8116789d>] update_flag+0x11d/0x210
[ 5536.057373]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
[ 5536.064186]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
[ 5536.070899]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
[ 5536.077420]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
[ 5536.084234]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
[ 5536.091049]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
[ 5536.097571]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
[ 5536.104207]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
[ 5536.110736]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
[ 5536.117461]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
[ 5536.123697]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
[ 5536.130426]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
[ 5536.135991]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 5536.142041]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
[ 5536.149345] INFO: lockdep is turned off.
[ 5585.148183] perf: interrupt took too long (3146 > 3136), lowering kernel.perf_event_max_sample_rate to 63000
[ 5658.479538] INFO: task systemd:1 blocked for more than 120 seconds.
[ 5658.486551]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5658.493352] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5658.502095] systemd         D ffff880468ccfca8 11952     1      0 0x00000000
[ 5658.509995]  ffff880468ccfca8 ffff880468ccfcd0 0000000000000000 ffff88046aa24000
[ 5658.518297]  ffff880468cd0000 ffff880468cd0000 ffffffff81deb380 ffff880468cd0000
[ 5658.526602]  0000000000000246 00000000ffffffff ffff880468ccfcc0 ffffffff817cdaaf
[ 5658.534909] Call Trace:
[ 5658.537645]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5658.543188]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5658.550375]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5658.556987]  [<ffffffff81161d7e>] ? proc_cgroup_show+0x4e/0x300
[ 5658.563600]  [<ffffffff81252b01>] ? kmem_cache_alloc_trace+0x1d1/0x2e0
[ 5658.570887]  [<ffffffff81161d7e>] proc_cgroup_show+0x4e/0x300
[ 5658.577304]  [<ffffffff81302d40>] proc_single_show+0x50/0x90
[ 5658.583620]  [<ffffffff812ac983>] seq_read+0x113/0x3e0
[ 5658.589355]  [<ffffffff81280407>] __vfs_read+0x37/0x150
[ 5658.595189]  [<ffffffff81349ded>] ? security_file_permission+0x9d/0xc0
[ 5658.602480]  [<ffffffff812815ac>] vfs_read+0x8c/0x130
[ 5658.608117]  [<ffffffff81282ac8>] SyS_read+0x58/0xc0
[ 5658.613661]  [<ffffffff817d497c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[ 5658.620849] INFO: lockdep is turned off.
[ 5658.625282] INFO: task systemd-journal:1165 blocked for more than 120 seconds.
[ 5658.633346]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5658.640147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5658.648887] systemd-journal D ffff880466167ca8 12672  1165      1 0x00000000
[ 5658.656788]  ffff880466167ca8 ffff880466167cd0 0000000000000000 ffff88086c6e2000
[ 5658.665092]  ffff88045deb0000 ffff880466168000 ffffffff81deb380 ffff88045deb0000
[ 5658.673394]  0000000000000246 00000000ffffffff ffff880466167cc0 ffffffff817cdaaf
[ 5658.681690] Call Trace:
[ 5658.684419]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5658.689961]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5658.697143]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5658.703766]  [<ffffffff81161d7e>] ? proc_cgroup_show+0x4e/0x300
[ 5658.710373]  [<ffffffff81252b01>] ? kmem_cache_alloc_trace+0x1d1/0x2e0
[ 5658.717661]  [<ffffffff81161d7e>] proc_cgroup_show+0x4e/0x300
[ 5658.724067]  [<ffffffff81302d40>] proc_single_show+0x50/0x90
[ 5658.730386]  [<ffffffff812ac983>] seq_read+0x113/0x3e0
[ 5658.736123]  [<ffffffff81280407>] __vfs_read+0x37/0x150
[ 5658.741957]  [<ffffffff81349ded>] ? security_file_permission+0x9d/0xc0
[ 5658.749244]  [<ffffffff812815ac>] vfs_read+0x8c/0x130
[ 5658.754884]  [<ffffffff81282ac8>] SyS_read+0x58/0xc0
[ 5658.760417]  [<ffffffff817d497c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[ 5658.767607] INFO: lockdep is turned off.
[ 5658.772016] INFO: task kworker/3:1:52401 blocked for more than 120 seconds.
[ 5658.779789]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5658.786582] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5658.795322] kworker/3:1     D ffff8803b25bbca8 13368 52401      2 0x00000080
[ 5658.803224] Workqueue: cgroup_destroy css_release_work_fn
[ 5658.809261]  ffff8803b25bbca8 ffff8803b25bbcd0 0000000000000000 ffff88046ded2000
[ 5658.817567]  ffff88046af8a000 ffff8803b25bc000 ffffffff81deb380 ffff88046af8a000
[ 5658.825871]  0000000000000246 00000000ffffffff ffff8803b25bbcc0 ffffffff817cdaaf
[ 5658.834173] Call Trace:
[ 5658.836904]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5658.842447]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5658.849638]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5658.856246]  [<ffffffff811586af>] ? css_release_work_fn+0x2f/0x110
[ 5658.863146]  [<ffffffff811586af>] css_release_work_fn+0x2f/0x110
[ 5658.869858]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
[ 5658.876370]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
[ 5658.883067]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
[ 5658.889287]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
[ 5658.895991]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
[ 5658.901538]  [<ffffffff817d40ec>] ? _raw_spin_unlock_irq+0x2c/0x60
[ 5658.908438]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 5658.914466]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
[ 5658.921745] INFO: lockdep is turned off.
[ 5658.926133] INFO: task kworker/45:4:146035 blocked for more than 120 seconds.
[ 5658.934099]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5658.940902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5658.949636] kworker/45:4    D ffff880853e9b950 14048 146035      2 0x00000080
[ 5658.957632] Workqueue: cgroup_destroy css_killed_work_fn
[ 5658.963574]  ffff880853e9b950 0000000000000000 0000000000000000 ffff88086c6da000
[ 5658.971877]  ffff88086c9e2000 ffff880853e9c000 ffff880853e9baa0 ffff88086c9e2000
[ 5658.980179]  ffff880853e9ba98 0000000000000001 ffff880853e9b968 ffffffff817cdaaf
[ 5658.988498] Call Trace:
[ 5658.991225]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5658.996768]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
[ 5659.003271]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
[ 5659.010161]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
[ 5659.016871]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
[ 5659.022706]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
[ 5659.029120]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
[ 5659.035535]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
[ 5659.042529]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
[ 5659.049528]  [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40
[ 5659.057203]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
[ 5659.064297]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
[ 5659.071673]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
[ 5659.079144]  [<ffffffff8116789d>] update_flag+0x11d/0x210
[ 5659.085172]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
[ 5659.091964]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
[ 5659.098668]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
[ 5659.105179]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
[ 5659.111982]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
[ 5659.118783]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
[ 5659.125296]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
[ 5659.131906]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
[ 5659.138417]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
[ 5659.145124]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
[ 5659.151345]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
[ 5659.158044]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
[ 5659.163586]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 5659.169605]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
[ 5659.176892] INFO: lockdep is turned off.
[ 5781.364367] INFO: task systemd:1 blocked for more than 120 seconds.
[ 5781.371373]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5781.378177] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5781.386918] systemd         D ffff880468ccfca8 11952     1      0 0x00000000
[ 5781.394818]  ffff880468ccfca8 ffff880468ccfcd0 0000000000000000 ffff88046aa24000
[ 5781.403121]  ffff880468cd0000 ffff880468cd0000 ffffffff81deb380 ffff880468cd0000
[ 5781.411421]  0000000000000246 00000000ffffffff ffff880468ccfcc0 ffffffff817cdaaf
[ 5781.419725] Call Trace:
[ 5781.422460]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5781.428003]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5781.435192]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5781.441801]  [<ffffffff81161d7e>] ? proc_cgroup_show+0x4e/0x300
[ 5781.448404]  [<ffffffff81252b01>] ? kmem_cache_alloc_trace+0x1d1/0x2e0
[ 5781.455691]  [<ffffffff81161d7e>] proc_cgroup_show+0x4e/0x300
[ 5781.462109]  [<ffffffff81302d40>] proc_single_show+0x50/0x90
[ 5781.468428]  [<ffffffff812ac983>] seq_read+0x113/0x3e0
[ 5781.474165]  [<ffffffff81280407>] __vfs_read+0x37/0x150
[ 5781.479991]  [<ffffffff81349ded>] ? security_file_permission+0x9d/0xc0
[ 5781.487277]  [<ffffffff812815ac>] vfs_read+0x8c/0x130
[ 5781.492914]  [<ffffffff81282ac8>] SyS_read+0x58/0xc0
[ 5781.498455]  [<ffffffff817d497c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[ 5781.505646] INFO: lockdep is turned off.
[ 5781.510085] INFO: task systemd-journal:1165 blocked for more than 120 seconds.
[ 5781.518146]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5781.524946] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5781.533686] systemd-journal D ffff880466167ca8 12672  1165      1 0x00000000
[ 5781.541581]  ffff880466167ca8 ffff880466167cd0 0000000000000000 ffff88086c6e2000
[ 5781.549880]  ffff88045deb0000 ffff880466168000 ffffffff81deb380 ffff88045deb0000
[ 5781.558186]  0000000000000246 00000000ffffffff ffff880466167cc0 ffffffff817cdaaf
[ 5781.566493] Call Trace:
[ 5781.569222]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5781.574764]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5781.581953]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5781.588559]  [<ffffffff81161d7e>] ? proc_cgroup_show+0x4e/0x300
[ 5781.595166]  [<ffffffff81252b01>] ? kmem_cache_alloc_trace+0x1d1/0x2e0
[ 5781.602451]  [<ffffffff81161d7e>] proc_cgroup_show+0x4e/0x300
[ 5781.608864]  [<ffffffff81302d40>] proc_single_show+0x50/0x90
[ 5781.615182]  [<ffffffff812ac983>] seq_read+0x113/0x3e0
[ 5781.620916]  [<ffffffff81280407>] __vfs_read+0x37/0x150
[ 5781.626749]  [<ffffffff81349ded>] ? security_file_permission+0x9d/0xc0
[ 5781.634035]  [<ffffffff812815ac>] vfs_read+0x8c/0x130
[ 5781.639673]  [<ffffffff81282ac8>] SyS_read+0x58/0xc0
[ 5781.645215]  [<ffffffff817d497c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[ 5781.652403] INFO: lockdep is turned off.
[ 5781.656811] INFO: task kworker/3:1:52401 blocked for more than 120 seconds.
[ 5781.664583]       Tainted: G        W       4.8.0-rc8-fornext+ #1
[ 5781.671383] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5781.680121] kworker/3:1     D ffff8803b25bbca8 13368 52401      2 0x00000080
[ 5781.688021] Workqueue: cgroup_destroy css_release_work_fn
[ 5781.694057]  ffff8803b25bbca8 ffff8803b25bbcd0 0000000000000000 ffff88046ded2000
[ 5781.702356]  ffff88046af8a000 ffff8803b25bc000 ffffffff81deb380 ffff88046af8a000
[ 5781.710656]  0000000000000246 00000000ffffffff ffff8803b25bbcc0 ffffffff817cdaaf
[ 5781.718954] Call Trace:
[ 5781.721684]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5781.727224]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 5781.734414]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 5781.741021]  [<ffffffff811586af>] ? css_release_work_fn+0x2f/0x110
[ 5781.747919]  [<ffffffff811586af>] css_release_work_fn+0x2f/0x110
[ 5781.754626]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
[ 5781.761137]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
[ 5781.767841]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
[ 5781.774061]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
[ 5781.780765]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
[ 5781.786304]  [<ffffffff817d40ec>] ? _raw_spin_unlock_irq+0x2c/0x60
[ 5781.793203]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 5781.799229]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
[ 5781.806514] INFO: lockdep is turned off.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-04 16:21                                                         ` CAI Qian
@ 2016-10-04 20:12                                                           ` Al Viro
  2016-10-05 14:30                                                             ` CAI Qian
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-10-04 20:12 UTC (permalink / raw)
  To: CAI Qian
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Tue, Oct 04, 2016 at 12:21:28PM -0400, CAI Qian wrote:
> 
> > Not enough information, unfortunately (descriptor in question opened
> > outside of that log, sendfile(out_fd=578, in_fd=578, offset=0x7f8318a07000,
> > count=0x3ffc00) doesn't tell what *offset was before the call) ;-/
> > 
> > Anyway, I've found and fixed a bug in pipe_advance(), which might or might
> > not help with those.  Could you try vfs.git#work.splice_read (or #for-next)
> > and see if these persist?
> I am afraid that this can also reproduced in the latest #for-next . The warning
> always showed up at the end of trinity run. I captured more information this time.

OK, let's try to get more information about what's going on (this is on top
of either for-next or work.splice_read):

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c97d661..a9cb9ff 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -313,6 +313,15 @@ static bool sanity(const struct iov_iter *i)
 	}
 	return true;
 Bad:
+	printk(KERN_ERR "idx = %d, offset = %zd\n", i->idx, i->iov_offset);
+	printk(KERN_ERR "curbuf = %d, nrbufs = %d, buffers = %d\n",
+			pipe->curbuf, pipe->nrbufs, pipe->buffers);
+	for (idx = 0; idx < pipe->buffers; idx++)
+		printk(KERN_ERR "[%p %p %d %d]\n",
+			pipe->bufs[idx].ops,
+			pipe->bufs[idx].page,
+			pipe->bufs[idx].offset,
+			pipe->bufs[idx].len);
 	WARN_ON(1);
 	return false;
 }
@@ -339,8 +348,11 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 	if (unlikely(!bytes))
 		return 0;
 
-	if (!sanity(i))
+	if (!sanity(i)) {
+		printk(KERN_ERR "page = %p, offset = %zd, size = %zd\n",
+			page, offset, bytes);
 		return 0;
+	}
 
 	off = i->iov_offset;
 	idx = i->idx;
@@ -518,6 +530,8 @@ static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
 		addr += chunk;
 	}
 	i->count -= bytes;
+	if (!sanity(i))
+		printk(KERN_ERR "buggered after copy_to_iter\n");
 	return bytes;
 }
 
@@ -629,6 +643,8 @@ static size_t pipe_zero(size_t bytes, struct iov_iter *i)
 		n -= chunk;
 	}
 	i->count -= bytes;
+	if (!sanity(i))
+		printk(KERN_ERR "buggered after zero_iter\n");
 	return bytes;
 }
 
@@ -673,6 +689,8 @@ static void pipe_advance(struct iov_iter *i, size_t size)
 	struct pipe_buffer *buf;
 	int idx = i->idx;
 	size_t off = i->iov_offset;
+	struct iov_iter orig = *i;
+	size_t orig_size = size;
 	
 	if (unlikely(i->count < size))
 		size = i->count;
@@ -702,6 +720,9 @@ static void pipe_advance(struct iov_iter *i, size_t size)
 			pipe->nrbufs--;
 		}
 	}
+	if (!sanity(i))
+		printk(KERN_ERR "buggered pipe_advance by %zd from [%d.%zd]",
+			orig_size, orig.idx, orig.iov_offset);
 }
 
 void iov_iter_advance(struct iov_iter *i, size_t size)

^ permalink raw reply related	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-04 17:39                                                   ` local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) CAI Qian
@ 2016-10-04 21:42                                                     ` tj
  2016-10-05 14:09                                                       ` CAI Qian
  2016-10-27 12:52                                                       ` local DoS - systemd hang or timeout with cgroup traces CAI Qian
  0 siblings, 2 replies; 152+ messages in thread
From: tj @ 2016-10-04 21:42 UTC (permalink / raw)
  To: CAI Qian
  Cc: Al Viro, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

Hello, CAI.

On Tue, Oct 04, 2016 at 01:39:11PM -0400, CAI Qian wrote:
...
> Not sure if related, but right after this lockdep happened and trinity running by a
> non-privileged user finished inside the container. The host's systemctl command just
> hang or timeout which renders the whole system unusable.
> 
> # systemctl status docker
> Failed to get properties: Connection timed out
> 
> # systemctl reboot (hang)
> 
...
> [ 5535.893675] INFO: lockdep is turned off.
> [ 5535.898085] INFO: task kworker/45:4:146035 blocked for more than 120 seconds.
> [ 5535.906059]       Tainted: G        W       4.8.0-rc8-fornext+ #1
> [ 5535.912865] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 5535.921613] kworker/45:4    D ffff880853e9b950 14048 146035      2 0x00000080
> [ 5535.929630] Workqueue: cgroup_destroy css_killed_work_fn
> [ 5535.935582]  ffff880853e9b950 0000000000000000 0000000000000000 ffff88086c6da000
> [ 5535.943882]  ffff88086c9e2000 ffff880853e9c000 ffff880853e9baa0 ffff88086c9e2000
> [ 5535.952205]  ffff880853e9ba98 0000000000000001 ffff880853e9b968 ffffffff817cdaaf
> [ 5535.960522] Call Trace:
> [ 5535.963265]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 5535.968817]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
> [ 5535.975346]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
> [ 5535.982256]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
> [ 5535.988972]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
> [ 5535.994804]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
> [ 5536.001227]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
> [ 5536.007648]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
> [ 5536.014654]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
> [ 5536.021657]  [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40
> [ 5536.029344]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
> [ 5536.036447]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
> [ 5536.043844]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
> [ 5536.051336]  [<ffffffff8116789d>] update_flag+0x11d/0x210
> [ 5536.057373]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
> [ 5536.064186]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
> [ 5536.070899]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
> [ 5536.077420]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
> [ 5536.084234]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
> [ 5536.091049]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
> [ 5536.097571]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
> [ 5536.104207]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
> [ 5536.110736]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
> [ 5536.117461]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
> [ 5536.123697]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
> [ 5536.130426]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
> [ 5536.135991]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
> [ 5536.142041]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230

This one seems to be the offender.  cgroup is trying to offline a
cpuset css, which takes place under cgroup_mutex.  The offlining ends
up trying to drain active usages of a sysctl table which apprently is
not happening.  Did something hang or crash while trying to generate
sysctl content?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-04 21:42                                                     ` tj
@ 2016-10-05 14:09                                                       ` CAI Qian
  2016-10-05 15:30                                                         ` tj
  2016-10-27 12:52                                                       ` local DoS - systemd hang or timeout with cgroup traces CAI Qian
  1 sibling, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-05 14:09 UTC (permalink / raw)
  To: tj
  Cc: Al Viro, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "tj" <tj@kernel.org>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Al Viro" <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> <david@fromorbit.com>, "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin"
> <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org
> Sent: Tuesday, October 4, 2016 5:42:19 PM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> Hello, CAI.
> 
> On Tue, Oct 04, 2016 at 01:39:11PM -0400, CAI Qian wrote:
> ...
> > Not sure if related, but right after this lockdep happened and trinity
> > running by a
> > non-privileged user finished inside the container. The host's systemctl
> > command just
> > hang or timeout which renders the whole system unusable.
> > 
> > # systemctl status docker
> > Failed to get properties: Connection timed out
> > 
> > # systemctl reboot (hang)
> > 
> ...
> > [ 5535.893675] INFO: lockdep is turned off.
> > [ 5535.898085] INFO: task kworker/45:4:146035 blocked for more than 120
> > seconds.
> > [ 5535.906059]       Tainted: G        W       4.8.0-rc8-fornext+ #1
> > [ 5535.912865] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [ 5535.921613] kworker/45:4    D ffff880853e9b950 14048 146035      2
> > 0x00000080
> > [ 5535.929630] Workqueue: cgroup_destroy css_killed_work_fn
> > [ 5535.935582]  ffff880853e9b950 0000000000000000 0000000000000000
> > ffff88086c6da000
> > [ 5535.943882]  ffff88086c9e2000 ffff880853e9c000 ffff880853e9baa0
> > ffff88086c9e2000
> > [ 5535.952205]  ffff880853e9ba98 0000000000000001 ffff880853e9b968
> > ffffffff817cdaaf
> > [ 5535.960522] Call Trace:
> > [ 5535.963265]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> > [ 5535.968817]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
> > [ 5535.975346]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
> > [ 5535.982256]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
> > [ 5535.988972]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
> > [ 5535.994804]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
> > [ 5536.001227]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
> > [ 5536.007648]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
> > [ 5536.014654]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
> > [ 5536.021657]  [<ffffffff810f57f5>]
> > unregister_sched_domain_sysctl+0x15/0x40
> > [ 5536.029344]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
> > [ 5536.036447]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
> > [ 5536.043844]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
> > [ 5536.051336]  [<ffffffff8116789d>] update_flag+0x11d/0x210
> > [ 5536.057373]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
> > [ 5536.064186]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
> > [ 5536.070899]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
> > [ 5536.077420]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
> > [ 5536.084234]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
> > [ 5536.091049]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
> > [ 5536.097571]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
> > [ 5536.104207]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
> > [ 5536.110736]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
> > [ 5536.117461]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
> > [ 5536.123697]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
> > [ 5536.130426]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
> > [ 5536.135991]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
> > [ 5536.142041]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
> 
> This one seems to be the offender.  cgroup is trying to offline a
> cpuset css, which takes place under cgroup_mutex.  The offlining ends
> up trying to drain active usages of a sysctl table which apprently is
> not happening.  Did something hang or crash while trying to generate
> sysctl content?
Hmm, I am not sure, since the trinity was running from an non-privileged
user which can only read content from /proc or /sys.
    CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-04 20:12                                                           ` Al Viro
@ 2016-10-05 14:30                                                             ` CAI Qian
  2016-10-05 16:07                                                               ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-05 14:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel



----- Original Message -----
> From: "Al Viro" <viro@ZenIV.linux.org.uk>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Tuesday, October 4, 2016 4:12:33 PM
> Subject: Re: [RFC][CFT] splice_read reworked
> 
> On Tue, Oct 04, 2016 at 12:21:28PM -0400, CAI Qian wrote:
> > 
> > > Not enough information, unfortunately (descriptor in question opened
> > > outside of that log, sendfile(out_fd=578, in_fd=578,
> > > offset=0x7f8318a07000,
> > > count=0x3ffc00) doesn't tell what *offset was before the call) ;-/
> > > 
> > > Anyway, I've found and fixed a bug in pipe_advance(), which might or
> > > might
> > > not help with those.  Could you try vfs.git#work.splice_read (or
> > > #for-next)
> > > and see if these persist?
> > I am afraid that this can also reproduced in the latest #for-next . The
> > warning
> > always showed up at the end of trinity run. I captured more information
> > this time.
> 
> OK, let's try to get more information about what's going on (this is on top
> of either for-next or work.splice_read):
Here you go,

http://people.redhat.com/qcai/tmp/trinity-child89.log


[  856.537452] idx = 0, offset = 12
[  856.541066] curbuf = 0, nrbufs = 1, buffers = 1
[  856.546149] [ffffffff81836660 ffffea001e2e1ec0 0 12]
[  856.551750] ------------[ cut here ]------------
[  856.556921] WARNING: CPU: 24 PID: 13756 at lib/iov_iter.c:325 sanity+0xdb/0xe2
[  856.565000] Modules linked in: ieee802154_socket ieee802154 af_key vmw_vsock_vmci_transport vsock vmw_vmci bluetooth rfkill can pptp gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe pppox ppp_generic slhc nfnetlink scsi_transport_iscsi atm sctp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support pcspkr mei_me i2c_i801 ipmi_ssif sg i2c_smbus mei shpchp lpc_ich wmi ipmi_si ipmi_msghandler acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm ixgbe ahci drm mdio libahci ptp libata pps_core i2c_core dca fjes dm_mirror dm_region_hash dm_log dm_mod
[  856.683348] CPU: 27 PID: 13756 Comm: trinity-c89 Not tainted 4.8.0-rc8-fornext-debug+ #2
[  856.692380] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[  856.703736]  0000000000000286 00000000cf291d96 ffff8803c355fae0 ffffffff813d30ac
[  856.712034]  0000000000000000 0000000000000000 ffff8803c355fb20 ffffffff8109cf31
[  856.720329]  00000145c355fb00 ffff8804586e3200 0000000000000001 0000000000000000
[  856.728627] Call Trace:
[  856.731362]  [<ffffffff813d30ac>] dump_stack+0x85/0xc9
[  856.737099]  [<ffffffff8109cf31>] __warn+0xd1/0xf0
[  856.742444]  [<ffffffff8109d06d>] warn_slowpath_null+0x1d/0x20
[  856.748953]  [<ffffffff81418ff8>] sanity+0xdb/0xe2
[  856.754299]  [<ffffffff813e9676>] iov_iter_advance+0x1d6/0x3c0
[  856.760810]  [<ffffffff812bc7d3>] default_file_splice_read+0x223/0x2c0
[  856.768099]  [<ffffffff812503bb>] ? __slab_free+0x9b/0x270
[  856.774222]  [<ffffffff811222d8>] ? __call_rcu+0xd8/0x380
[  856.780258]  [<ffffffff810cbaa9>] ? __might_sleep+0x49/0x80
[  856.786480]  [<ffffffff81349ded>] ? security_file_permission+0x9d/0xc0
[  856.793777]  [<ffffffff812bbb13>] do_splice_to+0x73/0x90
[  856.799703]  [<ffffffff812bbc1b>] splice_direct_to_actor+0xeb/0x220
[  856.806696]  [<ffffffff812bb0e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[  856.813982]  [<ffffffff812bbdd9>] do_splice_direct+0x89/0xd0
[  856.820299]  [<ffffffff8128263e>] do_sendfile+0x1ce/0x3b0
[  856.826323]  [<ffffffff812831ef>] SyS_sendfile64+0x6f/0xd0
[  856.832445]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[  856.838568]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[  856.845810] ---[ end trace 702eb33216129766 ]---
[  856.851032] buggered pipe_advance by 12 from [0.0]

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-05 14:09                                                       ` CAI Qian
@ 2016-10-05 15:30                                                         ` tj
  2016-10-05 15:54                                                           ` CAI Qian
  0 siblings, 1 reply; 152+ messages in thread
From: tj @ 2016-10-05 15:30 UTC (permalink / raw)
  To: CAI Qian
  Cc: Al Viro, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

Hello, CAI.

On Wed, Oct 05, 2016 at 10:09:39AM -0400, CAI Qian wrote:
> > This one seems to be the offender.  cgroup is trying to offline a
> > cpuset css, which takes place under cgroup_mutex.  The offlining ends
> > up trying to drain active usages of a sysctl table which apprently is
> > not happening.  Did something hang or crash while trying to generate
> > sysctl content?
>
> Hmm, I am not sure, since the trinity was running from an non-privileged
> user which can only read content from /proc or /sys.

So, userland, priviledged or not, can't cause this.  The ref is held
only while the kernel code is operating to generate content or
iterating, which shouldn't be affected by userland actions.  This is
caused by kernel code hanging or crashing while holding a ref.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-05 15:30                                                         ` tj
@ 2016-10-05 15:54                                                           ` CAI Qian
  2016-10-05 18:57                                                             ` CAI Qian
  0 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-05 15:54 UTC (permalink / raw)
  To: tj
  Cc: Al Viro, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "tj" <tj@kernel.org>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Al Viro" <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> <david@fromorbit.com>, "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin"
> <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org
> Sent: Wednesday, October 5, 2016 11:30:14 AM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> Hello, CAI.
> 
> On Wed, Oct 05, 2016 at 10:09:39AM -0400, CAI Qian wrote:
> > > This one seems to be the offender.  cgroup is trying to offline a
> > > cpuset css, which takes place under cgroup_mutex.  The offlining ends
> > > up trying to drain active usages of a sysctl table which apprently is
> > > not happening.  Did something hang or crash while trying to generate
> > > sysctl content?
> >
> > Hmm, I am not sure, since the trinity was running from an non-privileged
> > user which can only read content from /proc or /sys.
> 
> So, userland, priviledged or not, can't cause this.  The ref is held
> only while the kernel code is operating to generate content or
> iterating, which shouldn't be affected by userland actions.  This is
> caused by kernel code hanging or crashing while holding a ref.
Right, the trinity calls many different random syscalls and options on those
/proc/ and /sys/ files and generate lots of different errno. It is likely
some of error-path out there causes hang or crash.
    CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [RFC][CFT] splice_read reworked
  2016-10-05 14:30                                                             ` CAI Qian
@ 2016-10-05 16:07                                                               ` Al Viro
  0 siblings, 0 replies; 152+ messages in thread
From: Al Viro @ 2016-10-05 16:07 UTC (permalink / raw)
  To: CAI Qian
  Cc: Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Wed, Oct 05, 2016 at 10:30:46AM -0400, CAI Qian wrote:

> [  856.537452] idx = 0, offset = 12
> [  856.541066] curbuf = 0, nrbufs = 1, buffers = 1
					^^^^^^^^^^^^

Lovely - that's pretty much guaranteed to make sanity() spew false
positives.
        int delta = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1);
        if (i->iov_offset) {
                struct pipe_buffer *p;
                if (unlikely(delta != 1) || unlikely(!pipe->nrbufs))
                        goto Bad;       // must be at the last buffer...
and at the last buffer it is - idx == (curbuf + nrbufs - 1) % pipe->buffers.
The test would've done the right thing if pipe->buffers had been at least 2,
but...  OK, the patch below ought to fix those; could you check if anything
remains with it?

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c97d661..0ce3411 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -298,21 +298,32 @@ static bool sanity(const struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
 	int idx = i->idx;
-	int delta = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1);
+	int next = pipe->curbuf + pipe->nrbufs;
 	if (i->iov_offset) {
 		struct pipe_buffer *p;
-		if (unlikely(delta != 1) || unlikely(!pipe->nrbufs))
+		if (unlikely(!pipe->nrbufs))
+			goto Bad;	// pipe must be non-empty
+		if (unlikely(idx != ((next - 1) & (pipe->buffers - 1))))
 			goto Bad;	// must be at the last buffer...
 
 		p = &pipe->bufs[idx];
 		if (unlikely(p->offset + p->len != i->iov_offset))
 			goto Bad;	// ... at the end of segment
 	} else {
-		if (delta)
+		if (idx != (next & (pipe->buffers - 1)))
 			goto Bad;	// must be right after the last buffer
 	}
 	return true;
 Bad:
+	printk(KERN_ERR "idx = %d, offset = %zd\n", i->idx, i->iov_offset);
+	printk(KERN_ERR "curbuf = %d, nrbufs = %d, buffers = %d\n",
+			pipe->curbuf, pipe->nrbufs, pipe->buffers);
+	for (idx = 0; idx < pipe->buffers; idx++)
+		printk(KERN_ERR "[%p %p %d %d]\n",
+			pipe->bufs[idx].ops,
+			pipe->bufs[idx].page,
+			pipe->bufs[idx].offset,
+			pipe->bufs[idx].len);
 	WARN_ON(1);
 	return false;
 }

^ permalink raw reply related	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-05 15:54                                                           ` CAI Qian
@ 2016-10-05 18:57                                                             ` CAI Qian
  2016-10-05 20:05                                                               ` Al Viro
  0 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-05 18:57 UTC (permalink / raw)
  To: tj
  Cc: Al Viro, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "tj" <tj@kernel.org>
> Cc: "Al Viro" <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> <david@fromorbit.com>, "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin"
> <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org
> Sent: Wednesday, October 5, 2016 11:54:48 AM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> 
> 
> ----- Original Message -----
> > From: "tj" <tj@kernel.org>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "Al Viro" <viro@ZenIV.linux.org.uk>, "Linus Torvalds"
> > <torvalds@linux-foundation.org>, "Dave Chinner"
> > <david@fromorbit.com>, "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens
> > Axboe" <axboe@kernel.dk>, "Nick Piggin"
> > <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org
> > Sent: Wednesday, October 5, 2016 11:30:14 AM
> > Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT]
> > splice_read reworked)
> > 
> > Hello, CAI.
> > 
> > On Wed, Oct 05, 2016 at 10:09:39AM -0400, CAI Qian wrote:
> > > > This one seems to be the offender.  cgroup is trying to offline a
> > > > cpuset css, which takes place under cgroup_mutex.  The offlining ends
> > > > up trying to drain active usages of a sysctl table which apprently is
> > > > not happening.  Did something hang or crash while trying to generate
> > > > sysctl content?
> > >
> > > Hmm, I am not sure, since the trinity was running from an non-privileged
> > > user which can only read content from /proc or /sys.
> > 
> > So, userland, priviledged or not, can't cause this.  The ref is held
> > only while the kernel code is operating to generate content or
> > iterating, which shouldn't be affected by userland actions.  This is
> > caused by kernel code hanging or crashing while holding a ref.
> Right, the trinity calls many different random syscalls and options on those
> /proc/ and /sys/ files and generate lots of different errno. It is likely
> some of error-path out there causes hang or crash.
Tejun,

Not sure if this related, and there is always a lockdep regards procfs happened
below unless masking by other lockdep issues before the cgroup hang. Also, this
hang is always reproducible.

[ 4787.875980] 
[ 4787.877645] ======================================================
[ 4787.884540] [ INFO: possible circular locking dependency detected ]
[ 4787.891533] 4.8.0-rc8-usrns-scale+ #8 Tainted: G        W      
[ 4787.898138] -------------------------------------------------------
[ 4787.905130] trinity-c116/106905 is trying to acquire lock:
[ 4787.911251]  (&p->lock){+.+.+.}, at: [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4787.919264] 
[ 4787.919264] but task is already holding lock:
[ 4787.925773]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4787.934854] 
[ 4787.934854] which lock already depends on the new lock.
[ 4787.934854] 
[ 4787.943981] 
[ 4787.943981] the existing dependency chain (in reverse order) is:
[ 4787.952333] 
-> #3 (sb_writers#8){.+.+.+}:
[ 4787.957050]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4787.963960]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4787.970577]        [<ffffffff810f769a>] percpu_down_read+0x4a/0xa0
[ 4787.977487]        [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4787.984395]        [<ffffffff812a8974>] mnt_want_write+0x24/0x50
[ 4787.991110]        [<ffffffffa05049af>] ovl_want_write+0x1f/0x30 [overlay]
[ 4787.998799]        [<ffffffffa05070c2>] ovl_do_remove+0x42/0x4a0 [overlay]
[ 4788.006483]        [<ffffffffa0507536>] ovl_rmdir+0x16/0x20 [overlay]
[ 4788.013682]        [<ffffffff8128d357>] vfs_rmdir+0xb7/0x130
[ 4788.020009]        [<ffffffff81292ed3>] do_rmdir+0x183/0x1f0
[ 4788.026335]        [<ffffffff81293cf2>] SyS_unlinkat+0x22/0x30
[ 4788.032853]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.039576]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.046962] 
-> #2 (&sb->s_type->i_mutex_key#16){++++++}:
[ 4788.053140]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.060049]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.066664]        [<ffffffff817d60e7>] down_read+0x47/0x70
[ 4788.072893]        [<ffffffff8128ce79>] lookup_slow+0xc9/0x200
[ 4788.079410]        [<ffffffff81290b9c>] walk_component+0x1ec/0x310
[ 4788.086315]        [<ffffffff81290e5f>] link_path_walk+0x19f/0x5f0
[ 4788.093219]        [<ffffffff8129151d>] path_openat+0xdd/0xb80
[ 4788.099748]        [<ffffffff81293511>] do_filp_open+0x91/0x100
[ 4788.106362]        [<ffffffff81286f56>] do_open_execat+0x76/0x180
[ 4788.113186]        [<ffffffff8128747b>] open_exec+0x2b/0x50
[ 4788.119404]        [<ffffffff812ec61d>] load_elf_binary+0x28d/0x1120
[ 4788.126511]        [<ffffffff81288487>] search_binary_handler+0x97/0x1c0
[ 4788.134002]        [<ffffffff81289619>] do_execveat_common.isra.36+0x6a9/0x9f0
[ 4788.142071]        [<ffffffff81289c4a>] SyS_execve+0x3a/0x50
[ 4788.148398]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.155110]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.162502] 
-> #1 (&sig->cred_guard_mutex){+.+.+.}:
[ 4788.168179]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.175085]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.181712]        [<ffffffff817d4557>] mutex_lock_killable_nested+0x87/0x500
[ 4788.189695]        [<ffffffff81099599>] mm_access+0x29/0xa0
[ 4788.195924]        [<ffffffff81302b6c>] proc_pid_auxv+0x1c/0x70
[ 4788.202540]        [<ffffffff813039d0>] proc_single_show+0x50/0x90
[ 4788.209445]        [<ffffffff812acb48>] seq_read+0x108/0x3e0
[ 4788.215774]        [<ffffffff8127fb07>] __vfs_read+0x37/0x150
[ 4788.222198]        [<ffffffff81280d35>] vfs_read+0x95/0x140
[ 4788.228425]        [<ffffffff81282268>] SyS_read+0x58/0xc0
[ 4788.234557]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.241268]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.248660] 
-> #0 (&p->lock){+.+.+.}:
[ 4788.252987]        [<ffffffff810fc062>] validate_chain.isra.37+0xe72/0x1150
[ 4788.260769]        [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.267676]        [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.274302]        [<ffffffff817d3807>] mutex_lock_nested+0x77/0x430
[ 4788.281406]        [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4788.287633]        [<ffffffff81316b39>] kernfs_fop_read+0x129/0x1b0
[ 4788.294659]        [<ffffffff8127fca3>] do_loop_readv_writev+0x83/0xc0
[ 4788.301954]        [<ffffffff812811a8>] do_readv_writev+0x218/0x240
[ 4788.308959]        [<ffffffff81281209>] vfs_readv+0x39/0x50
[ 4788.315188]        [<ffffffff812bc6b1>] default_file_splice_read+0x1a1/0x2b0
[ 4788.323070]        [<ffffffff812bc206>] do_splice_to+0x76/0x90
[ 4788.329587]        [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
[ 4788.337173]        [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
[ 4788.344078]        [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
[ 4788.350694]        [<ffffffff812829c9>] SyS_sendfile64+0xc9/0xd0
[ 4788.357405]        [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.364119]        [<ffffffff817d927f>] return_from_SYSCALL_64+0x0/0x7a
[ 4788.371511] 
[ 4788.371511] other info that might help us debug this:
[ 4788.371511] 
[ 4788.380443] Chain exists of:
  &p->lock --> &sb->s_type->i_mutex_key#16 --> sb_writers#8

[ 4788.389881]  Possible unsafe locking scenario:
[ 4788.389881] 
[ 4788.396497]        CPU0                    CPU1
[ 4788.401549]        ----                    ----
[ 4788.406614]   lock(sb_writers#8);
[ 4788.410352]                                lock(&sb->s_type->i_mutex_key#16);
[ 4788.418354]                                lock(sb_writers#8);
[ 4788.424902]   lock(&p->lock);
[ 4788.428229] 
[ 4788.428229]  *** DEADLOCK ***
[ 4788.428229] 
[ 4788.434836] 1 lock held by trinity-c116/106905:
[ 4788.439888]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284367>] __sb_start_write+0xb7/0xf0
[ 4788.449473] 
[ 4788.449473] stack backtrace:
[ 4788.454334] CPU: 16 PID: 106905 Comm: trinity-c116 Tainted: G        W       4.8.0-rc8-usrns-scale+ #8
[ 4788.464719] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 4788.476076]  0000000000000086 00000000cbfc6314 ffff8803ce78b760 ffffffff813d5e93
[ 4788.484371]  ffffffff82a3fbd0 ffffffff82a94890 ffff8803ce78b7a0 ffffffff810fa6ec
[ 4788.492663]  ffff8803ce78b7e0 ffff8802ead08000 0000000000000001 ffff8802ead08ca0
[ 4788.500966] Call Trace:
[ 4788.503694]  [<ffffffff813d5e93>] dump_stack+0x85/0xc2
[ 4788.509426]  [<ffffffff810fa6ec>] print_circular_bug+0x1ec/0x260
[ 4788.516128]  [<ffffffff810fc062>] validate_chain.isra.37+0xe72/0x1150
[ 4788.523319]  [<ffffffff811d4491>] ? ___perf_sw_event+0x171/0x290
[ 4788.530022]  [<ffffffff810fd711>] __lock_acquire+0x3f1/0x7f0
[ 4788.536335]  [<ffffffff810fe166>] lock_acquire+0xd6/0x240
[ 4788.542359]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.548188]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.554019]  [<ffffffff817d3807>] mutex_lock_nested+0x77/0x430
[ 4788.560528]  [<ffffffff812aca8c>] ? seq_read+0x4c/0x3e0
[ 4788.566358]  [<ffffffff812aca8c>] seq_read+0x4c/0x3e0
[ 4788.571995]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.578600]  [<ffffffff81316b39>] kernfs_fop_read+0x129/0x1b0
[ 4788.585012]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.591617]  [<ffffffff8127fca3>] do_loop_readv_writev+0x83/0xc0
[ 4788.598318]  [<ffffffff81316a10>] ? kernfs_fop_open+0x3a0/0x3a0
[ 4788.604924]  [<ffffffff812811a8>] do_readv_writev+0x218/0x240
[ 4788.611347]  [<ffffffff813e9535>] ? push_pipe+0xd5/0x190
[ 4788.617278]  [<ffffffff813ecec0>] ? iov_iter_get_pages_alloc+0x250/0x400
[ 4788.624746]  [<ffffffff81281209>] vfs_readv+0x39/0x50
[ 4788.630381]  [<ffffffff812bc6b1>] default_file_splice_read+0x1a1/0x2b0
[ 4788.637668]  [<ffffffff8134ae20>] ? security_file_permission+0xa0/0xc0
[ 4788.644954]  [<ffffffff812bc206>] do_splice_to+0x76/0x90
[ 4788.650880]  [<ffffffff812bc2db>] splice_direct_to_actor+0xbb/0x220
[ 4788.657872]  [<ffffffff812bba80>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 4788.665157]  [<ffffffff812bc4d8>] do_splice_direct+0x98/0xd0
[ 4788.671472]  [<ffffffff81281dd1>] do_sendfile+0x1d1/0x3b0
[ 4788.677499]  [<ffffffff812829c9>] SyS_sendfile64+0xc9/0xd0
[ 4788.683622]  [<ffffffff81003f8c>] do_syscall_64+0x6c/0x1e0
[ 4788.689744]  [<ffffffff817d927f>] entry_SYSCALL64_slow_path+0x25/0x25

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-05 18:57                                                             ` CAI Qian
@ 2016-10-05 20:05                                                               ` Al Viro
  2016-10-06 12:20                                                                 ` CAI Qian
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-10-05 20:05 UTC (permalink / raw)
  To: CAI Qian
  Cc: tj, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Wed, Oct 05, 2016 at 02:57:04PM -0400, CAI Qian wrote:

> Not sure if this related, and there is always a lockdep regards procfs happened
> below unless masking by other lockdep issues before the cgroup hang. Also, this
> hang is always reproducible.

Sigh...  Let's get the /proc/*/auxv out of the way - this should deal with it:

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d588d14..489d2d6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -400,23 +400,6 @@ static const struct file_operations proc_pid_cmdline_ops = {
 	.llseek	= generic_file_llseek,
 };
 
-static int proc_pid_auxv(struct seq_file *m, struct pid_namespace *ns,
-			 struct pid *pid, struct task_struct *task)
-{
-	struct mm_struct *mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
-	if (mm && !IS_ERR(mm)) {
-		unsigned int nwords = 0;
-		do {
-			nwords += 2;
-		} while (mm->saved_auxv[nwords - 2] != 0); /* AT_NULL */
-		seq_write(m, mm->saved_auxv, nwords * sizeof(mm->saved_auxv[0]));
-		mmput(mm);
-		return 0;
-	} else
-		return PTR_ERR(mm);
-}
-
-
 #ifdef CONFIG_KALLSYMS
 /*
  * Provides a wchan file via kallsyms in a proper one-value-per-file format.
@@ -1014,6 +997,30 @@ static const struct file_operations proc_environ_operations = {
 	.release	= mem_release,
 };
 
+static int auxv_open(struct inode *inode, struct file *file)
+{
+	return __mem_open(inode, file, PTRACE_MODE_READ_FSCREDS);
+}
+
+static ssize_t auxv_read(struct file *file, char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct mm_struct *mm = file->private_data;
+	unsigned int nwords = 0;
+	do {
+		nwords += 2;
+	} while (mm->saved_auxv[nwords - 2] != 0); /* AT_NULL */
+	return simple_read_from_buffer(buf, count, ppos, mm->saved_auxv,
+				       nwords * sizeof(mm->saved_auxv[0]));
+}
+
+static const struct file_operations proc_auxv_operations = {
+	.open		= auxv_open,
+	.read		= auxv_read,
+	.llseek		= generic_file_llseek,
+	.release	= mem_release,
+};
+
 static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 			    loff_t *ppos)
 {
@@ -2822,7 +2829,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
 #endif
 	REG("environ",    S_IRUSR, proc_environ_operations),
-	ONE("auxv",       S_IRUSR, proc_pid_auxv),
+	REG("auxv",       S_IRUSR, proc_auxv_operations),
 	ONE("status",     S_IRUGO, proc_pid_status),
 	ONE("personality", S_IRUSR, proc_pid_personality),
 	ONE("limits",	  S_IRUGO, proc_pid_limits),
@@ -3210,7 +3217,7 @@ static const struct pid_entry tid_base_stuff[] = {
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
 #endif
 	REG("environ",   S_IRUSR, proc_environ_operations),
-	ONE("auxv",      S_IRUSR, proc_pid_auxv),
+	REG("auxv",      S_IRUSR, proc_auxv_operations),
 	ONE("status",    S_IRUGO, proc_pid_status),
 	ONE("personality", S_IRUSR, proc_pid_personality),
 	ONE("limits",	 S_IRUGO, proc_pid_limits),

^ permalink raw reply related	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-05 20:05                                                               ` Al Viro
@ 2016-10-06 12:20                                                                 ` CAI Qian
  2016-10-06 12:25                                                                   ` CAI Qian
  2016-10-07  9:27                                                                   ` local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) Dave Chinner
  0 siblings, 2 replies; 152+ messages in thread
From: CAI Qian @ 2016-10-06 12:20 UTC (permalink / raw)
  To: Al Viro
  Cc: tj, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "Al Viro" <viro@ZenIV.linux.org.uk>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "tj" <tj@kernel.org>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>,
> "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Wednesday, October 5, 2016 4:05:22 PM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> On Wed, Oct 05, 2016 at 02:57:04PM -0400, CAI Qian wrote:
> 
> > Not sure if this related, and there is always a lockdep regards procfs
> > happened
> > below unless masking by other lockdep issues before the cgroup hang. Also,
> > this
> > hang is always reproducible.
> 
> Sigh...  Let's get the /proc/*/auxv out of the way - this should deal with
> it:
So I applied both this and the sanity patch, and both original sanity and the
proc warnings went away. However, the cgroup hang can still be reproduced as
well as this new xfs internal error below,

[16921.141233] XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line 5619 of file fs/xfs/libxfs/xfs_bmap.c.  Caller xfs_bmap_shift_extents+0x1cc/0x3a0 [xfs]
[16921.157694] CPU: 9 PID: 52920 Comm: trinity-c108 Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[16921.167012] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[16921.178368]  0000000000000286 00000000c3833246 ffff8803d0a83b60 ffffffff813d2ecc
[16921.186658]  ffff88042a898000 0000000000000001 ffff8803d0a83b78 ffffffffa02f36eb
[16921.194946]  ffffffffa02b544c ffff8803d0a83c30 ffffffffa02a8e52 ffff88042a898040
[16921.203238] Call Trace:
[16921.205972]  [<ffffffff813d2ecc>] dump_stack+0x85/0xc9
[16921.211742]  [<ffffffffa02f36eb>] xfs_error_report+0x3b/0x40 [xfs]
[16921.218660]  [<ffffffffa02b544c>] ? xfs_bmap_shift_extents+0x1cc/0x3a0 [xfs]
[16921.226543]  [<ffffffffa02a8e52>] xfs_bmse_shift_one.constprop.20+0x332/0x370 [xfs]
[16921.235090]  [<ffffffff817cb73a>] ? kmemleak_alloc+0x4a/0xa0
[16921.241426]  [<ffffffffa02b544c>] xfs_bmap_shift_extents+0x1cc/0x3a0 [xfs]
[16921.249122]  [<ffffffffa03142aa>] ? xfs_trans_add_item+0x2a/0x60 [xfs]
[16921.256430]  [<ffffffffa02eb361>] xfs_shift_file_space+0x231/0x2f0 [xfs]
[16921.263931]  [<ffffffffa02ebe8c>] xfs_collapse_file_space+0x5c/0x180 [xfs]
[16921.271622]  [<ffffffffa02f69b8>] xfs_file_fallocate+0x158/0x360 [xfs]
[16921.278907]  [<ffffffff810f8eae>] ? update_fast_ctr+0x4e/0x70
[16921.285320]  [<ffffffff810f8f57>] ? percpu_down_read+0x57/0x90
[16921.291828]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[16921.298337]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[16921.304847]  [<ffffffff8127e000>] vfs_fallocate+0x140/0x230
[16921.311067]  [<ffffffff8127eee4>] SyS_fallocate+0x44/0x70
[16921.317091]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[16921.323212]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25

    CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 12:20                                                                 ` CAI Qian
@ 2016-10-06 12:25                                                                   ` CAI Qian
  2016-10-06 16:11                                                                     ` CAI Qian
  2016-10-07  7:08                                                                     ` Jan Kara
  2016-10-07  9:27                                                                   ` local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) Dave Chinner
  1 sibling, 2 replies; 152+ messages in thread
From: CAI Qian @ 2016-10-06 12:25 UTC (permalink / raw)
  To: Al Viro
  Cc: tj, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel



----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "Al Viro" <viro@ZenIV.linux.org.uk>
> Cc: "tj" <tj@kernel.org>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>,
> "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org
> Sent: Thursday, October 6, 2016 8:20:17 AM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> 
> 
> ----- Original Message -----
> > From: "Al Viro" <viro@ZenIV.linux.org.uk>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "tj" <tj@kernel.org>, "Linus Torvalds" <torvalds@linux-foundation.org>,
> > "Dave Chinner" <david@fromorbit.com>,
> > "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>,
> > "Nick Piggin" <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org
> > Sent: Wednesday, October 5, 2016 4:05:22 PM
> > Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT]
> > splice_read reworked)
> > 
> > On Wed, Oct 05, 2016 at 02:57:04PM -0400, CAI Qian wrote:
> > 
> > > Not sure if this related, and there is always a lockdep regards procfs
> > > happened
> > > below unless masking by other lockdep issues before the cgroup hang.
> > > Also,
> > > this
> > > hang is always reproducible.
> > 
> > Sigh...  Let's get the /proc/*/auxv out of the way - this should deal with
> > it:
> So I applied both this and the sanity patch, and both original sanity and the
> proc warnings went away. However, the cgroup hang can still be reproduced as
> well as this new xfs internal error below,

Wait. There is also a lockep happened before the xfs internal error as well.

[ 5839.452325] ======================================================
[ 5839.459221] [ INFO: possible circular locking dependency detected ]
[ 5839.466215] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
[ 5839.471945] -------------------------------------------------------
[ 5839.478937] trinity-c220/69531 is trying to acquire lock:
[ 5839.484961]  (&p->lock){+.+.+.}, at: [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
[ 5839.492967] 
but task is already holding lock:
[ 5839.499476]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[ 5839.508560] 
which lock already depends on the new lock.

[ 5839.517686] 
the existing dependency chain (in reverse order) is:
[ 5839.526036] 
-> #3 (sb_writers#8){.+.+.+}:
[ 5839.530751]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
[ 5839.537368]        [<ffffffff810f8f4a>] percpu_down_read+0x4a/0x90
[ 5839.544275]        [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[ 5839.551181]        [<ffffffff812a8544>] mnt_want_write+0x24/0x50
[ 5839.557892]        [<ffffffffa04a398f>] ovl_want_write+0x1f/0x30 [overlay]
[ 5839.565577]        [<ffffffffa04a6036>] ovl_do_remove+0x46/0x480 [overlay]
[ 5839.573259]        [<ffffffffa04a64a3>] ovl_unlink+0x13/0x20 [overlay]
[ 5839.580555]        [<ffffffff812918ea>] vfs_unlink+0xda/0x190
[ 5839.586979]        [<ffffffff81293698>] do_unlinkat+0x268/0x2b0
[ 5839.593599]        [<ffffffff8129419b>] SyS_unlinkat+0x1b/0x30
[ 5839.600120]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 5839.606836]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
[ 5839.614231] 
-> #2 (&sb->s_type->i_mutex_key#17){++++++}:
[ 5839.620399]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
[ 5839.627015]        [<ffffffff817d1b77>] down_read+0x47/0x70
[ 5839.633242]        [<ffffffff8128cfd2>] lookup_slow+0xc2/0x1f0
[ 5839.639762]        [<ffffffff8128f6f2>] walk_component+0x172/0x220
[ 5839.646668]        [<ffffffff81290fd6>] link_path_walk+0x1a6/0x620
[ 5839.653574]        [<ffffffff81291a81>] path_openat+0xe1/0xdb0
[ 5839.660092]        [<ffffffff812939e1>] do_filp_open+0x91/0x100
[ 5839.666707]        [<ffffffff81288e06>] do_open_execat+0x76/0x180
[ 5839.673517]        [<ffffffff81288f3b>] open_exec+0x2b/0x50
[ 5839.679743]        [<ffffffff812eccf3>] load_elf_binary+0x2a3/0x10a0
[ 5839.686844]        [<ffffffff81288917>] search_binary_handler+0x97/0x1d0
[ 5839.694331]        [<ffffffff81289ed8>] do_execveat_common.isra.35+0x678/0x9a0
[ 5839.702400]        [<ffffffff8128a4da>] SyS_execve+0x3a/0x50
[ 5839.708726]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 5839.715441]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
[ 5839.722833] 
-> #1 (&sig->cred_guard_mutex){+.+.+.}:
[ 5839.728510]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
[ 5839.735126]        [<ffffffff817cfc66>] mutex_lock_killable_nested+0x86/0x540
[ 5839.743097]        [<ffffffff81301e84>] lock_trace+0x24/0x60
[ 5839.749421]        [<ffffffff8130224d>] proc_pid_syscall+0x2d/0x110
[ 5839.756423]        [<ffffffff81302af0>] proc_single_show+0x50/0x90
[ 5839.763330]        [<ffffffff812ab867>] traverse+0xf7/0x210
[ 5839.769557]        [<ffffffff812ac9eb>] seq_read+0x39b/0x3e0
[ 5839.775884]        [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
[ 5839.783179]        [<ffffffff81281a03>] do_readv_writev+0x213/0x230
[ 5839.790181]        [<ffffffff81281a59>] vfs_readv+0x39/0x50
[ 5839.796406]        [<ffffffff81281c12>] do_preadv+0xa2/0xc0
[ 5839.802634]        [<ffffffff81282ec1>] SyS_preadv+0x11/0x20
[ 5839.808963]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 5839.815681]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
[ 5839.823075] 
-> #0 (&p->lock){+.+.+.}:
[ 5839.827395]        [<ffffffff810fe69c>] __lock_acquire+0x151c/0x1990
[ 5839.834500]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
[ 5839.841115]        [<ffffffff817cf3b6>] mutex_lock_nested+0x76/0x450
[ 5839.848219]        [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
[ 5839.854448]        [<ffffffff8131566b>] kernfs_fop_read+0x12b/0x1b0
[ 5839.861451]        [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
[ 5839.868742]        [<ffffffff81281a03>] do_readv_writev+0x213/0x230
[ 5839.875744]        [<ffffffff81281a59>] vfs_readv+0x39/0x50
[ 5839.881971]        [<ffffffff812bc55a>] default_file_splice_read+0x1aa/0x2c0
[ 5839.889847]        [<ffffffff812bb913>] do_splice_to+0x73/0x90
[ 5839.896365]        [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
[ 5839.903950]        [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
[ 5839.910857]        [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
[ 5839.917470]        [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
[ 5839.924184]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 5839.930898]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
[ 5839.938286] 
other info that might help us debug this:

[ 5839.947217] Chain exists of:
  &p->lock --> &sb->s_type->i_mutex_key#17 --> sb_writers#8

[ 5839.956615]  Possible unsafe locking scenario:

[ 5839.963218]        CPU0                    CPU1
[ 5839.968269]        ----                    ----
[ 5839.973321]   lock(sb_writers#8);
[ 5839.977046]                                lock(&sb->s_type->i_mutex_key#17);
[ 5839.985037]                                lock(sb_writers#8);
[ 5839.991573]   lock(&p->lock);
[ 5839.994900] 
 *** DEADLOCK ***

[ 5840.001503] 1 lock held by trinity-c220/69531:
[ 5840.006457]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[ 5840.016031] 
stack backtrace:
[ 5840.020891] CPU: 12 PID: 69531 Comm: trinity-c220 Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 5840.030306] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 5840.041660]  0000000000000086 00000000a1ef62f8 ffff8803ca52f7c0 ffffffff813d2ecc
[ 5840.049952]  ffffffff82a41160 ffffffff82a913e0 ffff8803ca52f800 ffffffff811dd630
[ 5840.058245]  ffff8803ca52f840 ffff880392c4ecc8 ffff880392c4e000 0000000000000001
[ 5840.066537] Call Trace:
[ 5840.069266]  [<ffffffff813d2ecc>] dump_stack+0x85/0xc9
[ 5840.075000]  [<ffffffff811dd630>] print_circular_bug+0x1f9/0x207
[ 5840.081701]  [<ffffffff810fe69c>] __lock_acquire+0x151c/0x1990
[ 5840.088208]  [<ffffffff810ff174>] lock_acquire+0xd4/0x240
[ 5840.094232]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
[ 5840.100061]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
[ 5840.105891]  [<ffffffff817cf3b6>] mutex_lock_nested+0x76/0x450
[ 5840.112397]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
[ 5840.118228]  [<ffffffff810fb3e9>] ? __lock_is_held+0x49/0x70
[ 5840.124540]  [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
[ 5840.130175]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
[ 5840.137360]  [<ffffffff8131566b>] kernfs_fop_read+0x12b/0x1b0
[ 5840.143770]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
[ 5840.150956]  [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
[ 5840.157657]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
[ 5840.164843]  [<ffffffff81281a03>] do_readv_writev+0x213/0x230
[ 5840.171255]  [<ffffffff81418cf9>] ? __pipe_get_pages+0x24/0x9b
[ 5840.177762]  [<ffffffff813e6f0f>] ? iov_iter_get_pages_alloc+0x19f/0x360
[ 5840.185240]  [<ffffffff810fd5f2>] ? __lock_acquire+0x472/0x1990
[ 5840.191843]  [<ffffffff81281a59>] vfs_readv+0x39/0x50
[ 5840.197478]  [<ffffffff812bc55a>] default_file_splice_read+0x1aa/0x2c0
[ 5840.204763]  [<ffffffff810cba89>] ? __might_sleep+0x49/0x80
[ 5840.210980]  [<ffffffff81349c93>] ? security_file_permission+0xa3/0xc0
[ 5840.218264]  [<ffffffff812bb913>] do_splice_to+0x73/0x90
[ 5840.224190]  [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
[ 5840.231182]  [<ffffffff812baee0>] ? generic_pipe_buf_nosteal+0x10/0x10
[ 5840.238465]  [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
[ 5840.244778]  [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
[ 5840.250802]  [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
[ 5840.256922]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 5840.263042]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25

   CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 12:25                                                                   ` CAI Qian
@ 2016-10-06 16:11                                                                     ` CAI Qian
  2016-10-06 17:00                                                                       ` Linus Torvalds
  2016-10-07  7:08                                                                     ` Jan Kara
  1 sibling, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-06 16:11 UTC (permalink / raw)
  To: Al Viro
  Cc: tj, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel


> > > On Wed, Oct 05, 2016 at 02:57:04PM -0400, CAI Qian wrote:
> > > 
> > > > Not sure if this related, and there is always a lockdep regards procfs
> > > > happened
> > > > below unless masking by other lockdep issues before the cgroup hang.
> > > > Also,
> > > > this
> > > > hang is always reproducible.
> > > 
> > > Sigh...  Let's get the /proc/*/auxv out of the way - this should deal
> > > with
> > > it:
> > So I applied both this and the sanity patch, and both original sanity and
> > the
> > proc warnings went away. However, the cgroup hang can still be reproduced
> > as
> > well as this new xfs internal error below,
> 
> Wait. There is also a lockep happened before the xfs internal error as well.
Some other lockdep this time,

[ 4872.310639] =================================
[ 4872.315499] [ INFO: inconsistent lock state ]
[ 4872.320359] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
[ 4872.326091] ---------------------------------
[ 4872.330950] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 4872.338235] kswapd1/437 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 4872.343965]  (&xfs_nondir_ilock_class){++++?.}, at: [<ffffffffa029968e>] xfs_ilock+0x18e/0x260 [xfs]
[ 4872.354236] {RECLAIM_FS-ON-W} state was registered at:
[ 4872.359969]   [<ffffffff810fcbd6>] mark_held_locks+0x66/0x90
[ 4872.366297]   [<ffffffff810fffd5>] lockdep_trace_alloc+0xc5/0x110
[ 4872.373107]   [<ffffffff81253ad3>] kmem_cache_alloc+0x33/0x2e0
[ 4872.379628]   [<ffffffffa02a8386>] kmem_zone_alloc+0x96/0x120 [xfs]
[ 4872.386654]   [<ffffffffa024967b>] xfs_bmbt_init_cursor+0x3b/0x160 [xfs]
[ 4872.394147]   [<ffffffffa0247f8f>] xfs_bunmapi+0x80f/0xb00 [xfs]
[ 4872.400202] kmemleak: Cannot allocate a kmemleak_object structure
[ 4872.400205] kmemleak: Kernel memory leak detector disabled
[ 4872.400337] kmemleak: Automatic memory scanning thread ended
[ 4872.400869] kmemleak: Kmemleak disabled without freeing internal data. Reclaim the memory with "echo clear > /sys/kernel/debug/kmemleak".
[ 4872.433878]   [<ffffffffa027ddc3>] xfs_bmap_punch_delalloc_range+0xe3/0x180 [xfs]
[ 4872.442253]   [<ffffffffa0294b39>] xfs_file_iomap_end+0x89/0xd0 [xfs]
[ 4872.449468]   [<ffffffff812f3da0>] iomap_apply+0xe0/0x130
[ 4872.455505]   [<ffffffff812f3e58>] iomap_file_buffered_write+0x68/0xa0
[ 4872.462798]   [<ffffffffa028a87f>] xfs_file_buffered_aio_write+0x14f/0x350 [xfs]
[ 4872.471079]   [<ffffffffa028ab6d>] xfs_file_write_iter+0xed/0x130 [xfs]
[ 4872.478485]   [<ffffffff81280eee>] do_iter_readv_writev+0xae/0x130
[ 4872.485393]   [<ffffffff81281992>] do_readv_writev+0x1a2/0x230
[ 4872.491911]   [<ffffffff81281c6c>] vfs_writev+0x3c/0x50
[ 4872.497752]   [<ffffffff81281ce4>] do_writev+0x64/0x100
[ 4872.503589]   [<ffffffff81282ea0>] SyS_writev+0x10/0x20
[ 4872.509428]   [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 4872.515656]   [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
[ 4872.522563] irq event stamp: 427
[ 4872.526160] hardirqs last  enabled at (427): [<ffffffff817cf21d>] mutex_trylock+0xdd/0x200
[ 4872.535393] hardirqs last disabled at (426): [<ffffffff817cf191>] mutex_trylock+0x51/0x200
[ 4872.544627] softirqs last  enabled at (424): [<ffffffff817d7b37>] __do_softirq+0x1f7/0x4b7
[ 4872.553862] softirqs last disabled at (417): [<ffffffff810a4a98>] irq_exit+0xc8/0xe0
[ 4872.562513] 
[ 4872.562513] other info that might help us debug this:
[ 4872.569797]  Possible unsafe locking scenario:
[ 4872.569797] 
[ 4872.576401]        CPU0
[ 4872.579127]        ----
[ 4872.581854]   lock(&xfs_nondir_ilock_class);
[ 4872.586637]   <Interrupt>
[ 4872.589558]     lock(&xfs_nondir_ilock_class);
[ 4872.594533] 
[ 4872.594533]  *** DEADLOCK ***
[ 4872.594533] 
[ 4872.601140] 3 locks held by kswapd1/437:
[ 4872.605515]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff811f78ad>] shrink_slab+0x9d/0x620
[ 4872.614889]  #1:  (&type->s_umount_key#48){++++++}, at: [<ffffffff8128550b>] trylock_super+0x1b/0x50
[ 4872.625145]  #2:  (&pag->pag_ici_reclaim_lock){+.+...}, at: [<ffffffffa028e7a7>] xfs_reclaim_inodes_ag+0xc7/0x4f0 [xfs]
[ 4872.637247] 
[ 4872.637247] stack backtrace:
[ 4872.642109] CPU: 49 PID: 437 Comm: kswapd1 Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 4872.650846] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
[ 4872.662202]  0000000000000086 00000000eda15d18 ffff880462bd7798 ffffffff813d2ecc
[ 4872.670498]  ffff880462e56000 ffffffff82a66870 ffff880462bd77e8 ffffffff811dd9e1
[ 4872.678793]  0000000000000000 ffff880400000001 ffff880400000001 000000000000000a
[ 4872.687086] Call Trace:
[ 4872.689817]  [<ffffffff813d2ecc>] dump_stack+0x85/0xc9
[ 4872.695543]  [<ffffffff811dd9e1>] print_usage_bug+0x1eb/0x1fc
[ 4872.701954]  [<ffffffff810fc0b0>] ? check_usage_backwards+0x150/0x150
[ 4872.709141]  [<ffffffff810fcae4>] mark_lock+0x264/0x2f0
[ 4872.714968]  [<ffffffff810fd491>] __lock_acquire+0x311/0x1990
[ 4872.721379]  [<ffffffff810499db>] ? save_stack_trace+0x2b/0x50
[ 4872.727892]  [<ffffffff810fd5f2>] ? __lock_acquire+0x472/0x1990
[ 4872.734497]  [<ffffffff810ff174>] lock_acquire+0xd4/0x240
[ 4872.740535]  [<ffffffffa029968e>] ? xfs_ilock+0x18e/0x260 [xfs]
[ 4872.747155]  [<ffffffffa028dd93>] ? xfs_reclaim_inode+0x113/0x380 [xfs]
[ 4872.754538]  [<ffffffff810f8bfa>] down_write_nested+0x4a/0x80
[ 4872.760962]  [<ffffffffa029968e>] ? xfs_ilock+0x18e/0x260 [xfs]
[ 4872.767579]  [<ffffffffa029968e>] xfs_ilock+0x18e/0x260 [xfs]
[ 4872.774004]  [<ffffffffa028dd93>] xfs_reclaim_inode+0x113/0x380 [xfs]
[ 4872.781203]  [<ffffffffa028e9ab>] xfs_reclaim_inodes_ag+0x2cb/0x4f0 [xfs]
[ 4872.788780]  [<ffffffffa028e7d2>] ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
[ 4872.796453]  [<ffffffff817d40aa>] ? _raw_spin_unlock_irqrestore+0x6a/0x80
[ 4872.804026]  [<ffffffff817d408a>] ? _raw_spin_unlock_irqrestore+0x4a/0x80
[ 4872.811602]  [<ffffffff810d1a58>] ? try_to_wake_up+0x58/0x510
[ 4872.818014]  [<ffffffff810d1f25>] ? wake_up_process+0x15/0x20
[ 4872.824438]  [<ffffffffa0290523>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 4872.831835]  [<ffffffffa02a26d9>] xfs_fs_free_cached_objects+0x19/0x20 [xfs]
[ 4872.839702]  [<ffffffff812856c1>] super_cache_scan+0x181/0x190
[ 4872.846210]  [<ffffffff811f7a79>] shrink_slab+0x269/0x620
[ 4872.852233]  [<ffffffff811fcc88>] shrink_node+0x108/0x310
[ 4872.858256]  [<ffffffff811fe360>] kswapd+0x3d0/0x960
[ 4872.863796]  [<ffffffff811fdf90>] ? mem_cgroup_shrink_node+0x370/0x370
[ 4872.871081]  [<ffffffff810c3f5e>] kthread+0xfe/0x120
[ 4872.876618]  [<ffffffff817d40ec>] ? _raw_spin_unlock_irq+0x2c/0x60
[ 4872.883514]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 4872.889539]  [<ffffffff810c3e60>] ? kthread_create_on_node+0x230/0x230

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 16:11                                                                     ` CAI Qian
@ 2016-10-06 17:00                                                                       ` Linus Torvalds
  2016-10-06 18:12                                                                         ` CAI Qian
  2016-10-07  9:57                                                                         ` Dave Chinner
  0 siblings, 2 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-10-06 17:00 UTC (permalink / raw)
  To: CAI Qian
  Cc: Al Viro, tj, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Thu, Oct 6, 2016 at 9:11 AM, CAI Qian <caiqian@redhat.com> wrote:
>
>>
>> Wait. There is also a lockep happened before the xfs internal error as well.
> Some other lockdep this time,

This one looks just bogus.

> [ 4872.569797]  Possible unsafe locking scenario:
> [ 4872.569797]
> [ 4872.576401]        CPU0
> [ 4872.579127]        ----
> [ 4872.581854]   lock(&xfs_nondir_ilock_class);
> [ 4872.586637]   <Interrupt>
> [ 4872.589558]     lock(&xfs_nondir_ilock_class);

I'm not seeing that .lock taken in interrupt context.

I'm wondering how many of your reports are confused by earlier errors
that  happened.

               Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 17:00                                                                       ` Linus Torvalds
@ 2016-10-06 18:12                                                                         ` CAI Qian
  2016-10-07  9:57                                                                         ` Dave Chinner
  1 sibling, 0 replies; 152+ messages in thread
From: CAI Qian @ 2016-10-06 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, tj, Dave Chinner, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel



----- Original Message -----
> From: "Linus Torvalds" <torvalds@linux-foundation.org>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Al Viro" <viro@zeniv.linux.org.uk>, "tj" <tj@kernel.org>, "Dave Chinner" <david@fromorbit.com>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>, "linux-fsdevel"
> <linux-fsdevel@vger.kernel.org>
> Sent: Thursday, October 6, 2016 1:00:08 PM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> On Thu, Oct 6, 2016 at 9:11 AM, CAI Qian <caiqian@redhat.com> wrote:
> >
> >>
> >> Wait. There is also a lockep happened before the xfs internal error as
> >> well.
> > Some other lockdep this time,
> 
> This one looks just bogus.
> 
> > [ 4872.569797]  Possible unsafe locking scenario:
> > [ 4872.569797]
> > [ 4872.576401]        CPU0
> > [ 4872.579127]        ----
> > [ 4872.581854]   lock(&xfs_nondir_ilock_class);
> > [ 4872.586637]   <Interrupt>
> > [ 4872.589558]     lock(&xfs_nondir_ilock_class);
> 
> I'm not seeing that .lock taken in interrupt context.
> 
> I'm wondering how many of your reports are confused by earlier errors
> that  happened.
Hmm, there was no previous error/lockdep/warnings on the console prior to
this AFAICT. It was a fresh trinity run after reboot.

The previous run triggered seq_read/__sb_start_write lockdep and
then xfs XFS_WANT_CORRUPTED_RETURN internal error was highlighted
in another reply was also started from a fresh reboot.

After all of those individual runs it will reliably triggered the
cgroup hang from any systemctl command or "make install" of kernel etc.
   CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 12:25                                                                   ` CAI Qian
  2016-10-06 16:11                                                                     ` CAI Qian
@ 2016-10-07  7:08                                                                     ` Jan Kara
  2016-10-07 14:43                                                                       ` CAI Qian
  2016-10-21 15:38                                                                       ` [4.9-rc1+] overlayfs lockdep CAI Qian
  1 sibling, 2 replies; 152+ messages in thread
From: Jan Kara @ 2016-10-07  7:08 UTC (permalink / raw)
  To: CAI Qian
  Cc: Al Viro, tj, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel, Miklos Szeredi


So I believe this may be just a problem in overlayfs lockdep annotation
(see below). Added Miklos to CC.

On Thu 06-10-16 08:25:59, CAI Qian wrote:
> > > > Not sure if this related, and there is always a lockdep regards procfs
> > > > happened
> > > > below unless masking by other lockdep issues before the cgroup hang.
> > > > Also,
> > > > this
> > > > hang is always reproducible.
> > > 
> > > Sigh...  Let's get the /proc/*/auxv out of the way - this should deal with
> > > it:
> > So I applied both this and the sanity patch, and both original sanity and the
> > proc warnings went away. However, the cgroup hang can still be reproduced as
> > well as this new xfs internal error below,
> 
> Wait. There is also a lockep happened before the xfs internal error as well.
> 
> [ 5839.452325] ======================================================
> [ 5839.459221] [ INFO: possible circular locking dependency detected ]
> [ 5839.466215] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
> [ 5839.471945] -------------------------------------------------------
> [ 5839.478937] trinity-c220/69531 is trying to acquire lock:
> [ 5839.484961]  (&p->lock){+.+.+.}, at: [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
> [ 5839.492967] 
> but task is already holding lock:
> [ 5839.499476]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> [ 5839.508560] 
> which lock already depends on the new lock.
> 
> [ 5839.517686] 
> the existing dependency chain (in reverse order) is:
> [ 5839.526036] 
> -> #3 (sb_writers#8){.+.+.+}:
> [ 5839.530751]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> [ 5839.537368]        [<ffffffff810f8f4a>] percpu_down_read+0x4a/0x90
> [ 5839.544275]        [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> [ 5839.551181]        [<ffffffff812a8544>] mnt_want_write+0x24/0x50
> [ 5839.557892]        [<ffffffffa04a398f>] ovl_want_write+0x1f/0x30 [overlay]
> [ 5839.565577]        [<ffffffffa04a6036>] ovl_do_remove+0x46/0x480 [overlay]
> [ 5839.573259]        [<ffffffffa04a64a3>] ovl_unlink+0x13/0x20 [overlay]
> [ 5839.580555]        [<ffffffff812918ea>] vfs_unlink+0xda/0x190
> [ 5839.586979]        [<ffffffff81293698>] do_unlinkat+0x268/0x2b0
> [ 5839.593599]        [<ffffffff8129419b>] SyS_unlinkat+0x1b/0x30
> [ 5839.600120]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 5839.606836]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> [ 5839.614231] 

So here is IMO the real culprit: do_unlinkat() grabs fs freeze protection
through mnt_want_write(), we grab also i_rwsem in do_unlinkat() in
I_MUTEX_PARENT class a bit after that and further down in vfs_unlink() we
grab i_rwsem for the unlinked inode itself in default I_MUTEX class. Then
in ovl_want_write() we grab freeze protection again, but this time for the
upper filesystem. That establishes sb_writers (overlay) -> I_MUTEX_PARENT
(overlay) -> I_MUTEX (overlay) -> sb_writers (FS-A) lock ordering
(we maintain locking classes per fs type so that's why I'm showing fs type
in parenthesis).

Now this nesting is nasty because once you add locks that are not tracked
per fs type into the mix, you get cycles. In this case we've got
seq_file->lock and cred_guard_mutex into the mix - the splice path is
doing sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex (splicing
from seq_file into the real filesystem). Exec path further establishes
cred_guard_mutex -> I_MUTEX (overlay) which closes the full cycle:

sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex -> i_mutex
(overlay) -> sb_writers (FS-A)

If I analyzed the lockdep trace, this looks like a real (although remote)
deadlock possibility. Miklos?

								Honza

> -> #2 (&sb->s_type->i_mutex_key#17){++++++}:
> [ 5839.620399]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> [ 5839.627015]        [<ffffffff817d1b77>] down_read+0x47/0x70
> [ 5839.633242]        [<ffffffff8128cfd2>] lookup_slow+0xc2/0x1f0
> [ 5839.639762]        [<ffffffff8128f6f2>] walk_component+0x172/0x220
> [ 5839.646668]        [<ffffffff81290fd6>] link_path_walk+0x1a6/0x620
> [ 5839.653574]        [<ffffffff81291a81>] path_openat+0xe1/0xdb0
> [ 5839.660092]        [<ffffffff812939e1>] do_filp_open+0x91/0x100
> [ 5839.666707]        [<ffffffff81288e06>] do_open_execat+0x76/0x180
> [ 5839.673517]        [<ffffffff81288f3b>] open_exec+0x2b/0x50
> [ 5839.679743]        [<ffffffff812eccf3>] load_elf_binary+0x2a3/0x10a0
> [ 5839.686844]        [<ffffffff81288917>] search_binary_handler+0x97/0x1d0
> [ 5839.694331]        [<ffffffff81289ed8>] do_execveat_common.isra.35+0x678/0x9a0
> [ 5839.702400]        [<ffffffff8128a4da>] SyS_execve+0x3a/0x50
> [ 5839.708726]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 5839.715441]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> [ 5839.722833] 
> -> #1 (&sig->cred_guard_mutex){+.+.+.}:
> [ 5839.728510]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> [ 5839.735126]        [<ffffffff817cfc66>] mutex_lock_killable_nested+0x86/0x540
> [ 5839.743097]        [<ffffffff81301e84>] lock_trace+0x24/0x60
> [ 5839.749421]        [<ffffffff8130224d>] proc_pid_syscall+0x2d/0x110
> [ 5839.756423]        [<ffffffff81302af0>] proc_single_show+0x50/0x90
> [ 5839.763330]        [<ffffffff812ab867>] traverse+0xf7/0x210
> [ 5839.769557]        [<ffffffff812ac9eb>] seq_read+0x39b/0x3e0
> [ 5839.775884]        [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
> [ 5839.783179]        [<ffffffff81281a03>] do_readv_writev+0x213/0x230
> [ 5839.790181]        [<ffffffff81281a59>] vfs_readv+0x39/0x50
> [ 5839.796406]        [<ffffffff81281c12>] do_preadv+0xa2/0xc0
> [ 5839.802634]        [<ffffffff81282ec1>] SyS_preadv+0x11/0x20
> [ 5839.808963]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 5839.815681]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> [ 5839.823075] 
> -> #0 (&p->lock){+.+.+.}:
> [ 5839.827395]        [<ffffffff810fe69c>] __lock_acquire+0x151c/0x1990
> [ 5839.834500]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> [ 5839.841115]        [<ffffffff817cf3b6>] mutex_lock_nested+0x76/0x450
> [ 5839.848219]        [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
> [ 5839.854448]        [<ffffffff8131566b>] kernfs_fop_read+0x12b/0x1b0
> [ 5839.861451]        [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
> [ 5839.868742]        [<ffffffff81281a03>] do_readv_writev+0x213/0x230
> [ 5839.875744]        [<ffffffff81281a59>] vfs_readv+0x39/0x50
> [ 5839.881971]        [<ffffffff812bc55a>] default_file_splice_read+0x1aa/0x2c0
> [ 5839.889847]        [<ffffffff812bb913>] do_splice_to+0x73/0x90
> [ 5839.896365]        [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
> [ 5839.903950]        [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
> [ 5839.910857]        [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
> [ 5839.917470]        [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
> [ 5839.924184]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 5839.930898]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> [ 5839.938286] 
> other info that might help us debug this:
> 
> [ 5839.947217] Chain exists of:
>   &p->lock --> &sb->s_type->i_mutex_key#17 --> sb_writers#8
> 
> [ 5839.956615]  Possible unsafe locking scenario:
> 
> [ 5839.963218]        CPU0                    CPU1
> [ 5839.968269]        ----                    ----
> [ 5839.973321]   lock(sb_writers#8);
> [ 5839.977046]                                lock(&sb->s_type->i_mutex_key#17);
> [ 5839.985037]                                lock(sb_writers#8);
> [ 5839.991573]   lock(&p->lock);
> [ 5839.994900] 
>  *** DEADLOCK ***
> 
> [ 5840.001503] 1 lock held by trinity-c220/69531:
> [ 5840.006457]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> [ 5840.016031] 
> stack backtrace:
> [ 5840.020891] CPU: 12 PID: 69531 Comm: trinity-c220 Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 5840.030306] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
> [ 5840.041660]  0000000000000086 00000000a1ef62f8 ffff8803ca52f7c0 ffffffff813d2ecc
> [ 5840.049952]  ffffffff82a41160 ffffffff82a913e0 ffff8803ca52f800 ffffffff811dd630
> [ 5840.058245]  ffff8803ca52f840 ffff880392c4ecc8 ffff880392c4e000 0000000000000001
> [ 5840.066537] Call Trace:
> [ 5840.069266]  [<ffffffff813d2ecc>] dump_stack+0x85/0xc9
> [ 5840.075000]  [<ffffffff811dd630>] print_circular_bug+0x1f9/0x207
> [ 5840.081701]  [<ffffffff810fe69c>] __lock_acquire+0x151c/0x1990
> [ 5840.088208]  [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> [ 5840.094232]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
> [ 5840.100061]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
> [ 5840.105891]  [<ffffffff817cf3b6>] mutex_lock_nested+0x76/0x450
> [ 5840.112397]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
> [ 5840.118228]  [<ffffffff810fb3e9>] ? __lock_is_held+0x49/0x70
> [ 5840.124540]  [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
> [ 5840.130175]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
> [ 5840.137360]  [<ffffffff8131566b>] kernfs_fop_read+0x12b/0x1b0
> [ 5840.143770]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
> [ 5840.150956]  [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
> [ 5840.157657]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
> [ 5840.164843]  [<ffffffff81281a03>] do_readv_writev+0x213/0x230
> [ 5840.171255]  [<ffffffff81418cf9>] ? __pipe_get_pages+0x24/0x9b
> [ 5840.177762]  [<ffffffff813e6f0f>] ? iov_iter_get_pages_alloc+0x19f/0x360
> [ 5840.185240]  [<ffffffff810fd5f2>] ? __lock_acquire+0x472/0x1990
> [ 5840.191843]  [<ffffffff81281a59>] vfs_readv+0x39/0x50
> [ 5840.197478]  [<ffffffff812bc55a>] default_file_splice_read+0x1aa/0x2c0
> [ 5840.204763]  [<ffffffff810cba89>] ? __might_sleep+0x49/0x80
> [ 5840.210980]  [<ffffffff81349c93>] ? security_file_permission+0xa3/0xc0
> [ 5840.218264]  [<ffffffff812bb913>] do_splice_to+0x73/0x90
> [ 5840.224190]  [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
> [ 5840.231182]  [<ffffffff812baee0>] ? generic_pipe_buf_nosteal+0x10/0x10
> [ 5840.238465]  [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
> [ 5840.244778]  [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
> [ 5840.250802]  [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
> [ 5840.256922]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 5840.263042]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> 
>    CAI Qian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 12:20                                                                 ` CAI Qian
  2016-10-06 12:25                                                                   ` CAI Qian
@ 2016-10-07  9:27                                                                   ` Dave Chinner
  1 sibling, 0 replies; 152+ messages in thread
From: Dave Chinner @ 2016-10-07  9:27 UTC (permalink / raw)
  To: CAI Qian
  Cc: Al Viro, tj, Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel

On Thu, Oct 06, 2016 at 08:20:17AM -0400, CAI Qian wrote:
> 
> 
> ----- Original Message -----
> > From: "Al Viro" <viro@ZenIV.linux.org.uk>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "tj" <tj@kernel.org>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>,
> > "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org
> > Sent: Wednesday, October 5, 2016 4:05:22 PM
> > Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> > 
> > On Wed, Oct 05, 2016 at 02:57:04PM -0400, CAI Qian wrote:
> > 
> > > Not sure if this related, and there is always a lockdep regards procfs
> > > happened
> > > below unless masking by other lockdep issues before the cgroup hang. Also,
> > > this
> > > hang is always reproducible.
> > 
> > Sigh...  Let's get the /proc/*/auxv out of the way - this should deal with
> > it:
> So I applied both this and the sanity patch, and both original sanity and the
> proc warnings went away. However, the cgroup hang can still be reproduced as
> well as this new xfs internal error below,
> 
> [16921.141233] XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line 5619 of file fs/xfs/libxfs/xfs_bmap.c.  Caller xfs_bmap_shift_extents+0x1cc/0x3a0 [xfs]
> [16921.157694] CPU: 9 PID: 52920 Comm: trinity-c108 Not tainted 4.8.0-rc8-splice-fixw-proc+ #4

iIt found a delayed allocation extent in the extent map after
flushing all the dirty data in the file. Something else has gone
wrong, this corruption detection is just the messenger. Maybe
memory corruption?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-06 17:00                                                                       ` Linus Torvalds
  2016-10-06 18:12                                                                         ` CAI Qian
@ 2016-10-07  9:57                                                                         ` Dave Chinner
  2016-10-07 15:25                                                                           ` Linus Torvalds
  1 sibling, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2016-10-07  9:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: CAI Qian, Al Viro, tj, linux-xfs, Jens Axboe, Nick Piggin, linux-fsdevel

On Thu, Oct 06, 2016 at 10:00:08AM -0700, Linus Torvalds wrote:
> On Thu, Oct 6, 2016 at 9:11 AM, CAI Qian <caiqian@redhat.com> wrote:
> >
> >>
> >> Wait. There is also a lockep happened before the xfs internal error as well.
> > Some other lockdep this time,
> 
> This one looks just bogus.
> 
> > [ 4872.569797]  Possible unsafe locking scenario:
> > [ 4872.569797]
> > [ 4872.576401]        CPU0
> > [ 4872.579127]        ----
> > [ 4872.581854]   lock(&xfs_nondir_ilock_class);
> > [ 4872.586637]   <Interrupt>
> > [ 4872.589558]     lock(&xfs_nondir_ilock_class);
> 
> I'm not seeing that .lock taken in interrupt context.

It's a memory allocation vs reclaim context warning, not a lock
warning. That overloads the lock vs interrupt lockdep mechanism, so
if lockdep sees a context violation it is reported as an "interrupt
context" lock problem.

The allocation context in question is in a function that can be
called from both inside and outside a transaction context. When
outside a transaction, it's a GFP_KERNEL allocation, when inside
it's a GFP_NOFS context.  However, both allocation contexts hold the
inode ilock over the allocation.

the inode shrinker (reclaim context) also happens to take the inode
ilock, and that's what lockdep is complaining about. i.e. it thinks
that this path ilock -> alloc(GFP_KERNEL) -> reclaim -> ilock can
deadlock. But it can't - the ilock held at the upper side is a
referenced inode and can't be seen by reclaim, and the ilocks taken
by reclaim are inodes that can't be seen or referenced by the VFS.

i.e. There's no depedencies between the ilocks on either side of
memory allocation, but there's no way of telling lockdep that short
of giving the inodes in reclaim a different lock class. We used to
do that, but that was a nasty hack and prevented lockdep from
verifying locking orders used on inodes and objects in reclaim
matched the locking orders of referenced inodes...

We've historically shut these false positives up by simply making
all the allocations in these dual context paths GFP_NOFS. However, I
recently got told not to do that by someone on the mm side because
it exacerbated deficiencies in memory reclaim when too many
allocations use GFP_NOFS.

So it's not "fixed" and instead I'm ignoring it.  If you spend any
amount of time running lockdep on XFS you'll get as sick and tired
of playing this whack-a-lockdep-false-positive game as I am.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-07  7:08                                                                     ` Jan Kara
@ 2016-10-07 14:43                                                                       ` CAI Qian
  2016-10-07 15:27                                                                         ` CAI Qian
  2016-10-09 21:51                                                                         ` Dave Chinner
  2016-10-21 15:38                                                                       ` [4.9-rc1+] overlayfs lockdep CAI Qian
  1 sibling, 2 replies; 152+ messages in thread
From: CAI Qian @ 2016-10-07 14:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, tj, Linus Torvalds, Dave Chinner, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel, Miklos Szeredi, Dave Jones



----- Original Message -----
> From: "Jan Kara" <jack@suse.cz>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Al Viro" <viro@ZenIV.linux.org.uk>, "tj" <tj@kernel.org>, "Linus Torvalds" <torvalds@linux-foundation.org>,
> "Dave Chinner" <david@fromorbit.com>, "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick
> Piggin" <npiggin@gmail.com>, linux-fsdevel@vger.kernel.org, "Miklos Szeredi" <miklos@szeredi.hu>
> Sent: Friday, October 7, 2016 3:08:38 AM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> 
> So I believe this may be just a problem in overlayfs lockdep annotation
> (see below). Added Miklos to CC.
> 
> On Thu 06-10-16 08:25:59, CAI Qian wrote:
> > > > > Not sure if this related, and there is always a lockdep regards
> > > > > procfs
> > > > > happened
> > > > > below unless masking by other lockdep issues before the cgroup hang.
> > > > > Also,
> > > > > this
> > > > > hang is always reproducible.
> > > > 
> > > > Sigh...  Let's get the /proc/*/auxv out of the way - this should deal
> > > > with
> > > > it:
> > > So I applied both this and the sanity patch, and both original sanity and
> > > the
> > > proc warnings went away. However, the cgroup hang can still be reproduced
> > > as
> > > well as this new xfs internal error below,
> > 
> > Wait. There is also a lockep happened before the xfs internal error as
> > well.
> > 
> > [ 5839.452325] ======================================================
> > [ 5839.459221] [ INFO: possible circular locking dependency detected ]
> > [ 5839.466215] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
> > [ 5839.471945] -------------------------------------------------------
> > [ 5839.478937] trinity-c220/69531 is trying to acquire lock:
> > [ 5839.484961]  (&p->lock){+.+.+.}, at: [<ffffffff812ac69c>]
> > seq_read+0x4c/0x3e0
> > [ 5839.492967]
> > but task is already holding lock:
> > [ 5839.499476]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>]
> > __sb_start_write+0xd1/0xf0
> > [ 5839.508560]
> > which lock already depends on the new lock.
> > 
> > [ 5839.517686]
> > the existing dependency chain (in reverse order) is:
> > [ 5839.526036]
> > -> #3 (sb_writers#8){.+.+.+}:
> > [ 5839.530751]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> > [ 5839.537368]        [<ffffffff810f8f4a>] percpu_down_read+0x4a/0x90
> > [ 5839.544275]        [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> > [ 5839.551181]        [<ffffffff812a8544>] mnt_want_write+0x24/0x50
> > [ 5839.557892]        [<ffffffffa04a398f>] ovl_want_write+0x1f/0x30
> > [overlay]
> > [ 5839.565577]        [<ffffffffa04a6036>] ovl_do_remove+0x46/0x480
> > [overlay]
> > [ 5839.573259]        [<ffffffffa04a64a3>] ovl_unlink+0x13/0x20 [overlay]
> > [ 5839.580555]        [<ffffffff812918ea>] vfs_unlink+0xda/0x190
> > [ 5839.586979]        [<ffffffff81293698>] do_unlinkat+0x268/0x2b0
> > [ 5839.593599]        [<ffffffff8129419b>] SyS_unlinkat+0x1b/0x30
> > [ 5839.600120]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> > [ 5839.606836]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> > [ 5839.614231]
> 
> So here is IMO the real culprit: do_unlinkat() grabs fs freeze protection
> through mnt_want_write(), we grab also i_rwsem in do_unlinkat() in
> I_MUTEX_PARENT class a bit after that and further down in vfs_unlink() we
> grab i_rwsem for the unlinked inode itself in default I_MUTEX class. Then
> in ovl_want_write() we grab freeze protection again, but this time for the
> upper filesystem. That establishes sb_writers (overlay) -> I_MUTEX_PARENT
> (overlay) -> I_MUTEX (overlay) -> sb_writers (FS-A) lock ordering
> (we maintain locking classes per fs type so that's why I'm showing fs type
> in parenthesis).
> 
> Now this nesting is nasty because once you add locks that are not tracked
> per fs type into the mix, you get cycles. In this case we've got
> seq_file->lock and cred_guard_mutex into the mix - the splice path is
> doing sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex (splicing
> from seq_file into the real filesystem). Exec path further establishes
> cred_guard_mutex -> I_MUTEX (overlay) which closes the full cycle:
> 
> sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex -> i_mutex
> (overlay) -> sb_writers (FS-A)
> 
> If I analyzed the lockdep trace, this looks like a real (although remote)
> deadlock possibility. Miklos?
> 
> 								Honza
> 
> > -> #2 (&sb->s_type->i_mutex_key#17){++++++}:
> > [ 5839.620399]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> > [ 5839.627015]        [<ffffffff817d1b77>] down_read+0x47/0x70
> > [ 5839.633242]        [<ffffffff8128cfd2>] lookup_slow+0xc2/0x1f0
> > [ 5839.639762]        [<ffffffff8128f6f2>] walk_component+0x172/0x220
> > [ 5839.646668]        [<ffffffff81290fd6>] link_path_walk+0x1a6/0x620
> > [ 5839.653574]        [<ffffffff81291a81>] path_openat+0xe1/0xdb0
> > [ 5839.660092]        [<ffffffff812939e1>] do_filp_open+0x91/0x100
> > [ 5839.666707]        [<ffffffff81288e06>] do_open_execat+0x76/0x180
> > [ 5839.673517]        [<ffffffff81288f3b>] open_exec+0x2b/0x50
> > [ 5839.679743]        [<ffffffff812eccf3>] load_elf_binary+0x2a3/0x10a0
> > [ 5839.686844]        [<ffffffff81288917>] search_binary_handler+0x97/0x1d0
> > [ 5839.694331]        [<ffffffff81289ed8>]
> > do_execveat_common.isra.35+0x678/0x9a0
> > [ 5839.702400]        [<ffffffff8128a4da>] SyS_execve+0x3a/0x50
> > [ 5839.708726]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> > [ 5839.715441]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> > [ 5839.722833]
> > -> #1 (&sig->cred_guard_mutex){+.+.+.}:
> > [ 5839.728510]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> > [ 5839.735126]        [<ffffffff817cfc66>]
> > mutex_lock_killable_nested+0x86/0x540
> > [ 5839.743097]        [<ffffffff81301e84>] lock_trace+0x24/0x60
> > [ 5839.749421]        [<ffffffff8130224d>] proc_pid_syscall+0x2d/0x110
> > [ 5839.756423]        [<ffffffff81302af0>] proc_single_show+0x50/0x90
> > [ 5839.763330]        [<ffffffff812ab867>] traverse+0xf7/0x210
> > [ 5839.769557]        [<ffffffff812ac9eb>] seq_read+0x39b/0x3e0
> > [ 5839.775884]        [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
> > [ 5839.783179]        [<ffffffff81281a03>] do_readv_writev+0x213/0x230
> > [ 5839.790181]        [<ffffffff81281a59>] vfs_readv+0x39/0x50
> > [ 5839.796406]        [<ffffffff81281c12>] do_preadv+0xa2/0xc0
> > [ 5839.802634]        [<ffffffff81282ec1>] SyS_preadv+0x11/0x20
> > [ 5839.808963]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> > [ 5839.815681]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> > [ 5839.823075]
> > -> #0 (&p->lock){+.+.+.}:
> > [ 5839.827395]        [<ffffffff810fe69c>] __lock_acquire+0x151c/0x1990
> > [ 5839.834500]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> > [ 5839.841115]        [<ffffffff817cf3b6>] mutex_lock_nested+0x76/0x450
> > [ 5839.848219]        [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
> > [ 5839.854448]        [<ffffffff8131566b>] kernfs_fop_read+0x12b/0x1b0
> > [ 5839.861451]        [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
> > [ 5839.868742]        [<ffffffff81281a03>] do_readv_writev+0x213/0x230
> > [ 5839.875744]        [<ffffffff81281a59>] vfs_readv+0x39/0x50
> > [ 5839.881971]        [<ffffffff812bc55a>]
> > default_file_splice_read+0x1aa/0x2c0
> > [ 5839.889847]        [<ffffffff812bb913>] do_splice_to+0x73/0x90
> > [ 5839.896365]        [<ffffffff812bba1b>]
> > splice_direct_to_actor+0xeb/0x220
> > [ 5839.903950]        [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
> > [ 5839.910857]        [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
> > [ 5839.917470]        [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
> > [ 5839.924184]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> > [ 5839.930898]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> > [ 5839.938286]
> > other info that might help us debug this:
> > 
> > [ 5839.947217] Chain exists of:
> >   &p->lock --> &sb->s_type->i_mutex_key#17 --> sb_writers#8
> > 
> > [ 5839.956615]  Possible unsafe locking scenario:
> > 
> > [ 5839.963218]        CPU0                    CPU1
> > [ 5839.968269]        ----                    ----
> > [ 5839.973321]   lock(sb_writers#8);
> > [ 5839.977046]
> > lock(&sb->s_type->i_mutex_key#17);
> > [ 5839.985037]                                lock(sb_writers#8);
> > [ 5839.991573]   lock(&p->lock);
> > [ 5839.994900]
> >  *** DEADLOCK ***
> > 
> > [ 5840.001503] 1 lock held by trinity-c220/69531:
> > [ 5840.006457]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>]
> > __sb_start_write+0xd1/0xf0
> > [ 5840.016031]
> > stack backtrace:
> > [ 5840.020891] CPU: 12 PID: 69531 Comm: trinity-c220 Not tainted
> > 4.8.0-rc8-splice-fixw-proc+ #4
> > [ 5840.030306] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS
> > GRNDSDP1.86B.0044.R00.1501191641 01/19/2015
> > [ 5840.041660]  0000000000000086 00000000a1ef62f8 ffff8803ca52f7c0
> > ffffffff813d2ecc
> > [ 5840.049952]  ffffffff82a41160 ffffffff82a913e0 ffff8803ca52f800
> > ffffffff811dd630
> > [ 5840.058245]  ffff8803ca52f840 ffff880392c4ecc8 ffff880392c4e000
> > 0000000000000001
> > [ 5840.066537] Call Trace:
> > [ 5840.069266]  [<ffffffff813d2ecc>] dump_stack+0x85/0xc9
> > [ 5840.075000]  [<ffffffff811dd630>] print_circular_bug+0x1f9/0x207
> > [ 5840.081701]  [<ffffffff810fe69c>] __lock_acquire+0x151c/0x1990
> > [ 5840.088208]  [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> > [ 5840.094232]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
> > [ 5840.100061]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
> > [ 5840.105891]  [<ffffffff817cf3b6>] mutex_lock_nested+0x76/0x450
> > [ 5840.112397]  [<ffffffff812ac69c>] ? seq_read+0x4c/0x3e0
> > [ 5840.118228]  [<ffffffff810fb3e9>] ? __lock_is_held+0x49/0x70
> > [ 5840.124540]  [<ffffffff812ac69c>] seq_read+0x4c/0x3e0
> > [ 5840.130175]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
> > [ 5840.137360]  [<ffffffff8131566b>] kernfs_fop_read+0x12b/0x1b0
> > [ 5840.143770]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
> > [ 5840.150956]  [<ffffffff81280573>] do_loop_readv_writev+0x83/0xc0
> > [ 5840.157657]  [<ffffffff81315540>] ? kernfs_vma_page_mkwrite+0x90/0x90
> > [ 5840.164843]  [<ffffffff81281a03>] do_readv_writev+0x213/0x230
> > [ 5840.171255]  [<ffffffff81418cf9>] ? __pipe_get_pages+0x24/0x9b
> > [ 5840.177762]  [<ffffffff813e6f0f>] ? iov_iter_get_pages_alloc+0x19f/0x360
> > [ 5840.185240]  [<ffffffff810fd5f2>] ? __lock_acquire+0x472/0x1990
> > [ 5840.191843]  [<ffffffff81281a59>] vfs_readv+0x39/0x50
> > [ 5840.197478]  [<ffffffff812bc55a>] default_file_splice_read+0x1aa/0x2c0
> > [ 5840.204763]  [<ffffffff810cba89>] ? __might_sleep+0x49/0x80
> > [ 5840.210980]  [<ffffffff81349c93>] ? security_file_permission+0xa3/0xc0
> > [ 5840.218264]  [<ffffffff812bb913>] do_splice_to+0x73/0x90
> > [ 5840.224190]  [<ffffffff812bba1b>] splice_direct_to_actor+0xeb/0x220
> > [ 5840.231182]  [<ffffffff812baee0>] ? generic_pipe_buf_nosteal+0x10/0x10
> > [ 5840.238465]  [<ffffffff812bbbd9>] do_splice_direct+0x89/0xd0
> > [ 5840.244778]  [<ffffffff8128261e>] do_sendfile+0x1ce/0x3b0
> > [ 5840.250802]  [<ffffffff812831df>] SyS_sendfile64+0x6f/0xd0
> > [ 5840.256922]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> > [ 5840.263042]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
Hmm, this round of trinity triggered a different hang.

[ 2094.403119] INFO: task trinity-c0:3126 blocked for more than 120 seconds.
[ 2094.410705]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2094.417027] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2094.425770] trinity-c0      D ffff88044efc3d10 13472  3126   3124 0x00000084
[ 2094.433659]  ffff88044efc3d10 ffffffff00000000 ffff880400000000 ffff880822b5e000
[ 2094.441965]  ffff88044c8b8000 ffff88044efc4000 ffff880443755670 ffff880443755658
[ 2094.450272]  ffffffff00000000 ffff88044c8b8000 ffff88044efc3d28 ffffffff817cdaaf
[ 2094.458572] Call Trace:
[ 2094.461312]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2094.466858]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
[ 2094.474049]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
[ 2094.481352]  [<ffffffff810fd5f2>] ? __lock_acquire+0x472/0x1990
[ 2094.487964]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
[ 2094.495450]  [<ffffffff817d1bff>] down_write+0x5f/0x80
[ 2094.501190]  [<ffffffff8127e301>] ? chown_common.isra.12+0x131/0x1e0
[ 2094.508284]  [<ffffffff8127e301>] chown_common.isra.12+0x131/0x1e0
[ 2094.515177]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[ 2094.521692]  [<ffffffff810cc367>] ? preempt_count_add+0x47/0xc0
[ 2094.528304]  [<ffffffff812a665f>] ? mnt_clone_write+0x3f/0x70
[ 2094.534723]  [<ffffffff8127faef>] SyS_fchown+0x8f/0xa0
[ 2094.540463]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2094.546588]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2094.553784] 2 locks held by trinity-c0/3126:
[ 2094.558552]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[ 2094.568240]  #1:  (&sb->s_type->i_mutex_key#17){++++++}, at: [<ffffffff8127e301>] chown_common.isra.12+0x131/0x1e0
[ 2094.579864] INFO: task trinity-c1:3127 blocked for more than 120 seconds.
[ 2094.587442]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2094.593761] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2094.602503] trinity-c1      D ffff88045a1bbd10 13312  3127   3124 0x00000084
[ 2094.610402]  ffff88045a1bbd10 ffff880443769fe8 ffff880400000000 ffff88046cefe000
[ 2094.618710]  ffff88044c8ba000 ffff88045a1bc000 ffff880443769fd0 ffff88045a1bbd40
[ 2094.627015]  ffff880443769fe8 ffff88044376a158 ffff88045a1bbd28 ffffffff817cdaaf
[ 2094.635321] Call Trace:
[ 2094.638053]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2094.643597]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[ 2094.650726]  [<ffffffffa0322cca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
[ 2094.657727]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[ 2094.665119]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[ 2094.671457]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[ 2094.677987]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2094.684324]  [<ffffffffa0322cca>] xfs_file_fsync+0xea/0x2e0 [xfs]
[ 2094.691133]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
[ 2094.697354]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
[ 2094.702896]  [<ffffffff812bdf40>] SyS_fsync+0x10/0x20
[ 2094.708528]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2094.714652]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2094.721844] 1 lock held by trinity-c1/3127:
[ 2094.726515]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2094.737181] INFO: task trinity-c2:3128 blocked for more than 120 seconds.
[ 2094.744751]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2094.751068] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2094.759810] trinity-c2      D ffff8804574f3df8 13472  3128   3124 0x00000084
[ 2094.767692]  ffff8804574f3df8 0000000000000006 0000000000000000 ffff8804569a4000
[ 2094.776002]  ffff88044c8bc000 ffff8804574f4000 ffff8804622eb338 ffff88044c8bc000
[ 2094.784307]  0000000000000246 00000000ffffffff ffff8804574f3e10 ffffffff817cdaaf
[ 2094.792605] Call Trace:
[ 2094.795340]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2094.800886]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 2094.808078]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 2094.814688]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
[ 2094.820715]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[ 2094.826544]  [<ffffffff81297f53>] SyS_getdents+0x83/0x140
[ 2094.832573]  [<ffffffff81297cd0>] ? fillonedir+0x100/0x100
[ 2094.838699]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2094.844822]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2094.852013] 1 lock held by trinity-c2/3128:
[ 2094.856682]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[ 2094.865969] INFO: task trinity-c3:3129 blocked for more than 120 seconds.
[ 2094.873547]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2094.879864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2094.888606] trinity-c3      D ffff880455ce3e08 13440  3129   3124 0x00000084
[ 2094.896495]  ffff880455ce3e08 0000000000000006 0000000000000000 ffff88045144e000
[ 2094.904803]  ffff88044c8be000 ffff880455ce4000 ffff8804622eb338 ffff88044c8be000
[ 2094.913111]  0000000000000246 00000000ffffffff ffff880455ce3e20 ffffffff817cdaaf
[ 2094.921418] Call Trace:
[ 2094.924152]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2094.929695]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 2094.936885]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 2094.943496]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
[ 2094.949526]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[ 2094.956620]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[ 2094.962454]  [<ffffffff81298091>] SyS_getdents64+0x81/0x130
[ 2094.968675]  [<ffffffff81297a80>] ? iterate_dir+0x190/0x190
[ 2094.974895]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2094.981019]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2094.988204] 1 lock held by trinity-c3/3129:
[ 2094.992872]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[ 2095.002158] INFO: task trinity-c4:3130 blocked for more than 120 seconds.
[ 2095.009734]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2095.016052] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2095.024793] trinity-c4      D ffff880458997e28 13392  3130   3124 0x00000084
[ 2095.032690]  ffff880458997e28 0000000000000006 0000000000000000 ffff88046ca18000
[ 2095.040995]  ffff880458998000 ffff880458998000 ffff8804622eb338 ffff880458998000
[ 2095.049342]  0000000000000246 00000000ffffffff ffff880458997e40 ffffffff817cdaaf
[ 2095.057650] Call Trace:
[ 2095.060382]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2095.065926]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[ 2095.073118]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[ 2095.079728]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
[ 2095.085757]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[ 2095.091589]  [<ffffffff812811dd>] SyS_lseek+0x1d/0xb0
[ 2095.097229]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2095.103355]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2095.110547] 1 lock held by trinity-c4/3130:
[ 2095.115216]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[ 2095.124507] INFO: task trinity-c5:3131 blocked for more than 120 seconds.
[ 2095.132083]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2095.138402] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2095.147135] trinity-c5      D ffff88045a12bae0 13472  3131   3124 0x00000084
[ 2095.155034]  ffff88045a12bae0 ffff880443769fe8 ffff880400000000 ffff88046ca1a000
[ 2095.163339]  ffff88045899a000 ffff88045a12c000 ffff880443769fd0 ffff88045a12bb10
[ 2095.171645]  ffff880443769fe8 0000000000000000 ffff88045a12baf8 ffffffff817cdaaf
[ 2095.179952] Call Trace:
[ 2095.182684]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2095.188230]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[ 2095.195341]  [<ffffffffa03337d4>] ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
[ 2095.203310]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[ 2095.210696]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[ 2095.217029]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[ 2095.223558]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.229894]  [<ffffffffa03337d4>] xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
[ 2095.237682]  [<ffffffffa02ccfaf>] xfs_attr_get+0xdf/0x1b0 [xfs]
[ 2095.244312]  [<ffffffffa0341bfc>] xfs_xattr_get+0x4c/0x70 [xfs]
[ 2095.250924]  [<ffffffff812ad269>] generic_getxattr+0x59/0x70
[ 2095.257244]  [<ffffffff812acf9b>] vfs_getxattr+0x8b/0xb0
[ 2095.263177]  [<ffffffffa0435bd6>] ovl_xattr_get+0x46/0x60 [overlay]
[ 2095.270176]  [<ffffffffa04331aa>] ovl_other_xattr_get+0x1a/0x20 [overlay]
[ 2095.277756]  [<ffffffff812ad269>] generic_getxattr+0x59/0x70
[ 2095.284079]  [<ffffffff81345e9e>] cap_inode_need_killpriv+0x2e/0x40
[ 2095.291078]  [<ffffffff81349a33>] security_inode_need_killpriv+0x33/0x50
[ 2095.298560]  [<ffffffff812a2fb0>] dentry_needs_remove_privs+0x30/0x50
[ 2095.305743]  [<ffffffff8127ea21>] do_truncate+0x51/0xc0
[ 2095.311581]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[ 2095.318094]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[ 2095.324609]  [<ffffffff8127edde>] do_sys_ftruncate.constprop.15+0xfe/0x160
[ 2095.332286]  [<ffffffff8127ee7e>] SyS_ftruncate+0xe/0x10
[ 2095.338225]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2095.344339]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2095.351531] 2 locks held by trinity-c5/3131:
[ 2095.356297]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[ 2095.365983]  #1:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.376647] INFO: task trinity-c6:3132 blocked for more than 120 seconds.
[ 2095.384216]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2095.390535] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2095.399275] trinity-c6      D ffff88044da5fd30 13312  3132   3124 0x00000084
[ 2095.407177]  ffff88044da5fd30 ffffffff00000000 ffff880400000000 ffff880459858000
[ 2095.415485]  ffff88045899c000 ffff88044da60000 ffff880443755670 ffff880443755658
[ 2095.423789]  ffffffff00000000 ffff88045899c000 ffff88044da5fd48 ffffffff817cdaaf
[ 2095.432094] Call Trace:
[ 2095.434825]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2095.440372]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
[ 2095.447565]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
[ 2095.454854]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
[ 2095.462337]  [<ffffffff817d1bff>] down_write+0x5f/0x80
[ 2095.468077]  [<ffffffff8127e413>] ? chmod_common+0x63/0x150
[ 2095.474300]  [<ffffffff8127e413>] chmod_common+0x63/0x150
[ 2095.480327]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[ 2095.487421]  [<ffffffff810035cc>] ? syscall_trace_enter+0x1dc/0x390
[ 2095.494418]  [<ffffffff8127f5f2>] SyS_fchmod+0x52/0x80
[ 2095.500155]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2095.506270]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2095.513452] 2 locks held by trinity-c6/3132:
[ 2095.518217]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[ 2095.527895]  #1:  (&sb->s_type->i_mutex_key#17){++++++}, at: [<ffffffff8127e413>] chmod_common+0x63/0x150
[ 2095.538648] INFO: task trinity-c7:3133 blocked for more than 120 seconds.
[ 2095.546227]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2095.552544] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2095.561288] trinity-c7      D ffff88044d393d10 13472  3133   3124 0x00000084
[ 2095.569188]  ffff88044d393d10 ffff880443769fe8 ffff880400000000 ffff88086ce68000
[ 2095.577491]  ffff88045899e000 ffff88044d394000 ffff880443769fd0 ffff88044d393d40
[ 2095.585796]  ffff880443769fe8 ffff88044376a158 ffff88044d393d28 ffffffff817cdaaf
[ 2095.594103] Call Trace:
[ 2095.596836]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2095.602379]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[ 2095.609491]  [<ffffffffa0322cca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
[ 2095.616490]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[ 2095.623877]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[ 2095.630212]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[ 2095.636740]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.643076]  [<ffffffffa0322cca>] xfs_file_fsync+0xea/0x2e0 [xfs]
[ 2095.649889]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
[ 2095.656109]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
[ 2095.661653]  [<ffffffff812bdf40>] SyS_fsync+0x10/0x20
[ 2095.667291]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2095.673417]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2095.680610] 1 lock held by trinity-c7/3133:
[ 2095.685281]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.695947] INFO: task trinity-c8:3135 blocked for more than 120 seconds.
[ 2095.703530]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2095.709848] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2095.718590] trinity-c8      D ffff88044d3b3d10 12912  3135   3124 0x00000084
[ 2095.726470]  ffff88044d3b3d10 ffff880443769fe8 ffff880400000000 ffff88046ca30000
[ 2095.734775]  ffff88044d3a8000 ffff88044d3b4000 ffff880443769fd0 ffff88044d3b3d40
[ 2095.743083]  ffff880443769fe8 ffff88044376a158 ffff88044d3b3d28 ffffffff817cdaaf
[ 2095.751387] Call Trace:
[ 2095.754119]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2095.759662]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[ 2095.766772]  [<ffffffffa0322cca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
[ 2095.773763]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[ 2095.781148]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[ 2095.787482]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[ 2095.794013]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.800347]  [<ffffffffa0322cca>] xfs_file_fsync+0xea/0x2e0 [xfs]
[ 2095.807155]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
[ 2095.813377]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
[ 2095.818921]  [<ffffffff812bdf63>] SyS_fdatasync+0x13/0x20
[ 2095.824949]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2095.831074]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2095.838261] 1 lock held by trinity-c8/3135:
[ 2095.842930]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.853588] INFO: task trinity-c9:3136 blocked for more than 120 seconds.
[ 2095.861167]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[ 2095.867485] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2095.876228] trinity-c9      D ffff88045b3679e0 13328  3136   3124 0x00000084
[ 2095.884111]  ffff88045b3679e0 ffff880443769fe8 ffff880400000000 ffff88086ce56000
[ 2095.892417]  ffff88044d3aa000 ffff88045b368000 ffff880443769fd0 ffff88045b367a10
[ 2095.900721]  ffff880443769fe8 ffff88044376a1e8 ffff88045b3679f8 ffffffff817cdaaf
[ 2095.909024] Call Trace:
[ 2095.911761]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 2095.917305]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[ 2095.924414]  [<ffffffffa0333790>] ? xfs_ilock_data_map_shared+0x30/0x40 [xfs]
[ 2095.932383]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[ 2095.939768]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[ 2095.946104]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[ 2095.952632]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
[ 2095.958968]  [<ffffffffa0333790>] xfs_ilock_data_map_shared+0x30/0x40 [xfs]
[ 2095.966752]  [<ffffffffa03128c6>] __xfs_get_blocks+0x96/0x9d0 [xfs]
[ 2095.973753]  [<ffffffff8126462e>] ? mem_cgroup_event_ratelimit.isra.39+0x3e/0xb0
[ 2095.982012]  [<ffffffff8126e8e5>] ? mem_cgroup_commit_charge+0x95/0x110
[ 2095.989413]  [<ffffffffa0313214>] xfs_get_blocks+0x14/0x20 [xfs]
[ 2095.996122]  [<ffffffff812cca44>] do_mpage_readpage+0x474/0x800
[ 2096.002745]  [<ffffffffa0313200>] ? __xfs_get_blocks+0x9d0/0x9d0 [xfs]
[ 2096.010037]  [<ffffffff81402fd7>] ? debug_smp_processor_id+0x17/0x20
[ 2096.017136]  [<ffffffff811f3565>] ? __lru_cache_add+0x75/0xb0
[ 2096.023551]  [<ffffffff811f45fe>] ? lru_cache_add+0xe/0x10
[ 2096.029678]  [<ffffffff812ccf0d>] mpage_readpages+0x13d/0x1b0
[ 2096.036109]  [<ffffffffa0313200>] ? __xfs_get_blocks+0x9d0/0x9d0 [xfs]
[ 2096.043420]  [<ffffffffa0313200>] ? __xfs_get_blocks+0x9d0/0x9d0 [xfs]
[ 2096.050724]  [<ffffffffa0311f14>] xfs_vm_readpages+0x54/0x170 [xfs]
[ 2096.057724]  [<ffffffff811f1a1d>] __do_page_cache_readahead+0x2ad/0x370
[ 2096.065113]  [<ffffffff811f18ec>] ? __do_page_cache_readahead+0x17c/0x370
[ 2096.072693]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[ 2096.079787]  [<ffffffff811f2014>] force_page_cache_readahead+0x94/0xf0
[ 2096.087077]  [<ffffffff811f2168>] SyS_readahead+0xa8/0xc0
[ 2096.093106]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[ 2096.099234]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2096.106427] 1 lock held by trinity-c9/3136:
[ 2096.111097]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-07  9:57                                                                         ` Dave Chinner
@ 2016-10-07 15:25                                                                           ` Linus Torvalds
  0 siblings, 0 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-10-07 15:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: CAI Qian, Al Viro, tj, linux-xfs, Jens Axboe, Nick Piggin, linux-fsdevel

On Fri, Oct 7, 2016 at 2:57 AM, Dave Chinner <david@fromorbit.com> wrote:
>
> So it's not "fixed" and instead I'm ignoring it.  If you spend any
> amount of time running lockdep on XFS you'll get as sick and tired
> of playing this whack-a-lockdep-false-positive game as I am.

Thanks for the background here. I'll try to remember it for the next
time this comes up, it doesn't help that lockdep reports are often a
bit cryptic to begin with (that "interrupt" thing certainly didn't
help).

             Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-07 14:43                                                                       ` CAI Qian
@ 2016-10-07 15:27                                                                         ` CAI Qian
  2016-10-07 18:56                                                                           ` CAI Qian
  2016-10-09 21:51                                                                         ` Dave Chinner
  1 sibling, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-07 15:27 UTC (permalink / raw)
  To: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, Dave Chinner
  Cc: linux-xfs, Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones



> Hmm, this round of trinity triggered a different hang.
This hang is reproducible so far with the command below on a overlayfs/xfs,

$ trinity -g vfs --arch 64 --disable-fds=sockets --disable-fds=perf --disable-fds=epoll
  --disable-fds=eventfd --disable-fds=pseudo --disable-fds=timerfd --disable-fds=memfd
  --disable-fds=drm
> 
> [ 2094.403119] INFO: task trinity-c0:3126 blocked for more than 120 seconds.
> [ 2094.410705]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2094.417027] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2094.425770] trinity-c0      D ffff88044efc3d10 13472  3126   3124
> 0x00000084
> [ 2094.433659]  ffff88044efc3d10 ffffffff00000000 ffff880400000000
> ffff880822b5e000
> [ 2094.441965]  ffff88044c8b8000 ffff88044efc4000 ffff880443755670
> ffff880443755658
> [ 2094.450272]  ffffffff00000000 ffff88044c8b8000 ffff88044efc3d28
> ffffffff817cdaaf
> [ 2094.458572] Call Trace:
> [ 2094.461312]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2094.466858]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
> [ 2094.474049]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
> [ 2094.481352]  [<ffffffff810fd5f2>] ? __lock_acquire+0x472/0x1990
> [ 2094.487964]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
> [ 2094.495450]  [<ffffffff817d1bff>] down_write+0x5f/0x80
> [ 2094.501190]  [<ffffffff8127e301>] ? chown_common.isra.12+0x131/0x1e0
> [ 2094.508284]  [<ffffffff8127e301>] chown_common.isra.12+0x131/0x1e0
> [ 2094.515177]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
> [ 2094.521692]  [<ffffffff810cc367>] ? preempt_count_add+0x47/0xc0
> [ 2094.528304]  [<ffffffff812a665f>] ? mnt_clone_write+0x3f/0x70
> [ 2094.534723]  [<ffffffff8127faef>] SyS_fchown+0x8f/0xa0
> [ 2094.540463]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2094.546588]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2094.553784] 2 locks held by trinity-c0/3126:
> [ 2094.558552]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>]
> __sb_start_write+0xd1/0xf0
> [ 2094.568240]  #1:  (&sb->s_type->i_mutex_key#17){++++++}, at:
> [<ffffffff8127e301>] chown_common.isra.12+0x131/0x1e0
> [ 2094.579864] INFO: task trinity-c1:3127 blocked for more than 120 seconds.
> [ 2094.587442]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2094.593761] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2094.602503] trinity-c1      D ffff88045a1bbd10 13312  3127   3124
> 0x00000084
> [ 2094.610402]  ffff88045a1bbd10 ffff880443769fe8 ffff880400000000
> ffff88046cefe000
> [ 2094.618710]  ffff88044c8ba000 ffff88045a1bc000 ffff880443769fd0
> ffff88045a1bbd40
> [ 2094.627015]  ffff880443769fe8 ffff88044376a158 ffff88045a1bbd28
> ffffffff817cdaaf
> [ 2094.635321] Call Trace:
> [ 2094.638053]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2094.643597]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2094.650726]  [<ffffffffa0322cca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
> [ 2094.657727]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
> [ 2094.665119]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2094.671457]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
> [ 2094.677987]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2094.684324]  [<ffffffffa0322cca>] xfs_file_fsync+0xea/0x2e0 [xfs]
> [ 2094.691133]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
> [ 2094.697354]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
> [ 2094.702896]  [<ffffffff812bdf40>] SyS_fsync+0x10/0x20
> [ 2094.708528]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2094.714652]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2094.721844] 1 lock held by trinity-c1/3127:
> [ 2094.726515]  #0:  (&xfs_nondir_ilock_class){++++..}, at:
> [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2094.737181] INFO: task trinity-c2:3128 blocked for more than 120 seconds.
> [ 2094.744751]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2094.751068] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2094.759810] trinity-c2      D ffff8804574f3df8 13472  3128   3124
> 0x00000084
> [ 2094.767692]  ffff8804574f3df8 0000000000000006 0000000000000000
> ffff8804569a4000
> [ 2094.776002]  ffff88044c8bc000 ffff8804574f4000 ffff8804622eb338
> ffff88044c8bc000
> [ 2094.784307]  0000000000000246 00000000ffffffff ffff8804574f3e10
> ffffffff817cdaaf
> [ 2094.792605] Call Trace:
> [ 2094.795340]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2094.800886]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
> [ 2094.808078]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
> [ 2094.814688]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
> [ 2094.820715]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
> [ 2094.826544]  [<ffffffff81297f53>] SyS_getdents+0x83/0x140
> [ 2094.832573]  [<ffffffff81297cd0>] ? fillonedir+0x100/0x100
> [ 2094.838699]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2094.844822]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2094.852013] 1 lock held by trinity-c2/3128:
> [ 2094.856682]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>]
> __fdget_pos+0x43/0x50
> [ 2094.865969] INFO: task trinity-c3:3129 blocked for more than 120 seconds.
> [ 2094.873547]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2094.879864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2094.888606] trinity-c3      D ffff880455ce3e08 13440  3129   3124
> 0x00000084
> [ 2094.896495]  ffff880455ce3e08 0000000000000006 0000000000000000
> ffff88045144e000
> [ 2094.904803]  ffff88044c8be000 ffff880455ce4000 ffff8804622eb338
> ffff88044c8be000
> [ 2094.913111]  0000000000000246 00000000ffffffff ffff880455ce3e20
> ffffffff817cdaaf
> [ 2094.921418] Call Trace:
> [ 2094.924152]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2094.929695]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
> [ 2094.936885]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
> [ 2094.943496]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
> [ 2094.949526]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
> [ 2094.956620]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
> [ 2094.962454]  [<ffffffff81298091>] SyS_getdents64+0x81/0x130
> [ 2094.968675]  [<ffffffff81297a80>] ? iterate_dir+0x190/0x190
> [ 2094.974895]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2094.981019]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2094.988204] 1 lock held by trinity-c3/3129:
> [ 2094.992872]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>]
> __fdget_pos+0x43/0x50
> [ 2095.002158] INFO: task trinity-c4:3130 blocked for more than 120 seconds.
> [ 2095.009734]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2095.016052] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2095.024793] trinity-c4      D ffff880458997e28 13392  3130   3124
> 0x00000084
> [ 2095.032690]  ffff880458997e28 0000000000000006 0000000000000000
> ffff88046ca18000
> [ 2095.040995]  ffff880458998000 ffff880458998000 ffff8804622eb338
> ffff880458998000
> [ 2095.049342]  0000000000000246 00000000ffffffff ffff880458997e40
> ffffffff817cdaaf
> [ 2095.057650] Call Trace:
> [ 2095.060382]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2095.065926]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
> [ 2095.073118]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
> [ 2095.079728]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
> [ 2095.085757]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
> [ 2095.091589]  [<ffffffff812811dd>] SyS_lseek+0x1d/0xb0
> [ 2095.097229]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.103355]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2095.110547] 1 lock held by trinity-c4/3130:
> [ 2095.115216]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>]
> __fdget_pos+0x43/0x50
> [ 2095.124507] INFO: task trinity-c5:3131 blocked for more than 120 seconds.
> [ 2095.132083]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2095.138402] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2095.147135] trinity-c5      D ffff88045a12bae0 13472  3131   3124
> 0x00000084
> [ 2095.155034]  ffff88045a12bae0 ffff880443769fe8 ffff880400000000
> ffff88046ca1a000
> [ 2095.163339]  ffff88045899a000 ffff88045a12c000 ffff880443769fd0
> ffff88045a12bb10
> [ 2095.171645]  ffff880443769fe8 0000000000000000 ffff88045a12baf8
> ffffffff817cdaaf
> [ 2095.179952] Call Trace:
> [ 2095.182684]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2095.188230]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.195341]  [<ffffffffa03337d4>] ? xfs_ilock_attr_map_shared+0x34/0x40
> [xfs]
> [ 2095.203310]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
> [ 2095.210696]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2095.217029]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.223558]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.229894]  [<ffffffffa03337d4>] xfs_ilock_attr_map_shared+0x34/0x40
> [xfs]
> [ 2095.237682]  [<ffffffffa02ccfaf>] xfs_attr_get+0xdf/0x1b0 [xfs]
> [ 2095.244312]  [<ffffffffa0341bfc>] xfs_xattr_get+0x4c/0x70 [xfs]
> [ 2095.250924]  [<ffffffff812ad269>] generic_getxattr+0x59/0x70
> [ 2095.257244]  [<ffffffff812acf9b>] vfs_getxattr+0x8b/0xb0
> [ 2095.263177]  [<ffffffffa0435bd6>] ovl_xattr_get+0x46/0x60 [overlay]
> [ 2095.270176]  [<ffffffffa04331aa>] ovl_other_xattr_get+0x1a/0x20 [overlay]
> [ 2095.277756]  [<ffffffff812ad269>] generic_getxattr+0x59/0x70
> [ 2095.284079]  [<ffffffff81345e9e>] cap_inode_need_killpriv+0x2e/0x40
> [ 2095.291078]  [<ffffffff81349a33>] security_inode_need_killpriv+0x33/0x50
> [ 2095.298560]  [<ffffffff812a2fb0>] dentry_needs_remove_privs+0x30/0x50
> [ 2095.305743]  [<ffffffff8127ea21>] do_truncate+0x51/0xc0
> [ 2095.311581]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
> [ 2095.318094]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
> [ 2095.324609]  [<ffffffff8127edde>] do_sys_ftruncate.constprop.15+0xfe/0x160
> [ 2095.332286]  [<ffffffff8127ee7e>] SyS_ftruncate+0xe/0x10
> [ 2095.338225]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.344339]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2095.351531] 2 locks held by trinity-c5/3131:
> [ 2095.356297]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>]
> __sb_start_write+0xd1/0xf0
> [ 2095.365983]  #1:  (&xfs_nondir_ilock_class){++++..}, at:
> [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.376647] INFO: task trinity-c6:3132 blocked for more than 120 seconds.
> [ 2095.384216]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2095.390535] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2095.399275] trinity-c6      D ffff88044da5fd30 13312  3132   3124
> 0x00000084
> [ 2095.407177]  ffff88044da5fd30 ffffffff00000000 ffff880400000000
> ffff880459858000
> [ 2095.415485]  ffff88045899c000 ffff88044da60000 ffff880443755670
> ffff880443755658
> [ 2095.423789]  ffffffff00000000 ffff88045899c000 ffff88044da5fd48
> ffffffff817cdaaf
> [ 2095.432094] Call Trace:
> [ 2095.434825]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2095.440372]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
> [ 2095.447565]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
> [ 2095.454854]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
> [ 2095.462337]  [<ffffffff817d1bff>] down_write+0x5f/0x80
> [ 2095.468077]  [<ffffffff8127e413>] ? chmod_common+0x63/0x150
> [ 2095.474300]  [<ffffffff8127e413>] chmod_common+0x63/0x150
> [ 2095.480327]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
> [ 2095.487421]  [<ffffffff810035cc>] ? syscall_trace_enter+0x1dc/0x390
> [ 2095.494418]  [<ffffffff8127f5f2>] SyS_fchmod+0x52/0x80
> [ 2095.500155]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.506270]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2095.513452] 2 locks held by trinity-c6/3132:
> [ 2095.518217]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>]
> __sb_start_write+0xd1/0xf0
> [ 2095.527895]  #1:  (&sb->s_type->i_mutex_key#17){++++++}, at:
> [<ffffffff8127e413>] chmod_common+0x63/0x150
> [ 2095.538648] INFO: task trinity-c7:3133 blocked for more than 120 seconds.
> [ 2095.546227]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2095.552544] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2095.561288] trinity-c7      D ffff88044d393d10 13472  3133   3124
> 0x00000084
> [ 2095.569188]  ffff88044d393d10 ffff880443769fe8 ffff880400000000
> ffff88086ce68000
> [ 2095.577491]  ffff88045899e000 ffff88044d394000 ffff880443769fd0
> ffff88044d393d40
> [ 2095.585796]  ffff880443769fe8 ffff88044376a158 ffff88044d393d28
> ffffffff817cdaaf
> [ 2095.594103] Call Trace:
> [ 2095.596836]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2095.602379]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.609491]  [<ffffffffa0322cca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
> [ 2095.616490]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
> [ 2095.623877]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2095.630212]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.636740]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.643076]  [<ffffffffa0322cca>] xfs_file_fsync+0xea/0x2e0 [xfs]
> [ 2095.649889]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
> [ 2095.656109]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
> [ 2095.661653]  [<ffffffff812bdf40>] SyS_fsync+0x10/0x20
> [ 2095.667291]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.673417]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2095.680610] 1 lock held by trinity-c7/3133:
> [ 2095.685281]  #0:  (&xfs_nondir_ilock_class){++++..}, at:
> [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.695947] INFO: task trinity-c8:3135 blocked for more than 120 seconds.
> [ 2095.703530]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2095.709848] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2095.718590] trinity-c8      D ffff88044d3b3d10 12912  3135   3124
> 0x00000084
> [ 2095.726470]  ffff88044d3b3d10 ffff880443769fe8 ffff880400000000
> ffff88046ca30000
> [ 2095.734775]  ffff88044d3a8000 ffff88044d3b4000 ffff880443769fd0
> ffff88044d3b3d40
> [ 2095.743083]  ffff880443769fe8 ffff88044376a158 ffff88044d3b3d28
> ffffffff817cdaaf
> [ 2095.751387] Call Trace:
> [ 2095.754119]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2095.759662]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.766772]  [<ffffffffa0322cca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
> [ 2095.773763]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
> [ 2095.781148]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2095.787482]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.794013]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.800347]  [<ffffffffa0322cca>] xfs_file_fsync+0xea/0x2e0 [xfs]
> [ 2095.807155]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
> [ 2095.813377]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
> [ 2095.818921]  [<ffffffff812bdf63>] SyS_fdatasync+0x13/0x20
> [ 2095.824949]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.831074]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2095.838261] 1 lock held by trinity-c8/3135:
> [ 2095.842930]  #0:  (&xfs_nondir_ilock_class){++++..}, at:
> [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.853588] INFO: task trinity-c9:3136 blocked for more than 120 seconds.
> [ 2095.861167]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
> [ 2095.867485] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 2095.876228] trinity-c9      D ffff88045b3679e0 13328  3136   3124
> 0x00000084
> [ 2095.884111]  ffff88045b3679e0 ffff880443769fe8 ffff880400000000
> ffff88086ce56000
> [ 2095.892417]  ffff88044d3aa000 ffff88045b368000 ffff880443769fd0
> ffff88045b367a10
> [ 2095.900721]  ffff880443769fe8 ffff88044376a1e8 ffff88045b3679f8
> ffffffff817cdaaf
> [ 2095.909024] Call Trace:
> [ 2095.911761]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> [ 2095.917305]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.924414]  [<ffffffffa0333790>] ? xfs_ilock_data_map_shared+0x30/0x40
> [xfs]
> [ 2095.932383]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
> [ 2095.939768]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2095.946104]  [<ffffffffa03335fa>] ? xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.952632]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.958968]  [<ffffffffa0333790>] xfs_ilock_data_map_shared+0x30/0x40
> [xfs]
> [ 2095.966752]  [<ffffffffa03128c6>] __xfs_get_blocks+0x96/0x9d0 [xfs]
> [ 2095.973753]  [<ffffffff8126462e>] ?
> mem_cgroup_event_ratelimit.isra.39+0x3e/0xb0
> [ 2095.982012]  [<ffffffff8126e8e5>] ? mem_cgroup_commit_charge+0x95/0x110
> [ 2095.989413]  [<ffffffffa0313214>] xfs_get_blocks+0x14/0x20 [xfs]
> [ 2095.996122]  [<ffffffff812cca44>] do_mpage_readpage+0x474/0x800
> [ 2096.002745]  [<ffffffffa0313200>] ? __xfs_get_blocks+0x9d0/0x9d0 [xfs]
> [ 2096.010037]  [<ffffffff81402fd7>] ? debug_smp_processor_id+0x17/0x20
> [ 2096.017136]  [<ffffffff811f3565>] ? __lru_cache_add+0x75/0xb0
> [ 2096.023551]  [<ffffffff811f45fe>] ? lru_cache_add+0xe/0x10
> [ 2096.029678]  [<ffffffff812ccf0d>] mpage_readpages+0x13d/0x1b0
> [ 2096.036109]  [<ffffffffa0313200>] ? __xfs_get_blocks+0x9d0/0x9d0 [xfs]
> [ 2096.043420]  [<ffffffffa0313200>] ? __xfs_get_blocks+0x9d0/0x9d0 [xfs]
> [ 2096.050724]  [<ffffffffa0311f14>] xfs_vm_readpages+0x54/0x170 [xfs]
> [ 2096.057724]  [<ffffffff811f1a1d>] __do_page_cache_readahead+0x2ad/0x370
> [ 2096.065113]  [<ffffffff811f18ec>] ? __do_page_cache_readahead+0x17c/0x370
> [ 2096.072693]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
> [ 2096.079787]  [<ffffffff811f2014>] force_page_cache_readahead+0x94/0xf0
> [ 2096.087077]  [<ffffffff811f2168>] SyS_readahead+0xa8/0xc0
> [ 2096.093106]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2096.099234]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2096.106427] 1 lock held by trinity-c9/3136:
> [ 2096.111097]  #0:  (&xfs_nondir_ilock_class){++++..}, at:
> [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
>

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-07 15:27                                                                         ` CAI Qian
@ 2016-10-07 18:56                                                                           ` CAI Qian
  2016-10-09 21:54                                                                             ` Dave Chinner
  0 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-07 18:56 UTC (permalink / raw)
  To: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, Dave Chinner
  Cc: linux-xfs, Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones



----- Original Message -----
> From: "CAI Qian" <caiqian@redhat.com>
> To: "Jan Kara" <jack@suse.cz>, "Miklos Szeredi" <miklos@szeredi.hu>, "tj" <tj@kernel.org>, "Al Viro"
> <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>
> Cc: "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org, "Dave Jones" <davej@codemonkey.org.uk>
> Sent: Friday, October 7, 2016 11:27:55 AM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> 
> 
> > Hmm, this round of trinity triggered a different hang.
> This hang is reproducible so far with the command below on a overlayfs/xfs,
Another data point is that this hang can also be reproduced using device-mapper thinp
as the docker backend.
    CAI Qian

[12047.714409] INFO: task trinity-c0:3716 blocked for more than 120 seconds.
[12047.722033]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12047.728354] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12047.737107] trinity-c0      D ffff8804507dbd10 13552  3716   3713 0x00000084
[12047.744997]  ffff8804507dbd10 ffff8804240e9368 ffff880400000000 ffffffff81c0d540
[12047.753300]  ffff88044c430000 ffff8804507dc000 ffff8804240e9350 ffff8804507dbd40
[12047.761598]  ffff8804240e9368 ffff8804240e94d8 ffff8804507dbd28 ffffffff817cdaaf
[12047.769898] Call Trace:
[12047.772631]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12047.778174]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[12047.785303]  [<ffffffffa028ccca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
[12047.792309]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[12047.799695]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[12047.806029]  [<ffffffffa029d5fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[12047.812554]  [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12047.818887]  [<ffffffffa028ccca>] xfs_file_fsync+0xea/0x2e0 [xfs]
[12047.825693]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
[12047.831915]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
[12047.837455]  [<ffffffff812bdf63>] SyS_fdatasync+0x13/0x20
[12047.843485]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12047.849609]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12047.856801] 1 lock held by trinity-c0/3716:
[12047.861470]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12047.872125] INFO: task trinity-c1:3717 blocked for more than 120 seconds.
[12047.879703]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12047.886011] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12047.894749] trinity-c1      D ffff8804507ffd10 13568  3717   3713 0x00000084
[12047.902645]  ffff8804507ffd10 ffff8804240e9368 ffff880400000000 ffff88046c9da000
[12047.910941]  ffff88044c434000 ffff880450800000 ffff8804240e9350 ffff8804507ffd40
[12047.919240]  ffff8804240e9368 ffff8804240e94d8 ffff8804507ffd28 ffffffff817cdaaf
[12047.927542] Call Trace:
[12047.930284]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12047.935826]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[12047.942933]  [<ffffffffa028ccca>] ? xfs_file_fsync+0xea/0x2e0 [xfs]
[12047.949930]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[12047.957315]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[12047.963647]  [<ffffffffa029d5fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[12047.970171]  [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12047.976506]  [<ffffffffa028ccca>] xfs_file_fsync+0xea/0x2e0 [xfs]
[12047.983310]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
[12047.989529]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
[12047.995070]  [<ffffffff812bdf63>] SyS_fdatasync+0x13/0x20
[12048.001096]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.007217]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.014407] 1 lock held by trinity-c1/3717:
[12048.019085]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12048.029742] INFO: task trinity-c2:3718 blocked for more than 120 seconds.
[12048.037310]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12048.043626] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12048.052365] trinity-c2      D ffff8804586c7df8 13504  3718   3713 0x00000084
[12048.060261]  ffff8804586c7df8 0000000000000006 0000000000000000 ffff88046c9dc000
[12048.068565]  ffff88044c436000 ffff8804586c8000 ffff88044ec7e6f8 ffff88044c436000
[12048.076862]  0000000000000246 00000000ffffffff ffff8804586c7e10 ffffffff817cdaaf
[12048.085163] Call Trace:
[12048.087893]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12048.093434]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[12048.100627]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[12048.107237]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
[12048.113262]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[12048.119094]  [<ffffffff81297f53>] SyS_getdents+0x83/0x140
[12048.125120]  [<ffffffff81297cd0>] ? fillonedir+0x100/0x100
[12048.131243]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.137357]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.144546] 1 lock held by trinity-c2/3718:
[12048.149214]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[12048.158495] INFO: task trinity-c3:3719 blocked for more than 120 seconds.
[12048.166071]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12048.172388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12048.181120] trinity-c3      D ffff880450707c60 13552  3719   3713 0x00000084
[12048.189013]  ffff880450707c60 ffffffff00000000 ffff880400000000 ffff88046ca10000
[12048.197313]  ffff88044c432000 ffff880450708000 ffff8804240e9658 ffff8804240e9640
[12048.205612]  ffffffff00000000 ffff88044c432000 ffff880450707c78 ffffffff817cdaaf
[12048.213912] Call Trace:
[12048.216643]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12048.222183]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
[12048.229374]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
[12048.236662]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
[12048.244144]  [<ffffffff817d1bff>] down_write+0x5f/0x80
[12048.249881]  [<ffffffff812ad021>] ? vfs_removexattr+0x61/0x120
[12048.256391]  [<ffffffff812ad021>] vfs_removexattr+0x61/0x120
[12048.262709]  [<ffffffff812ad135>] removexattr+0x55/0x80
[12048.268533]  [<ffffffff81402ff3>] ? __this_cpu_preempt_check+0x13/0x20
[12048.275811]  [<ffffffff810f8eae>] ? update_fast_ctr+0x4e/0x70
[12048.282225]  [<ffffffff810f8f57>] ? percpu_down_read+0x57/0x90
[12048.288728]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[12048.295230]  [<ffffffff810cc367>] ? preempt_count_add+0x47/0xc0
[12048.301829]  [<ffffffff812a665f>] ? mnt_clone_write+0x3f/0x70
[12048.308242]  [<ffffffff812a8588>] ? __mnt_want_write_file+0x18/0x30
[12048.315238]  [<ffffffff812a85d0>] ? mnt_want_write_file+0x30/0x60
[12048.322039]  [<ffffffff812ae303>] SyS_fremovexattr+0x83/0xb0
[12048.328356]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.334478]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.341679] 2 locks held by trinity-c3/3719:
[12048.346454]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[12048.356042]  #1:  (&sb->s_type->i_mutex_key#14){+.+.+.}, at: [<ffffffff812ad021>] vfs_removexattr+0x61/0x120
[12048.367079] INFO: task trinity-c4:3720 blocked for more than 120 seconds.
[12048.374655]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12048.380972] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12048.389712] trinity-c4      D ffff88045072be08 13536  3720   3713 0x00000084
[12048.397606]  ffff88045072be08 0000000000000006 0000000000000000 ffff88046c9fe000
[12048.405902]  ffff880450720000 ffff88045072c000 ffff88044ec7e6f8 ffff880450720000
[12048.414205]  0000000000000246 00000000ffffffff ffff88045072be20 ffffffff817cdaaf
[12048.422505] Call Trace:
[12048.425235]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12048.430767]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[12048.437957]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[12048.444565]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
[12048.450591]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[12048.457675]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[12048.463508]  [<ffffffff81298091>] SyS_getdents64+0x81/0x130
[12048.469720]  [<ffffffff81297a80>] ? iterate_dir+0x190/0x190
[12048.475939]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.482063]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.489243] 1 lock held by trinity-c4/3720:
[12048.493913]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[12048.503182] INFO: task trinity-c5:3721 blocked for more than 120 seconds.
[12048.510757]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12048.517071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12048.525812] trinity-c5      D ffff8804510a7e08 13552  3721   3713 0x00000084
[12048.533706]  ffff8804510a7e08 0000000000000006 0000000000000000 ffff88046c9fa000
[12048.542007]  ffff880450722000 ffff8804510a8000 ffff88044ec7e6f8 ffff880450722000
[12048.550310]  0000000000000246 00000000ffffffff ffff8804510a7e20 ffffffff817cdaaf
[12048.558610] Call Trace:
[12048.561339]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12048.566879]  [<ffffffff817cdf18>] schedule_preempt_disabled+0x18/0x30
[12048.574070]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
[12048.580677]  [<ffffffff812a5313>] ? __fdget_pos+0x43/0x50
[12048.586703]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[12048.593796]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[12048.599629]  [<ffffffff81298091>] SyS_getdents64+0x81/0x130
[12048.605849]  [<ffffffff81297a80>] ? iterate_dir+0x190/0x190
[12048.612069]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.618191]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.625382] 1 lock held by trinity-c5/3721:
[12048.630049]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50
[12048.639329] INFO: task trinity-c6:3722 blocked for more than 120 seconds.
[12048.646903]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12048.653219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12048.661958] trinity-c6      D ffff88044f0ebc50 12224  3722   3713 0x00000084
[12048.669849]  ffff88044f0ebc50 ffff8804240e9368 ffff880400000000 ffff88046c9fc000
[12048.678149]  ffff880450724000 ffff88044f0ec000 ffff8804240e9350 ffff88044f0ebc80
[12048.686448]  ffff8804240e9368 ffff8804240e92c0 ffff88044f0ebc68 ffffffff817cdaaf
[12048.694750] Call Trace:
[12048.697478]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12048.703018]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[12048.710126]  [<ffffffffa029d7d4>] ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
[12048.718095]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[12048.725479]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[12048.731800]  [<ffffffffa029d5fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[12048.738337]  [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12048.744669]  [<ffffffffa029d7d4>] xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
[12048.752457]  [<ffffffffa0280801>] xfs_attr_list_int+0x71/0x690 [xfs]
[12048.759555]  [<ffffffff810cba89>] ? __might_sleep+0x49/0x80
[12048.765792]  [<ffffffffa02abf2a>] xfs_vn_listxattr+0x7a/0xb0 [xfs]
[12048.772707]  [<ffffffffa02abcc0>] ? __xfs_xattr_put_listent+0xa0/0xa0 [xfs]
[12048.780480]  [<ffffffff812ad582>] vfs_listxattr+0x42/0x70
[12048.786517]  [<ffffffff812ad68e>] listxattr+0xde/0xf0
[12048.792156]  [<ffffffff812ae1f6>] SyS_flistxattr+0x56/0xa0
[12048.798271]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.804404]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.811595] 1 lock held by trinity-c6/3722:
[12048.816263]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12048.826935] INFO: task trinity-c7:3723 blocked for more than 120 seconds.
[12048.834516]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12048.840832] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12048.849572] trinity-c7      D ffff88044fc23c50 13552  3723   3713 0x00000084
[12048.857469]  ffff88044fc23c50 ffff8804240e9368 ffff880400000000 ffff88046c9f8000
[12048.865768]  ffff880450726000 ffff88044fc24000 ffff8804240e9350 ffff88044fc23c80
[12048.874067]  ffff8804240e9368 ffff8804240e92c0 ffff88044fc23c68 ffffffff817cdaaf
[12048.882370] Call Trace:
[12048.885100]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12048.890634]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
[12048.897741]  [<ffffffffa029d7d4>] ? xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
[12048.905707]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
[12048.913081]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
[12048.919412]  [<ffffffffa029d5fa>] ? xfs_ilock+0xfa/0x260 [xfs]
[12048.925937]  [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12048.932267]  [<ffffffffa029d7d4>] xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
[12048.940053]  [<ffffffffa0280801>] xfs_attr_list_int+0x71/0x690 [xfs]
[12048.947146]  [<ffffffff810cba89>] ? __might_sleep+0x49/0x80
[12048.953374]  [<ffffffffa02abf2a>] xfs_vn_listxattr+0x7a/0xb0 [xfs]
[12048.960288]  [<ffffffffa02abcc0>] ? __xfs_xattr_put_listent+0xa0/0xa0 [xfs]
[12048.968060]  [<ffffffff812ad582>] vfs_listxattr+0x42/0x70
[12048.974088]  [<ffffffff812ad602>] listxattr+0x52/0xf0
[12048.979726]  [<ffffffff812ae1f6>] SyS_flistxattr+0x56/0xa0
[12048.985849]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12048.991973]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12048.999162] 1 lock held by trinity-c7/3723:
[12049.003831]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa029d5fa>] xfs_ilock+0xfa/0x260 [xfs]
[12049.014481] INFO: task trinity-c8:3724 blocked for more than 120 seconds.
[12049.022072]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12049.028389] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12049.037130] trinity-c8      D ffff88044fc3fc60 13504  3724   3713 0x00000084
[12049.045023]  ffff88044fc3fc60 ffffffff00000000 ffff880400000000 ffff88046ca14000
[12049.053324]  ffff88044e540000 ffff88044fc40000 ffff8804240e9368 ffff8804240e9350
[12049.061623]  ffffffff00000000 ffff88044e540000 ffff88044fc3fc78 ffffffff817cdaaf
[12049.069924] Call Trace:
[12049.072654]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12049.078208]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
[12049.085408]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
[12049.092734]  [<ffffffffa028d7cc>] ? xfs_update_prealloc_flags+0x6c/0x100 [xfs]
[12049.100798]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
[12049.108290]  [<ffffffff810f8c15>] down_write_nested+0x65/0x80
[12049.114742]  [<ffffffffa029d68e>] ? xfs_ilock+0x18e/0x260 [xfs]
[12049.121377]  [<ffffffffa029d68e>] xfs_ilock+0x18e/0x260 [xfs]
[12049.127819]  [<ffffffffa028d7cc>] xfs_update_prealloc_flags+0x6c/0x100 [xfs]
[12049.135714]  [<ffffffffa028da8e>] xfs_file_fallocate+0x22e/0x360 [xfs]
[12049.143004]  [<ffffffff810f8eae>] ? update_fast_ctr+0x4e/0x70
[12049.149435]  [<ffffffff810f8f57>] ? percpu_down_read+0x57/0x90
[12049.155958]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[12049.162492]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
[12049.169016]  [<ffffffff8127e000>] vfs_fallocate+0x140/0x230
[12049.175249]  [<ffffffff8127eee4>] SyS_fallocate+0x44/0x70
[12049.181288]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12049.187423]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12049.194667] 5 locks held by trinity-c8/3724:
[12049.199429]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[12049.209024]  #1:  (&(&ip->i_iolock)->mr_lock){++++++}, at: [<ffffffffa029d654>] xfs_ilock+0x154/0x260 [xfs]
[12049.219990]  #2:  (&(&ip->i_mmaplock)->mr_lock){+++++.}, at: [<ffffffffa029d674>] xfs_ilock+0x174/0x260 [xfs]
[12049.231128]  #3:  (sb_internal){.+.+.+}, at: [<ffffffff81284b8b>] __sb_start_write+0x7b/0xf0
[12049.240620]  #4:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa029d68e>] xfs_ilock+0x18e/0x260 [xfs]
[12049.251383] INFO: task trinity-c9:3725 blocked for more than 120 seconds.
[12049.258959]       Not tainted 4.8.0-rc8-splice-fixw-proc+ #4
[12049.265287] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12049.274027] trinity-c9      D ffff88044f043d30 13552  3725   3713 0x00000084
[12049.281922]  ffff88044f043d30 ffffffff00000000 ffff880400000000 ffff88046ca14000
[12049.290238]  ffff88044e542000 ffff88044f044000 ffff8804240e9658 ffff8804240e9640
[12049.298539]  ffffffff00000000 ffff88044e542000 ffff88044f043d48 ffffffff817cdaaf
[12049.306840] Call Trace:
[12049.309569]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[12049.315122]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
[12049.322327]  [<ffffffff817d25ac>] ? rwsem_down_write_failed+0x6c/0x4b0
[12049.329625]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
[12049.337118]  [<ffffffff817d1bff>] down_write+0x5f/0x80
[12049.342864]  [<ffffffff8127e413>] ? chmod_common+0x63/0x150
[12049.349096]  [<ffffffff8127e413>] chmod_common+0x63/0x150
[12049.355131]  [<ffffffff8117729f>] ? __audit_syscall_entry+0xaf/0x100
[12049.362236]  [<ffffffff810035cc>] ? syscall_trace_enter+0x1dc/0x390
[12049.369243]  [<ffffffff8127f5f2>] SyS_fchmod+0x52/0x80
[12049.374988]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
[12049.381124]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
[12049.388324] 2 locks held by trinity-c9/3725:
[12049.393100]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
[12049.402705]  #1:  (&sb->s_type->i_mutex_key#14){+.+.+.}, at: [<ffffffff8127e413>] chmod_common+0x63/0x150

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-07 14:43                                                                       ` CAI Qian
  2016-10-07 15:27                                                                         ` CAI Qian
@ 2016-10-09 21:51                                                                         ` Dave Chinner
  1 sibling, 0 replies; 152+ messages in thread
From: Dave Chinner @ 2016-10-09 21:51 UTC (permalink / raw)
  To: CAI Qian
  Cc: Jan Kara, Al Viro, tj, Linus Torvalds, linux-xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel, Miklos Szeredi, Dave Jones

On Fri, Oct 07, 2016 at 10:43:18AM -0400, CAI Qian wrote:
> Hmm, this round of trinity triggered a different hang.
> 
> [ 2094.487964]  [<ffffffff813e27b7>] call_rwsem_down_write_failed+0x17/0x30
> [ 2094.495450]  [<ffffffff817d1bff>] down_write+0x5f/0x80
> [ 2094.508284]  [<ffffffff8127e301>] chown_common.isra.12+0x131/0x1e0
> [ 2094.553784] 2 locks held by trinity-c0/3126:
> [ 2094.558552]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> [ 2094.568240]  #1:  (&sb->s_type->i_mutex_key#17){++++++}, at: [<ffffffff8127e301>] chown_common.isra.12+0x131/0x1e0

Waiting on i_mutex.

> [ 2094.643597]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2094.665119]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2094.691133]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
> [ 2094.721844] 1 lock held by trinity-c1/3127:
> [ 2094.726515]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]

Waiting on i_ilock.

> [ 2094.808078]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
> [ 2094.820715]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
> [ 2094.826544]  [<ffffffff81297f53>] SyS_getdents+0x83/0x140
> [ 2094.856682]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50

concurrent readdir on the same directory fd, blocked on fd.

> [ 2094.936885]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
> [ 2094.956620]  [<ffffffff812a5313>] __fdget_pos+0x43/0x50
> [ 2094.962454]  [<ffffffff81298091>] SyS_getdents64+0x81/0x130
> [ 2094.988204] 1 lock held by trinity-c3/3129:
> [ 2094.992872]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50

Same.

> [ 2095.073118]  [<ffffffff817cf4df>] mutex_lock_nested+0x19f/0x450
> [ 2095.091589]  [<ffffffff812811dd>] SyS_lseek+0x1d/0xb0
> [ 2095.097229]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.110547] 1 lock held by trinity-c4/3130:
> [ 2095.115216]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff812a5313>] __fdget_pos+0x43/0x50

Concurrent lseek on directory fd, blocked on fd.


> [ 2095.188230]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.223558]  [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]
> [ 2095.229894]  [<ffffffffa03337d4>] xfs_ilock_attr_map_shared+0x34/0x40 [xfs]
> [ 2095.237682]  [<ffffffffa02ccfaf>] xfs_attr_get+0xdf/0x1b0 [xfs]
> [ 2095.244312]  [<ffffffffa0341bfc>] xfs_xattr_get+0x4c/0x70 [xfs]
> [ 2095.250924]  [<ffffffff812ad269>] generic_getxattr+0x59/0x70
> [ 2095.257244]  [<ffffffff812acf9b>] vfs_getxattr+0x8b/0xb0
> [ 2095.263177]  [<ffffffffa0435bd6>] ovl_xattr_get+0x46/0x60 [overlay]
> [ 2095.270176]  [<ffffffffa04331aa>] ovl_other_xattr_get+0x1a/0x20 [overlay]
> [ 2095.277756]  [<ffffffff812ad269>] generic_getxattr+0x59/0x70
> [ 2095.284079]  [<ffffffff81345e9e>] cap_inode_need_killpriv+0x2e/0x40
> [ 2095.291078]  [<ffffffff81349a33>] security_inode_need_killpriv+0x33/0x50
> [ 2095.298560]  [<ffffffff812a2fb0>] dentry_needs_remove_privs+0x30/0x50
> [ 2095.305743]  [<ffffffff8127ea21>] do_truncate+0x51/0xc0
> [ 2095.311581]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
> [ 2095.318094]  [<ffffffff81284be1>] ? __sb_start_write+0xd1/0xf0
> [ 2095.324609]  [<ffffffff8127edde>] do_sys_ftruncate.constprop.15+0xfe/0x160
> [ 2095.332286]  [<ffffffff8127ee7e>] SyS_ftruncate+0xe/0x10
> [ 2095.338225]  [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> [ 2095.344339]  [<ffffffff817d4a3f>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2095.351531] 2 locks held by trinity-c5/3131:
> [ 2095.356297]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> [ 2095.365983]  #1:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]

truncate on overlay, removing xattrs from XFS file, blocked on
i_ilock.

> [ 2095.440372]  [<ffffffff817d2782>] rwsem_down_write_failed+0x242/0x4b0
> [ 2095.474300]  [<ffffffff8127e413>] chmod_common+0x63/0x150
> [ 2095.513452] 2 locks held by trinity-c6/3132:
> [ 2095.518217]  #0:  (sb_writers#14){.+.+.+}, at: [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> [ 2095.527895]  #1:  (&sb->s_type->i_mutex_key#17){++++++}, at: [<ffffffff8127e413>] chmod_common+0x63/0x150

chmod, blocked on i_mutex.

> [ 2095.602379]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.616490]  [<ffffffff813e2788>] call_rwsem_down_read_failed+0x18/0x30
> [ 2095.623877]  [<ffffffff810f8b0b>] down_read_nested+0x5b/0x80
> [ 2095.649889]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
> [ 2095.680610] 1 lock held by trinity-c7/3133:
> [ 2095.685281]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]

fsync on file, blocked on i_ilock.

> [ 2095.759662]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.807155]  [<ffffffff812bdbbd>] vfs_fsync_range+0x3d/0xb0
> [ 2095.813377]  [<ffffffff812bdc8d>] do_fsync+0x3d/0x70
> [ 2095.818921]  [<ffffffff812bdf63>] SyS_fdatasync+0x13/0x20
> [ 2095.838261] 1 lock held by trinity-c8/3135:
> [ 2095.842930]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]

ditto.

> [ 2095.917305]  [<ffffffff817d24b7>] rwsem_down_read_failed+0x107/0x190
> [ 2095.958968]  [<ffffffffa0333790>] xfs_ilock_data_map_shared+0x30/0x40 [xfs]
> [ 2095.966752]  [<ffffffffa03128c6>] __xfs_get_blocks+0x96/0x9d0 [xfs]
> [ 2095.989413]  [<ffffffffa0313214>] xfs_get_blocks+0x14/0x20 [xfs]
> [ 2095.996122]  [<ffffffff812cca44>] do_mpage_readpage+0x474/0x800
> [ 2096.029678]  [<ffffffff812ccf0d>] mpage_readpages+0x13d/0x1b0
> [ 2096.050724]  [<ffffffffa0311f14>] xfs_vm_readpages+0x54/0x170 [xfs]
> [ 2096.057724]  [<ffffffff811f1a1d>] __do_page_cache_readahead+0x2ad/0x370
> [ 2096.079787]  [<ffffffff811f2014>] force_page_cache_readahead+0x94/0xf0
> [ 2096.087077]  [<ffffffff811f2168>] SyS_readahead+0xa8/0xc0
> [ 2096.106427] 1 lock held by trinity-c9/3136:
> [ 2096.111097]  #0:  (&xfs_nondir_ilock_class){++++..}, at: [<ffffffffa03335fa>] xfs_ilock+0xfa/0x260 [xfs]

readhead blocking in i_ilock before reading in extents.

Nothing here indicates a deadlock. Everything is waiting for locks,
but nothing is holding locks in a way that indicates that progress
is not being made. This sort of thing can happen when slow storage
is massively overloaded - sysrq-w is really the only way to get a
better picutre of what is happening here, but so far there's no
concrete evidence of a hang from this output.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-07 18:56                                                                           ` CAI Qian
@ 2016-10-09 21:54                                                                             ` Dave Chinner
  2016-10-10 14:10                                                                               ` CAI Qian
  0 siblings, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2016-10-09 21:54 UTC (permalink / raw)
  To: CAI Qian
  Cc: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, linux-xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones

On Fri, Oct 07, 2016 at 02:56:22PM -0400, CAI Qian wrote:
> 
> 
> ----- Original Message -----
> > From: "CAI Qian" <caiqian@redhat.com>
> > To: "Jan Kara" <jack@suse.cz>, "Miklos Szeredi" <miklos@szeredi.hu>, "tj" <tj@kernel.org>, "Al Viro"
> > <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>
> > Cc: "linux-xfs" <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org, "Dave Jones" <davej@codemonkey.org.uk>
> > Sent: Friday, October 7, 2016 11:27:55 AM
> > Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> > 
> > 
> > 
> > > Hmm, this round of trinity triggered a different hang.
> > This hang is reproducible so far with the command below on a overlayfs/xfs,
> Another data point is that this hang can also be reproduced using device-mapper thinp
> as the docker backend.

Again, no evidence that the system is actually hung. Waiting on
locks, yes, but nothing to indicate there is a deadlock in those
waiters.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-09 21:54                                                                             ` Dave Chinner
@ 2016-10-10 14:10                                                                               ` CAI Qian
  2016-10-10 20:14                                                                                 ` CAI Qian
  2016-10-10 21:57                                                                                 ` Dave Chinner
  0 siblings, 2 replies; 152+ messages in thread
From: CAI Qian @ 2016-10-10 14:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, linux-xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> To: "CAI Qian" <caiqian@redhat.com>
> Cc: "Jan Kara" <jack@suse.cz>, "Miklos Szeredi" <miklos@szeredi.hu>, "tj" <tj@kernel.org>, "Al Viro"
> <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "linux-xfs"
> <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> linux-fsdevel@vger.kernel.org, "Dave Jones" <davej@codemonkey.org.uk>
> Sent: Sunday, October 9, 2016 5:54:55 PM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> Again, no evidence that the system is actually hung. Waiting on
> locks, yes, but nothing to indicate there is a deadlock in those
> waiters.
Here you are,

http://people.redhat.com/qcai/tmp/dmesg

    CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-10 14:10                                                                               ` CAI Qian
@ 2016-10-10 20:14                                                                                 ` CAI Qian
  2016-10-10 21:57                                                                                 ` Dave Chinner
  1 sibling, 0 replies; 152+ messages in thread
From: CAI Qian @ 2016-10-10 20:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, linux-xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones


> Here you are,
> 
> http://people.redhat.com/qcai/tmp/dmesg
Also, this turned out to be a regression and bisecting so far pointed out this commit,

commit 5d50ac70fe98518dbf620bfba8184254663125eb
Merge: 31c1feb 4e14e49
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Nov 11 20:18:48 2015 -0800

    Merge tag 'xfs-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/g
    
    Pull xfs updates from Dave Chinner:
     "There is nothing really major here - the only significant addition is
      the per-mount operation statistics infrastructure.  Otherwises there's
      various ACL, xattr, DAX, AIO and logging fixes, and a smattering of
      small cleanups and fixes elsewhere.
    
      Summary:
    
       - per-mount operational statistics in sysfs
       - fixes for concurrent aio append write submission
       - various logging fixes
       - detection of zeroed logs and invalid log sequence numbers on v5 filesys
       - memory allocation failure message improvements
       - a bunch of xattr/ACL fixes
       - fdatasync optimisation
       - miscellaneous other fixes and cleanups"

    * tag 'xfs-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/d
      xfs: give all workqueues rescuer threads
      xfs: fix log recovery op header validation assert
      xfs: Fix error path in xfs_get_acl
      xfs: optimise away log forces on timestamp updates for fdatasync
      xfs: don't leak uuid table on rmmod
      xfs: invalidate cached acl if set via ioctl
      xfs: Plug memory leak in xfs_attrmulti_attr_set
      xfs: Validate the length of on-disk ACLs
      xfs: invalidate cached acl if set directly via xattr
      xfs: xfs_filemap_pmd_fault treats read faults as write faults
      xfs: add ->pfn_mkwrite support for DAX
      xfs: DAX does not use IO completion callbacks
      xfs: Don't use unwritten extents for DAX
      xfs: introduce BMAPI_ZERO for allocating zeroed extents
      xfs: fix inode size update overflow in xfs_map_direct()
      xfs: clear PF_NOFREEZE for xfsaild kthread
      xfs: fix an error code in xfs_fs_fill_super()
      xfs: stats are no longer dependent on CONFIG_PROC_FS
      xfs: simplify /proc teardown & error handling
      xfs: per-filesystem stats counter implementation
      ...

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-10 14:10                                                                               ` CAI Qian
  2016-10-10 20:14                                                                                 ` CAI Qian
@ 2016-10-10 21:57                                                                                 ` Dave Chinner
  2016-10-12 19:50                                                                                   ` [bisected] " CAI Qian
  1 sibling, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2016-10-10 21:57 UTC (permalink / raw)
  To: CAI Qian
  Cc: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, linux-xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones

On Mon, Oct 10, 2016 at 10:10:29AM -0400, CAI Qian wrote:
> 
> 
> ----- Original Message -----
> > From: "Dave Chinner" <david@fromorbit.com>
> > To: "CAI Qian" <caiqian@redhat.com>
> > Cc: "Jan Kara" <jack@suse.cz>, "Miklos Szeredi" <miklos@szeredi.hu>, "tj" <tj@kernel.org>, "Al Viro"
> > <viro@ZenIV.linux.org.uk>, "Linus Torvalds" <torvalds@linux-foundation.org>, "linux-xfs"
> > <linux-xfs@vger.kernel.org>, "Jens Axboe" <axboe@kernel.dk>, "Nick Piggin" <npiggin@gmail.com>,
> > linux-fsdevel@vger.kernel.org, "Dave Jones" <davej@codemonkey.org.uk>
> > Sent: Sunday, October 9, 2016 5:54:55 PM
> > Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> > 
> > Again, no evidence that the system is actually hung. Waiting on
> > locks, yes, but nothing to indicate there is a deadlock in those
> > waiters.
> Here you are,
> 
> http://people.redhat.com/qcai/tmp/dmesg

It's a page lock order bug in the XFS seek hole/data implementation.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-10 21:57                                                                                 ` Dave Chinner
@ 2016-10-12 19:50                                                                                   ` CAI Qian
  2016-10-12 20:59                                                                                     ` Dave Chinner
  0 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-12 19:50 UTC (permalink / raw)
  To: Dave Chinner, Sage Weil, Brian Foster
  Cc: Jan Kara, Miklos Szeredi, tj, Al Viro, Linus Torvalds, linux-xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel, Dave Jones



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> Sent: Monday, October 10, 2016 5:57:14 PM
> 
> > http://people.redhat.com/qcai/tmp/dmesg
> 
> It's a page lock order bug in the XFS seek hole/data implementation.
So reverted this commit against the latest mainline allows trinity run
hours. Otherwise, it always hang at fdatasync() within 30 minutes.

fc0561cefc04e7803c0f6501ca4f310a502f65b8
xfs: optimise away log forces on timestamp updates for fdatasync

PS: tested against the vfs tree's #work.splice_read with this commit
reverted is now hanging at sync() instead which won't be  reproduced
against the mainline so far.
http://people.redhat.com/qcai/tmp/dmesg-sync

   CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-12 19:50                                                                                   ` [bisected] " CAI Qian
@ 2016-10-12 20:59                                                                                     ` Dave Chinner
  2016-10-13 16:25                                                                                       ` CAI Qian
  0 siblings, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2016-10-12 20:59 UTC (permalink / raw)
  To: CAI Qian
  Cc: Sage Weil, Brian Foster, Jan Kara, Miklos Szeredi, tj, Al Viro,
	Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel, Dave Jones

On Wed, Oct 12, 2016 at 03:50:36PM -0400, CAI Qian wrote:
> 
> 
> ----- Original Message -----
> > From: "Dave Chinner" <david@fromorbit.com>
> > Sent: Monday, October 10, 2016 5:57:14 PM
> > 
> > > http://people.redhat.com/qcai/tmp/dmesg
> > 
> > It's a page lock order bug in the XFS seek hole/data implementation.
> So reverted this commit against the latest mainline allows trinity run
> hours. Otherwise, it always hang at fdatasync() within 30 minutes.
> 
> fc0561cefc04e7803c0f6501ca4f310a502f65b8
> xfs: optimise away log forces on timestamp updates for fdatasync

Has nothing at all to do with the hang.

> PS: tested against the vfs tree's #work.splice_read with this commit
> reverted is now hanging at sync() instead which won't be  reproduced
> against the mainline so far.
> http://people.redhat.com/qcai/tmp/dmesg-sync

It is the same page lock vs seek hole/data issue.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-12 20:59                                                                                     ` Dave Chinner
@ 2016-10-13 16:25                                                                                       ` CAI Qian
  2016-10-13 20:49                                                                                         ` Dave Chinner
  0 siblings, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-13 16:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Sage Weil, Brian Foster, Jan Kara, Miklos Szeredi, tj, Al Viro,
	Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel, Dave Jones



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> Sent: Wednesday, October 12, 2016 4:59:01 PM
> Subject: Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> On Wed, Oct 12, 2016 at 03:50:36PM -0400, CAI Qian wrote:
> > 
> > 
> > ----- Original Message -----
> > > From: "Dave Chinner" <david@fromorbit.com>
> > > Sent: Monday, October 10, 2016 5:57:14 PM
> > > 
> > > > http://people.redhat.com/qcai/tmp/dmesg
> > > 
> > > It's a page lock order bug in the XFS seek hole/data implementation.
> > So reverted this commit against the latest mainline allows trinity run
> > hours. Otherwise, it always hang at fdatasync() within 30 minutes.
> > 
> > fc0561cefc04e7803c0f6501ca4f310a502f65b8
> > xfs: optimise away log forces on timestamp updates for fdatasync
> 
> Has nothing at all to do with the hang.
> 
> > PS: tested against the vfs tree's #work.splice_read with this commit
> > reverted is now hanging at sync() instead which won't be  reproduced
> > against the mainline so far.
> > http://people.redhat.com/qcai/tmp/dmesg-sync
> 
> It is the same page lock vs seek hole/data issue.
FYI, CVE-2016-8660 was assigned for it.
   CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-13 16:25                                                                                       ` CAI Qian
@ 2016-10-13 20:49                                                                                         ` Dave Chinner
  2016-10-13 20:56                                                                                           ` CAI Qian
  0 siblings, 1 reply; 152+ messages in thread
From: Dave Chinner @ 2016-10-13 20:49 UTC (permalink / raw)
  To: CAI Qian
  Cc: Sage Weil, Brian Foster, Jan Kara, Miklos Szeredi, tj, Al Viro,
	Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel, Dave Jones

On Thu, Oct 13, 2016 at 12:25:30PM -0400, CAI Qian wrote:
> 
> 
> ----- Original Message -----
> > From: "Dave Chinner" <david@fromorbit.com>
> > Sent: Wednesday, October 12, 2016 4:59:01 PM
> > Subject: Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> > 
> > On Wed, Oct 12, 2016 at 03:50:36PM -0400, CAI Qian wrote:
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Dave Chinner" <david@fromorbit.com>
> > > > Sent: Monday, October 10, 2016 5:57:14 PM
> > > > 
> > > > > http://people.redhat.com/qcai/tmp/dmesg
> > > > 
> > > > It's a page lock order bug in the XFS seek hole/data implementation.
> > > So reverted this commit against the latest mainline allows trinity run
> > > hours. Otherwise, it always hang at fdatasync() within 30 minutes.
> > > 
> > > fc0561cefc04e7803c0f6501ca4f310a502f65b8
> > > xfs: optimise away log forces on timestamp updates for fdatasync
> > 
> > Has nothing at all to do with the hang.
> > 
> > > PS: tested against the vfs tree's #work.splice_read with this commit
> > > reverted is now hanging at sync() instead which won't be  reproduced
> > > against the mainline so far.
> > > http://people.redhat.com/qcai/tmp/dmesg-sync
> > 
> > It is the same page lock vs seek hole/data issue.
> FYI, CVE-2016-8660 was assigned for it.

Why? This isn't a security issue - CVEs cost time and effort for
everyone to track and follow and raising them for issues like this
does not help anyone fix the actual problem.  It doesn't help us
track it, analyse it, communicate with the bug reporter, test it or
get the fix committed.  It's meaningless to the developers fixing
the code, it's meaningless to users, and it's meaningless to most
distros that are supporting XFS because the distro maintainers don't
watch the CVE lists for XFS bugs they need to backport and fix.

All this does is artificially inflate the supposed importance of the
bug. CVEs are for security or severe issues. This is neither serious
or a security issue - please have the common courtesy to ask the
people with the knowledge to make such a determination (i.e. the
maintainers) before you waste the time of a /large number/ of people
by raising a useless CVE...

Yes, you found a bug. No, it's not a security bug. No, you should
not abusing of the CVE process to apply pressure to get it fixed.
Please don't do this again.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
  2016-10-13 20:49                                                                                         ` Dave Chinner
@ 2016-10-13 20:56                                                                                           ` CAI Qian
  0 siblings, 0 replies; 152+ messages in thread
From: CAI Qian @ 2016-10-13 20:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Sage Weil, Brian Foster, Jan Kara, Miklos Szeredi, tj, Al Viro,
	Linus Torvalds, linux-xfs, Jens Axboe, Nick Piggin,
	linux-fsdevel, Dave Jones



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> Sent: Thursday, October 13, 2016 4:49:17 PM
> Subject: Re: [bisected] Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
>
> Why? This isn't a security issue - CVEs cost time and effort for
> everyone to track and follow and raising them for issues like this
> does not help anyone fix the actual problem.  It doesn't help us
> track it, analyse it, communicate with the bug reporter, test it or
> get the fix committed.  It's meaningless to the developers fixing
> the code, it's meaningless to users, and it's meaningless to most
> distros that are supporting XFS because the distro maintainers don't
> watch the CVE lists for XFS bugs they need to backport and fix.
> 
> All this does is artificially inflate the supposed importance of the
> bug. CVEs are for security or severe issues. This is neither serious
> or a security issue - please have the common courtesy to ask the
> people with the knowledge to make such a determination (i.e. the
> maintainers) before you waste the time of a /large number/ of people
> by raising a useless CVE...
> 
> Yes, you found a bug. No, it's not a security bug. No, you should
> not abusing of the CVE process to apply pressure to get it fixed.
> Please don't do this again.
As far as I can tell, this is a medium-severity security issue that a
non-privileged user can exploit it to cause a system hang/deadlock.
Hence, a local DoS for other users use the system.
   CAI Qian

^ permalink raw reply	[flat|nested] 152+ messages in thread

* [4.9-rc1+] overlayfs lockdep
  2016-10-07  7:08                                                                     ` Jan Kara
  2016-10-07 14:43                                                                       ` CAI Qian
@ 2016-10-21 15:38                                                                       ` CAI Qian
  2016-10-24 12:57                                                                         ` Miklos Szeredi
  1 sibling, 1 reply; 152+ messages in thread
From: CAI Qian @ 2016-10-21 15:38 UTC (permalink / raw)
  To: Jan Kara, Miklos Szeredi; +Cc: Al Viro, Linus Torvalds, linux-fsdevel


----- Original Message -----
> From: "Jan Kara" <jack@suse.cz>
> Sent: Friday, October 7, 2016 3:08:38 AM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> 
> So I believe this may be just a problem in overlayfs lockdep annotation
> (see below). Added Miklos to CC.
> 
> > Wait. There is also a lockep happened before the xfs internal error as
> > well.
> > 
> > [ 5839.452325] ======================================================
> > [ 5839.459221] [ INFO: possible circular locking dependency detected ]
> > [ 5839.466215] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
> > [ 5839.471945] -------------------------------------------------------
> > [ 5839.478937] trinity-c220/69531 is trying to acquire lock:
> > [ 5839.484961]  (&p->lock){+.+.+.}, at: [<ffffffff812ac69c>]
> > seq_read+0x4c/0x3e0
> > [ 5839.492967]
> > but task is already holding lock:
> > [ 5839.499476]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>]
> > __sb_start_write+0xd1/0xf0
> > [ 5839.508560]
> > which lock already depends on the new lock.
> > 
> > [ 5839.517686]
> > the existing dependency chain (in reverse order) is:
> > [ 5839.526036]
> > -> #3 (sb_writers#8){.+.+.+}:
> > [ 5839.530751]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
> > [ 5839.537368]        [<ffffffff810f8f4a>] percpu_down_read+0x4a/0x90
> > [ 5839.544275]        [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
> > [ 5839.551181]        [<ffffffff812a8544>] mnt_want_write+0x24/0x50
> > [ 5839.557892]        [<ffffffffa04a398f>] ovl_want_write+0x1f/0x30
> > [overlay]
> > [ 5839.565577]        [<ffffffffa04a6036>] ovl_do_remove+0x46/0x480
> > [overlay]
> > [ 5839.573259]        [<ffffffffa04a64a3>] ovl_unlink+0x13/0x20 [overlay]
> > [ 5839.580555]        [<ffffffff812918ea>] vfs_unlink+0xda/0x190
> > [ 5839.586979]        [<ffffffff81293698>] do_unlinkat+0x268/0x2b0
> > [ 5839.593599]        [<ffffffff8129419b>] SyS_unlinkat+0x1b/0x30
> > [ 5839.600120]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
> > [ 5839.606836]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
> > [ 5839.614231]
> 
> So here is IMO the real culprit: do_unlinkat() grabs fs freeze protection
> through mnt_want_write(), we grab also i_rwsem in do_unlinkat() in
> I_MUTEX_PARENT class a bit after that and further down in vfs_unlink() we
> grab i_rwsem for the unlinked inode itself in default I_MUTEX class. Then
> in ovl_want_write() we grab freeze protection again, but this time for the
> upper filesystem. That establishes sb_writers (overlay) -> I_MUTEX_PARENT
> (overlay) -> I_MUTEX (overlay) -> sb_writers (FS-A) lock ordering
> (we maintain locking classes per fs type so that's why I'm showing fs type
> in parenthesis).
> 
> Now this nesting is nasty because once you add locks that are not tracked
> per fs type into the mix, you get cycles. In this case we've got
> seq_file->lock and cred_guard_mutex into the mix - the splice path is
> doing sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex (splicing
> from seq_file into the real filesystem). Exec path further establishes
> cred_guard_mutex -> I_MUTEX (overlay) which closes the full cycle:
> 
> sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex -> i_mutex
> (overlay) -> sb_writers (FS-A)
> 
> If I analyzed the lockdep trace, this looks like a real (although remote)
> deadlock possibility. Miklos?

So this can still be reproduced in the yesterday's mainline.

[40581.813575] [ INFO: possible circular locking dependency detected ]
[40581.813578] 4.9.0-rc1-lockfix-uncorev2+ #51 Tainted: G        W      
[40581.813581] -------------------------------------------------------
[40581.813582] trinity-c104/39795 is trying to acquire lock:
[40581.813587]  (
[40581.813588] &p->lock
[40581.813589] ){+.+.+.}
[40581.813600] , at: 
[40581.813601] [<ffffffff8191588c>] seq_read+0xec/0x1400
[40581.813603] 
[40581.813603] but task is already holding lock:
[40581.813605]  (
[40581.813607] sb_writers
[40581.813608] #8
[40581.813609] ){.+.+.+}
[40581.813617] , at: 
[40581.813617] [<ffffffff81889c6a>] do_sendfile+0x9ea/0x1270
[40581.813618] 
[40581.813618] which lock already depends on the new lock.
[40581.813618] 
[40581.813620] 
[40581.813620] the existing dependency chain (in reverse order) is:
[40581.813623] 
[40581.813623] -> #3
[40581.813624]  (
[40581.813625] sb_writers
[40581.813626] #8
[40581.813628] ){.+.+.+}
[40581.813628] :
[40581.813636]        
[40581.813636] [<ffffffff8133dcda>] __lock_acquire+0x9aa/0x1710
[40581.813640]        
[40581.813640] [<ffffffff8133fd4e>] lock_acquire+0x24e/0x5d0
[40581.813644]        
[40581.813645] [<ffffffff8189037e>] __sb_start_write+0xae/0x360
[40581.813650]        
[40581.813650] [<ffffffff819066fa>] mnt_want_write+0x4a/0xc0
[40581.813661]        
[40581.813661] [<ffffffffa16cdfbd>] ovl_want_write+0x8d/0xf0 [overlay]
[40581.813668]        
[40581.813668] [<ffffffffa16d4dc7>] ovl_do_remove+0xe7/0x9a0 [overlay]
[40581.813675]        
[40581.813676] [<ffffffffa16d5696>] ovl_rmdir+0x16/0x20 [overlay]
[40581.813680]        
[40581.813680] [<ffffffff818af90f>] vfs_rmdir+0x1bf/0x3e0
[40581.813685]        
[40581.813686] [<ffffffff818c5965>] do_rmdir+0x2c5/0x430
[40581.813689]        
[40581.813690] [<ffffffff818c8242>] SyS_unlinkat+0x22/0x30
[40581.813696]        
[40581.813696] [<ffffffff8100924d>] do_syscall_64+0x19d/0x540
[40581.813704]        
[40581.813704] [<ffffffff82c8af24>] return_from_SYSCALL_64+0x0/0x7a
[40581.813707] 
[40581.813707] -> #2
[40581.813709]  (
[40581.813710] &sb->s_type->i_mutex_key
[40581.813711] #17
[40581.813712] ){++++++}
[40581.813713] :
[40581.813720]        
[40581.813720] [<ffffffff8133dcda>] __lock_acquire+0x9aa/0x1710
[40581.813726]        
[40581.813726] [<ffffffff8133fd4e>] lock_acquire+0x24e/0x5d0
[40581.813736]        
[40581.813736] [<ffffffff82c84261>] down_read+0xa1/0x1c0
[40581.813740]        
[40581.813740] [<ffffffff818ae2db>] lookup_slow+0x17b/0x4f0
[40581.813744]        
[40581.813744] [<ffffffff818bb228>] walk_component+0x728/0x1d10
[40581.813750]        
[40581.813750] [<ffffffff818bcc1e>] link_path_walk+0x40e/0x1690
[40581.813758]        
[40581.813758] [<ffffffff818c0274>] path_openat+0x1c4/0x3870
[40581.813764]        
[40581.813765] [<ffffffff818c6d19>] do_filp_open+0x1a9/0x2e0
[40581.813772]        
[40581.813772] [<ffffffff8189832b>] do_open_execat+0xcb/0x420
[40581.813783]        
[40581.813784] [<ffffffff8189932b>] open_exec+0x2b/0x50
[40581.813793]        
[40581.813793] [<ffffffff819ea78c>] load_elf_binary+0x103c/0x3550
[40581.813807]        
[40581.813807] [<ffffffff8189a852>] search_binary_handler+0x162/0x480
[40581.813814]        
[40581.813815] [<ffffffff818a106a>] do_execveat_common.isra.24+0x138a/0x2570
[40581.813823]        
[40581.813824] [<ffffffff818a2efa>] SyS_execve+0x3a/0x50
[40581.813828]        
[40581.813828] [<ffffffff8100924d>] do_syscall_64+0x19d/0x540
[40581.813833]        
[40581.813834] [<ffffffff82c8af24>] return_from_SYSCALL_64+0x0/0x7a
[40581.813843] 
[40581.813843] -> #1
[40581.813845]  (
[40581.813850] &sig->cred_guard_mutex
[40581.813852] ){+.+.+.}
[40581.813852] :
[40581.813861]        
[40581.813862] [<ffffffff8133dcda>] __lock_acquire+0x9aa/0x1710
[40581.813871]        
[40581.813871] [<ffffffff8133fd4e>] lock_acquire+0x24e/0x5d0
[40581.813885]        
[40581.813886] [<ffffffff82c7d1d3>] mutex_lock_killable_nested+0x103/0xb90
[40581.813895]        
[40581.813896] [<ffffffff81a3f7a6>] do_io_accounting+0x186/0xcf0
[40581.813902]        
[40581.813903] [<ffffffff81a40329>] proc_tgid_io_accounting+0x19/0x20
[40581.813908]        
[40581.813909] [<ffffffff81a41494>] proc_single_show+0x114/0x1d0
[40581.813917]        
[40581.813917] [<ffffffff81915ad4>] seq_read+0x334/0x1400
[40581.813921]        
[40581.813921] [<ffffffff81884da6>] __vfs_read+0x106/0x990
[40581.813927]        
[40581.813927] [<ffffffff81886038>] vfs_read+0x118/0x400
[40581.813931]        
[40581.813931] [<ffffffff8188aebf>] SyS_read+0xdf/0x1d0
[40581.813938]        
[40581.813938] [<ffffffff8100924d>] do_syscall_64+0x19d/0x540
[40581.813945]        
[40581.813946] [<ffffffff82c8af24>] return_from_SYSCALL_64+0x0/0x7a
[40581.813949] 
[40581.813949] -> #0
[40581.813951]  (
[40581.813954] &p->lock
[40581.813955] ){+.+.+.}
[40581.813956] :
[40581.813961]        
[40581.813961] [<ffffffff81337938>] validate_chain.isra.31+0x2b28/0x4c00
[40581.813965]        
[40581.813966] [<ffffffff8133dcda>] __lock_acquire+0x9aa/0x1710
[40581.813972]        
[40581.813972] [<ffffffff8133fd4e>] lock_acquire+0x24e/0x5d0
[40581.813977]        
[40581.813977] [<ffffffff82c7f2f8>] mutex_lock_nested+0x108/0xa50
[40581.813983]        
[40581.813983] [<ffffffff8191588c>] seq_read+0xec/0x1400
[40581.813993]        
[40581.813993] [<ffffffff81a7bdde>] kernfs_fop_read+0x35e/0x640
[40581.813998]        
[40581.813998] [<ffffffff818812ef>] do_loop_readv_writev+0xdf/0x250
[40581.814003]        
[40581.814003] [<ffffffff81886fb5>] do_readv_writev+0x6a5/0xab0
[40581.814007]        
[40581.814007] [<ffffffff81887446>] vfs_readv+0x86/0xe0
[40581.814020]        
[40581.814020] [<ffffffff8194fdac>] default_file_splice_read+0x49c/0xbb0
[40581.814026]        
[40581.814027] [<ffffffff8194eb74>] do_splice_to+0x104/0x1a0
[40581.814033]        
[40581.814033] [<ffffffff8194ee80>] splice_direct_to_actor+0x270/0xa00
[40581.814039]        
[40581.814039] [<ffffffff8194f7a4>] do_splice_direct+0x194/0x300
[40581.814046]        
[40581.814046] [<ffffffff818896e9>] do_sendfile+0x469/0x1270
[40581.814051]        
[40581.814051] [<ffffffff8188bcb0>] SyS_sendfile64+0x140/0x150
[40581.814054]        
[40581.814055] [<ffffffff8100924d>] do_syscall_64+0x19d/0x540
[40581.814059]        
[40581.814060] [<ffffffff82c8af24>] return_from_SYSCALL_64+0x0/0x7a
[40581.814062] 
[40581.814062] other info that might help us debug this:
[40581.814062] 
[40581.814066] Chain exists of:
[40581.814066]   
[40581.814067] &p->lock
[40581.814069]  --> 
[40581.814070] &sb->s_type->i_mutex_key
[40581.814071] #17
[40581.814073]  --> 
[40581.814076] sb_writers
[40581.814079] #8
[40581.814079] 
[40581.814079] 
[40581.814080]  Possible unsafe locking scenario:
[40581.814080] 
[40581.814081]        CPU0                    CPU1
[40581.814083]        ----                    ----
[40581.814085]   lock(
[40581.814088] sb_writers
[40581.814089] #8
[40581.814089] );
[40581.814091]                                lock(
[40581.814093] &sb->s_type->i_mutex_key
[40581.814095] #17
[40581.814095] );
[40581.814097]                                lock(
[40581.814098] sb_writers
[40581.814099] #8
[40581.814099] );
[40581.814101]   lock(
[40581.814103] &p->lock
[40581.814103] );
[40581.814104] 
[40581.814104]  *** DEADLOCK ***
[40581.814104] 
[40581.814106] 1 lock held by trinity-c104/39795:
[40581.814109]  #0: 
[40581.814111]  (
[40581.814112] sb_writers
[40581.814113] #8
[40581.814114] ){.+.+.+}
[40581.814116] , at: 
[40581.814117] [<ffffffff81889c6a>] do_sendfile+0x9ea/0x1270
[40581.814118] 
[40581.814118] stack backtrace:
[40581.814121] CPU: 25 PID: 39795 Comm: trinity-c104 Tainted: G        W       4.9.0-rc1-lockfix-uncorev2+ #51
[40581.814123] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRRFSDP1.86B.0271.R00.1510301446 10/30/2015
[40581.814131]  ffff880825886da0 ffffffff81d37124 0000000041b58ab3 ffffffff83348dc7
[40581.814138]  ffffffff81d37064 0000000000000001 0000000000000000 ffff8807b4d5d5d8
[40581.814145]  00000000bfc018be ffff880825886d78 0000000000000001 0000000000000000
[40581.814146] Call Trace:
[40581.814155]  [<ffffffff81d37124>] dump_stack+0xc0/0x12c
[40581.814159]  [<ffffffff81d37064>] ? _atomic_dec_and_lock+0xc4/0xc4
[40581.814168]  [<ffffffff81332fa9>] print_circular_bug+0x3c9/0x5e0
[40581.814171]  [<ffffffff81332be0>] ? print_circular_bug_entry+0xd0/0xd0
[40581.814176]  [<ffffffff81337938>] validate_chain.isra.31+0x2b28/0x4c00
[40581.814182]  [<ffffffff81334e10>] ? check_irq_usage+0x300/0x300
[40581.814192]  [<ffffffff81334e10>] ? check_irq_usage+0x300/0x300
[40581.814196]  [<ffffffff81de21f3>] ? __this_cpu_preempt_check+0x13/0x20
[40581.814200]  [<ffffffff81336045>] ? validate_chain.isra.31+0x1235/0x4c00
[40581.814204]  [<ffffffff8133a4d0>] ? print_usage_bug+0x700/0x700
[40581.814208]  [<ffffffff812abdc0>] ? sched_clock_cpu+0x1b0/0x310
[40581.814214]  [<ffffffff8133a4d0>] ? print_usage_bug+0x700/0x700
[40581.814219]  [<ffffffff812abdc0>] ? sched_clock_cpu+0x1b0/0x310
[40581.814226]  [<ffffffff8133dcda>] __lock_acquire+0x9aa/0x1710
[40581.814232]  [<ffffffff8133fd4e>] lock_acquire+0x24e/0x5d0
[40581.814235]  [<ffffffff8191588c>] ? seq_read+0xec/0x1400
[40581.814240]  [<ffffffff8191588c>] ? seq_read+0xec/0x1400
[40581.814243]  [<ffffffff82c7f2f8>] mutex_lock_nested+0x108/0xa50
[40581.814246]  [<ffffffff8191588c>] ? seq_read+0xec/0x1400
[40581.814250]  [<ffffffff8191588c>] ? seq_read+0xec/0x1400
[40581.814256]  [<ffffffff817fedd6>] ? kasan_unpoison_shadow+0x36/0x50
[40581.814259]  [<ffffffff82c7f1f0>] ? mutex_lock_interruptible_nested+0xb40/0xb40
[40581.814264]  [<ffffffff8168ec2c>] ? get_page_from_freelist+0x175c/0x2ed0
[40581.814271]  [<ffffffff8168d4d0>] ? __isolate_free_page+0x7e0/0x7e0
[40581.814275]  [<ffffffff8133c3f9>] ? mark_held_locks+0x109/0x290
[40581.814278]  [<ffffffff8191588c>] seq_read+0xec/0x1400
[40581.814283]  [<ffffffff813ac01d>] ? rcu_lockdep_current_cpu_online+0x11d/0x1d0
[40581.814290]  [<ffffffff819157a0>] ? seq_hlist_start_percpu+0x4a0/0x4a0
[40581.814295]  [<ffffffff8198ef20>] ? __fsnotify_update_child_dentry_flags.part.0+0x2b0/0x2b0
[40581.814298]  [<ffffffff81de21f3>] ? __this_cpu_preempt_check+0x13/0x20
[40581.814300]  [<ffffffff81a7bdde>] kernfs_fop_read+0x35e/0x640
[40581.814305]  [<ffffffff81b49a55>] ? selinux_file_permission+0x3c5/0x550
[40581.814310]  [<ffffffff81a7ba80>] ? kernfs_fop_open+0xf40/0xf40
[40581.814312]  [<ffffffff818812ef>] do_loop_readv_writev+0xdf/0x250
[40581.814318]  [<ffffffff81886fb5>] do_readv_writev+0x6a5/0xab0
[40581.814324]  [<ffffffff81886910>] ? vfs_write+0x5f0/0x5f0
[40581.814328]  [<ffffffff81d8fbaf>] ? iov_iter_get_pages_alloc+0x53f/0x1990
[40581.814332]  [<ffffffff81d8f670>] ? iov_iter_npages+0xed0/0xed0
[40581.814336]  [<ffffffff8133c3f9>] ? mark_held_locks+0x109/0x290
[40581.814339]  [<ffffffff81de21f3>] ? __this_cpu_preempt_check+0x13/0x20
[40581.814344]  [<ffffffff8133caa0>] ? trace_hardirqs_on_caller+0x520/0x720
[40581.814347]  [<ffffffff81887446>] vfs_readv+0x86/0xe0
[40581.814352]  [<ffffffff8194fdac>] default_file_splice_read+0x49c/0xbb0
[40581.814361]  [<ffffffff8194f910>] ? do_splice_direct+0x300/0x300
[40581.814363]  [<ffffffff817fef3d>] ? kasan_kmalloc+0xad/0xe0
[40581.814366]  [<ffffffff818a6287>] ? alloc_pipe_info+0x1b7/0x410
[40581.814371]  [<ffffffff8133a4d0>] ? print_usage_bug+0x700/0x700
[40581.814373]  [<ffffffff8188bcb0>] ? SyS_sendfile64+0x140/0x150
[40581.814377]  [<ffffffff8100924d>] ? do_syscall_64+0x19d/0x540
[40581.814380]  [<ffffffff82c8af24>] ? entry_SYSCALL64_slow_path+0x25/0x25
[40581.814382]  [<ffffffff812abdc0>] ? sched_clock_cpu+0x1b0/0x310
[40581.814386]  [<ffffffff8133c3f9>] ? mark_held_locks+0x109/0x290
[40581.814390]  [<ffffffff8133caa0>] ? trace_hardirqs_on_caller+0x520/0x720
[40581.814395]  [<ffffffff8198ef20>] ? __fsnotify_update_child_dentry_flags.part.0+0x2b0/0x2b0
[40581.814398]  [<ffffffff81b49a55>] ? selinux_file_permission+0x3c5/0x550
[40581.814404]  [<ffffffff81b26e96>] ? security_file_permission+0x176/0x220
[40581.814408]  [<ffffffff81885c78>] ? rw_verify_area+0xd8/0x380
[40581.814411]  [<ffffffff8194eb74>] do_splice_to+0x104/0x1a0
[40581.814415]  [<ffffffff818a63b7>] ? alloc_pipe_info+0x2e7/0x410
[40581.814419]  [<ffffffff8194ee80>] splice_direct_to_actor+0x270/0xa00
[40581.814424]  [<ffffffff8194c5e0>] ? wakeup_pipe_readers+0x90/0x90
[40581.814429]  [<ffffffff8194ec10>] ? do_splice_to+0x1a0/0x1a0
[40581.814432]  [<ffffffff81885c78>] ? rw_verify_area+0xd8/0x380
[40581.814438]  [<ffffffff8194f7a4>] do_splice_direct+0x194/0x300
[40581.814443]  [<ffffffff8194f610>] ? splice_direct_to_actor+0xa00/0xa00
[40581.814450]  [<ffffffff81278cee>] ? preempt_count_sub+0x5e/0xe0
[40581.814452]  [<ffffffff81890415>] ? __sb_start_write+0x145/0x360
[40581.814457]  [<ffffffff818896e9>] do_sendfile+0x469/0x1270
[40581.814461]  [<ffffffff81889280>] ? do_compat_pwritev64.isra.16+0xd0/0xd0
[40581.814466]  [<ffffffff814cb287>] ? __audit_syscall_exit+0x637/0x960
[40581.814469]  [<ffffffff81006afb>] ? syscall_trace_enter+0x89b/0x1930
[40581.814473]  [<ffffffff817f7993>] ? kfree+0x3f3/0x620
[40581.814477]  [<ffffffff8188bcb0>] SyS_sendfile64+0x140/0x150
[40581.814479]  [<ffffffff8188bb70>] ? SyS_sendfile+0x140/0x140
[40581.814482]  [<ffffffff81de21f3>] ? __this_cpu_preempt_check+0x13/0x20
[40581.814485]  [<ffffffff8188bb70>] ? SyS_sendfile+0x140/0x140
[40581.814487]  [<ffffffff8100924d>] do_syscall_64+0x19d/0x540
[40581.814491]  [<ffffffff82c8af24>] entry_SYSCALL64_slow_path+0x25/0x25

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [4.9-rc1+] overlayfs lockdep
  2016-10-21 15:38                                                                       ` [4.9-rc1+] overlayfs lockdep CAI Qian
@ 2016-10-24 12:57                                                                         ` Miklos Szeredi
  0 siblings, 0 replies; 152+ messages in thread
From: Miklos Szeredi @ 2016-10-24 12:57 UTC (permalink / raw)
  To: CAI Qian; +Cc: Jan Kara, Al Viro, Linus Torvalds, linux-fsdevel

On Fri, Oct 21, 2016 at 5:38 PM, CAI Qian <caiqian@redhat.com> wrote:
>
> ----- Original Message -----
>> From: "Jan Kara" <jack@suse.cz>
>> Sent: Friday, October 7, 2016 3:08:38 AM
>> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
>>
>>
>> So I believe this may be just a problem in overlayfs lockdep annotation
>> (see below). Added Miklos to CC.
>>
>> > Wait. There is also a lockep happened before the xfs internal error as
>> > well.
>> >
>> > [ 5839.452325] ======================================================
>> > [ 5839.459221] [ INFO: possible circular locking dependency detected ]
>> > [ 5839.466215] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
>> > [ 5839.471945] -------------------------------------------------------
>> > [ 5839.478937] trinity-c220/69531 is trying to acquire lock:
>> > [ 5839.484961]  (&p->lock){+.+.+.}, at: [<ffffffff812ac69c>]
>> > seq_read+0x4c/0x3e0
>> > [ 5839.492967]
>> > but task is already holding lock:
>> > [ 5839.499476]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>]
>> > __sb_start_write+0xd1/0xf0
>> > [ 5839.508560]
>> > which lock already depends on the new lock.
>> >
>> > [ 5839.517686]
>> > the existing dependency chain (in reverse order) is:
>> > [ 5839.526036]
>> > -> #3 (sb_writers#8){.+.+.+}:
>> > [ 5839.530751]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
>> > [ 5839.537368]        [<ffffffff810f8f4a>] percpu_down_read+0x4a/0x90
>> > [ 5839.544275]        [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
>> > [ 5839.551181]        [<ffffffff812a8544>] mnt_want_write+0x24/0x50
>> > [ 5839.557892]        [<ffffffffa04a398f>] ovl_want_write+0x1f/0x30
>> > [overlay]
>> > [ 5839.565577]        [<ffffffffa04a6036>] ovl_do_remove+0x46/0x480
>> > [overlay]
>> > [ 5839.573259]        [<ffffffffa04a64a3>] ovl_unlink+0x13/0x20 [overlay]
>> > [ 5839.580555]        [<ffffffff812918ea>] vfs_unlink+0xda/0x190
>> > [ 5839.586979]        [<ffffffff81293698>] do_unlinkat+0x268/0x2b0
>> > [ 5839.593599]        [<ffffffff8129419b>] SyS_unlinkat+0x1b/0x30
>> > [ 5839.600120]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
>> > [ 5839.606836]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
>> > [ 5839.614231]
>>
>> So here is IMO the real culprit: do_unlinkat() grabs fs freeze protection
>> through mnt_want_write(), we grab also i_rwsem in do_unlinkat() in
>> I_MUTEX_PARENT class a bit after that and further down in vfs_unlink() we
>> grab i_rwsem for the unlinked inode itself in default I_MUTEX class. Then
>> in ovl_want_write() we grab freeze protection again, but this time for the
>> upper filesystem. That establishes sb_writers (overlay) -> I_MUTEX_PARENT
>> (overlay) -> I_MUTEX (overlay) -> sb_writers (FS-A) lock ordering
>> (we maintain locking classes per fs type so that's why I'm showing fs type
>> in parenthesis).
>>
>> Now this nesting is nasty because once you add locks that are not tracked
>> per fs type into the mix, you get cycles. In this case we've got
>> seq_file->lock and cred_guard_mutex into the mix - the splice path is
>> doing sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex (splicing
>> from seq_file into the real filesystem). Exec path further establishes
>> cred_guard_mutex -> I_MUTEX (overlay) which closes the full cycle:
>>
>> sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex -> i_mutex
>> (overlay) -> sb_writers (FS-A)
>>
>> If I analyzed the lockdep trace, this looks like a real (although remote)
>> deadlock possibility. Miklos?

Yeah, you can leave out seq_file->lock, the chain can be made up from
just 3 parts:

unlink : i_mutex(ov) -> sb_writers(fs-a)
splice: sb_writers(fs-a) ->cred_guard_mutex (though proc_tgid_io_accounting)
exec:  cred_guard_mutex -> i_mutex(ov)

None of those are incorrect, but the cred_guard_mutex usage is also
pretty weird: taken outside path lookup as well as inside ->read() in
proc.

Doesn't look a serious worry in practice, I don't think anybody would
trigger the actual deadlock in a 1000years (an fs freeze is needed at
just the right moment in addition to the above, very unlikely chain).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 152+ messages in thread

* local DoS - systemd hang or timeout with cgroup traces
  2016-10-04 21:42                                                     ` tj
  2016-10-05 14:09                                                       ` CAI Qian
@ 2016-10-27 12:52                                                       ` CAI Qian
  1 sibling, 0 replies; 152+ messages in thread
From: CAI Qian @ 2016-10-27 12:52 UTC (permalink / raw)
  To: tj; +Cc: cgroups, Johannes Weiner, linux-kernel

So this can still be reproduced in 4.9-rc2 by running trinity as a non-root
user within 30-minute on this machine on either ext4 or xfs. Below is the
trace on ext4 and the sysrq-w report.

http://people.redhat.com/qcai/tmp/dmesg-ext4-cgroup-hang

    CAI Qian

----- Original Message -----
> From: "tj" <tj@kernel.org>
> Sent: Tuesday, October 4, 2016 5:42:19 PM
> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
> 
> ...
> > Not sure if related, but right after this lockdep happened and trinity
> > running by a
> > non-privileged user finished inside the container. The host's systemctl
> > command just
> > hang or timeout which renders the whole system unusable.
> > 
> > # systemctl status docker
> > Failed to get properties: Connection timed out
> > 
> > # systemctl reboot (hang)
> > 
> ...
> > [ 5535.893675] INFO: lockdep is turned off.
> > [ 5535.898085] INFO: task kworker/45:4:146035 blocked for more than 120
> > seconds.
> > [ 5535.906059]       Tainted: G        W       4.8.0-rc8-fornext+ #1
> > [ 5535.912865] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [ 5535.921613] kworker/45:4    D ffff880853e9b950 14048 146035      2
> > 0x00000080
> > [ 5535.929630] Workqueue: cgroup_destroy css_killed_work_fn
> > [ 5535.935582]  ffff880853e9b950 0000000000000000 0000000000000000
> > ffff88086c6da000
> > [ 5535.943882]  ffff88086c9e2000 ffff880853e9c000 ffff880853e9baa0
> > ffff88086c9e2000
> > [ 5535.952205]  ffff880853e9ba98 0000000000000001 ffff880853e9b968
> > ffffffff817cdaaf
> > [ 5535.960522] Call Trace:
> > [ 5535.963265]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
> > [ 5535.968817]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
> > [ 5535.975346]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
> > [ 5535.982256]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
> > [ 5535.988972]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
> > [ 5535.994804]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
> > [ 5536.001227]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
> > [ 5536.007648]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
> > [ 5536.014654]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
> > [ 5536.021657]  [<ffffffff810f57f5>]
> > unregister_sched_domain_sysctl+0x15/0x40
> > [ 5536.029344]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
> > [ 5536.036447]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
> > [ 5536.043844]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
> > [ 5536.051336]  [<ffffffff8116789d>] update_flag+0x11d/0x210
> > [ 5536.057373]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
> > [ 5536.064186]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
> > [ 5536.070899]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
> > [ 5536.077420]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
> > [ 5536.084234]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
> > [ 5536.091049]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
> > [ 5536.097571]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
> > [ 5536.104207]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
> > [ 5536.110736]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
> > [ 5536.117461]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
> > [ 5536.123697]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
> > [ 5536.130426]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
> > [ 5536.135991]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
> > [ 5536.142041]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
> 
> This one seems to be the offender.  cgroup is trying to offline a
> cpuset css, which takes place under cgroup_mutex.  The offlining ends
> up trying to drain active usages of a sysctl table which apprently is
> not happening.  Did something hang or crash while trying to generate
> sysctl content?

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-09-24  3:59                                                 ` [PATCH 04/12] " Al Viro
  2016-09-26 13:35                                                     ` Miklos Szeredi
@ 2016-12-17 19:54                                                   ` Andreas Schwab
  2016-12-18 19:28                                                     ` Linus Torvalds
  1 sibling, 1 reply; 152+ messages in thread
From: Andreas Schwab @ 2016-12-17 19:54 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

This break EPIPE handling inside splice when SIGPIPE is ignored:

Before:
$ { sleep 1; strace -e splice pv -q /dev/zero; } | :
splice(3, NULL, 1, NULL, 131072, SPLICE_F_MORE) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=23750, si_uid=17005} ---
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=23750, si_uid=17005} ---
+++ exited with 0 +++

After:
$ { sleep 1; strace -e splice pv -q /dev/zero; } | :
splice(3, NULL, 1, NULL, 131072, SPLICE_F_MORE) = 65536
splice(3, NULL, 1, NULL, 131072, SPLICE_F_MORE
[hangs]

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-17 19:54                                                   ` Andreas Schwab
@ 2016-12-18 19:28                                                     ` Linus Torvalds
  2016-12-18 19:57                                                       ` Andreas Schwab
  2016-12-18 20:12                                                       ` Al Viro
  0 siblings, 2 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-12-18 19:28 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Al Viro, Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Sat, Dec 17, 2016 at 11:54 AM, Andreas Schwab <schwab@linux-m68k.org> wrote:
> This break EPIPE handling inside splice when SIGPIPE is ignored:
>
> Before:
> $ { sleep 1; strace -e splice pv -q /dev/zero; } | :

Where is that "splice" program from? Google isn't helpful, and fedora
doesn't seem to have it. I'm assuming it was posted in one of the
threads, but if so I've long since lost sight of it..

             Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 19:28                                                     ` Linus Torvalds
@ 2016-12-18 19:57                                                       ` Andreas Schwab
  2016-12-18 20:12                                                       ` Al Viro
  1 sibling, 0 replies; 152+ messages in thread
From: Andreas Schwab @ 2016-12-18 19:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Dez 18 2016, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Dec 17, 2016 at 11:54 AM, Andreas Schwab <schwab@linux-m68k.org> wrote:
>> This break EPIPE handling inside splice when SIGPIPE is ignored:
>>
>> Before:
>> $ { sleep 1; strace -e splice pv -q /dev/zero; } | :
>
> Where is that "splice" program from?

It's running pv (splice is the argument of strace -e).

http://ivarch.com/programs/pv.shtml

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 19:28                                                     ` Linus Torvalds
  2016-12-18 19:57                                                       ` Andreas Schwab
@ 2016-12-18 20:12                                                       ` Al Viro
  2016-12-18 20:30                                                         ` Al Viro
  1 sibling, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-12-18 20:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Schwab, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sun, Dec 18, 2016 at 11:28:44AM -0800, Linus Torvalds wrote:
> On Sat, Dec 17, 2016 at 11:54 AM, Andreas Schwab <schwab@linux-m68k.org> wrote:
> > This break EPIPE handling inside splice when SIGPIPE is ignored:
> >
> > Before:
> > $ { sleep 1; strace -e splice pv -q /dev/zero; } | :
> 
> Where is that "splice" program from? Google isn't helpful, and fedora
> doesn't seem to have it. I'm assuming it was posted in one of the
> threads, but if so I've long since lost sight of it..

It's pv(1), actually.  I'm looking into that - debian-packaged pv reproduced
that crap.

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 20:12                                                       ` Al Viro
@ 2016-12-18 20:30                                                         ` Al Viro
  2016-12-18 22:10                                                           ` Linus Torvalds
  0 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-12-18 20:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Schwab, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sun, Dec 18, 2016 at 08:12:07PM +0000, Al Viro wrote:
> On Sun, Dec 18, 2016 at 11:28:44AM -0800, Linus Torvalds wrote:
> > On Sat, Dec 17, 2016 at 11:54 AM, Andreas Schwab <schwab@linux-m68k.org> wrote:
> > > This break EPIPE handling inside splice when SIGPIPE is ignored:
> > >
> > > Before:
> > > $ { sleep 1; strace -e splice pv -q /dev/zero; } | :
> > 
> > Where is that "splice" program from? Google isn't helpful, and fedora
> > doesn't seem to have it. I'm assuming it was posted in one of the
> > threads, but if so I've long since lost sight of it..
> 
> It's pv(1), actually.  I'm looking into that - debian-packaged pv reproduced
> that crap.

OK, I see what's going on - it's wait_for_space() lifted past the checks
for lack of readers.  The fix, AFAICS, is simply

diff --git a/fs/splice.c b/fs/splice.c
index 6a2b0db5..aeba2b7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1082,6 +1082,10 @@ EXPORT_SYMBOL(do_splice_direct);
 
 static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 {
+	if (unlikely(!pipe->readers)) {
+		send_sig(SIGPIPE, current, 0);
+		return -EPIPE;
+	}
 	while (pipe->nrbufs == pipe->buffers) {
 		if (flags & SPLICE_F_NONBLOCK)
 			return -EAGAIN;
@@ -1090,6 +1094,10 @@ static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 		pipe->waiting_writers++;
 		pipe_wait(pipe);
 		pipe->waiting_writers--;
+		if (unlikely(!pipe->readers)) {
+			send_sig(SIGPIPE, current, 0);
+			return -EPIPE;
+		}
 	}
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 20:30                                                         ` Al Viro
@ 2016-12-18 22:10                                                           ` Linus Torvalds
  2016-12-18 22:18                                                             ` Al Viro
                                                                               ` (2 more replies)
  0 siblings, 3 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-12-18 22:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Andreas Schwab, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sun, Dec 18, 2016 at 12:30 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OK, I see what's going on - it's wait_for_space() lifted past the checks
> for lack of readers.  The fix, AFAICS, is simply

Ugh. Does it have to be duplicated?

How about just making the wait_for_space() loop be a for-loop, and writing it as

   for (;;) {
        if (unlikely(!pipe->readers)) {
                send_sig(SIGPIPE, current, 0);
                return -EPIPE;
        }
        if (pipe->nrbufs == pipe->buffers)
                return 0;
        if (flags & SPLICE_F_NONBLOCK)
                return -EAGAIN;
        if (signal_pending(current))
                return -ERESTARTSYS;
        pipe->waiting_writers++;
        pipe_wait(pipe);
        pipe->waiting_writers--;
   }

and just having it once?

Regardless - Andreas, can you verify that that fixes your issues? I'm
assuming you had some real load that made you notice this, not just he
dummy example..

            Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 22:10                                                           ` Linus Torvalds
@ 2016-12-18 22:18                                                             ` Al Viro
  2016-12-18 22:22                                                               ` Linus Torvalds
  2016-12-18 22:49                                                             ` Andreas Schwab
  2016-12-21 18:56                                                             ` Andreas Schwab
  2 siblings, 1 reply; 152+ messages in thread
From: Al Viro @ 2016-12-18 22:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Schwab, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sun, Dec 18, 2016 at 02:10:54PM -0800, Linus Torvalds wrote:
> On Sun, Dec 18, 2016 at 12:30 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > OK, I see what's going on - it's wait_for_space() lifted past the checks
> > for lack of readers.  The fix, AFAICS, is simply
> 
> Ugh. Does it have to be duplicated?
> 
> How about just making the wait_for_space() loop be a for-loop, and writing it as
> 
>    for (;;) {
>         if (unlikely(!pipe->readers)) {
>                 send_sig(SIGPIPE, current, 0);
>                 return -EPIPE;
>         }
>         if (pipe->nrbufs == pipe->buffers)

ITYM "!="...

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 22:18                                                             ` Al Viro
@ 2016-12-18 22:22                                                               ` Linus Torvalds
  0 siblings, 0 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-12-18 22:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Andreas Schwab, Dave Chinner, CAI Qian, linux-xfs, xfs,
	Jens Axboe, Nick Piggin, linux-fsdevel

On Sun, Dec 18, 2016 at 2:18 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> ITYM "!="...

Right. A bit too much cut-and-pasting going on in my email ;)

              Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 22:10                                                           ` Linus Torvalds
  2016-12-18 22:18                                                             ` Al Viro
@ 2016-12-18 22:49                                                             ` Andreas Schwab
  2016-12-21 18:56                                                             ` Andreas Schwab
  2 siblings, 0 replies; 152+ messages in thread
From: Andreas Schwab @ 2016-12-18 22:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Dez 18 2016, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Regardless - Andreas, can you verify that that fixes your issues? I'm
> assuming you had some real load that made you notice this, not just he
> dummy example..

This is from the testsuite of pv, I only noticed because it was hanging.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-18 22:10                                                           ` Linus Torvalds
  2016-12-18 22:18                                                             ` Al Viro
  2016-12-18 22:49                                                             ` Andreas Schwab
@ 2016-12-21 18:56                                                             ` Andreas Schwab
  2016-12-21 19:12                                                               ` Linus Torvalds
  2 siblings, 1 reply; 152+ messages in thread
From: Andreas Schwab @ 2016-12-21 18:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Dez 18 2016, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Regardless - Andreas, can you verify that that fixes your issues? I'm
> assuming you had some real load that made you notice this, not just he
> dummy example..

FWIW, I have verified that the testsuite of pv succeeds with this patch:

diff --git a/fs/splice.c b/fs/splice.c
index 5a7750bd2e..63b8f54485 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1086,7 +1086,13 @@ EXPORT_SYMBOL(do_splice_direct);
 
 static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 {
-	while (pipe->nrbufs == pipe->buffers) {
+	for (;;) {
+		if (unlikely(!pipe->readers)) {
+			send_sig(SIGPIPE, current, 0);
+			return -EPIPE;
+		}
+		if (pipe->nrbufs != pipe->buffers)
+			return 0;
 		if (flags & SPLICE_F_NONBLOCK)
 			return -EAGAIN;
 		if (signal_pending(current))
@@ -1095,7 +1101,6 @@ static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 		pipe_wait(pipe);
 		pipe->waiting_writers--;
 	}
-	return 0;
 }
 
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
-- 
2.11.0


Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply related	[flat|nested] 152+ messages in thread

* Re: [PATCH 04/12] splice: lift pipe_lock out of splice_to_pipe()
  2016-12-21 18:56                                                             ` Andreas Schwab
@ 2016-12-21 19:12                                                               ` Linus Torvalds
  0 siblings, 0 replies; 152+ messages in thread
From: Linus Torvalds @ 2016-12-21 19:12 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Al Viro, Dave Chinner, CAI Qian, linux-xfs, xfs, Jens Axboe,
	Nick Piggin, linux-fsdevel

On Wed, Dec 21, 2016 at 10:56 AM, Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> FWIW, I have verified that the testsuite of pv succeeds with this patch:

Ok, thanks, committed.

Al, looking at this area, I think there's some room for cleanups. In
particular, isn't the loop in opipe_prep() now just
"wait_for_space()"? I'm also thinking that we could perhaps remove the
SIGPIPE/EPIPE handling from splice_to_pipe()..

Hmm?

               Linus

^ permalink raw reply	[flat|nested] 152+ messages in thread

end of thread, other threads:[~2016-12-21 19:12 UTC | newest]

Thread overview: 152+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <723420070.1340881.1472835555274.JavaMail.zimbra@redhat.com>
     [not found] ` <1832555471.1341372.1472835736236.JavaMail.zimbra@redhat.com>
2016-09-03  0:39   ` xfs_file_splice_read: possible circular locking dependency detected Dave Chinner
2016-09-03  0:57     ` Linus Torvalds
2016-09-03  1:45       ` Al Viro
2016-09-06 23:59         ` Dave Chinner
2016-09-08 20:35           ` Al Viro
2016-09-06 21:53     ` CAI Qian
2016-09-06 23:34       ` Dave Chinner
2016-09-08 15:29     ` CAI Qian
2016-09-08 17:56       ` Al Viro
2016-09-08 18:12         ` Linus Torvalds
2016-09-08 18:18           ` Linus Torvalds
2016-09-08 20:44           ` Al Viro
2016-09-08 20:57             ` Al Viro
2016-09-08 21:23             ` Al Viro
2016-09-08 21:38           ` Dave Chinner
2016-09-08 23:55             ` Al Viro
2016-09-09  1:53               ` Dave Chinner
2016-09-09  2:22                 ` Linus Torvalds
2016-09-09  2:26                   ` Linus Torvalds
2016-09-09  2:34                     ` Al Viro
2016-09-09  2:50                       ` Linus Torvalds
2016-09-09 22:19                         ` Al Viro
2016-09-10  2:06                           ` Linus Torvalds
2016-09-14  3:16                             ` Al Viro
2016-09-14  3:39                               ` Nicholas Piggin
2016-09-14  4:01                                 ` Linus Torvalds
2016-09-18  5:33                                 ` Al Viro
2016-09-19  3:08                                   ` Nicholas Piggin
2016-09-19  6:11                                     ` Al Viro
2016-09-19  7:26                                       ` Nicholas Piggin
2016-09-14  3:49                               ` Linus Torvalds
2016-09-14  4:26                                 ` Al Viro
2016-09-17  8:20                                   ` Al Viro
2016-09-17 19:00                                     ` Al Viro
2016-09-17 20:15                                       ` Linus Torvalds
2016-09-18 19:31                                       ` skb_splice_bits() and large chunks in pipe (was " Al Viro
2016-09-18 20:12                                         ` Linus Torvalds
2016-09-18 22:31                                           ` Al Viro
2016-09-19  0:18                                             ` Linus Torvalds
2016-09-19  0:22                                             ` Al Viro
2016-09-19  0:22                                               ` Al Viro
2016-09-20  9:51                                               ` Herbert Xu
2016-09-23 19:00                                       ` [RFC][CFT] splice_read reworked Al Viro
2016-09-23 19:01                                         ` [PATCH 01/11] fix memory leaks in tracing_buffers_splice_read() Al Viro
2016-09-23 19:02                                         ` [PATCH 02/11] splice_to_pipe(): don't open-code wakeup_pipe_readers() Al Viro
2016-09-23 19:02                                         ` [PATCH 03/11] splice: switch get_iovec_page_array() to iov_iter Al Viro
2016-09-23 19:02                                           ` Al Viro
2016-09-23 19:03                                         ` [PATCH 04/11] splice: lift pipe_lock out of splice_to_pipe() Al Viro
2016-09-23 19:45                                           ` Linus Torvalds
2016-09-23 20:10                                             ` Al Viro
2016-09-23 20:36                                               ` Linus Torvalds
2016-09-24  3:59                                                 ` Al Viro
2016-09-24 17:29                                                   ` Al Viro
2016-09-27 15:38                                                     ` Nicholas Piggin
2016-09-27 15:53                                                     ` Chuck Lever
2016-09-27 15:53                                                       ` Chuck Lever
2016-09-24  3:59                                                 ` [PATCH 04/12] " Al Viro
2016-09-26 13:35                                                   ` Miklos Szeredi
2016-09-26 13:35                                                     ` Miklos Szeredi
2016-09-27  4:14                                                     ` Al Viro
2016-09-27  4:14                                                       ` Al Viro
2016-12-17 19:54                                                   ` Andreas Schwab
2016-12-18 19:28                                                     ` Linus Torvalds
2016-12-18 19:57                                                       ` Andreas Schwab
2016-12-18 20:12                                                       ` Al Viro
2016-12-18 20:30                                                         ` Al Viro
2016-12-18 22:10                                                           ` Linus Torvalds
2016-12-18 22:18                                                             ` Al Viro
2016-12-18 22:22                                                               ` Linus Torvalds
2016-12-18 22:49                                                             ` Andreas Schwab
2016-12-21 18:56                                                             ` Andreas Schwab
2016-12-21 19:12                                                               ` Linus Torvalds
2016-09-24  4:00                                                 ` [PATCH 06/12] new helper: add_to_pipe() Al Viro
2016-09-26 13:49                                                   ` Miklos Szeredi
2016-09-24  4:01                                                 ` [PATCH 10/12] new iov_iter flavour: pipe-backed Al Viro
2016-09-29 20:53                                                   ` Miklos Szeredi
2016-09-29 22:50                                                     ` Al Viro
2016-09-29 22:50                                                       ` Al Viro
2016-09-30  7:30                                                       ` Miklos Szeredi
2016-10-03  3:34                                                         ` [RFC] O_DIRECT vs EFAULT (was Re: [PATCH 10/12] new iov_iter flavour: pipe-backed) Al Viro
2016-10-03 17:07                                                           ` Linus Torvalds
2016-10-03 18:54                                                             ` Al Viro
2016-09-24  4:01                                                 ` [PATCH 11/12] switch generic_file_splice_read() to use of ->read_iter() Al Viro
2016-09-24  4:02                                                 ` [PATCH 12/12] switch default_file_splice_read() to use of pipe-backed iov_iter Al Viro
2016-09-23 19:03                                         ` [PATCH 05/11] skb_splice_bits(): get rid of callback Al Viro
2016-09-23 19:03                                           ` Al Viro
2016-09-23 19:04                                         ` [PATCH 06/11] new helper: add_to_pipe() Al Viro
2016-09-23 19:04                                         ` [PATCH 07/11] fuse_dev_splice_read(): switch to add_to_pipe() Al Viro
2016-09-23 19:06                                         ` [PATCH 08/11] cifs: don't use memcpy() to copy struct iov_iter Al Viro
2016-09-23 19:08                                         ` [PATCH 09/11] fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter() Al Viro
2016-09-26  9:31                                           ` Miklos Szeredi
2016-09-23 19:09                                         ` [PATCH 10/11] new iov_iter flavour: pipe-backed Al Viro
2016-09-23 19:10                                         ` [PATCH 11/11] switch generic_file_splice_read() to use of ->read_iter() Al Viro
2016-09-30 13:32                                         ` [RFC][CFT] splice_read reworked CAI Qian
2016-09-30 17:42                                           ` CAI Qian
2016-09-30 18:33                                             ` CAI Qian
2016-09-30 18:33                                               ` CAI Qian
2016-10-03  1:37                                               ` Al Viro
2016-10-03 17:49                                                 ` CAI Qian
2016-10-04 17:39                                                   ` local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) CAI Qian
2016-10-04 21:42                                                     ` tj
2016-10-05 14:09                                                       ` CAI Qian
2016-10-05 15:30                                                         ` tj
2016-10-05 15:54                                                           ` CAI Qian
2016-10-05 18:57                                                             ` CAI Qian
2016-10-05 20:05                                                               ` Al Viro
2016-10-06 12:20                                                                 ` CAI Qian
2016-10-06 12:25                                                                   ` CAI Qian
2016-10-06 16:11                                                                     ` CAI Qian
2016-10-06 17:00                                                                       ` Linus Torvalds
2016-10-06 18:12                                                                         ` CAI Qian
2016-10-07  9:57                                                                         ` Dave Chinner
2016-10-07 15:25                                                                           ` Linus Torvalds
2016-10-07  7:08                                                                     ` Jan Kara
2016-10-07 14:43                                                                       ` CAI Qian
2016-10-07 15:27                                                                         ` CAI Qian
2016-10-07 18:56                                                                           ` CAI Qian
2016-10-09 21:54                                                                             ` Dave Chinner
2016-10-10 14:10                                                                               ` CAI Qian
2016-10-10 20:14                                                                                 ` CAI Qian
2016-10-10 21:57                                                                                 ` Dave Chinner
2016-10-12 19:50                                                                                   ` [bisected] " CAI Qian
2016-10-12 20:59                                                                                     ` Dave Chinner
2016-10-13 16:25                                                                                       ` CAI Qian
2016-10-13 20:49                                                                                         ` Dave Chinner
2016-10-13 20:56                                                                                           ` CAI Qian
2016-10-09 21:51                                                                         ` Dave Chinner
2016-10-21 15:38                                                                       ` [4.9-rc1+] overlayfs lockdep CAI Qian
2016-10-24 12:57                                                                         ` Miklos Szeredi
2016-10-07  9:27                                                                   ` local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) Dave Chinner
2016-10-27 12:52                                                       ` local DoS - systemd hang or timeout with cgroup traces CAI Qian
2016-10-03  1:42                                             ` [RFC][CFT] splice_read reworked Al Viro
2016-10-03 14:06                                               ` CAI Qian
2016-10-03 15:20                                                 ` CAI Qian
2016-10-03 21:12                                                   ` Dave Chinner
2016-10-04 13:57                                                     ` CAI Qian
2016-10-03 20:32                                                 ` CAI Qian
2016-10-03 20:35                                                   ` Al Viro
2016-10-04 13:29                                                     ` CAI Qian
2016-10-04 14:28                                                       ` Al Viro
2016-10-04 16:21                                                         ` CAI Qian
2016-10-04 20:12                                                           ` Al Viro
2016-10-05 14:30                                                             ` CAI Qian
2016-10-05 16:07                                                               ` Al Viro
2016-09-09  2:31                   ` xfs_file_splice_read: possible circular locking dependency detected Al Viro
2016-09-09  2:39                     ` Linus Torvalds
2016-09-09  2:26                 ` Al Viro
2016-09-09  2:19               ` Al Viro
2016-09-08 18:01       ` Linus Torvalds
2016-09-08 20:39         ` CAI Qian
2016-09-08 21:19           ` Dave Chinner
2016-09-08 21:30             ` Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.