linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* iov_iter_pipe warning.
@ 2017-03-21 20:59 Dave Jones
  2017-04-05 22:02 ` Dave Jones
  2017-04-10 19:28 ` Al Viro
  0 siblings, 2 replies; 73+ messages in thread
From: Dave Jones @ 2017-03-21 20:59 UTC (permalink / raw)
  To: Linux Kernel; +Cc: Al Viro

WARNING: CPU: 0 PID: 9290 at lib/iov_iter.c:836 iov_iter_pipe+0x71/0x80
CPU: 0 PID: 9290 Comm: trinity-c7 Not tainted 4.11.0-rc3-think+ #3 
Call Trace:
 dump_stack+0x68/0x93
 __warn+0xcb/0xf0
 warn_slowpath_null+0x1d/0x20
 iov_iter_pipe+0x71/0x80
 generic_file_splice_read+0x37/0x140
 do_splice_to+0x79/0x90
 splice_direct_to_actor+0xb8/0x220
 ? generic_pipe_buf_nosteal+0x10/0x10
 do_splice_direct+0x9e/0xd0
 do_sendfile+0x1d7/0x3c0
 SyS_sendfile64+0x73/0xe0
 do_syscall_64+0x66/0x1d0
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f86804bf0f9
RSP: 002b:00007ffd5e613038 EFLAGS: 00000246
 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007f86804bf0f9
RDX: 00007f867e92a000 RSI: 000000000000018b RDI: 000000000000011b
RBP: 00007f8680b5d000 R08: 0000000000000000 R09: ffffffffffffff5b
R10: 0000000000221200 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f8680b5d048 R14: 00007f8680b95ad8 R15: 00007f8680b5d000
---[ end trace e50d88214985fd43 ]---


 836         WARN_ON(pipe->nrbufs == pipe->buffers);

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-03-21 20:59 iov_iter_pipe warning Dave Jones
@ 2017-04-05 22:02 ` Dave Jones
  2017-04-10 19:28 ` Al Viro
  1 sibling, 0 replies; 73+ messages in thread
From: Dave Jones @ 2017-04-05 22:02 UTC (permalink / raw)
  To: Linux Kernel, Al Viro

On Tue, Mar 21, 2017 at 04:59:01PM -0400, Dave Jones wrote:
 > WARNING: CPU: 0 PID: 9290 at lib/iov_iter.c:836 iov_iter_pipe+0x71/0x80
 > CPU: 0 PID: 9290 Comm: trinity-c7 Not tainted 4.11.0-rc3-think+ #3 
 > Call Trace:
 >  dump_stack+0x68/0x93
 >  __warn+0xcb/0xf0
 >  warn_slowpath_null+0x1d/0x20
 >  iov_iter_pipe+0x71/0x80
 >  generic_file_splice_read+0x37/0x140
 >  do_splice_to+0x79/0x90
 >  splice_direct_to_actor+0xb8/0x220
 >  ? generic_pipe_buf_nosteal+0x10/0x10
 >  do_splice_direct+0x9e/0xd0
 >  do_sendfile+0x1d7/0x3c0
 >  SyS_sendfile64+0x73/0xe0
 >  do_syscall_64+0x66/0x1d0
 >  entry_SYSCALL64_slow_path+0x25/0x25
 > RIP: 0033:0x7f86804bf0f9
 > RSP: 002b:00007ffd5e613038 EFLAGS: 00000246
 >  ORIG_RAX: 0000000000000028
 > RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007f86804bf0f9
 > RDX: 00007f867e92a000 RSI: 000000000000018b RDI: 000000000000011b
 > RBP: 00007f8680b5d000 R08: 0000000000000000 R09: ffffffffffffff5b
 > R10: 0000000000221200 R11: 0000000000000246 R12: 0000000000000002
 > R13: 00007f8680b5d048 R14: 00007f8680b95ad8 R15: 00007f8680b5d000
 > ---[ end trace e50d88214985fd43 ]---
 > 
 > 
 >  836         WARN_ON(pipe->nrbufs == pipe->buffers);

Still hitting this regularly. Takes a couple hours usually, but it seems to be
the only thing that falls out of trinity runs lately.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-03-21 20:59 iov_iter_pipe warning Dave Jones
  2017-04-05 22:02 ` Dave Jones
@ 2017-04-10 19:28 ` Al Viro
  2017-04-10 19:42   ` Dave Jones
  1 sibling, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-10 19:28 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Tue, Mar 21, 2017 at 04:59:01PM -0400, Dave Jones wrote:
> WARNING: CPU: 0 PID: 9290 at lib/iov_iter.c:836 iov_iter_pipe+0x71/0x80
> CPU: 0 PID: 9290 Comm: trinity-c7 Not tainted 4.11.0-rc3-think+ #3 
> Call Trace:
>  dump_stack+0x68/0x93
>  __warn+0xcb/0xf0
>  warn_slowpath_null+0x1d/0x20
>  iov_iter_pipe+0x71/0x80
>  generic_file_splice_read+0x37/0x140
>  do_splice_to+0x79/0x90
>  splice_direct_to_actor+0xb8/0x220
>  ? generic_pipe_buf_nosteal+0x10/0x10
>  do_splice_direct+0x9e/0xd0
>  do_sendfile+0x1d7/0x3c0
>  SyS_sendfile64+0x73/0xe0
>  do_syscall_64+0x66/0x1d0

Cute...  We have ->splice_read() called in attempt to shove something into
a full internal pipe.  How do you achieve that situation?  actor() callback
claiming to have drained more than it actually had to?

Do you have a reliable reproducer?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-10 19:28 ` Al Viro
@ 2017-04-10 19:42   ` Dave Jones
  2017-04-10 19:57     ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-10 19:42 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Mon, Apr 10, 2017 at 08:28:00PM +0100, Al Viro wrote:
 > On Tue, Mar 21, 2017 at 04:59:01PM -0400, Dave Jones wrote:
 > > WARNING: CPU: 0 PID: 9290 at lib/iov_iter.c:836 iov_iter_pipe+0x71/0x80
 > > CPU: 0 PID: 9290 Comm: trinity-c7 Not tainted 4.11.0-rc3-think+ #3 
 > > Call Trace:
 > >  dump_stack+0x68/0x93
 > >  __warn+0xcb/0xf0
 > >  warn_slowpath_null+0x1d/0x20
 > >  iov_iter_pipe+0x71/0x80
 > >  generic_file_splice_read+0x37/0x140
 > >  do_splice_to+0x79/0x90
 > >  splice_direct_to_actor+0xb8/0x220
 > >  ? generic_pipe_buf_nosteal+0x10/0x10
 > >  do_splice_direct+0x9e/0xd0
 > >  do_sendfile+0x1d7/0x3c0
 > >  SyS_sendfile64+0x73/0xe0
 > >  do_syscall_64+0x66/0x1d0
 > 
 > Cute...  We have ->splice_read() called in attempt to shove something into
 > a full internal pipe.  How do you achieve that situation?  actor() callback
 > claiming to have drained more than it actually had to?
 > 
 > Do you have a reliable reproducer?

Not reliable, but I'll see if I can find some time to narrow it down
this week.  I've been working on better logging of "wtf just happened"
the last few weeks, so it should get easier once I finish that work.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-10 19:42   ` Dave Jones
@ 2017-04-10 19:57     ` Al Viro
  2017-04-10 23:48       ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-10 19:57 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Mon, Apr 10, 2017 at 03:42:06PM -0400, Dave Jones wrote:
>  > Do you have a reliable reproducer?
> 
> Not reliable, but I'll see if I can find some time to narrow it down
> this week.  I've been working on better logging of "wtf just happened"
> the last few weeks, so it should get easier once I finish that work.

I would suggest
	a) slapping WARN_ON(pipe->nr_bufs); right before the loop
in splice_direct_to_actor().  Internal pipe should be empty when we
enter this function.
	b) the same WARN_ON() in the very end of the loop body.

We should have started with empty pipe.  We'd called ->splice_read()
and it had returned a positive number (in read_len).  That should be
the amount we'd pushed in there.  Then we call actor(), with
sd->total_len set to read_len.  Its return value is
	* positive (or we would've buggered off)
	* no less than read_len (ditto)
so it should have drained the pipe entirely, leaving it empty again.

Finding it not just non-empty, but full means that something's very
wrong.  The actor here is essentially ->splice_write(), and I'm really
curious which file is the target.  Actually, could you turn those
WARN_ON() into
	if (WARN_ON(pipe->nr_bufs))
		printk(KERN_ERR "->splice_write = %p",
			sd->u.file->f_op->splice_write);
and see which function it is?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-10 19:57     ` Al Viro
@ 2017-04-10 23:48       ` Dave Jones
  2017-04-11  0:22         ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-10 23:48 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Mon, Apr 10, 2017 at 08:57:11PM +0100, Al Viro wrote:
 > On Mon, Apr 10, 2017 at 03:42:06PM -0400, Dave Jones wrote:
 > >  > Do you have a reliable reproducer?
 > > 
 > > Not reliable, but I'll see if I can find some time to narrow it down
 > > this week.  I've been working on better logging of "wtf just happened"
 > > the last few weeks, so it should get easier once I finish that work.
 > 
 > I would suggest
 > 	a) slapping WARN_ON(pipe->nr_bufs); right before the loop
 > in splice_direct_to_actor().  Internal pipe should be empty when we
 > enter this function.
 > 	b) the same WARN_ON() in the very end of the loop body.
 > 
 > We should have started with empty pipe.  We'd called ->splice_read()
 > and it had returned a positive number (in read_len).  That should be
 > the amount we'd pushed in there.  Then we call actor(), with
 > sd->total_len set to read_len.  Its return value is
 > 	* positive (or we would've buggered off)
 > 	* no less than read_len (ditto)
 > so it should have drained the pipe entirely, leaving it empty again.
 > 
 > Finding it not just non-empty, but full means that something's very
 > wrong.  The actor here is essentially ->splice_write(), and I'm really
 > curious which file is the target.  Actually, could you turn those
 > WARN_ON() into
 > 	if (WARN_ON(pipe->nr_bufs))
 > 		printk(KERN_ERR "->splice_write = %p",
 > 			sd->u.file->f_op->splice_write);
 > and see which function it is?

s/nr_bufs/nrbufs/ aside, I tried this, and it didn't trigger, even
though I hit the iov_iter_pipe WARN again.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-10 23:48       ` Dave Jones
@ 2017-04-11  0:22         ` Al Viro
  2017-04-11  3:05           ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-11  0:22 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Mon, Apr 10, 2017 at 07:48:30PM -0400, Dave Jones wrote:
>  > 	if (WARN_ON(pipe->nr_bufs))
>  > 		printk(KERN_ERR "->splice_write = %p",
>  > 			sd->u.file->f_op->splice_write);
>  > and see which function it is?
> 
> s/nr_bufs/nrbufs/ aside, I tried this, and it didn't trigger, even
> though I hit the iov_iter_pipe WARN again.

Huh?  So you have WARN_ON(pipe->nrbufs == pipe->buffers) trigger depite
WARN_ON(pipe->nrbufs) *not* triggering a bit earlier?  Nuts...

Let's do this:
	* in generic_file_splice_read(): WARN_ON(pipe->nrbufs == pipe->buffers)
*and* WARN_ON(!pipe->buffers) in the very beginning.
	* in do_splice_to(): ditto.
	* in splice_direct_to_actor(): if (WARN_ON... from upthread (with the
typo fix, of course).
	* in iov_iter_pipe() turn the test into
	if (WARN_ON(pipe->nrbufs == pipe->buffers))
		WARN_ON(pipe != current->splice_pipe);

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-11  0:22         ` Al Viro
@ 2017-04-11  3:05           ` Dave Jones
  2017-04-11  3:28             ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-11  3:05 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Tue, Apr 11, 2017 at 01:22:15AM +0100, Al Viro wrote:

 > 	* in do_splice_to(): WARN_ON(pipe->nrbufs == pipe->buffers)

Hit this one.

 > 	* in generic_file_splice_read(): WARN_ON(pipe->nrbufs == pipe->buffers)

followed by this one.

Then the usual iov_iter trace.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-11  3:05           ` Dave Jones
@ 2017-04-11  3:28             ` Al Viro
  2017-04-11 20:53               ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-11  3:28 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Mon, Apr 10, 2017 at 11:05:32PM -0400, Dave Jones wrote:
> On Tue, Apr 11, 2017 at 01:22:15AM +0100, Al Viro wrote:
> 
>  > 	* in do_splice_to(): WARN_ON(pipe->nrbufs == pipe->buffers)
> 
> Hit this one.

But not WARN_ON(pipe->nrbufs) in its caller *or* WARN_ON(!pipe->buffers)
in do_splice_to() itself?

How the devil can that be possible?

Again, to make sure we are on the same page: in
	if (WARN_ON(pipe->nrbufs)) {
		printk(KERN_ERR "->splice_write = %p",
			sd->u.file->f_op->splice_write);
	}
        while (len) {
                size_t read_len;
                loff_t pos = sd->pos, prev_pos = pos;

                ret = do_splice_to(in, &pos, pipe, len, flags);
		...
		... (not a single continue in sight)
		...
		if (WARN_ON(pipe->nrbufs)) {
			printk(KERN_ERR "->splice_write = %p",
				sd->u.file->f_op->splice_write);
		}
	}
neither of those WARN_ON() triggers.  In do_splice_to()
	WARN_ON(pipe->nrbufs == pipe->buffers);
does trigger, but
	WARN_ON(!pipe->buffers);
does not.  And pipe is equal to current->splice_pipe, so nobody else could
see it, let alone be messing with it.

How can that be possible?  Non-triggering WARN_ON() in caller of do_splice_to()
mean that pipe->nrbufs is zero.  Triggering WARN_ON() in do_splice_to() means
that it's equal to pipe->buffers, but WARN_ON(!pipe->buffers) manages to avoid
being triggered?  Can you confirm all that?  Because if that's the case,
the next possibility is random memory corruption and/or pipe_info dangling
pointers/use-after-free/etc.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-11  3:28             ` Al Viro
@ 2017-04-11 20:53               ` Dave Jones
  2017-04-11 21:12                 ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-11 20:53 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Tue, Apr 11, 2017 at 04:28:39AM +0100, Al Viro wrote:
 > On Mon, Apr 10, 2017 at 11:05:32PM -0400, Dave Jones wrote:
 > > On Tue, Apr 11, 2017 at 01:22:15AM +0100, Al Viro wrote:
 > > 
 > >  > 	* in do_splice_to(): WARN_ON(pipe->nrbufs == pipe->buffers)
 > > 
 > > Hit this one.
 > 
 > But not WARN_ON(pipe->nrbufs) in its caller *or* WARN_ON(!pipe->buffers)
 > in do_splice_to() itself?
 > 
 > How the devil can that be possible?
 > 
 > Again, to make sure we are on the same page: in
 > 	if (WARN_ON(pipe->nrbufs)) {
 > 		printk(KERN_ERR "->splice_write = %p",
 > 			sd->u.file->f_op->splice_write);
 > 	}
 >         while (len) {
 >                 size_t read_len;
 >                 loff_t pos = sd->pos, prev_pos = pos;
 > 
 >                 ret = do_splice_to(in, &pos, pipe, len, flags);
 > 		...
 > 		... (not a single continue in sight)
 > 		...
 > 		if (WARN_ON(pipe->nrbufs)) {
 > 			printk(KERN_ERR "->splice_write = %p",
 > 				sd->u.file->f_op->splice_write);
 > 		}
 > 	}

Ah, missed adding this 2nd WARN_ON.

 > neither of those WARN_ON() triggers.  In do_splice_to()
 > 	WARN_ON(pipe->nrbufs == pipe->buffers);
 > does trigger, but
 > 	WARN_ON(!pipe->buffers);
 > does not.  And pipe is equal to current->splice_pipe, so nobody else could
 > see it, let alone be messing with it.
 > 
 > How can that be possible?  Non-triggering WARN_ON() in caller of do_splice_to()
 > mean that pipe->nrbufs is zero.  Triggering WARN_ON() in do_splice_to() means
 > that it's equal to pipe->buffers, but WARN_ON(!pipe->buffers) manages to avoid
 > being triggered?  Can you confirm all that?
 
asides from above, yeah, same.

 > Because if that's the case,
 > the next possibility is random memory corruption and/or pipe_info dangling
 > pointers/use-after-free/etc.

I've been tied up with other stuff today, so while I was preoccupied, I
did a run with KASAN to see if anything fell out.  That seems to slow
things down enough that I don't trigger anything. Been running all day
without incident.

I'll turn it back off, and retry with the missing WARN from above added.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-11 20:53               ` Dave Jones
@ 2017-04-11 21:12                 ` Al Viro
  2017-04-11 22:25                   ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-11 21:12 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Tue, Apr 11, 2017 at 04:53:36PM -0400, Dave Jones wrote:
>  > 		if (WARN_ON(pipe->nrbufs)) {
>  > 			printk(KERN_ERR "->splice_write = %p",
>  > 				sd->u.file->f_op->splice_write);
>  > 		}
>  > 	}
> 
> Ah, missed adding this 2nd WARN_ON.

Good - it means that we don't have to chase memory corruption yet.  That
smells like some ->splice_write() claiming it has copied more than it
has drained from the pipe.

> I'll turn it back off, and retry with the missing WARN from above added.

Please, do.  And it would be nice if you printed sd->u.file->f_op as well
(in the same printk).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-11 21:12                 ` Al Viro
@ 2017-04-11 22:25                   ` Dave Jones
  2017-04-11 23:28                     ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-11 22:25 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Tue, Apr 11, 2017 at 10:12:16PM +0100, Al Viro wrote:
 > On Tue, Apr 11, 2017 at 04:53:36PM -0400, Dave Jones wrote:
 > >  > 		if (WARN_ON(pipe->nrbufs)) {
 > >  > 			printk(KERN_ERR "->splice_write = %p",
 > >  > 				sd->u.file->f_op->splice_write);
 > >  > 		}
 > >  > 	}
 > > 
 > > Ah, missed adding this 2nd WARN_ON.
 > 
 > Good - it means that we don't have to chase memory corruption yet.  That
 > smells like some ->splice_write() claiming it has copied more than it
 > has drained from the pipe.
 > 
 > > I'll turn it back off, and retry with the missing WARN from above added.
 > 
 > Please, do.  And it would be nice if you printed sd->u.file->f_op as well
 > (in the same printk).

Will add that for the next run. In the meantime..

[ 2675.049082] WARNING: CPU: 0 PID: 10149 at fs/splice.c:1020 splice_direct_to_actor+0x20c/0x2b0
[ 2675.061581] CPU: 0 PID: 10149 Comm: trinity-c2 Not tainted 4.11.0-rc6-think+ #6 
[ 2675.086528] Call Trace:
[ 2675.098901]  dump_stack+0x68/0x93
[ 2675.111242]  __warn+0xcb/0xf0
[ 2675.123467]  warn_slowpath_null+0x1d/0x20
[ 2675.135602]  splice_direct_to_actor+0x20c/0x2b0
[ 2675.147636]  ? generic_pipe_buf_nosteal+0x10/0x10
[ 2675.159675]  do_splice_direct+0x9e/0xd0
[ 2675.171837]  do_sendfile+0x1d7/0x3c0
[ 2675.183762]  SyS_sendfile64+0x73/0xe0
[ 2675.195944]  do_syscall_64+0x66/0x1d0
[ 2675.208090]  entry_SYSCALL64_slow_path+0x25/0x25
[ 2675.220256] RIP: 0033:0x7fcf6cdb80f9
[ 2675.232338] RSP: 002b:00007fff12252418 EFLAGS: 00000246
[ 2675.244447]  ORIG_RAX: 0000000000000028
[ 2675.256474] RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007fcf6cdb80f9
[ 2675.268528] RDX: 00007fcf6af23000 RSI: 0000000000000187 RDI: 0000000000000158
[ 2675.280680] RBP: 00007fcf6d479000 R08: 9696969696969696 R09: adadadadadadadad
[ 2675.292870] R10: 0000000000000508 R11: 0000000000000246 R12: 0000000000000002
[ 2675.305049] R13: 00007fcf6d479048 R14: 00007fcf6d48ead8 R15: 00007fcf6d479000
[ 2675.317455] ---[ end trace 9e4e62cfabcb9081 ]---
[ 2675.329626] ->splice_write = ffffffff812b3130

$ grep ffffffff812b3130 /proc/kallsyms 
ffffffff812b3130 T generic_splice_sendpage

This one spat out all by itself.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-11 22:25                   ` Dave Jones
@ 2017-04-11 23:28                     ` Al Viro
  2017-04-11 23:34                       ` Dave Jones
  2017-04-11 23:45                       ` Dave Jones
  0 siblings, 2 replies; 73+ messages in thread
From: Al Viro @ 2017-04-11 23:28 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Tue, Apr 11, 2017 at 06:25:02PM -0400, Dave Jones wrote:

> ffffffff812b3130 T generic_splice_sendpage
> 
> This one spat out all by itself.

No need to print ->f_op for that one - can be only socket_file_ops.  Now,
the address family of that socket would be interesting...

How about adding to that printk (under if (WARN_ON()) something like
	file = sd->u.file;
	if (file->f_op->splice_write == generic_splice_sendpage) {
		struct socket *sock = file->private_data;
		printk(KERN_ERR "socket [%d, %p]\n", sock->type, sock->ops);
	}
	printk(KERN_ERR "in->f_op = %p\n", in->f_op);

Said that, we seem to have
	* a pipe with some amount of data in it
	* generic_splice_sendpage() called on that pipe, with len equal to
the amount in the pipe.  Hopefully.
	* generic_splice_sendpage() returning the value equal to len...
	* ... and not draining the pipe entirely.

generic_splice_sendpage() is calling this:

ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
                           splice_actor *actor)
{
        int ret;

        splice_from_pipe_begin(sd); 
        do {
                cond_resched();
                ret = splice_from_pipe_next(pipe, sd);
                if (ret > 0)
                        ret = splice_from_pipe_feed(pipe, sd, actor);
        } while (ret > 0);
        splice_from_pipe_end(pipe, sd);

        return sd->num_spliced ? sd->num_spliced : ret;
}

It has returned a positive number.  That must have been sd->num_spliced.
splice_from_pipe_begin() sets it to 0 and the only place where it is
updated is
                ret = actor(pipe, buf, sd);
                if (ret <= 0)
                        return ret;

                buf->offset += ret;
                buf->len -= ret;

                sd->num_spliced += ret;
                sd->len -= ret;
                sd->pos += ret;
                sd->total_len -= ret;

                if (!buf->len) {
                        pipe_buf_release(pipe, buf);
                        pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1);
                        pipe->nrbufs--;

in splice_from_pipe_feed().  Whatever actor() is doing, the amount we
drain from the pipe is equal to the amount we add to ->num_spliced.

In other words, sending part looks reasonably solid.  Another thing that
might have happened is
                ret = do_splice_to(in, &pos, pipe, len, flags);
                if (unlikely(ret <= 0))
                        goto out_release;
returning less than it has actually dumped into the pipe in some situations.

Which means default_file_splice_read() called on an empty pipe and
returning less than it has put there.  The thing is, the last thing
that function does is
        iov_iter_advance(&to, copied);  /* truncates and discards */
	return res;
and we would have to have copied > res > 0 for that scenario to happen...

Interesting...  How about
	if (res > 0 && pipe == current->splice_pipe) {
		int idx = pipe->curbuf;
		int n = pipe->nrbufs;
		size_t size = 0;
		while (n--) {
			size += pipe->bufs[idx++].len;
			if (idx == pipe->buffers)
				idx = 0;
		}
		WARN_ON(len != res);
	}
just before the return in default_file_splice_read()?  WARN_ON_ONCE,
perhaps, to avoid cascades...

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-11 23:28                     ` Al Viro
@ 2017-04-11 23:34                       ` Dave Jones
  2017-04-11 23:48                         ` Al Viro
  2017-04-11 23:45                       ` Dave Jones
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-11 23:34 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Wed, Apr 12, 2017 at 12:28:42AM +0100, Al Viro wrote:
 > On Tue, Apr 11, 2017 at 06:25:02PM -0400, Dave Jones wrote:
 > 
 > > ffffffff812b3130 T generic_splice_sendpage
 > > 
 > > This one spat out all by itself.
 > 
 > No need to print ->f_op for that one - can be only socket_file_ops.  Now,
 > the address family of that socket would be interesting...

Turned out to be..

 ->splice_write = ffffffff812b2b70 sd->u.file->f_op=ffffffffa02e0980
$ grep ffffffffa02e0980 /proc/kallsyms 
ffffffffa02e0980 r nfs4_file_operations	[nfsv4]

dir test is running from is /home on nfs, so that makes sense.

 > How about adding to that printk (under if (WARN_ON()) something like
 > 	file = sd->u.file;
 > 	if (file->f_op->splice_write == generic_splice_sendpage) {
 > 		struct socket *sock = file->private_data;
 > 		printk(KERN_ERR "socket [%d, %p]\n", sock->type, sock->ops);
 > 	}
 > 	printk(KERN_ERR "in->f_op = %p\n", in->f_op);

Ack.

 > Interesting...  How about
 > 	if (res > 0 && pipe == current->splice_pipe) {
 > 		int idx = pipe->curbuf;
 > 		int n = pipe->nrbufs;
 > 		size_t size = 0;
 > 		while (n--) {
 > 			size += pipe->bufs[idx++].len;
 > 			if (idx == pipe->buffers)
 > 				idx = 0;
 > 		}
 > 		WARN_ON(len != res);
 > 	}
 > just before the return in default_file_splice_read()?  WARN_ON_ONCE,
 > perhaps, to avoid cascades...

Sure, up next. Gimme an hour, it seems to be reproducing fairly quickly
tonight.

	Dave.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-11 23:28                     ` Al Viro
  2017-04-11 23:34                       ` Dave Jones
@ 2017-04-11 23:45                       ` Dave Jones
  2017-04-11 23:51                         ` Al Viro
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-11 23:45 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Wed, Apr 12, 2017 at 12:28:42AM +0100, Al Viro wrote:
 > On Tue, Apr 11, 2017 at 06:25:02PM -0400, Dave Jones wrote:
 > 
 > > ffffffff812b3130 T generic_splice_sendpage
 > > 
 > > This one spat out all by itself.
 > 
 > No need to print ->f_op for that one - can be only socket_file_ops.  Now,
 > the address family of that socket would be interesting...
 > 
 > How about adding to that printk (under if (WARN_ON()) something like
 > 	file = sd->u.file;
 > 	if (file->f_op->splice_write == generic_splice_sendpage) {
 > 		struct socket *sock = file->private_data;
 > 		printk(KERN_ERR "socket [%d, %p]\n", sock->type, sock->ops);
 > 	}
 > 	printk(KERN_ERR "in->f_op = %p\n", in->f_op);

Ugh, this explodes with a million errors when I try to compile it. 
It misses socket definition, and include <linux/net.h> causes another
cascade of errors about linkage.h and nonsense.

Send me a diff, I'll check in later.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-11 23:34                       ` Dave Jones
@ 2017-04-11 23:48                         ` Al Viro
  0 siblings, 0 replies; 73+ messages in thread
From: Al Viro @ 2017-04-11 23:48 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Tue, Apr 11, 2017 at 07:34:37PM -0400, Dave Jones wrote:
> On Wed, Apr 12, 2017 at 12:28:42AM +0100, Al Viro wrote:
>  > On Tue, Apr 11, 2017 at 06:25:02PM -0400, Dave Jones wrote:
>  > 
>  > > ffffffff812b3130 T generic_splice_sendpage
>  > > 
>  > > This one spat out all by itself.
>  > 
>  > No need to print ->f_op for that one - can be only socket_file_ops.  Now,
>  > the address family of that socket would be interesting...
> 
> Turned out to be..
> 
>  ->splice_write = ffffffff812b2b70 sd->u.file->f_op=ffffffffa02e0980
> $ grep ffffffffa02e0980 /proc/kallsyms 
> ffffffffa02e0980 r nfs4_file_operations	[nfsv4]

Lovely...  So now we get it not only on splice to socket, but on
splice to regular file on NFS as well?  That makes lying splice_read()
more likely...

>  > Interesting...  How about
>  > 	if (res > 0 && pipe == current->splice_pipe) {
>  > 		int idx = pipe->curbuf;
>  > 		int n = pipe->nrbufs;
>  > 		size_t size = 0;
>  > 		while (n--) {
>  > 			size += pipe->bufs[idx++].len;
>  > 			if (idx == pipe->buffers)
>  > 				idx = 0;
>  > 		}
>  > 		WARN_ON(len != res);
>  > 	}
>  > just before the return in default_file_splice_read()?  WARN_ON_ONCE,
>  > perhaps, to avoid cascades...
> 
> Sure, up next. Gimme an hour, it seems to be reproducing fairly quickly
> tonight.

Makes sense - now it screams on non-empty pipe in the beginning of loop
body; originally it was only for _full_ pipe (i.e. for leftovers from
several iterations filling the entire thing up), which should've been
harder to trigger...

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-11 23:45                       ` Dave Jones
@ 2017-04-11 23:51                         ` Al Viro
  2017-04-11 23:56                           ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-11 23:51 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Tue, Apr 11, 2017 at 07:45:58PM -0400, Dave Jones wrote:
>  > 	if (file->f_op->splice_write == generic_splice_sendpage) {
>  > 		struct socket *sock = file->private_data;
>  > 		printk(KERN_ERR "socket [%d, %p]\n", sock->type, sock->ops);
>  > 	}
>  > 	printk(KERN_ERR "in->f_op = %p\n", in->f_op);
> 
> Ugh, this explodes with a million errors when I try to compile it. 
> It misses socket definition, and include <linux/net.h> causes another
> cascade of errors about linkage.h and nonsense.

Ignore the socket part - you've already triggered it with NFS file as
destination, so this is not particularly interesting.  I would still like
to see in->f_op and even more - the checks in default_file_splice_read().

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-11 23:51                         ` Al Viro
@ 2017-04-11 23:56                           ` Al Viro
  2017-04-12  0:06                             ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-11 23:56 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Wed, Apr 12, 2017 at 12:51:58AM +0100, Al Viro wrote:
> On Tue, Apr 11, 2017 at 07:45:58PM -0400, Dave Jones wrote:
> >  > 	if (file->f_op->splice_write == generic_splice_sendpage) {
> >  > 		struct socket *sock = file->private_data;
> >  > 		printk(KERN_ERR "socket [%d, %p]\n", sock->type, sock->ops);
> >  > 	}
> >  > 	printk(KERN_ERR "in->f_op = %p\n", in->f_op);
> > 
> > Ugh, this explodes with a million errors when I try to compile it. 
> > It misses socket definition, and include <linux/net.h> causes another
> > cascade of errors about linkage.h and nonsense.
> 
> Ignore the socket part - you've already triggered it with NFS file as
> destination, so this is not particularly interesting.  I would still like
> to see in->f_op and even more - the checks in default_file_splice_read().

... and the latter had a braino - WARN_ON(size != ret), not len != ret.
Diff follows:

diff --git a/fs/splice.c b/fs/splice.c
index 006ba50f4ece..43dd9b3140ee 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -448,6 +448,18 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 		put_page(pages[i]);
 	kvfree(pages);
 	iov_iter_advance(&to, copied);	/* truncates and discards */
+	if (res > 0 && pipe == current->splice_pipe) {
+		int idx = pipe->curbuf;
+		int n = pipe->nrbufs;
+		size_t size = 0;
+		while (n--) {
+			size += pipe->bufs[idx++].len;
+			if (idx == pipe->buffers)
+				idx = 0;
+		}
+		WARN_ON(size != res);
+	}
+
 	return res;
 }
 
@@ -970,6 +982,11 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	while (len) {
 		size_t read_len;
 		loff_t pos = sd->pos, prev_pos = pos;
+		if (WARN_ON(pipe->buffers)) {
+			printk(KERN_ERR "in->f_op = %p, ->splice_write = %p\n",
+				in->f_op,
+				sd->u.file->f_op->splice_write);
+		}
 
 		ret = do_splice_to(in, &pos, pipe, len, flags);
 		if (unlikely(ret <= 0))

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-11 23:56                           ` Al Viro
@ 2017-04-12  0:06                             ` Dave Jones
  2017-04-12  0:17                               ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-12  0:06 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Wed, Apr 12, 2017 at 12:56:41AM +0100, Al Viro wrote:
 > On Wed, Apr 12, 2017 at 12:51:58AM +0100, Al Viro wrote:
 > > On Tue, Apr 11, 2017 at 07:45:58PM -0400, Dave Jones wrote:
 > > >  > 	if (file->f_op->splice_write == generic_splice_sendpage) {
 > > >  > 		struct socket *sock = file->private_data;
 > > >  > 		printk(KERN_ERR "socket [%d, %p]\n", sock->type, sock->ops);
 > > >  > 	}
 > > >  > 	printk(KERN_ERR "in->f_op = %p\n", in->f_op);
 > > > 
 > > > Ugh, this explodes with a million errors when I try to compile it. 
 > > > It misses socket definition, and include <linux/net.h> causes another
 > > > cascade of errors about linkage.h and nonsense.
 > > 
 > > Ignore the socket part - you've already triggered it with NFS file as
 > > destination, so this is not particularly interesting.  I would still like
 > > to see in->f_op and even more - the checks in default_file_splice_read().
 > 
 > ... and the latter had a braino - WARN_ON(size != ret), not len != ret.
 > Diff follows:
 
super fast repro..

[   51.795286] WARNING: CPU: 1 PID: 2057 at fs/splice.c:985 splice_direct_to_actor+0x13f/0x280
[   51.806721] CPU: 1 PID: 2057 Comm: trinity-c3 Not tainted 4.11.0-rc6-think+ #9 
[   51.814567] ------------[ cut here ]------------
[   51.814573] WARNING: CPU: 2 PID: 2018 at fs/splice.c:985 splice_direct_to_actor+0x13f/0x280
[   51.852613] Call Trace:
[   51.864076]  dump_stack+0x68/0x93
[   51.875475]  __warn+0xcb/0xf0
[   51.886831]  warn_slowpath_null+0x1d/0x20
[   51.898162]  splice_direct_to_actor+0x13f/0x280
[   51.909509]  ? generic_pipe_buf_nosteal+0x10/0x10
[   51.920949]  do_splice_direct+0x9e/0xd0
[   51.932166]  do_sendfile+0x1d7/0x3c0
[   51.943349]  SyS_sendfile64+0x73/0xe0
[   51.954495]  do_syscall_64+0x66/0x1d0
[   51.965630]  entry_SYSCALL64_slow_path+0x25/0x25
[   51.976718] RIP: 0033:0x7f3e6ecc80f9
[   51.987732] RSP: 002b:00007ffcb8b38728 EFLAGS: 00000246
[   51.998705]  ORIG_RAX: 0000000000000028
[   52.009546] RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007f3e6ecc80f9
[   52.020507] RDX: 00007f3e6f264000 RSI: 000000000000011a RDI: 000000000000019b
[   52.031427] RBP: 00007f3e6f382000 R08: 0000000000000010 R09: 0000000000000000
[   52.042263] R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000000002
[   52.053040] R13: 00007f3e6f382048 R14: 00007f3e6f39ead8 R15: 00007f3e6f382000
[   52.063787] CPU: 2 PID: 2018 Comm: trinity-c6 Not tainted 4.11.0-rc6-think+ #9 
[   52.063997] ---[ end trace 51a5bc02dc45a59d ]---
[   52.063998] in->f_op = ffffffff81c26480, ->splice_write =           (null)
[   52.106748] Call Trace:
[   52.117338]  dump_stack+0x68/0x93
[   52.127832]  __warn+0xcb/0xf0
[   52.138224]  warn_slowpath_null+0x1d/0x20
[   52.148737]  splice_direct_to_actor+0x13f/0x280
[   52.159331]  ? generic_pipe_buf_nosteal+0x10/0x10
[   52.169897]  do_splice_direct+0x9e/0xd0
[   52.180412]  do_sendfile+0x1d7/0x3c0
[   52.190821]  SyS_sendfile64+0x73/0xe0
[   52.201210]  do_syscall_64+0x66/0x1d0
[   52.211503]  entry_SYSCALL64_slow_path+0x25/0x25
[   52.221740] RIP: 0033:0x7f3e6ecc80f9
[   52.231955] RSP: 002b:00007ffcb8b38728 EFLAGS: 00000246
[   52.242137]  ORIG_RAX: 0000000000000028
[   52.252235] RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007f3e6ecc80f9
[   52.262453] RDX: 00007f3e6f263000 RSI: 000000000000011d RDI: 000000000000011d
[   52.272637] RBP: 00007f3e6f36d000 R08: 00000000000000c7 R09: ffffffffffffffef
[   52.282775] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000000002
[   52.292889] R13: 00007f3e6f36d048 R14: 00007f3e6f39ead8 R15: 00007f3e6f36d000
[   52.304196] ---[ end trace 51a5bc02dc45a59e ]---
[   52.314808] in->f_op = ffffffff81c26480, ->splice_write = ffffffff812b2b20

$ grep  ffffffff812b2b20 /proc/kallsyms 
ffffffff812b2b20 T iter_file_splice_write
$ grep ffffffff81c26480 /proc/kallsyms 
ffffffff81c26480 r shmem_file_operations

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-12  0:06                             ` Dave Jones
@ 2017-04-12  0:17                               ` Al Viro
  2017-04-12  0:58                                 ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-12  0:17 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Tue, Apr 11, 2017 at 08:06:07PM -0400, Dave Jones wrote:
> On Wed, Apr 12, 2017 at 12:56:41AM +0100, Al Viro wrote:
>  > On Wed, Apr 12, 2017 at 12:51:58AM +0100, Al Viro wrote:
>  > > On Tue, Apr 11, 2017 at 07:45:58PM -0400, Dave Jones wrote:
>  > > >  > 	if (file->f_op->splice_write == generic_splice_sendpage) {
>  > > >  > 		struct socket *sock = file->private_data;
>  > > >  > 		printk(KERN_ERR "socket [%d, %p]\n", sock->type, sock->ops);
>  > > >  > 	}
>  > > >  > 	printk(KERN_ERR "in->f_op = %p\n", in->f_op);
>  > > > 
>  > > > Ugh, this explodes with a million errors when I try to compile it. 
>  > > > It misses socket definition, and include <linux/net.h> causes another
>  > > > cascade of errors about linkage.h and nonsense.
>  > > 
>  > > Ignore the socket part - you've already triggered it with NFS file as
>  > > destination, so this is not particularly interesting.  I would still like
>  > > to see in->f_op and even more - the checks in default_file_splice_read().
>  > 
>  > ... and the latter had a braino - WARN_ON(size != ret), not len != ret.
>  > Diff follows:
>  
> super fast repro..

Alas, that's just another braino - it checks for non-zero ->buffers (always
true) rather than non-zero ->nrbufs (non-empty pipe).  Sorry.  Fixed diff
follows:

diff --git a/fs/splice.c b/fs/splice.c
index 006ba50f4ece..43dd9b3140ee 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -448,6 +448,18 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 		put_page(pages[i]);
 	kvfree(pages);
 	iov_iter_advance(&to, copied);	/* truncates and discards */
+	if (res > 0 && pipe == current->splice_pipe) {
+		int idx = pipe->curbuf;
+		int n = pipe->nrbufs;
+		size_t size = 0;
+		while (n--) {
+			size += pipe->bufs[idx++].len;
+			if (idx == pipe->buffers)
+				idx = 0;
+		}
+		WARN_ON(size != res);
+	}
+
 	return res;
 }
 
@@ -970,6 +982,11 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	while (len) {
 		size_t read_len;
 		loff_t pos = sd->pos, prev_pos = pos;
+		if (WARN_ON(pipe->nrbufs)) {
+			printk(KERN_ERR "in->f_op = %p, ->splice_write = %p\n",
+				in->f_op,
+				sd->u.file->f_op->splice_write);
+		}
 
 		ret = do_splice_to(in, &pos, pipe, len, flags);
 		if (unlikely(ret <= 0))

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-12  0:17                               ` Al Viro
@ 2017-04-12  0:58                                 ` Dave Jones
  2017-04-12  1:15                                   ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-12  0:58 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Wed, Apr 12, 2017 at 01:17:46AM +0100, Al Viro wrote:
 > On Tue, Apr 11, 2017 at 08:06:07PM -0400, Dave Jones wrote:
 > > On Wed, Apr 12, 2017 at 12:56:41AM +0100, Al Viro wrote:
 > >  > On Wed, Apr 12, 2017 at 12:51:58AM +0100, Al Viro wrote:
 > >  > > On Tue, Apr 11, 2017 at 07:45:58PM -0400, Dave Jones wrote:
 > >  > > >  > 	if (file->f_op->splice_write == generic_splice_sendpage) {
 > >  > > >  > 		struct socket *sock = file->private_data;
 > >  > > >  > 		printk(KERN_ERR "socket [%d, %p]\n", sock->type, sock->ops);
 > >  > > >  > 	}
 > >  > > >  > 	printk(KERN_ERR "in->f_op = %p\n", in->f_op);
 > >  > > > 
 > >  > > > Ugh, this explodes with a million errors when I try to compile it. 
 > >  > > > It misses socket definition, and include <linux/net.h> causes another
 > >  > > > cascade of errors about linkage.h and nonsense.
 > >  > > 
 > >  > > Ignore the socket part - you've already triggered it with NFS file as
 > >  > > destination, so this is not particularly interesting.  I would still like
 > >  > > to see in->f_op and even more - the checks in default_file_splice_read().
 > >  > 
 > >  > ... and the latter had a braino - WARN_ON(size != ret), not len != ret.
 > >  > Diff follows:
 > >  
 > > super fast repro..
 > 
 > Alas, that's just another braino - it checks for non-zero ->buffers (always
 > true) rather than non-zero ->nrbufs (non-empty pipe).  Sorry.  Fixed diff
 > follows:
 
[  462.415864] WARNING: CPU: 0 PID: 21500 at fs/splice.c:985 splice_direct_to_actor+0x13f/0x280
[  462.428443] CPU: 0 PID: 21500 Comm: trinity-c4 Not tainted 4.11.0-rc6-think+ #10 
[  462.453818] Call Trace:
[  462.466415]  dump_stack+0x68/0x93
[  462.478920]  __warn+0xcb/0xf0
[  462.491367]  warn_slowpath_null+0x1d/0x20
[  462.503823]  splice_direct_to_actor+0x13f/0x280
[  462.516236]  ? generic_pipe_buf_nosteal+0x10/0x10
[  462.528606]  do_splice_direct+0x9e/0xd0
[  462.540825]  do_sendfile+0x1d7/0x3c0
[  462.552910]  SyS_sendfile64+0x73/0xe0
[  462.564989]  do_syscall_64+0x66/0x1d0
[  462.576949]  entry_SYSCALL64_slow_path+0x25/0x25
[  462.588872] RIP: 0033:0x7febc78b30f9
[  462.600809] RSP: 002b:00007ffd767b6398 EFLAGS: 00000246
[  462.612790]  ORIG_RAX: 0000000000000028
[  462.624684] RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007febc78b30f9
[  462.636737] RDX: 00007febc5c1e000 RSI: 0000000000000130 RDI: 0000000000000130
[  462.648756] RBP: 00007febc7f66000 R08: 0000000000000ff1 R09: fffffffffffffffd
[  462.660709] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000002
[  462.672627] R13: 00007febc7f66048 R14: 00007febc7f89ad8 R15: 00007febc7f66000
[  462.684586] ---[ end trace d002d06d5879c8a9 ]---
[  462.696557] in->f_op = ffffffffa02df980, ->splice_write = ffffffff812b2b20
$ grep ffffffffa02df980 /proc/kallsyms 
ffffffffa02df980 r nfs4_file_operations	[nfsv4]
$ grep ffffffff812b2b20 /proc/kallsyms 
ffffffff812b2b20 T iter_file_splice_write

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-12  0:58                                 ` Dave Jones
@ 2017-04-12  1:15                                   ` Al Viro
  2017-04-12  2:29                                     ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-12  1:15 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Tue, Apr 11, 2017 at 08:58:53PM -0400, Dave Jones wrote:
> [  462.696557] in->f_op = ffffffffa02df980, ->splice_write = ffffffff812b2b20
> $ grep ffffffffa02df980 /proc/kallsyms 
> ffffffffa02df980 r nfs4_file_operations	[nfsv4]
> $ grep ffffffff812b2b20 /proc/kallsyms 
> ffffffff812b2b20 T iter_file_splice_write

Let's try to figure out whether it's read or write side going wrong.

diff --git a/fs/splice.c b/fs/splice.c
index 006ba50f4ece..0a7c0bd3e164 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -970,10 +970,26 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	while (len) {
 		size_t read_len;
 		loff_t pos = sd->pos, prev_pos = pos;
+		if (WARN_ON(pipe->nrbufs)) {
+			printk(KERN_ERR "in->f_op = %p, ->splice_write = %p\n",
+				in->f_op,
+				sd->u.file->f_op->splice_write);
+		}
 
 		ret = do_splice_to(in, &pos, pipe, len, flags);
 		if (unlikely(ret <= 0))
 			goto out_release;
+		{
+			int idx = pipe->curbuf;
+			int n = pipe->nrbufs;
+			size_t size = 0;
+			while (n--) {
+				size += pipe->bufs[idx++].len;
+				if (idx == pipe->buffers)
+					idx = 0;
+			}
+			WARN_ON(size != ret);
+		}
 
 		read_len = ret;
 		sd->total_len = read_len;

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-12  1:15                                   ` Al Viro
@ 2017-04-12  2:29                                     ` Dave Jones
  2017-04-12  2:58                                       ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-12  2:29 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Wed, Apr 12, 2017 at 02:15:32AM +0100, Al Viro wrote:
 > On Tue, Apr 11, 2017 at 08:58:53PM -0400, Dave Jones wrote:
 > > [  462.696557] in->f_op = ffffffffa02df980, ->splice_write = ffffffff812b2b20
 > > $ grep ffffffffa02df980 /proc/kallsyms 
 > > ffffffffa02df980 r nfs4_file_operations	[nfsv4]
 > > $ grep ffffffff812b2b20 /proc/kallsyms 
 > > ffffffff812b2b20 T iter_file_splice_write
 > 
 > Let's try to figure out whether it's read or write side going wrong.
 > 
 > diff --git a/fs/splice.c b/fs/splice.c
 > index 006ba50f4ece..0a7c0bd3e164 100644
 > --- a/fs/splice.c
 > +++ b/fs/splice.c
 > @@ -970,10 +970,26 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 >  	while (len) {
 >  		size_t read_len;
 >  		loff_t pos = sd->pos, prev_pos = pos;
 > +		if (WARN_ON(pipe->nrbufs)) {
 > +			printk(KERN_ERR "in->f_op = %p, ->splice_write = %p\n",
 > +				in->f_op,
 > +				sd->u.file->f_op->splice_write);
 > +		}
 >  
 >  		ret = do_splice_to(in, &pos, pipe, len, flags);
 >  		if (unlikely(ret <= 0))
 >  			goto out_release;
 > +		{
 > +			int idx = pipe->curbuf;
 > +			int n = pipe->nrbufs;
 > +			size_t size = 0;
 > +			while (n--) {
 > +				size += pipe->bufs[idx++].len;
 > +				if (idx == pipe->buffers)
 > +					idx = 0;
 > +			}
 > +			WARN_ON(size != ret);
 > +		}
 >  
 >  		read_len = ret;
 >  		sd->total_len = read_len;

WARNING: CPU: 0 PID: 21500 at fs/splice.c:985 splice_direct_to_actor+0x13f/0x280
CPU: 0 PID: 21500 Comm: trinity-c4 Not tainted 4.11.0-rc6-think+ #10 
Call Trace:
 dump_stack+0x68/0x93
 __warn+0xcb/0xf0
 warn_slowpath_null+0x1d/0x20
 splice_direct_to_actor+0x13f/0x280
 ? generic_pipe_buf_nosteal+0x10/0x10
 do_splice_direct+0x9e/0xd0
 do_sendfile+0x1d7/0x3c0
 SyS_sendfile64+0x73/0xe0
 do_syscall_64+0x66/0x1d0
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7febc78b30f9
RSP: 002b:00007ffd767b6398 EFLAGS: 00000246
  ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007febc78b30f9
RDX: 00007febc5c1e000 RSI: 0000000000000130 RDI: 0000000000000130
RBP: 00007febc7f66000 R08: 0000000000000ff1 R09: fffffffffffffffd
R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000002
R13: 00007febc7f66048 R14: 00007febc7f89ad8 R15: 00007febc7f66000
---[ end trace d002d06d5879c8a9 ]---
in->f_op = ffffffffa02df980, ->splice_write = ffffffff812b2b20

This time around it locked up completely. Weird.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-12  2:29                                     ` Dave Jones
@ 2017-04-12  2:58                                       ` Al Viro
  2017-04-12 14:35                                         ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-12  2:58 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Tue, Apr 11, 2017 at 10:29:11PM -0400, Dave Jones wrote:
> On Wed, Apr 12, 2017 at 02:15:32AM +0100, Al Viro wrote:
>  > On Tue, Apr 11, 2017 at 08:58:53PM -0400, Dave Jones wrote:
>  > > [  462.696557] in->f_op = ffffffffa02df980, ->splice_write = ffffffff812b2b20
>  > > $ grep ffffffffa02df980 /proc/kallsyms 
>  > > ffffffffa02df980 r nfs4_file_operations	[nfsv4]
>  > > $ grep ffffffff812b2b20 /proc/kallsyms 
>  > > ffffffff812b2b20 T iter_file_splice_write
>  > 
>  > Let's try to figure out whether it's read or write side going wrong.
>  > 
>  > diff --git a/fs/splice.c b/fs/splice.c
>  > index 006ba50f4ece..0a7c0bd3e164 100644
>  > --- a/fs/splice.c
>  > +++ b/fs/splice.c
>  > @@ -970,10 +970,26 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
>  >  	while (len) {
>  >  		size_t read_len;
>  >  		loff_t pos = sd->pos, prev_pos = pos;
>  > +		if (WARN_ON(pipe->nrbufs)) {
>  > +			printk(KERN_ERR "in->f_op = %p, ->splice_write = %p\n",
>  > +				in->f_op,
>  > +				sd->u.file->f_op->splice_write);
>  > +		}
>  >  
>  >  		ret = do_splice_to(in, &pos, pipe, len, flags);
>  >  		if (unlikely(ret <= 0))
>  >  			goto out_release;
>  > +		{
>  > +			int idx = pipe->curbuf;
>  > +			int n = pipe->nrbufs;
>  > +			size_t size = 0;
>  > +			while (n--) {
>  > +				size += pipe->bufs[idx++].len;
>  > +				if (idx == pipe->buffers)
>  > +					idx = 0;
>  > +			}
>  > +			WARN_ON(size != ret);
>  > +		}
>  >  
>  >  		read_len = ret;
>  >  		sd->total_len = read_len;
> 
> WARNING: CPU: 0 PID: 21500 at fs/splice.c:985 splice_direct_to_actor+0x13f/0x280
> CPU: 0 PID: 21500 Comm: trinity-c4 Not tainted 4.11.0-rc6-think+ #10 
> Call Trace:
>  dump_stack+0x68/0x93
>  __warn+0xcb/0xf0
>  warn_slowpath_null+0x1d/0x20
>  splice_direct_to_actor+0x13f/0x280
>  ? generic_pipe_buf_nosteal+0x10/0x10
>  do_splice_direct+0x9e/0xd0
>  do_sendfile+0x1d7/0x3c0
>  SyS_sendfile64+0x73/0xe0
>  do_syscall_64+0x66/0x1d0
>  entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7febc78b30f9
> RSP: 002b:00007ffd767b6398 EFLAGS: 00000246
>   ORIG_RAX: 0000000000000028
> RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007febc78b30f9
> RDX: 00007febc5c1e000 RSI: 0000000000000130 RDI: 0000000000000130
> RBP: 00007febc7f66000 R08: 0000000000000ff1 R09: fffffffffffffffd
> R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000002
> R13: 00007febc7f66048 R14: 00007febc7f89ad8 R15: 00007febc7f66000
> ---[ end trace d002d06d5879c8a9 ]---
> in->f_op = ffffffffa02df980, ->splice_write = ffffffff812b2b20
> 
> This time around it locked up completely. Weird.

No idea about the lockup, but this definitely looks like the read side being
bogus.  Let's try this:

diff --git a/fs/splice.c b/fs/splice.c
index 006ba50f4ece..73ff284fb20f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -284,6 +284,41 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
 	kfree(spd->partial);
 }
 
+static void test_it(struct pipe_inode_info *pipe, size_t len, long ret)
+{
+	int idx = pipe->curbuf;
+	int n = pipe->nrbufs;
+	size_t size = 0;
+	while (n--) {
+		size += pipe->bufs[idx++].len;
+		if (idx == pipe->buffers)
+			idx = 0;
+	}
+	if (WARN_ON(size != ret)) {
+		char c = '[';
+		printk(KERN_ERR "asked to read %zu, claims to have read %ld",
+			len, ret);
+		printk(KERN_CONT "actual size of data in pipe %zd ", size);
+		for (n = pipe->nrbufs, idx = pipe->curbuf; n--; ) {
+			printk(KERN_CONT "%c%d:%u", c, idx,
+				pipe->bufs[idx].len);
+			c = ',';
+			if (++idx == pipe->buffers)
+				idx = 0;
+		}
+		if (c == ' ')
+			printk(KERN_CONT "]");
+	}
+}
+
+static inline void insane_splice_read(struct pipe_inode_info *pipe,
+					size_t len, long ret)
+{
+	if (ret <= 0 || pipe != current->splice_pipe)
+		return;
+	test_it(pipe, len, ret);
+}
+
 /**
  * generic_file_splice_read - splice data from file to a pipe
  * @in:		file to splice from
@@ -313,6 +348,7 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	if (ret > 0) {
 		*ppos = kiocb.ki_pos;
 		file_accessed(in);
+		insane_splice_read(pipe, len, ret);
 	} else if (ret < 0) {
 		to.idx = idx;
 		to.iov_offset = 0;
@@ -394,7 +430,7 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 	struct page **pages;
 	unsigned int nr_pages;
 	size_t offset, dummy, copied = 0;
-	ssize_t res;
+	ssize_t res, old_len = len;
 	int i;
 
 	if (pipe->nrbufs == pipe->buffers)
@@ -448,6 +484,7 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 		put_page(pages[i]);
 	kvfree(pages);
 	iov_iter_advance(&to, copied);	/* truncates and discards */
+	insane_splice_read(pipe, old_len, res);
 	return res;
 }
 
@@ -970,6 +1007,11 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	while (len) {
 		size_t read_len;
 		loff_t pos = sd->pos, prev_pos = pos;
+		if (WARN_ON(pipe->nrbufs)) {
+			printk(KERN_ERR "in->f_op = %p, ->splice_write = %p\n",
+				in->f_op,
+				sd->u.file->f_op->splice_write);
+		}
 
 		ret = do_splice_to(in, &pos, pipe, len, flags);
 		if (unlikely(ret <= 0))

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-12  2:58                                       ` Al Viro
@ 2017-04-12 14:35                                         ` Dave Jones
  2017-04-12 15:26                                           ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-12 14:35 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Wed, Apr 12, 2017 at 03:58:42AM +0100, Al Viro wrote:
 > On Tue, Apr 11, 2017 at 10:29:11PM -0400, Dave Jones wrote:
 > > On Wed, Apr 12, 2017 at 02:15:32AM +0100, Al Viro wrote:
 > >  > On Tue, Apr 11, 2017 at 08:58:53PM -0400, Dave Jones wrote:
 > >  > > [  462.696557] in->f_op = ffffffffa02df980, ->splice_write = ffffffff812b2b20
 > >  > > $ grep ffffffffa02df980 /proc/kallsyms 
 > >  > > ffffffffa02df980 r nfs4_file_operations	[nfsv4]
 > >  > > $ grep ffffffff812b2b20 /proc/kallsyms 
 > >  > > ffffffff812b2b20 T iter_file_splice_write
 > >  > 
 > >  > Let's try to figure out whether it's read or write side going wrong.

[ 4139.722363] WARNING: CPU: 0 PID: 30536 at fs/splice.c:297 test_it+0x7b/0x100
[ 4139.734795] CPU: 0 PID: 30536 Comm: trinity-c2 Not tainted 4.11.0-rc6-think+ #12 
[ 4139.759574] Call Trace:
[ 4139.771840]  dump_stack+0x68/0x93
[ 4139.784062]  __warn+0xcb/0xf0
[ 4139.796127]  warn_slowpath_null+0x1d/0x20
[ 4139.808062]  test_it+0x7b/0x100
[ 4139.819903]  generic_file_splice_read+0x19b/0x1b0
[ 4139.831728]  do_splice_to+0x79/0x90
[ 4139.843547]  splice_direct_to_actor+0xc6/0x280
[ 4139.855329]  ? generic_pipe_buf_nosteal+0x10/0x10
[ 4139.867021]  do_splice_direct+0x9e/0xd0
[ 4139.878681]  do_sendfile+0x1d7/0x3c0
[ 4139.890349]  SyS_sendfile64+0x73/0xe0
[ 4139.902019]  do_syscall_64+0x66/0x1d0
[ 4139.913592]  entry_SYSCALL64_slow_path+0x25/0x25
[ 4139.925225] RIP: 0033:0x7f279ed980f9
[ 4139.936749] RSP: 002b:00007ffe5f42b858 EFLAGS: 00000246
[ 4139.948230]  ORIG_RAX: 0000000000000028
[ 4139.959677] RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007f279ed980f9
[ 4139.971180] RDX: 00007f279cb03000 RSI: 0000000000000185 RDI: 000000000000011a
[ 4139.982653] RBP: 00007f279f459000 R08: 0000005353535353 R09: ffffffff8100ffff
[ 4139.994080] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000002
[ 4140.005511] R13: 00007f279f459048 R14: 00007f279f46ead8 R15: 00007f279f459000
[ 4140.016837]  ? intel_pmu_lbr_read+0x4bf/0x5f0
[ 4140.028389] ---[ end trace d074a8823fe244d3 ]---
[ 4140.040002] asked to read 8, claims to have read 4
[ 4140.051634] actual size of data in pipe 8 
[ 4140.063234] [0:8
[ 4140.074673] ------------[ cut here ]------------
[ 4140.085985] WARNING: CPU: 3 PID: 30536 at fs/splice.c:1010 splice_direct_to_actor+0x13f/0x280
[ 4140.097406] CPU: 3 PID: 30536 Comm: trinity-c2 Tainted: G        W       4.11.0-rc6-think+ #12 
[ 4140.120382] Call Trace:
[ 4140.131729]  dump_stack+0x68/0x93
[ 4140.143030]  __warn+0xcb/0xf0
[ 4140.154376]  warn_slowpath_null+0x1d/0x20
[ 4140.165606]  splice_direct_to_actor+0x13f/0x280
[ 4140.176794]  ? generic_pipe_buf_nosteal+0x10/0x10
[ 4140.187934]  do_splice_direct+0x9e/0xd0
[ 4140.199033]  do_sendfile+0x1d7/0x3c0
[ 4140.210084]  SyS_sendfile64+0x73/0xe0
[ 4140.221122]  do_syscall_64+0x66/0x1d0
[ 4140.232137]  entry_SYSCALL64_slow_path+0x25/0x25
[ 4140.243192] RIP: 0033:0x7f279ed980f9
[ 4140.254194] RSP: 002b:00007ffe5f42b858 EFLAGS: 00000246
[ 4140.265156]  ORIG_RAX: 0000000000000028
[ 4140.276084] RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007f279ed980f9
[ 4140.287186] RDX: 00007f279cb03000 RSI: 0000000000000185 RDI: 000000000000011a
[ 4140.298229] RBP: 00007f279f459000 R08: 0000005353535353 R09: ffffffff8100ffff
[ 4140.309236] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000002
[ 4140.320077] R13: 00007f279f459048 R14: 00007f279f46ead8 R15: 00007f279f459000
[ 4140.330891]  ? intel_pmu_lbr_read+0x4bf/0x5f0
[ 4140.342955] ---[ end trace d074a8823fe244d4 ]---
[ 4140.353868] in->f_op = ffffffffa02dc980, ->splice_write = ffffffff812b2c20

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-12 14:35                                         ` Dave Jones
@ 2017-04-12 15:26                                           ` Al Viro
  2017-04-12 16:27                                             ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-12 15:26 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Wed, Apr 12, 2017 at 10:35:19AM -0400, Dave Jones wrote:

> [ 4140.040002] asked to read 8, claims to have read 4
> [ 4140.051634] actual size of data in pipe 8 
> [ 4140.063234] [0:8

> [ 4140.342955] ---[ end trace d074a8823fe244d4 ]---
> [ 4140.353868] in->f_op = ffffffffa02dc980, ->splice_write = ffffffff812b2c20

IOW, we just had someone's ->read_iter() return 4 after having deposited 8
bytes.  The next question is which file_operations had that been, whether
it was O_DIRECT or not and where in file had we been reading from...

diff --git a/fs/splice.c b/fs/splice.c
index 006ba50f4ece..0e67ddf8618d 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -284,6 +284,43 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
 	kfree(spd->partial);
 }
 
+static bool test_it(struct pipe_inode_info *pipe, size_t len, long ret)
+{
+	int idx = pipe->curbuf;
+	int n = pipe->nrbufs;
+	size_t size = 0;
+	while (n--) {
+		size += pipe->bufs[idx++].len;
+		if (idx == pipe->buffers)
+			idx = 0;
+	}
+	if (WARN_ON(size != ret)) {
+		char c = '[';
+		printk(KERN_ERR "asked to read %zu, claims to have read %ld",
+			len, ret);
+		printk(KERN_CONT "actual size of data in pipe %zd ", size);
+		for (n = pipe->nrbufs, idx = pipe->curbuf; n--; ) {
+			printk(KERN_CONT "%c%d:%u", c, idx,
+				pipe->bufs[idx].len);
+			c = ',';
+			if (++idx == pipe->buffers)
+				idx = 0;
+		}
+		if (c != '[')
+			printk(KERN_CONT "]");
+		return true;
+	}
+	return false;
+}
+
+static inline bool insane_splice_read(struct pipe_inode_info *pipe,
+					size_t len, long ret)
+{
+	if (ret <= 0 || pipe != current->splice_pipe)
+		return false;
+	return test_it(pipe, len, ret);
+}
+
 /**
  * generic_file_splice_read - splice data from file to a pipe
  * @in:		file to splice from
@@ -311,8 +348,14 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	kiocb.ki_pos = *ppos;
 	ret = call_read_iter(in, &kiocb, &to);
 	if (ret > 0) {
-		*ppos = kiocb.ki_pos;
 		file_accessed(in);
+		if (unlikely(insane_splice_read(pipe, len, ret))) {
+			printk(KERN_ERR "f_op: %p, f_flags: %d, pos: %lld/%lld, size: %lld",
+				in->f_op, in->f_flags, (long long)*ppos,
+				(long long)kiocb.ki_pos,
+				(long long)i_size_read(file_inode(in)));
+		}
+		*ppos = kiocb.ki_pos;
 	} else if (ret < 0) {
 		to.idx = idx;
 		to.iov_offset = 0;
@@ -394,7 +437,7 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 	struct page **pages;
 	unsigned int nr_pages;
 	size_t offset, dummy, copied = 0;
-	ssize_t res;
+	ssize_t res, old_len = len;
 	int i;
 
 	if (pipe->nrbufs == pipe->buffers)
@@ -448,6 +491,7 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 		put_page(pages[i]);
 	kvfree(pages);
 	iov_iter_advance(&to, copied);	/* truncates and discards */
+	insane_splice_read(pipe, old_len, res);
 	return res;
 }
 
@@ -970,6 +1014,11 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	while (len) {
 		size_t read_len;
 		loff_t pos = sd->pos, prev_pos = pos;
+		if (WARN_ON(pipe->nrbufs)) {
+			printk(KERN_ERR "in->f_op = %p, ->splice_write = %p\n",
+				in->f_op,
+				sd->u.file->f_op->splice_write);
+		}
 
 		ret = do_splice_to(in, &pos, pipe, len, flags);
 		if (unlikely(ret <= 0))

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-12 15:26                                           ` Al Viro
@ 2017-04-12 16:27                                             ` Dave Jones
  2017-04-12 17:07                                               ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-12 16:27 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Wed, Apr 12, 2017 at 04:26:00PM +0100, Al Viro wrote:
 > On Wed, Apr 12, 2017 at 10:35:19AM -0400, Dave Jones wrote:
 > 
 > > [ 4140.040002] asked to read 8, claims to have read 4
 > > [ 4140.051634] actual size of data in pipe 8 
 > > [ 4140.063234] [0:8
 > 
 > > [ 4140.342955] ---[ end trace d074a8823fe244d4 ]---
 > > [ 4140.353868] in->f_op = ffffffffa02dc980, ->splice_write = ffffffff812b2c20
 > 
 > IOW, we just had someone's ->read_iter() return 4 after having deposited 8
 > bytes.  The next question is which file_operations had that been, whether
 > it was O_DIRECT or not and where in file had we been reading from...

[ 1010.008906] WARNING: CPU: 2 PID: 15216 at fs/splice.c:297 test_it+0x7d/0x120
[ 1010.021094] CPU: 2 PID: 15216 Comm: trinity-c5 Not tainted 4.11.0-rc6-think+ #13 
[ 1010.045849] Call Trace:
[ 1010.058136]  dump_stack+0x68/0x93
[ 1010.070379]  __warn+0xcb/0xf0
[ 1010.082473]  warn_slowpath_null+0x1d/0x20
[ 1010.094489]  test_it+0x7d/0x120
[ 1010.106443]  generic_file_splice_read+0x19a/0x1e0
[ 1010.118337]  do_splice_to+0x79/0x90
[ 1010.130161]  splice_direct_to_actor+0xc6/0x280
[ 1010.142032]  ? generic_pipe_buf_nosteal+0x10/0x10
[ 1010.153871]  do_splice_direct+0x9e/0xd0
[ 1010.165687]  do_sendfile+0x1d7/0x3c0
[ 1010.177442]  SyS_sendfile64+0x73/0xe0
[ 1010.189128]  do_syscall_64+0x66/0x1d0
[ 1010.200743]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1010.212287] RIP: 0033:0x7f1bda69d0f9
[ 1010.223883] RSP: 002b:00007ffede478db8 EFLAGS: 00000246
[ 1010.235516]  ORIG_RAX: 0000000000000028
[ 1010.247115] RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007f1bda69d0f9
[ 1010.258850] RDX: 00007f1bdac39000 RSI: 0000000000000186 RDI: 0000000000000186
[ 1010.270550] RBP: 00007f1bdad49000 R08: ffffe5d6e7f7c7f4 R09: 0000000092d54252
[ 1010.282174] R10: 0000000000200000 R11: 0000000000000246 R12: 0000000000000002
[ 1010.293781] R13: 00007f1bdad49048 R14: 00007f1bdad73ad8 R15: 00007f1bdad49000
[ 1010.305695] ---[ end trace 048fab9469c0948e ]---
[ 1010.317696] asked to read 2097152, claims to have read 7025
[ 1010.329471] actual size of data in pipe 65536 
[ 1010.341162] [0:4096
[ 1010.353232] ,1:4096
[ 1010.364402] ,2:4096
[ 1010.375608] ,3:4096
[ 1010.386346] ,4:4096
[ 1010.397027] ,5:4096
[ 1010.407611] ,6:4096
[ 1010.418010] ,7:4096
[ 1010.428533] ,8:4096
[ 1010.438885] ,9:4096
[ 1010.449269] ,10:4096
[ 1010.459462] ,11:4096
[ 1010.469519] ,12:4096
[ 1010.479326] ,13:4096
[ 1010.489093] ,14:4096
[ 1010.498711] ,15:4096
[ 1010.508217] ]
[ 1010.517570] f_op: ffffffffa0242980, f_flags: 311298, pos: 11/7036, size: 7036

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-12 16:27                                             ` Dave Jones
@ 2017-04-12 17:07                                               ` Al Viro
  2017-04-12 19:03                                                 ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-12 17:07 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Wed, Apr 12, 2017 at 12:27:09PM -0400, Dave Jones wrote:

> [ 1010.317696] asked to read 2097152, claims to have read 7025
> [ 1010.329471] actual size of data in pipe 65536 
> [ 1010.341162] [0:4096
> [ 1010.353232] ,1:4096
> [ 1010.364402] ,2:4096
> [ 1010.375608] ,3:4096
> [ 1010.386346] ,4:4096
> [ 1010.397027] ,5:4096
> [ 1010.407611] ,6:4096
> [ 1010.418010] ,7:4096
> [ 1010.428533] ,8:4096
> [ 1010.438885] ,9:4096
> [ 1010.449269] ,10:4096
> [ 1010.459462] ,11:4096
> [ 1010.469519] ,12:4096
> [ 1010.479326] ,13:4096
> [ 1010.489093] ,14:4096
> [ 1010.498711] ,15:4096
> [ 1010.508217] ]
> [ 1010.517570] f_op: ffffffffa0242980, f_flags: 311298, pos: 11/7036, size: 7036

	OK, I see what's going on.  Could you check if the following stops
the warnings?  It's not the final variant of fix - there's no need to copy
the entire iov_iter, it's just that the primitive needed to deal with that
in cleaner way is still not in mainline - davem has pulled it into
net.git, but that was after the latest pull from net.git into mainline.

	For now it should at least tell whether there's something else
going on, though:

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index aab32fc3d6a8..d1633753a1a8 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -568,6 +568,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter)
 	struct nfs_lock_context *l_ctx;
 	ssize_t result = -EINVAL;
 	size_t count = iov_iter_count(iter);
+	struct iov_iter data;
 	nfs_add_stats(mapping->host, NFSIOS_DIRECTREADBYTES, count);
 
 	dfprintk(FILE, "NFS: direct read(%pD2, %zd@%Ld)\n",
@@ -600,14 +601,17 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter)
 	nfs_start_io_direct(inode);
 
 	NFS_I(inode)->read_io += count;
-	result = nfs_direct_read_schedule_iovec(dreq, iter, iocb->ki_pos);
+	data = *iter;
+	result = nfs_direct_read_schedule_iovec(dreq, &data, iocb->ki_pos);
 
 	nfs_end_io_direct(inode);
 
 	if (!result) {
 		result = nfs_direct_wait(dreq);
-		if (result > 0)
+		if (result > 0) {
+			iov_iter_advance(iter, result);
 			iocb->ki_pos += result;
+		}
 	}
 
 out_release:

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-12 17:07                                               ` Al Viro
@ 2017-04-12 19:03                                                 ` Dave Jones
  2017-04-21 17:54                                                   ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-12 19:03 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Wed, Apr 12, 2017 at 06:07:23PM +0100, Al Viro wrote:
 > On Wed, Apr 12, 2017 at 12:27:09PM -0400, Dave Jones wrote:
 > 
 > > [ 1010.317696] asked to read 2097152, claims to have read 7025
 > > [ 1010.329471] actual size of data in pipe 65536 
 > > [ 1010.341162] [0:4096
 > > [ 1010.353232] ,1:4096
 > > [ 1010.364402] ,2:4096
 > > [ 1010.375608] ,3:4096
 > > [ 1010.386346] ,4:4096
 > > [ 1010.397027] ,5:4096
 > > [ 1010.407611] ,6:4096
 > > [ 1010.418010] ,7:4096
 > > [ 1010.428533] ,8:4096
 > > [ 1010.438885] ,9:4096
 > > [ 1010.449269] ,10:4096
 > > [ 1010.459462] ,11:4096
 > > [ 1010.469519] ,12:4096
 > > [ 1010.479326] ,13:4096
 > > [ 1010.489093] ,14:4096
 > > [ 1010.498711] ,15:4096
 > > [ 1010.508217] ]
 > > [ 1010.517570] f_op: ffffffffa0242980, f_flags: 311298, pos: 11/7036, size: 7036
 > 
 > 	OK, I see what's going on.  Could you check if the following stops
 > the warnings?  It's not the final variant of fix - there's no need to copy
 > the entire iov_iter, it's just that the primitive needed to deal with that
 > in cleaner way is still not in mainline - davem has pulled it into
 > net.git, but that was after the latest pull from net.git into mainline.
 > 
 > 	For now it should at least tell whether there's something else
 > going on, though:

Well it's been running an hour without incident, which looks promising.
I'll leave it run, but I'd say you're on the right track given how quick
it reproduced so far.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-12 19:03                                                 ` Dave Jones
@ 2017-04-21 17:54                                                   ` Al Viro
  2017-04-27  4:19                                                     ` Dave Jones
  2017-04-28 15:29                                                     ` Dave Jones
  0 siblings, 2 replies; 73+ messages in thread
From: Al Viro @ 2017-04-21 17:54 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Wed, Apr 12, 2017 at 03:03:18PM -0400, Dave Jones wrote:

> Well it's been running an hour without incident, which looks promising.
> I'll leave it run, but I'd say you're on the right track given how quick
> it reproduced so far.

Could you try this and see if it works?  What happens is that unlike
e.g. generic_file_read_iter/generic_file_write_iter, NFS O_DIRECT handling
does not make sure that iov_iter had been advanced by the amount
actually transferred - it is left advanced by the amount *requested*.

mm/filemap.c code gets around that by taking a copy of iov_iter, feeding
it to ->direct_IO() and then advancing the original by the amount actually
done.  That's what the previous patch had duplicated for NFS, but we have
a cleaner way to do that now - both for NFS and in mm/filemap.c.  Namely,
use iov_iter_revert().  For NFS it means making nfs_direct_..._schedule_iovec()
return how much it has actually requested from server and having their
callers do iov_iter_revert() after nfs_direct_wait() has reported how much
has actually come through.  I've similar patches for mm/filemap.c avoiding
the games with copy of iov_iter there, but those are not fixes per se,
so they are separate.  This one just deals with NFS.

fix nfs O_DIRECT advancing iov_iter too much
    
It leaves the iterator advanced by the amount of IO it has requested
instead of the amount actually transferred.  Among other things,
that confuses the hell out of generic_file_splice_read().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index aab32fc3d6a8..c1b5fed7c863 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -537,7 +537,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 
 	if (put_dreq(dreq))
 		nfs_direct_complete(dreq);
-	return 0;
+	return requested_bytes;
 }
 
 /**
@@ -566,7 +566,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter)
 	struct inode *inode = mapping->host;
 	struct nfs_direct_req *dreq;
 	struct nfs_lock_context *l_ctx;
-	ssize_t result = -EINVAL;
+	ssize_t result = -EINVAL, requested;
 	size_t count = iov_iter_count(iter);
 	nfs_add_stats(mapping->host, NFSIOS_DIRECTREADBYTES, count);
 
@@ -600,14 +600,19 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter)
 	nfs_start_io_direct(inode);
 
 	NFS_I(inode)->read_io += count;
-	result = nfs_direct_read_schedule_iovec(dreq, iter, iocb->ki_pos);
+	requested = nfs_direct_read_schedule_iovec(dreq, iter, iocb->ki_pos);
 
 	nfs_end_io_direct(inode);
 
-	if (!result) {
+	if (requested > 0) {
 		result = nfs_direct_wait(dreq);
-		if (result > 0)
+		if (result > 0) {
+			requested -= result;
 			iocb->ki_pos += result;
+		}
+		iov_iter_revert(iter, requested);
+	} else {
+		result = requested;
 	}
 
 out_release:
@@ -954,7 +959,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 
 	if (put_dreq(dreq))
 		nfs_direct_write_complete(dreq);
-	return 0;
+	return requested_bytes;
 }
 
 /**
@@ -979,7 +984,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
  */
 ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 {
-	ssize_t result = -EINVAL;
+	ssize_t result = -EINVAL, requested;
 	size_t count;
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
@@ -1022,7 +1027,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 
 	nfs_start_io_direct(inode);
 
-	result = nfs_direct_write_schedule_iovec(dreq, iter, pos);
+	requested = nfs_direct_write_schedule_iovec(dreq, iter, pos);
 
 	if (mapping->nrpages) {
 		invalidate_inode_pages2_range(mapping,
@@ -1031,13 +1036,17 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 
 	nfs_end_io_direct(inode);
 
-	if (!result) {
+	if (requested > 0) {
 		result = nfs_direct_wait(dreq);
 		if (result > 0) {
+			requested -= result;
 			iocb->ki_pos = pos + result;
 			/* XXX: should check the generic_write_sync retval */
 			generic_write_sync(iocb, result);
 		}
+		iov_iter_revert(iter, requested);
+	} else {
+		result = requested;
 	}
 out_release:
 	nfs_direct_req_release(dreq);

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-21 17:54                                                   ` Al Viro
@ 2017-04-27  4:19                                                     ` Dave Jones
  2017-04-27 16:34                                                       ` Dave Jones
  2017-04-28 15:29                                                     ` Dave Jones
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-27  4:19 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Fri, Apr 21, 2017 at 06:54:30PM +0100, Al Viro wrote:
 > On Wed, Apr 12, 2017 at 03:03:18PM -0400, Dave Jones wrote:
 > 
 > > Well it's been running an hour without incident, which looks promising.
 > > I'll leave it run, but I'd say you're on the right track given how quick
 > > it reproduced so far.
 > 
 > Could you try this and see if it works?  What happens is that unlike
 > e.g. generic_file_read_iter/generic_file_write_iter, NFS O_DIRECT handling
 > does not make sure that iov_iter had been advanced by the amount
 > actually transferred - it is left advanced by the amount *requested*.

Sorry for delay on this, been sick.   Just gave this a run for 12 hours.
Looks good to me.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-27  4:19                                                     ` Dave Jones
@ 2017-04-27 16:34                                                       ` Dave Jones
  2017-04-27 17:39                                                         ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-27 16:34 UTC (permalink / raw)
  To: Al Viro, Linux Kernel

On Thu, Apr 27, 2017 at 12:19:18AM -0400, Dave Jones wrote:
 > On Fri, Apr 21, 2017 at 06:54:30PM +0100, Al Viro wrote:
 >  > On Wed, Apr 12, 2017 at 03:03:18PM -0400, Dave Jones wrote:
 >  > 
 >  > > Well it's been running an hour without incident, which looks promising.
 >  > > I'll leave it run, but I'd say you're on the right track given how quick
 >  > > it reproduced so far.
 >  > 
 >  > Could you try this and see if it works?  What happens is that unlike
 >  > e.g. generic_file_read_iter/generic_file_write_iter, NFS O_DIRECT handling
 >  > does not make sure that iov_iter had been advanced by the amount
 >  > actually transferred - it is left advanced by the amount *requested*.
 > 
 > Sorry for delay on this, been sick.   Just gave this a run for 12 hours.
 > Looks good to me.

Actually.. client seems fine, but I've noticed these on the server now..

[977286.117268] RPC request reserved 116 but used 268
[1918138.126285] RPC request reserved 200 but used 268
[2327777.483077] RPC request reserved 200 but used 268
[2327800.909007] RPC request reserved 200 but used 268

related ?

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-27 16:34                                                       ` Dave Jones
@ 2017-04-27 17:39                                                         ` Al Viro
  0 siblings, 0 replies; 73+ messages in thread
From: Al Viro @ 2017-04-27 17:39 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Thu, Apr 27, 2017 at 12:34:44PM -0400, Dave Jones wrote:

> [977286.117268] RPC request reserved 116 but used 268
> [1918138.126285] RPC request reserved 200 but used 268
> [2327777.483077] RPC request reserved 200 but used 268
> [2327800.909007] RPC request reserved 200 but used 268
> 
> related ?

Rather unlikely...  AFAICS, that's nfsd miscalculating the response
size and generating longer response than it has reserved.  The warning
comes from svc_xprt_release().  Out of its callers, svc_recv() is
impossible (it zeroes rqstp->rq_res.len before calling the sucker, so
there's no way for it to be found too large), which leaves svc_drop()
and svc_send().  The last one is more likely, AFAICS, and there the
length is calculated by
        /* calculate over-all length */
        xb = &rqstp->rq_res;
        xb->len = xb->head[0].iov_len +
                xb->page_len +
                xb->tail[0].iov_len;

Might be interesting to slap WARN_ON(xb->len > rqstp->rq_reserved); there
and see if it triggers.  Or something like
	if (WARN_ON(rqstp->rq_res->head[0].iov_len +
		rqstp->rq_res->page_len +
		rqstp->rq_res->tail[0].iov_len > rqstp->rq_reserved) {
		try to print something useful about request and response
	}
right before the call of ->xpo_release_rqst() in there - I hadn't looked
at that code for a long time, but it smells like dumping the request is
better done before the skbs containing it get dropped...

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-21 17:54                                                   ` Al Viro
  2017-04-27  4:19                                                     ` Dave Jones
@ 2017-04-28 15:29                                                     ` Dave Jones
  2017-04-28 16:43                                                       ` Al Viro
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-28 15:29 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Fri, Apr 21, 2017 at 06:54:30PM +0100, Al Viro wrote:
 > On Wed, Apr 12, 2017 at 03:03:18PM -0400, Dave Jones wrote:
 > 
 > > Well it's been running an hour without incident, which looks promising.
 > > I'll leave it run, but I'd say you're on the right track given how quick
 > > it reproduced so far.
 > 
 > Could you try this and see if it works?  What happens is that unlike
 > e.g. generic_file_read_iter/generic_file_write_iter, NFS O_DIRECT handling
 > does not make sure that iov_iter had been advanced by the amount
 > actually transferred - it is left advanced by the amount *requested*.

Crap. So I never ran it long enough it seems. I can still hit that trace.
I re-added one of your earlier debug patches, and got this..

WARNING: CPU: 1 PID: 20100 at fs/splice.c:297 test_it+0x7d/0x120
CPU: 1 PID: 20100 Comm: trinity-c49 Not tainted 4.11.0-rc8-think+ #3 
Call Trace:
 dump_stack+0x68/0x93
 __warn+0xcb/0xf0
 warn_slowpath_null+0x1d/0x20
 test_it+0x7d/0x120
 generic_file_splice_read+0x19a/0x1e0
 do_splice_to+0x79/0x90
 splice_direct_to_actor+0xc6/0x280
 ? generic_pipe_buf_nosteal+0x10/0x10
 do_splice_direct+0x9e/0xd0
 do_sendfile+0x1d7/0x3c0
 SyS_sendfile64+0xc9/0xe0
 do_syscall_64+0x66/0x1d0
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f8b83b59099
RSP: 002b:00007ffd7967da68 EFLAGS: 00000246
 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007f8b83b59099
RDX: 0000000000000000 RSI: 0000000000000131 RDI: 0000000000000195
RBP: 00007f8b840d1000 R08: 0000000000000073 R09: fffffffffffffffc
R10: 000000000000040f R11: 0000000000000246 R12: 0000000000000002
R13: 00007f8b840d1048 R14: 00007f8b8422fad8 R15: 00007f8b840d1000
---[ end trace 70d344adaede0734 ]---
asked to read 1039, claims to have read 62
actual size of data in pipe 977 
[0:977
]
f_op: ffffffffa02dd980, f_flags: 313346, pos: 4478229749/4478229811, size: 4478229811
------------[ cut here ]------------
WARNING: CPU: 1 PID: 20100 at fs/splice.c:1017 splice_direct_to_actor+0x13f/0x280
CPU: 1 PID: 20100 Comm: trinity-c49 Tainted: G        W       4.11.0-rc8-think+ #3 
Call Trace:
 dump_stack+0x68/0x93
 __warn+0xcb/0xf0
 warn_slowpath_null+0x1d/0x20
 splice_direct_to_actor+0x13f/0x280
 ? generic_pipe_buf_nosteal+0x10/0x10
 do_splice_direct+0x9e/0xd0
 do_sendfile+0x1d7/0x3c0
 SyS_sendfile64+0xc9/0xe0
 do_syscall_64+0x66/0x1d0
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f8b83b59099
RSP: 002b:00007ffd7967da68 EFLAGS: 00000246
 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007f8b83b59099
RDX: 0000000000000000 RSI: 0000000000000131 RDI: 0000000000000195
RBP: 00007f8b840d1000 R08: 0000000000000073 R09: fffffffffffffffc
R10: 000000000000040f R11: 0000000000000246 R12: 0000000000000002
R13: 00007f8b840d1048 R14: 00007f8b8422fad8 R15: 00007f8b840d1000
---[ end trace 70d344adaede0735 ]---
in->f_op = ffffffffa02dd980, ->splice_write = ffffffff812b1db0


$ grep ffffffffa02dd980 /proc/kallsyms 
ffffffffa02dd980 r nfs4_file_operations	[nfsv4]
$ grep ffffffff812b1db0 /proc/kallsyms 
ffffffff812b1db0 T iter_file_splice_write

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-28 15:29                                                     ` Dave Jones
@ 2017-04-28 16:43                                                       ` Al Viro
  2017-04-28 16:50                                                         ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-28 16:43 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Fri, Apr 28, 2017 at 11:29:55AM -0400, Dave Jones wrote:
> On Fri, Apr 21, 2017 at 06:54:30PM +0100, Al Viro wrote:
>  > On Wed, Apr 12, 2017 at 03:03:18PM -0400, Dave Jones wrote:
>  > 
>  > > Well it's been running an hour without incident, which looks promising.
>  > > I'll leave it run, but I'd say you're on the right track given how quick
>  > > it reproduced so far.
>  > 
>  > Could you try this and see if it works?  What happens is that unlike
>  > e.g. generic_file_read_iter/generic_file_write_iter, NFS O_DIRECT handling
>  > does not make sure that iov_iter had been advanced by the amount
>  > actually transferred - it is left advanced by the amount *requested*.
> 
> Crap. So I never ran it long enough it seems. I can still hit that trace.
> I re-added one of your earlier debug patches, and got this..

Could you send me the diff against something from mainline (identified by
sha1, ideally)?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-28 16:43                                                       ` Al Viro
@ 2017-04-28 16:50                                                         ` Dave Jones
  2017-04-28 17:20                                                           ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-28 16:50 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Fri, Apr 28, 2017 at 05:43:13PM +0100, Al Viro wrote:
 > On Fri, Apr 28, 2017 at 11:29:55AM -0400, Dave Jones wrote:
 > > On Fri, Apr 21, 2017 at 06:54:30PM +0100, Al Viro wrote:
 > >  > On Wed, Apr 12, 2017 at 03:03:18PM -0400, Dave Jones wrote:
 > >  > 
 > >  > > Well it's been running an hour without incident, which looks promising.
 > >  > > I'll leave it run, but I'd say you're on the right track given how quick
 > >  > > it reproduced so far.
 > >  > 
 > >  > Could you try this and see if it works?  What happens is that unlike
 > >  > e.g. generic_file_read_iter/generic_file_write_iter, NFS O_DIRECT handling
 > >  > does not make sure that iov_iter had been advanced by the amount
 > >  > actually transferred - it is left advanced by the amount *requested*.
 > > 
 > > Crap. So I never ran it long enough it seems. I can still hit that trace.
 > > I re-added one of your earlier debug patches, and got this..
 > 
 > Could you send me the diff against something from mainline (identified by
 > sha1, ideally)?

currently running v4.11-rc8-75-gf83246089ca0

sunrpc bit is for the other unrelated problem I'm chasing.

note also, I saw the backtrace without the fs/splice.c changes.



diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index aab32fc3d6a8..c1b5fed7c863 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -537,7 +537,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 
 	if (put_dreq(dreq))
 		nfs_direct_complete(dreq);
-	return 0;
+	return requested_bytes;
 }
 
 /**
@@ -566,7 +566,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter)
 	struct inode *inode = mapping->host;
 	struct nfs_direct_req *dreq;
 	struct nfs_lock_context *l_ctx;
-	ssize_t result = -EINVAL;
+	ssize_t result = -EINVAL, requested;
 	size_t count = iov_iter_count(iter);
 	nfs_add_stats(mapping->host, NFSIOS_DIRECTREADBYTES, count);
 
@@ -600,14 +600,19 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter)
 	nfs_start_io_direct(inode);
 
 	NFS_I(inode)->read_io += count;
-	result = nfs_direct_read_schedule_iovec(dreq, iter, iocb->ki_pos);
+	requested = nfs_direct_read_schedule_iovec(dreq, iter, iocb->ki_pos);
 
 	nfs_end_io_direct(inode);
 
-	if (!result) {
+	if (requested > 0) {
 		result = nfs_direct_wait(dreq);
-		if (result > 0)
+		if (result > 0) {
+			requested -= result;
 			iocb->ki_pos += result;
+		}
+		iov_iter_revert(iter, requested);
+	} else {
+		result = requested;
 	}
 
 out_release:
@@ -954,7 +959,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 
 	if (put_dreq(dreq))
 		nfs_direct_write_complete(dreq);
-	return 0;
+	return requested_bytes;
 }
 
 /**
@@ -979,7 +984,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
  */
 ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 {
-	ssize_t result = -EINVAL;
+	ssize_t result = -EINVAL, requested;
 	size_t count;
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
@@ -1022,7 +1027,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 
 	nfs_start_io_direct(inode);
 
-	result = nfs_direct_write_schedule_iovec(dreq, iter, pos);
+	requested = nfs_direct_write_schedule_iovec(dreq, iter, pos);
 
 	if (mapping->nrpages) {
 		invalidate_inode_pages2_range(mapping,
@@ -1031,13 +1036,17 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 
 	nfs_end_io_direct(inode);
 
-	if (!result) {
+	if (requested > 0) {
 		result = nfs_direct_wait(dreq);
 		if (result > 0) {
+			requested -= result;
 			iocb->ki_pos = pos + result;
 			/* XXX: should check the generic_write_sync retval */
 			generic_write_sync(iocb, result);
 		}
+		iov_iter_revert(iter, requested);
+	} else {
+		result = requested;
 	}
 out_release:
 	nfs_direct_req_release(dreq);
diff --git a/fs/splice.c b/fs/splice.c
index 006ba50f4ece..0e67ddf8618d 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -284,6 +284,43 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
 	kfree(spd->partial);
 }
 
+static bool test_it(struct pipe_inode_info *pipe, size_t len, long ret)
+{
+	int idx = pipe->curbuf;
+	int n = pipe->nrbufs;
+	size_t size = 0;
+	while (n--) {
+		size += pipe->bufs[idx++].len;
+		if (idx == pipe->buffers)
+			idx = 0;
+	}
+	if (WARN_ON(size != ret)) {
+		char c = '[';
+		printk(KERN_ERR "asked to read %zu, claims to have read %ld",
+			len, ret);
+		printk(KERN_CONT "actual size of data in pipe %zd ", size);
+		for (n = pipe->nrbufs, idx = pipe->curbuf; n--; ) {
+			printk(KERN_CONT "%c%d:%u", c, idx,
+				pipe->bufs[idx].len);
+			c = ',';
+			if (++idx == pipe->buffers)
+				idx = 0;
+		}
+		if (c != '[')
+			printk(KERN_CONT "]");
+		return true;
+	}
+	return false;
+}
+
+static inline bool insane_splice_read(struct pipe_inode_info *pipe,
+					size_t len, long ret)
+{
+	if (ret <= 0 || pipe != current->splice_pipe)
+		return false;
+	return test_it(pipe, len, ret);
+}
+
 /**
  * generic_file_splice_read - splice data from file to a pipe
  * @in:		file to splice from
@@ -311,8 +348,14 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	kiocb.ki_pos = *ppos;
 	ret = call_read_iter(in, &kiocb, &to);
 	if (ret > 0) {
-		*ppos = kiocb.ki_pos;
 		file_accessed(in);
+		if (unlikely(insane_splice_read(pipe, len, ret))) {
+			printk(KERN_ERR "f_op: %p, f_flags: %d, pos: %lld/%lld, size: %lld",
+				in->f_op, in->f_flags, (long long)*ppos,
+				(long long)kiocb.ki_pos,
+				(long long)i_size_read(file_inode(in)));
+		}
+		*ppos = kiocb.ki_pos;
 	} else if (ret < 0) {
 		to.idx = idx;
 		to.iov_offset = 0;
@@ -394,7 +437,7 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 	struct page **pages;
 	unsigned int nr_pages;
 	size_t offset, dummy, copied = 0;
-	ssize_t res;
+	ssize_t res, old_len = len;
 	int i;
 
 	if (pipe->nrbufs == pipe->buffers)
@@ -448,6 +491,7 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 		put_page(pages[i]);
 	kvfree(pages);
 	iov_iter_advance(&to, copied);	/* truncates and discards */
+	insane_splice_read(pipe, old_len, res);
 	return res;
 }
 
@@ -970,6 +1014,11 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	while (len) {
 		size_t read_len;
 		loff_t pos = sd->pos, prev_pos = pos;
+		if (WARN_ON(pipe->nrbufs)) {
+			printk(KERN_ERR "in->f_op = %p, ->splice_write = %p\n",
+				in->f_op,
+				sd->u.file->f_op->splice_write);
+		}
 
 		ret = do_splice_to(in, &pos, pipe, len, flags);
 		if (unlikely(ret <= 0))
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 7bfe1fb42add..eb5297573a8a 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -927,6 +927,15 @@ int svc_send(struct svc_rqst *rqstp)
 	if (!xprt)
 		goto out;
 
+        if (WARN_ON(rqstp->rq_res.head[0].iov_len +
+                rqstp->rq_res.page_len +
+                rqstp->rq_res.tail[0].iov_len > rqstp->rq_reserved)) {
+
+		printk("dbg: rqstp->rq_res.head[0].iov_len:%ld\n", rqstp->rq_res.head[0].iov_len);
+		printk("dbg: rqstp->rq_res.page_len:%d\n", rqstp->rq_res.page_len);
+		printk("dbg: rqstp->rq_reserved:%d\n", rqstp->rq_reserved);
+        }
+
 	/* release the receive skb before sending the reply */
 	rqstp->rq_xprt->xpt_ops->xpo_release_rqst(rqstp);
 
@@ -936,6 +945,8 @@ int svc_send(struct svc_rqst *rqstp)
 		xb->page_len +
 		xb->tail[0].iov_len;
 
+	WARN_ON(xb->len > rqstp->rq_reserved);
+
 	/* Grab mutex to serialize outgoing data. */
 	mutex_lock(&xprt->xpt_mutex);
 	if (test_bit(XPT_DEAD, &xprt->xpt_flags)
@@ -944,6 +955,7 @@ int svc_send(struct svc_rqst *rqstp)
 	else
 		len = xprt->xpt_ops->xpo_sendto(rqstp);
 	mutex_unlock(&xprt->xpt_mutex);
+
 	rpc_wake_up(&xprt->xpt_bc_pending);
 	svc_xprt_release(rqstp);
 

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-28 16:50                                                         ` Dave Jones
@ 2017-04-28 17:20                                                           ` Al Viro
  2017-04-28 18:25                                                             ` Al Viro
  2017-08-07 20:18                                                             ` iov_iter_pipe warning Dave Jones
  0 siblings, 2 replies; 73+ messages in thread
From: Al Viro @ 2017-04-28 17:20 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Fri, Apr 28, 2017 at 12:50:24PM -0400, Dave Jones wrote:
> currently running v4.11-rc8-75-gf83246089ca0
> 
> sunrpc bit is for the other unrelated problem I'm chasing.
> 
> note also, I saw the backtrace without the fs/splice.c changes.

	Interesting...  Could you add this and see if that triggers?

diff --git a/fs/splice.c b/fs/splice.c
index 540c4a44756c..12a12d9c313f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -306,6 +306,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	kiocb.ki_pos = *ppos;
 	ret = call_read_iter(in, &kiocb, &to);
 	if (ret > 0) {
+		if (WARN_ON(iov_iter_count(&to) != len - ret))
+			printk(KERN_ERR "ops %p: was %zd, left %zd, returned %d\n",
+				in->f_op, len, iov_iter_count(&to), ret);
 		*ppos = kiocb.ki_pos;
 		file_accessed(in);
 	} else if (ret < 0) {

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-28 17:20                                                           ` Al Viro
@ 2017-04-28 18:25                                                             ` Al Viro
  2017-04-29  1:58                                                               ` Dave Jones
  2017-08-07 20:18                                                             ` iov_iter_pipe warning Dave Jones
  1 sibling, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-28 18:25 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Fri, Apr 28, 2017 at 06:20:25PM +0100, Al Viro wrote:
> On Fri, Apr 28, 2017 at 12:50:24PM -0400, Dave Jones wrote:
> > currently running v4.11-rc8-75-gf83246089ca0
> > 
> > sunrpc bit is for the other unrelated problem I'm chasing.
> > 
> > note also, I saw the backtrace without the fs/splice.c changes.
> 
> 	Interesting...  Could you add this and see if that triggers?

Gyah...  It's a bloody dumb braino in iov_iter_revert() for pipe-backed
ones.  Sorry, the oneliner below should fix it.

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f7c93568ec99..4952311422c1 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -798,7 +798,7 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 		while (1) {
 			size_t n = off - pipe->bufs[idx].offset;
 			if (unroll < n) {
-				off -= (n - unroll);
+				off -= unroll;
 				break;
 			}
 			unroll -= n;

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-28 18:25                                                             ` Al Viro
@ 2017-04-29  1:58                                                               ` Dave Jones
  2017-04-29  2:47                                                                 ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-29  1:58 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Fri, Apr 28, 2017 at 07:25:12PM +0100, Al Viro wrote:
 > On Fri, Apr 28, 2017 at 06:20:25PM +0100, Al Viro wrote:
 > > On Fri, Apr 28, 2017 at 12:50:24PM -0400, Dave Jones wrote:
 > > > currently running v4.11-rc8-75-gf83246089ca0
 > > > 
 > > > sunrpc bit is for the other unrelated problem I'm chasing.
 > > > 
 > > > note also, I saw the backtrace without the fs/splice.c changes.
 > > 
 > > 	Interesting...  Could you add this and see if that triggers?
 > 
 > Gyah...  It's a bloody dumb braino in iov_iter_revert() for pipe-backed
 > ones.  Sorry, the oneliner below should fix it.

5 hrs in, looking good so far.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-29  1:58                                                               ` Dave Jones
@ 2017-04-29  2:47                                                                 ` Al Viro
  2017-04-29 15:51                                                                   ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-04-29  2:47 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

On Fri, Apr 28, 2017 at 09:58:47PM -0400, Dave Jones wrote:
> On Fri, Apr 28, 2017 at 07:25:12PM +0100, Al Viro wrote:
>  > On Fri, Apr 28, 2017 at 06:20:25PM +0100, Al Viro wrote:
>  > > On Fri, Apr 28, 2017 at 12:50:24PM -0400, Dave Jones wrote:
>  > > > currently running v4.11-rc8-75-gf83246089ca0
>  > > > 
>  > > > sunrpc bit is for the other unrelated problem I'm chasing.
>  > > > 
>  > > > note also, I saw the backtrace without the fs/splice.c changes.
>  > > 
>  > > 	Interesting...  Could you add this and see if that triggers?
>  > 
>  > Gyah...  It's a bloody dumb braino in iov_iter_revert() for pipe-backed
>  > ones.  Sorry, the oneliner below should fix it.
> 
> 5 hrs in, looking good so far.

Mind your Tested-by on the fix?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-29  2:47                                                                 ` Al Viro
@ 2017-04-29 15:51                                                                   ` Dave Jones
  2017-04-29 20:46                                                                     ` [git pull] vfs.git fix (Re: iov_iter_pipe warning.) Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-04-29 15:51 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Sat, Apr 29, 2017 at 03:47:36AM +0100, Al Viro wrote:
 > On Fri, Apr 28, 2017 at 09:58:47PM -0400, Dave Jones wrote:
 > > On Fri, Apr 28, 2017 at 07:25:12PM +0100, Al Viro wrote:
 > >  > On Fri, Apr 28, 2017 at 06:20:25PM +0100, Al Viro wrote:
 > >  > > On Fri, Apr 28, 2017 at 12:50:24PM -0400, Dave Jones wrote:
 > >  > > > currently running v4.11-rc8-75-gf83246089ca0
 > >  > > > 
 > >  > > > sunrpc bit is for the other unrelated problem I'm chasing.
 > >  > > > 
 > >  > > > note also, I saw the backtrace without the fs/splice.c changes.
 > >  > > 
 > >  > > 	Interesting...  Could you add this and see if that triggers?
 > >  > 
 > >  > Gyah...  It's a bloody dumb braino in iov_iter_revert() for pipe-backed
 > >  > ones.  Sorry, the oneliner below should fix it.
 > > 
 > > 5 hrs in, looking good so far.
 > 
 > Mind your Tested-by on the fix?

Sure.

Tested-by: Dave Jones <davej@codemonkey.org.uk>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [git pull] vfs.git fix (Re: iov_iter_pipe warning.)
  2017-04-29 15:51                                                                   ` Dave Jones
@ 2017-04-29 20:46                                                                     ` Al Viro
  0 siblings, 0 replies; 73+ messages in thread
From: Al Viro @ 2017-04-29 20:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, linux-kernel

On Sat, Apr 29, 2017 at 11:51:40AM -0400, Dave Jones wrote:

>  > >  > Gyah...  It's a bloody dumb braino in iov_iter_revert() for pipe-backed
>  > >  > ones.  Sorry, the oneliner below should fix it.
>  > > 
>  > > 5 hrs in, looking good so far.
>  > 
>  > Mind your Tested-by on the fix?
> 
> Sure.
> 
> Tested-by: Dave Jones <davej@codemonkey.org.uk>

The following changes since commit 1741937d475d91ed95abb37f07e8571e23b9a7fe:

  uapi: change the type of struct statx_timestamp.tv_nsec to unsigned (2017-04-26 21:19:05 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

for you to fetch changes up to 4fa55cefee1bbecadb4c9f47d40a92f65dc44351:

  fix a braino in ITER_PIPE iov_iter_revert() (2017-04-29 16:42:30 -0400)

----------------------------------------------------------------
Al Viro (1):
      fix a braino in ITER_PIPE iov_iter_revert()

 lib/iov_iter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-04-28 17:20                                                           ` Al Viro
  2017-04-28 18:25                                                             ` Al Viro
@ 2017-08-07 20:18                                                             ` Dave Jones
  2017-08-28 20:31                                                               ` Dave Jones
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-08-07 20:18 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel

On Fri, Apr 28, 2017 at 06:20:25PM +0100, Al Viro wrote:
 > On Fri, Apr 28, 2017 at 12:50:24PM -0400, Dave Jones wrote:
 > > currently running v4.11-rc8-75-gf83246089ca0
 > > 
 > > sunrpc bit is for the other unrelated problem I'm chasing.
 > > 
 > > note also, I saw the backtrace without the fs/splice.c changes.
 > 
 > 	Interesting...  Could you add this and see if that triggers?
 > 
 > diff --git a/fs/splice.c b/fs/splice.c
 > index 540c4a44756c..12a12d9c313f 100644
 > --- a/fs/splice.c
 > +++ b/fs/splice.c
 > @@ -306,6 +306,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 >  	kiocb.ki_pos = *ppos;
 >  	ret = call_read_iter(in, &kiocb, &to);
 >  	if (ret > 0) {
 > +		if (WARN_ON(iov_iter_count(&to) != len - ret))
 > +			printk(KERN_ERR "ops %p: was %zd, left %zd, returned %d\n",
 > +				in->f_op, len, iov_iter_count(&to), ret);
 >  		*ppos = kiocb.ki_pos;
 >  		file_accessed(in);
 >  	} else if (ret < 0) {

Hey Al,
 Due to a git stash screw up on my part, I've had this leftover WARN_ON
in my tree for the last couple months. (That screw-up might turn out to be
serendipitous if this is a real bug..)

Today I decided to change things up and beat up on xfs for a change, and
was able to trigger this again.

Is this check no longer valid, or am I triggering the same bug we were chased
down in nfs, but now in xfs ?  (None of the other detritus from that debugging
back in April made it, just those three lines above).

	Dave

WARNING: CPU: 1 PID: 18377 at fs/splice.c:309 generic_file_splice_read+0x3e4/0x430                                                                                 
CPU: 1 PID: 18377 Comm: trinity-c17 Not tainted 4.13.0-rc4-think+ #1      
task: ffff88045d2855c0 task.stack: ffff88045ca28000                       
RIP: 0010:generic_file_splice_read+0x3e4/0x430                            
RSP: 0018:ffff88045ca2f900 EFLAGS: 00010206                               
RAX: 000000000000001f RBX: ffff88045c36e200 RCX: 0000000000000000         
RDX: 0000000000000fe1 RSI: dffffc0000000000 RDI: ffff88045ca2f960         
RBP: ffff88045ca2fa38 R08: ffff88046b26b880 R09: 000000000000001f         
R10: ffff88045ca2f540 R11: 0000000000000000 R12: ffff88045ca2f9b0         
R13: ffff88045ca2fa10 R14: 1ffff1008b945f26 R15: ffff88045c36e228         
FS:  00007f5580594700(0000) GS:ffff88046b200000(0000) knlGS:0000000000000000                                                                                       
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                         
CR2: 00007f5580594698 CR3: 000000045d3ef000 CR4: 00000000001406e0         
Call Trace:                  
 ? pipe_to_user+0xa0/0xa0    
 ? lockdep_init_map+0xb2/0x2b0                                            
 ? rw_verify_area+0x9d/0x150 
 do_splice_to+0xab/0xc0      
 splice_direct_to_actor+0x1ac/0x480                                       
 ? generic_pipe_buf_nosteal+0x10/0x10                                     
 ? do_splice_to+0xc0/0xc0    
 ? rw_verify_area+0x9d/0x150 
 do_splice_direct+0x1b9/0x230
 ? splice_direct_to_actor+0x480/0x480                                     
 ? retint_kernel+0x10/0x10   
 ? rw_verify_area+0x9d/0x150 
 do_sendfile+0x428/0x840     
 ? do_compat_pwritev64+0xb0/0xb0                                          
 ? copy_user_generic_unrolled+0x83/0xb0                                   
 SyS_sendfile64+0xa4/0x120   
 ? SyS_sendfile+0x150/0x150  
 ? mark_held_locks+0x23/0xb0 
 ? do_syscall_64+0xc0/0x3e0  
 ? SyS_sendfile+0x150/0x150  
 do_syscall_64+0x1bc/0x3e0   
 ? syscall_return_slowpath+0x240/0x240                                    
 ? mark_held_locks+0x23/0xb0 
 ? return_from_SYSCALL_64+0x2d/0x7a                                       
 ? trace_hardirqs_on_caller+0x182/0x260                                   
 ? trace_hardirqs_on_thunk+0x1a/0x1c                                      
 entry_SYSCALL64_slow_path+0x25/0x25                                      
RIP: 0033:0x7f557febf219     
RSP: 002b:00007ffc25086db8 EFLAGS: 00000246                               
 ORIG_RAX: 0000000000000028  
RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007f557febf219         
RDX: 00007f557e559000 RSI: 0000000000000187 RDI: 0000000000000199         
RBP: 00007ffc25086e60 R08: 0000000000000100 R09: 0000000000006262         
R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000002         
R13: 00007f5580516058 R14: 00007f5580594698 R15: 00007f5580516000         
---[ end trace e2f2217aba545e92 ]---                                      
ops ffffffffa09e4920: was 4096, left 0, returned 31                       

$ grep ffffffffa09e4920 /proc/kallsyms             
ffffffffa09e4920 r xfs_file_operations  [xfs]                                            

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-08-07 20:18                                                             ` iov_iter_pipe warning Dave Jones
@ 2017-08-28 20:31                                                               ` Dave Jones
  2017-08-29  4:25                                                                 ` Darrick J. Wong
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-08-28 20:31 UTC (permalink / raw)
  To: Al Viro, Linux Kernel; +Cc: linux-xfs

On Mon, Aug 07, 2017 at 04:18:18PM -0400, Dave Jones wrote:
 > On Fri, Apr 28, 2017 at 06:20:25PM +0100, Al Viro wrote:
 >  > On Fri, Apr 28, 2017 at 12:50:24PM -0400, Dave Jones wrote:
 >  > > currently running v4.11-rc8-75-gf83246089ca0
 >  > > 
 >  > > sunrpc bit is for the other unrelated problem I'm chasing.
 >  > > 
 >  > > note also, I saw the backtrace without the fs/splice.c changes.
 >  > 
 >  > 	Interesting...  Could you add this and see if that triggers?
 >  > 
 >  > diff --git a/fs/splice.c b/fs/splice.c
 >  > index 540c4a44756c..12a12d9c313f 100644
 >  > --- a/fs/splice.c
 >  > +++ b/fs/splice.c
 >  > @@ -306,6 +306,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 >  >  	kiocb.ki_pos = *ppos;
 >  >  	ret = call_read_iter(in, &kiocb, &to);
 >  >  	if (ret > 0) {
 >  > +		if (WARN_ON(iov_iter_count(&to) != len - ret))
 >  > +			printk(KERN_ERR "ops %p: was %zd, left %zd, returned %d\n",
 >  > +				in->f_op, len, iov_iter_count(&to), ret);
 >  >  		*ppos = kiocb.ki_pos;
 >  >  		file_accessed(in);
 >  >  	} else if (ret < 0) {
 > 
 > Hey Al,
 >  Due to a git stash screw up on my part, I've had this leftover WARN_ON
 > in my tree for the last couple months. (That screw-up might turn out to be
 > serendipitous if this is a real bug..)
 > 
 > Today I decided to change things up and beat up on xfs for a change, and
 > was able to trigger this again.
 > 
 > Is this check no longer valid, or am I triggering the same bug we were chased
 > down in nfs, but now in xfs ?  (None of the other detritus from that debugging
 > back in April made it, just those three lines above).

Revisiting this. I went back and dug out some of the other debug diffs [1]
from that old thread.

I can easily trigger this spew on xfs.


WARNING: CPU: 1 PID: 2251 at fs/splice.c:292 test_it+0xd4/0x1d0
CPU: 1 PID: 2251 Comm: trinity-c42 Not tainted 4.13.0-rc7-think+ #1 
task: ffff880459173a40 task.stack: ffff88044f7d0000
RIP: 0010:test_it+0xd4/0x1d0
RSP: 0018:ffff88044f7d7878 EFLAGS: 00010283
RAX: 0000000000000000 RBX: ffff88044f44b968 RCX: ffffffff81511ea0
RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88044f44ba68
RBP: ffff88044f7d78c8 R08: ffff88046b218ec0 R09: 0000000000000000
R10: ffff88044f7d7518 R11: 0000000000000000 R12: 0000000000001000
R13: 0000000000000001 R14: 00000000ffffffff R15: 0000000000000001
FS:  00007fdbc09b2700(0000) GS:ffff88046b200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000459e1d000 CR4: 00000000001406e0
Call Trace:
 generic_file_splice_read+0x414/0x4e0
 ? opipe_prep.part.14+0x180/0x180
 ? lockdep_init_map+0xb2/0x2b0
 ? rw_verify_area+0x65/0x150
 do_splice_to+0xab/0xc0
 splice_direct_to_actor+0x1f5/0x540
 ? generic_pipe_buf_nosteal+0x10/0x10
 ? do_splice_to+0xc0/0xc0
 ? rw_verify_area+0x9d/0x150
 do_splice_direct+0x1b9/0x230
 ? splice_direct_to_actor+0x540/0x540
 ? __sb_start_write+0x164/0x1c0
 ? do_sendfile+0x7b3/0x840
 do_sendfile+0x428/0x840
 ? do_compat_pwritev64+0xb0/0xb0
 ? __might_sleep+0x72/0xe0
 ? kasan_check_write+0x14/0x20
 SyS_sendfile64+0xa4/0x120
 ? SyS_sendfile+0x150/0x150
 ? mark_held_locks+0x23/0xb0
 ? do_syscall_64+0xc0/0x3e0
 ? SyS_sendfile+0x150/0x150
 do_syscall_64+0x1bc/0x3e0
 ? syscall_return_slowpath+0x240/0x240
 ? mark_held_locks+0x23/0xb0
 ? return_from_SYSCALL_64+0x2d/0x7a
 ? trace_hardirqs_on_caller+0x182/0x260
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7fdbc02dd219
RSP: 002b:00007ffc5024fa48 EFLAGS: 00000246
 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007fdbc02dd219
RDX: 00007fdbbe348000 RSI: 0000000000000011 RDI: 0000000000000015
RBP: 00007ffc5024faf0 R08: 000000000000006d R09: 0094e82f2c730a50
R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007fdbc0885058 R14: 00007fdbc09b2698 R15: 00007fdbc0885000
---[ end trace a5847ef0f7be7e20 ]---
asked to read 4096, claims to have read 1
actual size of data in pipe 4096 
[0:4096]
f_op: ffffffffa058c920, f_flags: 49154, pos: 0/1, size: 0


I'm still trying to narrow down an exact reproducer, but it seems having
trinity do a combination of sendfile & writev, with pipes and regular
files as fd's is the best repro.

Is this a real problem, or am I chasing ghosts ?  That it doesn't happen
on ext4 or btrfs is making me wonder...

	Dave


[1] https://lkml.org/lkml/2017/4/11/921

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-08-28 20:31                                                               ` Dave Jones
@ 2017-08-29  4:25                                                                 ` Darrick J. Wong
  2017-08-30 17:05                                                                   ` Dave Jones
  2017-09-06 20:03                                                                   ` Dave Jones
  0 siblings, 2 replies; 73+ messages in thread
From: Darrick J. Wong @ 2017-08-29  4:25 UTC (permalink / raw)
  To: Dave Jones, Al Viro, Linux Kernel, linux-xfs

On Mon, Aug 28, 2017 at 04:31:30PM -0400, Dave Jones wrote:
> On Mon, Aug 07, 2017 at 04:18:18PM -0400, Dave Jones wrote:
>  > On Fri, Apr 28, 2017 at 06:20:25PM +0100, Al Viro wrote:
>  >  > On Fri, Apr 28, 2017 at 12:50:24PM -0400, Dave Jones wrote:
>  >  > > currently running v4.11-rc8-75-gf83246089ca0
>  >  > > 
>  >  > > sunrpc bit is for the other unrelated problem I'm chasing.
>  >  > > 
>  >  > > note also, I saw the backtrace without the fs/splice.c changes.
>  >  > 
>  >  > 	Interesting...  Could you add this and see if that triggers?
>  >  > 
>  >  > diff --git a/fs/splice.c b/fs/splice.c
>  >  > index 540c4a44756c..12a12d9c313f 100644
>  >  > --- a/fs/splice.c
>  >  > +++ b/fs/splice.c
>  >  > @@ -306,6 +306,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
>  >  >  	kiocb.ki_pos = *ppos;
>  >  >  	ret = call_read_iter(in, &kiocb, &to);
>  >  >  	if (ret > 0) {
>  >  > +		if (WARN_ON(iov_iter_count(&to) != len - ret))
>  >  > +			printk(KERN_ERR "ops %p: was %zd, left %zd, returned %d\n",
>  >  > +				in->f_op, len, iov_iter_count(&to), ret);
>  >  >  		*ppos = kiocb.ki_pos;
>  >  >  		file_accessed(in);
>  >  >  	} else if (ret < 0) {
>  > 
>  > Hey Al,
>  >  Due to a git stash screw up on my part, I've had this leftover WARN_ON
>  > in my tree for the last couple months. (That screw-up might turn out to be
>  > serendipitous if this is a real bug..)
>  > 
>  > Today I decided to change things up and beat up on xfs for a change, and
>  > was able to trigger this again.
>  > 
>  > Is this check no longer valid, or am I triggering the same bug we were chased
>  > down in nfs, but now in xfs ?  (None of the other detritus from that debugging
>  > back in April made it, just those three lines above).
> 
> Revisiting this. I went back and dug out some of the other debug diffs [1]
> from that old thread.
> 
> I can easily trigger this spew on xfs.
> 
> 
> WARNING: CPU: 1 PID: 2251 at fs/splice.c:292 test_it+0xd4/0x1d0
> CPU: 1 PID: 2251 Comm: trinity-c42 Not tainted 4.13.0-rc7-think+ #1 
> task: ffff880459173a40 task.stack: ffff88044f7d0000
> RIP: 0010:test_it+0xd4/0x1d0
> RSP: 0018:ffff88044f7d7878 EFLAGS: 00010283
> RAX: 0000000000000000 RBX: ffff88044f44b968 RCX: ffffffff81511ea0
> RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88044f44ba68
> RBP: ffff88044f7d78c8 R08: ffff88046b218ec0 R09: 0000000000000000
> R10: ffff88044f7d7518 R11: 0000000000000000 R12: 0000000000001000
> R13: 0000000000000001 R14: 00000000ffffffff R15: 0000000000000001
> FS:  00007fdbc09b2700(0000) GS:ffff88046b200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 0000000459e1d000 CR4: 00000000001406e0
> Call Trace:
>  generic_file_splice_read+0x414/0x4e0
>  ? opipe_prep.part.14+0x180/0x180
>  ? lockdep_init_map+0xb2/0x2b0
>  ? rw_verify_area+0x65/0x150
>  do_splice_to+0xab/0xc0
>  splice_direct_to_actor+0x1f5/0x540
>  ? generic_pipe_buf_nosteal+0x10/0x10
>  ? do_splice_to+0xc0/0xc0
>  ? rw_verify_area+0x9d/0x150
>  do_splice_direct+0x1b9/0x230
>  ? splice_direct_to_actor+0x540/0x540
>  ? __sb_start_write+0x164/0x1c0
>  ? do_sendfile+0x7b3/0x840
>  do_sendfile+0x428/0x840
>  ? do_compat_pwritev64+0xb0/0xb0
>  ? __might_sleep+0x72/0xe0
>  ? kasan_check_write+0x14/0x20
>  SyS_sendfile64+0xa4/0x120
>  ? SyS_sendfile+0x150/0x150
>  ? mark_held_locks+0x23/0xb0
>  ? do_syscall_64+0xc0/0x3e0
>  ? SyS_sendfile+0x150/0x150
>  do_syscall_64+0x1bc/0x3e0
>  ? syscall_return_slowpath+0x240/0x240
>  ? mark_held_locks+0x23/0xb0
>  ? return_from_SYSCALL_64+0x2d/0x7a
>  ? trace_hardirqs_on_caller+0x182/0x260
>  ? trace_hardirqs_on_thunk+0x1a/0x1c
>  entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7fdbc02dd219
> RSP: 002b:00007ffc5024fa48 EFLAGS: 00000246
>  ORIG_RAX: 0000000000000028
> RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007fdbc02dd219
> RDX: 00007fdbbe348000 RSI: 0000000000000011 RDI: 0000000000000015
> RBP: 00007ffc5024faf0 R08: 000000000000006d R09: 0094e82f2c730a50
> R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000002
> R13: 00007fdbc0885058 R14: 00007fdbc09b2698 R15: 00007fdbc0885000
> ---[ end trace a5847ef0f7be7e20 ]---
> asked to read 4096, claims to have read 1
> actual size of data in pipe 4096 
> [0:4096]
> f_op: ffffffffa058c920, f_flags: 49154, pos: 0/1, size: 0
> 
> 
> I'm still trying to narrow down an exact reproducer, but it seems having
> trinity do a combination of sendfile & writev, with pipes and regular
> files as fd's is the best repro.
> 
> Is this a real problem, or am I chasing ghosts ?  That it doesn't happen
> on ext4 or btrfs is making me wonder...

<shrug> I haven't heard of any problems w/ directio xfs lately, but OTOH
I think it's the only filesystem that uses iomap_dio_rw, which would
explain why ext4/btrfs don't have this problem.

Granted that's idle speculation; is there a reproducer/xfstest for this?

--D

> 
> 	Dave
> 
> 
> [1] https://lkml.org/lkml/2017/4/11/921
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-08-29  4:25                                                                 ` Darrick J. Wong
@ 2017-08-30 17:05                                                                   ` Dave Jones
  2017-08-30 17:13                                                                     ` Darrick J. Wong
  2017-09-06 20:03                                                                   ` Dave Jones
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-08-30 17:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Al Viro, Linux Kernel, linux-xfs

On Mon, Aug 28, 2017 at 09:25:42PM -0700, Darrick J. Wong wrote:
 > On Mon, Aug 28, 2017 at 04:31:30PM -0400, Dave Jones wrote:
 > > On Mon, Aug 07, 2017 at 04:18:18PM -0400, Dave Jones wrote:
 > >  > On Fri, Apr 28, 2017 at 06:20:25PM +0100, Al Viro wrote:
 > >  >  > On Fri, Apr 28, 2017 at 12:50:24PM -0400, Dave Jones wrote:
 > >  >  > 
 > >  >  > diff --git a/fs/splice.c b/fs/splice.c
 > >  >  > index 540c4a44756c..12a12d9c313f 100644
 > >  >  > --- a/fs/splice.c
 > >  >  > +++ b/fs/splice.c
 > >  >  > @@ -306,6 +306,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 > >  >  >  	kiocb.ki_pos = *ppos;
 > >  >  >  	ret = call_read_iter(in, &kiocb, &to);
 > >  >  >  	if (ret > 0) {
 > >  >  > +		if (WARN_ON(iov_iter_count(&to) != len - ret))
 > >  >  > +			printk(KERN_ERR "ops %p: was %zd, left %zd, returned %d\n",
 > >  >  > +				in->f_op, len, iov_iter_count(&to), ret);
 > >  >  >  		*ppos = kiocb.ki_pos;
 > >  >  >  		file_accessed(in);
 > >  >  >  	} else if (ret < 0) {
 > >  > 
 > >  > Hey Al,
 > >  >  Due to a git stash screw up on my part, I've had this leftover WARN_ON
 > >  > in my tree for the last couple months. (That screw-up might turn out to be
 > >  > serendipitous if this is a real bug..)
 > >  > 
 > >  > Today I decided to change things up and beat up on xfs for a change, and
 > >  > was able to trigger this again.
 > >  > 
 > >  > Is this check no longer valid, or am I triggering the same bug we were chased
 > >  > down in nfs, but now in xfs ?  (None of the other detritus from that debugging
 > >  > back in April made it, just those three lines above).
 > > 
 > > Revisiting this. I went back and dug out some of the other debug diffs [1]
 > > from that old thread.
 > > 
 > > I can easily trigger this spew on xfs.
 > > 
 > > ...
 > >
 > > asked to read 4096, claims to have read 1
 > > actual size of data in pipe 4096 
 > > [0:4096]
 > > f_op: ffffffffa058c920, f_flags: 49154, pos: 0/1, size: 0
 > > 
 > > 
 > > I'm still trying to narrow down an exact reproducer, but it seems having
 > > trinity do a combination of sendfile & writev, with pipes and regular
 > > files as fd's is the best repro.
 > > 
 > > Is this a real problem, or am I chasing ghosts ?  That it doesn't happen
 > > on ext4 or btrfs is making me wonder...
 > 
 > <shrug> I haven't heard of any problems w/ directio xfs lately, but OTOH
 > I think it's the only filesystem that uses iomap_dio_rw, which would
 > explain why ext4/btrfs don't have this problem.
 > 
 > Granted that's idle speculation; is there a reproducer/xfstest for this?

I reverted the debug patches mentioned above, and ran trinity for a while again,
and got this which smells really suspiciously related

WARNING: CPU: 1 PID: 10380 at fs/iomap.c:993 iomap_dio_rw+0x825/0x840
CPU: 1 PID: 10380 Comm: trinity-c30 Not tainted 4.13.0-rc7-think+ #3 
task: ffff8804613a5740 task.stack: ffff880432120000
RIP: 0010:iomap_dio_rw+0x825/0x840
RSP: 0018:ffff880432127890 EFLAGS: 00010286
RAX: 00000000fffffff0 RBX: ffff88046a64d0e8 RCX: 0000000000000000
RDX: ffffed0086424e9b RSI: 0000000000000000 RDI: ffffed0086424f03
RBP: ffff880432127a70 R08: ffff88046b239840 R09: 0000000000000001
R10: ffff880432126f50 R11: 0000000000000000 R12: ffff880432127c40
R13: 0000000000000e0a R14: 1ffff10086424f20 R15: ffff880432127ca0
FS:  00007f4cda32f700(0000) GS:ffff88046b200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f181e02f000 CR3: 000000043d32a000 CR4: 00000000001406e0
Call Trace:
 ? iomap_seek_data+0xc0/0xc0
 ? down_read_non_owner+0x40/0x40
 ? xfs_ilock+0x3f2/0x490 [xfs]
 ? touch_atime+0x9c/0x180
 ? __atime_needs_update+0x440/0x440
 xfs_file_dio_aio_read+0x12d/0x390 [xfs]
 ? xfs_file_dio_aio_read+0x12d/0x390 [xfs]
 ? xfs_file_fallocate+0x660/0x660 [xfs]
 ? cyc2ns_read_end+0x10/0x10
 xfs_file_read_iter+0x1bb/0x1d0 [xfs]
 __vfs_read+0x332/0x440
 ? default_llseek+0x140/0x140
 ? cyc2ns_read_end+0x10/0x10
 ? __fget_light+0x1ae/0x230
 ? rcu_is_watching+0x8d/0xd0
 ? exit_to_usermode_loop+0x1b0/0x1b0
 ? rw_verify_area+0x9d/0x150
 vfs_read+0xc8/0x1c0
 SyS_pread64+0x11a/0x140
 ? SyS_write+0x160/0x160
 ? do_syscall_64+0xc0/0x3e0
 ? SyS_write+0x160/0x160
 do_syscall_64+0x1bc/0x3e0
 ? syscall_return_slowpath+0x240/0x240
 ? cpumask_check.part.2+0x10/0x10
 ? cpumask_check.part.2+0x10/0x10
 ? mark_held_locks+0x23/0xb0
 ? return_from_SYSCALL_64+0x2d/0x7a
 ? trace_hardirqs_on_caller+0x182/0x260
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f4cd9c5a219
RSP: 002b:00007ffce0d90888 EFLAGS: 00000246
 ORIG_RAX: 0000000000000011
RAX: ffffffffffffffda RBX: 0000000000000011 RCX: 00007f4cd9c5a219
RDX: 0000000000000e0a RSI: 00007f4cd7ec5000 RDI: 000000000000000f
RBP: 00007ffce0d90930 R08: fffffffffffffffd R09: 000000000000001b
R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f4cda256058 R14: 00007f4cda32f698 R15: 00007f4cda256000
Code: 83 bd 78 08 00 00 00 f 85 8e fd ff ff 4c 89 ef e8 d1 46 fa ff 85 c0 89 c2 0f 89 7c fd ff ff 31 c0 f0 0f b1 53 28 e9 70 fd ff ff <0f> ff e9 82 fc ff ff 48 c7 c1 f4 ff ff ff e9 0d f9 ff ff e8 93 
---[ end trace 2f6189daedf1df6e ]---



That's this..

 987         ret = filemap_write_and_wait_range(mapping, start, end);
 988         if (ret)
 989                 goto out_free_dio;
 990 
 991         ret = invalidate_inode_pages2_range(mapping,
 992                         start >> PAGE_SHIFT, end >> PAGE_SHIFT);
 993         WARN_ON_ONCE(ret);


Plot thickens..

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-08-30 17:05                                                                   ` Dave Jones
@ 2017-08-30 17:13                                                                     ` Darrick J. Wong
  2017-08-30 17:17                                                                       ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Darrick J. Wong @ 2017-08-30 17:13 UTC (permalink / raw)
  To: Dave Jones, Al Viro, Linux Kernel, linux-xfs

On Wed, Aug 30, 2017 at 01:05:23PM -0400, Dave Jones wrote:
> On Mon, Aug 28, 2017 at 09:25:42PM -0700, Darrick J. Wong wrote:
>  > On Mon, Aug 28, 2017 at 04:31:30PM -0400, Dave Jones wrote:
>  > > On Mon, Aug 07, 2017 at 04:18:18PM -0400, Dave Jones wrote:
>  > >  > On Fri, Apr 28, 2017 at 06:20:25PM +0100, Al Viro wrote:
>  > >  >  > On Fri, Apr 28, 2017 at 12:50:24PM -0400, Dave Jones wrote:
>  > >  >  > 
>  > >  >  > diff --git a/fs/splice.c b/fs/splice.c
>  > >  >  > index 540c4a44756c..12a12d9c313f 100644
>  > >  >  > --- a/fs/splice.c
>  > >  >  > +++ b/fs/splice.c
>  > >  >  > @@ -306,6 +306,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
>  > >  >  >  	kiocb.ki_pos = *ppos;
>  > >  >  >  	ret = call_read_iter(in, &kiocb, &to);
>  > >  >  >  	if (ret > 0) {
>  > >  >  > +		if (WARN_ON(iov_iter_count(&to) != len - ret))
>  > >  >  > +			printk(KERN_ERR "ops %p: was %zd, left %zd, returned %d\n",
>  > >  >  > +				in->f_op, len, iov_iter_count(&to), ret);
>  > >  >  >  		*ppos = kiocb.ki_pos;
>  > >  >  >  		file_accessed(in);
>  > >  >  >  	} else if (ret < 0) {
>  > >  > 
>  > >  > Hey Al,
>  > >  >  Due to a git stash screw up on my part, I've had this leftover WARN_ON
>  > >  > in my tree for the last couple months. (That screw-up might turn out to be
>  > >  > serendipitous if this is a real bug..)
>  > >  > 
>  > >  > Today I decided to change things up and beat up on xfs for a change, and
>  > >  > was able to trigger this again.
>  > >  > 
>  > >  > Is this check no longer valid, or am I triggering the same bug we were chased
>  > >  > down in nfs, but now in xfs ?  (None of the other detritus from that debugging
>  > >  > back in April made it, just those three lines above).
>  > > 
>  > > Revisiting this. I went back and dug out some of the other debug diffs [1]
>  > > from that old thread.
>  > > 
>  > > I can easily trigger this spew on xfs.
>  > > 
>  > > ...
>  > >
>  > > asked to read 4096, claims to have read 1
>  > > actual size of data in pipe 4096 
>  > > [0:4096]
>  > > f_op: ffffffffa058c920, f_flags: 49154, pos: 0/1, size: 0
>  > > 
>  > > 
>  > > I'm still trying to narrow down an exact reproducer, but it seems having
>  > > trinity do a combination of sendfile & writev, with pipes and regular
>  > > files as fd's is the best repro.
>  > > 
>  > > Is this a real problem, or am I chasing ghosts ?  That it doesn't happen
>  > > on ext4 or btrfs is making me wonder...
>  > 
>  > <shrug> I haven't heard of any problems w/ directio xfs lately, but OTOH
>  > I think it's the only filesystem that uses iomap_dio_rw, which would
>  > explain why ext4/btrfs don't have this problem.
>  > 
>  > Granted that's idle speculation; is there a reproducer/xfstest for this?
> 
> I reverted the debug patches mentioned above, and ran trinity for a while again,
> and got this which smells really suspiciously related
> 
> WARNING: CPU: 1 PID: 10380 at fs/iomap.c:993 iomap_dio_rw+0x825/0x840
> CPU: 1 PID: 10380 Comm: trinity-c30 Not tainted 4.13.0-rc7-think+ #3 
> task: ffff8804613a5740 task.stack: ffff880432120000
> RIP: 0010:iomap_dio_rw+0x825/0x840
> RSP: 0018:ffff880432127890 EFLAGS: 00010286
> RAX: 00000000fffffff0 RBX: ffff88046a64d0e8 RCX: 0000000000000000
> RDX: ffffed0086424e9b RSI: 0000000000000000 RDI: ffffed0086424f03
> RBP: ffff880432127a70 R08: ffff88046b239840 R09: 0000000000000001
> R10: ffff880432126f50 R11: 0000000000000000 R12: ffff880432127c40
> R13: 0000000000000e0a R14: 1ffff10086424f20 R15: ffff880432127ca0
> FS:  00007f4cda32f700(0000) GS:ffff88046b200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f181e02f000 CR3: 000000043d32a000 CR4: 00000000001406e0
> Call Trace:
>  ? iomap_seek_data+0xc0/0xc0
>  ? down_read_non_owner+0x40/0x40
>  ? xfs_ilock+0x3f2/0x490 [xfs]
>  ? touch_atime+0x9c/0x180
>  ? __atime_needs_update+0x440/0x440
>  xfs_file_dio_aio_read+0x12d/0x390 [xfs]
>  ? xfs_file_dio_aio_read+0x12d/0x390 [xfs]
>  ? xfs_file_fallocate+0x660/0x660 [xfs]
>  ? cyc2ns_read_end+0x10/0x10
>  xfs_file_read_iter+0x1bb/0x1d0 [xfs]
>  __vfs_read+0x332/0x440
>  ? default_llseek+0x140/0x140
>  ? cyc2ns_read_end+0x10/0x10
>  ? __fget_light+0x1ae/0x230
>  ? rcu_is_watching+0x8d/0xd0
>  ? exit_to_usermode_loop+0x1b0/0x1b0
>  ? rw_verify_area+0x9d/0x150
>  vfs_read+0xc8/0x1c0
>  SyS_pread64+0x11a/0x140
>  ? SyS_write+0x160/0x160
>  ? do_syscall_64+0xc0/0x3e0
>  ? SyS_write+0x160/0x160
>  do_syscall_64+0x1bc/0x3e0
>  ? syscall_return_slowpath+0x240/0x240
>  ? cpumask_check.part.2+0x10/0x10
>  ? cpumask_check.part.2+0x10/0x10
>  ? mark_held_locks+0x23/0xb0
>  ? return_from_SYSCALL_64+0x2d/0x7a
>  ? trace_hardirqs_on_caller+0x182/0x260
>  ? trace_hardirqs_on_thunk+0x1a/0x1c
>  entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7f4cd9c5a219
> RSP: 002b:00007ffce0d90888 EFLAGS: 00000246
>  ORIG_RAX: 0000000000000011
> RAX: ffffffffffffffda RBX: 0000000000000011 RCX: 00007f4cd9c5a219
> RDX: 0000000000000e0a RSI: 00007f4cd7ec5000 RDI: 000000000000000f
> RBP: 00007ffce0d90930 R08: fffffffffffffffd R09: 000000000000001b
> R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000002
> R13: 00007f4cda256058 R14: 00007f4cda32f698 R15: 00007f4cda256000
> Code: 83 bd 78 08 00 00 00 f 85 8e fd ff ff 4c 89 ef e8 d1 46 fa ff 85 c0 89 c2 0f 89 7c fd ff ff 31 c0 f0 0f b1 53 28 e9 70 fd ff ff <0f> ff e9 82 fc ff ff 48 c7 c1 f4 ff ff ff e9 0d f9 ff ff e8 93 
> ---[ end trace 2f6189daedf1df6e ]---
> 
> 
> 
> That's this..
> 
>  987         ret = filemap_write_and_wait_range(mapping, start, end);
>  988         if (ret)
>  989                 goto out_free_dio;
>  990 
>  991         ret = invalidate_inode_pages2_range(mapping,
>  992                         start >> PAGE_SHIFT, end >> PAGE_SHIFT);
>  993         WARN_ON_ONCE(ret);
> 
> 
> Plot thickens..

Hm, that's the WARN_ON that comes from a failed pagecache invalidation
prior to a dio operation, which implies that something's mixing buffered
and dio?

Given that it's syzkaller it wouldn't surprise me to hear that it's
doing that... :)

--D

> 
> 	Dave
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-08-30 17:13                                                                     ` Darrick J. Wong
@ 2017-08-30 17:17                                                                       ` Dave Jones
  0 siblings, 0 replies; 73+ messages in thread
From: Dave Jones @ 2017-08-30 17:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Al Viro, Linux Kernel, linux-xfs

On Wed, Aug 30, 2017 at 10:13:43AM -0700, Darrick J. Wong wrote:


 > > I reverted the debug patches mentioned above, and ran trinity for a while again,
 > > and got this which smells really suspiciously related
 > > 
 > > WARNING: CPU: 1 PID: 10380 at fs/iomap.c:993 iomap_dio_rw+0x825/0x840
 > > RAX: 00000000fffffff0 RBX: ffff88046a64d0e8 RCX: 0000000000000000
 > > 
 > > 
 > > 
 > > That's this..
 > > 
 > >  987         ret = filemap_write_and_wait_range(mapping, start, end);
 > >  988         if (ret)
 > >  989                 goto out_free_dio;
 > >  990 
 > >  991         ret = invalidate_inode_pages2_range(mapping,
 > >  992                         start >> PAGE_SHIFT, end >> PAGE_SHIFT);
 > >  993         WARN_ON_ONCE(ret);
 > > 
 > > 
 > > Plot thickens..
 > 
 > Hm, that's the WARN_ON that comes from a failed pagecache invalidation
 > prior to a dio operation, which implies that something's mixing buffered
 > and dio?

Plausible. Judging by RAX, we got -EBUSY

 > Given that it's syzkaller it wouldn't surprise me to hear that it's
 > doing that... :)

s/syzkaller/trinity/, but yes.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-08-29  4:25                                                                 ` Darrick J. Wong
  2017-08-30 17:05                                                                   ` Dave Jones
@ 2017-09-06 20:03                                                                   ` Dave Jones
  2017-09-06 23:46                                                                     ` Dave Chinner
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-09-06 20:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Al Viro, Linux Kernel, linux-xfs

On Mon, Aug 28, 2017 at 09:25:42PM -0700, Darrick J. Wong wrote:
 > On Mon, Aug 28, 2017 at 04:31:30PM -0400, Dave Jones wrote:
 > > On Mon, Aug 07, 2017 at 04:18:18PM -0400, Dave Jones wrote:
 > >  > On Fri, Apr 28, 2017 at 06:20:25PM +0100, Al Viro wrote:
 > >  >  > On Fri, Apr 28, 2017 at 12:50:24PM -0400, Dave Jones wrote:
 > >  >  > > currently running v4.11-rc8-75-gf83246089ca0
 > >  >  > > 
 > >  >  > > sunrpc bit is for the other unrelated problem I'm chasing.
 > >  >  > > 
 > >  >  > > note also, I saw the backtrace without the fs/splice.c changes.
 > >  >  > 
 > >  >  > 	Interesting...  Could you add this and see if that triggers?
 > >  >  > 
 > >  >  > diff --git a/fs/splice.c b/fs/splice.c
 > >  >  > index 540c4a44756c..12a12d9c313f 100644
 > >  >  > --- a/fs/splice.c
 > >  >  > +++ b/fs/splice.c
 > >  >  > @@ -306,6 +306,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 > >  >  >  	kiocb.ki_pos = *ppos;
 > >  >  >  	ret = call_read_iter(in, &kiocb, &to);
 > >  >  >  	if (ret > 0) {
 > >  >  > +		if (WARN_ON(iov_iter_count(&to) != len - ret))
 > >  >  > +			printk(KERN_ERR "ops %p: was %zd, left %zd, returned %d\n",
 > >  >  > +				in->f_op, len, iov_iter_count(&to), ret);
 > >  >  >  		*ppos = kiocb.ki_pos;
 > >  >  >  		file_accessed(in);
 > >  >  >  	} else if (ret < 0) {
 > >  > 
 > >  > Hey Al,
 > >  >  Due to a git stash screw up on my part, I've had this leftover WARN_ON
 > >  > in my tree for the last couple months. (That screw-up might turn out to be
 > >  > serendipitous if this is a real bug..)
 > >  > 
 > >  > Today I decided to change things up and beat up on xfs for a change, and
 > >  > was able to trigger this again.
 > >  > 
 > >  > Is this check no longer valid, or am I triggering the same bug we were chased
 > >  > down in nfs, but now in xfs ?  (None of the other detritus from that debugging
 > >  > back in April made it, just those three lines above).
 > > 
 > > Revisiting this. I went back and dug out some of the other debug diffs [1]
 > > from that old thread.
 > > 
 > > I can easily trigger this spew on xfs.
 > > 
 > > 
 > > WARNING: CPU: 1 PID: 2251 at fs/splice.c:292 test_it+0xd4/0x1d0
 > > CPU: 1 PID: 2251 Comm: trinity-c42 Not tainted 4.13.0-rc7-think+ #1 
 > > task: ffff880459173a40 task.stack: ffff88044f7d0000
 > > RIP: 0010:test_it+0xd4/0x1d0
 > > RSP: 0018:ffff88044f7d7878 EFLAGS: 00010283
 > > RAX: 0000000000000000 RBX: ffff88044f44b968 RCX: ffffffff81511ea0
 > > RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88044f44ba68
 > > RBP: ffff88044f7d78c8 R08: ffff88046b218ec0 R09: 0000000000000000
 > > R10: ffff88044f7d7518 R11: 0000000000000000 R12: 0000000000001000
 > > R13: 0000000000000001 R14: 00000000ffffffff R15: 0000000000000001
 > > FS:  00007fdbc09b2700(0000) GS:ffff88046b200000(0000) knlGS:0000000000000000
 > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 > > CR2: 0000000000000000 CR3: 0000000459e1d000 CR4: 00000000001406e0
 > > Call Trace:
 > >  generic_file_splice_read+0x414/0x4e0
 > >  ? opipe_prep.part.14+0x180/0x180
 > >  ? lockdep_init_map+0xb2/0x2b0
 > >  ? rw_verify_area+0x65/0x150
 > >  do_splice_to+0xab/0xc0
 > >  splice_direct_to_actor+0x1f5/0x540
 > >  ? generic_pipe_buf_nosteal+0x10/0x10
 > >  ? do_splice_to+0xc0/0xc0
 > >  ? rw_verify_area+0x9d/0x150
 > >  do_splice_direct+0x1b9/0x230
 > >  ? splice_direct_to_actor+0x540/0x540
 > >  ? __sb_start_write+0x164/0x1c0
 > >  ? do_sendfile+0x7b3/0x840
 > >  do_sendfile+0x428/0x840
 > >  ? do_compat_pwritev64+0xb0/0xb0
 > >  ? __might_sleep+0x72/0xe0
 > >  ? kasan_check_write+0x14/0x20
 > >  SyS_sendfile64+0xa4/0x120
 > >  ? SyS_sendfile+0x150/0x150
 > >  ? mark_held_locks+0x23/0xb0
 > >  ? do_syscall_64+0xc0/0x3e0
 > >  ? SyS_sendfile+0x150/0x150
 > >  do_syscall_64+0x1bc/0x3e0
 > >  ? syscall_return_slowpath+0x240/0x240
 > >  ? mark_held_locks+0x23/0xb0
 > >  ? return_from_SYSCALL_64+0x2d/0x7a
 > >  ? trace_hardirqs_on_caller+0x182/0x260
 > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
 > >  entry_SYSCALL64_slow_path+0x25/0x25
 > > RIP: 0033:0x7fdbc02dd219
 > > RSP: 002b:00007ffc5024fa48 EFLAGS: 00000246
 > >  ORIG_RAX: 0000000000000028
 > > RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007fdbc02dd219
 > > RDX: 00007fdbbe348000 RSI: 0000000000000011 RDI: 0000000000000015
 > > RBP: 00007ffc5024faf0 R08: 000000000000006d R09: 0094e82f2c730a50
 > > R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000002
 > > R13: 00007fdbc0885058 R14: 00007fdbc09b2698 R15: 00007fdbc0885000
 > > ---[ end trace a5847ef0f7be7e20 ]---
 > > asked to read 4096, claims to have read 1
 > > actual size of data in pipe 4096 
 > > [0:4096]
 > > f_op: ffffffffa058c920, f_flags: 49154, pos: 0/1, size: 0
 > > 
 > > 
 > > I'm still trying to narrow down an exact reproducer, but it seems having
 > > trinity do a combination of sendfile & writev, with pipes and regular
 > > files as fd's is the best repro.
 > > 
 > > Is this a real problem, or am I chasing ghosts ?  That it doesn't happen
 > > on ext4 or btrfs is making me wonder...
 > 
 > <shrug> I haven't heard of any problems w/ directio xfs lately, but OTOH
 > I think it's the only filesystem that uses iomap_dio_rw, which would
 > explain why ext4/btrfs don't have this problem.

Another warning, from likely the same root cause.

WARNING: CPU: 3 PID: 572 at lib/iov_iter.c:962 iov_iter_pipe+0xe2/0xf0
CPU: 3 PID: 572 Comm: trinity-c45 Not tainted 4.13.0-think+ #5 
task: ffff88042e5f8040 task.stack: ffff880454cb0000
RIP: 0010:iov_iter_pipe+0xe2/0xf0
RSP: 0018:ffff880454cb7950 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880454cb79c0 RCX: ffffffff818fc581
RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88043f07fa80
RBP: ffff88043f07f968 R08: 0000000000000000 R09: 0000000000000000
R10: ffff880454cb7468 R11: 0000000000000000 R12: 000000000003fff8
R13: 0000000000000010 R14: ffff88043f07fa80 R15: ffff88043f07fa78
FS:  00007fac217a1700(0000) GS:ffff88046b400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000043f37e001 CR4: 00000000001606e0
Call Trace:
 generic_file_splice_read+0x8d/0x340
 ? pipe_unlock+0x15/0x30
 ? splice_from_pipe+0x10e/0x160
 ? pipe_to_user+0x80/0x80
 ? splice_shrink_spd+0x40/0x40
 ? init_wait_entry+0x100/0x100
 ? rw_verify_area+0x87/0x130
 splice_direct_to_actor+0x186/0x420
 ? generic_pipe_buf_nosteal+0x10/0x10
 ? do_splice_to+0xc0/0xc0
 ? __fget_light+0x181/0x200
 do_splice_direct+0x173/0x1e0
 ? splice_direct_to_actor+0x420/0x420
 ? mark_held_locks+0x1c/0x90
 ? retint_kernel+0x10/0x10
 do_sendfile+0x3a2/0x6d0
 ? do_compat_pwritev64+0xa0/0xa0
 ? copy_user_generic_unrolled+0x83/0xb0
 SyS_sendfile64+0xa4/0x130
 ? SyS_sendfile+0x140/0x140
 ? mark_held_locks+0x1c/0x90
 ? do_syscall_64+0xae/0x3e0
 ? SyS_sendfile+0x140/0x140
 do_syscall_64+0x182/0x3e0
 ? syscall_return_slowpath+0x250/0x250
 ? rcu_read_lock_sched_held+0x90/0xa0
 ? __context_tracking_exit.part.5+0x226/0x290
 ? mark_held_locks+0x1c/0x90
 ? return_from_SYSCALL_64+0x2d/0x7a
 ? trace_hardirqs_on_caller+0x17a/0x250
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7fac210cc219
RSP: 002b:00007ffd7fe8ab38 EFLAGS: 00000246
 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007fac210cc219
RDX: 00007fac215de000 RSI: 000000000000001b RDI: 000000000000000b
RBP: 00007ffd7fe8abe0 R08: 00000000c8c8c8c8 R09: 00000000000000c9
R10: 0000000000040000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007fac2165f058 R14: 00007fac217a1698 R15: 00007fac2165f000
Code: 10 48 c7 43 08 00 00 00 00 e8 bb 79 b1 ff 4c 89 63 10 48 8d 7b 24 e8 ae 78 b1 ff 89 6b 24 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 0b <0f> ff e9 65 ff ff ff 0f 1f 80 00 00 00 00 41 57 41 56 41 55 49 
---[ end trace 4507dec15e49aca7 ]---


 962         WARN_ON(pipe->nrbufs == pipe->buffers);


 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-06 20:03                                                                   ` Dave Jones
@ 2017-09-06 23:46                                                                     ` Dave Chinner
  2017-09-07  3:48                                                                       ` Dave Jones
  2017-09-08  1:04                                                                       ` Al Viro
  0 siblings, 2 replies; 73+ messages in thread
From: Dave Chinner @ 2017-09-06 23:46 UTC (permalink / raw)
  To: Dave Jones, Darrick J. Wong, Al Viro, Linux Kernel, linux-xfs

On Wed, Sep 06, 2017 at 04:03:37PM -0400, Dave Jones wrote:
> On Mon, Aug 28, 2017 at 09:25:42PM -0700, Darrick J. Wong wrote:
>  > On Mon, Aug 28, 2017 at 04:31:30PM -0400, Dave Jones wrote:
>  > > I'm still trying to narrow down an exact reproducer, but it seems having
>  > > trinity do a combination of sendfile & writev, with pipes and regular
>  > > files as fd's is the best repro.
>  > > 
>  > > Is this a real problem, or am I chasing ghosts ?  That it doesn't happen
>  > > on ext4 or btrfs is making me wonder...
>  > 
>  > <shrug> I haven't heard of any problems w/ directio xfs lately, but OTOH
>  > I think it's the only filesystem that uses iomap_dio_rw, which would
>  > explain why ext4/btrfs don't have this problem.
> 
> Another warning, from likely the same root cause.
> 
> WARNING: CPU: 3 PID: 572 at lib/iov_iter.c:962 iov_iter_pipe+0xe2/0xf0

	WARN_ON(pipe->nrbufs == pipe->buffers);

 *      @nrbufs: the number of non-empty pipe buffers in this pipe
 *      @buffers: total number of buffers (should be a power of 2)

So that's warning that the pipe buffer is already full before we
try to read from the filesystem?

That doesn't seem like an XFS problem - it indicates the pipe we are
filling in generic_file_splice_read() is not being emptied by
whatever we are splicing the file data to....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-06 23:46                                                                     ` Dave Chinner
@ 2017-09-07  3:48                                                                       ` Dave Jones
  2017-09-07  4:33                                                                         ` Al Viro
  2017-09-08  1:04                                                                       ` Al Viro
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-09-07  3:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Al Viro, Linux Kernel, linux-xfs

On Thu, Sep 07, 2017 at 09:46:17AM +1000, Dave Chinner wrote:
 > On Wed, Sep 06, 2017 at 04:03:37PM -0400, Dave Jones wrote:
 > > On Mon, Aug 28, 2017 at 09:25:42PM -0700, Darrick J. Wong wrote:
 > >  > On Mon, Aug 28, 2017 at 04:31:30PM -0400, Dave Jones wrote:
 > >  > > I'm still trying to narrow down an exact reproducer, but it seems having
 > >  > > trinity do a combination of sendfile & writev, with pipes and regular
 > >  > > files as fd's is the best repro.
 > >  > > 
 > >  > > Is this a real problem, or am I chasing ghosts ?  That it doesn't happen
 > >  > > on ext4 or btrfs is making me wonder...
 > >  > 
 > >  > <shrug> I haven't heard of any problems w/ directio xfs lately, but OTOH
 > >  > I think it's the only filesystem that uses iomap_dio_rw, which would
 > >  > explain why ext4/btrfs don't have this problem.
 > > 
 > > Another warning, from likely the same root cause.
 > > 
 > > WARNING: CPU: 3 PID: 572 at lib/iov_iter.c:962 iov_iter_pipe+0xe2/0xf0
 > 
 > 	WARN_ON(pipe->nrbufs == pipe->buffers);
 > 
 >  *      @nrbufs: the number of non-empty pipe buffers in this pipe
 >  *      @buffers: total number of buffers (should be a power of 2)
 > 
 > So that's warning that the pipe buffer is already full before we
 > try to read from the filesystem?
 > 
 > That doesn't seem like an XFS problem - it indicates the pipe we are
 > filling in generic_file_splice_read() is not being emptied by
 > whatever we are splicing the file data to....

The puzzling part is this runs for a day on ext4 or btrfs, whereas I can
make xfs fall over pretty quickly.  As Darrick pointed out though, this
could be due to xfs being the only user of iomap_dio_rw.

I'm juggling a few other things right now, so probably not going to
have much time to dig further on this until after plumbers + 1 wk.

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-07  3:48                                                                       ` Dave Jones
@ 2017-09-07  4:33                                                                         ` Al Viro
  0 siblings, 0 replies; 73+ messages in thread
From: Al Viro @ 2017-09-07  4:33 UTC (permalink / raw)
  To: Dave Jones, Dave Chinner, Darrick J. Wong, Linux Kernel, linux-xfs

On Wed, Sep 06, 2017 at 11:48:35PM -0400, Dave Jones wrote:

>  > That doesn't seem like an XFS problem - it indicates the pipe we are
>  > filling in generic_file_splice_read() is not being emptied by
>  > whatever we are splicing the file data to....
> 
> The puzzling part is this runs for a day on ext4 or btrfs, whereas I can
> make xfs fall over pretty quickly.  As Darrick pointed out though, this
> could be due to xfs being the only user of iomap_dio_rw.
> 
> I'm juggling a few other things right now, so probably not going to
> have much time to dig further on this until after plumbers + 1 wk.

I'll look into that tomorrow...

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-06 23:46                                                                     ` Dave Chinner
  2017-09-07  3:48                                                                       ` Dave Jones
@ 2017-09-08  1:04                                                                       ` Al Viro
  2017-09-10  1:07                                                                         ` Dave Jones
  1 sibling, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-09-08  1:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Thu, Sep 07, 2017 at 09:46:17AM +1000, Dave Chinner wrote:
> On Wed, Sep 06, 2017 at 04:03:37PM -0400, Dave Jones wrote:
> > On Mon, Aug 28, 2017 at 09:25:42PM -0700, Darrick J. Wong wrote:
> >  > On Mon, Aug 28, 2017 at 04:31:30PM -0400, Dave Jones wrote:
> >  > > I'm still trying to narrow down an exact reproducer, but it seems having
> >  > > trinity do a combination of sendfile & writev, with pipes and regular
> >  > > files as fd's is the best repro.
> >  > > 
> >  > > Is this a real problem, or am I chasing ghosts ?  That it doesn't happen
> >  > > on ext4 or btrfs is making me wonder...
> >  > 
> >  > <shrug> I haven't heard of any problems w/ directio xfs lately, but OTOH
> >  > I think it's the only filesystem that uses iomap_dio_rw, which would
> >  > explain why ext4/btrfs don't have this problem.
> > 
> > Another warning, from likely the same root cause.
> > 
> > WARNING: CPU: 3 PID: 572 at lib/iov_iter.c:962 iov_iter_pipe+0xe2/0xf0
> 
> 	WARN_ON(pipe->nrbufs == pipe->buffers);
> 
>  *      @nrbufs: the number of non-empty pipe buffers in this pipe
>  *      @buffers: total number of buffers (should be a power of 2)
> 
> So that's warning that the pipe buffer is already full before we
> try to read from the filesystem?
> 
> That doesn't seem like an XFS problem - it indicates the pipe we are
> filling in generic_file_splice_read() is not being emptied by
> whatever we are splicing the file data to....

Or that XFS in some conditions shoves into pipe more than it reports,
so not all of that gets emptied, filling the sucker up after sufficient
amount of iterations...

There's at least one suspicious place in iomap_dio_actor() -
                if (!(dio->flags & IOMAP_DIO_WRITE)) {
                        iov_iter_zero(length, dio->submit.iter);
                        dio->size += length;
                        return length;
                }
which assumes that iov_iter_zero() always succeeds.  That's very
much _not_ true - neither for iovec-backed, not for pipe-backed.
Orangefs read_one_page() is fine (it calls that sucker for bvec-backed
iov_iter it's just created), but iomap_dio_actor() is not.

I'm not saying that it will suffice, but we definitely need this:

diff --git a/fs/iomap.c b/fs/iomap.c
index 269b24a01f32..4a671263475f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -843,7 +843,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 		/*FALLTHRU*/
 	case IOMAP_UNWRITTEN:
 		if (!(dio->flags & IOMAP_DIO_WRITE)) {
-			iov_iter_zero(length, dio->submit.iter);
+			length = iov_iter_zero(length, dio->submit.iter);
 			dio->size += length;
 			return length;
 		}

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-08  1:04                                                                       ` Al Viro
@ 2017-09-10  1:07                                                                         ` Dave Jones
  2017-09-10  2:57                                                                           ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-09-10  1:07 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, Darrick J. Wong, Linux Kernel, linux-xfs

On Fri, Sep 08, 2017 at 02:04:41AM +0100, Al Viro wrote:
 
 > There's at least one suspicious place in iomap_dio_actor() -
 >                 if (!(dio->flags & IOMAP_DIO_WRITE)) {
 >                         iov_iter_zero(length, dio->submit.iter);
 >                         dio->size += length;
 >                         return length;
 >                 }
 > which assumes that iov_iter_zero() always succeeds.  That's very
 > much _not_ true - neither for iovec-backed, not for pipe-backed.
 > Orangefs read_one_page() is fine (it calls that sucker for bvec-backed
 > iov_iter it's just created), but iomap_dio_actor() is not.
 > 
 > I'm not saying that it will suffice, but we definitely need this:
 > 
 > diff --git a/fs/iomap.c b/fs/iomap.c
 > index 269b24a01f32..4a671263475f 100644
 > --- a/fs/iomap.c
 > +++ b/fs/iomap.c
 > @@ -843,7 +843,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 >  		/*FALLTHRU*/
 >  	case IOMAP_UNWRITTEN:
 >  		if (!(dio->flags & IOMAP_DIO_WRITE)) {
 > -			iov_iter_zero(length, dio->submit.iter);
 > +			length = iov_iter_zero(length, dio->submit.iter);
 >  			dio->size += length;
 >  			return length;

With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
which doesn't end well...


WARNING: CPU: 3 PID: 11443 at fs/iomap.c:993 iomap_dio_rw+0x825/0x840
CPU: 3 PID: 11443 Comm: trinity-c39 Not tainted 4.13.0-think+ #9 
task: ffff880461080040 task.stack: ffff88043d720000
RIP: 0010:iomap_dio_rw+0x825/0x840
RSP: 0018:ffff88043d727730 EFLAGS: 00010286
RAX: 00000000fffffff0 RBX: ffff88044f036428 RCX: 0000000000000000
RDX: ffffed0087ae4e67 RSI: 0000000000000000 RDI: ffffed0087ae4ed7
RBP: ffff88043d727910 R08: ffff88046b4176c0 R09: 0000000000000000
R10: ffff88043d726d20 R11: 0000000000000001 R12: ffff88043d727a90
R13: 00000000027253f7 R14: 1ffff10087ae4ef4 R15: ffff88043d727c10
FS:  00007f5d8613e700(0000) GS:ffff88046b400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5d84503000 CR3: 00000004594e1000 CR4: 00000000001606e0
Call Trace:
 ? iomap_seek_data+0xb0/0xb0
 ? down_read_nested+0xd3/0x160
 ? down_read_non_owner+0x40/0x40
 ? xfs_ilock+0x3cb/0x460 [xfs]
 ? sched_clock_cpu+0x14/0xf0
 ? __lock_is_held+0x51/0xc0
 ? xfs_file_dio_aio_read+0x123/0x350 [xfs]
 xfs_file_dio_aio_read+0x123/0x350 [xfs]
 ? xfs_file_fallocate+0x550/0x550 [xfs]
 ? lock_release+0xa00/0xa00
 ? ___might_sleep.part.70+0x118/0x320
 xfs_file_read_iter+0x1b1/0x1d0 [xfs]
 do_iter_readv_writev+0x2ea/0x330
 ? vfs_dedupe_file_range+0x400/0x400
 do_iter_read+0x149/0x280
 vfs_readv+0x107/0x180
 ? vfs_iter_read+0x60/0x60
 ? fget_raw+0x10/0x10
 ? native_sched_clock+0xf9/0x1a0
 ? __fdget_pos+0xd6/0x110
 ? __fdget_pos+0xd6/0x110
 ? __fdget_raw+0x10/0x10
 ? do_readv+0xc0/0x1b0
 do_readv+0xc0/0x1b0
 ? vfs_readv+0x180/0x180
 ? mark_held_locks+0x1c/0x90
 ? do_syscall_64+0xae/0x3e0
 ? compat_rw_copy_check_uvector+0x1b0/0x1b0
 do_syscall_64+0x182/0x3e0
 ? syscall_return_slowpath+0x250/0x250
 ? rcu_read_lock_sched_held+0x90/0xa0
 ? mark_held_locks+0x1c/0x90
 ? return_from_SYSCALL_64+0x2d/0x7a
 ? trace_hardirqs_on_caller+0x17a/0x250
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f5d85a69219
RSP: 002b:00007ffdf090afd8 EFLAGS: 00000246
 ORIG_RAX: 0000000000000013
RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f5d85a69219
RDX: 00000000000000ae RSI: 0000565183cd5490 RDI: 0000000000000056
RBP: 00007ffdf090b080 R08: 0141082b00011c63 R09: 0000000000000000
R10: 00000000ffffe000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f5d86026058 R14: 00007f5d8613e698 R15: 00007f5d86026000

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-10  1:07                                                                         ` Dave Jones
@ 2017-09-10  2:57                                                                           ` Al Viro
  2017-09-10 16:07                                                                             ` Dave Jones
  2017-09-10 21:11                                                                             ` Dave Chinner
  0 siblings, 2 replies; 73+ messages in thread
From: Al Viro @ 2017-09-10  2:57 UTC (permalink / raw)
  To: Dave Jones, Dave Chinner, Darrick J. Wong, Linux Kernel, linux-xfs

On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:

> With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
> which doesn't end well...

Different issue, and I'm not sure why that WARN_ON() is there in the
first place.  Note that in a similar situation generic_file_direct_write()
simply buggers off and lets the caller do buffered write...

iov_iter_pipe() warning is a sign of ->read_iter() on pipe-backed iov_iter
putting into the pipe more than it claims to have done.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-10  2:57                                                                           ` Al Viro
@ 2017-09-10 16:07                                                                             ` Dave Jones
  2017-09-10 20:05                                                                               ` Al Viro
  2017-09-10 21:11                                                                             ` Dave Chinner
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-09-10 16:07 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, Darrick J. Wong, Linux Kernel, linux-xfs

On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
 > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
 > 
 > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
 > > which doesn't end well...
 > 
 > Different issue, and I'm not sure why that WARN_ON() is there in the
 > first place.  Note that in a similar situation generic_file_direct_write()
 > simply buggers off and lets the caller do buffered write...
 > 
 > iov_iter_pipe() warning is a sign of ->read_iter() on pipe-backed iov_iter
 > putting into the pipe more than it claims to have done.

(from a rerun after hitting that EBUSY warn; hence the taint)

WARNING: CPU: 0 PID: 14154 at fs/iomap.c:1055 iomap_dio_rw+0x78e/0x840
CPU: 0 PID: 14154 Comm: trinity-c33 Tainted: G        W       4.13.0-think+ #9 
task: ffff8801027e3e40 task.stack: ffff8801632d8000
RIP: 0010:iomap_dio_rw+0x78e/0x840
RSP: 0018:ffff8801632df370 EFLAGS: 00010286
RAX: 00000000fffffff0 RBX: ffff880428666428 RCX: ffffffffffffffea
RDX: ffffed002c65bdef RSI: 0000000000000000 RDI: ffffed002c65be5f
RBP: ffff8801632df550 R08: ffff88046ae176c0 R09: 0000000000000000
R10: ffff8801632de960 R11: 0000000000000001 R12: ffff8801632df7f0
R13: ffffffffffffffea R14: 1ffff1002c65be7c R15: ffff8801632df988
FS:  00007f3da2100700(0000) GS:ffff88046ae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000002f6223001 CR4: 00000000001606f0
DR0: 00007f3da1f3d000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
 ? iomap_seek_data+0xb0/0xb0
 ? find_inode_fast+0xd0/0xd0
 ? xfs_file_aio_write_checks+0x295/0x320 [xfs]
 ? __lock_is_held+0x51/0xc0
 ? xfs_file_dio_aio_write+0x286/0x7e0 [xfs]
 ? rcu_read_lock_sched_held+0x90/0xa0
 xfs_file_dio_aio_write+0x286/0x7e0 [xfs]
 ? xfs_file_aio_write_checks+0x320/0x320 [xfs]
 ? unwind_get_return_address+0x2f/0x50
 ? __save_stack_trace+0x92/0x100
 ? memcmp+0x45/0x70
 ? depot_save_stack+0x12e/0x480
 ? save_stack+0x89/0xb0
 ? save_stack+0x32/0xb0
 ? kasan_kmalloc+0xa0/0xd0
 ? __kmalloc+0x157/0x360
 ? iter_file_splice_write+0x154/0x760
 ? direct_splice_actor+0x86/0xa0
 ? splice_direct_to_actor+0x1c4/0x420
 ? do_splice_direct+0x173/0x1e0
 ? do_sendfile+0x3a2/0x6d0
 ? SyS_sendfile64+0xa4/0x130
 ? do_syscall_64+0x182/0x3e0
 ? entry_SYSCALL64_slow_path+0x25/0x25
 ? match_held_lock+0xa6/0x410
 ? iter_file_splice_write+0x154/0x760
 xfs_file_write_iter+0x227/0x280 [xfs]
 do_iter_readv_writev+0x267/0x330
 ? vfs_dedupe_file_range+0x400/0x400
 do_iter_write+0xd7/0x280
 ? splice_from_pipe_next.part.9+0x28/0x160
 iter_file_splice_write+0x4d5/0x760
 ? page_cache_pipe_buf_steal+0x2b0/0x2b0
 ? generic_file_splice_read+0x2e1/0x340
 ? pipe_to_user+0x80/0x80
 direct_splice_actor+0x86/0xa0
 splice_direct_to_actor+0x1c4/0x420
 ? generic_pipe_buf_nosteal+0x10/0x10
 ? do_splice_to+0xc0/0xc0
 do_splice_direct+0x173/0x1e0
 ? splice_direct_to_actor+0x420/0x420
 ? rcu_read_lock_sched_held+0x90/0xa0
 ? rcu_sync_lockdep_assert+0x43/0x70
 ? __sb_start_write+0x179/0x1e0
 do_sendfile+0x3a2/0x6d0
 ? do_compat_pwritev64+0xa0/0xa0
 ? __lock_is_held+0x2e/0xc0
 SyS_sendfile64+0xa4/0x130
 ? SyS_sendfile+0x140/0x140
 ? mark_held_locks+0x1c/0x90
 ? do_syscall_64+0xae/0x3e0
 ? SyS_sendfile+0x140/0x140
 do_syscall_64+0x182/0x3e0
 ? syscall_return_slowpath+0x250/0x250
 ? rcu_read_lock_sched_held+0x90/0xa0
 ? __context_tracking_exit.part.4+0x223/0x290
 ? mark_held_locks+0x1c/0x90
 ? return_from_SYSCALL_64+0x2d/0x7a
 ? trace_hardirqs_on_caller+0x17a/0x250
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f3da1a2b219
RSP: 002b:00007ffdd1642f38 EFLAGS: 00000246
 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000028 RCX: 00007f3da1a2b219
RDX: 00007f3da1f3d000 RSI: 000000000000005f RDI: 0000000000000060
RBP: 00007ffdd1642fe0 R08: 30503123188dbe3f R09: ffffffffe7e7e7e7
R10: 000000000000f000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f3da2012058 R14: 00007f3da2100698 R15: 00007f3da2012000

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-10 16:07                                                                             ` Dave Jones
@ 2017-09-10 20:05                                                                               ` Al Viro
  2017-09-10 20:07                                                                                 ` Dave Jones
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-09-10 20:05 UTC (permalink / raw)
  To: Dave Jones, Dave Chinner, Darrick J. Wong, Linux Kernel, linux-xfs

On Sun, Sep 10, 2017 at 12:07:10PM -0400, Dave Jones wrote:
> On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
>  > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
>  > 
>  > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
>  > > which doesn't end well...
>  > 
>  > Different issue, and I'm not sure why that WARN_ON() is there in the
>  > first place.  Note that in a similar situation generic_file_direct_write()
>  > simply buggers off and lets the caller do buffered write...
>  > 
>  > iov_iter_pipe() warning is a sign of ->read_iter() on pipe-backed iov_iter
>  > putting into the pipe more than it claims to have done.
> 
> (from a rerun after hitting that EBUSY warn; hence the taint)
> 
> WARNING: CPU: 0 PID: 14154 at fs/iomap.c:1055 iomap_dio_rw+0x78e/0x840

... and that's another invalidate_inode_pages2_range() in the same
sucker.  Again, compare with generic_file_direct_write()...

I don't believe that this one has anything splice-specific to do with it.
And its only relation to iov_iter_pipe() splat is that it's in the same
fs/iomap.c...

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-10 20:05                                                                               ` Al Viro
@ 2017-09-10 20:07                                                                                 ` Dave Jones
  2017-09-10 20:33                                                                                   ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Jones @ 2017-09-10 20:07 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Chinner, Darrick J. Wong, Linux Kernel, linux-xfs

On Sun, Sep 10, 2017 at 09:05:48PM +0100, Al Viro wrote:
 > On Sun, Sep 10, 2017 at 12:07:10PM -0400, Dave Jones wrote:
 > > On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
 > >  > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
 > >  > 
 > >  > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
 > >  > > which doesn't end well...
 > >  > 
 > >  > Different issue, and I'm not sure why that WARN_ON() is there in the
 > >  > first place.  Note that in a similar situation generic_file_direct_write()
 > >  > simply buggers off and lets the caller do buffered write...
 > >  > 
 > >  > iov_iter_pipe() warning is a sign of ->read_iter() on pipe-backed iov_iter
 > >  > putting into the pipe more than it claims to have done.
 > > 
 > > (from a rerun after hitting that EBUSY warn; hence the taint)
 > > 
 > > WARNING: CPU: 0 PID: 14154 at fs/iomap.c:1055 iomap_dio_rw+0x78e/0x840
 > 
 > ... and that's another invalidate_inode_pages2_range() in the same
 > sucker.  Again, compare with generic_file_direct_write()...
 > 
 > I don't believe that this one has anything splice-specific to do with it.
 > And its only relation to iov_iter_pipe() splat is that it's in the same
 > fs/iomap.c...

The interesting part is that I'm hitting these two over and over now
rather than the iov_iter_pipe warning.  Could just be unlucky
randomness though..

	Dave

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-10 20:07                                                                                 ` Dave Jones
@ 2017-09-10 20:33                                                                                   ` Al Viro
  0 siblings, 0 replies; 73+ messages in thread
From: Al Viro @ 2017-09-10 20:33 UTC (permalink / raw)
  To: Dave Jones, Dave Chinner, Darrick J. Wong, Linux Kernel,
	linux-xfs, Christoph Hellwig

On Sun, Sep 10, 2017 at 04:07:24PM -0400, Dave Jones wrote:
> On Sun, Sep 10, 2017 at 09:05:48PM +0100, Al Viro wrote:
>  > On Sun, Sep 10, 2017 at 12:07:10PM -0400, Dave Jones wrote:
>  > > On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
>  > >  > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
>  > >  > 
>  > >  > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
>  > >  > > which doesn't end well...
>  > >  > 
>  > >  > Different issue, and I'm not sure why that WARN_ON() is there in the
>  > >  > first place.  Note that in a similar situation generic_file_direct_write()
>  > >  > simply buggers off and lets the caller do buffered write...
>  > >  > 
>  > >  > iov_iter_pipe() warning is a sign of ->read_iter() on pipe-backed iov_iter
>  > >  > putting into the pipe more than it claims to have done.
>  > > 
>  > > (from a rerun after hitting that EBUSY warn; hence the taint)
>  > > 
>  > > WARNING: CPU: 0 PID: 14154 at fs/iomap.c:1055 iomap_dio_rw+0x78e/0x840
>  > 
>  > ... and that's another invalidate_inode_pages2_range() in the same
>  > sucker.  Again, compare with generic_file_direct_write()...
>  > 
>  > I don't believe that this one has anything splice-specific to do with it.
>  > And its only relation to iov_iter_pipe() splat is that it's in the same
>  > fs/iomap.c...
> 
> The interesting part is that I'm hitting these two over and over now
> rather than the iov_iter_pipe warning.  Could just be unlucky
> randomness though..

Well, if you are still running the same reproducer and it used to hit the "read
from hole longer than the amount of space left in pipe" case, fixing the other
bug would have led to a lot more data shoved through the pipe without choking.
So the write side would be exercised more than before...

Hell knows; the question I have right now is what the devil are those WARN_ON_ONCE()
doing there.  Again, the same conditions are possible on other filesystems, only
there we don't yell; invalidation failure before starting O_DIRECT write is
handled by quiet fallback to buffered IO, the one after the write is simply
ignored.

Doing those WARN_ON_ONCE() is an explicit choice in "iomap: implement direct I/O",
so it's a question to Christoph, AFAICS...

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-10  2:57                                                                           ` Al Viro
  2017-09-10 16:07                                                                             ` Dave Jones
@ 2017-09-10 21:11                                                                             ` Dave Chinner
  2017-09-10 21:19                                                                               ` Al Viro
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Chinner @ 2017-09-10 21:11 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
> On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
> 
> > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
> > which doesn't end well...
> 
> Different issue, and I'm not sure why that WARN_ON() is there in the
> first place.  Note that in a similar situation generic_file_direct_write()
> simply buggers off and lets the caller do buffered write...

XFS does not fall back to buffered IO when direct IO fails.  A
direct IO failure is indicative of a problem that needs to be fixed,
not use a "let's hope we can hide this" fallback path. Especially in
this case - EBUSY usually comes from the app is doing something we
/know/ is dangerous and it's occurrence to completely timing
dependent - if the timing is slightly different, we miss detection
and that can lead to silent data corruption. Hence if we detect it,
and our coherency guards can't fix up the coherency problem, we
issue a warning and fail the IO.

The warning is mostly there for us developers and it's been there
for years - it's so we don't end up chasing ghosts when we see that
warning in the logs. The usual vector is an app that mixes
concurrent DIO with mmap access to the same file, which we
explicitly say "don't do this because data corruption" in the
open(2) man page....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-10 21:11                                                                             ` Dave Chinner
@ 2017-09-10 21:19                                                                               ` Al Viro
  2017-09-10 22:08                                                                                 ` Dave Chinner
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-09-10 21:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Mon, Sep 11, 2017 at 07:11:10AM +1000, Dave Chinner wrote:
> On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
> > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
> > 
> > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
> > > which doesn't end well...
> > 
> > Different issue, and I'm not sure why that WARN_ON() is there in the
> > first place.  Note that in a similar situation generic_file_direct_write()
> > simply buggers off and lets the caller do buffered write...
> 
> XFS does not fall back to buffered IO when direct IO fails.  A
> direct IO failure is indicative of a problem that needs to be fixed,
> not use a "let's hope we can hide this" fallback path. Especially in
> this case - EBUSY usually comes from the app is doing something we
> /know/ is dangerous and it's occurrence to completely timing
> dependent - if the timing is slightly different, we miss detection
> and that can lead to silent data corruption.

In this case app is a fuzzer, which is bloody well supposed to poke
into all kinds of odd usage patterns, though...

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-10 21:19                                                                               ` Al Viro
@ 2017-09-10 22:08                                                                                 ` Dave Chinner
  2017-09-10 23:07                                                                                   ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Chinner @ 2017-09-10 22:08 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Sun, Sep 10, 2017 at 10:19:07PM +0100, Al Viro wrote:
> On Mon, Sep 11, 2017 at 07:11:10AM +1000, Dave Chinner wrote:
> > On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
> > > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
> > > 
> > > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
> > > > which doesn't end well...
> > > 
> > > Different issue, and I'm not sure why that WARN_ON() is there in the
> > > first place.  Note that in a similar situation generic_file_direct_write()
> > > simply buggers off and lets the caller do buffered write...
> > 
> > XFS does not fall back to buffered IO when direct IO fails.  A
> > direct IO failure is indicative of a problem that needs to be fixed,
> > not use a "let's hope we can hide this" fallback path. Especially in
> > this case - EBUSY usually comes from the app is doing something we
> > /know/ is dangerous and it's occurrence to completely timing
> > dependent - if the timing is slightly different, we miss detection
> > and that can lead to silent data corruption.
> 
> In this case app is a fuzzer, which is bloody well supposed to poke
> into all kinds of odd usage patterns, though...

Yup, and we have quite a few tests in xfstests that specifically
exercise this same dark corner. We filter out these warnings from
the xfstests that exercise this case, though, because we know they
are going to be emitted and so aren't a sign of test failures...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-10 22:08                                                                                 ` Dave Chinner
@ 2017-09-10 23:07                                                                                   ` Al Viro
  2017-09-10 23:15                                                                                     ` Al Viro
                                                                                                       ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Al Viro @ 2017-09-10 23:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Mon, Sep 11, 2017 at 08:08:14AM +1000, Dave Chinner wrote:
> On Sun, Sep 10, 2017 at 10:19:07PM +0100, Al Viro wrote:
> > On Mon, Sep 11, 2017 at 07:11:10AM +1000, Dave Chinner wrote:
> > > On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
> > > > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
> > > > 
> > > > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
> > > > > which doesn't end well...
> > > > 
> > > > Different issue, and I'm not sure why that WARN_ON() is there in the
> > > > first place.  Note that in a similar situation generic_file_direct_write()
> > > > simply buggers off and lets the caller do buffered write...
> > > 
> > > XFS does not fall back to buffered IO when direct IO fails.  A
> > > direct IO failure is indicative of a problem that needs to be fixed,
> > > not use a "let's hope we can hide this" fallback path. Especially in
> > > this case - EBUSY usually comes from the app is doing something we
> > > /know/ is dangerous and it's occurrence to completely timing
> > > dependent - if the timing is slightly different, we miss detection
> > > and that can lead to silent data corruption.
> > 
> > In this case app is a fuzzer, which is bloody well supposed to poke
> > into all kinds of odd usage patterns, though...
> 
> Yup, and we have quite a few tests in xfstests that specifically
> exercise this same dark corner. We filter out these warnings from
> the xfstests that exercise this case, though, because we know they
> are going to be emitted and so aren't a sign of test failures...

BTW, another problem I see there is that iomap_dio_actor() should *NOT*
assume that do-while loop in there will always manage to shove 'length'
bytes out in case of success.  That is simply not true for pipe-backed
destination.  And I'm not sure if outright failures halfway through
are handled correctly.  What does it need a copy of dio->submit.iter for,
anyway?  Why not work with dio->submit.iter directly?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-10 23:07                                                                                   ` Al Viro
@ 2017-09-10 23:15                                                                                     ` Al Viro
  2017-09-11  0:31                                                                                     ` Dave Chinner
  2017-09-11 12:07                                                                                     ` Christoph Hellwig
  2 siblings, 0 replies; 73+ messages in thread
From: Al Viro @ 2017-09-10 23:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs, Christoph Hellwig

On Mon, Sep 11, 2017 at 12:07:23AM +0100, Al Viro wrote:

> BTW, another problem I see there is that iomap_dio_actor() should *NOT*
> assume that do-while loop in there will always manage to shove 'length'
> bytes out in case of success.  That is simply not true for pipe-backed
> destination.  And I'm not sure if outright failures halfway through
> are handled correctly.  What does it need a copy of dio->submit.iter for,
> anyway?  Why not work with dio->submit.iter directly?

I mean, if it's just a matter of iov_iter_truncate() to be undone in
the end, that's not hard to do - iov_iter_reexpand() is there.  Or is there
something more subtle in the play?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-10 23:07                                                                                   ` Al Viro
  2017-09-10 23:15                                                                                     ` Al Viro
@ 2017-09-11  0:31                                                                                     ` Dave Chinner
  2017-09-11  3:32                                                                                       ` Al Viro
  2017-09-11 12:07                                                                                     ` Christoph Hellwig
  2 siblings, 1 reply; 73+ messages in thread
From: Dave Chinner @ 2017-09-11  0:31 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Mon, Sep 11, 2017 at 12:07:23AM +0100, Al Viro wrote:
> On Mon, Sep 11, 2017 at 08:08:14AM +1000, Dave Chinner wrote:
> > On Sun, Sep 10, 2017 at 10:19:07PM +0100, Al Viro wrote:
> > > On Mon, Sep 11, 2017 at 07:11:10AM +1000, Dave Chinner wrote:
> > > > On Sun, Sep 10, 2017 at 03:57:21AM +0100, Al Viro wrote:
> > > > > On Sat, Sep 09, 2017 at 09:07:56PM -0400, Dave Jones wrote:
> > > > > 
> > > > > > With this in place, I'm still seeing -EBUSY from invalidate_inode_pages2_range
> > > > > > which doesn't end well...
> > > > > 
> > > > > Different issue, and I'm not sure why that WARN_ON() is there in the
> > > > > first place.  Note that in a similar situation generic_file_direct_write()
> > > > > simply buggers off and lets the caller do buffered write...
> > > > 
> > > > XFS does not fall back to buffered IO when direct IO fails.  A
> > > > direct IO failure is indicative of a problem that needs to be fixed,
> > > > not use a "let's hope we can hide this" fallback path. Especially in
> > > > this case - EBUSY usually comes from the app is doing something we
> > > > /know/ is dangerous and it's occurrence to completely timing
> > > > dependent - if the timing is slightly different, we miss detection
> > > > and that can lead to silent data corruption.
> > > 
> > > In this case app is a fuzzer, which is bloody well supposed to poke
> > > into all kinds of odd usage patterns, though...
> > 
> > Yup, and we have quite a few tests in xfstests that specifically
> > exercise this same dark corner. We filter out these warnings from
> > the xfstests that exercise this case, though, because we know they
> > are going to be emitted and so aren't a sign of test failures...
> 
> BTW, another problem I see there is that iomap_dio_actor() should *NOT*
> assume that do-while loop in there will always manage to shove 'length'
> bytes out in case of success.  That is simply not true for pipe-backed
> destination.

splice does not go down the direct IO path, so iomap_dio_actor()
should never be handled a pipe as the destination for the IO data.
Indeed, splice read has to supply the pages to be put into the pipe,
which the DIO path does not do - it requires pages be supplied to
it. So I'm not sure why we'd care about pipe destination limitations
in the DIO path?

> And I'm not sure if outright failures halfway through
> are handled correctly.  What does it need a copy of dio->submit.iter for,
> anyway?  Why not work with dio->submit.iter directly?

No idea - that's a question for Christoph...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-11  0:31                                                                                     ` Dave Chinner
@ 2017-09-11  3:32                                                                                       ` Al Viro
  2017-09-11  6:44                                                                                         ` Dave Chinner
  0 siblings, 1 reply; 73+ messages in thread
From: Al Viro @ 2017-09-11  3:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Mon, Sep 11, 2017 at 10:31:13AM +1000, Dave Chinner wrote:

> splice does not go down the direct IO path, so iomap_dio_actor()
> should never be handled a pipe as the destination for the IO data.
> Indeed, splice read has to supply the pages to be put into the pipe,
> which the DIO path does not do - it requires pages be supplied to
> it. So I'm not sure why we'd care about pipe destination limitations
> in the DIO path?

splice doesn't give a rat's arse for direct IO; it's up to filesystem.
generic_file_splice_read() simply sets up a pipe-backed iov_iter and
calls ->read_iter(), period.

iov_iter_get_pages() for pipe-backed destination does page allocation
and inserts freshly allocated pages into pipe.  copy_to_iter() does
the same + copies data; copy_page_to_iter() grabs an extra reference
to page and inserts it into pipe, not that O_DIRECT ->read_iter()
had been likely to use the last one.

Normally O_DIRECT would work just fine - pages get allocated, references
to them put into pipe cyclic buffer *and* into a number of bio, bio
would get submitted and once the IO is completed we unlock the pipe,
making those pages available for readers.

With minimal care it works just fine - all you really need is
	* cope with failing copy_to_... / iov_iter_get_pages().
Short read if we'd already gotten something, -EFAULT otherwise.
That goes for pipe-backed same as for iovec-backed - any ->read_iter()
that fails to handle that is already in trouble.
	* make sure that iov_iter_get_pages()/iov_iter_get_pages_alloc()
is followed by iov_iter_advance() for the amount you've actually filled,
before any subsequent copy_to_iter()/copy_page_to_iter() or return
from ->read_iter(), whichever comes first.  That includes the situation
when you actually hadn't filled anything at all - just remember to
do iov_iter_advance(to, 0) in that case.  That's about the only
extra requirement imposed by pipes and it's not hard to satisfy.
Combination of iov_iter_advance() with iov_iter_revert() works as
usual.

Normally a filesystem doesn't need to care about splice at all -
just use generic_file_splice_read() and be done with that.
It will use the normal ->read_iter(), with whatever locking, etc.,
your filesystem would do on a normal read.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-11  3:32                                                                                       ` Al Viro
@ 2017-09-11  6:44                                                                                         ` Dave Chinner
  2017-09-11 20:07                                                                                           ` Al Viro
  0 siblings, 1 reply; 73+ messages in thread
From: Dave Chinner @ 2017-09-11  6:44 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Mon, Sep 11, 2017 at 04:32:22AM +0100, Al Viro wrote:
> On Mon, Sep 11, 2017 at 10:31:13AM +1000, Dave Chinner wrote:
> 
> > splice does not go down the direct IO path, so iomap_dio_actor()
> > should never be handled a pipe as the destination for the IO data.
> > Indeed, splice read has to supply the pages to be put into the pipe,
> > which the DIO path does not do - it requires pages be supplied to
> > it. So I'm not sure why we'd care about pipe destination limitations
> > in the DIO path?
> 
> splice doesn't give a rat's arse for direct IO; it's up to filesystem.

[....]

It's news to me that splice works on direct IO - I thought it was
still required page cache based IO for file data for this stuff to
work. I must have missed the memo saying that splice interfaces now
work on O_DIRECT fds, there's certainly no documentation or comments
in the code I could have read to find this out myself...

As it is, we have very little test coverage for splice interfaces,
and all that I am aware of assumes that sendfile/splice only works
for buffered IO. So I'm not surprised there are bugs in this code,
it's likely to be completely untested.

I'm guessing the warnings are being thrown because sendfile's
source and/or destination is opened O_DIRECT and something else has
also mmap()d the same files and is doing concurrent sendfile/splice
and page faults. Hell, it could even be sendfile to/from the same
file mixing buffered and direct IO, or perhaps vmsplice of a mapped
range of the same file it's using as the O_DIRECT destination fd.
None of which are sane things to do and fall under the "not
supported" category....

> iov_iter_get_pages() for pipe-backed destination does page allocation
> and inserts freshly allocated pages into pipe.

Oh, it's hidden more layers down than the code implied I needed to
look.

i.e. there's no obvious clue in the function names that there is
allocation happening in these paths (get_pipe_pages ->
__get_pipe_pages -> push_pipe -> page allocation). The function
names imply it's getting a reference to pages (like
(get_user_pages()) and the fact it does allocation is inconsistent
with it's naming.  Worse, when push_pipe() fails to allocate pages,
the error __get_pipe_pages() returns is -EFAULT, which further hides
the fact push_pipe() does memory allocation that can fail....

And then there's the companion interface that implies page
allocation: pipe_get_pages_alloc(). Which brings me back to there
being no obvious clue while reading the code from the top down that
pages are being allocated in push_pipe()....

Comments and documentation for this code would help, but I can't
find any of that, either. Hence I assumed naming followed familiar
patterns and so mistook these interfaces being one that does page
allocation and the other for getting references to pre-existing
pages.....

[snip]

> Normally a filesystem doesn't need to care about splice at all -
> just use generic_file_splice_read() and be done with that.
> It will use the normal ->read_iter(), with whatever locking, etc.,
> your filesystem would do on a normal read.

Yup, that's my point - this is exactly what XFS does, and so I had
no clue that the generic splice code had been changed to accept and
use O_DIRECT semantics because no filesystem code was changed to
enable it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-10 23:07                                                                                   ` Al Viro
  2017-09-10 23:15                                                                                     ` Al Viro
  2017-09-11  0:31                                                                                     ` Dave Chinner
@ 2017-09-11 12:07                                                                                     ` Christoph Hellwig
  2017-09-11 12:51                                                                                       ` Al Viro
  2 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2017-09-11 12:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Chinner, Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Mon, Sep 11, 2017 at 12:07:23AM +0100, Al Viro wrote:
> BTW, another problem I see there is that iomap_dio_actor() should *NOT*
> assume that do-while loop in there will always manage to shove 'length'
> bytes out in case of success.  That is simply not true for pipe-backed
> destination.  And I'm not sure if outright failures halfway through
> are handled correctly.  What does it need a copy of dio->submit.iter for,
> anyway?  Why not work with dio->submit.iter directly?
> --

So that we only walk the pagetables and pin down the pages that
we can actually use in this iteration.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-11 12:07                                                                                     ` Christoph Hellwig
@ 2017-09-11 12:51                                                                                       ` Al Viro
  0 siblings, 0 replies; 73+ messages in thread
From: Al Viro @ 2017-09-11 12:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Mon, Sep 11, 2017 at 05:07:57AM -0700, Christoph Hellwig wrote:
> On Mon, Sep 11, 2017 at 12:07:23AM +0100, Al Viro wrote:
> > BTW, another problem I see there is that iomap_dio_actor() should *NOT*
> > assume that do-while loop in there will always manage to shove 'length'
> > bytes out in case of success.  That is simply not true for pipe-backed
> > destination.  And I'm not sure if outright failures halfway through
> > are handled correctly.  What does it need a copy of dio->submit.iter for,
> > anyway?  Why not work with dio->submit.iter directly?
> > --
> 
> So that we only walk the pagetables and pin down the pages that
> we can actually use in this iteration.

Er...  So why not simply do iov_iter_reexpand() in the end of segment with the
right argument?  IDGI...

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-11  6:44                                                                                         ` Dave Chinner
@ 2017-09-11 20:07                                                                                           ` Al Viro
  2017-09-11 20:17                                                                                             ` Al Viro
  2017-09-12  6:02                                                                                             ` Dave Chinner
  0 siblings, 2 replies; 73+ messages in thread
From: Al Viro @ 2017-09-11 20:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Mon, Sep 11, 2017 at 04:44:40PM +1000, Dave Chinner wrote:

> > iov_iter_get_pages() for pipe-backed destination does page allocation
> > and inserts freshly allocated pages into pipe.
> 
> Oh, it's hidden more layers down than the code implied I needed to
> look.
> 
> i.e. there's no obvious clue in the function names that there is
> allocation happening in these paths (get_pipe_pages ->
> __get_pipe_pages -> push_pipe -> page allocation). The function
> names imply it's getting a reference to pages (like
> (get_user_pages()) and the fact it does allocation is inconsistent
> with it's naming.  Worse, when push_pipe() fails to allocate pages,
> the error __get_pipe_pages() returns is -EFAULT, which further hides
> the fact push_pipe() does memory allocation that can fail....
> 
> And then there's the companion interface that implies page
> allocation: pipe_get_pages_alloc(). Which brings me back to there
> being no obvious clue while reading the code from the top down that
> pages are being allocated in push_pipe()....
> 
> Comments and documentation for this code would help, but I can't
> find any of that, either. Hence I assumed naming followed familiar
> patterns and so mistook these interfaces being one that does page
> allocation and the other for getting references to pre-existing
> pages.....

_NONE_ of those is a public interface - they are all static, to start
with.

The whole damn point is to have normal ->read_iter() work for read-to-pipe
without changes.  That's why -EFAULT as error (rather than some other
mechanism for saying that pipe is full), etc.

Filesystem should *not* be changed to use that.  At all.  As far as it is
concerned,
	copy_to_iter()
	copy_page_to_iter()
	iov_iter_get_pages()
	iov_iter_get_pages_alloc()
	iov_iter_advance()
are black boxes.

Note that one of the bugs there applies to normal read() as well - if you
are reading from a hole in file into an array with a read-only page in
the middle, you want a short read.  Ignoring return value from iov_iter_zero()
is wrong for iovec-backed case as well as for pipes.

Another one manages to work for iovec-backed case, albeit with rather odd
resulting semantics.  readv(2) is underspecified (to put it politely) enough
for compliance, but it's still bloody strange.  Namely, if you have a contiguous
50Mb chunk of file on disk and run into e.g. a failure to fault the destination
pages in halfway through that extent, you act as if *nothing* in the damn thing
had been read, nevermind that 25Mb had been actually already read and that had
there been a discontinuity 5Mb prior, the first 20Mb would've been reported
read just fine.

Strictly speaking that behaviour doesn't violate POSIX.  It is, however,
atrocious from the QoI standpoint, and for no good reason whatsoever.
It's quite easy to do better, and doing so would've eliminated the problems
in pipe-backed case as well (see below).  In addition to that, I would
consider teaching bio_iov_iter_get_pages() to take the maximal bio size
as an explict argument.  That would've killed the need of copying the
iterator and calling iov_iter_advance() in iomap_dio_actor() at all.
Anyway, the minimal candidate fix follows; it won't do anything about
the WARN_ON() in there, seeing that those are deliberate "program is
doing something bogus" things, but it should eliminate all crap with
->splice_read() misreporting the amount of data it has copied.

diff --git a/fs/iomap.c b/fs/iomap.c
index 269b24a01f32..836fe27b00e2 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -832,6 +832,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 	struct bio *bio;
 	bool need_zeroout = false;
 	int nr_pages, ret;
+	size_t copied = 0;
 
 	if ((pos | length | align) & ((1 << blkbits) - 1))
 		return -EINVAL;
@@ -843,7 +844,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 		/*FALLTHRU*/
 	case IOMAP_UNWRITTEN:
 		if (!(dio->flags & IOMAP_DIO_WRITE)) {
-			iov_iter_zero(length, dio->submit.iter);
+			length = iov_iter_zero(length, dio->submit.iter);
 			dio->size += length;
 			return length;
 		}
@@ -880,6 +881,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 	}
 
 	do {
+		size_t n;
 		if (dio->error)
 			return 0;
 
@@ -897,17 +899,21 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 			return ret;
 		}
 
+		n = bio->bi_iter.bi_size;
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
-			task_io_account_write(bio->bi_iter.bi_size);
+			task_io_account_write(n);
 		} else {
 			bio_set_op_attrs(bio, REQ_OP_READ, 0);
 			if (dio->flags & IOMAP_DIO_DIRTY)
 				bio_set_pages_dirty(bio);
 		}
 
-		dio->size += bio->bi_iter.bi_size;
-		pos += bio->bi_iter.bi_size;
+		iov_iter_advance(dio->submit.iter, n);
+
+		dio->size += n;
+		pos += n;
+		copied += n;
 
 		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
 
@@ -923,9 +929,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 		if (pad)
 			iomap_dio_zero(dio, iomap, pos, fs_block_size - pad);
 	}
-
-	iov_iter_advance(dio->submit.iter, length);
-	return length;
+	return copied;
 }
 
 ssize_t

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-11 20:07                                                                                           ` Al Viro
@ 2017-09-11 20:17                                                                                             ` Al Viro
  2017-09-12  6:02                                                                                             ` Dave Chinner
  1 sibling, 0 replies; 73+ messages in thread
From: Al Viro @ 2017-09-11 20:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Mon, Sep 11, 2017 at 09:07:13PM +0100, Al Viro wrote:
> Strictly speaking that behaviour doesn't violate POSIX.  It is, however,
> atrocious from the QoI standpoint, and for no good reason whatsoever.
> It's quite easy to do better, and doing so would've eliminated the problems
> in pipe-backed case as well (see below).  In addition to that, I would
> consider teaching bio_iov_iter_get_pages() to take the maximal bio size
> as an explict argument.  That would've killed the need of copying the
> iterator and calling iov_iter_advance() in iomap_dio_actor() at all.
> Anyway, the minimal candidate fix follows; it won't do anything about
> the WARN_ON() in there, seeing that those are deliberate "program is
> doing something bogus" things, but it should eliminate all crap with
> ->splice_read() misreporting the amount of data it has copied.

... and after minimal testing and fixing a braino in "found an IO error"
case, that's

diff --git a/fs/iomap.c b/fs/iomap.c
index 269b24a01f32..012e1f247e13 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -832,6 +832,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 	struct bio *bio;
 	bool need_zeroout = false;
 	int nr_pages, ret;
+	size_t copied = 0;
 
 	if ((pos | length | align) & ((1 << blkbits) - 1))
 		return -EINVAL;
@@ -843,7 +844,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 		/*FALLTHRU*/
 	case IOMAP_UNWRITTEN:
 		if (!(dio->flags & IOMAP_DIO_WRITE)) {
-			iov_iter_zero(length, dio->submit.iter);
+			length = iov_iter_zero(length, dio->submit.iter);
 			dio->size += length;
 			return length;
 		}
@@ -880,8 +881,11 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 	}
 
 	do {
-		if (dio->error)
+		size_t n;
+		if (dio->error) {
+			iov_iter_revert(dio->submit.iter, copied);
 			return 0;
+		}
 
 		bio = bio_alloc(GFP_KERNEL, nr_pages);
 		bio_set_dev(bio, iomap->bdev);
@@ -894,20 +898,24 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 		ret = bio_iov_iter_get_pages(bio, &iter);
 		if (unlikely(ret)) {
 			bio_put(bio);
-			return ret;
+			return copied ? copied : ret;
 		}
 
+		n = bio->bi_iter.bi_size;
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
-			task_io_account_write(bio->bi_iter.bi_size);
+			task_io_account_write(n);
 		} else {
 			bio_set_op_attrs(bio, REQ_OP_READ, 0);
 			if (dio->flags & IOMAP_DIO_DIRTY)
 				bio_set_pages_dirty(bio);
 		}
 
-		dio->size += bio->bi_iter.bi_size;
-		pos += bio->bi_iter.bi_size;
+		iov_iter_advance(dio->submit.iter, n);
+
+		dio->size += n;
+		pos += n;
+		copied += n;
 
 		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
 
@@ -923,9 +931,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 		if (pad)
 			iomap_dio_zero(dio, iomap, pos, fs_block_size - pad);
 	}
-
-	iov_iter_advance(dio->submit.iter, length);
-	return length;
+	return copied;
 }
 
 ssize_t

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-11 20:07                                                                                           ` Al Viro
  2017-09-11 20:17                                                                                             ` Al Viro
@ 2017-09-12  6:02                                                                                             ` Dave Chinner
  2017-09-12 11:13                                                                                               ` Al Viro
  1 sibling, 1 reply; 73+ messages in thread
From: Dave Chinner @ 2017-09-12  6:02 UTC (permalink / raw)
  To: Al Viro; +Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Mon, Sep 11, 2017 at 09:07:13PM +0100, Al Viro wrote:
> On Mon, Sep 11, 2017 at 04:44:40PM +1000, Dave Chinner wrote:
> 
> > > iov_iter_get_pages() for pipe-backed destination does page allocation
> > > and inserts freshly allocated pages into pipe.
> > 
> > Oh, it's hidden more layers down than the code implied I needed to
> > look.
> > 
> > i.e. there's no obvious clue in the function names that there is
> > allocation happening in these paths (get_pipe_pages ->
> > __get_pipe_pages -> push_pipe -> page allocation). The function
> > names imply it's getting a reference to pages (like
> > (get_user_pages()) and the fact it does allocation is inconsistent
> > with it's naming.  Worse, when push_pipe() fails to allocate pages,
> > the error __get_pipe_pages() returns is -EFAULT, which further hides
> > the fact push_pipe() does memory allocation that can fail....
> > 
> > And then there's the companion interface that implies page
> > allocation: pipe_get_pages_alloc(). Which brings me back to there
> > being no obvious clue while reading the code from the top down that
> > pages are being allocated in push_pipe()....
> > 
> > Comments and documentation for this code would help, but I can't
> > find any of that, either. Hence I assumed naming followed familiar
> > patterns and so mistook these interfaces being one that does page
> > allocation and the other for getting references to pre-existing
> > pages.....
> 
> _NONE_ of those is a public interface - they are all static, to start
> with.

It still requires comments to explain *why* the code is doing what
it's doing. You wrote the code, the whole explaination of it is in
your head. I can't see any of that, so when I read the code I sit
there thinking "why the fuck is it doing this?" because there's no
explanations of the WTF? moments in the code...

> The whole damn point is to have normal ->read_iter() work for read-to-pipe
> without changes.  That's why -EFAULT as error (rather than some other
> mechanism for saying that pipe is full), etc.

... like this one.

That needs a *fucking big comment* because it's not at all obvious
why ENOMEM conditions are being hidden with EFAULT.

Comments and documentation are not for the person who writes the
code - they are for the stupid morons like me that need all the
help they can get to understand complex code that does tricksy,
subtle, non-obvious shit to work correctly.

> Filesystem should *not* be changed to use that.  At all.  As far as it is
> concerned,
> 	copy_to_iter()
> 	copy_page_to_iter()
> 	iov_iter_get_pages()
> 	iov_iter_get_pages_alloc()
> 	iov_iter_advance()
> are black boxes.

The implementation may be a black box, but the operations the black
box is performing for the callers still needs to be explained.

> Note that one of the bugs there applies to normal read() as well - if you
> are reading from a hole in file into an array with a read-only page in
> the middle, you want a short read.

And there's another WTF? moment.....

How do we get a read only page in the middle of an array of pages
we've been told to write data into? And why isn't that a bug in the
code that supplied us with those pages?

> Ignoring return value from iov_iter_zero()
> is wrong for iovec-backed case as well as for pipes.
> 
> Another one manages to work for iovec-backed case, albeit with rather odd
> resulting semantics.  readv(2) is underspecified (to put it politely) enough
> for compliance, but it's still bloody strange.  Namely, if you have a contiguous
> 50Mb chunk of file on disk and run into e.g. a failure to fault the destination
> pages in halfway through that extent, you act as if *nothing* in the damn thing
> had been read, nevermind that 25Mb had been actually already read and that had
> there been a discontinuity 5Mb prior, the first 20Mb would've been reported
> read just fine.
> 
> Strictly speaking that behaviour doesn't violate POSIX.

This is direct IO. POSIX compliant behaviour went out the window
long ago..... :/

> It is, however,
> atrocious from the QoI standpoint, and for no good reason whatsoever.
> It's quite easy to do better, and doing so would've eliminated the problems
> in pipe-backed case as well (see below).  In addition to that, I would
> consider teaching bio_iov_iter_get_pages() to take the maximal bio size
> as an explict argument.  That would've killed the need of copying the
> iterator and calling iov_iter_advance() in iomap_dio_actor() at all.
> Anyway, the minimal candidate fix follows; it won't do anything about
> the WARN_ON() in there, seeing that those are deliberate "program is
> doing something bogus" things, but it should eliminate all crap with
> ->splice_read() misreporting the amount of data it has copied.

I'll run your updated patch through my testing, but seeing as I have
nothing that tests splice+direct IO I'm not going to be able to test
that right now. I have slightly more important things to that need
urgent attention than writing splice+DIO test cases....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: iov_iter_pipe warning.
  2017-09-12  6:02                                                                                             ` Dave Chinner
@ 2017-09-12 11:13                                                                                               ` Al Viro
  0 siblings, 0 replies; 73+ messages in thread
From: Al Viro @ 2017-09-12 11:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Dave Jones, Darrick J. Wong, Linux Kernel, linux-xfs

On Tue, Sep 12, 2017 at 04:02:14PM +1000, Dave Chinner wrote:

> > Note that one of the bugs there applies to normal read() as well - if you
> > are reading from a hole in file into an array with a read-only page in
> > the middle, you want a short read.
> 
> And there's another WTF? moment.....
> 
> How do we get a read only page in the middle of an array of pages
> we've been told to write data into? And why isn't that a bug in the
> code that supplied us with those pages?

Sorry, I'd been unclear - I'm talking about read(2) or readv(2) called by
userland with a read-only piece in the middle of a (user-supplied) buffer.
Either due to mprotect() or simply with one of the iovecs passed to readv(2)
having ->iov_base set to some read-only area.

It may be a bug in userland code, but when handling that error is as trivial
as "don't assume iov_iter_zero(to, n) will return n, use the actual return
value", resorting to "the userland code was probably buggy and it's O_DIRECT,
so we can do whatever we want" looks wrong.

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2017-09-12 11:13 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-21 20:59 iov_iter_pipe warning Dave Jones
2017-04-05 22:02 ` Dave Jones
2017-04-10 19:28 ` Al Viro
2017-04-10 19:42   ` Dave Jones
2017-04-10 19:57     ` Al Viro
2017-04-10 23:48       ` Dave Jones
2017-04-11  0:22         ` Al Viro
2017-04-11  3:05           ` Dave Jones
2017-04-11  3:28             ` Al Viro
2017-04-11 20:53               ` Dave Jones
2017-04-11 21:12                 ` Al Viro
2017-04-11 22:25                   ` Dave Jones
2017-04-11 23:28                     ` Al Viro
2017-04-11 23:34                       ` Dave Jones
2017-04-11 23:48                         ` Al Viro
2017-04-11 23:45                       ` Dave Jones
2017-04-11 23:51                         ` Al Viro
2017-04-11 23:56                           ` Al Viro
2017-04-12  0:06                             ` Dave Jones
2017-04-12  0:17                               ` Al Viro
2017-04-12  0:58                                 ` Dave Jones
2017-04-12  1:15                                   ` Al Viro
2017-04-12  2:29                                     ` Dave Jones
2017-04-12  2:58                                       ` Al Viro
2017-04-12 14:35                                         ` Dave Jones
2017-04-12 15:26                                           ` Al Viro
2017-04-12 16:27                                             ` Dave Jones
2017-04-12 17:07                                               ` Al Viro
2017-04-12 19:03                                                 ` Dave Jones
2017-04-21 17:54                                                   ` Al Viro
2017-04-27  4:19                                                     ` Dave Jones
2017-04-27 16:34                                                       ` Dave Jones
2017-04-27 17:39                                                         ` Al Viro
2017-04-28 15:29                                                     ` Dave Jones
2017-04-28 16:43                                                       ` Al Viro
2017-04-28 16:50                                                         ` Dave Jones
2017-04-28 17:20                                                           ` Al Viro
2017-04-28 18:25                                                             ` Al Viro
2017-04-29  1:58                                                               ` Dave Jones
2017-04-29  2:47                                                                 ` Al Viro
2017-04-29 15:51                                                                   ` Dave Jones
2017-04-29 20:46                                                                     ` [git pull] vfs.git fix (Re: iov_iter_pipe warning.) Al Viro
2017-08-07 20:18                                                             ` iov_iter_pipe warning Dave Jones
2017-08-28 20:31                                                               ` Dave Jones
2017-08-29  4:25                                                                 ` Darrick J. Wong
2017-08-30 17:05                                                                   ` Dave Jones
2017-08-30 17:13                                                                     ` Darrick J. Wong
2017-08-30 17:17                                                                       ` Dave Jones
2017-09-06 20:03                                                                   ` Dave Jones
2017-09-06 23:46                                                                     ` Dave Chinner
2017-09-07  3:48                                                                       ` Dave Jones
2017-09-07  4:33                                                                         ` Al Viro
2017-09-08  1:04                                                                       ` Al Viro
2017-09-10  1:07                                                                         ` Dave Jones
2017-09-10  2:57                                                                           ` Al Viro
2017-09-10 16:07                                                                             ` Dave Jones
2017-09-10 20:05                                                                               ` Al Viro
2017-09-10 20:07                                                                                 ` Dave Jones
2017-09-10 20:33                                                                                   ` Al Viro
2017-09-10 21:11                                                                             ` Dave Chinner
2017-09-10 21:19                                                                               ` Al Viro
2017-09-10 22:08                                                                                 ` Dave Chinner
2017-09-10 23:07                                                                                   ` Al Viro
2017-09-10 23:15                                                                                     ` Al Viro
2017-09-11  0:31                                                                                     ` Dave Chinner
2017-09-11  3:32                                                                                       ` Al Viro
2017-09-11  6:44                                                                                         ` Dave Chinner
2017-09-11 20:07                                                                                           ` Al Viro
2017-09-11 20:17                                                                                             ` Al Viro
2017-09-12  6:02                                                                                             ` Dave Chinner
2017-09-12 11:13                                                                                               ` Al Viro
2017-09-11 12:07                                                                                     ` Christoph Hellwig
2017-09-11 12:51                                                                                       ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).