linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [git pull] iov_iter fixes
@ 2021-09-09  4:22 Al Viro
  2021-09-09 19:37 ` Linus Torvalds
  2021-09-09 20:03 ` pr-tracker-bot
  0 siblings, 2 replies; 46+ messages in thread
From: Al Viro @ 2021-09-09  4:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

	Fixes for io-uring handling of iov_iter reexpands

The following changes since commit e73f0f0ee7541171d89f2e2491130c7771ba58d3:

  Linux 5.14-rc1 (2021-07-11 15:07:40 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git work.iov_iter

for you to fetch changes up to 89c2b3b74918200e46699338d7bcc19b1ea12110:

  io_uring: reexpand under-reexpanded iters (2021-09-03 19:31:33 -0400)

----------------------------------------------------------------
Pavel Begunkov (2):
      iov_iter: track truncated size
      io_uring: reexpand under-reexpanded iters

 fs/io_uring.c       | 2 ++
 include/linux/uio.h | 6 +++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09  4:22 [git pull] iov_iter fixes Al Viro
@ 2021-09-09 19:37 ` Linus Torvalds
  2021-09-09 21:19   ` Jens Axboe
                     ` (2 more replies)
  2021-09-09 20:03 ` pr-tracker-bot
  1 sibling, 3 replies; 46+ messages in thread
From: Linus Torvalds @ 2021-09-09 19:37 UTC (permalink / raw)
  To: Al Viro, Jens Axboe, Pavel Begunkov
  Cc: Linux Kernel Mailing List, linux-fsdevel

On Wed, Sep 8, 2021 at 9:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         Fixes for io-uring handling of iov_iter reexpands

Ugh.

I have pulled this, because I understand what it does and I agree it
fixes a bug, but it really feels very very hacky and wrong to me.

It really smells like io-uring is doing a "iov_iter_revert()" using a
number that it pulls incorrectly out of its arse.

So when io-uring does that

                iov_iter_revert(iter, io_size - iov_iter_count(iter));

what it *really* wants to do is just basically "iov_iter_reset(iter)".

And that's basically what that addition of that "iov_iter_reexpand()"
tries to effectively do.

Wouldn't it be better to have a function that does exactly that?

Alternatively (and I'm cc'ing Jens) is is not possible for the
io-uring code to know how many bytes it *actually* used, rather than
saying that "ok, the iter originally had X bytes, now it has Y bytes,
so it must have used X-Y bytes" which was actively wrong for the case
where something ended up truncating the IO for some reason.

Because I note that io-uring does that

        /* may have left rw->iter inconsistent on -EIOCBQUEUED */
        iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));

in io_resubmit_prep() too, and that you guys missed that it's the
exact same issue, and needs that exact same iov_iter_reexpand().

That "req->result" is once again the *original* length, and the above
code once again mis-handles the case of "oh, the iov got truncated
because of some IO limit".

So I've pulled this, but I think it is

 (a) ugly nasty

 (b) incomplete and misses a case

and needs more thought. At the VERY least it needs that
iov_iter_reexpand() in io_resubmit_prep() too, I think.

I'd like the comments expanded too. In particular that

                /* some cases will consume bytes even on error returns */

really should expand on the "some cases" thing, and why such an error
isn't fatal buye should be retried asynchronously blindly like this?

Because I think _that_ is part of the fundamental issue here - the
io_uring code tries to just blindly re-submit the whole thing, and it
does it very badly and actually incorrectly.

Or am I missing something?

           Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09  4:22 [git pull] iov_iter fixes Al Viro
  2021-09-09 19:37 ` Linus Torvalds
@ 2021-09-09 20:03 ` pr-tracker-bot
  1 sibling, 0 replies; 46+ messages in thread
From: pr-tracker-bot @ 2021-09-09 20:03 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

The pull request you sent on Thu, 9 Sep 2021 04:22:22 +0000:

> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git work.iov_iter

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/7b7699c09f66f180b9a8a5010df352acb8683bfa

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09 19:37 ` Linus Torvalds
@ 2021-09-09 21:19   ` Jens Axboe
  2021-09-09 21:39     ` Jens Axboe
                       ` (2 more replies)
  2021-09-09 22:54   ` Pavel Begunkov
  2021-09-09 23:14   ` Pavel Begunkov
  2 siblings, 3 replies; 46+ messages in thread
From: Jens Axboe @ 2021-09-09 21:19 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Pavel Begunkov
  Cc: Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 1:37 PM, Linus Torvalds wrote:
> On Wed, Sep 8, 2021 at 9:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>>         Fixes for io-uring handling of iov_iter reexpands
> 
> Ugh.
> 
> I have pulled this, because I understand what it does and I agree it
> fixes a bug, but it really feels very very hacky and wrong to me.
> 
> It really smells like io-uring is doing a "iov_iter_revert()" using a
> number that it pulls incorrectly out of its arse.
> 
> So when io-uring does that
> 
>                 iov_iter_revert(iter, io_size - iov_iter_count(iter));
> 
> what it *really* wants to do is just basically "iov_iter_reset(iter)".
> 
> And that's basically what that addition of that "iov_iter_reexpand()"
> tries to effectively do.
> 
> Wouldn't it be better to have a function that does exactly that?

That might indeed be better. Alternatively, consumers that truncate
should expand. Part of the problem here is the inconsistency in how they
are consumed.

> Alternatively (and I'm cc'ing Jens) is is not possible for the
> io-uring code to know how many bytes it *actually* used, rather than
> saying that "ok, the iter originally had X bytes, now it has Y bytes,
> so it must have used X-Y bytes" which was actively wrong for the case
> where something ended up truncating the IO for some reason.

Not sure how we'd do that, outside of stupid tricks like copy the
iov_iter before we pass it down. But that's obviously not going to be
very efficient. Hence we're left with having some way to reset/reexpand,
even in the presence of someone having done truncate on it.

> Because I note that io-uring does that
> 
>         /* may have left rw->iter inconsistent on -EIOCBQUEUED */
>         iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));
> 
> in io_resubmit_prep() too, and that you guys missed that it's the
> exact same issue, and needs that exact same iov_iter_reexpand().

I think you're right on that one, there's no difference between that use
case and the other two...

> That "req->result" is once again the *original* length, and the above
> code once again mis-handles the case of "oh, the iov got truncated
> because of some IO limit".
> 
> So I've pulled this, but I think it is
> 
>  (a) ugly nasty
> 
>  (b) incomplete and misses a case
> 
> and needs more thought. At the VERY least it needs that
> iov_iter_reexpand() in io_resubmit_prep() too, I think.
> 
> I'd like the comments expanded too. In particular that
> 
>                 /* some cases will consume bytes even on error returns */

That comment is from me, and it goes back a few years. IIRC, it was the
iomap or xfs code that I hit this with, but honestly I don't remember
all the details at this point. I can try and play with it and see if it
still reproduces.

> really should expand on the "some cases" thing, and why such an error
> isn't fatal buye should be retried asynchronously blindly like this?

That would certainly make it easier to handle, as we'd never need to
care at that point. Ideally, return 'bytes_consumed' or error. It might
have been a case of -EAGAIN after truncate, I'll have to dig a bit to
find it again. Outside of that error, we don't retry as there's no point
in doing so.

> Because I think _that_ is part of the fundamental issue here - the
> io_uring code tries to just blindly re-submit the whole thing, and it
> does it very badly and actually incorrectly.
> 
> Or am I missing something?

I think the key point here is re-figuring out where the
consumption-on-error comes from. If it just ends up being a truncated
iov, that's all good and fine. If not, that feels like a bug somewhere
else that needs fixing.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09 21:19   ` Jens Axboe
@ 2021-09-09 21:39     ` Jens Axboe
  2021-09-09 21:56       ` Linus Torvalds
  2021-09-09 21:42     ` Dave Chinner
  2021-09-10  2:57     ` Al Viro
  2 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2021-09-09 21:39 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Pavel Begunkov
  Cc: Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 3:19 PM, Jens Axboe wrote:
>> That "req->result" is once again the *original* length, and the above
>> code once again mis-handles the case of "oh, the iov got truncated
>> because of some IO limit".
>>
>> So I've pulled this, but I think it is
>>
>>  (a) ugly nasty
>>
>>  (b) incomplete and misses a case
>>
>> and needs more thought. At the VERY least it needs that
>> iov_iter_reexpand() in io_resubmit_prep() too, I think.
>>
>> I'd like the comments expanded too. In particular that
>>
>>                 /* some cases will consume bytes even on error returns */
> 
> That comment is from me, and it goes back a few years. IIRC, it was the
> iomap or xfs code that I hit this with, but honestly I don't remember
> all the details at this point. I can try and play with it and see if it
> still reproduces.

OK, one that I immediately found is just doing O_DIRECT to a block
device or file on XFS. As pages are mapped and added, the iov_iter is
advanced. If we then go and submit and get -EAGAIN, for example, then we
return with what we mapped already consumed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09 21:19   ` Jens Axboe
  2021-09-09 21:39     ` Jens Axboe
@ 2021-09-09 21:42     ` Dave Chinner
  2021-09-10  2:57     ` Al Viro
  2 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2021-09-09 21:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Al Viro, Pavel Begunkov,
	Linux Kernel Mailing List, linux-fsdevel

On Thu, Sep 09, 2021 at 03:19:56PM -0600, Jens Axboe wrote:
> On 9/9/21 1:37 PM, Linus Torvalds wrote:
> > I'd like the comments expanded too. In particular that
> > 
> >                 /* some cases will consume bytes even on error returns */
> 
> That comment is from me, and it goes back a few years. IIRC, it was the
> iomap or xfs code that I hit this with, but honestly I don't remember
> all the details at this point. I can try and play with it and see if it
> still reproduces.

You might well be thinking of the problem fixed by commit
883a790a8440 ("xfs: don't allow NOWAIT DIO across extent
boundaries").

This fix was indicative of a whole class of issues with IOCB_NOWAIT
being used for multi-IO operations at the filesystem level and being
applied to each sub-segment of the IO that was constructed, rather
than the IO as a whole. Hence a failure on the second or subsequent
segments could return -EAGAIN (and potentially other errors) to the
caller after the segments we successfully submitted consumed part of
the iov...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09 21:39     ` Jens Axboe
@ 2021-09-09 21:56       ` Linus Torvalds
  2021-09-09 22:21         ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: Linus Torvalds @ 2021-09-09 21:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Al Viro, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Thu, Sep 9, 2021 at 2:39 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> OK, one that I immediately found is just doing O_DIRECT to a block
> device or file on XFS. As pages are mapped and added, the iov_iter is
> advanced. If we then go and submit and get -EAGAIN, for example, then we
> return with what we mapped already consumed.

Ok, that's annoying but understandable. Dave points to a commit that
removes one of the EAGAIN cases, but apparently not some others.

I do kind of wonder if you can't have the exact same case when *some*
of the IO succeeds, though.

IOW, can't we have  that

        ret = io_iter_do_read(req, iter);

return partial success - and if XFS does that "update iovec on
failure", I could easily see that same code - or something else -
having done the exact same thing.

Put another way: if the iovec isn't guaranteed to be coherent when an
actual error occurs, then why would it be guaranteed to be coherent
with a partial success value?

Because in most cases - I'd argue pretty much all - those "partial
success" cases are *exactly* the same as the error cases, it's just
that we had a loop and one or more iterations succeeded before it hit
the error case.

Hmm?

            Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09 21:56       ` Linus Torvalds
@ 2021-09-09 22:21         ` Jens Axboe
  2021-09-09 22:56           ` Linus Torvalds
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2021-09-09 22:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 3:56 PM, Linus Torvalds wrote:
> On Thu, Sep 9, 2021 at 2:39 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> OK, one that I immediately found is just doing O_DIRECT to a block
>> device or file on XFS. As pages are mapped and added, the iov_iter is
>> advanced. If we then go and submit and get -EAGAIN, for example, then we
>> return with what we mapped already consumed.
> 
> Ok, that's annoying but understandable. Dave points to a commit that
> removes one of the EAGAIN cases, but apparently not some others.

That one just catches it upfront so we -EAGAIN immediately, which does
make it a lot easier to handle. But yes, that's an example.

The case I mention we basically always consume the whole iov, to the
extent that we can map it to a bio. But there's really no way around
that, we have to map it before we can attempt to do that IO.

> I do kind of wonder if you can't have the exact same case when *some*
> of the IO succeeds, though.
> 
> IOW, can't we have  that
> 
>         ret = io_iter_do_read(req, iter);
> 
> return partial success - and if XFS does that "update iovec on
> failure", I could easily see that same code - or something else -
> having done the exact same thing.
> 
> Put another way: if the iovec isn't guaranteed to be coherent when an
> actual error occurs, then why would it be guaranteed to be coherent
> with a partial success value?
> 
> Because in most cases - I'd argue pretty much all - those "partial
> success" cases are *exactly* the same as the error cases, it's just
> that we had a loop and one or more iterations succeeded before it hit
> the error case.

Right, which is why the reset would be nice, but reexpand + revert at
least works and accomplishes the same even if it doesn't look as pretty.
We do return how much IO was actually done from the various
->read/write_iter() obviously, and that cannot be incorrect. It's just
that the iov_iter doesn't necessarily agree with that view and more (or
all) may have been consumed regardless of the return value. The truncate
was really the part that made it impossible to handle.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09 19:37 ` Linus Torvalds
  2021-09-09 21:19   ` Jens Axboe
@ 2021-09-09 22:54   ` Pavel Begunkov
  2021-09-09 22:57     ` Pavel Begunkov
  2021-09-09 23:14   ` Pavel Begunkov
  2 siblings, 1 reply; 46+ messages in thread
From: Pavel Begunkov @ 2021-09-09 22:54 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Jens Axboe
  Cc: Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 8:37 PM, Linus Torvalds wrote:
> On Wed, Sep 8, 2021 at 9:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>>         Fixes for io-uring handling of iov_iter reexpands
> 
> Ugh.
> 
> I have pulled this, because I understand what it does and I agree it
> fixes a bug, but it really feels very very hacky and wrong to me.

Maybe was worded not too clearly, my apologies.


> It really smells like io-uring is doing a "iov_iter_revert()" using a
> number that it pulls incorrectly out of its arse.

It's not invented by io_uring,

filemap.c : generic_file_direct_[write,read]()

do the same thing. Also, the block layer was not re-expanding before
~5.12, so it looks it was possible to trigger a similar thing without
io_uring, but I haven't tried to reproduce. Was mentioned in the
cover-letter.

> So when io-uring does that
> 
>                 iov_iter_revert(iter, io_size - iov_iter_count(iter));
> 
> what it *really* wants to do is just basically "iov_iter_reset(iter)".
> 
> And that's basically what that addition of that "iov_iter_reexpand()"
> tries to effectively do.
> 
> Wouldn't it be better to have a function that does exactly that?
> 
> Alternatively (and I'm cc'ing Jens) is is not possible for the
> io-uring code to know how many bytes it *actually* used, rather than
> saying that "ok, the iter originally had X bytes, now it has Y bytes,
> so it must have used X-Y bytes" which was actively wrong for the case
> where something ended up truncating the IO for some reason.
> 
> Because I note that io-uring does that
> 
>         /* may have left rw->iter inconsistent on -EIOCBQUEUED */
>         iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));
> 
> in io_resubmit_prep() too, and that you guys missed that it's the
> exact same issue, and needs that exact same iov_iter_reexpand().

Right. It was covered by v1-v2, which were failing requests with
additional fallback in v2 [1], but I dropped in v3 [2] because there
is a difference. Namely io_resubmit_prep() might be called deeply down
the stack, e.g. in the block layer.

It was intended to get fixed once the first part is merged, and I do
believe that was the right approach, because there were certain
communication delays. The first version was posted a month ago, but
we missed the merged window. It appeared to me that if we get anything
more complex 

 



do that at the bottom of stack.




 how deep in the stack we do that. It was indended to
be 


[1] https://lkml.org/lkml/2021/8/12/620
[2] https://lkml.org/lkml/2021/8/23/285

> 
> That "req->result" is once again the *original* length, and the above
> code once again mis-handles the case of "oh, the iov got truncated
> because of some IO limit".
> 
> So I've pulled this, but I think it is
> 
>  (a) ugly nasty
> 
>  (b) incomplete and misses a case
> 
> and needs more thought. At the VERY least it needs that
> iov_iter_reexpand() in io_resubmit_prep() too, I think.
> 
> I'd like the comments expanded too. In particular that
> 
>                 /* some cases will consume bytes even on error returns */
> 
> really should expand on the "some cases" thing, and why such an error
> isn't fatal buye should be retried asynchronously blindly like this?
> 
> Because I think _that_ is part of the fundamental issue here - the
> io_uring code tries to just blindly re-submit the whole thing, and it
> does it very badly and actually incorrectly.
> 
> Or am I missing something?
> 
>            Linus
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09 22:21         ` Jens Axboe
@ 2021-09-09 22:56           ` Linus Torvalds
  2021-09-10  1:35             ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: Linus Torvalds @ 2021-09-09 22:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Al Viro, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Thu, Sep 9, 2021 at 3:21 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 9/9/21 3:56 PM, Linus Torvalds wrote:
> >
> > IOW, can't we have  that
> >
> >         ret = io_iter_do_read(req, iter);
> >
> > return partial success - and if XFS does that "update iovec on
> > failure", I could easily see that same code - or something else -
> > having done the exact same thing.
> >
> > Put another way: if the iovec isn't guaranteed to be coherent when an
> > actual error occurs, then why would it be guaranteed to be coherent
> > with a partial success value?
> >
> > Because in most cases - I'd argue pretty much all - those "partial
> > success" cases are *exactly* the same as the error cases, it's just
> > that we had a loop and one or more iterations succeeded before it hit
> > the error case.
>
> Right, which is why the reset would be nice, but reexpand + revert at
> least works and accomplishes the same even if it doesn't look as pretty.

You miss my point.

The partial success case seems to do the wrong thing.

Or am I misreading things? Lookie here, in io_read():

        ret = io_iter_do_read(req, iter);

let's say that something succeeds partially, does X bytes, and returns
a positive X.

The if-statements following it then do not trigger:

        if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
  .. not this case ..
        } else if (ret == -EIOCBQUEUED) {
  .. nor this ..
        } else if (ret <= 0 || ret == io_size || !force_nonblock ||
                   (req->flags & REQ_F_NOWAIT) || !(req->flags & REQ_F_ISREG)) {
  .. nor this ..
        }

so nothing has been done to the iovec at all.

Then it does

        ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);

using that iovec that has *not* been reset, even though it really
should have been reset to "X bytes read".

See what I'm trying to say?

                Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09 22:54   ` Pavel Begunkov
@ 2021-09-09 22:57     ` Pavel Begunkov
  0 siblings, 0 replies; 46+ messages in thread
From: Pavel Begunkov @ 2021-09-09 22:57 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Jens Axboe
  Cc: Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 11:54 PM, Pavel Begunkov wrote:
> On 9/9/21 8:37 PM, Linus Torvalds wrote:
>> On Wed, Sep 8, 2021 at 9:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>
>>>         Fixes for io-uring handling of iov_iter reexpands
>>
>> Ugh.
>>
>> I have pulled this, because I understand what it does and I agree it
>> fixes a bug, but it really feels very very hacky and wrong to me.
> 
> Maybe was worded not too clearly, my apologies.
> 
> 
>> It really smells like io-uring is doing a "iov_iter_revert()" using a
>> number that it pulls incorrectly out of its arse.
> 
> It's not invented by io_uring,
> 
> filemap.c : generic_file_direct_[write,read]()
> 
> do the same thing. Also, the block layer was not re-expanding before
> ~5.12, so it looks it was possible to trigger a similar thing without
> io_uring, but I haven't tried to reproduce. Was mentioned in the
> cover-letter.
> 
>> So when io-uring does that
>>
>>                 iov_iter_revert(iter, io_size - iov_iter_count(iter));
>>
>> what it *really* wants to do is just basically "iov_iter_reset(iter)".
>>
>> And that's basically what that addition of that "iov_iter_reexpand()"
>> tries to effectively do.
>>
>> Wouldn't it be better to have a function that does exactly that?
>>
>> Alternatively (and I'm cc'ing Jens) is is not possible for the
>> io-uring code to know how many bytes it *actually* used, rather than
>> saying that "ok, the iter originally had X bytes, now it has Y bytes,
>> so it must have used X-Y bytes" which was actively wrong for the case
>> where something ended up truncating the IO for some reason.
>>
>> Because I note that io-uring does that
>>
>>         /* may have left rw->iter inconsistent on -EIOCBQUEUED */
>>         iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));
>>
>> in io_resubmit_prep() too, and that you guys missed that it's the
>> exact same issue, and needs that exact same iov_iter_reexpand().
> 
> Right. It was covered by v1-v2, which were failing requests with
> additional fallback in v2 [1], but I dropped in v3 [2] because there
> is a difference. Namely io_resubmit_prep() might be called deeply down
> the stack, e.g. in the block layer.
> 
> It was intended to get fixed once the first part is merged, and I do
> believe that was the right approach, because there were certain
> communication delays. The first version was posted a month ago, but
> we missed the merged window. It appeared to me that if we get anything
> more complex 

Dammit, apologies for the teared email.

... It was intended to get fixed once the first part is merged, and I do
believe that was the right approach, because there were certain
communication delays. The first version was posted a month ago, but
we missed the merged window. It appeared to me that if anything
more complex is posted, it would take another window to get it done.


> [1] https://lkml.org/lkml/2021/8/12/620
> [2] https://lkml.org/lkml/2021/8/23/285
> 
>>
>> That "req->result" is once again the *original* length, and the above
>> code once again mis-handles the case of "oh, the iov got truncated
>> because of some IO limit".
>>
>> So I've pulled this, but I think it is
>>
>>  (a) ugly nasty
>>
>>  (b) incomplete and misses a case
>>
>> and needs more thought. At the VERY least it needs that
>> iov_iter_reexpand() in io_resubmit_prep() too, I think.
>>
>> I'd like the comments expanded too. In particular that
>>
>>                 /* some cases will consume bytes even on error returns */
>>
>> really should expand on the "some cases" thing, and why such an error
>> isn't fatal buye should be retried asynchronously blindly like this?
>>
>> Because I think _that_ is part of the fundamental issue here - the
>> io_uring code tries to just blindly re-submit the whole thing, and it
>> does it very badly and actually incorrectly.
>>
>> Or am I missing something?
>>
>>            Linus
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09 19:37 ` Linus Torvalds
  2021-09-09 21:19   ` Jens Axboe
  2021-09-09 22:54   ` Pavel Begunkov
@ 2021-09-09 23:14   ` Pavel Begunkov
  2 siblings, 0 replies; 46+ messages in thread
From: Pavel Begunkov @ 2021-09-09 23:14 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Jens Axboe
  Cc: Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 8:37 PM, Linus Torvalds wrote:
> On Wed, Sep 8, 2021 at 9:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>>         Fixes for io-uring handling of iov_iter reexpands
> 
> Ugh.
> 
> I have pulled this, because I understand what it does and I agree it
> fixes a bug, but it really feels very very hacky and wrong to me.
> 
> It really smells like io-uring is doing a "iov_iter_revert()" using a
> number that it pulls incorrectly out of its arse.
> 
> So when io-uring does that
> 
>                 iov_iter_revert(iter, io_size - iov_iter_count(iter));
> 
> what it *really* wants to do is just basically "iov_iter_reset(iter)".
> 
> And that's basically what that addition of that "iov_iter_reexpand()"
> tries to effectively do.
> 
> Wouldn't it be better to have a function that does exactly that?
> 
> Alternatively (and I'm cc'ing Jens) is is not possible for the
> io-uring code to know how many bytes it *actually* used, rather than
> saying that "ok, the iter originally had X bytes, now it has Y bytes,
> so it must have used X-Y bytes" which was actively wrong for the case
> where something ended up truncating the IO for some reason.
> 
> Because I note that io-uring does that
> 
>         /* may have left rw->iter inconsistent on -EIOCBQUEUED */
>         iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));
> 
> in io_resubmit_prep() too, and that you guys missed that it's the
> exact same issue, and needs that exact same iov_iter_reexpand().
> 
> That "req->result" is once again the *original* length, and the above
> code once again mis-handles the case of "oh, the iov got truncated
> because of some IO limit".
> 
> So I've pulled this, but I think it is
> 
>  (a) ugly nasty

Should have mentioned, I agree that it's ghastly, as mentioned
in the cover-letter, but I just prefer to first fix the problem
ASAP, and then carry on with something more fundamental and right.


>  (b) incomplete and misses a case
> 
> and needs more thought. At the VERY least it needs that
> iov_iter_reexpand() in io_resubmit_prep() too, I think.
> 
> I'd like the comments expanded too. In particular that
> 
>                 /* some cases will consume bytes even on error returns */
> 
> really should expand on the "some cases" thing, and why such an error
> isn't fatal buye should be retried asynchronously blindly like this?
> 
> Because I think _that_ is part of the fundamental issue here - the
> io_uring code tries to just blindly re-submit the whole thing, and it
> does it very badly and actually incorrectly.
> 
> Or am I missing something?
> 
>            Linus
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09 22:56           ` Linus Torvalds
@ 2021-09-10  1:35             ` Jens Axboe
  2021-09-10  2:43               ` Jens Axboe
  2021-09-10  2:48               ` Al Viro
  0 siblings, 2 replies; 46+ messages in thread
From: Jens Axboe @ 2021-09-10  1:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 4:56 PM, Linus Torvalds wrote:
> On Thu, Sep 9, 2021 at 3:21 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 9/9/21 3:56 PM, Linus Torvalds wrote:
>>>
>>> IOW, can't we have  that
>>>
>>>         ret = io_iter_do_read(req, iter);
>>>
>>> return partial success - and if XFS does that "update iovec on
>>> failure", I could easily see that same code - or something else -
>>> having done the exact same thing.
>>>
>>> Put another way: if the iovec isn't guaranteed to be coherent when an
>>> actual error occurs, then why would it be guaranteed to be coherent
>>> with a partial success value?
>>>
>>> Because in most cases - I'd argue pretty much all - those "partial
>>> success" cases are *exactly* the same as the error cases, it's just
>>> that we had a loop and one or more iterations succeeded before it hit
>>> the error case.
>>
>> Right, which is why the reset would be nice, but reexpand + revert at
>> least works and accomplishes the same even if it doesn't look as pretty.
> 
> You miss my point.
> 
> The partial success case seems to do the wrong thing.
> 
> Or am I misreading things? Lookie here, in io_read():
> 
>         ret = io_iter_do_read(req, iter);
> 
> let's say that something succeeds partially, does X bytes, and returns
> a positive X.
> 
> The if-statements following it then do not trigger:
> 
>         if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
>   .. not this case ..
>         } else if (ret == -EIOCBQUEUED) {
>   .. nor this ..
>         } else if (ret <= 0 || ret == io_size || !force_nonblock ||
>                    (req->flags & REQ_F_NOWAIT) || !(req->flags & REQ_F_ISREG)) {
>   .. nor this ..
>         }
> 
> so nothing has been done to the iovec at all.
> 
> Then it does
> 
>         ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
> 
> using that iovec that has *not* been reset, even though it really
> should have been reset to "X bytes read".
> 
> See what I'm trying to say?

Yep ok I follow you now. And yes, if we get a partial one but one that
has more consumed than what was returned, that would not work well. I'm
guessing that a) we've never seen that, or b) we always end up with
either correctly advanced OR fully advanced, and the fully advanced case
would then just return 0 next time and we'd just get a short IO back to
userspace.

The safer way here would likely be to import the iovec again. We're
still in the context of the original submission, and the sqe hasn't been
consumed in the ring yet, so that can be done safely.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  1:35             ` Jens Axboe
@ 2021-09-10  2:43               ` Jens Axboe
  2021-09-10  2:48               ` Al Viro
  1 sibling, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2021-09-10  2:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 7:35 PM, Jens Axboe wrote:
> On 9/9/21 4:56 PM, Linus Torvalds wrote:
>> On Thu, Sep 9, 2021 at 3:21 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 9/9/21 3:56 PM, Linus Torvalds wrote:
>>>>
>>>> IOW, can't we have  that
>>>>
>>>>         ret = io_iter_do_read(req, iter);
>>>>
>>>> return partial success - and if XFS does that "update iovec on
>>>> failure", I could easily see that same code - or something else -
>>>> having done the exact same thing.
>>>>
>>>> Put another way: if the iovec isn't guaranteed to be coherent when an
>>>> actual error occurs, then why would it be guaranteed to be coherent
>>>> with a partial success value?
>>>>
>>>> Because in most cases - I'd argue pretty much all - those "partial
>>>> success" cases are *exactly* the same as the error cases, it's just
>>>> that we had a loop and one or more iterations succeeded before it hit
>>>> the error case.
>>>
>>> Right, which is why the reset would be nice, but reexpand + revert at
>>> least works and accomplishes the same even if it doesn't look as pretty.
>>
>> You miss my point.
>>
>> The partial success case seems to do the wrong thing.
>>
>> Or am I misreading things? Lookie here, in io_read():
>>
>>         ret = io_iter_do_read(req, iter);
>>
>> let's say that something succeeds partially, does X bytes, and returns
>> a positive X.
>>
>> The if-statements following it then do not trigger:
>>
>>         if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
>>   .. not this case ..
>>         } else if (ret == -EIOCBQUEUED) {
>>   .. nor this ..
>>         } else if (ret <= 0 || ret == io_size || !force_nonblock ||
>>                    (req->flags & REQ_F_NOWAIT) || !(req->flags & REQ_F_ISREG)) {
>>   .. nor this ..
>>         }
>>
>> so nothing has been done to the iovec at all.
>>
>> Then it does
>>
>>         ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>>
>> using that iovec that has *not* been reset, even though it really
>> should have been reset to "X bytes read".
>>
>> See what I'm trying to say?
> 
> Yep ok I follow you now. And yes, if we get a partial one but one that
> has more consumed than what was returned, that would not work well. I'm
> guessing that a) we've never seen that, or b) we always end up with
> either correctly advanced OR fully advanced, and the fully advanced case
> would then just return 0 next time and we'd just get a short IO back to
> userspace.
> 
> The safer way here would likely be to import the iovec again. We're
> still in the context of the original submission, and the sqe hasn't been
> consumed in the ring yet, so that can be done safely.

Totally untested, but something like this could be a better solution.
If we're still in the original submit path, then re-import the iovec and
set the iter again before doing retry. If we do get a partial
read/write return, then advance the iter to avoid re-doing parts of the
IO.

If we're already in the io-wq retry path, short IO will just be ended
anyway. That's no different than today.

Will take a closer look at this tomorrow and run some testing, but I
think the idea is sound and it avoids any kind of guessing on what was
done or not. Just re-setup the iter/iov and advance if we got a positive
result on the previous attempt.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 855ea544807f..89c4c568d785 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2608,8 +2608,6 @@ static bool io_resubmit_prep(struct io_kiocb *req)
 
 	if (!rw)
 		return !io_req_prep_async(req);
-	/* may have left rw->iter inconsistent on -EIOCBQUEUED */
-	iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));
 	return true;
 }
 
@@ -3431,6 +3429,28 @@ static bool need_read_all(struct io_kiocb *req)
 		S_ISBLK(file_inode(req->file)->i_mode);
 }
 
+static int io_prep_for_retry(int rw, struct io_kiocb *req, struct iovec **vecs,
+			     struct iov_iter *iter, ssize_t did_bytes)
+{
+	ssize_t ret;
+
+	/*
+	 * io-wq path cannot retry, as we cannot safely re-import vecs. It
+	 * would be perfectly legal for non-vectored IO, but we handle them
+	 * all the same.
+	 */
+	if (WARN_ON_ONCE(io_wq_current_is_worker()))
+		return did_bytes;
+
+	ret = io_import_iovec(rw, req, vecs, iter, false);
+	if (ret < 0)
+		return ret;
+	if (did_bytes > 0)
+		iov_iter_advance(iter, did_bytes);
+
+	return 0;
+}
+
 static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
@@ -3479,9 +3499,6 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		/* no retry on NONBLOCK nor RWF_NOWAIT */
 		if (req->flags & REQ_F_NOWAIT)
 			goto done;
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
 		ret = 0;
 	} else if (ret == -EIOCBQUEUED) {
 		goto out_free;
@@ -3491,6 +3508,13 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		goto done;
 	}
 
+	iovec = inline_vecs;
+	ret2 = io_prep_for_retry(READ, req, &iovec, iter, ret);
+	if (ret2 < 0) {
+		ret = ret2;
+		goto done;
+	}
+
 	ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 	if (ret2)
 		return ret2;
@@ -3614,14 +3638,16 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	if (!force_nonblock || ret2 != -EAGAIN) {
 		/* IOPOLL retry should happen for io-wq threads */
 		if ((req->ctx->flags & IORING_SETUP_IOPOLL) && ret2 == -EAGAIN)
-			goto copy_iov;
+			goto copy_import;
 done:
 		kiocb_done(kiocb, ret2, issue_flags);
 	} else {
+copy_import:
+		iovec = inline_vecs;
+		ret = io_prep_for_retry(WRITE, req, &iovec, iter, ret2);
+		if (ret < 0)
+			goto out_free;
 copy_iov:
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
 		ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
 		return ret ?: -EAGAIN;
 	}

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  1:35             ` Jens Axboe
  2021-09-10  2:43               ` Jens Axboe
@ 2021-09-10  2:48               ` Al Viro
  2021-09-10  3:06                 ` Jens Axboe
  2021-09-13 15:29                 ` David Laight
  1 sibling, 2 replies; 46+ messages in thread
From: Al Viro @ 2021-09-10  2:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Thu, Sep 09, 2021 at 07:35:13PM -0600, Jens Axboe wrote:

> Yep ok I follow you now. And yes, if we get a partial one but one that
> has more consumed than what was returned, that would not work well. I'm
> guessing that a) we've never seen that, or b) we always end up with
> either correctly advanced OR fully advanced, and the fully advanced case
> would then just return 0 next time and we'd just get a short IO back to
> userspace.
> 
> The safer way here would likely be to import the iovec again. We're
> still in the context of the original submission, and the sqe hasn't been
> consumed in the ring yet, so that can be done safely.

... until you end up with something assuming that you've got the same
iovec from userland the second time around.

IOW, generally it's a bad idea to do that kind of re-imports.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-09 21:19   ` Jens Axboe
  2021-09-09 21:39     ` Jens Axboe
  2021-09-09 21:42     ` Dave Chinner
@ 2021-09-10  2:57     ` Al Viro
  2021-09-10  3:05       ` Jens Axboe
  2 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2021-09-10  2:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Thu, Sep 09, 2021 at 03:19:56PM -0600, Jens Axboe wrote:

> Not sure how we'd do that, outside of stupid tricks like copy the
> iov_iter before we pass it down. But that's obviously not going to be
> very efficient. Hence we're left with having some way to reset/reexpand,
> even in the presence of someone having done truncate on it.

"Obviously" why, exactly?  It's not that large a structure; it's not
the optimal variant, but I'd like to see profiling data before assuming
that it'll cause noticable slowdowns.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  2:57     ` Al Viro
@ 2021-09-10  3:05       ` Jens Axboe
  2021-09-10  3:11         ` Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2021-09-10  3:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 8:57 PM, Al Viro wrote:
> On Thu, Sep 09, 2021 at 03:19:56PM -0600, Jens Axboe wrote:
> 
>> Not sure how we'd do that, outside of stupid tricks like copy the
>> iov_iter before we pass it down. But that's obviously not going to be
>> very efficient. Hence we're left with having some way to reset/reexpand,
>> even in the presence of someone having done truncate on it.
> 
> "Obviously" why, exactly?  It's not that large a structure; it's not
> the optimal variant, but I'd like to see profiling data before assuming
> that it'll cause noticable slowdowns.

It's 48 bytes, and we have to do it upfront. That means we'd be doing it
for _all_ requests, not just when we need to retry. As an example, current
benchmarks are at ~4M read requests per core. That'd add ~200MB/sec of
memory traffic just doing this copy.

Besides, I think that's moot as there's a better way.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  2:48               ` Al Viro
@ 2021-09-10  3:06                 ` Jens Axboe
  2021-09-10  3:15                   ` Al Viro
  2021-09-13 15:29                 ` David Laight
  1 sibling, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2021-09-10  3:06 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 8:48 PM, Al Viro wrote:
> On Thu, Sep 09, 2021 at 07:35:13PM -0600, Jens Axboe wrote:
> 
>> Yep ok I follow you now. And yes, if we get a partial one but one that
>> has more consumed than what was returned, that would not work well. I'm
>> guessing that a) we've never seen that, or b) we always end up with
>> either correctly advanced OR fully advanced, and the fully advanced case
>> would then just return 0 next time and we'd just get a short IO back to
>> userspace.
>>
>> The safer way here would likely be to import the iovec again. We're
>> still in the context of the original submission, and the sqe hasn't been
>> consumed in the ring yet, so that can be done safely.
> 
> ... until you end up with something assuming that you've got the same
> iovec from userland the second time around.
> 
> IOW, generally it's a bad idea to do that kind of re-imports.

That's really no different than having one thread do the issue, and
another modify the iovec while it happens. It's only an issue if you
don't validate it, just like you did the first time you imported. No
assumptions need to be made here.

If it's no longer valid, it'll get failed, and it's really on the
application having buggy behavior. The iovec cannot be modified until
we've signaled that it's been consumed, which hasn't happened yet. No
different than if an application modifies it mid readv(2) syscall.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  3:05       ` Jens Axboe
@ 2021-09-10  3:11         ` Al Viro
  2021-09-10  3:22           ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2021-09-10  3:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Thu, Sep 09, 2021 at 09:05:13PM -0600, Jens Axboe wrote:
> On 9/9/21 8:57 PM, Al Viro wrote:
> > On Thu, Sep 09, 2021 at 03:19:56PM -0600, Jens Axboe wrote:
> > 
> >> Not sure how we'd do that, outside of stupid tricks like copy the
> >> iov_iter before we pass it down. But that's obviously not going to be
> >> very efficient. Hence we're left with having some way to reset/reexpand,
> >> even in the presence of someone having done truncate on it.
> > 
> > "Obviously" why, exactly?  It's not that large a structure; it's not
> > the optimal variant, but I'd like to see profiling data before assuming
> > that it'll cause noticable slowdowns.
> 
> It's 48 bytes, and we have to do it upfront. That means we'd be doing it
> for _all_ requests, not just when we need to retry. As an example, current
> benchmarks are at ~4M read requests per core. That'd add ~200MB/sec of
> memory traffic just doing this copy.

Umm...  How much of that will be handled by cache?

> Besides, I think that's moot as there's a better way.

I hope so, but I'm afraid that "let's reload from userland on e.g. short
reads" is not better - there's a plenty of interesting corner cases you
need to handle with that.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  3:06                 ` Jens Axboe
@ 2021-09-10  3:15                   ` Al Viro
  2021-09-10  3:23                     ` Jens Axboe
  2021-09-10  3:24                     ` Al Viro
  0 siblings, 2 replies; 46+ messages in thread
From: Al Viro @ 2021-09-10  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Thu, Sep 09, 2021 at 09:06:58PM -0600, Jens Axboe wrote:
> On 9/9/21 8:48 PM, Al Viro wrote:
> > On Thu, Sep 09, 2021 at 07:35:13PM -0600, Jens Axboe wrote:
> > 
> >> Yep ok I follow you now. And yes, if we get a partial one but one that
> >> has more consumed than what was returned, that would not work well. I'm
> >> guessing that a) we've never seen that, or b) we always end up with
> >> either correctly advanced OR fully advanced, and the fully advanced case
> >> would then just return 0 next time and we'd just get a short IO back to
> >> userspace.
> >>
> >> The safer way here would likely be to import the iovec again. We're
> >> still in the context of the original submission, and the sqe hasn't been
> >> consumed in the ring yet, so that can be done safely.
> > 
> > ... until you end up with something assuming that you've got the same
> > iovec from userland the second time around.
> > 
> > IOW, generally it's a bad idea to do that kind of re-imports.
> 
> That's really no different than having one thread do the issue, and
> another modify the iovec while it happens. It's only an issue if you
> don't validate it, just like you did the first time you imported. No
> assumptions need to be made here.

	It's not "need to be made", it's "will be mistakenly made by
somebody several years down the road"...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  3:11         ` Al Viro
@ 2021-09-10  3:22           ` Jens Axboe
  2021-09-10  3:27             ` Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2021-09-10  3:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 9:11 PM, Al Viro wrote:
> On Thu, Sep 09, 2021 at 09:05:13PM -0600, Jens Axboe wrote:
>> On 9/9/21 8:57 PM, Al Viro wrote:
>>> On Thu, Sep 09, 2021 at 03:19:56PM -0600, Jens Axboe wrote:
>>>
>>>> Not sure how we'd do that, outside of stupid tricks like copy the
>>>> iov_iter before we pass it down. But that's obviously not going to be
>>>> very efficient. Hence we're left with having some way to reset/reexpand,
>>>> even in the presence of someone having done truncate on it.
>>>
>>> "Obviously" why, exactly?  It's not that large a structure; it's not
>>> the optimal variant, but I'd like to see profiling data before assuming
>>> that it'll cause noticable slowdowns.
>>
>> It's 48 bytes, and we have to do it upfront. That means we'd be doing it
>> for _all_ requests, not just when we need to retry. As an example, current
>> benchmarks are at ~4M read requests per core. That'd add ~200MB/sec of
>> memory traffic just doing this copy.
> 
> Umm...  How much of that will be handled by cache?

Depends? And what if the iovec itself has been modified in the middle?
We'd need to copy that whole thing too. It's just not workable as a
solution.

>> Besides, I think that's moot as there's a better way.
> 
> I hope so, but I'm afraid that "let's reload from userland on e.g. short
> reads" is not better - there's a plenty of interesting corner cases you
> need to handle with that.

As long as we're still in the context of the submission, it is tractable
provided we import it like we did originally.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  3:15                   ` Al Viro
@ 2021-09-10  3:23                     ` Jens Axboe
  2021-09-10  3:24                     ` Al Viro
  1 sibling, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2021-09-10  3:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 9:15 PM, Al Viro wrote:
> On Thu, Sep 09, 2021 at 09:06:58PM -0600, Jens Axboe wrote:
>> On 9/9/21 8:48 PM, Al Viro wrote:
>>> On Thu, Sep 09, 2021 at 07:35:13PM -0600, Jens Axboe wrote:
>>>
>>>> Yep ok I follow you now. And yes, if we get a partial one but one that
>>>> has more consumed than what was returned, that would not work well. I'm
>>>> guessing that a) we've never seen that, or b) we always end up with
>>>> either correctly advanced OR fully advanced, and the fully advanced case
>>>> would then just return 0 next time and we'd just get a short IO back to
>>>> userspace.
>>>>
>>>> The safer way here would likely be to import the iovec again. We're
>>>> still in the context of the original submission, and the sqe hasn't been
>>>> consumed in the ring yet, so that can be done safely.
>>>
>>> ... until you end up with something assuming that you've got the same
>>> iovec from userland the second time around.
>>>
>>> IOW, generally it's a bad idea to do that kind of re-imports.
>>
>> That's really no different than having one thread do the issue, and
>> another modify the iovec while it happens. It's only an issue if you
>> don't validate it, just like you did the first time you imported. No
>> assumptions need to be made here.
> 
> 	It's not "need to be made", it's "will be mistakenly made by
> somebody several years down the road"...

If the application changes the iovec passed in while it hasn't been
consumed yet, it's buggy. Period. We obviously need to ensure that we
only do this re-import IFF we're in the same original submit context
still.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  3:15                   ` Al Viro
  2021-09-10  3:23                     ` Jens Axboe
@ 2021-09-10  3:24                     ` Al Viro
  2021-09-10  3:28                       ` Jens Axboe
  1 sibling, 1 reply; 46+ messages in thread
From: Al Viro @ 2021-09-10  3:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 10, 2021 at 03:15:35AM +0000, Al Viro wrote:
> On Thu, Sep 09, 2021 at 09:06:58PM -0600, Jens Axboe wrote:
> > On 9/9/21 8:48 PM, Al Viro wrote:
> > > On Thu, Sep 09, 2021 at 07:35:13PM -0600, Jens Axboe wrote:
> > > 
> > >> Yep ok I follow you now. And yes, if we get a partial one but one that
> > >> has more consumed than what was returned, that would not work well. I'm
> > >> guessing that a) we've never seen that, or b) we always end up with
> > >> either correctly advanced OR fully advanced, and the fully advanced case
> > >> would then just return 0 next time and we'd just get a short IO back to
> > >> userspace.
> > >>
> > >> The safer way here would likely be to import the iovec again. We're
> > >> still in the context of the original submission, and the sqe hasn't been
> > >> consumed in the ring yet, so that can be done safely.
> > > 
> > > ... until you end up with something assuming that you've got the same
> > > iovec from userland the second time around.
> > > 
> > > IOW, generally it's a bad idea to do that kind of re-imports.
> > 
> > That's really no different than having one thread do the issue, and
> > another modify the iovec while it happens. It's only an issue if you
> > don't validate it, just like you did the first time you imported. No
> > assumptions need to be made here.
> 
> 	It's not "need to be made", it's "will be mistakenly made by
> somebody several years down the road"...

E.g. somebody blindly assuming that the amount of data read the last
time around will not exceed the size of reimported iov_iter.  What I'm
saying is that there's a plenty of ways to fuck up in that direction,
and they will *not* be caught by normal fuzzers.

I'm not arguing in favour of an uncoditional copy, BTW - I would like to
see something resembling profiling data, but it's obviously not a pretty
solution.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  3:22           ` Jens Axboe
@ 2021-09-10  3:27             ` Al Viro
  2021-09-10  3:30               ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2021-09-10  3:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Thu, Sep 09, 2021 at 09:22:30PM -0600, Jens Axboe wrote:
> On 9/9/21 9:11 PM, Al Viro wrote:
> > On Thu, Sep 09, 2021 at 09:05:13PM -0600, Jens Axboe wrote:
> >> On 9/9/21 8:57 PM, Al Viro wrote:
> >>> On Thu, Sep 09, 2021 at 03:19:56PM -0600, Jens Axboe wrote:
> >>>
> >>>> Not sure how we'd do that, outside of stupid tricks like copy the
> >>>> iov_iter before we pass it down. But that's obviously not going to be
> >>>> very efficient. Hence we're left with having some way to reset/reexpand,
> >>>> even in the presence of someone having done truncate on it.
> >>>
> >>> "Obviously" why, exactly?  It's not that large a structure; it's not
> >>> the optimal variant, but I'd like to see profiling data before assuming
> >>> that it'll cause noticable slowdowns.
> >>
> >> It's 48 bytes, and we have to do it upfront. That means we'd be doing it
> >> for _all_ requests, not just when we need to retry. As an example, current
> >> benchmarks are at ~4M read requests per core. That'd add ~200MB/sec of
> >> memory traffic just doing this copy.
> > 
> > Umm...  How much of that will be handled by cache?
> 
> Depends? And what if the iovec itself has been modified in the middle?
> We'd need to copy that whole thing too. It's just not workable as a
> solution.

Huh?  Why the hell would we need to copy iovecs themselves?  They are never
modified by ->read_iter()/->write_iter().

That's the whole fucking point of iov_iter - the iovec itself is made
constant, with all movable parts taken to iov_iter.

Again, we should never, ever modify the iovec (or bvec, etc.) array in
->read_iter()/->write_iter()/->sendmsg()/etc. instances.  If you see such
behaviour anywhere, report it immediately.  Any such is a blatant bug.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  3:24                     ` Al Viro
@ 2021-09-10  3:28                       ` Jens Axboe
  0 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2021-09-10  3:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 9:24 PM, Al Viro wrote:
> On Fri, Sep 10, 2021 at 03:15:35AM +0000, Al Viro wrote:
>> On Thu, Sep 09, 2021 at 09:06:58PM -0600, Jens Axboe wrote:
>>> On 9/9/21 8:48 PM, Al Viro wrote:
>>>> On Thu, Sep 09, 2021 at 07:35:13PM -0600, Jens Axboe wrote:
>>>>
>>>>> Yep ok I follow you now. And yes, if we get a partial one but one that
>>>>> has more consumed than what was returned, that would not work well. I'm
>>>>> guessing that a) we've never seen that, or b) we always end up with
>>>>> either correctly advanced OR fully advanced, and the fully advanced case
>>>>> would then just return 0 next time and we'd just get a short IO back to
>>>>> userspace.
>>>>>
>>>>> The safer way here would likely be to import the iovec again. We're
>>>>> still in the context of the original submission, and the sqe hasn't been
>>>>> consumed in the ring yet, so that can be done safely.
>>>>
>>>> ... until you end up with something assuming that you've got the same
>>>> iovec from userland the second time around.
>>>>
>>>> IOW, generally it's a bad idea to do that kind of re-imports.
>>>
>>> That's really no different than having one thread do the issue, and
>>> another modify the iovec while it happens. It's only an issue if you
>>> don't validate it, just like you did the first time you imported. No
>>> assumptions need to be made here.
>>
>> 	It's not "need to be made", it's "will be mistakenly made by
>> somebody several years down the road"...
> 
> E.g. somebody blindly assuming that the amount of data read the last
> time around will not exceed the size of reimported iov_iter.  What I'm
> saying is that there's a plenty of ways to fuck up in that direction,
> and they will *not* be caught by normal fuzzers.

If the plan pans out, it's literally doing the _exact_ same thing that
we did originally. No assumptions are made about the contents of the
iovecs originally passed in, none of that state is reused. It's an
identical import to what was originally done.

I'm not saying it's trivial, but as long as the context is correct, then
it really should be pretty straight forward...

> I'm not arguing in favour of an uncoditional copy, BTW - I would like
> to see something resembling profiling data, but it's obviously not a
> pretty solution.

I can tell you right now that it's unworkable, it'll be a very
noticeable slowdown. And it's very much a case of doing the slow path
for the extreme corner case of ever hitting this case. For most
workloads, you'll _never_ hit it. But we obviously have to be able to do
it, for the slower cases (like SCSI with low QD, it'd trigger pretty
easily).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  3:27             ` Al Viro
@ 2021-09-10  3:30               ` Jens Axboe
  2021-09-10  3:36                 ` Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2021-09-10  3:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 9:27 PM, Al Viro wrote:
> On Thu, Sep 09, 2021 at 09:22:30PM -0600, Jens Axboe wrote:
>> On 9/9/21 9:11 PM, Al Viro wrote:
>>> On Thu, Sep 09, 2021 at 09:05:13PM -0600, Jens Axboe wrote:
>>>> On 9/9/21 8:57 PM, Al Viro wrote:
>>>>> On Thu, Sep 09, 2021 at 03:19:56PM -0600, Jens Axboe wrote:
>>>>>
>>>>>> Not sure how we'd do that, outside of stupid tricks like copy the
>>>>>> iov_iter before we pass it down. But that's obviously not going to be
>>>>>> very efficient. Hence we're left with having some way to reset/reexpand,
>>>>>> even in the presence of someone having done truncate on it.
>>>>>
>>>>> "Obviously" why, exactly?  It's not that large a structure; it's not
>>>>> the optimal variant, but I'd like to see profiling data before assuming
>>>>> that it'll cause noticable slowdowns.
>>>>
>>>> It's 48 bytes, and we have to do it upfront. That means we'd be doing it
>>>> for _all_ requests, not just when we need to retry. As an example, current
>>>> benchmarks are at ~4M read requests per core. That'd add ~200MB/sec of
>>>> memory traffic just doing this copy.
>>>
>>> Umm...  How much of that will be handled by cache?
>>
>> Depends? And what if the iovec itself has been modified in the middle?
>> We'd need to copy that whole thing too. It's just not workable as a
>> solution.
> 
> Huh?  Why the hell would we need to copy iovecs themselves?  They are
> never modified by ->read_iter()/->write_iter().
> 
> That's the whole fucking point of iov_iter - the iovec itself is made
> constant, with all movable parts taken to iov_iter.
> 
> Again, we should never, ever modify the iovec (or bvec, etc.) array in
> ->read_iter()/->write_iter()/->sendmsg()/etc. instances.  If you see
> such behaviour anywhere, report it immediately.  Any such is a blatant
> bug.

Yes that was wrong, the iovec is obviously const. But that really
doesn't change the original point, which was that copying the iov_iter
itself unconditionally would be miserable.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  3:30               ` Jens Axboe
@ 2021-09-10  3:36                 ` Al Viro
  2021-09-10 13:57                   ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2021-09-10  3:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Thu, Sep 09, 2021 at 09:30:03PM -0600, Jens Axboe wrote:

> > Again, we should never, ever modify the iovec (or bvec, etc.) array in
> > ->read_iter()/->write_iter()/->sendmsg()/etc. instances.  If you see
> > such behaviour anywhere, report it immediately.  Any such is a blatant
> > bug.
> 
> Yes that was wrong, the iovec is obviously const. But that really
> doesn't change the original point, which was that copying the iov_iter
> itself unconditionally would be miserable.

Might very well be true, but... won't your patch hit the reimport on
every short read?  And the cost of uaccess in there is *much* higher
than copying of 48 bytes into local variable...

Or am I misreading your patch?  Note that short reads on reaching
EOF are obviously normal - it's not a rare case at all.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10  3:36                 ` Al Viro
@ 2021-09-10 13:57                   ` Jens Axboe
  2021-09-10 14:42                     ` Al Viro
  2021-09-10 15:04                     ` Jens Axboe
  0 siblings, 2 replies; 46+ messages in thread
From: Jens Axboe @ 2021-09-10 13:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/9/21 9:36 PM, Al Viro wrote:
> On Thu, Sep 09, 2021 at 09:30:03PM -0600, Jens Axboe wrote:
> 
>>> Again, we should never, ever modify the iovec (or bvec, etc.) array in
>>> ->read_iter()/->write_iter()/->sendmsg()/etc. instances.  If you see
>>> such behaviour anywhere, report it immediately.  Any such is a blatant
>>> bug.
>>
>> Yes that was wrong, the iovec is obviously const. But that really
>> doesn't change the original point, which was that copying the iov_iter
>> itself unconditionally would be miserable.
> 
> Might very well be true, but... won't your patch hit the reimport on
> every short read?  And the cost of uaccess in there is *much* higher
> than copying of 48 bytes into local variable...
> 
> Or am I misreading your patch?  Note that short reads on reaching
> EOF are obviously normal - it's not a rare case at all.

It was just a quick hack, might very well be too eager to go through
those motions. But pondering this instead of sleeping, we don't need to
copy all of iov_iter in order to restore the state, and we can use the
same advance after restoring. So something like this may be more
palatable. Caveat - again untested, and I haven't tested the performance
impact of this at all.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 855ea544807f..4d6d4315deda 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2608,8 +2608,6 @@ static bool io_resubmit_prep(struct io_kiocb *req)
 
 	if (!rw)
 		return !io_req_prep_async(req);
-	/* may have left rw->iter inconsistent on -EIOCBQUEUED */
-	iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));
 	return true;
 }
 
@@ -3431,14 +3429,45 @@ static bool need_read_all(struct io_kiocb *req)
 		S_ISBLK(file_inode(req->file)->i_mode);
 }
 
+/*
+ * Stash the items we need to restore an iov_iter after a partial or
+ * -EAGAIN'ed result.
+ */
+struct iov_store {
+	ssize_t io_size;
+	size_t iov_offset;
+	unsigned long nr_segs;
+	const void *ptr;
+};
+
+static void io_iter_reset(struct iov_iter *iter, struct iov_store *store,
+			  ssize_t did_bytes)
+{
+	iter->count = store->io_size;
+	iter->iov_offset = store->iov_offset;
+	iter->nr_segs = store->nr_segs;
+	iter->iov = store->ptr;
+	if (did_bytes > 0)
+		iov_iter_advance(iter, did_bytes);
+}
+
+static void io_iov_store(struct iov_store *store, struct iov_iter *iter)
+{
+	store->io_size = iov_iter_count(iter);
+	store->iov_offset = iter->iov_offset;
+	store->nr_segs = iter->nr_segs;
+	store->ptr = iter->iov;
+}
+
 static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t io_size, ret, ret2;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_store store;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
@@ -3448,8 +3477,8 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		if (ret < 0)
 			return ret;
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	io_iov_store(&store, iter);
+	req->result = store.io_size;
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3463,7 +3492,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		return ret ?: -EAGAIN;
 	}
 
-	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), store.io_size);
 	if (unlikely(ret)) {
 		kfree(iovec);
 		return ret;
@@ -3479,18 +3508,17 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		/* no retry on NONBLOCK nor RWF_NOWAIT */
 		if (req->flags & REQ_F_NOWAIT)
 			goto done;
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
 		ret = 0;
 	} else if (ret == -EIOCBQUEUED) {
 		goto out_free;
-	} else if (ret <= 0 || ret == io_size || !force_nonblock ||
+	} else if (ret <= 0 || ret == store.io_size || !force_nonblock ||
 		   (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
 		/* read all, failed, already did sync or don't want to retry */
 		goto done;
 	}
 
+	io_iter_reset(iter, &store, ret);
+
 	ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 	if (ret2)
 		return ret2;
@@ -3501,7 +3529,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	iter = &rw->iter;
 
 	do {
-		io_size -= ret;
+		store.io_size -= ret;
 		rw->bytes_done += ret;
 		/* if we can retry, do so with the callbacks armed */
 		if (!io_rw_should_retry(req)) {
@@ -3520,7 +3548,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 			return 0;
 		/* we got some bytes, but not all. retry. */
 		kiocb->ki_flags &= ~IOCB_WAITQ;
-	} while (ret > 0 && ret < io_size);
+	} while (ret > 0 && ret < store.io_size);
 done:
 	kiocb_done(kiocb, ret, issue_flags);
 out_free:
@@ -3543,8 +3571,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t ret, ret2, io_size;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_store store;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
@@ -3554,8 +3583,10 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		if (ret < 0)
 			return ret;
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+
+	io_iov_store(&store, iter);
+	req->result = store.io_size;
+	ret2 = 0;
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3572,7 +3603,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	    (req->flags & REQ_F_ISREG))
 		goto copy_iov;
 
-	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), store.io_size);
 	if (unlikely(ret))
 		goto out_free;
 
@@ -3619,9 +3650,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		kiocb_done(kiocb, ret2, issue_flags);
 	} else {
 copy_iov:
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
+		io_iter_reset(iter, &store, ret2);
 		ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
 		return ret ?: -EAGAIN;
 	}

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 13:57                   ` Jens Axboe
@ 2021-09-10 14:42                     ` Al Viro
  2021-09-10 15:08                       ` Jens Axboe
  2021-09-10 15:04                     ` Jens Axboe
  1 sibling, 1 reply; 46+ messages in thread
From: Al Viro @ 2021-09-10 14:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 10, 2021 at 07:57:49AM -0600, Jens Axboe wrote:

> It was just a quick hack, might very well be too eager to go through
> those motions. But pondering this instead of sleeping, we don't need to
> copy all of iov_iter in order to restore the state, and we can use the
> same advance after restoring. So something like this may be more
> palatable. Caveat - again untested, and I haven't tested the performance
> impact of this at all.

You actually can cut it down even more - nr_segs + iov remains constant
all along, so you could get away with just 3 words here...  I would be
surprised if extra memory traffic had shown up - it's well within the
noise from register spills, (un)inlining, etc.  We are talking about
3 (or 4, with your variant) extra words on one stack frame (and that'd
be further offset by removal of ->truncated); I'd still like to see the
profiling data, but concerns about extra memory traffic due to that
are, IMO, misplaced.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 13:57                   ` Jens Axboe
  2021-09-10 14:42                     ` Al Viro
@ 2021-09-10 15:04                     ` Jens Axboe
  2021-09-10 16:06                       ` Jens Axboe
  1 sibling, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2021-09-10 15:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/10/21 7:57 AM, Jens Axboe wrote:
> On 9/9/21 9:36 PM, Al Viro wrote:
>> On Thu, Sep 09, 2021 at 09:30:03PM -0600, Jens Axboe wrote:
>>
>>>> Again, we should never, ever modify the iovec (or bvec, etc.) array in
>>>> ->read_iter()/->write_iter()/->sendmsg()/etc. instances.  If you see
>>>> such behaviour anywhere, report it immediately.  Any such is a blatant
>>>> bug.
>>>
>>> Yes that was wrong, the iovec is obviously const. But that really
>>> doesn't change the original point, which was that copying the iov_iter
>>> itself unconditionally would be miserable.
>>
>> Might very well be true, but... won't your patch hit the reimport on
>> every short read?  And the cost of uaccess in there is *much* higher
>> than copying of 48 bytes into local variable...
>>
>> Or am I misreading your patch?  Note that short reads on reaching
>> EOF are obviously normal - it's not a rare case at all.
> 
> It was just a quick hack, might very well be too eager to go through
> those motions. But pondering this instead of sleeping, we don't need to
> copy all of iov_iter in order to restore the state, and we can use the
> same advance after restoring. So something like this may be more
> palatable. Caveat - again untested, and I haven't tested the performance
> impact of this at all.

Passes basic testing for me. I added a sysctl switch for this so I can
compare performance, running my usual peak-perf-single-core benchmark.
That one does ~3.5M IOPS, using polled IO. There's always a slight
variability between boots and builds, hence the sysctl so I could
toggle this behavior on the fly.

Did a few runs, and the differences are very stable. With this enabled,
we spend about 0.15% more time in io_read(). That's only worth about
5K IOPS at 3.5M, not enough to notice as the variation for the 1 second
reporting window usually swings more than that:

Old behavior:
IOPS=3536512, IOS/call=32/31, inflight=(75)
IOPS=3541888, IOS/call=32/32, inflight=(64)
IOPS=3529056, IOS/call=32/31, inflight=(119)
IOPS=3521184, IOS/call=32/32, inflight=(96)
IOPS=3527456, IOS/call=32/31, inflight=(128)
IOPS=3525504, IOS/call=32/32, inflight=(128)
IOPS=3524288, IOS/call=32/32, inflight=(128)
IOPS=3536192, IOS/call=32/32, inflight=(96)
IOPS=3535840, IOS/call=32/32, inflight=(96)
IOPS=3533728, IOS/call=32/31, inflight=(128)
IOPS=3528384, IOS/call=32/32, inflight=(128)
IOPS=3518400, IOS/call=32/32, inflight=(64)

Turning it on:
IOPS=3533824, IOS/call=32/31, inflight=(64)
IOPS=3541408, IOS/call=32/32, inflight=(32)
IOPS=3533024, IOS/call=32/31, inflight=(64)
IOPS=3528672, IOS/call=32/32, inflight=(35)
IOPS=3522272, IOS/call=32/31, inflight=(107)
IOPS=3517632, IOS/call=32/32, inflight=(57)
IOPS=3516000, IOS/call=32/31, inflight=(96)
IOPS=3513568, IOS/call=32/32, inflight=(34)
IOPS=3525600, IOS/call=32/31, inflight=(96)
IOPS=3527136, IOS/call=32/31, inflight=(101)

I think that's tolerable, it was never going to be absolutely free.

What do you think of this approach? Parts of iov_iter are going to
remain constant, like iter_type and data_source. io_uring already copies
iter->count, so that just leaves restoring iov (and unionized friends),
nr_segs union, and iov_offset;

We could pretty this up and have the state part be explicit in iov_iter,
and have the store/restore parts end up in uio.h. That'd tie them closer
together, though I don't expect iov_iter changes to be an issue. It
would make it more maintainable, though. I'll try and hack up this
generic solution, see if that looks any better.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 14:42                     ` Al Viro
@ 2021-09-10 15:08                       ` Jens Axboe
  2021-09-10 15:32                         ` Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2021-09-10 15:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/10/21 8:42 AM, Al Viro wrote:
> On Fri, Sep 10, 2021 at 07:57:49AM -0600, Jens Axboe wrote:
> 
>> It was just a quick hack, might very well be too eager to go through
>> those motions. But pondering this instead of sleeping, we don't need to
>> copy all of iov_iter in order to restore the state, and we can use the
>> same advance after restoring. So something like this may be more
>> palatable. Caveat - again untested, and I haven't tested the performance
>> impact of this at all.
> 
> You actually can cut it down even more - nr_segs + iov remains constant
> all along, so you could get away with just 3 words here...  I would be

Mmm, the iov pointer remains constant? Maybe I'm missing your point, but
the various advance functions are quite happy to increment iter->iov or
iter->bvec, so we need to restore them. From a quick look, looks like
iter->nr_segs is modified for advancing too.

What am I missing?

> surprised if extra memory traffic had shown up - it's well within the
> noise from register spills, (un)inlining, etc.  We are talking about
> 3 (or 4, with your variant) extra words on one stack frame (and that'd
> be further offset by removal of ->truncated); I'd still like to see the
> profiling data, but concerns about extra memory traffic due to that
> are, IMO, misplaced.

See other email that was just sent out, it is measurable but pretty
minimal. But that's also down to about 1/3rd of copying the whole
thing blindly, so definitely a better case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 15:08                       ` Jens Axboe
@ 2021-09-10 15:32                         ` Al Viro
  2021-09-10 15:36                           ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2021-09-10 15:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 10, 2021 at 09:08:02AM -0600, Jens Axboe wrote:

> > You actually can cut it down even more - nr_segs + iov remains constant
> > all along, so you could get away with just 3 words here...  I would be
> 
> Mmm, the iov pointer remains constant? Maybe I'm missing your point, but
> the various advance functions are quite happy to increment iter->iov or
> iter->bvec, so we need to restore them. From a quick look, looks like
> iter->nr_segs is modified for advancing too.
> 
> What am I missing?

i->iov + i->nr_segs does not change - the places incrementing the former
will decrement the latter by the same amount.  So it's enough to store
either of those - the other one can be recovered by subtracting the
saved value from the current i->iov + i->nr_segs.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 15:32                         ` Al Viro
@ 2021-09-10 15:36                           ` Jens Axboe
  0 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2021-09-10 15:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/10/21 9:32 AM, Al Viro wrote:
> On Fri, Sep 10, 2021 at 09:08:02AM -0600, Jens Axboe wrote:
> 
>>> You actually can cut it down even more - nr_segs + iov remains constant
>>> all along, so you could get away with just 3 words here...  I would be
>>
>> Mmm, the iov pointer remains constant? Maybe I'm missing your point, but
>> the various advance functions are quite happy to increment iter->iov or
>> iter->bvec, so we need to restore them. From a quick look, looks like
>> iter->nr_segs is modified for advancing too.
>>
>> What am I missing?
> 
> i->iov + i->nr_segs does not change - the places incrementing the former
> will decrement the latter by the same amount.  So it's enough to store
> either of those - the other one can be recovered by subtracting the
> saved value from the current i->iov + i->nr_segs.

Ahh, clever. Yes that should work just fine. Let me test that and send
out a proposal. Thanks Al.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 15:04                     ` Jens Axboe
@ 2021-09-10 16:06                       ` Jens Axboe
  2021-09-10 16:44                         ` Linus Torvalds
  2021-09-10 16:56                         ` Al Viro
  0 siblings, 2 replies; 46+ messages in thread
From: Jens Axboe @ 2021-09-10 16:06 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/10/21 9:04 AM, Jens Axboe wrote:
> We could pretty this up and have the state part be explicit in iov_iter,
> and have the store/restore parts end up in uio.h. That'd tie them closer
> together, though I don't expect iov_iter changes to be an issue. It
> would make it more maintainable, though. I'll try and hack up this
> generic solution, see if that looks any better.

Looks something like this. Not super pretty in terms of needing a define
for this, and maybe I'm missing something, but ideally we'd want it as
an anonymous struct that's defined inside iov_iter. Anyway, gets the
point across. Alternatively, since we're down to just a few members now,
we just duplicate them in each struct...

Would be split into two patches, one for the iov_state addition and
the save/restore helpers, and then one switching io_uring to use them.
Figured we'd need some agreement on this first...


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 855ea544807f..539c94299d64 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2608,8 +2608,6 @@ static bool io_resubmit_prep(struct io_kiocb *req)
 
 	if (!rw)
 		return !io_req_prep_async(req);
-	/* may have left rw->iter inconsistent on -EIOCBQUEUED */
-	iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));
 	return true;
 }
 
@@ -3431,14 +3429,23 @@ static bool need_read_all(struct io_kiocb *req)
 		S_ISBLK(file_inode(req->file)->i_mode);
 }
 
+static void io_iter_restore(struct iov_iter *iter, struct iov_iter_state *state,
+			    ssize_t did_bytes)
+{
+	iov_iter_restore_state(iter, state);
+	if (did_bytes > 0)
+		iov_iter_advance(iter, did_bytes);
+}
+
 static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t io_size, ret, ret2;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_iter_state state;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
@@ -3448,8 +3455,8 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		if (ret < 0)
 			return ret;
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	req->result = iov_iter_count(iter);
+	iov_iter_save_state(iter, &state);
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3463,7 +3470,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		return ret ?: -EAGAIN;
 	}
 
-	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), state.count);
 	if (unlikely(ret)) {
 		kfree(iovec);
 		return ret;
@@ -3479,18 +3486,17 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		/* no retry on NONBLOCK nor RWF_NOWAIT */
 		if (req->flags & REQ_F_NOWAIT)
 			goto done;
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
 		ret = 0;
 	} else if (ret == -EIOCBQUEUED) {
 		goto out_free;
-	} else if (ret <= 0 || ret == io_size || !force_nonblock ||
+	} else if (ret <= 0 || ret == state.count || !force_nonblock ||
 		   (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
 		/* read all, failed, already did sync or don't want to retry */
 		goto done;
 	}
 
+	io_iter_restore(iter, &state, ret);
+
 	ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 	if (ret2)
 		return ret2;
@@ -3501,7 +3507,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	iter = &rw->iter;
 
 	do {
-		io_size -= ret;
+		state.count -= ret;
 		rw->bytes_done += ret;
 		/* if we can retry, do so with the callbacks armed */
 		if (!io_rw_should_retry(req)) {
@@ -3520,7 +3526,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 			return 0;
 		/* we got some bytes, but not all. retry. */
 		kiocb->ki_flags &= ~IOCB_WAITQ;
-	} while (ret > 0 && ret < io_size);
+	} while (ret > 0 && ret < state.count);
 done:
 	kiocb_done(kiocb, ret, issue_flags);
 out_free:
@@ -3543,8 +3549,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t ret, ret2, io_size;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_iter_state state;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
@@ -3554,8 +3561,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		if (ret < 0)
 			return ret;
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	req->result = iov_iter_count(iter);
+	iov_iter_save_state(iter, &state);
+	ret2 = 0;
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3572,7 +3580,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	    (req->flags & REQ_F_ISREG))
 		goto copy_iov;
 
-	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), state.count);
 	if (unlikely(ret))
 		goto out_free;
 
@@ -3619,9 +3627,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		kiocb_done(kiocb, ret2, issue_flags);
 	} else {
 copy_iov:
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
+		io_iter_restore(iter, &state, ret2);
 		ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
 		return ret ?: -EAGAIN;
 	}
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5265024e8b90..4f9d483096cd 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -27,11 +27,25 @@ enum iter_type {
 	ITER_DISCARD,
 };
 
+#define IOV_ITER_STATE					\
+	size_t iov_offset;				\
+	size_t count;					\
+	union {						\
+		unsigned long nr_segs;			\
+		struct {				\
+			unsigned int head;		\
+			unsigned int start_head;	\
+		};					\
+		loff_t xarray_start;			\
+	};						\
+
+struct iov_iter_state {
+	IOV_ITER_STATE;
+};
+
 struct iov_iter {
 	u8 iter_type;
 	bool data_source;
-	size_t iov_offset;
-	size_t count;
 	union {
 		const struct iovec *iov;
 		const struct kvec *kvec;
@@ -40,12 +54,10 @@ struct iov_iter {
 		struct pipe_inode_info *pipe;
 	};
 	union {
-		unsigned long nr_segs;
+		struct iov_iter_state state;
 		struct {
-			unsigned int head;
-			unsigned int start_head;
+			IOV_ITER_STATE;
 		};
-		loff_t xarray_start;
 	};
 	size_t truncated;
 };
@@ -55,6 +67,33 @@ static inline enum iter_type iov_iter_type(const struct iov_iter *i)
 	return i->iter_type;
 }
 
+static inline void iov_iter_save_state(struct iov_iter *iter,
+				       struct iov_iter_state *state)
+{
+	*state = iter->state;
+}
+
+static inline void iov_iter_restore_state(struct iov_iter *iter,
+					  struct iov_iter_state *state)
+{
+	iter->iov_offset = state->iov_offset;
+	iter->count = state->count;
+
+	switch (iov_iter_type(iter)) {
+	case ITER_IOVEC:
+	case ITER_KVEC:
+	case ITER_BVEC:
+		iter->iov -= state->nr_segs - iter->nr_segs;
+		fallthrough;
+	case ITER_PIPE:
+	case ITER_XARRAY:
+		iter->nr_segs = state->nr_segs;
+		break;
+	case ITER_DISCARD:
+		break;
+	}
+}
+
 static inline bool iter_is_iovec(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_IOVEC;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 16:06                       ` Jens Axboe
@ 2021-09-10 16:44                         ` Linus Torvalds
  2021-09-10 16:56                         ` Al Viro
  1 sibling, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2021-09-10 16:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Al Viro, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 10, 2021 at 9:06 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> +static void io_iter_restore(struct iov_iter *iter, struct iov_iter_state *state,
> +                           ssize_t did_bytes)
> +{
> +       iov_iter_restore_state(iter, state);
> +       if (did_bytes > 0)
> +               iov_iter_advance(iter, did_bytes);
> +}

This approach looks conceptually good to me.

Just name it "iov_iter_restore()", and (together with the actual
iov_iter_restore_state() - I don't think it makes much sense to inline
something like this that is by definition for the slow path when
something failed) move it to lib/iov_iter.c

If this allows us to remove the 'truncated' field from the iov_iter, I
think it's a win overall.

That said, I think your actual implementation of
iov_iter_restore_state() is buggy. It's not just those state bits you
need to restore, you do need to do all the "back out the i->iov/bvec
pointers" games too. All the stuff that iov_iter_revert() does.

Which means that I think your tricks to try to share the 'struct
iov_iter_state' with the 'struct iov_iter' using unions are just ugly
and pointless and make for more complex code. Because you can't just
save/restore the 'state part' of it all, you do have to do more than
that.

So instead of the union, just have the state in some sane (different)
form, and do the revert/advance thing taking different types of
iterators into account. This is not supposed to be
performance-critical code.

Alternatively, you'd need to make the state part be *both* the unions,
and restore the pointers that don't need restoring too. You end up
with pretty much all of iov_iter.

Al, what do you think?

               Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 16:06                       ` Jens Axboe
  2021-09-10 16:44                         ` Linus Torvalds
@ 2021-09-10 16:56                         ` Al Viro
  2021-09-10 16:58                           ` Linus Torvalds
  2021-09-10 17:04                           ` Jens Axboe
  1 sibling, 2 replies; 46+ messages in thread
From: Al Viro @ 2021-09-10 16:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 10, 2021 at 10:06:25AM -0600, Jens Axboe wrote:

> Looks something like this. Not super pretty in terms of needing a define
> for this, and maybe I'm missing something, but ideally we'd want it as
> an anonymous struct that's defined inside iov_iter. Anyway, gets the
> point across. Alternatively, since we're down to just a few members now,
> we just duplicate them in each struct...
> 
> Would be split into two patches, one for the iov_state addition and
> the save/restore helpers, and then one switching io_uring to use them.
> Figured we'd need some agreement on this first...

> +#define IOV_ITER_STATE					\
> +	size_t iov_offset;				\
> +	size_t count;					\
> +	union {						\
> +		unsigned long nr_segs;			\
> +		struct {				\
> +			unsigned int head;		\
> +			unsigned int start_head;	\
> +		};					\
> +		loff_t xarray_start;			\
> +	};						\
> +
> +struct iov_iter_state {
> +	IOV_ITER_STATE;
> +};
> +
>  struct iov_iter {
>  	u8 iter_type;
>  	bool data_source;
> -	size_t iov_offset;
> -	size_t count;
>  	union {
>  		const struct iovec *iov;
>  		const struct kvec *kvec;
> @@ -40,12 +54,10 @@ struct iov_iter {
>  		struct pipe_inode_info *pipe;
>  	};
>  	union {
> -		unsigned long nr_segs;
> +		struct iov_iter_state state;
>  		struct {
> -			unsigned int head;
> -			unsigned int start_head;
> +			IOV_ITER_STATE;
>  		};
> -		loff_t xarray_start;
>  	};
>  	size_t truncated;
>  };

No.  This is impossible to read *and* wrong for flavours other than
iovec anyway.

Rules:
	count is flavour-independent
	iovec: iov, nr_segs, iov_offset.  nr_segs + iov is constant
	kvec: kvec, nr_segs, iov_offset.  nr_segs + kvec is constant
	bvec: bvec, nr_segs, iov_offset.  nr_segs + bvec is constant
	xarray: xarray, xarray_start, iov_offset.  xarray and xarray_start are constant.
	pipe: pipe, head, start_head, iov_offset.  pipe and start_head are constant,
						   iov_offset can be derived from the rest.
	discard: nothing.

What's more, for pipe (output-only) the situation is much trickier and
there this "reset + advance" won't work at all.  Simply not applicable.

What's the point of all those contortions, anyway?  You only need it for
iovec case; don't mix doing that and turning it into flavour-independent
primitive.

Especially since you turn around and access the fields of that sucker
(->count, that is) directly in your code.  Keep it simple and readable,
please.  We'll sort the sane flavour-independent API later.  And get
rid of ->truncate, while we are at it.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 16:56                         ` Al Viro
@ 2021-09-10 16:58                           ` Linus Torvalds
  2021-09-10 17:26                             ` Jens Axboe
  2021-09-10 17:04                           ` Jens Axboe
  1 sibling, 1 reply; 46+ messages in thread
From: Linus Torvalds @ 2021-09-10 16:58 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 10, 2021 at 9:56 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What's the point of all those contortions, anyway?  You only need it for
> iovec case; don't mix doing that and turning it into flavour-independent
> primitive.

Good point, making it specific to iovec only gets rid of a lot of
special cases and worries.

This is fairly specialized, no need to always cater to every possible case.

               Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 16:56                         ` Al Viro
  2021-09-10 16:58                           ` Linus Torvalds
@ 2021-09-10 17:04                           ` Jens Axboe
  1 sibling, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2021-09-10 17:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/10/21 10:56 AM, Al Viro wrote:
> On Fri, Sep 10, 2021 at 10:06:25AM -0600, Jens Axboe wrote:
> 
>> Looks something like this. Not super pretty in terms of needing a define
>> for this, and maybe I'm missing something, but ideally we'd want it as
>> an anonymous struct that's defined inside iov_iter. Anyway, gets the
>> point across. Alternatively, since we're down to just a few members now,
>> we just duplicate them in each struct...
>>
>> Would be split into two patches, one for the iov_state addition and
>> the save/restore helpers, and then one switching io_uring to use them.
>> Figured we'd need some agreement on this first...
> 
>> +#define IOV_ITER_STATE					\
>> +	size_t iov_offset;				\
>> +	size_t count;					\
>> +	union {						\
>> +		unsigned long nr_segs;			\
>> +		struct {				\
>> +			unsigned int head;		\
>> +			unsigned int start_head;	\
>> +		};					\
>> +		loff_t xarray_start;			\
>> +	};						\
>> +
>> +struct iov_iter_state {
>> +	IOV_ITER_STATE;
>> +};
>> +
>>  struct iov_iter {
>>  	u8 iter_type;
>>  	bool data_source;
>> -	size_t iov_offset;
>> -	size_t count;
>>  	union {
>>  		const struct iovec *iov;
>>  		const struct kvec *kvec;
>> @@ -40,12 +54,10 @@ struct iov_iter {
>>  		struct pipe_inode_info *pipe;
>>  	};
>>  	union {
>> -		unsigned long nr_segs;
>> +		struct iov_iter_state state;
>>  		struct {
>> -			unsigned int head;
>> -			unsigned int start_head;
>> +			IOV_ITER_STATE;
>>  		};
>> -		loff_t xarray_start;
>>  	};
>>  	size_t truncated;
>>  };
> 
> No.  This is impossible to read *and* wrong for flavours other than
> iovec anyway.
> 
> Rules:
> 	count is flavour-independent
> 	iovec: iov, nr_segs, iov_offset.  nr_segs + iov is constant
> 	kvec: kvec, nr_segs, iov_offset.  nr_segs + kvec is constant
> 	bvec: bvec, nr_segs, iov_offset.  nr_segs + bvec is constant
> 	xarray: xarray, xarray_start, iov_offset.  xarray and xarray_start are constant.
> 	pipe: pipe, head, start_head, iov_offset.  pipe and start_head are constant,
> 						   iov_offset can be derived from the rest.
> 	discard: nothing.
> 
> What's more, for pipe (output-only) the situation is much trickier and
> there this "reset + advance" won't work at all.  Simply not applicable.
> 
> What's the point of all those contortions, anyway?  You only need it for
> iovec case; don't mix doing that and turning it into flavour-independent
> primitive.

Yes that's a good point, BVEC as well fwiw. But those two are very
similar.

> Especially since you turn around and access the fields of that sucker
> (->count, that is) directly in your code.  Keep it simple and readable,
> please.  We'll sort the sane flavour-independent API later.  And get
> rid of ->truncate, while we are at it.

Alright, so how about I just make the state a bit dumber and only work
for iovec/bvec. That gets rid of the weirdo macro. Add a WARN_ON_ONCE()
for using restore on anything that isn't an IOVEC/BVEC.

Sound reasonable?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 16:58                           ` Linus Torvalds
@ 2021-09-10 17:26                             ` Jens Axboe
  2021-09-10 17:31                               ` Linus Torvalds
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2021-09-10 17:26 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/10/21 10:58 AM, Linus Torvalds wrote:
> On Fri, Sep 10, 2021 at 9:56 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> What's the point of all those contortions, anyway?  You only need it for
>> iovec case; don't mix doing that and turning it into flavour-independent
>> primitive.
> 
> Good point, making it specific to iovec only gets rid of a lot of
> special cases and worries.
> 
> This is fairly specialized, no need to always cater to every possible case.

Alright, split into three patches:

https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter

If this looks reasonable (Al/Linus), then I'll send it out officially with
git send-email as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 17:26                             ` Jens Axboe
@ 2021-09-10 17:31                               ` Linus Torvalds
  2021-09-10 17:32                                 ` Jens Axboe
  2021-09-10 18:48                                 ` Al Viro
  0 siblings, 2 replies; 46+ messages in thread
From: Linus Torvalds @ 2021-09-10 17:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Al Viro, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 10, 2021 at 10:26 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 9/10/21 10:58 AM, Linus Torvalds wrote:
> > On Fri, Sep 10, 2021 at 9:56 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >>
> >> What's the point of all those contortions, anyway?  You only need it for
> >> iovec case; don't mix doing that and turning it into flavour-independent
> >> primitive.
> >
> > Good point, making it specific to iovec only gets rid of a lot of
> > special cases and worries.
> >
> > This is fairly specialized, no need to always cater to every possible case.
>
> Alright, split into three patches:
>
> https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter

That looks sane to me.

Please add some comment about how that

        i->iov -= state->nr_segs - i->nr_segs;

actually is the right thing for all the three cases (iow how 'iov',
'kvec' and 'bvec' all end up having a union member that acts the same
way).

But yeah, I like how the io_uring.c code looks better this way too.

Al, what do you think?

              Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 17:31                               ` Linus Torvalds
@ 2021-09-10 17:32                                 ` Jens Axboe
  2021-09-10 18:48                                 ` Al Viro
  1 sibling, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2021-09-10 17:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/10/21 11:31 AM, Linus Torvalds wrote:
> On Fri, Sep 10, 2021 at 10:26 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 9/10/21 10:58 AM, Linus Torvalds wrote:
>>> On Fri, Sep 10, 2021 at 9:56 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>>
>>>> What's the point of all those contortions, anyway?  You only need it for
>>>> iovec case; don't mix doing that and turning it into flavour-independent
>>>> primitive.
>>>
>>> Good point, making it specific to iovec only gets rid of a lot of
>>> special cases and worries.
>>>
>>> This is fairly specialized, no need to always cater to every possible case.
>>
>> Alright, split into three patches:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter
> 
> That looks sane to me.
> 
> Please add some comment about how that
> 
>         i->iov -= state->nr_segs - i->nr_segs;
> 
> actually is the right thing for all the three cases (iow how 'iov',
> 'kvec' and 'bvec' all end up having a union member that acts the same
> way).

Good idea, I'll add that right now.

> But yeah, I like how the io_uring.c code looks better this way too.

Me too :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 17:31                               ` Linus Torvalds
  2021-09-10 17:32                                 ` Jens Axboe
@ 2021-09-10 18:48                                 ` Al Viro
  2021-09-10 19:04                                   ` Linus Torvalds
  2021-09-10 19:10                                   ` Jens Axboe
  1 sibling, 2 replies; 46+ messages in thread
From: Al Viro @ 2021-09-10 18:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 10, 2021 at 10:31:00AM -0700, Linus Torvalds wrote:
> On Fri, Sep 10, 2021 at 10:26 AM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > On 9/10/21 10:58 AM, Linus Torvalds wrote:
> > > On Fri, Sep 10, 2021 at 9:56 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >>
> > >> What's the point of all those contortions, anyway?  You only need it for
> > >> iovec case; don't mix doing that and turning it into flavour-independent
> > >> primitive.
> > >
> > > Good point, making it specific to iovec only gets rid of a lot of
> > > special cases and worries.
> > >
> > > This is fairly specialized, no need to always cater to every possible case.
> >
> > Alright, split into three patches:
> >
> > https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter
> 
> That looks sane to me.
> 
> Please add some comment about how that
> 
>         i->iov -= state->nr_segs - i->nr_segs;
> 
> actually is the right thing for all the three cases (iow how 'iov',
> 'kvec' and 'bvec' all end up having a union member that acts the same
> way).
> 
> But yeah, I like how the io_uring.c code looks better this way too.
> 
> Al, what do you think?

I think that sizeof(struct bio_vec) != sizeof(struct iovec):
struct bio_vec {
        struct page     *bv_page;
	unsigned int    bv_len;
	unsigned int    bv_offset;
};
takes 3 words on 32bit boxen.
struct iovec
{
        void __user *iov_base;  /* BSD uses caddr_t (1003.1g requires void *) */
	__kernel_size_t iov_len; /* Must be size_t (1003.1g) */
};
takes 2 words on 32bit boxen.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 18:48                                 ` Al Viro
@ 2021-09-10 19:04                                   ` Linus Torvalds
  2021-09-10 19:10                                     ` Linus Torvalds
  2021-09-10 19:10                                   ` Jens Axboe
  1 sibling, 1 reply; 46+ messages in thread
From: Linus Torvalds @ 2021-09-10 19:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 10, 2021 at 11:50 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I think that sizeof(struct bio_vec) != sizeof(struct iovec):

Ooh, very good catch.

That would cause some *very* odd and subtle errors, since it just
happens to work on 64-bit, and then causes very confusing pointer
arithmetic errors on 32-bit.

So yeah, that

        i->iov -= state->nr_segs - i->nr_segs;

doesn't work after all, comment or not.

So only 'struct iovec' and 'struct kvec' actually have the same format
and can be used interchangeably.

            Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 19:04                                   ` Linus Torvalds
@ 2021-09-10 19:10                                     ` Linus Torvalds
  0 siblings, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2021-09-10 19:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On Fri, Sep 10, 2021 at 12:04 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So only 'struct iovec' and 'struct kvec' actually have the same format
> and can be used interchangeably.

That was very badly and confusingly phrased. They obviously don't
actually have the same format, and cannot be used interchangeably in
general.

But the pointer arithmetic works the same for those two union members,
so for that very specific case (and _only_ that) you can treat them as
equivalent and use them interchangeably.

Al clearly understood that, but I just wanted to clarify my phrasing
for anybody else reading this thread. Please don't use the iov/kvec
members interchangeably in general.

              Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [git pull] iov_iter fixes
  2021-09-10 18:48                                 ` Al Viro
  2021-09-10 19:04                                   ` Linus Torvalds
@ 2021-09-10 19:10                                   ` Jens Axboe
  1 sibling, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2021-09-10 19:10 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

On 9/10/21 12:48 PM, Al Viro wrote:
> On Fri, Sep 10, 2021 at 10:31:00AM -0700, Linus Torvalds wrote:
>> On Fri, Sep 10, 2021 at 10:26 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 9/10/21 10:58 AM, Linus Torvalds wrote:
>>>> On Fri, Sep 10, 2021 at 9:56 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>>>
>>>>> What's the point of all those contortions, anyway?  You only need it for
>>>>> iovec case; don't mix doing that and turning it into flavour-independent
>>>>> primitive.
>>>>
>>>> Good point, making it specific to iovec only gets rid of a lot of
>>>> special cases and worries.
>>>>
>>>> This is fairly specialized, no need to always cater to every possible case.
>>>
>>> Alright, split into three patches:
>>>
>>> https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter
>>
>> That looks sane to me.
>>
>> Please add some comment about how that
>>
>>         i->iov -= state->nr_segs - i->nr_segs;
>>
>> actually is the right thing for all the three cases (iow how 'iov',
>> 'kvec' and 'bvec' all end up having a union member that acts the same
>> way).
>>
>> But yeah, I like how the io_uring.c code looks better this way too.
>>
>> Al, what do you think?
> 
> I think that sizeof(struct bio_vec) != sizeof(struct iovec):
> struct bio_vec {
>         struct page     *bv_page;
> 	unsigned int    bv_len;
> 	unsigned int    bv_offset;
> };
> takes 3 words on 32bit boxen.
> struct iovec
> {
>         void __user *iov_base;  /* BSD uses caddr_t (1003.1g requires void *) */
> 	__kernel_size_t iov_len; /* Must be size_t (1003.1g) */
> };
> takes 2 words on 32bit boxen.

Ouch yes. I guess we do have to special case BVEC for now, as I do actually
need that one internally. I'll add a BUILD_BUG_ON() for the other one while
at it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [git pull] iov_iter fixes
  2021-09-10  2:48               ` Al Viro
  2021-09-10  3:06                 ` Jens Axboe
@ 2021-09-13 15:29                 ` David Laight
  1 sibling, 0 replies; 46+ messages in thread
From: David Laight @ 2021-09-13 15:29 UTC (permalink / raw)
  To: 'Al Viro', Jens Axboe
  Cc: Linus Torvalds, Pavel Begunkov, Linux Kernel Mailing List, linux-fsdevel

From: Al Viro <viro@ftp.linux.org.uk>
> Sent: 10 September 2021 03:48
> 
> On Thu, Sep 09, 2021 at 07:35:13PM -0600, Jens Axboe wrote:
> 
> > Yep ok I follow you now. And yes, if we get a partial one but one that
> > has more consumed than what was returned, that would not work well. I'm
> > guessing that a) we've never seen that, or b) we always end up with
> > either correctly advanced OR fully advanced, and the fully advanced case
> > would then just return 0 next time and we'd just get a short IO back to
> > userspace.
> >
> > The safer way here would likely be to import the iovec again. We're
> > still in the context of the original submission, and the sqe hasn't been
> > consumed in the ring yet, so that can be done safely.
> 
> ... until you end up with something assuming that you've got the same
> iovec from userland the second time around.
> 
> IOW, generally it's a bad idea to do that kind of re-imports.


IIRC the canonical 'import' code is something like:

	struct iov iov[8], *cache = iov;
	struct iter;

	iov_iter_import(&iter, ... , &cache, 8);

	result = ....

	if (cache != iov)
		kfree(cache);

	return result;

The iov[] and 'cache' are always allocated on stack with 'iter'.

Now processing the 'iter' advances iter->iov.
So to reset you need the start point - which is either 'cache' or 'iov[]'.
The outer caller (typically) has this information.
But the inner functions don't.

Move both 'iov[]' and 'cache' into struct iter and they become
available to all the code.
It would also simplify the currently horrid boilerplate code
that is replicated for every user.

You might need a 'offset in current iter->iov[]' but nothing else.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2021-09-13 15:30 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-09  4:22 [git pull] iov_iter fixes Al Viro
2021-09-09 19:37 ` Linus Torvalds
2021-09-09 21:19   ` Jens Axboe
2021-09-09 21:39     ` Jens Axboe
2021-09-09 21:56       ` Linus Torvalds
2021-09-09 22:21         ` Jens Axboe
2021-09-09 22:56           ` Linus Torvalds
2021-09-10  1:35             ` Jens Axboe
2021-09-10  2:43               ` Jens Axboe
2021-09-10  2:48               ` Al Viro
2021-09-10  3:06                 ` Jens Axboe
2021-09-10  3:15                   ` Al Viro
2021-09-10  3:23                     ` Jens Axboe
2021-09-10  3:24                     ` Al Viro
2021-09-10  3:28                       ` Jens Axboe
2021-09-13 15:29                 ` David Laight
2021-09-09 21:42     ` Dave Chinner
2021-09-10  2:57     ` Al Viro
2021-09-10  3:05       ` Jens Axboe
2021-09-10  3:11         ` Al Viro
2021-09-10  3:22           ` Jens Axboe
2021-09-10  3:27             ` Al Viro
2021-09-10  3:30               ` Jens Axboe
2021-09-10  3:36                 ` Al Viro
2021-09-10 13:57                   ` Jens Axboe
2021-09-10 14:42                     ` Al Viro
2021-09-10 15:08                       ` Jens Axboe
2021-09-10 15:32                         ` Al Viro
2021-09-10 15:36                           ` Jens Axboe
2021-09-10 15:04                     ` Jens Axboe
2021-09-10 16:06                       ` Jens Axboe
2021-09-10 16:44                         ` Linus Torvalds
2021-09-10 16:56                         ` Al Viro
2021-09-10 16:58                           ` Linus Torvalds
2021-09-10 17:26                             ` Jens Axboe
2021-09-10 17:31                               ` Linus Torvalds
2021-09-10 17:32                                 ` Jens Axboe
2021-09-10 18:48                                 ` Al Viro
2021-09-10 19:04                                   ` Linus Torvalds
2021-09-10 19:10                                     ` Linus Torvalds
2021-09-10 19:10                                   ` Jens Axboe
2021-09-10 17:04                           ` Jens Axboe
2021-09-09 22:54   ` Pavel Begunkov
2021-09-09 22:57     ` Pavel Begunkov
2021-09-09 23:14   ` Pavel Begunkov
2021-09-09 20:03 ` pr-tracker-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).