All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] write(2) semantics wrt return values and current position
@ 2015-04-06 16:02 Al Viro
  2015-04-06 18:13 ` Linus Torvalds
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Al Viro @ 2015-04-06 16:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Trond Myklebust, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Miklos Szeredi, Oleg Drokin

	There are several questions regarding the write(2) semantics, and
I'd like to see comments on those.  All of that is for regular files.

	1) should we ever update the current position when write returns
an error?  As it is, write(2) explicitly ignores any changes of position
when ->write() has returned an error, but some other callers of vfs_write()
are not so careful.

	2) should we ever update the current position when write() returns 0?
IOW, what effect should zero-length write() on O_APPEND file have upon its
current position?  POSIX seems to imply that it should do nothing, and
generally that's what happens, but e.g. ext4 *does* update position to
the EOF, whether we will write anything or not.  So does FUSE when server
requests to bypass the page cache.  AFAICS, lustre is the same way,
but I might be missing something; everything else definitely does not
update position in that case.  IMO the common behaviour is correct and
ext4 one is a bug.

	3) pwrite(2): POSIX seems to require ignoring the O_APPEND completely
for that syscall.  We definitely do not.  It's arguable whether this is
desired or not, but it's an existing behaviour that had been that way since
we'd got pwrite(2) in the kernel (2.1.60).  Probably too late to do anything
about that.

	4) at lower level, there's a nasty case when short (but non-empty)
O_DIRECT write followed by success of fallback to buffered write and a failure
of filemap_write_and_wait_range() yields a return of the amount written by
->direct_IO() *and* update of current position by that plus the amount
reported by buffered write.  IOW, we shift the offset by amount different
from (positive) value we'll be returning from write(2).  That's a direct
POSIX violation and I would expect the userland to be very surprised by
running into that.  IMO it's a bug and we would be better off by shifting
position by the amount we'll be returning.

	5) somewhat related: nfs_direct_IO() ends up calling
nfs_file_direct_write(), which calls generic_write_checks();
it's triggered by swap-over-NFS (normal O_DIRECT writes go directly to
nfs_file_direct_write()), and it ends up being subject to rlimit of
caller.  Which might be anyone who calls alloc_pages(), AFAICS.  Almost
certainly a bug.

	6) XFS seems to have fun bugs in O_DIRECT handling.  Consider
the following scenario:
	* O_DIRECT write() is called, we hit xfs_file_dio_aio_write().
	* we check alignment and make decision whether to do
xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will
not).  Suppose it takes that shared.
	* we call xfs_file_aio_write_checks(), which, for starters, might
modify position (on O_APPEND) and size (on rlimit).  Which renders the
alignment checks useless, of course, but what's worse, it proceeds to
calling xfs_break_layouts(), which might drop and retake XFS part of what's
taken by xfs_rw_iolock().  Retake it exclusive, and update the iolock flag
passed to it by reference accordingly.  And when we return to
xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping
exclusively taken XFS part of things *and* ->i_mutex we'd never taken.
	I might be misreading that code (it sure as hell wouldn't be
the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious
to me...

	My preference would be to have new_sync_write() and vfs_iter_write()
to ignore iocb.ki_pos when ->write_iter() returns negative or zero (would
take care of (1) and (2)) and have __generic_file_write_iter() to do
->ki_pos update in sync with what it'll be returning (takes care of (4)).
(3) is probably too old to fix, (5) should have generic_write_checks() done
outside of fs/nfs/direct.c.  No idea on (6) and I would really like to hear
from XFS folks before doing anything to that one.

	Comments?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-06 16:02 [RFC] write(2) semantics wrt return values and current position Al Viro
@ 2015-04-06 18:13 ` Linus Torvalds
  2015-04-06 19:29   ` Al Viro
  2015-04-06 20:04 ` Drokin, Oleg
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2015-04-06 18:13 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Trond Myklebust, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Miklos Szeredi, Oleg Drokin

On Mon, Apr 6, 2015 at 9:02 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         1) should we ever update the current position when write returns
> an error?  As it is, write(2) explicitly ignores any changes of position
> when ->write() has returned an error, but some other callers of vfs_write()
> are not so careful.

I think that question is the wrong way around.

If the write has ever been even partially successful, we should never
return an error. We should return the partial success.

So "error return" does imply "no position change", but not because we
shouldn't update the position for errors, but because we shouldn't
return error for anything that has updated the position.

NOTE! If we have magical non-POSIX files that update position even if
they otherwise return zero (don't ask me why, but we have various
non-posix behavior, maybe we have that case too), that's different.

Such a magical file might return an error (instead of zero) and update
position anyway. We have special files, they are ourside any POSIX
rules.

>         2) should we ever update the current position when write() returns 0?

I think the ext4 behavior is fine, although there's some noise in
POSIX about zero-sized writes being no-ops. I think the POSIX wording
is simply because some systems used to check for zero length before
even doing anything. In fact, I think Linux used to do that too, but
then we had special packetized formats that we wanted to syupport with
write() too (not just sendmsg), so that got removed.

I really don't think we should worry about it.

>         3) pwrite(2): POSIX seems to require ignoring the O_APPEND completely
> for that syscall.

That sounds insane too. But so is honoring it and ignoring the pwrite
offset. So I suspect the sane semantics for pwrite() with O_APPEND
should probably be to always just fail (the same way we fail pwrite on
nonseekable fd's), but if we have programs that use it, there's not a
lot we can do.

>         4) at lower level, there's a nasty case when short (but non-empty)
> O_DIRECT write followed by success of fallback to buffered write and a failure
> of filemap_write_and_wait_range() yields a return of the amount written by
> ->direct_IO() *and* update of current position by that plus the amount
> reported by buffered write.  IOW, we shift the offset by amount different
> from (positive) value we'll be returning from write(2).  That's a direct
> POSIX violation and I would expect the userland to be very surprised by
> running into that.  IMO it's a bug and we would be better off by shifting
> position by the amount we'll be returning.

That does sound like a bug. If we return a success value, and it's a
normal file (ie not the FAT "translate NL into NLCR" magic, or some
/proc seqfile etc), then I agree that the position should update by
the value we returned from write.

And there is no way I can see that a user program can depend on the
crazy behavior you describe.

>         5) somewhat related: nfs_direct_IO() ends up calling
> nfs_file_direct_write(), which calls generic_write_checks();
> it's triggered by swap-over-NFS (normal O_DIRECT writes go directly to
> nfs_file_direct_write()), and it ends up being subject to rlimit of
> caller.  Which might be anyone who calls alloc_pages(), AFAICS.  Almost
> certainly a bug.

Yeah, sounds bogus.

>         6) XFS seems to have fun bugs in O_DIRECT handling.  Consider
> the following scenario:
>         * O_DIRECT write() is called, we hit xfs_file_dio_aio_write().
>         * we check alignment and make decision whether to do
> xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will
> not).  Suppose it takes that shared.
>         * we call xfs_file_aio_write_checks(), which, for starters, might
> modify position (on O_APPEND) and size (on rlimit).  Which renders the
> alignment checks useless, of course, but what's worse, it proceeds to
> calling xfs_break_layouts(), which might drop and retake XFS part of what's
> taken by xfs_rw_iolock().  Retake it exclusive, and update the iolock flag
> passed to it by reference accordingly.  And when we return to
> xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping
> exclusively taken XFS part of things *and* ->i_mutex we'd never taken.
>         I might be misreading that code (it sure as hell wouldn't be
> the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious
> to me...

I don't think aio_write() makes sense on an O_APPEND file (for the
same reason pwrite() doesn't), but we might be stuck with it.  People
who do that are insane and probably deserve whatever crazy semantics
they get (and if they rely on them, we shouldn't change them in the
name of "make things sane").

If the lack of proper locking causes coherence problems, that's a XFS
bug, of course.

>         My preference would be to have new_sync_write() and vfs_iter_write()
> to ignore iocb.ki_pos when ->write_iter() returns negative or zero (would
> take care of (1) and (2))

I guess that should be ok, because the (1) case is a "shouldn't happen
anyway", and the (2) case is debatable (and probably depends on
filesystem, so hopefully no user app can depend on it).

So I don't think the current behavior for 0 is necessarily wrong, but
I also don't care that strongly, so..

>     and have __generic_file_write_iter() to do
> ->ki_pos update in sync with what it'll be returning (takes care of (4)).

Yes.

> (3) is probably too old to fix, (5) should have generic_write_checks() done
> outside of fs/nfs/direct.c.  No idea on (6) and I would really like to hear
> from XFS folks before doing anything to that one.

Sounds ok by me.

                        Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-06 18:13 ` Linus Torvalds
@ 2015-04-06 19:29   ` Al Viro
  2015-04-06 19:50     ` Al Viro
  0 siblings, 1 reply; 15+ messages in thread
From: Al Viro @ 2015-04-06 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Trond Myklebust, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Miklos Szeredi, Oleg Drokin

On Mon, Apr 06, 2015 at 11:13:14AM -0700, Linus Torvalds wrote:
> On Mon, Apr 6, 2015 at 9:02 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         1) should we ever update the current position when write returns
> > an error?  As it is, write(2) explicitly ignores any changes of position
> > when ->write() has returned an error, but some other callers of vfs_write()
> > are not so careful.
> 
> I think that question is the wrong way around.
> 
> If the write has ever been even partially successful, we should never
> return an error. We should return the partial success.

Check what happens if generic_write_sync() (from generic_file_write_iter())
fails.  That's the kind of thing I'm worried about.  Sure, an error halfway
through => short write, no problem with that.  We do handle that.

> >         2) should we ever update the current position when write() returns 0?
> 
> I think the ext4 behavior is fine, although there's some noise in
> POSIX about zero-sized writes being no-ops. I think the POSIX wording
> is simply because some systems used to check for zero length before
> even doing anything. In fact, I think Linux used to do that too, but
> then we had special packetized formats that we wanted to syupport with
> write() too (not just sendmsg), so that got removed.
> 
> I really don't think we should worry about it.

No, it's just that it would be a lot more convenient to have it ki_pos
discarded (in new_sync_write() and vfs_iter_write()) when ->write_iter()
returns 0 or negative.  As it is, we do rather clumsy dances in
generic_write_checks() and it would be much nicer if we could simply
pass iocb and iter to it and have it update ->ki_pos (in O_APPEND case)
and do iov_iter_turncate().  And yes, it needs massage to get iocb to
all callers; the main obstacle used to be v9fs_file_write(), but that's
got dealt with in my tree.

> >         4) at lower level, there's a nasty case when short (but non-empty)
> > O_DIRECT write followed by success of fallback to buffered write and a failure
> > of filemap_write_and_wait_range() yields a return of the amount written by
> > ->direct_IO() *and* update of current position by that plus the amount
> > reported by buffered write.  IOW, we shift the offset by amount different
> > from (positive) value we'll be returning from write(2).  That's a direct
> > POSIX violation and I would expect the userland to be very surprised by
> > running into that.  IMO it's a bug and we would be better off by shifting
> > position by the amount we'll be returning.
> 
> That does sound like a bug. If we return a success value, and it's a
> normal file (ie not the FAT "translate NL into NLCR" magic, or some
> /proc seqfile etc), then I agree that the position should update by
> the value we returned from write.

ext* and friends.  It's in __generic_file_write_iter().

> >         6) XFS seems to have fun bugs in O_DIRECT handling.  Consider
> > the following scenario:
> >         * O_DIRECT write() is called, we hit xfs_file_dio_aio_write().
> >         * we check alignment and make decision whether to do
> > xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will
> > not).  Suppose it takes that shared.
> >         * we call xfs_file_aio_write_checks(), which, for starters, might
> > modify position (on O_APPEND) and size (on rlimit).  Which renders the
> > alignment checks useless, of course, but what's worse, it proceeds to
> > calling xfs_break_layouts(), which might drop and retake XFS part of what's
> > taken by xfs_rw_iolock().  Retake it exclusive, and update the iolock flag
> > passed to it by reference accordingly.  And when we return to
> > xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping
> > exclusively taken XFS part of things *and* ->i_mutex we'd never taken.
> >         I might be misreading that code (it sure as hell wouldn't be
> > the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious
> > to me...
> 
> I don't think aio_write() makes sense on an O_APPEND file (for the
> same reason pwrite() doesn't), but we might be stuck with it.  People
> who do that are insane and probably deserve whatever crazy semantics
> they get (and if they rely on them, we shouldn't change them in the
> name of "make things sane").
> 
> If the lack of proper locking causes coherence problems, that's a XFS
> bug, of course.

It doesn't have to be O_DIRECT; setrlimit(2) from another thread is enough
to screw the alignment to hell and back and mutex_unlock() of something
we hadn't done mutex_lock() to is definitely a bug (don't need O_APPEND to
trigger that either; needs pNFS involved, AFAICS).  I'd really like comments
from Christoph and Dave on that on...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-06 19:29   ` Al Viro
@ 2015-04-06 19:50     ` Al Viro
  0 siblings, 0 replies; 15+ messages in thread
From: Al Viro @ 2015-04-06 19:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Trond Myklebust, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Miklos Szeredi, Oleg Drokin

On Mon, Apr 06, 2015 at 08:29:03PM +0100, Al Viro wrote:

> It doesn't have to be O_DIRECT; setrlimit(2) from another thread is enough

grrr.... s/O_DIRECT/O_APPEND/.  Badly edited sentence...  O_DIRECT _is_
needed to trigger that, O_APPEND isn't.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-06 16:02 [RFC] write(2) semantics wrt return values and current position Al Viro
  2015-04-06 18:13 ` Linus Torvalds
@ 2015-04-06 20:04 ` Drokin, Oleg
  2015-04-06 20:09   ` Al Viro
  2015-04-07 15:25 ` Christoph Hellwig
  2015-04-08 19:24 ` Al Viro
  3 siblings, 1 reply; 15+ messages in thread
From: Drokin, Oleg @ 2015-04-06 20:04 UTC (permalink / raw)
  To: Al Viro
  Cc: <linux-fsdevel@vger.kernel.org>,
	Linus Torvalds, Trond Myklebust, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Miklos Szeredi

Hello!

On Apr 6, 2015, at 12:02 PM, Al Viro wrote:
> 	2) should we ever update the current position when write() returns 0?
> IOW, what effect should zero-length write() on O_APPEND file have upon its
> current position?  POSIX seems to imply that it should do nothing, and
> generally that's what happens, but e.g. ext4 *does* update position to
> the EOF, whether we will write anything or not.  So does FUSE when server
> requests to bypass the page cache.  AFAICS, lustre is the same way,
> but I might be missing something; everything else definitely does not

Lustre is not the same way.
drivers/staging/lustre/lustre/llite/file.c::ll_file_io_generic() has this
        if (io-> ci_nob > 0) {
                result = io->ci_nob;
                *ppos = io->u.ci_wr.wr.crw_pos;
        }

and ppos is passed in as &iocb->ki_pos. ci_nob is number of bytes that we managed to read/write,
and if it was 0 (either due to 0 bytes io request or due to error from the get go)
we won't update it.

Bye,
    Oleg


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-06 20:04 ` Drokin, Oleg
@ 2015-04-06 20:09   ` Al Viro
  2015-04-06 20:39     ` Drokin, Oleg
  0 siblings, 1 reply; 15+ messages in thread
From: Al Viro @ 2015-04-06 20:09 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: <linux-fsdevel@vger.kernel.org>,
	Linus Torvalds, Trond Myklebust, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Miklos Szeredi

On Mon, Apr 06, 2015 at 08:04:01PM +0000, Drokin, Oleg wrote:
> Hello!
> 
> On Apr 6, 2015, at 12:02 PM, Al Viro wrote:
> > 	2) should we ever update the current position when write() returns 0?
> > IOW, what effect should zero-length write() on O_APPEND file have upon its
> > current position?  POSIX seems to imply that it should do nothing, and
> > generally that's what happens, but e.g. ext4 *does* update position to
> > the EOF, whether we will write anything or not.  So does FUSE when server
> > requests to bypass the page cache.  AFAICS, lustre is the same way,
> > but I might be missing something; everything else definitely does not
> 
> Lustre is not the same way.
> drivers/staging/lustre/lustre/llite/file.c::ll_file_io_generic() has this
>         if (io-> ci_nob > 0) {
>                 result = io->ci_nob;
>                 *ppos = io->u.ci_wr.wr.crw_pos;
>         }
> 
> and ppos is passed in as &iocb->ki_pos. ci_nob is number of bytes that we managed to read/write,
> and if it was 0 (either due to 0 bytes io request or due to error from the get go)
> we won't update it.

Eh? vvp_io_write_start():
                result = generic_file_write_iter(cio->cui_iocb, cio->cui_iter);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-06 20:09   ` Al Viro
@ 2015-04-06 20:39     ` Drokin, Oleg
  0 siblings, 0 replies; 15+ messages in thread
From: Drokin, Oleg @ 2015-04-06 20:39 UTC (permalink / raw)
  To: Al Viro
  Cc: <linux-fsdevel@vger.kernel.org>,
	Linus Torvalds, Trond Myklebust, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Miklos Szeredi

On Apr 6, 2015, at 4:09 PM, Al Viro wrote:

> On Mon, Apr 06, 2015 at 08:04:01PM +0000, Drokin, Oleg wrote:
>> Hello!
>> 
>> On Apr 6, 2015, at 12:02 PM, Al Viro wrote:
>>> 	2) should we ever update the current position when write() returns 0?
>>> IOW, what effect should zero-length write() on O_APPEND file have upon its
>>> current position?  POSIX seems to imply that it should do nothing, and
>>> generally that's what happens, but e.g. ext4 *does* update position to
>>> the EOF, whether we will write anything or not.  So does FUSE when server
>>> requests to bypass the page cache.  AFAICS, lustre is the same way,
>>> but I might be missing something; everything else definitely does not
>> 
>> Lustre is not the same way.
>> drivers/staging/lustre/lustre/llite/file.c::ll_file_io_generic() has this
>>        if (io-> ci_nob > 0) {
>>                result = io->ci_nob;
>>                *ppos = io->u.ci_wr.wr.crw_pos;
>>        }
>> 
>> and ppos is passed in as &iocb->ki_pos. ci_nob is number of bytes that we managed to read/write,
>> and if it was 0 (either due to 0 bytes io request or due to error from the get go)
>> we won't update it.
> 
> Eh? vvp_io_write_start():
>                result = generic_file_write_iter(cio->cui_iocb, cio->cui_iter);

Oh, hm, I guess you mean
        if (cl_io_is_append(io)) {
                pos = io->u.ci_wr.wr.crw_pos = i_size_read(inode);
                cio->cui_iocb->ki_pos = pos;

missed that one.
It's a leftover from the times we used to call __generic_file_aio_write there and that took
&iocb->ki_pos argument.
Now that we pass entire iocb in, probably can just drop that ki_pos assignment, I am going to try that now.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-06 16:02 [RFC] write(2) semantics wrt return values and current position Al Viro
  2015-04-06 18:13 ` Linus Torvalds
  2015-04-06 20:04 ` Drokin, Oleg
@ 2015-04-07 15:25 ` Christoph Hellwig
  2015-04-08 19:24 ` Al Viro
  3 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2015-04-07 15:25 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Trond Myklebust,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	Miklos Szeredi, Oleg Drokin

On Mon, Apr 06, 2015 at 05:02:31PM +0100, Al Viro wrote:
> 	6) XFS seems to have fun bugs in O_DIRECT handling.  Consider
> the following scenario:
> 	* O_DIRECT write() is called, we hit xfs_file_dio_aio_write().
> 	* we check alignment and make decision whether to do
> xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will
> not).  Suppose it takes that shared.
> 	* we call xfs_file_aio_write_checks(), which, for starters, might
> modify position (on O_APPEND) and size (on rlimit).  Which renders the
> alignment checks useless, of course, but what's worse, it proceeds to
> calling xfs_break_layouts(), which might drop and retake XFS part of what's
> taken by xfs_rw_iolock().  Retake it exclusive, and update the iolock flag
> passed to it by reference accordingly.  And when we return to
> xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping
> exclusively taken XFS part of things *and* ->i_mutex we'd never taken.
> 	I might be misreading that code (it sure as hell wouldn't be
> the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious
> to me...

It's not just dubious, it's broken.  I've forgotten to drop and retake
i_mutex there (depending on EXCL) flag.  It's been hitting me by making
another bug worse.  I've got an RFC patches for a few days, just need to
get around to send it out, it's proably 4.0 material.

And yes, alignment checks really should be past
xfs_file_aio_write_checks, or at least be re-checked there.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-06 16:02 [RFC] write(2) semantics wrt return values and current position Al Viro
                   ` (2 preceding siblings ...)
  2015-04-07 15:25 ` Christoph Hellwig
@ 2015-04-08 19:24 ` Al Viro
  2015-04-08 20:57   ` Al Viro
  3 siblings, 1 reply; 15+ messages in thread
From: Al Viro @ 2015-04-08 19:24 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Trond Myklebust, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Miklos Szeredi, Oleg Drokin, Joseph Qi,
	Joel Becker, Mark Fasheh

On Mon, Apr 06, 2015 at 05:02:31PM +0100, Al Viro wrote:

7) commit 3a83b342c87e6d21290de8dc76ec20a67821261d
Author: Joseph Qi <joseph.qi@huawei.com>
Date:   Mon Feb 16 16:00:09 2015 -0800

    ocfs2: complete the rest request through buffer io

appears to be very odd.  Look:

                written = generic_file_direct_write(iocb, from, *ppos);
-               if (written < 0) {
+               if (written < 0 || written == count) {
                        ret = written;
                        goto out_dio;
                }
+
+               /*
+                * for completing the rest of the request.
+                */
+               *ppos += written;

ppos here is &iocb->ki_pos.  Now, take a look at the end of
generic_file_direct_write():
        if (written > 0) {
                pos += written;
                iov_iter_advance(from, written);
                if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
                        i_size_write(inode, pos);
                        mark_inode_dirty(inode);
                }
                iocb->ki_pos = pos;
        }
out:
        return written;

In other words, after short write done by ->direct_IO(), we end up incrementing
position by *twice* the amount written by it.  And if it's short, but not
empty, we appear to be buggered...

Unless I hear "Al, you idiot, it's doing the right thing, here's what you've
missed: ...", I'm going to take that increment in ocfs2_file_write_iter()
out, and send it to Linus for 4.0 - it's post-3.19 regression.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-08 19:24 ` Al Viro
@ 2015-04-08 20:57   ` Al Viro
  2015-04-08 21:20     ` Al Viro
  0 siblings, 1 reply; 15+ messages in thread
From: Al Viro @ 2015-04-08 20:57 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Trond Myklebust, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Miklos Szeredi, Oleg Drokin, Joseph Qi,
	Joel Becker, Mark Fasheh

On Wed, Apr 08, 2015 at 08:24:50PM +0100, Al Viro wrote:

> In other words, after short write done by ->direct_IO(), we end up incrementing
> position by *twice* the amount written by it.  And if it's short, but not
> empty, we appear to be buggered...
> 
> Unless I hear "Al, you idiot, it's doing the right thing, here's what you've
> missed: ...", I'm going to take that increment in ocfs2_file_write_iter()
> out, and send it to Linus for 4.0 - it's post-3.19 regression.

... and having looked through old mail, there's _another_ breakage that
might have inspired that one; this one is mine - "ocfs2 syncs the wrong range".
It was syncing the wrong range with O_APPEND, all right, but after that
patch it was syncing the wrong range in _all_ cases.  What it should've
been doing instead is
                ret = filemap_fdatawrite_range(file->f_mapping,
					       iocb->ki_pos - written,
                                               iocb->ki_pos - 1);
...
                        ret = filemap_fdatawait_range(file->f_mapping,
						      iocb->ki_pos - written,
                                                      iocb->ki_pos - 1);

Joseph, my apologies for missing your mail back in January - you are absolutely
correct, *ppos (aka iocb->ki_pos) _is_ changed.  Unlike pos, it can be used
to get the right range reliably (as above; pos, back when it existed,
hadn't been affected by generic_write_checks() call in what used to be
ocfs2_file_aio_write()), but the actual calculation had been completely
wrong.

If that had contributed to confusion, my deep apologies...

Another fun issue in there: ocfs2_is_io_unaligned() is called before we
get around to checking rlimit and possibly truncating the range being
written.  IOW, result of that check is unreliable.

At which point in that function do we get the file size stabilized?
After ocfs2_rw_lock()?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-08 20:57   ` Al Viro
@ 2015-04-08 21:20     ` Al Viro
  2015-04-09  4:48       ` Junxiao Bi
  0 siblings, 1 reply; 15+ messages in thread
From: Al Viro @ 2015-04-08 21:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Trond Myklebust, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Miklos Szeredi, Oleg Drokin, Joseph Qi,
	Joel Becker, Mark Fasheh

On Wed, Apr 08, 2015 at 09:57:37PM +0100, Al Viro wrote:
> ... and having looked through old mail, there's _another_ breakage that
> might have inspired that one; this one is mine - "ocfs2 syncs the wrong range".
> It was syncing the wrong range with O_APPEND, all right, but after that
> patch it was syncing the wrong range in _all_ cases.  What it should've
> been doing instead is
>                 ret = filemap_fdatawrite_range(file->f_mapping,
> 					       iocb->ki_pos - written,
>                                                iocb->ki_pos - 1);
> ...
>                         ret = filemap_fdatawait_range(file->f_mapping,
> 						      iocb->ki_pos - written,
>                                                       iocb->ki_pos - 1);
> 
> Joseph, my apologies for missing your mail back in January - you are absolutely
> correct, *ppos (aka iocb->ki_pos) _is_ changed.  Unlike pos, it can be used
> to get the right range reliably (as above; pos, back when it existed,
> hadn't been affected by generic_write_checks() call in what used to be
> ocfs2_file_aio_write()), but the actual calculation had been completely
> wrong.
> 
> If that had contributed to confusion, my deep apologies...

Folks, could you please take a look through vfs.git#for-linus (the last
three commits in there) and see if you are OK with those?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-08 21:20     ` Al Viro
@ 2015-04-09  4:48       ` Junxiao Bi
  2015-04-09 11:23         ` Al Viro
  0 siblings, 1 reply; 15+ messages in thread
From: Junxiao Bi @ 2015-04-09  4:48 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Trond Myklebust, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Miklos Szeredi, Oleg Drokin, Joseph Qi,
	Joel Becker, Mark Fasheh

On 04/09/2015 05:20 AM, Al Viro wrote:
> On Wed, Apr 08, 2015 at 09:57:37PM +0100, Al Viro wrote:
>> ... and having looked through old mail, there's _another_ breakage that
>> might have inspired that one; this one is mine - "ocfs2 syncs the wrong range".
>> It was syncing the wrong range with O_APPEND, all right, but after that
>> patch it was syncing the wrong range in _all_ cases.  What it should've
>> been doing instead is
>>                 ret = filemap_fdatawrite_range(file->f_mapping,
>> 					       iocb->ki_pos - written,
>>                                                iocb->ki_pos - 1);
>> ...
>>                         ret = filemap_fdatawait_range(file->f_mapping,
>> 						      iocb->ki_pos - written,
>>                                                       iocb->ki_pos - 1);

Looks like if generic_file_direct_write() return -EIOCBQUEUED and
IS_SYNC(inode) is true, the sync range is also wrong.


2385         if (direct_io) {
2386                 loff_t endbyte;
2387                 ssize_t written_buffered;
2388                 written = generic_file_direct_write(iocb, from, *ppos);
2389                 if (written < 0 || written == count) {
2390                         ret = written;
2391                         goto out_dio;
2392                 }

2443         if (((file->f_flags & O_DSYNC) && !direct_io) ||
IS_SYNC(inode) ||
2444             ((file->f_flags & O_DIRECT) && !direct_io)) {

The other two patches looks good.


Thanks,
Junxiao.
>>
>> Joseph, my apologies for missing your mail back in January - you are absolutely
>> correct, *ppos (aka iocb->ki_pos) _is_ changed.  Unlike pos, it can be used
>> to get the right range reliably (as above; pos, back when it existed,
>> hadn't been affected by generic_write_checks() call in what used to be
>> ocfs2_file_aio_write()), but the actual calculation had been completely
>> wrong.
>>
>> If that had contributed to confusion, my deep apologies...
> 
> Folks, could you please take a look through vfs.git#for-linus (the last
> three commits in there) and see if you are OK with those?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-09  4:48       ` Junxiao Bi
@ 2015-04-09 11:23         ` Al Viro
  2015-04-09 11:42           ` Al Viro
  0 siblings, 1 reply; 15+ messages in thread
From: Al Viro @ 2015-04-09 11:23 UTC (permalink / raw)
  To: Junxiao Bi
  Cc: linux-fsdevel, Linus Torvalds, Trond Myklebust,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	Miklos Szeredi, Oleg Drokin, Joseph Qi, Joel Becker, Mark Fasheh

On Thu, Apr 09, 2015 at 12:48:44PM +0800, Junxiao Bi wrote:

> Looks like if generic_file_direct_write() return -EIOCBQUEUED and
> IS_SYNC(inode) is true, the sync range is also wrong.

*blink*

But in that case it shouldn't do any syncing at all...  Oh, right.
Unlike generic_file_write_iter(), if goes into the sync pathway
in that case (which was another long-standing bug there)...

Fixed and force-pushed.

FWIW, once we get to the situation when generic_write_checks() takes
iocb and iter, the next step will be mirroring O_DIRECT and O_APPEND
state in iocb->ki_flags; then ocfs2_file_write_iter() will be able to
use those instead of o_direct/o_append *and* start using
__generic_file_write_iter() instead of open-coding it - the problem
with "we can't rely on ->f_flags & O_DIRECT to tell if that should be
a direct write, need to look at our local flag" will go away...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-09 11:23         ` Al Viro
@ 2015-04-09 11:42           ` Al Viro
  2015-04-10 14:31             ` Junxiao Bi
  0 siblings, 1 reply; 15+ messages in thread
From: Al Viro @ 2015-04-09 11:42 UTC (permalink / raw)
  To: Junxiao Bi
  Cc: linux-fsdevel, Linus Torvalds, Trond Myklebust,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	Miklos Szeredi, Oleg Drokin, Joseph Qi, Joel Becker, Mark Fasheh

On Thu, Apr 09, 2015 at 12:23:33PM +0100, Al Viro wrote:
> On Thu, Apr 09, 2015 at 12:48:44PM +0800, Junxiao Bi wrote:
> 
> > Looks like if generic_file_direct_write() return -EIOCBQUEUED and
> > IS_SYNC(inode) is true, the sync range is also wrong.
> 
> *blink*
> 
> But in that case it shouldn't do any syncing at all...  Oh, right.
> Unlike generic_file_write_iter(), if goes into the sync pathway
> in that case (which was another long-standing bug there)...
> 
> Fixed and force-pushed.

	BTW, what about the locks needed to stabilize the file size?
generic_write_checks() in there looks like it's called too late - we
can't tell a damn thing about alignment until we do it, so at the very
least we might need to recheck that.

	Can we move it _before_ ocfs2_prepare_inode_for_write()?
Having it done twice in case we end up on "can't really do O_DIRECT, unlock
and redo everything without asking for direct IO" path is not a problem -
we just need to save the original count and restore it before that
goto relock in there.

	Do we hold enough locks to make sure that file size won't change
under us?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] write(2) semantics wrt return values and current position
  2015-04-09 11:42           ` Al Viro
@ 2015-04-10 14:31             ` Junxiao Bi
  0 siblings, 0 replies; 15+ messages in thread
From: Junxiao Bi @ 2015-04-10 14:31 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Trond Myklebust,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	Miklos Szeredi, Oleg Drokin, Joseph Qi, Joel Becker, Mark Fasheh

On 04/09/2015 07:42 PM, Al Viro wrote:
> On Thu, Apr 09, 2015 at 12:23:33PM +0100, Al Viro wrote:
>> On Thu, Apr 09, 2015 at 12:48:44PM +0800, Junxiao Bi wrote:
>>
>>> Looks like if generic_file_direct_write() return -EIOCBQUEUED and
>>> IS_SYNC(inode) is true, the sync range is also wrong.
>> *blink*
>>
>> But in that case it shouldn't do any syncing at all...  Oh, right.
>> Unlike generic_file_write_iter(), if goes into the sync pathway
>> in that case (which was another long-standing bug there)...
>>
>> Fixed and force-pushed.
> 	BTW, what about the locks needed to stabilize the file size?
> generic_write_checks() in there looks like it's called too late - we
> can't tell a damn thing about alignment until we do it, so at the very
> least we might need to recheck that.
>
> 	Can we move it _before_ ocfs2_prepare_inode_for_write()?
> Having it done twice in case we end up on "can't really do O_DIRECT, unlock
> and redo everything without asking for direct IO" path is not a problem -
> we just need to save the original count and restore it before that
> goto relock in there.
>
> 	Do we hold enough locks to make sure that file size won't change
> under us?
I think holding ocfs2_rw_lock() for cluster and i_mutex for local will 
stop file size changing under us. So looks safe to do your changes there.

Thanks,
Junxiao.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-04-10 14:32 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-06 16:02 [RFC] write(2) semantics wrt return values and current position Al Viro
2015-04-06 18:13 ` Linus Torvalds
2015-04-06 19:29   ` Al Viro
2015-04-06 19:50     ` Al Viro
2015-04-06 20:04 ` Drokin, Oleg
2015-04-06 20:09   ` Al Viro
2015-04-06 20:39     ` Drokin, Oleg
2015-04-07 15:25 ` Christoph Hellwig
2015-04-08 19:24 ` Al Viro
2015-04-08 20:57   ` Al Viro
2015-04-08 21:20     ` Al Viro
2015-04-09  4:48       ` Junxiao Bi
2015-04-09 11:23         ` Al Viro
2015-04-09 11:42           ` Al Viro
2015-04-10 14:31             ` Junxiao Bi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.