[RFC] write(2) semantics wrt return values and current position

* [RFC] write(2) semantics wrt return values and current position
@ 2015-04-06 16:02 Al Viro
  2015-04-06 18:13 ` Linus Torvalds
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Al Viro @ 2015-04-06 16:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Trond Myklebust, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Miklos Szeredi, Oleg Drokin

	There are several questions regarding the write(2) semantics, and
I'd like to see comments on those.  All of that is for regular files.

	1) should we ever update the current position when write returns
an error?  As it is, write(2) explicitly ignores any changes of position
when ->write() has returned an error, but some other callers of vfs_write()
are not so careful.

	2) should we ever update the current position when write() returns 0?
IOW, what effect should zero-length write() on O_APPEND file have upon its
current position?  POSIX seems to imply that it should do nothing, and
generally that's what happens, but e.g. ext4 *does* update position to
the EOF, whether we will write anything or not.  So does FUSE when server
requests to bypass the page cache.  AFAICS, lustre is the same way,
but I might be missing something; everything else definitely does not
update position in that case.  IMO the common behaviour is correct and
ext4 one is a bug.

	3) pwrite(2): POSIX seems to require ignoring the O_APPEND completely
for that syscall.  We definitely do not.  It's arguable whether this is
desired or not, but it's an existing behaviour that had been that way since
we'd got pwrite(2) in the kernel (2.1.60).  Probably too late to do anything
about that.

	4) at lower level, there's a nasty case when short (but non-empty)
O_DIRECT write followed by success of fallback to buffered write and a failure
of filemap_write_and_wait_range() yields a return of the amount written by
->direct_IO() *and* update of current position by that plus the amount
reported by buffered write.  IOW, we shift the offset by amount different
from (positive) value we'll be returning from write(2).  That's a direct
POSIX violation and I would expect the userland to be very surprised by
running into that.  IMO it's a bug and we would be better off by shifting
position by the amount we'll be returning.

	5) somewhat related: nfs_direct_IO() ends up calling
nfs_file_direct_write(), which calls generic_write_checks();
it's triggered by swap-over-NFS (normal O_DIRECT writes go directly to
nfs_file_direct_write()), and it ends up being subject to rlimit of
caller.  Which might be anyone who calls alloc_pages(), AFAICS.  Almost
certainly a bug.

	6) XFS seems to have fun bugs in O_DIRECT handling.  Consider
the following scenario:
	* O_DIRECT write() is called, we hit xfs_file_dio_aio_write().
	* we check alignment and make decision whether to do
xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will
not).  Suppose it takes that shared.
	* we call xfs_file_aio_write_checks(), which, for starters, might
modify position (on O_APPEND) and size (on rlimit).  Which renders the
alignment checks useless, of course, but what's worse, it proceeds to
calling xfs_break_layouts(), which might drop and retake XFS part of what's
taken by xfs_rw_iolock().  Retake it exclusive, and update the iolock flag
passed to it by reference accordingly.  And when we return to
xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping
exclusively taken XFS part of things *and* ->i_mutex we'd never taken.
	I might be misreading that code (it sure as hell wouldn't be
the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious
to me...

	My preference would be to have new_sync_write() and vfs_iter_write()
to ignore iocb.ki_pos when ->write_iter() returns negative or zero (would
take care of (1) and (2)) and have __generic_file_write_iter() to do
->ki_pos update in sync with what it'll be returning (takes care of (4)).
(3) is probably too old to fix, (5) should have generic_write_checks() done
outside of fs/nfs/direct.c.  No idea on (6) and I would really like to hear
from XFS folks before doing anything to that one.

	Comments?

^ permalink raw reply	[flat|nested] 15+ messages in thread