linux-cifs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Out of order read() completion and buffer filling beyond returned amount
@ 2022-01-17  9:57 David Howells
  2022-01-17 10:19 ` Linus Torvalds
  0 siblings, 1 reply; 4+ messages in thread
From: David Howells @ 2022-01-17  9:57 UTC (permalink / raw)
  To: Alexander Viro, Linus Torvalds
  Cc: dhowells, Anna Schumaker, Dave Wysochanski, Dominique Martinet,
	Jeff Layton, Latchesar Ionkov, Marc Dionne, Matthew Wilcox,
	Omar Sandoval, Shyam Prasad N, Steve French, Trond Myklebust,
	Peter Zijlstra, ceph-devel, linux-afs, linux-cachefs, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, v9fs-developer, linux-kernel

Hi Al, Linus,

Do you have an opinion on whether it's permissible for a filesystem to write
into the read() buffer beyond the amount it claims to return, though still
within the specified size of the buffer?

I'm working on common DIO routines for 9p, afs, ceph and cifs in netfs lib,
and I can see that at least three of those four filesystems either can or must
split a read, possibly being required to distribute across multiple servers.

If a filesystem was to emit multiple read RPCs in parallel, there is the
possibility that they would complete out of order - particularly if they go to
multiple servers.

Would it be a violation of the way the read() family of syscalls work to write
the data into the buffers out of order, and then abandon the extra data
written at the end if one of the RPCs returned a short read?  We would have
clobbered some of the buffer that we haven't said we've modified.

For buffered reads, it's not a problem as we can fill the pagecache out of
order with no issue.

David


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Out of order read() completion and buffer filling beyond returned amount
  2022-01-17  9:57 Out of order read() completion and buffer filling beyond returned amount David Howells
@ 2022-01-17 10:19 ` Linus Torvalds
  2022-01-17 13:30   ` Matthew Wilcox
  0 siblings, 1 reply; 4+ messages in thread
From: Linus Torvalds @ 2022-01-17 10:19 UTC (permalink / raw)
  To: David Howells
  Cc: Alexander Viro, Anna Schumaker, Dave Wysochanski,
	Dominique Martinet, Jeff Layton, Latchesar Ionkov, Marc Dionne,
	Matthew Wilcox, Omar Sandoval, Shyam Prasad N, Steve French,
	Trond Myklebust, Peter Zijlstra, ceph-devel, linux-afs,
	linux-cachefs, CIFS, linux-fsdevel, Linux-MM, open list:NFS,
	SUNRPC, AND...,
	v9fs-developer, Linux Kernel Mailing List

On Mon, Jan 17, 2022 at 11:57 AM David Howells <dhowells@redhat.com> wrote:
>
> Do you have an opinion on whether it's permissible for a filesystem to write
> into the read() buffer beyond the amount it claims to return, though still
> within the specified size of the buffer?

I'm pretty sure that would seriously violate POSIX in the general
case, and maybe even break some programs that do fancy buffer
management (ie I could imagine some circular buffer thing that expects
any "unwritten" ('unread'?) parts to stay with the old contents)

That said, that's for generic 'read()' cases for things like tty's or
pipes etc that can return partial reads in the first place.

If it's a regular file, then any partial read *already* violates
POSIX, and nobody sane would do any such buffer management because
it's supposed to be a 'can't happen' thing.

And since you mention DIO, that's doubly true, and is already outside
basic POSIX, and has already violated things like "all or nothing"
rules for visibility of writes-vs-reads (which admittedly most Linux
filesystems have violated even outside of DIO, since the strictest
reading of the rules are incredibly nasty anyway). But filesystems
like XFS which took some of the strict rules more seriously already
ignored them for DIO, afaik.

So I suspect you're fine. Buffered reads might care more, but even
there the whole "you can't really validly have partial reads anyway"
thing is a bigger violation to begin with.

With DIO, I suspect nobody cares about _those_ kinds of semantic
details. People who use DIO tend to care primarily about performance -
it's why they use it, after all - and are probably more than happy to
be lax about other rules.

But maybe somebody would prefer to have a mount option to specify just
how out-of-spec things can be (ie like the traditional old nfs 'intr'
thing). If only for testing, and for 'in case some odd app breaks'

                Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Out of order read() completion and buffer filling beyond returned amount
  2022-01-17 10:19 ` Linus Torvalds
@ 2022-01-17 13:30   ` Matthew Wilcox
  2022-01-18  7:25     ` Christoph Hellwig
  0 siblings, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2022-01-17 13:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Alexander Viro, Anna Schumaker, Dave Wysochanski,
	Dominique Martinet, Jeff Layton, Latchesar Ionkov, Marc Dionne,
	Omar Sandoval, Shyam Prasad N, Steve French, Trond Myklebust,
	Peter Zijlstra, ceph-devel, linux-afs, linux-cachefs, CIFS,
	linux-fsdevel, Linux-MM, open list:NFS, SUNRPC, AND...,
	v9fs-developer, Linux Kernel Mailing List

On Mon, Jan 17, 2022 at 12:19:29PM +0200, Linus Torvalds wrote:
> On Mon, Jan 17, 2022 at 11:57 AM David Howells <dhowells@redhat.com> wrote:
> >
> > Do you have an opinion on whether it's permissible for a filesystem to write
> > into the read() buffer beyond the amount it claims to return, though still
> > within the specified size of the buffer?
> 
> I'm pretty sure that would seriously violate POSIX in the general
> case, and maybe even break some programs that do fancy buffer
> management (ie I could imagine some circular buffer thing that expects
> any "unwritten" ('unread'?) parts to stay with the old contents)
> 
> That said, that's for generic 'read()' cases for things like tty's or
> pipes etc that can return partial reads in the first place.
> 
> If it's a regular file, then any partial read *already* violates
> POSIX, and nobody sane would do any such buffer management because
> it's supposed to be a 'can't happen' thing.
> 
> And since you mention DIO, that's doubly true, and is already outside
> basic POSIX, and has already violated things like "all or nothing"
> rules for visibility of writes-vs-reads (which admittedly most Linux
> filesystems have violated even outside of DIO, since the strictest
> reading of the rules are incredibly nasty anyway). But filesystems
> like XFS which took some of the strict rules more seriously already
> ignored them for DIO, afaik.

I think for DIO, you're sacrificing the entire buffer with any filesystem.
If the underlying file is split across multiple drives, or is even
just fragmented on a single drive, we'll submit multiple BIOs which
will complete independently (even for SCSI which writes sequentially;
never mind NVMe which can DMA blocks asynchronously).  It might be
more apparent in a networking situation where errors are more common,
but it's always been a possibility since Linux introduced DIO.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Out of order read() completion and buffer filling beyond returned amount
  2022-01-17 13:30   ` Matthew Wilcox
@ 2022-01-18  7:25     ` Christoph Hellwig
  0 siblings, 0 replies; 4+ messages in thread
From: Christoph Hellwig @ 2022-01-18  7:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, David Howells, Alexander Viro, Anna Schumaker,
	Dave Wysochanski, Dominique Martinet, Jeff Layton,
	Latchesar Ionkov, Marc Dionne, Omar Sandoval, Shyam Prasad N,
	Steve French, Trond Myklebust, Peter Zijlstra, ceph-devel,
	linux-afs, linux-cachefs, CIFS, linux-fsdevel, Linux-MM,
	open list:NFS, SUNRPC, AND...,
	v9fs-developer, Linux Kernel Mailing List

On Mon, Jan 17, 2022 at 01:30:05PM +0000, Matthew Wilcox wrote:
> I think for DIO, you're sacrificing the entire buffer with any filesystem.
> If the underlying file is split across multiple drives, or is even
> just fragmented on a single drive, we'll submit multiple BIOs which
> will complete independently (even for SCSI which writes sequentially;
> never mind NVMe which can DMA blocks asynchronously).  It might be
> more apparent in a networking situation where errors are more common,
> but it's always been a possibility since Linux introduced DIO.

Yes.  Probably because of that we also never allow short reads or writes
due to I/O errrors but always fail the whole I/O.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-01-18  7:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-17  9:57 Out of order read() completion and buffer filling beyond returned amount David Howells
2022-01-17 10:19 ` Linus Torvalds
2022-01-17 13:30   ` Matthew Wilcox
2022-01-18  7:25     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).