Re: [PATCH 3/5] fs: remove ki_nbytes

From: Robert Baldyga <r.baldyga@samsung.com>
To: Al Viro <viro@ZenIV.linux.org.uk>,
	Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>, Miklos Szeredi <mszeredi@suse.cz>,
	linux-aio@kvack.org, linux-fsdevel@vger.kernel.org,
	Felipe Balbi <balbi@ti.com>,
	linux-usb@vger.kernel.org
Subject: Re: [PATCH 3/5] fs: remove ki_nbytes
Date: Thu, 05 Feb 2015 09:24:32 +0100	[thread overview]
Message-ID: <54D328C0.1070400@samsung.com> (raw)
In-Reply-To: <20150204230733.GK29656@ZenIV.linux.org.uk>

On 02/05/2015 12:07 AM, Al Viro wrote:
> On Wed, Feb 04, 2015 at 03:30:32PM -0500, Alan Stern wrote:
>>> 	* this one.  Note that you are not guaranteed that ep_config() won't
>>> be called more than once - two threads might race in write(2), with the loser
>>> getting through mutex_lock_interruptible(&data->lock); in ep_config() only
>>> after the winner has already gotten through write(), switched ->f_op, returned
>>> to userland and started doing read()/write()/etc.  If nothing else,
>>> the contents of data->desc and data->hs_desc can be buggered by arbitrary
>>> data, no matter how bogus, right as the first thread is doing IO.
>>
>> Well, this one certainly can be fixed to avoid altering ->f_op, at the 
>> cost of adding an extra check at the start of each I/O operation.
>  
>>> Um...  readv() is also going through ->aio_read().
>>
>> Why does readv() do this but not read()?  Wouldn't it make more sense 
>> to have all the read* calls use the same internal interface?
> 
> Because there are two partially overlapping classes wrt vector IO semantics:
> 	1) datagram-style.  Vectored read/write is equivalent to simple
> read/write done on each vector component.  And IO boundaries matter - if
> your driver treats any write() as datagram that starts e.g. with
> fixed-sized table in the beginning + arbitrary amount of data following
> it, you will get very different results from write(fd, buf, 200) and
> writev(fd, (struct iovec[2]){{buf, 100}, {buf+100, 100}}, 2).  A _lot_ of
> drivers are like that - they supply ->read() and ->write() for single-range
> IO and VFS construct the rest of operations out of those.
> 
> 	2) stream-style.  Vectored read is guaranteed to behave the same
> way as simple read on a range with size being the sum of vector element
> sizes, except that the data ends in ranges covered by vector elements instead
> of a single array.  Vectored write is guaranteed to behave the same way
> as simple write from a buffer containing the concatenation of ranges covered
> by vector elements.  Boundaries between the elements do not matter at all.
> Regular files on storage filesystems are like that.  So are FIFOs and pipes
> and so are sockets.  Even for datagram protocols, boundaries between the
> vector elements are ignored; boundaries between syscalls provide the datagram
> boundaries, but you can e.g. do writev(udp_socket_fd, (struct iovec[3]){
> {const_header, sizeof(const_header)}, {&n, 4}, {s, strlen(s)}}, 3) and have
> only one UDP packet sent.  IOW, it's general-purpose scatter-gather for read
> and write.
> 
> 	The last example shows that (2) isn't a subset of (1) - it's not
> always possible to call ->write() in loop and get the right behaviour.
> For regular files (and pure stream sockets, etc.) it would work, but for
> stuff like UDP sockets it would break.  Moreover, even for regular files on
> storage filesystems it would be quite inefficient - we'd need to acquire and
> release a bunch of locks, poke through metadata, etc., for each segment.
> 
> 	As the result, there was a couple of new methods added, inventively
> called ->readv() and ->writev().  do_sync_read() was supposed to be used
> as ->read() instance - it's "feed a single-element vector to ->readv()" and
> similar for s/read/write/.
> 
> 	Note that both (1) and (2) satisfy the following sanity requirement -
> single-element readv() is always equivalent to single-element() read().  You
> could violate that, by providing completely unrelated ->read() and ->readv(),
> but very few drivers went for that - too insane.
> 
> 	Then, when AIO had been added, those had grown an argument pointing
> to iocb (instead of file and ppos - for those we use iocb->ki_filp and
> &iocb->ki_pos resp.) and they got renamed into ->aio_read() and ->aio_write().
> Note that non-vectored AIO uses the same methods - ->read() and ->write() had
> too many instances to convert and most of those would end up just using those
> two iocb fields instead of the old arguments - tons of churn for no good
> reason.  ->readv() and ->writev() had fewer instances (very common was the
> use of generic_file_aio_{read,write}()) and conversion was less painful.
> So there was no ->aio_read() and ->aio_write().  That, in principle, was a
> bit of constraint - you couldn't make single-element AIO_PREADV behave
> different from AIO_PREAD, but nobody had been insane enough to ask for that.
> 
> 	Moreover, keeping ->readv() and ->writev() was pointless. There is
> cheap way to tell whether ->aio_{read,write}() call is due to io_submit(2)
> or to readv()/writev() - is_sync_kiocb(iocb) tells which one it is, so if
> driver really wanted different semantics for sync vs. async, it could check
> that.
> 
> 	So we ended up with ->read/->write for sync non-vectored and
> ->aio_read()/->aio_write() for sync vectored *and* async anything.  Usually
> you provide one or the other - NULL ->aio_... means loop calling ->read/write
> on each element, NULL ->read/write (or do_sync_... for them - it's the same
> thing) means feeding sync iocb and single-element vector to ->aio_....
> You *can* provide both, but that's rare and almost always pointless.
> 
> 	These days ->aio_{read,write} is almost gone - replaced with
> ->{read,write}_iter().  That sucker is equivalent, except that you
> get a pointer struct iov_iter instead iovec, nr_seg, size triple.  And
> you use linux/uio.h primitives for accessing the data (it might be backed
> by something other than userland ranges).  The primitives in there are
> not harder to use than copy_..._user(), and you avoid any need to loop over
> iovec array, etc.  Those primitives generally don't give a damn about the
> range boundaries; the only exception is iov_iter_single_seg_count(), which
> tells how much data is left to be consumed in the current segment; very
> few things are using it.  is_sync_kiocb() is still available to tell io_submit
> from synchronous syscalls.
> 
> 	I don't believe that it's worth switching the (1) \setminus (2) to
> those; it's still too much churn, plus we'd need the loop over segments
> somewhere to keep the semantics.  It *can* be expressed by ->read_iter()
> and ->write_iter(), but it'll be tons of boilerplate code in a lot of
> drivers.  So ->read() and ->write() are there to stay.  However, I think
> that we can get to the situation when it's really either one or another -
> if you have ->read_iter()/->write_iter(), don't mess with custom ->read()
> and ->write().
> 
> 	Both function/f_fs.c and legacy/inode.c are in class (2) - they
> gather data from iovec into temp buffer in ->aio_write() and pass the buffer to 
> the code doing actual IO, and on ->aio_read() side they give a buffer to
> read into, then arrange for it to be scattered into our iovec upon IO
> completion.  And they are doing non-trivial async stuff, so I'd prefer to
> switch them completely to ->read_iter/->write_iter.  The only obstacle is
> read vs. single-element readv and write vs. single-element writev behaviour
> difference.

No, function/f_fs.c and legacy/inode.c are in class (1). They have
datagram semantics - each vector element is submitted in separate USB
request. Each single request is sent in separate USB data packet (for
bulk endpoints it can be more than one packet). In fact sync
read()/write() also will give different results while called once with
some block of data or in loop with the same block of data splitted into
a few parts.

> 
> 	For some files it really can't be helped - {ipath,qib}fs has completely
> unrelated sets of commands accepted by write and writev, including the
> single-element one.  And /dev/snd/pcm*, while somewhat milder, has non-trivial
> behaviour differences.  I think that trick suggested by hch (put several
> flags in iocb, encoding the reason for the call; that would allow to tell
> one from another) is the best way to deal with those.  But I would really
> prefer to avoid that; IMO those examples are simply bad userland APIs.
> legacy/inode.c is the third (and last) beast like those two.  And there the
> difference is small enough to simply change readv/writev to be like read/write
> in that respect, i.e. also halt the endpoint when called on the isochronous
> one with wrong direction.

It shouldn't be a big problem to halt endpoints in
ep_aio_write()/ep_aio_read() to have similar behaviour for
single-element readv()/writev() and read()/write() operations.

Robert Baldyga