Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
* Splice & iomap dio problems
@ 2019-11-13 18:00 Jan Kara
  2019-11-13 18:44 ` Darrick J. Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2019-11-13 18:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christoph Hellwig, Darrick J. Wong, Al Viro, Matthew Bobrowski,
	linux-ext4, Ted Tso

Hello,

I've spent today tracking down the syzkaller report of a WARN_ON hit in
iov_iter_pipe() [1]. The immediate problem is that syzkaller reproducer
(calling sendfile(2) from different threads at the same time a file to the
same file in rather evil way) results in splice code leaking pipe pages
(nrbufs doesn't return to 0 after read+write in the splice) and eventually
we run out of pipe pages and hit the warning in iov_iter_pipe(). The
problem is not specific to ext4, I can see in my tracing that when the
underlying filesystem is XFS, we can leak the pipe pages in the same way
(but for XFS somehow the problem doesn't happen as often).  Rather the
problem seems to be in how iomap direct IO code, pipe iter code, and splice
code interact.

So the problematic situation is when we do direct IO read into pipe pages
and the read hits EOF which is not on page boundary. Say the file has 4608
(4096+512) bytes, block size == page size == 4096. What happens is that iomap
code maps the extent, gets that the extent size is 8192 (mapping ignores
i_size). Then we call iomap_dio_bio_actor(), which creates its private
iter, truncates it to 8192, and calls bio_iov_iter_get_pages(). That
eventually results in preparing two pipe buffers with length 4096 to accept
the read. Then read completes, in iomap_dio_complete() we truncate the return
value from 8192 (which was the real amount of IO we performed) to 4608. Now
this amount (4608) gets passed through splice code to
iter_file_splice_write(), we write out that amount, but then when cleaning
up pipe buffers, the last pipe buffer has still 3584 unused so we leave
the pipe buffer allocated and effectively leak it.

Now I was also investigating why the old direct IO code doesn't leak pipe
buffers like this and the trick is done by the iov_iter_revert() call
generic_file_read_iter(). This results in setting iter position right to
the position where direct IO read reported it ended (4608) and truncating
pipe buffers after this point. So splice code then sees the second pipe
buffer has length only 512 which matches the amount it was asked to write
and so the pipe buffer gets freed after the write in
iter_file_splice_write().

The question is how to best fix this. The quick fix is to add
iov_iter_revert() call to iomap_dio_rw() so that in case of sync IO (we
always do only sync IO to pipes), we properly set iter position in case of
short read / write. But it looks somewhat hacky to me and this whole
interaction of iter and pipes looks fragile to me.

Another option I can see is to truncate the iter to min(i_size-pos, length) in
iomap_dio_bio_actor() which *should* do the trick AFAICT. But I'm not sure
if it won't break something else.

Any other ideas?

As a side note the logic copying iter in iomap_dio_bio_actor() looks
suspicious. We copy 'dio->submit.iter' to 'iter' but then in the loop we call
iov_iter_advance() on dio->submit.iter. So if bio_iov_iter_get_pages()
didn't return enough pages and we loop again, 'iter' will have stale
contents and things go sideways from there? What am I missing? And why do
we do that strange copying of iter instead of using iov_iter_truncate() and
iov_iter_reexpand() on the 'dio->submit.iter' directly?

								Honza

[1] https://lore.kernel.org/lkml/000000000000d60aa50596c63063@google.com

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Splice & iomap dio problems
  2019-11-13 18:00 Splice & iomap dio problems Jan Kara
@ 2019-11-13 18:44 ` Darrick J. Wong
  2019-11-19 16:32   ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Darrick J. Wong @ 2019-11-13 18:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Christoph Hellwig, Al Viro, Matthew Bobrowski,
	linux-ext4, Ted Tso

On Wed, Nov 13, 2019 at 07:00:32PM +0100, Jan Kara wrote:
> Hello,
> 
> I've spent today tracking down the syzkaller report of a WARN_ON hit in
> iov_iter_pipe() [1]. The immediate problem is that syzkaller reproducer
> (calling sendfile(2) from different threads at the same time a file to the
> same file in rather evil way) results in splice code leaking pipe pages
> (nrbufs doesn't return to 0 after read+write in the splice) and eventually
> we run out of pipe pages and hit the warning in iov_iter_pipe(). The
> problem is not specific to ext4, I can see in my tracing that when the
> underlying filesystem is XFS, we can leak the pipe pages in the same way
> (but for XFS somehow the problem doesn't happen as often).  Rather the
> problem seems to be in how iomap direct IO code, pipe iter code, and splice
> code interact.
> 
> So the problematic situation is when we do direct IO read into pipe pages
> and the read hits EOF which is not on page boundary. Say the file has 4608
> (4096+512) bytes, block size == page size == 4096. What happens is that iomap
> code maps the extent, gets that the extent size is 8192 (mapping ignores

I wonder, would this work properly if the read side returns a 4608-byte
mapping instead of an 8192-byte mapping?  It doesn't make a lot of sense
(to me, anyway) for a read mapping to go beyond EOF.

> i_size). Then we call iomap_dio_bio_actor(), which creates its private
> iter, truncates it to 8192, and calls bio_iov_iter_get_pages(). That
> eventually results in preparing two pipe buffers with length 4096 to accept
> the read. Then read completes, in iomap_dio_complete() we truncate the return
> value from 8192 (which was the real amount of IO we performed) to 4608. Now
> this amount (4608) gets passed through splice code to
> iter_file_splice_write(), we write out that amount, but then when cleaning
> up pipe buffers, the last pipe buffer has still 3584 unused so we leave
> the pipe buffer allocated and effectively leak it.
> 
> Now I was also investigating why the old direct IO code doesn't leak pipe
> buffers like this and the trick is done by the iov_iter_revert() call
> generic_file_read_iter(). This results in setting iter position right to
> the position where direct IO read reported it ended (4608) and truncating
> pipe buffers after this point. So splice code then sees the second pipe
> buffer has length only 512 which matches the amount it was asked to write
> and so the pipe buffer gets freed after the write in
> iter_file_splice_write().
> 
> The question is how to best fix this. The quick fix is to add
> iov_iter_revert() call to iomap_dio_rw() so that in case of sync IO (we
> always do only sync IO to pipes), we properly set iter position in case of
> short read / write. But it looks somewhat hacky to me and this whole
> interaction of iter and pipes looks fragile to me.
> 
> Another option I can see is to truncate the iter to min(i_size-pos, length) in
> iomap_dio_bio_actor() which *should* do the trick AFAICT. But I'm not sure
> if it won't break something else.

Do the truncation in ->iomap_begin on the read side, as I suggested above?

> Any other ideas?
> 
> As a side note the logic copying iter in iomap_dio_bio_actor() looks
> suspicious. We copy 'dio->submit.iter' to 'iter' but then in the loop we call
> iov_iter_advance() on dio->submit.iter. So if bio_iov_iter_get_pages()
> didn't return enough pages and we loop again, 'iter' will have stale
> contents and things go sideways from there? What am I missing? And why do
> we do that strange copying of iter instead of using iov_iter_truncate() and
> iov_iter_reexpand() on the 'dio->submit.iter' directly?

I'm similarly puzzled; I would've thought that we'd need to advance the
private @iter too.  Or just truncate and reexpand the dio->submit.iter
and not have the private one.

With any luck hch will have some ideas? :/

--D

> 
> 								Honza
> 
> [1] https://lore.kernel.org/lkml/000000000000d60aa50596c63063@google.com
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Splice & iomap dio problems
  2019-11-13 18:44 ` Darrick J. Wong
@ 2019-11-19 16:32   ` Jan Kara
  2019-11-19 16:34     ` Darrick J. Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2019-11-19 16:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-fsdevel, Christoph Hellwig, Al Viro,
	Matthew Bobrowski, linux-ext4, Ted Tso

On Wed 13-11-19 10:44:03, Darrick J. Wong wrote:
> On Wed, Nov 13, 2019 at 07:00:32PM +0100, Jan Kara wrote:
> > Hello,
> > 
> > I've spent today tracking down the syzkaller report of a WARN_ON hit in
> > iov_iter_pipe() [1]. The immediate problem is that syzkaller reproducer
> > (calling sendfile(2) from different threads at the same time a file to the
> > same file in rather evil way) results in splice code leaking pipe pages
> > (nrbufs doesn't return to 0 after read+write in the splice) and eventually
> > we run out of pipe pages and hit the warning in iov_iter_pipe(). The
> > problem is not specific to ext4, I can see in my tracing that when the
> > underlying filesystem is XFS, we can leak the pipe pages in the same way
> > (but for XFS somehow the problem doesn't happen as often).  Rather the
> > problem seems to be in how iomap direct IO code, pipe iter code, and splice
> > code interact.
> > 
> > So the problematic situation is when we do direct IO read into pipe pages
> > and the read hits EOF which is not on page boundary. Say the file has 4608
> > (4096+512) bytes, block size == page size == 4096. What happens is that iomap
> > code maps the extent, gets that the extent size is 8192 (mapping ignores
> 
> I wonder, would this work properly if the read side returns a 4608-byte
> mapping instead of an 8192-byte mapping?  It doesn't make a lot of sense
> (to me, anyway) for a read mapping to go beyond EOF.

The slight concern I have with this is that that would change e.g. the
behavior of IOMAP_REPORT. We could specialcase IOMAP_REPORT but then it
gets kind of ugly. And it seems kind of fuzzy when do we truncate the
extent with i_size and when not... Generally i_size is kind of a side-band
thing for block mapping operations so if we could leave it out of
->iomap_begin I'd find that nicer.

> > i_size). Then we call iomap_dio_bio_actor(), which creates its private
> > iter, truncates it to 8192, and calls bio_iov_iter_get_pages(). That
> > eventually results in preparing two pipe buffers with length 4096 to accept
> > the read. Then read completes, in iomap_dio_complete() we truncate the return
> > value from 8192 (which was the real amount of IO we performed) to 4608. Now
> > this amount (4608) gets passed through splice code to
> > iter_file_splice_write(), we write out that amount, but then when cleaning
> > up pipe buffers, the last pipe buffer has still 3584 unused so we leave
> > the pipe buffer allocated and effectively leak it.
> > 
> > Now I was also investigating why the old direct IO code doesn't leak pipe
> > buffers like this and the trick is done by the iov_iter_revert() call
> > generic_file_read_iter(). This results in setting iter position right to
> > the position where direct IO read reported it ended (4608) and truncating
> > pipe buffers after this point. So splice code then sees the second pipe
> > buffer has length only 512 which matches the amount it was asked to write
> > and so the pipe buffer gets freed after the write in
> > iter_file_splice_write().
> > 
> > The question is how to best fix this. The quick fix is to add
> > iov_iter_revert() call to iomap_dio_rw() so that in case of sync IO (we
> > always do only sync IO to pipes), we properly set iter position in case of
> > short read / write. But it looks somewhat hacky to me and this whole
> > interaction of iter and pipes looks fragile to me.
> > 
> > Another option I can see is to truncate the iter to min(i_size-pos, length) in
> > iomap_dio_bio_actor() which *should* do the trick AFAICT. But I'm not sure
> > if it won't break something else.
> 
> Do the truncation in ->iomap_begin on the read side, as I suggested above?

Yes, that would be equivalent for this case.

> > Any other ideas?
> > 
> > As a side note the logic copying iter in iomap_dio_bio_actor() looks
> > suspicious. We copy 'dio->submit.iter' to 'iter' but then in the loop we call
> > iov_iter_advance() on dio->submit.iter. So if bio_iov_iter_get_pages()
> > didn't return enough pages and we loop again, 'iter' will have stale
> > contents and things go sideways from there? What am I missing? And why do
> > we do that strange copying of iter instead of using iov_iter_truncate() and
> > iov_iter_reexpand() on the 'dio->submit.iter' directly?
> 
> I'm similarly puzzled; I would've thought that we'd need to advance the
> private @iter too.  Or just truncate and reexpand the dio->submit.iter
> and not have the private one.
> 
> With any luck hch will have some ideas? :/

Christoph seems to be busy with something else. So I'll just write patches,
run them through fstests and see if something blows up.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Splice & iomap dio problems
  2019-11-19 16:32   ` Jan Kara
@ 2019-11-19 16:34     ` Darrick J. Wong
  2019-11-19 16:48       ` Christoph Hellwig
  0 siblings, 1 reply; 5+ messages in thread
From: Darrick J. Wong @ 2019-11-19 16:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Christoph Hellwig, Al Viro, Matthew Bobrowski,
	linux-ext4, Ted Tso

On Tue, Nov 19, 2019 at 05:32:14PM +0100, Jan Kara wrote:
> On Wed 13-11-19 10:44:03, Darrick J. Wong wrote:
> > On Wed, Nov 13, 2019 at 07:00:32PM +0100, Jan Kara wrote:
> > > Hello,
> > > 
> > > I've spent today tracking down the syzkaller report of a WARN_ON hit in
> > > iov_iter_pipe() [1]. The immediate problem is that syzkaller reproducer
> > > (calling sendfile(2) from different threads at the same time a file to the
> > > same file in rather evil way) results in splice code leaking pipe pages
> > > (nrbufs doesn't return to 0 after read+write in the splice) and eventually
> > > we run out of pipe pages and hit the warning in iov_iter_pipe(). The
> > > problem is not specific to ext4, I can see in my tracing that when the
> > > underlying filesystem is XFS, we can leak the pipe pages in the same way
> > > (but for XFS somehow the problem doesn't happen as often).  Rather the
> > > problem seems to be in how iomap direct IO code, pipe iter code, and splice
> > > code interact.
> > > 
> > > So the problematic situation is when we do direct IO read into pipe pages
> > > and the read hits EOF which is not on page boundary. Say the file has 4608
> > > (4096+512) bytes, block size == page size == 4096. What happens is that iomap
> > > code maps the extent, gets that the extent size is 8192 (mapping ignores
> > 
> > I wonder, would this work properly if the read side returns a 4608-byte
> > mapping instead of an 8192-byte mapping?  It doesn't make a lot of sense
> > (to me, anyway) for a read mapping to go beyond EOF.
> 
> The slight concern I have with this is that that would change e.g. the
> behavior of IOMAP_REPORT. We could specialcase IOMAP_REPORT but then it
> gets kind of ugly. And it seems kind of fuzzy when do we truncate the
> extent with i_size and when not... Generally i_size is kind of a side-band
> thing for block mapping operations so if we could leave it out of
> ->iomap_begin I'd find that nicer.

<nod>

> > > i_size). Then we call iomap_dio_bio_actor(), which creates its private
> > > iter, truncates it to 8192, and calls bio_iov_iter_get_pages(). That
> > > eventually results in preparing two pipe buffers with length 4096 to accept
> > > the read. Then read completes, in iomap_dio_complete() we truncate the return
> > > value from 8192 (which was the real amount of IO we performed) to 4608. Now
> > > this amount (4608) gets passed through splice code to
> > > iter_file_splice_write(), we write out that amount, but then when cleaning
> > > up pipe buffers, the last pipe buffer has still 3584 unused so we leave
> > > the pipe buffer allocated and effectively leak it.
> > > 
> > > Now I was also investigating why the old direct IO code doesn't leak pipe
> > > buffers like this and the trick is done by the iov_iter_revert() call
> > > generic_file_read_iter(). This results in setting iter position right to
> > > the position where direct IO read reported it ended (4608) and truncating
> > > pipe buffers after this point. So splice code then sees the second pipe
> > > buffer has length only 512 which matches the amount it was asked to write
> > > and so the pipe buffer gets freed after the write in
> > > iter_file_splice_write().
> > > 
> > > The question is how to best fix this. The quick fix is to add
> > > iov_iter_revert() call to iomap_dio_rw() so that in case of sync IO (we
> > > always do only sync IO to pipes), we properly set iter position in case of
> > > short read / write. But it looks somewhat hacky to me and this whole
> > > interaction of iter and pipes looks fragile to me.
> > > 
> > > Another option I can see is to truncate the iter to min(i_size-pos, length) in
> > > iomap_dio_bio_actor() which *should* do the trick AFAICT. But I'm not sure
> > > if it won't break something else.
> > 
> > Do the truncation in ->iomap_begin on the read side, as I suggested above?
> 
> Yes, that would be equivalent for this case.
> 
> > > Any other ideas?
> > > 
> > > As a side note the logic copying iter in iomap_dio_bio_actor() looks
> > > suspicious. We copy 'dio->submit.iter' to 'iter' but then in the loop we call
> > > iov_iter_advance() on dio->submit.iter. So if bio_iov_iter_get_pages()
> > > didn't return enough pages and we loop again, 'iter' will have stale
> > > contents and things go sideways from there? What am I missing? And why do
> > > we do that strange copying of iter instead of using iov_iter_truncate() and
> > > iov_iter_reexpand() on the 'dio->submit.iter' directly?
> > 
> > I'm similarly puzzled; I would've thought that we'd need to advance the
> > private @iter too.  Or just truncate and reexpand the dio->submit.iter
> > and not have the private one.
> > 
> > With any luck hch will have some ideas? :/
> 
> Christoph seems to be busy with something else. So I'll just write patches,
> run them through fstests and see if something blows up.

Heheh.  Ok, sounds good!

--D

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Splice & iomap dio problems
  2019-11-19 16:34     ` Darrick J. Wong
@ 2019-11-19 16:48       ` Christoph Hellwig
  0 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2019-11-19 16:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-fsdevel, Christoph Hellwig, Al Viro,
	Matthew Bobrowski, linux-ext4, Ted Tso

On Tue, Nov 19, 2019 at 08:34:54AM -0800, Darrick J. Wong wrote:
> > The slight concern I have with this is that that would change e.g. the
> > behavior of IOMAP_REPORT. We could specialcase IOMAP_REPORT but then it
> > gets kind of ugly. And it seems kind of fuzzy when do we truncate the
> > extent with i_size and when not... Generally i_size is kind of a side-band
> > thing for block mapping operations so if we could leave it out of
> > ->iomap_begin I'd find that nicer.
> 
> <nod>

Yes.  I'd prefer if the caller deals with any i_size limiting and
not the iomap methods themselves.  For now I'm tempted to just go
with the iov_iter_revert scheme.  Note that I particularly like it,
but it matches the most common direct I/O implementation at least.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, back to index

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-13 18:00 Splice & iomap dio problems Jan Kara
2019-11-13 18:44 ` Darrick J. Wong
2019-11-19 16:32   ` Jan Kara
2019-11-19 16:34     ` Darrick J. Wong
2019-11-19 16:48       ` Christoph Hellwig

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git