Splice & iomap dio problems

* Splice & iomap dio problems
@ 2019-11-13 18:00 Jan Kara
  2019-11-13 18:44 ` Darrick J. Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2019-11-13 18:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christoph Hellwig, Darrick J. Wong, Al Viro, Matthew Bobrowski,
	linux-ext4, Ted Tso

Hello,

I've spent today tracking down the syzkaller report of a WARN_ON hit in
iov_iter_pipe() [1]. The immediate problem is that syzkaller reproducer
(calling sendfile(2) from different threads at the same time a file to the
same file in rather evil way) results in splice code leaking pipe pages
(nrbufs doesn't return to 0 after read+write in the splice) and eventually
we run out of pipe pages and hit the warning in iov_iter_pipe(). The
problem is not specific to ext4, I can see in my tracing that when the
underlying filesystem is XFS, we can leak the pipe pages in the same way
(but for XFS somehow the problem doesn't happen as often).  Rather the
problem seems to be in how iomap direct IO code, pipe iter code, and splice
code interact.

So the problematic situation is when we do direct IO read into pipe pages
and the read hits EOF which is not on page boundary. Say the file has 4608
(4096+512) bytes, block size == page size == 4096. What happens is that iomap
code maps the extent, gets that the extent size is 8192 (mapping ignores
i_size). Then we call iomap_dio_bio_actor(), which creates its private
iter, truncates it to 8192, and calls bio_iov_iter_get_pages(). That
eventually results in preparing two pipe buffers with length 4096 to accept
the read. Then read completes, in iomap_dio_complete() we truncate the return
value from 8192 (which was the real amount of IO we performed) to 4608. Now
this amount (4608) gets passed through splice code to
iter_file_splice_write(), we write out that amount, but then when cleaning
up pipe buffers, the last pipe buffer has still 3584 unused so we leave
the pipe buffer allocated and effectively leak it.

Now I was also investigating why the old direct IO code doesn't leak pipe
buffers like this and the trick is done by the iov_iter_revert() call
generic_file_read_iter(). This results in setting iter position right to
the position where direct IO read reported it ended (4608) and truncating
pipe buffers after this point. So splice code then sees the second pipe
buffer has length only 512 which matches the amount it was asked to write
and so the pipe buffer gets freed after the write in
iter_file_splice_write().

The question is how to best fix this. The quick fix is to add
iov_iter_revert() call to iomap_dio_rw() so that in case of sync IO (we
always do only sync IO to pipes), we properly set iter position in case of
short read / write. But it looks somewhat hacky to me and this whole
interaction of iter and pipes looks fragile to me.

Another option I can see is to truncate the iter to min(i_size-pos, length) in
iomap_dio_bio_actor() which *should* do the trick AFAICT. But I'm not sure
if it won't break something else.

Any other ideas?

As a side note the logic copying iter in iomap_dio_bio_actor() looks
suspicious. We copy 'dio->submit.iter' to 'iter' but then in the loop we call
iov_iter_advance() on dio->submit.iter. So if bio_iov_iter_get_pages()
didn't return enough pages and we loop again, 'iter' will have stale
contents and things go sideways from there? What am I missing? And why do
we do that strange copying of iter instead of using iov_iter_truncate() and
iov_iter_reexpand() on the 'dio->submit.iter' directly?

								Honza

[1] https://lore.kernel.org/lkml/000000000000d60aa50596c63063@google.com

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread