Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/2] iomap: zero dirty pages over unwritten extents
@ 2020-10-12 14:03 Brian Foster
  2020-10-12 14:03 ` [PATCH 1/2] iomap: use page dirty state to seek data " Brian Foster
  2020-10-12 14:03 ` [PATCH 2/2] iomap: zero cached pages over unwritten extents on zero range Brian Foster
  0 siblings, 2 replies; 13+ messages in thread
From: Brian Foster @ 2020-10-12 14:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

Hi all,

This is an alternate/iomap based variant of the XFS fix posted here:

https://lore.kernel.org/linux-xfs/20201007143509.669729-1-bfoster@redhat.com/

This addresses a post-eof stale data exposure problem that can occur
when truncate down races with writeback of the new EOF page and the new
EOF block happens to be unwritten. Instead of explicitly flushing the
new EOF page in XFS, however, this variant updates iomap zero range to
check for and zero preexisting dirty pages over unwritten extents. This
reuses the dirty page checking that seek data/hole already implements
for unwritten extents. Patch 1 is technically not required, but fixes an
odd behavior I noticed when playing around with zero range. Without it,
a zero range (with patch 2) over a large, cached (but clean), unwritten
mapping would unnecessarily zero pages that aren't otherwise dirty. I
don't think this is actually a problem in practice today as most large
zero range requests are truncates over post-eof space (which shouldn't
have page cache), but it seemed odd regardless.

Thoughts, reviews, flames appreciated.

Brian

Brian Foster (2):
  iomap: use page dirty state to seek data over unwritten extents
  iomap: zero cached pages over unwritten extents on zero range

 fs/iomap/buffered-io.c | 39 +++++++++++++++++++++++++++++++++++++--
 fs/iomap/seek.c        |  7 ++++---
 include/linux/iomap.h  |  2 ++
 3 files changed, 43 insertions(+), 5 deletions(-)

-- 
2.25.4


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] iomap: use page dirty state to seek data over unwritten extents
  2020-10-12 14:03 [PATCH 0/2] iomap: zero dirty pages over unwritten extents Brian Foster
@ 2020-10-12 14:03 ` Brian Foster
  2020-10-13 12:30   ` Brian Foster
                     ` (2 more replies)
  2020-10-12 14:03 ` [PATCH 2/2] iomap: zero cached pages over unwritten extents on zero range Brian Foster
  1 sibling, 3 replies; 13+ messages in thread
From: Brian Foster @ 2020-10-12 14:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

iomap seek hole/data currently uses page Uptodate state to track
data over unwritten extents. This is odd and unpredictable in that
the existence of clean pages changes behavior. For example:

  $ xfs_io -fc "falloc 0 32k" -c "seek -d 0" \
	    -c "pread 16k 4k" -c "seek -d 0" /mnt/file
  Whence  Result
  DATA    EOF
  ...
  Whence  Result
  DATA    16384

Instead, use page dirty state to locate data over unwritten extents.
This causes seek data to land on the first uptodate block of a dirty
page since we don't have per-block dirty state in iomap. This is
consistent with writeback, however, which converts all uptodate
blocks of a dirty page for similar reasons.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/seek.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
index 107ee80c3568..981a74c8d60f 100644
--- a/fs/iomap/seek.c
+++ b/fs/iomap/seek.c
@@ -40,7 +40,7 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
 	 * Just check the page unless we can and should check block ranges:
 	 */
 	if (bsize == PAGE_SIZE || !ops->is_partially_uptodate)
-		return PageUptodate(page) == seek_data;
+		return PageDirty(page) == seek_data;
 
 	lock_page(page);
 	if (unlikely(page->mapping != inode->i_mapping))
@@ -49,7 +49,8 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
 	for (off = 0; off < PAGE_SIZE; off += bsize) {
 		if (offset_in_page(*lastoff) >= off + bsize)
 			continue;
-		if (ops->is_partially_uptodate(page, off, bsize) == seek_data) {
+		if ((ops->is_partially_uptodate(page, off, bsize) &&
+		     PageDirty(page)) == seek_data) {
 			unlock_page(page);
 			return true;
 		}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] iomap: zero cached pages over unwritten extents on zero range
  2020-10-12 14:03 [PATCH 0/2] iomap: zero dirty pages over unwritten extents Brian Foster
  2020-10-12 14:03 ` [PATCH 1/2] iomap: use page dirty state to seek data " Brian Foster
@ 2020-10-12 14:03 ` Brian Foster
  2020-10-15  9:49   ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Brian Foster @ 2020-10-12 14:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

The iomap zero range mechanism currently skips unwritten mappings.
This is normally not a problem as most users synchronize in-core
state to the underlying block mapping by flushing pagecache prior to
calling into iomap. This is not always the case, however. For
example, XFS calls iomap_truncate_page() on truncate down before
flushing the new EOF page of the file. This means that if the new
EOF block is unwritten but covered by a dirty page (i.e. awaiting
unwritten conversion after writeback), iomap fails to zero that
page. The subsequent truncate_setsize() call does perform page
zeroing, but doesn't dirty the page. Therefore if the new EOF page
is written back after calling into iomap and before the pagecache
truncate, the post-EOF zeroing is lost on page reclaim. This exposes
stale post-EOF data on mapped reads.

To address this problem, update the iomap zero range mechanism to
explicitly zero ranges over unwritten extents where pagecache
happens to exist. This is similar to how iomap seek data works for
unwritten extents with cached data. In fact, we can reuse the same
mechanism to scan for pagecache over large unwritten mappings to
retain the same level of efficiency when zeroing large unwritten
(and non-dirty) ranges.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 39 +++++++++++++++++++++++++++++++++++++--
 fs/iomap/seek.c        |  2 +-
 include/linux/iomap.h  |  2 ++
 3 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index bcfc288dba3f..a07703d686da 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -944,6 +944,30 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 	return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
 }
 
+/*
+ * Seek data over an unwritten mapping and update the counters for the caller to
+ * perform zeroing, if necessary.
+ */
+static void
+iomap_zero_range_skip_uncached(struct inode *inode, loff_t *pos,
+		loff_t *count, loff_t *written)
+{
+	unsigned dirty_offset, bytes = 0;
+
+	dirty_offset = page_cache_seek_hole_data(inode, *pos, *count,
+				SEEK_DATA);
+	if (dirty_offset == -ENOENT)
+		bytes = *count;
+	else if (dirty_offset > *pos)
+		bytes = dirty_offset - *pos;
+
+	if (bytes) {
+		*pos += bytes;
+		*count -= bytes;
+		*written += bytes;
+	}
+}
+
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		void *data, struct iomap *iomap, struct iomap *srcmap)
@@ -952,13 +976,24 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 	loff_t written = 0;
 	int status;
 
-	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	/* holes are already zeroed, we're done */
+	if (srcmap->type == IOMAP_HOLE)
 		return count;
 
 	do {
 		unsigned offset, bytes;
 
+		/*
+		 * Unwritten mappings are effectively zeroed on disk, but we
+		 * must zero any preexisting data pages over the range.
+		 */
+		if (srcmap->type == IOMAP_UNWRITTEN) {
+			iomap_zero_range_skip_uncached(inode, &pos, &count,
+				&written);
+			if (!count)
+				break;
+		}
+
 		offset = offset_in_page(pos);
 		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
 
diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
index 981a74c8d60f..6804c1d5808e 100644
--- a/fs/iomap/seek.c
+++ b/fs/iomap/seek.c
@@ -71,7 +71,7 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
  *
  * Returns the resulting offset on successs, and -ENOENT otherwise.
  */
-static loff_t
+loff_t
 page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
 		int whence)
 {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4d1d3c3469e9..898c012f4f33 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -184,6 +184,8 @@ loff_t iomap_seek_data(struct inode *inode, loff_t offset,
 		const struct iomap_ops *ops);
 sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
 		const struct iomap_ops *ops);
+loff_t page_cache_seek_hole_data(struct inode *inode, loff_t offset,
+		loff_t length, int whence);
 
 /*
  * Structure for writeback I/O completions.
-- 
2.25.4


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] iomap: use page dirty state to seek data over unwritten extents
  2020-10-12 14:03 ` [PATCH 1/2] iomap: use page dirty state to seek data " Brian Foster
@ 2020-10-13 12:30   ` Brian Foster
  2020-10-13 22:53   ` Dave Chinner
  2020-10-15  9:47   ` Christoph Hellwig
  2 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2020-10-13 12:30 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

On Mon, Oct 12, 2020 at 10:03:49AM -0400, Brian Foster wrote:
> iomap seek hole/data currently uses page Uptodate state to track
> data over unwritten extents. This is odd and unpredictable in that
> the existence of clean pages changes behavior. For example:
> 
>   $ xfs_io -fc "falloc 0 32k" -c "seek -d 0" \
> 	    -c "pread 16k 4k" -c "seek -d 0" /mnt/file
>   Whence  Result
>   DATA    EOF
>   ...
>   Whence  Result
>   DATA    16384
> 
> Instead, use page dirty state to locate data over unwritten extents.
> This causes seek data to land on the first uptodate block of a dirty
> page since we don't have per-block dirty state in iomap. This is
> consistent with writeback, however, which converts all uptodate
> blocks of a dirty page for similar reasons.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---

JFYI that I hit a generic/285 failure with this patch. I suspect this
needs to check for Dirty || Writeback, otherwise if we see the latter
the range is incorrectly treated as a hole.

Brian

>  fs/iomap/seek.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
> index 107ee80c3568..981a74c8d60f 100644
> --- a/fs/iomap/seek.c
> +++ b/fs/iomap/seek.c
> @@ -40,7 +40,7 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
>  	 * Just check the page unless we can and should check block ranges:
>  	 */
>  	if (bsize == PAGE_SIZE || !ops->is_partially_uptodate)
> -		return PageUptodate(page) == seek_data;
> +		return PageDirty(page) == seek_data;
>  
>  	lock_page(page);
>  	if (unlikely(page->mapping != inode->i_mapping))
> @@ -49,7 +49,8 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
>  	for (off = 0; off < PAGE_SIZE; off += bsize) {
>  		if (offset_in_page(*lastoff) >= off + bsize)
>  			continue;
> -		if (ops->is_partially_uptodate(page, off, bsize) == seek_data) {
> +		if ((ops->is_partially_uptodate(page, off, bsize) &&
> +		     PageDirty(page)) == seek_data) {
>  			unlock_page(page);
>  			return true;
>  		}
> -- 
> 2.25.4
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] iomap: use page dirty state to seek data over unwritten extents
  2020-10-12 14:03 ` [PATCH 1/2] iomap: use page dirty state to seek data " Brian Foster
  2020-10-13 12:30   ` Brian Foster
@ 2020-10-13 22:53   ` Dave Chinner
  2020-10-14 12:59     ` Brian Foster
  2020-10-15  9:47   ` Christoph Hellwig
  2 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2020-10-13 22:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

On Mon, Oct 12, 2020 at 10:03:49AM -0400, Brian Foster wrote:
> iomap seek hole/data currently uses page Uptodate state to track
> data over unwritten extents. This is odd and unpredictable in that
> the existence of clean pages changes behavior. For example:
> 
>   $ xfs_io -fc "falloc 0 32k" -c "seek -d 0" \
> 	    -c "pread 16k 4k" -c "seek -d 0" /mnt/file
>   Whence  Result
>   DATA    EOF
>   ...
>   Whence  Result
>   DATA    16384

I don't think there is any way around this, because the page cache
lookup done by the seek hole/data code is an
unlocked operation and can race with other IO and operations. That
is, seek does not take IO serialisation locks at all so
read/write/page faults/fallocate/etc all run concurrently with it...

i.e. we get an iomap that is current at the time the iomap_begin()
call is made, but we don't hold any locks to stabilise that extent
range while we do a page cache traversal looking for cached data.
That means any region of the unwritten iomap can change state while
we are running the page cache seek.

We cannot determine what the data contents without major overhead,
and if we are seeking over a large unwritten extent covered by clean
pages that then gets partially written synchronously by another
concurrent write IO then we might trip across clean uptodate pages
with real data in them by the time the page cache scan gets to it.

Hence the only thing we are looking at here is whether there is data
present in the cache or not. As such, I think assuming that only
dirty/writeback pages contain actual user data in a seek data/hole
operation is a fundametnally incorrect premise.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] iomap: use page dirty state to seek data over unwritten extents
  2020-10-13 22:53   ` Dave Chinner
@ 2020-10-14 12:59     ` Brian Foster
  2020-10-14 22:37       ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Brian Foster @ 2020-10-14 12:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-xfs

On Wed, Oct 14, 2020 at 09:53:44AM +1100, Dave Chinner wrote:
> On Mon, Oct 12, 2020 at 10:03:49AM -0400, Brian Foster wrote:
> > iomap seek hole/data currently uses page Uptodate state to track
> > data over unwritten extents. This is odd and unpredictable in that
> > the existence of clean pages changes behavior. For example:
> > 
> >   $ xfs_io -fc "falloc 0 32k" -c "seek -d 0" \
> > 	    -c "pread 16k 4k" -c "seek -d 0" /mnt/file
> >   Whence  Result
> >   DATA    EOF
> >   ...
> >   Whence  Result
> >   DATA    16384
> 
> I don't think there is any way around this, because the page cache
> lookup done by the seek hole/data code is an
> unlocked operation and can race with other IO and operations. That
> is, seek does not take IO serialisation locks at all so
> read/write/page faults/fallocate/etc all run concurrently with it...
> 
> i.e. we get an iomap that is current at the time the iomap_begin()
> call is made, but we don't hold any locks to stabilise that extent
> range while we do a page cache traversal looking for cached data.
> That means any region of the unwritten iomap can change state while
> we are running the page cache seek.
> 

Hm, Ok.. that makes sense..

> We cannot determine what the data contents without major overhead,
> and if we are seeking over a large unwritten extent covered by clean
> pages that then gets partially written synchronously by another
> concurrent write IO then we might trip across clean uptodate pages
> with real data in them by the time the page cache scan gets to it.
> 
> Hence the only thing we are looking at here is whether there is data
> present in the cache or not. As such, I think assuming that only
> dirty/writeback pages contain actual user data in a seek data/hole
> operation is a fundametnally incorrect premise.
> 

... but afaict this kind of thing is already possible because nothing
stops a subsequently cleaned page (i.e., dirtied and written back) from
also being dropped from cache before the scan finds it. IOW, I don't
really see how this justifies using one page state check over another as
opposed to pointing out the whole page scanning thing itself seems to be
racy. Perhaps the reasoning wrt to seek is simply that we should either
see one state (hole) or the next (data) and we don't terribly care much
about seek being racy..?

My concern is more the issue described by patch 2. Note that patch 2
doesn't necessarily depend on this one. The tradeoff without patch 1 is
just that we'd explicitly zero and dirty any uptodate new EOF page as
opposed to a page that was already dirty (or writeback).

Truncate does hold iolock/mmaplock, but ISTM that is still not
sufficient because of the same page reclaim issue mentioned above. E.g.,
a truncate down lands on a dirty page over an unwritten block,
iomap_truncate_page() receives the unwritten mapping, page is flushed
and reclaimed (changing block state), iomap_truncate_page() (still using
the unwritten mapping) has nothing to do without a page and thus stale
data is exposed.

ISTM that either the filesystem needs to be more involved with the
stabilization of unwritten mappings in general or truncate page needs to
do something along the lines of block_truncate_page() (which we used
pre-iomap) and just explicitly zero/dirty the new page if the block is
otherwise mapped. Thoughts? Other ideas?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] iomap: use page dirty state to seek data over unwritten extents
  2020-10-14 12:59     ` Brian Foster
@ 2020-10-14 22:37       ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2020-10-14 22:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

On Wed, Oct 14, 2020 at 08:59:55AM -0400, Brian Foster wrote:
> On Wed, Oct 14, 2020 at 09:53:44AM +1100, Dave Chinner wrote:
> > On Mon, Oct 12, 2020 at 10:03:49AM -0400, Brian Foster wrote:
> > > iomap seek hole/data currently uses page Uptodate state to track
> > > data over unwritten extents. This is odd and unpredictable in that
> > > the existence of clean pages changes behavior. For example:
> > > 
> > >   $ xfs_io -fc "falloc 0 32k" -c "seek -d 0" \
> > > 	    -c "pread 16k 4k" -c "seek -d 0" /mnt/file
> > >   Whence  Result
> > >   DATA    EOF
> > >   ...
> > >   Whence  Result
> > >   DATA    16384
> > 
> > I don't think there is any way around this, because the page cache
> > lookup done by the seek hole/data code is an
> > unlocked operation and can race with other IO and operations. That
> > is, seek does not take IO serialisation locks at all so
> > read/write/page faults/fallocate/etc all run concurrently with it...
> > 
> > i.e. we get an iomap that is current at the time the iomap_begin()
> > call is made, but we don't hold any locks to stabilise that extent
> > range while we do a page cache traversal looking for cached data.
> > That means any region of the unwritten iomap can change state while
> > we are running the page cache seek.
> > 
> 
> Hm, Ok.. that makes sense..
> 
> > We cannot determine what the data contents without major overhead,
> > and if we are seeking over a large unwritten extent covered by clean
> > pages that then gets partially written synchronously by another
> > concurrent write IO then we might trip across clean uptodate pages
> > with real data in them by the time the page cache scan gets to it.
> > 
> > Hence the only thing we are looking at here is whether there is data
> > present in the cache or not. As such, I think assuming that only
> > dirty/writeback pages contain actual user data in a seek data/hole
> > operation is a fundametnally incorrect premise.
> > 
> 
> ... but afaict this kind of thing is already possible because nothing
> stops a subsequently cleaned page (i.e., dirtied and written back) from
> also being dropped from cache before the scan finds it.  IOW, I don't
> really see how this justifies using one page state check over another as
> opposed to pointing out the whole page scanning thing itself seems to be
> racy. Perhaps the reasoning wrt to seek is simply that we should either
> see one state (hole) or the next (data) and we don't terribly care much
> about seek being racy..?

lseek() is inherently racy, and there's nothing we can do to prevent
that. i.e. even if we lock lseek up tight and guarantee the lookup
we do is absolutely correct, something can come in between dropping
the locks and returning that information to userspace that makes it
incorrect.

My point is that if we see a page that has uptodate data in it, then
userspace needs to be told that so it can determine if it is
actual data or just zeroed pages by reading it. If there's no page
in the cache, then seek hole/data simply doesn't know there was ever
data there, and there's nothing a page cache scan can do about that
sort of race. And there's nothing userspace can do about that sort
of race, either, because lseek() is inherently racy.

> My concern is more the issue described by patch 2. Note that patch 2
> doesn't necessarily depend on this one. The tradeoff without patch 1 is
> just that we'd explicitly zero and dirty any uptodate new EOF page as
> opposed to a page that was already dirty (or writeback).

I haven't looked at patch 2 - changes to seek hole/data need to be
correct from a stand alone perspective...

> Truncate does hold iolock/mmaplock, but ISTM that is still not
> sufficient because of the same page reclaim issue mentioned above. E.g.,
> a truncate down lands on a dirty page over an unwritten block,
> iomap_truncate_page() receives the unwritten mapping, page is flushed
> and reclaimed (changing block state), iomap_truncate_page() (still using
> the unwritten mapping) has nothing to do without a page and thus stale
> data is exposed.

So the problem here is only the new partial EOF block after the
truncate down completes? i.e. iomap_zero_range_actor() if failing to
zero the partial page that covers the new EOF if the block is
unwritten?

So, truncate_setsize() zeros the new partial EOF page, but it does
so without dirtying the page at all. Hence if the page is dirty in
memory over an unwritten extent when we run iomap_truncate_page(),
it's ready to be written out with the tail already zeroed in the
page cache. The only problem is that iomap_truncate_page() didn't
return "did_zeroing = true", and hence the followup "did_zeroing"
flush doesn't trigger.

And, FWIW, I suspect that follow flush is invalid for a truncate
down, too, because ip->i_d.di_size > new_isize  because both
write_cache_pages and __filemap_fdatawait_range() do nothing if
start > end as per the call in xfs_setattr_size().

> ISTM that either the filesystem needs to be more involved with the
> stabilization of unwritten mappings in general or truncate page needs to
> do something along the lines of block_truncate_page() (which we used
> pre-iomap) and just explicitly zero/dirty the new page if the block is
> otherwise mapped. Thoughts? Other ideas?

I suspect that iomap_truncate_page() needs to attempt to invalidate
the range being truncated first. i.e. call
invalidate_inode_pages2_range() and if it gets EBUSY returned we
know that there is a dirty cached page over the offset we are
truncating, and we now know that we must zero the page range
regardless of whether it sits over an unwritten extent or not.  Then
we just have to make sure the xfs_setattr_size does the right thing
with did_zeroing on truncate down....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] iomap: use page dirty state to seek data over unwritten extents
  2020-10-12 14:03 ` [PATCH 1/2] iomap: use page dirty state to seek data " Brian Foster
  2020-10-13 12:30   ` Brian Foster
  2020-10-13 22:53   ` Dave Chinner
@ 2020-10-15  9:47   ` Christoph Hellwig
  2020-10-19 16:55     ` Brian Foster
  2 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2020-10-15  9:47 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

I don't think we can solve this properly.  Due to the racyness we can
always err one side.  The beauty of treating all the uptodate pages
as present data is that we err on the safe side, as applications
expect holes to never have data, while "data" could always be zeroed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] iomap: zero cached pages over unwritten extents on zero range
  2020-10-12 14:03 ` [PATCH 2/2] iomap: zero cached pages over unwritten extents on zero range Brian Foster
@ 2020-10-15  9:49   ` Christoph Hellwig
  2020-10-19 16:55     ` Brian Foster
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2020-10-15  9:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

> +iomap_zero_range_skip_uncached(struct inode *inode, loff_t *pos,
> +		loff_t *count, loff_t *written)
> +{
> +	unsigned dirty_offset, bytes = 0;
> +
> +	dirty_offset = page_cache_seek_hole_data(inode, *pos, *count,
> +				SEEK_DATA);
> +	if (dirty_offset == -ENOENT)
> +		bytes = *count;
> +	else if (dirty_offset > *pos)
> +		bytes = dirty_offset - *pos;
> +
> +	if (bytes) {
> +		*pos += bytes;
> +		*count -= bytes;
> +		*written += bytes;
> +	}

I find the calling conventions weird.  why not return bytes and
keep the increments/decrements of the three variables in the caller?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] iomap: use page dirty state to seek data over unwritten extents
  2020-10-15  9:47   ` Christoph Hellwig
@ 2020-10-19 16:55     ` Brian Foster
  0 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2020-10-19 16:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs

On Thu, Oct 15, 2020 at 10:47:00AM +0100, Christoph Hellwig wrote:
> I don't think we can solve this properly.  Due to the racyness we can
> always err one side.  The beauty of treating all the uptodate pages
> as present data is that we err on the safe side, as applications
> expect holes to never have data, while "data" could always be zeroed.
> 

I don't think that's quite accurate. Nothing prevents a dirty page from
being written back and reclaimed between acquiring the (unwritten)
mapping and doing the pagecache scan, so it's possible to present valid
data (written to the kernel prior to a seek) as a hole with the current
code.

Brian


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] iomap: zero cached pages over unwritten extents on zero range
  2020-10-15  9:49   ` Christoph Hellwig
@ 2020-10-19 16:55     ` Brian Foster
  2020-10-19 18:01       ` Brian Foster
  0 siblings, 1 reply; 13+ messages in thread
From: Brian Foster @ 2020-10-19 16:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs

On Thu, Oct 15, 2020 at 10:49:01AM +0100, Christoph Hellwig wrote:
> > +iomap_zero_range_skip_uncached(struct inode *inode, loff_t *pos,
> > +		loff_t *count, loff_t *written)
> > +{
> > +	unsigned dirty_offset, bytes = 0;
> > +
> > +	dirty_offset = page_cache_seek_hole_data(inode, *pos, *count,
> > +				SEEK_DATA);
> > +	if (dirty_offset == -ENOENT)
> > +		bytes = *count;
> > +	else if (dirty_offset > *pos)
> > +		bytes = dirty_offset - *pos;
> > +
> > +	if (bytes) {
> > +		*pos += bytes;
> > +		*count -= bytes;
> > +		*written += bytes;
> > +	}
> 
> I find the calling conventions weird.  why not return bytes and
> keep the increments/decrements of the three variables in the caller?
> 

No particular reason. IIRC I had it both ways and just landed on this.
I'd change it, but as mentioned in the patch 1 thread I don't think this
patch is sufficient (with or without patch 1) anyways because the page
can also have been reclaimed before we get here.

Brian


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] iomap: zero cached pages over unwritten extents on zero range
  2020-10-19 16:55     ` Brian Foster
@ 2020-10-19 18:01       ` Brian Foster
  2020-10-20 16:21         ` Brian Foster
  0 siblings, 1 reply; 13+ messages in thread
From: Brian Foster @ 2020-10-19 18:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs

On Mon, Oct 19, 2020 at 12:55:19PM -0400, Brian Foster wrote:
> On Thu, Oct 15, 2020 at 10:49:01AM +0100, Christoph Hellwig wrote:
> > > +iomap_zero_range_skip_uncached(struct inode *inode, loff_t *pos,
> > > +		loff_t *count, loff_t *written)
> > > +{
> > > +	unsigned dirty_offset, bytes = 0;
> > > +
> > > +	dirty_offset = page_cache_seek_hole_data(inode, *pos, *count,
> > > +				SEEK_DATA);
> > > +	if (dirty_offset == -ENOENT)
> > > +		bytes = *count;
> > > +	else if (dirty_offset > *pos)
> > > +		bytes = dirty_offset - *pos;
> > > +
> > > +	if (bytes) {
> > > +		*pos += bytes;
> > > +		*count -= bytes;
> > > +		*written += bytes;
> > > +	}
> > 
> > I find the calling conventions weird.  why not return bytes and
> > keep the increments/decrements of the three variables in the caller?
> > 
> 
> No particular reason. IIRC I had it both ways and just landed on this.
> I'd change it, but as mentioned in the patch 1 thread I don't think this
> patch is sufficient (with or without patch 1) anyways because the page
> can also have been reclaimed before we get here.
> 

Christoph,

What do you think about introducing behavior specific to
iomap_truncate_page() to unconditionally write zeroes over unwritten
extents? AFAICT that addresses the race and was historical XFS behavior
(via block_truncate_page()) before iomap, so is not without precedent.
What I'd probably do is bury the caller's did_zero parameter into a new
internal struct iomap_zero_data to pass down into
iomap_zero_range_actor(), then extend that structure with a
'zero_unwritten' field such that iomap_zero_range_actor() can do this:

        if (srcmap->type == IOMAP_HOLE ||
            (srcmap->type == IOMAP_UNWRITTEN && !zdata->zero_unwritten))
                return count;

iomap_truncate_page() would set that flag either via open coding
iomap_zero_range() or creating a new internal wrapper. Hm?

Brian

> Brian
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] iomap: zero cached pages over unwritten extents on zero range
  2020-10-19 18:01       ` Brian Foster
@ 2020-10-20 16:21         ` Brian Foster
  0 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2020-10-20 16:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs

On Mon, Oct 19, 2020 at 02:01:44PM -0400, Brian Foster wrote:
> On Mon, Oct 19, 2020 at 12:55:19PM -0400, Brian Foster wrote:
> > On Thu, Oct 15, 2020 at 10:49:01AM +0100, Christoph Hellwig wrote:
> > > > +iomap_zero_range_skip_uncached(struct inode *inode, loff_t *pos,
> > > > +		loff_t *count, loff_t *written)
> > > > +{
> > > > +	unsigned dirty_offset, bytes = 0;
> > > > +
> > > > +	dirty_offset = page_cache_seek_hole_data(inode, *pos, *count,
> > > > +				SEEK_DATA);
> > > > +	if (dirty_offset == -ENOENT)
> > > > +		bytes = *count;
> > > > +	else if (dirty_offset > *pos)
> > > > +		bytes = dirty_offset - *pos;
> > > > +
> > > > +	if (bytes) {
> > > > +		*pos += bytes;
> > > > +		*count -= bytes;
> > > > +		*written += bytes;
> > > > +	}
> > > 
> > > I find the calling conventions weird.  why not return bytes and
> > > keep the increments/decrements of the three variables in the caller?
> > > 
> > 
> > No particular reason. IIRC I had it both ways and just landed on this.
> > I'd change it, but as mentioned in the patch 1 thread I don't think this
> > patch is sufficient (with or without patch 1) anyways because the page
> > can also have been reclaimed before we get here.
> > 
> 
> Christoph,
> 
> What do you think about introducing behavior specific to
> iomap_truncate_page() to unconditionally write zeroes over unwritten
> extents? AFAICT that addresses the race and was historical XFS behavior
> (via block_truncate_page()) before iomap, so is not without precedent.
> What I'd probably do is bury the caller's did_zero parameter into a new
> internal struct iomap_zero_data to pass down into
> iomap_zero_range_actor(), then extend that structure with a
> 'zero_unwritten' field such that iomap_zero_range_actor() can do this:
> 

Ugh, so the above doesn't quite describe historical behavior.
block_truncate_page() converts an unwritten block if a page exists
(dirty or not), but bails out if a page doesn't exist. We could still do
the above, but if we wanted something more intelligent I think we need
to check for a page before we get the mapping to know whether we can
safely skip an unwritten block or need to write over it. Otherwise if we
check for a page within the actor, we have no way of knowing whether
there was a (possibly dirty) page that had been written back and/or
reclaimed since ->iomap_begin(). If we check for the page first, I think
that the iolock/mmaplock in the truncate path ensures that a page can't
be added before we complete. We might be able to take that further and
check for a dirty || writeback page, but that might be safer as a
separate patch. See the (compile tested only) diff below for an idea of
what I was thinking.

Brian

--- 8< ---

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index bcfc288dba3f..2cdfcff02307 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1000,17 +1000,56 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 }
 EXPORT_SYMBOL_GPL(iomap_zero_range);
 
+struct iomap_trunc_priv {
+	bool *did_zero;
+	bool has_page;
+};
+
+static loff_t
+iomap_truncate_page_actor(struct inode *inode, loff_t pos, loff_t count,
+		void *data, struct iomap *iomap, struct iomap *srcmap)
+{
+	struct iomap_trunc_priv	*priv = data;
+	unsigned offset;
+	int status;
+
+	if (srcmap->type == IOMAP_HOLE)
+		return count;
+	if (srcmap->type == IOMAP_UNWRITTEN && !priv->has_page)
+		return count;
+
+	offset = offset_in_page(pos);
+	if (IS_DAX(inode))
+		status = dax_iomap_zero(pos, offset, count, iomap);
+	else
+		status = iomap_zero(inode, pos, offset, count, iomap, srcmap);
+	if (status < 0)
+		return status;
+
+	if (priv->did_zero)
+		*priv->did_zero = true;
+	return count;
+}
+
 int
 iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 		const struct iomap_ops *ops)
 {
+	struct iomap_trunc_priv priv = { .did_zero = did_zero };
 	unsigned int blocksize = i_blocksize(inode);
 	unsigned int off = pos & (blocksize - 1);
+	loff_t ret;
 
 	/* Block boundary? Nothing to do */
 	if (!off)
 		return 0;
-	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
+
+	priv.has_page = filemap_range_has_page(inode->i_mapping, pos, pos);
+	ret = iomap_apply(inode, pos, blocksize - off, IOMAP_ZERO, ops, &priv,
+			  iomap_truncate_page_actor);
+	if (ret <= 0)
+		return ret;
+	return 0;
 }
 EXPORT_SYMBOL_GPL(iomap_truncate_page);
 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, back to index

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-12 14:03 [PATCH 0/2] iomap: zero dirty pages over unwritten extents Brian Foster
2020-10-12 14:03 ` [PATCH 1/2] iomap: use page dirty state to seek data " Brian Foster
2020-10-13 12:30   ` Brian Foster
2020-10-13 22:53   ` Dave Chinner
2020-10-14 12:59     ` Brian Foster
2020-10-14 22:37       ` Dave Chinner
2020-10-15  9:47   ` Christoph Hellwig
2020-10-19 16:55     ` Brian Foster
2020-10-12 14:03 ` [PATCH 2/2] iomap: zero cached pages over unwritten extents on zero range Brian Foster
2020-10-15  9:49   ` Christoph Hellwig
2020-10-19 16:55     ` Brian Foster
2020-10-19 18:01       ` Brian Foster
2020-10-20 16:21         ` Brian Foster

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git