linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Splitting a THP beyond EOF
@ 2020-10-20  1:43 Matthew Wilcox
  2020-10-20  4:59 ` Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-10-20  1:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-xfs

This is a weird one ... which is good because it means the obvious
ones have been fixed and now I'm just tripping over the weird cases.
And fortunately, xfstests exercises the weird cases.

1. The file is 0x3d000 bytes long.
2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff
3. We simulate a read error for 0x3c000-0x3cfff
4. Userspace writes to 0x3d697 to 0x3dfaa
5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate
   so it calls iomap_split_page() (passing page 0x3d)
6. iomap_split_page() calls split_huge_page()
7. split_huge_page() sees that page 0x3d is beyond EOF, so it removes it
   from i_pages
8. iomap_write_actor() copies the data into page 0x3d
9. The write is lost.

Trying to persuade XFS to update i_size before calling
iomap_file_buffered_write() seems like a bad idea.

Changing split_huge_page() to disregard i_size() is something I kind
of want to be able to do long-term in order to make hole-punch more
efficient, but that seems like a lot of work right now.

I think the easiest way to fix this is to decline to allocate readahead
pages beyond EOF.  That is, if we have a file which is, say, 61 pages
long, read the last 5 pages into an order-2 THP and an order-0 THP
instead of allocating an order-3 THP and zeroing the last three pages.

It's probably the right thing to do anyway -- we split THPs that overlap
the EOF on a truncate.  I'll start implementing this in the morning,
but I thought I'd share the problem & proposed solution in case anybody
has a better idea.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Splitting a THP beyond EOF
  2020-10-20  1:43 Splitting a THP beyond EOF Matthew Wilcox
@ 2020-10-20  4:59 ` Dave Chinner
  2020-10-20 11:21   ` Matthew Wilcox
  2020-10-20 14:26 ` Matthew Wilcox
  2020-10-20 14:32 ` Chris Mason
  2 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-10-20  4:59 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, linux-xfs

On Tue, Oct 20, 2020 at 02:43:57AM +0100, Matthew Wilcox wrote:
> This is a weird one ... which is good because it means the obvious
> ones have been fixed and now I'm just tripping over the weird cases.
> And fortunately, xfstests exercises the weird cases.
> 
> 1. The file is 0x3d000 bytes long.
> 2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff
> 3. We simulate a read error for 0x3c000-0x3cfff
> 4. Userspace writes to 0x3d697 to 0x3dfaa

So this is a write() beyond EOF, yes?

If yes, then we first go through this path:

	xfs_file_buffered_aio_write()
	  xfs_file_aio_write_checks()
	    iomap_zero_range(isize, pos - isize)

To zero the region between the current EOF and where the new write
starts. i.e. from 0x3d000 to 0x3d696.

> 5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate
>    so it calls iomap_split_page() (passing page 0x3d)

Splitting the page because it's !Uptodate seems rather drastic to
me.  Why does it need to split the page here?

Also, this concerns me: if we are exposing the cached EOF page via
mmap, it needs to contain only zeroes in the region beyond EOF so
that we don't expose stale data to userspace. Hence when a THP that
contains EOF is instantiated, we have to ensure that the region
beyond EOF is compeltely zeroed. It then follows that if we read all
the data in that THP up to EOF, then the page is actually up to
date...

And hence, I don't see why we'd need to split it just because we
got a small, unaligned write beyond EOF....

> 6. iomap_split_page() calls split_huge_page()
> 7. split_huge_page() sees that page 0x3d is beyond EOF, so it removes it
>    from i_pages
> 8. iomap_write_actor() copies the data into page 0x3d
> 9. The write is lost.
> 
> Trying to persuade XFS to update i_size before calling
> iomap_file_buffered_write() seems like a bad idea.

Agreed, I can't see anything good coming from doing that...

> Changing split_huge_page() to disregard i_size() is something I kind
> of want to be able to do long-term in order to make hole-punch more
> efficient, but that seems like a lot of work right now.
> 
> I think the easiest way to fix this is to decline to allocate readahead
> pages beyond EOF.  That is, if we have a file which is, say, 61 pages
> long, read the last 5 pages into an order-2 THP and an order-0 THP
> instead of allocating an order-3 THP and zeroing the last three pages.

I think you need to consider zeroing the THP range beyond EOF when
doing readahead....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Splitting a THP beyond EOF
  2020-10-20  4:59 ` Dave Chinner
@ 2020-10-20 11:21   ` Matthew Wilcox
  2020-10-20 21:16     ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2020-10-20 11:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-xfs

On Tue, Oct 20, 2020 at 03:59:28PM +1100, Dave Chinner wrote:
> On Tue, Oct 20, 2020 at 02:43:57AM +0100, Matthew Wilcox wrote:
> > This is a weird one ... which is good because it means the obvious
> > ones have been fixed and now I'm just tripping over the weird cases.
> > And fortunately, xfstests exercises the weird cases.
> > 
> > 1. The file is 0x3d000 bytes long.
> > 2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff
> > 3. We simulate a read error for 0x3c000-0x3cfff
> > 4. Userspace writes to 0x3d697 to 0x3dfaa
> 
> So this is a write() beyond EOF, yes?
> 
> If yes, then we first go through this path:
> 
> 	xfs_file_buffered_aio_write()
> 	  xfs_file_aio_write_checks()
> 	    iomap_zero_range(isize, pos - isize)
> 
> To zero the region between the current EOF and where the new write
> starts. i.e. from 0x3d000 to 0x3d696.

Yes.  That calls iomap_write_begin() which calls iomap_split_page()
which is where we run into trouble.  I elided the exact path from the
description of the problem.

> > 5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate
> >    so it calls iomap_split_page() (passing page 0x3d)
> 
> Splitting the page because it's !Uptodate seems rather drastic to
> me.  Why does it need to split the page here?

Because we can't handle Dirty, !Uptodate THPs in the truncate path.
Previous discussion:
https://lore.kernel.org/linux-mm/20200821144021.GV17456@casper.infradead.org/

The current assumption is that a !Uptodate THP is due to a read error,
and so the sensible thing to do is split it and handle read errors at
a single-page level.

I've been playing around with creating THPs in the write path, and that
offers a different pathway to creating Dirty, !Uptodate THPs, so this
may also change at some point.  I'd like to get what I have merged and
then figure out how to make this better.

> Also, this concerns me: if we are exposing the cached EOF page via
> mmap, it needs to contain only zeroes in the region beyond EOF so
> that we don't expose stale data to userspace. Hence when a THP that
> contains EOF is instantiated, we have to ensure that the region
> beyond EOF is compeltely zeroed. It then follows that if we read all
> the data in that THP up to EOF, then the page is actually up to
> date...

We do that in iomap_readpage_actor().  Had the readahead I/O not "failed",
we'd've had an Uptodate THP which straddled EOF.  I dumped the page and
its uptodate bits are 0xfff0 (1kB block size filesystem, the three pages
0x3d-0x3f are uptodate because they were zeroed and 0x3c is !uptodate
because the I/O failed).



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Splitting a THP beyond EOF
  2020-10-20  1:43 Splitting a THP beyond EOF Matthew Wilcox
  2020-10-20  4:59 ` Dave Chinner
@ 2020-10-20 14:26 ` Matthew Wilcox
  2020-10-20 14:32 ` Chris Mason
  2 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-10-20 14:26 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-xfs

On Tue, Oct 20, 2020 at 02:43:57AM +0100, Matthew Wilcox wrote:
> I think the easiest way to fix this is to decline to allocate readahead
> pages beyond EOF.  That is, if we have a file which is, say, 61 pages
> long, read the last 5 pages into an order-2 THP and an order-0 THP
> instead of allocating an order-3 THP and zeroing the last three pages.

Oh yeah, really easy.

+++ b/mm/readahead.c
@@ -481,6 +481,11 @@ void page_cache_ra_order(struct readahead_control *ractl,
                        if (order == 1)
                                order = 0;
                }
+               /* Don't allocate pages past EOF */
+               while (index + (1UL << order) - 1 > limit) {
+                       if (--order == 1)
+                               order = 0;
+               }
                err = ra_alloc_page(ractl, index, mark, order, gfp);
                if (err)
                        break;

I've added that to an earlier patch and I've pushed out commit
cd3fa4bc6516 as the head of
http://git.infradead.org/users/willy/pagecache.git/shortlog


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Splitting a THP beyond EOF
  2020-10-20  1:43 Splitting a THP beyond EOF Matthew Wilcox
  2020-10-20  4:59 ` Dave Chinner
  2020-10-20 14:26 ` Matthew Wilcox
@ 2020-10-20 14:32 ` Chris Mason
  2020-10-21  0:23   ` Matthew Wilcox
  2 siblings, 1 reply; 12+ messages in thread
From: Chris Mason @ 2020-10-20 14:32 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, linux-xfs



On 19 Oct 2020, at 21:43, Matthew Wilcox wrote:

> This is a weird one ... which is good because it means the obvious
> ones have been fixed and now I'm just tripping over the weird cases.
> And fortunately, xfstests exercises the weird cases.
>
> 1. The file is 0x3d000 bytes long.
> 2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff
> 3. We simulate a read error for 0x3c000-0x3cfff
> 4. Userspace writes to 0x3d697 to 0x3dfaa
> 5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate
>    so it calls iomap_split_page() (passing page 0x3d)
> 6. iomap_split_page() calls split_huge_page()
> 7. split_huge_page() sees that page 0x3d is beyond EOF, so it removes 
> it
>    from i_pages
> 8. iomap_write_actor() copies the data into page 0x3d

I’m guessing that iomap_write_begin() is still in charge of locking 
the pages, and that iomap_split_page()->split_huge_page() is just 
reusing that lock?

It sounds like you’re missing a flag to iomap_split_page() that says: 
I care about range A->B, even if its beyond EOF.  IOW, 
iomap_write_begin()’s path should be in charge of doing the right 
thing for the write, without relying on the rest of the kernel to avoid 
upsetting it.

> 9. The write is lost.
>
> Trying to persuade XFS to update i_size before calling
> iomap_file_buffered_write() seems like a bad idea.
>
> Changing split_huge_page() to disregard i_size() is something I kind
> of want to be able to do long-term in order to make hole-punch more
> efficient, but that seems like a lot of work right now.
>

The problem with trusting i_size is that it changes at surprising times. 
  For this code inside split_huge_page(), end == i_size_read()

         for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
                 __split_huge_page_tail(head, i, lruvec, list);
                 /* Some pages can be beyond i_size: drop them from page 
cache */
                 if (head[i].index >= end) {
                         ClearPageDirty(head + i);

But, we actually change i_size after dropping all the page locks.  In 
xfs this is xfs_setattr_size()->truncate_setsize(), all of which means 
that dropping PageDirty seems unwise if this code is running 
concurrently with an expanding truncate.  If i_size jumps past the page 
where you’re clearing dirty, it probably won’t be good.  Ignore me 
if this is already handled differently, it just seems error prone in 
current Linus.

-chris


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Splitting a THP beyond EOF
  2020-10-20 11:21   ` Matthew Wilcox
@ 2020-10-20 21:16     ` Dave Chinner
  2020-10-20 22:53       ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-10-20 21:16 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, linux-xfs

On Tue, Oct 20, 2020 at 12:21:38PM +0100, Matthew Wilcox wrote:
> On Tue, Oct 20, 2020 at 03:59:28PM +1100, Dave Chinner wrote:
> > On Tue, Oct 20, 2020 at 02:43:57AM +0100, Matthew Wilcox wrote:
> > > This is a weird one ... which is good because it means the obvious
> > > ones have been fixed and now I'm just tripping over the weird cases.
> > > And fortunately, xfstests exercises the weird cases.
> > > 
> > > 1. The file is 0x3d000 bytes long.
> > > 2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff
> > > 3. We simulate a read error for 0x3c000-0x3cfff
> > > 4. Userspace writes to 0x3d697 to 0x3dfaa
> > 
> > So this is a write() beyond EOF, yes?
> > 
> > If yes, then we first go through this path:
> > 
> > 	xfs_file_buffered_aio_write()
> > 	  xfs_file_aio_write_checks()
> > 	    iomap_zero_range(isize, pos - isize)
> > 
> > To zero the region between the current EOF and where the new write
> > starts. i.e. from 0x3d000 to 0x3d696.
> 
> Yes.  That calls iomap_write_begin() which calls iomap_split_page()
> which is where we run into trouble.  I elided the exact path from the
> description of the problem.
> 
> > > 5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate
> > >    so it calls iomap_split_page() (passing page 0x3d)
> > 
> > Splitting the page because it's !Uptodate seems rather drastic to
> > me.  Why does it need to split the page here?
> 
> Because we can't handle Dirty, !Uptodate THPs in the truncate path.
> Previous discussion:
> https://lore.kernel.org/linux-mm/20200821144021.GV17456@casper.infradead.org/

Maybe I'm just dense, but this doesn't explain the reason for
needing to split THPs during partial THP invalidation, nor the
reason why we need to split THPs when the write path sees a
partially up to date THP. iomap is supposed to be tracking the
sub-page regions that are not up to date, so why would we ever need
to split the page to get sub-page regions into the correct state?

i.e. why do THPs need to be handled any differently to a block size
< page size situation with a normal PAGE_SIZE page?

FWIW, didn't you change the dirty tracking to be done sub-page and
held in the iomap_page? If so, releasing the iomap_page on a partial
page invalidation looks ... problematic. i.e. not only are you
throwing away the per-block up-to-date state on a THP, you're alos
throwing away the per-block dirty state.

i.e. we still need that per-block state to be maintained once the
THP has been split - the split pages should be set to the correct
state held on the THP as the split progresses. IOWs, I suspect that
split_huge_page() needs to call into iomap to determine the actual
state of each individual sub-page from the iomap_page state attached
to the huge page.

> The current assumption is that a !Uptodate THP is due to a read error,
> and so the sensible thing to do is split it and handle read errors at
> a single-page level.

Why? Apart from the range of the file coverd by the page, how is
handling a read error at a single page level any different from
handling it at a THP level?

Alternatively, if there's a read error on THP-based readahead, then
why isn't the entire THP tossed away when a subsequent read sees
PageError() so it can then be re-read synchronously into the cache
using single pages?

> I've been playing around with creating THPs in the write path, and that
> offers a different pathway to creating Dirty, !Uptodate THPs, so this
> may also change at some point.  I'd like to get what I have merged and
> then figure out how to make this better.

Sure. However, again, splitting THPs in this situation makes no
sense to me. I can't see why we don't just treat them like a normal
page and the sub-page !uptodate range filling algorithms just do
their normal work on them...

> > Also, this concerns me: if we are exposing the cached EOF page via
> > mmap, it needs to contain only zeroes in the region beyond EOF so
> > that we don't expose stale data to userspace. Hence when a THP that
> > contains EOF is instantiated, we have to ensure that the region
> > beyond EOF is compeltely zeroed. It then follows that if we read all
> > the data in that THP up to EOF, then the page is actually up to
> > date...
> 
> We do that in iomap_readpage_actor().  Had the readahead I/O not "failed",
> we'd've had an Uptodate THP which straddled EOF.

If the IO error in a THP is the trigger for bad things here, then
surely the correct thing to do is trash the THP on IO error, not
leave a landmine that every path has to handle specially...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Splitting a THP beyond EOF
  2020-10-20 21:16     ` Dave Chinner
@ 2020-10-20 22:53       ` Matthew Wilcox
  2020-10-21 22:14         ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2020-10-20 22:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-xfs

On Wed, Oct 21, 2020 at 08:16:34AM +1100, Dave Chinner wrote:
> On Tue, Oct 20, 2020 at 12:21:38PM +0100, Matthew Wilcox wrote:
> > On Tue, Oct 20, 2020 at 03:59:28PM +1100, Dave Chinner wrote:
> > > On Tue, Oct 20, 2020 at 02:43:57AM +0100, Matthew Wilcox wrote:
> > > > This is a weird one ... which is good because it means the obvious
> > > > ones have been fixed and now I'm just tripping over the weird cases.
> > > > And fortunately, xfstests exercises the weird cases.
> > > > 
> > > > 1. The file is 0x3d000 bytes long.
> > > > 2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff
> > > > 3. We simulate a read error for 0x3c000-0x3cfff
> > > > 4. Userspace writes to 0x3d697 to 0x3dfaa
> > > 
> > > So this is a write() beyond EOF, yes?
> > > 
> > > If yes, then we first go through this path:
> > > 
> > > 	xfs_file_buffered_aio_write()
> > > 	  xfs_file_aio_write_checks()
> > > 	    iomap_zero_range(isize, pos - isize)
> > > 
> > > To zero the region between the current EOF and where the new write
> > > starts. i.e. from 0x3d000 to 0x3d696.
> > 
> > Yes.  That calls iomap_write_begin() which calls iomap_split_page()
> > which is where we run into trouble.  I elided the exact path from the
> > description of the problem.
> > 
> > > > 5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate
> > > >    so it calls iomap_split_page() (passing page 0x3d)
> > > 
> > > Splitting the page because it's !Uptodate seems rather drastic to
> > > me.  Why does it need to split the page here?
> > 
> > Because we can't handle Dirty, !Uptodate THPs in the truncate path.
> > Previous discussion:
> > https://lore.kernel.org/linux-mm/20200821144021.GV17456@casper.infradead.org/
> 
> Maybe I'm just dense, but this doesn't explain the reason for
> needing to split THPs during partial THP invalidation, nor the
> reason why we need to split THPs when the write path sees a
> partially up to date THP. iomap is supposed to be tracking the
> sub-page regions that are not up to date, so why would we ever need
> to split the page to get sub-page regions into the correct state?

True, we don't _have to_ split THP on holepunch/truncation/... but it's
a better implementation to free pages which cover blocks that no longer
have data associated with them.

> FWIW, didn't you change the dirty tracking to be done sub-page and
> held in the iomap_page? If so, releasing the iomap_page on a partial
> page invalidation looks ... problematic. i.e. not only are you
> throwing away the per-block up-to-date state on a THP, you're alos
> throwing away the per-block dirty state.

That wasn't my patch.  Also, discarding the iomap_page causes the entire
page to be treated as a single unit.  So if we lose per-block state,
and the page is marked as dirty, then each subpage (that remains after
the holepunch) will be treated as dirty.

> i.e. we still need that per-block state to be maintained once the
> THP has been split - the split pages should be set to the correct
> state held on the THP as the split progresses. IOWs, I suspect that
> split_huge_page() needs to call into iomap to determine the actual
> state of each individual sub-page from the iomap_page state attached
> to the huge page.

That would be ideal, but we only have the ability to do that for uptodate
sub-page state right now.  It's all a bit messy, and it's on my list
of things to improve at some point.  But there's no bug that I know of
in this.

> > The current assumption is that a !Uptodate THP is due to a read error,
> > and so the sensible thing to do is split it and handle read errors at
> > a single-page level.
> 
> Why? Apart from the range of the file coverd by the page, how is
> handling a read error at a single page level any different from
> handling it at a THP level?
> 
> Alternatively, if there's a read error on THP-based readahead, then
> why isn't the entire THP tossed away when a subsequent read sees
> PageError() so it can then be re-read synchronously into the cache
> using single pages?

Splitting the page instead of throwing it away makes sense once we can
transfer the Uptodate bits to each subpage.  If we don't have that,
it doesn't really matter which we do.

> > We do that in iomap_readpage_actor().  Had the readahead I/O not "failed",
> > we'd've had an Uptodate THP which straddled EOF.
> 
> If the IO error in a THP is the trigger for bad things here, then
> surely the correct thing to do is trash the THP on IO error, not
> leave a landmine that every path has to handle specially...

Can't split a page on I/O error -- split_huge_page() has to be called
from process context and we get notified about the error in BH context.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Splitting a THP beyond EOF
  2020-10-20 14:32 ` Chris Mason
@ 2020-10-21  0:23   ` Matthew Wilcox
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-10-21  0:23 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-mm, linux-fsdevel, linux-xfs

On Tue, Oct 20, 2020 at 10:32:59AM -0400, Chris Mason wrote:
> On 19 Oct 2020, at 21:43, Matthew Wilcox wrote:
> > This is a weird one ... which is good because it means the obvious
> > ones have been fixed and now I'm just tripping over the weird cases.
> > And fortunately, xfstests exercises the weird cases.
> > 
> > 1. The file is 0x3d000 bytes long.
> > 2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff
> > 3. We simulate a read error for 0x3c000-0x3cfff
> > 4. Userspace writes to 0x3d697 to 0x3dfaa
> > 5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate
> >    so it calls iomap_split_page() (passing page 0x3d)
> > 6. iomap_split_page() calls split_huge_page()
> > 7. split_huge_page() sees that page 0x3d is beyond EOF, so it removes it
> >    from i_pages
> > 8. iomap_write_actor() copies the data into page 0x3d
> 
> I’m guessing that iomap_write_begin() is still in charge of locking the
> pages, and that iomap_split_page()->split_huge_page() is just reusing that
> lock?

That's right -- iomap_write_begin() calls grab_cache_page_write_begin()
which acquires the page lock.

> It sounds like you’re missing a flag to iomap_split_page() that says: I care
> about range A->B, even if its beyond EOF.  IOW, iomap_write_begin()’s path
> should be in charge of doing the right thing for the write, without relying
> on the rest of the kernel to avoid upsetting it.

Yeah, the problem is that split_huge_page() doesn't have that
functionality.  I'd like to add it, but Kirill's not particularly keen.
I'm also looking for a quick fix more than an intrusive change like
that ...  fortunately, I found one.  And it's even something that was
on my long-term todo list; I don't think we should be allocating THPs
to cache beyond the end of the file.  I mean, I could see the point in
allocating a 2MB THP to cache a 1.9MB file tail, but allocating a 64kB
page to cache a 3kB file tail is definitely wrong.

> > Changing split_huge_page() to disregard i_size() is something I kind
> > of want to be able to do long-term in order to make hole-punch more
> > efficient, but that seems like a lot of work right now.
> > 
> 
> The problem with trusting i_size is that it changes at surprising times.
> For this code inside split_huge_page(), end == i_size_read()
> 
>         for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
>                 __split_huge_page_tail(head, i, lruvec, list);
>                 /* Some pages can be beyond i_size: drop them from page
> cache */
>                 if (head[i].index >= end) {
>                         ClearPageDirty(head + i);
> 
> But, we actually change i_size after dropping all the page locks.  In xfs
> this is xfs_setattr_size()->truncate_setsize(), all of which means that
> dropping PageDirty seems unwise if this code is running concurrently with an
> expanding truncate.  If i_size jumps past the page where you’re clearing
> dirty, it probably won’t be good.  Ignore me if this is already handled
> differently, it just seems error prone in current Linus.

Oh, but the next line is __delete_from_page_cache().  So a concurrent
expanding truncate will never find this page, it's about to go back to
the page allocator.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Splitting a THP beyond EOF
  2020-10-20 22:53       ` Matthew Wilcox
@ 2020-10-21 22:14         ` Dave Chinner
  2020-10-21 23:04           ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-10-21 22:14 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, linux-xfs

On Tue, Oct 20, 2020 at 11:53:31PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 21, 2020 at 08:16:34AM +1100, Dave Chinner wrote:
> > On Tue, Oct 20, 2020 at 12:21:38PM +0100, Matthew Wilcox wrote:
> > > On Tue, Oct 20, 2020 at 03:59:28PM +1100, Dave Chinner wrote:
> > > > On Tue, Oct 20, 2020 at 02:43:57AM +0100, Matthew Wilcox wrote:
> > > > > This is a weird one ... which is good because it means the obvious
> > > > > ones have been fixed and now I'm just tripping over the weird cases.
> > > > > And fortunately, xfstests exercises the weird cases.
> > > > > 
> > > > > 1. The file is 0x3d000 bytes long.
> > > > > 2. A readahead allocates an order-2 THP for 0x3c000-0x3ffff
> > > > > 3. We simulate a read error for 0x3c000-0x3cfff
> > > > > 4. Userspace writes to 0x3d697 to 0x3dfaa
> > > > 
> > > > So this is a write() beyond EOF, yes?
> > > > 
> > > > If yes, then we first go through this path:
> > > > 
> > > > 	xfs_file_buffered_aio_write()
> > > > 	  xfs_file_aio_write_checks()
> > > > 	    iomap_zero_range(isize, pos - isize)
> > > > 
> > > > To zero the region between the current EOF and where the new write
> > > > starts. i.e. from 0x3d000 to 0x3d696.
> > > 
> > > Yes.  That calls iomap_write_begin() which calls iomap_split_page()
> > > which is where we run into trouble.  I elided the exact path from the
> > > description of the problem.
> > > 
> > > > > 5. iomap_write_begin() gets the 0x3c page, sees it's THP and !Uptodate
> > > > >    so it calls iomap_split_page() (passing page 0x3d)
> > > > 
> > > > Splitting the page because it's !Uptodate seems rather drastic to
> > > > me.  Why does it need to split the page here?
> > > 
> > > Because we can't handle Dirty, !Uptodate THPs in the truncate path.
> > > Previous discussion:
> > > https://lore.kernel.org/linux-mm/20200821144021.GV17456@casper.infradead.org/
> > 
> > Maybe I'm just dense, but this doesn't explain the reason for
> > needing to split THPs during partial THP invalidation, nor the
> > reason why we need to split THPs when the write path sees a
> > partially up to date THP. iomap is supposed to be tracking the
> > sub-page regions that are not up to date, so why would we ever need
> > to split the page to get sub-page regions into the correct state?
> 
> True, we don't _have to_ split THP on holepunch/truncation/... but it's
> a better implementation to free pages which cover blocks that no longer
> have data associated with them.

"Better" is a very subjective measure. What numbers do you have
to back that up?

e.g. if we are just punching a 4kB hole in a range covered by a THP,
then breaking up the THP is, IMO, exactly the wrong thing to do.
Just zeroing it out via iomap_zero_range() has much lower overhead
and is far simpler than breaking up the THP just to remove a single
page from the range...

> > FWIW, didn't you change the dirty tracking to be done sub-page and
> > held in the iomap_page? If so, releasing the iomap_page on a partial
> > page invalidation looks ... problematic. i.e. not only are you
> > throwing away the per-block up-to-date state on a THP, you're alos
> > throwing away the per-block dirty state.
> 
> That wasn't my patch.  Also, discarding the iomap_page causes the entire
> page to be treated as a single unit.  So if we lose per-block state,
> and the page is marked as dirty, then each subpage (that remains after
> the holepunch) will be treated as dirty.

That really sounds like something that your patchset needs to do,
though. You're jumping through hoops to handle untracked partial
THP page state that add complexity, but you can avoid all this with
a relatively simple change...

> > > The current assumption is that a !Uptodate THP is due to a read error,
> > > and so the sensible thing to do is split it and handle read errors at
> > > a single-page level.
> > 
> > Why? Apart from the range of the file coverd by the page, how is
> > handling a read error at a single page level any different from
> > handling it at a THP level?
> > 
> > Alternatively, if there's a read error on THP-based readahead, then
> > why isn't the entire THP tossed away when a subsequent read sees
> > PageError() so it can then be re-read synchronously into the cache
> > using single pages?
> 
> Splitting the page instead of throwing it away makes sense once we can
> transfer the Uptodate bits to each subpage.  If we don't have that,
> it doesn't really matter which we do.

Sounds like more required functionality...

> > > We do that in iomap_readpage_actor().  Had the readahead I/O not "failed",
> > > we'd've had an Uptodate THP which straddled EOF.
> > 
> > If the IO error in a THP is the trigger for bad things here, then
> > surely the correct thing to do is trash the THP on IO error, not
> > leave a landmine that every path has to handle specially...
> 
> Can't split a page on I/O error -- split_huge_page() has to be called
> from process context and we get notified about the error in BH context.

That's exactly why we have workqueues on the write IO completion
path - much of the write completion work we do must execute in a
context where we can take sleeping locks, block, issue more IO, etc.

IOWs, if the THP read IO fails, punt it to a workqueue to do the
page split and update the page error states.

CHeers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Splitting a THP beyond EOF
  2020-10-21 22:14         ` Dave Chinner
@ 2020-10-21 23:04           ` Matthew Wilcox
  2020-10-27  5:31             ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2020-10-21 23:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-xfs

On Thu, Oct 22, 2020 at 09:14:35AM +1100, Dave Chinner wrote:
> On Tue, Oct 20, 2020 at 11:53:31PM +0100, Matthew Wilcox wrote:
> > True, we don't _have to_ split THP on holepunch/truncation/... but it's
> > a better implementation to free pages which cover blocks that no longer
> > have data associated with them.
> 
> "Better" is a very subjective measure. What numbers do you have
> to back that up?

None.  When we choose to use a THP, we're choosing to treat a chunk
of a file as a single unit for the purposes of tracking dirtiness,
age, membership of the workingset, etc.  We're trading off reduced
precision for reduced overhead; just like the CPU tracks dirtiness on
a cacheline basis instead of at byte level.

So at some level, we've making the assumption that this 128kB THP is
all one thingand it should be tracked together.  But the user has just
punched a hole in it.  I can think of no stronger signal to say "The
piece before this hole, the piece I just got rid of and the piece after
this are three separate pieces of the file".

If I could split them into pieces that weren't single pages, I would.
Zi Yan has a patch to do just that, and I'm very much looking forward
to that being merged.  But saying "Oh, this is quite small, I'll keep
the rest of the THP together" is conceptually wrong.

> > Splitting the page instead of throwing it away makes sense once we can
> > transfer the Uptodate bits to each subpage.  If we don't have that,
> > it doesn't really matter which we do.
> 
> Sounds like more required functionality...

I'm not saying that my patchset is the last word and there will be no
tweaking.  I'm saying I think it's good enough, an improvement on the
status quo, and it's better to merge it for 5.11 than to keep it out of
tree for another three months while we tinker with improving it.

Do you disagree?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Splitting a THP beyond EOF
  2020-10-21 23:04           ` Matthew Wilcox
@ 2020-10-27  5:31             ` Dave Chinner
  2020-10-27 12:14               ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-10-27  5:31 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, linux-xfs

On Thu, Oct 22, 2020 at 12:04:22AM +0100, Matthew Wilcox wrote:
> On Thu, Oct 22, 2020 at 09:14:35AM +1100, Dave Chinner wrote:
> > On Tue, Oct 20, 2020 at 11:53:31PM +0100, Matthew Wilcox wrote:
> > > True, we don't _have to_ split THP on holepunch/truncation/... but it's
> > > a better implementation to free pages which cover blocks that no longer
> > > have data associated with them.
> > 
> > "Better" is a very subjective measure. What numbers do you have
> > to back that up?
> 
> None.  When we choose to use a THP, we're choosing to treat a chunk
> of a file as a single unit for the purposes of tracking dirtiness,
> age, membership of the workingset, etc.  We're trading off reduced
> precision for reduced overhead; just like the CPU tracks dirtiness on
> a cacheline basis instead of at byte level.
> 
> So at some level, we've making the assumption that this 128kB THP is
> all one thingand it should be tracked together.  But the user has just
> punched a hole in it.  I can think of no stronger signal to say "The
> piece before this hole, the piece I just got rid of and the piece after
> this are three separate pieces of the file".

There's a difference between the physical layout of the file and
representing data efficiently in the page cache. Just because we can
use a THP to represent a single extent doesn't mean we should always
use that relationship, nor should we require that small
manipulations of on-disk extent state require that page cache pages
be split or gathered.

i.e. the whole point of the page cache is to decouple the physical
layout of the file from the user access mechanisms for performance
reasons, not tie them tightly together. I think that's the wrong
approach to be taking here - truncate/holepunch do not imply that
THPs need to be split unconditionally. Indeed, readahead doesn't
care that a THP might be split across mulitple extents and require
multiple bios to bring tha data into cache, so why should
truncate/holepunch type operations require the THP to be split to
reflect underlying disk layouts?

> If I could split them into pieces that weren't single pages, I would.
> Zi Yan has a patch to do just that, and I'm very much looking forward
> to that being merged.  But saying "Oh, this is quite small, I'll keep
> the rest of the THP together" is conceptually wrong.

Yet that's exactly what we do with block size < PAGE_SIZE
configurations, so I fail to see why it's conceptually wrong for
THPs to behave the same way and normal pages....

> > > Splitting the page instead of throwing it away makes sense once we can
> > > transfer the Uptodate bits to each subpage.  If we don't have that,
> > > it doesn't really matter which we do.
> > 
> > Sounds like more required functionality...
> 
> I'm not saying that my patchset is the last word and there will be no
> tweaking.  I'm saying I think it's good enough, an improvement on the
> status quo, and it's better to merge it for 5.11 than to keep it out of
> tree for another three months while we tinker with improving it.
> 
> Do you disagree?

In part. Concepts and algorithms need to be sound and agreed upon
before we merge patches, and right now I disagree with the some of
the basic assumptions about how THP and filesystem layout operations
are being coupled. That part needs to be sorted before stuff gets
merged...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Splitting a THP beyond EOF
  2020-10-27  5:31             ` Dave Chinner
@ 2020-10-27 12:14               ` Matthew Wilcox
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-10-27 12:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-xfs

On Tue, Oct 27, 2020 at 04:31:26PM +1100, Dave Chinner wrote:
> On Thu, Oct 22, 2020 at 12:04:22AM +0100, Matthew Wilcox wrote:
> > On Thu, Oct 22, 2020 at 09:14:35AM +1100, Dave Chinner wrote:
> > > On Tue, Oct 20, 2020 at 11:53:31PM +0100, Matthew Wilcox wrote:
> > > > True, we don't _have to_ split THP on holepunch/truncation/... but it's
> > > > a better implementation to free pages which cover blocks that no longer
> > > > have data associated with them.
> > > 
> > > "Better" is a very subjective measure. What numbers do you have
> > > to back that up?
> > 
> > None.  When we choose to use a THP, we're choosing to treat a chunk
> > of a file as a single unit for the purposes of tracking dirtiness,
> > age, membership of the workingset, etc.  We're trading off reduced
> > precision for reduced overhead; just like the CPU tracks dirtiness on
> > a cacheline basis instead of at byte level.
> > 
> > So at some level, we've making the assumption that this 128kB THP is
> > all one thingand it should be tracked together.  But the user has just
> > punched a hole in it.  I can think of no stronger signal to say "The
> > piece before this hole, the piece I just got rid of and the piece after
> > this are three separate pieces of the file".
> 
> There's a difference between the physical layout of the file and
> representing data efficiently in the page cache. Just because we can
> use a THP to represent a single extent doesn't mean we should always
> use that relationship, nor should we require that small
> manipulations of on-disk extent state require that page cache pages
> be split or gathered.
> 
> i.e. the whole point of the page cache is to decouple the physical
> layout of the file from the user access mechanisms for performance
> reasons, not tie them tightly together. I think that's the wrong
> approach to be taking here - truncate/holepunch do not imply that
> THPs need to be split unconditionally. Indeed, readahead doesn't
> care that a THP might be split across mulitple extents and require
> multiple bios to bring tha data into cache, so why should
> truncate/holepunch type operations require the THP to be split to
> reflect underlying disk layouts?

At the time we do readahead, we've inferred from the user's access
patterns that they're reading this file if not sequentially, then close
enough to sequentially that it makes sense to bring in more of the file.
On-media layout of the file is irrelevant, as you say.

Now the user has given us another hint about how they see the file.
A call to FALLOC_FL_PUNCH_HOLE is certainly an instruction to the
filesystem to change the layout, but it's also giving the page cache
information about how the file is being treated.  It tells us that
the portion of the file before the hole is different from the portion
of the file after the hole, and treating those two portions of the
file as being similar for the purposes of working set tracking is
going to lead to wrong decisions.

Let's take an example where an app uses 1kB fixed size records.  First it
does a linear scan (so readahead kicks in and we get all the way up to
allocating 256kB pages).  Then it decides some records are obsolete, so it
calls PUNCH_HOLE on the range 20kB to 27kB in the page, then PUNCH_HOLE
40kB-45kB and finally PUNCH_HOLE 150kB-160kB.  In my current scheme,
this splits the page into 4kB pages.  If the app then only operates on
the records after 160kB and before 20kB, the pages used to cache records
in the 24kB-40kB and 44kB-150kB ranges will naturally fall out of cache
and the memory will be used for other purposes.  With your scheme,
the 256kB page would be retained in cache as a single piece.

> > If I could split them into pieces that weren't single pages, I would.
> > Zi Yan has a patch to do just that, and I'm very much looking forward
> > to that being merged.  But saying "Oh, this is quite small, I'll keep
> > the rest of the THP together" is conceptually wrong.
> 
> Yet that's exactly what we do with block size < PAGE_SIZE
> configurations, so I fail to see why it's conceptually wrong for
> THPs to behave the same way and normal pages....

We don't have the ability to mmap files at smaller than PAGE_SIZE
granularity, so we can't do that.

> > I'm not saying that my patchset is the last word and there will be no
> > tweaking.  I'm saying I think it's good enough, an improvement on the
> > status quo, and it's better to merge it for 5.11 than to keep it out of
> > tree for another three months while we tinker with improving it.
> > 
> > Do you disagree?
> 
> In part. Concepts and algorithms need to be sound and agreed upon
> before we merge patches, and right now I disagree with the some of
> the basic assumptions about how THP and filesystem layout operations
> are being coupled. That part needs to be sorted before stuff gets
> merged...

They're not being coupled.  I'm using the information the user is
giving the kernel to make better decisions about what to cache.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-10-27 12:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-20  1:43 Splitting a THP beyond EOF Matthew Wilcox
2020-10-20  4:59 ` Dave Chinner
2020-10-20 11:21   ` Matthew Wilcox
2020-10-20 21:16     ` Dave Chinner
2020-10-20 22:53       ` Matthew Wilcox
2020-10-21 22:14         ` Dave Chinner
2020-10-21 23:04           ` Matthew Wilcox
2020-10-27  5:31             ` Dave Chinner
2020-10-27 12:14               ` Matthew Wilcox
2020-10-20 14:26 ` Matthew Wilcox
2020-10-20 14:32 ` Chris Mason
2020-10-21  0:23   ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).