linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Splitting an iomap_page
@ 2020-08-21 14:40 Matthew Wilcox
  2020-08-22  6:06 ` Christoph Hellwig
  2020-09-04  3:37 ` Matthew Wilcox
  0 siblings, 2 replies; 7+ messages in thread
From: Matthew Wilcox @ 2020-08-21 14:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: darrick.wong, david, yukuai3, linux-xfs, linux-fsdevel, linux-mm

I have only bad ideas for solving this problem, so I thought I'd share.

Operations like holepunch or truncate call into
truncate_inode_pages_range() which just remove THPs which are
entirely within the punched range, but pages which contain one or both
ends of the range need to be split.

What I have right now, and works, calls do_invalidatepage() from
truncate_inode_pages_range().  The iomap implementation just calls
iomap_page_release().  We have the page locked, and we've waited for
writeback to finish, so there is no I/O outstanding against the page.
We may then get into one of the writepage methods without an iomap being
allocated, so we have to allocate it.  Patches here:

http://git.infradead.org/users/willy/pagecache.git/commitdiff/167f81e880ef00d83ab7db50d56ed85bfbae2601
http://git.infradead.org/users/willy/pagecache.git/commitdiff/82fe90cde95420c3cf155b54ed66c74d5fb6ffc5

If the page is Uptodate, this works great.  But if it's not, then we're
going to unnecessarily re-read data from storage -- indeed, we may as
well just dump the entire page if it's !Uptodate.  Of course, right now
the only way to get THPs is through readahead and that's going to always
read the entire page, so we're never going to see a !Uptodate THP.  But
in the future, maybe we shall?  I wouldn't like to rely on that without
pasting some big warnings for anyone who tries to change it.

Alternative 1: Allocate a new iop for each base page if needed.  This is
the obvious approach.  If the block size is >= PAGE_SIZE, we don't even
need to allocate a new iop; we can just mark the page as being Uptodate
if that range is.  The problem is that we might need to allocate a lot of
them -- 512 if we're splitting a 2MB page into 4kB chunks (which would
be 12kB -- plus a 2kB array to hold 512 pointers).  And at the point
where we know we need to allocate them, we're under a spin_lock_irq().
We could allocate them in advance, but then we might find out we aren't
going to split this page after all.

Alternative 2: Always allocate an iop for each base page in a THP.  We pay
the allocation price up front for every THP, even though the majority
will never be split.  It does save us from allocating any iop at all for
block size >= page size, but it's a lot of extra overhead to gather all
the Uptodate bits.  As above, it'd be 12kB of iops vs 80 bytes that we
currently allocate for a 2MB THP.  144 once we also track dirty bits.

Alternative 3: Allow pages to share an iop.  Do something like add a
pgoff_t base and a refcount_t to the iop and have each base page point
to the same iop, using the part of the bit array indexed by (page->index
- base) * blocks_per_page.  The read/write count are harder to share,
and I'm not sure I see a way to do it.

Alternative 4: Don't support partially-uptodate THPs.  We declare (and
enforce with strategic assertions) that all THPs must always be Uptodate
(or Error) once unlocked.  If your workload benefits from using THPs,
you want to do big I/Os anyway, so these "write 512 bytes at a time
using O_SYNC" workloads aren't going to use THPs.

Funnily, buffer_heads are easier here.  They just need to be reparented
to their new page.  Of course, they're allocated up front, so they're
essentially alternative 2.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Splitting an iomap_page
  2020-08-21 14:40 Splitting an iomap_page Matthew Wilcox
@ 2020-08-22  6:06 ` Christoph Hellwig
  2020-08-24 16:06   ` Darrick J. Wong
  2020-09-04  3:37 ` Matthew Wilcox
  1 sibling, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2020-08-22  6:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, darrick.wong, david, yukuai3, linux-xfs,
	linux-fsdevel, linux-mm

On Fri, Aug 21, 2020 at 03:40:21PM +0100, Matthew Wilcox wrote:
> I have only bad ideas for solving this problem, so I thought I'd share.
> 
> Operations like holepunch or truncate call into
> truncate_inode_pages_range() which just remove THPs which are
> entirely within the punched range, but pages which contain one or both
> ends of the range need to be split.
> 
> What I have right now, and works, calls do_invalidatepage() from
> truncate_inode_pages_range().  The iomap implementation just calls
> iomap_page_release().  We have the page locked, and we've waited for
> writeback to finish, so there is no I/O outstanding against the page.
> We may then get into one of the writepage methods without an iomap being
> allocated, so we have to allocate it.  Patches here:
> 
> http://git.infradead.org/users/willy/pagecache.git/commitdiff/167f81e880ef00d83ab7db50d56ed85bfbae2601
> http://git.infradead.org/users/willy/pagecache.git/commitdiff/82fe90cde95420c3cf155b54ed66c74d5fb6ffc5
> 
> If the page is Uptodate, this works great.  But if it's not, then we're
> going to unnecessarily re-read data from storage -- indeed, we may as
> well just dump the entire page if it's !Uptodate.  Of course, right now
> the only way to get THPs is through readahead and that's going to always
> read the entire page, so we're never going to see a !Uptodate THP.  But
> in the future, maybe we shall?  I wouldn't like to rely on that without
> pasting some big warnings for anyone who tries to change it.
> 
> Alternative 1: Allocate a new iop for each base page if needed.  This is
> the obvious approach.  If the block size is >= PAGE_SIZE, we don't even
> need to allocate a new iop; we can just mark the page as being Uptodate
> if that range is.  The problem is that we might need to allocate a lot of
> them -- 512 if we're splitting a 2MB page into 4kB chunks (which would
> be 12kB -- plus a 2kB array to hold 512 pointers).  And at the point
> where we know we need to allocate them, we're under a spin_lock_irq().
> We could allocate them in advance, but then we might find out we aren't
> going to split this page after all.
> 
> Alternative 2: Always allocate an iop for each base page in a THP.  We pay
> the allocation price up front for every THP, even though the majority
> will never be split.  It does save us from allocating any iop at all for
> block size >= page size, but it's a lot of extra overhead to gather all
> the Uptodate bits.  As above, it'd be 12kB of iops vs 80 bytes that we
> currently allocate for a 2MB THP.  144 once we also track dirty bits.
> 
> Alternative 3: Allow pages to share an iop.  Do something like add a
> pgoff_t base and a refcount_t to the iop and have each base page point
> to the same iop, using the part of the bit array indexed by (page->index
> - base) * blocks_per_page.  The read/write count are harder to share,
> and I'm not sure I see a way to do it.
> 
> Alternative 4: Don't support partially-uptodate THPs.  We declare (and
> enforce with strategic assertions) that all THPs must always be Uptodate
> (or Error) once unlocked.  If your workload benefits from using THPs,
> you want to do big I/Os anyway, so these "write 512 bytes at a time
> using O_SYNC" workloads aren't going to use THPs.
> 
> Funnily, buffer_heads are easier here.  They just need to be reparented
> to their new page.  Of course, they're allocated up front, so they're
> essentially alternative 2.

At least initially I'd go for 4.  And then see if someone screams loudly
enough to reconsider.  And if we really have to I wonder if we can do
a variation of variant 1 where we avoid allocating under the irqs
disabled spinlock by a clever retry trick.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Splitting an iomap_page
  2020-08-22  6:06 ` Christoph Hellwig
@ 2020-08-24 16:06   ` Darrick J. Wong
  0 siblings, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2020-08-24 16:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, david, yukuai3, linux-xfs, linux-fsdevel, linux-mm

On Sat, Aug 22, 2020 at 07:06:18AM +0100, Christoph Hellwig wrote:
> On Fri, Aug 21, 2020 at 03:40:21PM +0100, Matthew Wilcox wrote:
> > I have only bad ideas for solving this problem, so I thought I'd share.
> > 
> > Operations like holepunch or truncate call into
> > truncate_inode_pages_range() which just remove THPs which are
> > entirely within the punched range, but pages which contain one or both
> > ends of the range need to be split.
> > 
> > What I have right now, and works, calls do_invalidatepage() from
> > truncate_inode_pages_range().  The iomap implementation just calls
> > iomap_page_release().  We have the page locked, and we've waited for
> > writeback to finish, so there is no I/O outstanding against the page.
> > We may then get into one of the writepage methods without an iomap being
> > allocated, so we have to allocate it.  Patches here:
> > 
> > http://git.infradead.org/users/willy/pagecache.git/commitdiff/167f81e880ef00d83ab7db50d56ed85bfbae2601
> > http://git.infradead.org/users/willy/pagecache.git/commitdiff/82fe90cde95420c3cf155b54ed66c74d5fb6ffc5
> > 
> > If the page is Uptodate, this works great.  But if it's not, then we're
> > going to unnecessarily re-read data from storage -- indeed, we may as
> > well just dump the entire page if it's !Uptodate.  Of course, right now
> > the only way to get THPs is through readahead and that's going to always
> > read the entire page, so we're never going to see a !Uptodate THP.  But
> > in the future, maybe we shall?  I wouldn't like to rely on that without
> > pasting some big warnings for anyone who tries to change it.
> > 
> > Alternative 1: Allocate a new iop for each base page if needed.  This is
> > the obvious approach.  If the block size is >= PAGE_SIZE, we don't even
> > need to allocate a new iop; we can just mark the page as being Uptodate
> > if that range is.  The problem is that we might need to allocate a lot of
> > them -- 512 if we're splitting a 2MB page into 4kB chunks (which would
> > be 12kB -- plus a 2kB array to hold 512 pointers).  And at the point
> > where we know we need to allocate them, we're under a spin_lock_irq().
> > We could allocate them in advance, but then we might find out we aren't
> > going to split this page after all.
> > 
> > Alternative 2: Always allocate an iop for each base page in a THP.  We pay
> > the allocation price up front for every THP, even though the majority
> > will never be split.  It does save us from allocating any iop at all for
> > block size >= page size, but it's a lot of extra overhead to gather all
> > the Uptodate bits.  As above, it'd be 12kB of iops vs 80 bytes that we
> > currently allocate for a 2MB THP.  144 once we also track dirty bits.
> > 
> > Alternative 3: Allow pages to share an iop.  Do something like add a
> > pgoff_t base and a refcount_t to the iop and have each base page point
> > to the same iop, using the part of the bit array indexed by (page->index
> > - base) * blocks_per_page.  The read/write count are harder to share,
> > and I'm not sure I see a way to do it.
> > 
> > Alternative 4: Don't support partially-uptodate THPs.  We declare (and
> > enforce with strategic assertions) that all THPs must always be Uptodate
> > (or Error) once unlocked.  If your workload benefits from using THPs,
> > you want to do big I/Os anyway, so these "write 512 bytes at a time
> > using O_SYNC" workloads aren't going to use THPs.
> > 
> > Funnily, buffer_heads are easier here.  They just need to be reparented
> > to their new page.  Of course, they're allocated up front, so they're
> > essentially alternative 2.
> 
> At least initially I'd go for 4.  And then see if someone screams loudly
> enough to reconsider.  And if we really have to I wonder if we can do
> a variation of variant 1 where we avoid allocating under the irqs
> disabled spinlock by a clever retry trick.

/me doesn't have any objection to #4, and bets #1 and #3 will probably
lead to weird problems /somewhere/ ;)

--D


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Splitting an iomap_page
  2020-08-21 14:40 Splitting an iomap_page Matthew Wilcox
  2020-08-22  6:06 ` Christoph Hellwig
@ 2020-09-04  3:37 ` Matthew Wilcox
  2020-09-07 11:33   ` Kirill A. Shutemov
  1 sibling, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2020-09-04  3:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: darrick.wong, david, yukuai3, linux-xfs, linux-fsdevel, linux-mm,
	Kirill A. Shutemov

On Fri, Aug 21, 2020 at 03:40:21PM +0100, Matthew Wilcox wrote:
> Operations like holepunch or truncate call into
> truncate_inode_pages_range() which just remove THPs which are
> entirely within the punched range, but pages which contain one or both
> ends of the range need to be split.
> 
> What I have right now, and works, calls do_invalidatepage() from
> truncate_inode_pages_range().  The iomap implementation just calls
> iomap_page_release().  We have the page locked, and we've waited for
> writeback to finish, so there is no I/O outstanding against the page.
> We may then get into one of the writepage methods without an iomap being
> allocated, so we have to allocate it.  Patches here:
> 
> http://git.infradead.org/users/willy/pagecache.git/commitdiff/167f81e880ef00d83ab7db50d56ed85bfbae2601
> http://git.infradead.org/users/willy/pagecache.git/commitdiff/82fe90cde95420c3cf155b54ed66c74d5fb6ffc5
> 
> If the page is Uptodate, this works great.  But if it's not, then we're
> going to unnecessarily re-read data from storage -- indeed, we may as
> well just dump the entire page if it's !Uptodate.  Of course, right now
> the only way to get THPs is through readahead and that's going to always
> read the entire page, so we're never going to see a !Uptodate THP.  But
> in the future, maybe we shall?  I wouldn't like to rely on that without
> pasting some big warnings for anyone who tries to change it.
[first three bad solutions snipped]
> Alternative 4: Don't support partially-uptodate THPs.  We declare (and
> enforce with strategic assertions) that all THPs must always be Uptodate
> (or Error) once unlocked.  If your workload benefits from using THPs,
> you want to do big I/Os anyway, so these "write 512 bytes at a time
> using O_SYNC" workloads aren't going to use THPs.

This was the most popular alternative amongst those who cast a ballot.
Here's my response.  I'd really like to be able to callback from
split_huge_page() to the filesystem through ->is_partially_uptodate
in order to set each of the pages to uptodate (if indeed they are).
Unfortunately, we just disposed of the per-page data right before we
called split_huge_page(), so that doesn't work.

This solution allows PageError pages to stay in the cache (because
I/O completions run in contexts where we don't want to call
split_huge_page()).  The only way (I believe) to get a !Uptodate page
dirty is to perform a write to the page.  The patch below will split
the page in this case.  It also splits the page in the case where we
call ->readpage to attempt to bring the page uptodate (from, eg, the
filemap read path, or the pagefault path).

We lose the uptodate bits when we split the page, so there may be an
extra read.  That's not great.  But we do look up whether the current
subpage we're looking at is entirely uptodate, and if it is, we set the
page uptodate after it's split.

Kirill, do I have the handling of split_huge_page() failure correct?
It seems reasonable to me -- unlock the page and drop the reference,
hoping that somebody else will not have a reference to the page by the
next time we try to split it.  Or they will split it for us.  There's a
livelock opportunity here, but I'm not sure it's worse than the one in
a holepunch scenario.

This is against my current head of the THP patchset, so you'll see some
earlier bad ideas being deleted (like using thp_size() in iomap_readpage()
and the VM_WARN_ON_ONCE(!PageUptodate) in iomap_invalidate())

This is all basically untested.  I have the xfstests suite running now,
but it never hit the VM_WARN_ON_ONCE in iomap_invalidate() so it's
probably not hitting any of this code.  Anybody want to write an
xfstest for me with unreliable storage for data, reliable storage
for metadata and runs fsx like generic/127 does?

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 156cbed1cd2c..03f477f785fe 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -336,6 +336,39 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	return pos - orig_pos + plen;
 }
 
+static bool iomap_range_uptodate(struct inode *inode, struct page *page,
+		size_t start, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(page);
+	size_t i, first, last;
+
+	/* First and last blocks in range within page */
+	first = start >> inode->i_blkbits;
+	last = (start + len - 1) >> inode->i_blkbits;
+
+	if (!iop)
+		return false;
+
+	for (i = first; i <= last; i++)
+		if (!test_bit(i, iop->uptodate))
+			return false;
+	return true;
+}
+
+static bool iomap_split_page(struct inode *inode, struct page *page)
+{
+	struct page *head = thp_head(page);
+	bool uptodate = iomap_range_uptodate(inode, head,
+				(page - head) * PAGE_SIZE, PAGE_SIZE);
+
+	iomap_page_release(head);
+	if (split_huge_page(page) < 0)
+		return false;
+	if (uptodate)
+		SetPageUptodate(page);
+	return true;
+}
+
 int
 iomap_readpage(struct page *page, const struct iomap_ops *ops)
 {
@@ -344,11 +377,21 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 	unsigned poff;
 	loff_t ret;
 
-	trace_iomap_readpage(page->mapping->host, 1);
+	if (PageTransCompound(page)) {
+		/*
+		 * The page wasn't exactly truncated, but we want to drop
+		 * our refcount so somebody else might be able to split it.
+		 */
+		if (!iomap_split_page(inode, page))
+			return AOP_TRUNCATED_PAGE;
+		if (PageUptodate(page))
+			return 0;
+	}
+	trace_iomap_readpage(inode, 1);
 
-	for (poff = 0; poff < thp_size(page); poff += ret) {
+	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
 		ret = iomap_apply(inode, page_offset(page) + poff,
-				thp_size(page) - poff, 0, ops, &ctx,
+				PAGE_SIZE - poff, 0, ops, &ctx,
 				iomap_readpage_actor);
 		if (ret <= 0) {
 			WARN_ON_ONCE(ret == 0);
@@ -458,26 +501,15 @@ int
 iomap_is_partially_uptodate(struct page *page, unsigned long from,
 		unsigned long count)
 {
-	struct iomap_page *iop = to_iomap_page(page);
-	struct inode *inode = page->mapping->host;
-	unsigned len, first, last;
-	unsigned i;
-
-	/* Limit range to one page */
-	len = min_t(unsigned, thp_size(page) - from, count);
+	struct page *head = thp_head(page);
+	size_t len;
 
-	/* First and last blocks in range within page */
-	first = from >> inode->i_blkbits;
-	last = (from + len - 1) >> inode->i_blkbits;
+	/* 'from' is relative to page, but the bitmap is relative to head */
+	from += (page - head) * PAGE_SIZE;
+	/* Limit range to this page */
+	len = min(thp_size(head) - from, count);
 
-	if (iop) {
-		for (i = first; i <= last; i++)
-			if (!test_bit(i, iop->uptodate))
-				return 0;
-		return 1;
-	}
-
-	return 0;
+	return iomap_range_uptodate(head->mapping->host, head, from, len);
 }
 EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
 
@@ -517,7 +549,7 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
 
 	/* Punching a hole in a THP requires releasing the iop */
 	if (PageTransHuge(page)) {
-		VM_WARN_ON_ONCE(!PageUptodate(page));
+		VM_BUG_ON_PAGE(PageDirty(page), page);
 		iomap_page_release(page);
 	}
 }
@@ -645,12 +677,20 @@ static ssize_t iomap_write_begin(struct inode *inode, loff_t pos, loff_t len,
 			return status;
 	}
 
+retry:
 	page = grab_cache_page_write_begin(inode->i_mapping, pos >> PAGE_SHIFT,
 			AOP_FLAG_NOFS);
 	if (!page) {
 		status = -ENOMEM;
 		goto out_no_page;
 	}
+	if (PageTransCompound(page) && !PageUptodate(page)) {
+		if (!iomap_split_page(inode, page)) {
+			unlock_page(page);
+			put_page(page);
+			goto retry;
+		}
+	}
 	page = thp_head(page);
 	offset = offset_in_thp(page, pos);
 	if (len > thp_size(page) - offset)


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Splitting an iomap_page
  2020-09-04  3:37 ` Matthew Wilcox
@ 2020-09-07 11:33   ` Kirill A. Shutemov
  2020-09-08 11:43     ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: Kirill A. Shutemov @ 2020-09-07 11:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, darrick.wong, david, yukuai3, linux-xfs,
	linux-fsdevel, linux-mm

On Fri, Sep 04, 2020 at 04:37:24AM +0100, Matthew Wilcox wrote:
> Kirill, do I have the handling of split_huge_page() failure correct?
> It seems reasonable to me -- unlock the page and drop the reference,
> hoping that somebody else will not have a reference to the page by the
> next time we try to split it.  Or they will split it for us.  There's a
> livelock opportunity here, but I'm not sure it's worse than the one in
> a holepunch scenario.

The worst case scenario is when the page is referenced (directly or
indirectly) by the caller. It this case we would end up with endless loop.
I'm not sure how we can guarantee that this will never happen.

Maybe it's safer to return -EINTR if the split is failed and let the
syscall (or whatever codepath brings us here) to be restarted from the
scratch? Yes, it's abuse of -EINTR, but should be fine.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Splitting an iomap_page
  2020-09-07 11:33   ` Kirill A. Shutemov
@ 2020-09-08 11:43     ` Matthew Wilcox
  2020-09-09 11:57       ` Kirill A. Shutemov
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2020-09-08 11:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Christoph Hellwig, darrick.wong, david, yukuai3, linux-xfs,
	linux-fsdevel, linux-mm

On Mon, Sep 07, 2020 at 02:33:24PM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 04, 2020 at 04:37:24AM +0100, Matthew Wilcox wrote:
> > Kirill, do I have the handling of split_huge_page() failure correct?
> > It seems reasonable to me -- unlock the page and drop the reference,
> > hoping that somebody else will not have a reference to the page by the
> > next time we try to split it.  Or they will split it for us.  There's a
> > livelock opportunity here, but I'm not sure it's worse than the one in
> > a holepunch scenario.
> 
> The worst case scenario is when the page is referenced (directly or
> indirectly) by the caller. It this case we would end up with endless loop.
> I'm not sure how we can guarantee that this will never happen.

I don't see a way for that to happen at the moment.  We're pretty
careful not to take references on multiple pages at once in these paths.
I've fixed the truncate paths to only take one reference per THP too.

I was thinking that if livelock becomes a problem, we could (ab)use the
THP destructor mechanism somewhat like this:

Add
	[TRANSHUGE_PAGE_SPLIT] = split_transhuge_page,
to the compound_page_dtors array.

New function split_huge_page_wait() which, if !can_split_huge_page()
first checks if the dtor is already set to TRANSHUGE_PAGE_SPLIT.  If so,
it returns to its caller, reporting failure (so that it will put its
reference to the page).  Then it sets the dtor to TRANSHUGE_PAGE_SPLIT
and sets page refcount to 1.  It goes to sleep on the page.

split_transhuge_page() first has to check if the refcount went to 0
due to mapcount being reduced.  If so, it resets the refcount to 1 and
returns to the caller.  If not, it freezes the page and wakes the task
above which is sleeping in split_huge_page_wait().

It's complicated and I don't love it.  But it might solve livelock, should
we need to do it.  It wouldn't prevent us from an indefinite wait if the
caller of split_huge_page_wait() has more than one reference to this page.
That's better than a livelock though.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Splitting an iomap_page
  2020-09-08 11:43     ` Matthew Wilcox
@ 2020-09-09 11:57       ` Kirill A. Shutemov
  0 siblings, 0 replies; 7+ messages in thread
From: Kirill A. Shutemov @ 2020-09-09 11:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, darrick.wong, david, yukuai3, linux-xfs,
	linux-fsdevel, linux-mm

On Tue, Sep 08, 2020 at 12:43:28PM +0100, Matthew Wilcox wrote:
> On Mon, Sep 07, 2020 at 02:33:24PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Sep 04, 2020 at 04:37:24AM +0100, Matthew Wilcox wrote:
> > > Kirill, do I have the handling of split_huge_page() failure correct?
> > > It seems reasonable to me -- unlock the page and drop the reference,
> > > hoping that somebody else will not have a reference to the page by the
> > > next time we try to split it.  Or they will split it for us.  There's a
> > > livelock opportunity here, but I'm not sure it's worse than the one in
> > > a holepunch scenario.
> > 
> > The worst case scenario is when the page is referenced (directly or
> > indirectly) by the caller. It this case we would end up with endless loop.
> > I'm not sure how we can guarantee that this will never happen.
> 
> I don't see a way for that to happen at the moment.  We're pretty
> careful not to take references on multiple pages at once in these paths.
> I've fixed the truncate paths to only take one reference per THP too.
> 
> I was thinking that if livelock becomes a problem, we could (ab)use the
> THP destructor mechanism somewhat like this:
> 
> Add
> 	[TRANSHUGE_PAGE_SPLIT] = split_transhuge_page,
> to the compound_page_dtors array.
> 
> New function split_huge_page_wait() which, if !can_split_huge_page()
> first checks if the dtor is already set to TRANSHUGE_PAGE_SPLIT.  If so,
> it returns to its caller, reporting failure (so that it will put its
> reference to the page).  Then it sets the dtor to TRANSHUGE_PAGE_SPLIT
> and sets page refcount to 1.  It goes to sleep on the page.

The refcount has to be reduced by one, not set to one. Otherwise the page
will get split while somebody holds a pin. But willnot work if two
callsites use split_huge_page_wait(). Emm?..

> split_transhuge_page() first has to check if the refcount went to 0
> due to mapcount being reduced.  If so, it resets the refcount to 1 and
> returns to the caller.  If not, it freezes the page and wakes the task
> above which is sleeping in split_huge_page_wait().

What happens if there's still several GUP references to the page. We must
not split the page in this case.

Maybe I don't follow your idea. I donno.

> It's complicated and I don't love it.  But it might solve livelock, should
> we need to do it.  It wouldn't prevent us from an indefinite wait if the
> caller of split_huge_page_wait() has more than one reference to this page.
> That's better than a livelock though.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-09-09 11:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-21 14:40 Splitting an iomap_page Matthew Wilcox
2020-08-22  6:06 ` Christoph Hellwig
2020-08-24 16:06   ` Darrick J. Wong
2020-09-04  3:37 ` Matthew Wilcox
2020-09-07 11:33   ` Kirill A. Shutemov
2020-09-08 11:43     ` Matthew Wilcox
2020-09-09 11:57       ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).