linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] huge tmpfs: try to split_huge_page() when punching hole
@ 2020-02-27  4:06 Hugh Dickins
  2020-02-27  8:47 ` Kirill A. Shutemov
  0 siblings, 1 reply; 6+ messages in thread
From: Hugh Dickins @ 2020-02-27  4:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yang Shi, Hugh Dickins, Alexander Duyck, Michael S. Tsirkin,
	David Hildenbrand, Kirill A. Shutemov, Matthew Wilcox,
	Andrea Arcangeli, linux-mm, linux-kernel

Yang Shi writes:

Currently, when truncating a shmem file, if the range is partly in a THP
(start or end is in the middle of THP), the pages actually will just get
cleared rather than being freed, unless the range covers the whole THP.
Even though all the subpages are truncated (randomly or sequentially),
the THP may still be kept in page cache.

This might be fine for some usecases which prefer preserving THP, but
balloon inflation is handled in base page size.  So when using shmem THP
as memory backend, QEMU inflation actually doesn't work as expected since
it doesn't free memory.  But the inflation usecase really needs to get
the memory freed.  (Anonymous THP will also not get freed right away,
but will be freed eventually when all subpages are unmapped: whereas
shmem THP still stays in page cache.)

Split THP right away when doing partial hole punch, and if split fails
just clear the page so that read of the punched area will return zeroes.

Hugh Dickins adds:

Our earlier "team of pages" huge tmpfs implementation worked in the way
that Yang Shi proposes; and we have been using this patch to continue to
split the huge page when hole-punched or truncated, since converting over
to the compound page implementation.  Although huge tmpfs gives out huge
pages when available, if the user specifically asks to truncate or punch
a hole (perhaps to free memory, perhaps to reduce the memcg charge), then
the filesystem should do so as best it can, splitting the huge page.

That is not always possible: any additional reference to the huge page
prevents split_huge_page() from succeeding, so the result can be flaky.
But in practice it works successfully enough that we've not seen any
problem from that.

Add shmem_punch_compound() to encapsulate the decision of when a split
is needed, and doing the split if so.  Using this simplifies the flow in
shmem_undo_range(); and the first (trylock) pass does not need to do any
page clearing on failure, because the second pass will either succeed or
do that clearing.  Following the example of zero_user_segment() when
clearing a partial page, add flush_dcache_page() and set_page_dirty()
when clearing a hole - though I'm not certain that either is needed.

But: split_huge_page() would be sure to fail if shmem_undo_range()'s
pagevec holds further references to the huge page.  The easiest way to
fix that is for find_get_entries() to return early, as soon as it has
put one compound head or tail into the pagevec.  At first this felt
like a hack; but on examination, this convention better suits all its
callers - or will do, if the slight one-page-per-pagevec slowdown in
shmem_unlock_mapping() and shmem_seek_hole_data() is transformed into
a 512-page-per-pagevec speedup by checking for compound pages there.

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/filemap.c |   14 ++++++-
 mm/shmem.c   |   98 +++++++++++++++++++++----------------------------
 mm/swap.c    |    4 ++
 3 files changed, 60 insertions(+), 56 deletions(-)

--- 5.6-rc3/mm/filemap.c	2020-02-09 17:36:41.758976480 -0800
+++ linux/mm/filemap.c	2020-02-25 20:08:13.178755732 -0800
@@ -1697,6 +1697,11 @@ EXPORT_SYMBOL(pagecache_get_page);
  * Any shadow entries of evicted pages, or swap entries from
  * shmem/tmpfs, are included in the returned array.
  *
+ * If it finds a Transparent Huge Page, head or tail, find_get_entries()
+ * stops at that page: the caller is likely to have a better way to handle
+ * the compound page as a whole, and then skip its extent, than repeatedly
+ * calling find_get_entries() to return all its tails.
+ *
  * Return: the number of pages and shadow entries which were found.
  */
 unsigned find_get_entries(struct address_space *mapping,
@@ -1728,8 +1733,15 @@ unsigned find_get_entries(struct address
 		/* Has the page moved or been split? */
 		if (unlikely(page != xas_reload(&xas)))
 			goto put_page;
-		page = find_subpage(page, xas.xa_index);
 
+		/*
+		 * Terminate early on finding a THP, to allow the caller to
+		 * handle it all at once; but continue if this is hugetlbfs.
+		 */
+		if (PageTransHuge(page) && !PageHuge(page)) {
+			page = find_subpage(page, xas.xa_index);
+			nr_entries = ret + 1;
+		}
 export:
 		indices[ret] = xas.xa_index;
 		entries[ret] = page;
--- 5.6-rc3/mm/shmem.c	2020-02-09 17:36:41.798976778 -0800
+++ linux/mm/shmem.c	2020-02-25 20:08:13.182755758 -0800
@@ -789,6 +789,32 @@ void shmem_unlock_mapping(struct address
 }
 
 /*
+ * Check whether a hole-punch or truncation needs to split a huge page,
+ * returning true if no split was required, or the split has been successful.
+ *
+ * Eviction (or truncation to 0 size) should never need to split a huge page;
+ * but in rare cases might do so, if shmem_undo_range() failed to trylock on
+ * head, and then succeeded to trylock on tail.
+ *
+ * A split can only succeed when there are no additional references on the
+ * huge page: so the split below relies upon find_get_entries() having stopped
+ * when it found a subpage of the huge page, without getting further references.
+ */
+static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
+{
+	if (!PageTransCompound(page))
+		return true;
+
+	/* Just proceed to delete a huge page wholly within the range punched */
+	if (PageHead(page) &&
+	    page->index >= start && page->index + HPAGE_PMD_NR <= end)
+		return true;
+
+	/* Try to split huge page, so we can truly punch the hole or truncate */
+	return split_huge_page(page) >= 0;
+}
+
+/*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
  */
@@ -838,31 +864,11 @@ static void shmem_undo_range(struct inod
 			if (!trylock_page(page))
 				continue;
 
-			if (PageTransTail(page)) {
-				/* Middle of THP: zero out the page */
-				clear_highpage(page);
-				unlock_page(page);
-				continue;
-			} else if (PageTransHuge(page)) {
-				if (index == round_down(end, HPAGE_PMD_NR)) {
-					/*
-					 * Range ends in the middle of THP:
-					 * zero out the page
-					 */
-					clear_highpage(page);
-					unlock_page(page);
-					continue;
-				}
-				index += HPAGE_PMD_NR - 1;
-				i += HPAGE_PMD_NR - 1;
-			}
-
-			if (!unfalloc || !PageUptodate(page)) {
-				VM_BUG_ON_PAGE(PageTail(page), page);
-				if (page_mapping(page) == mapping) {
-					VM_BUG_ON_PAGE(PageWriteback(page), page);
+			if ((!unfalloc || !PageUptodate(page)) &&
+			    page_mapping(page) == mapping) {
+				VM_BUG_ON_PAGE(PageWriteback(page), page);
+				if (shmem_punch_compound(page, start, end))
 					truncate_inode_page(mapping, page);
-				}
 			}
 			unlock_page(page);
 		}
@@ -936,43 +942,25 @@ static void shmem_undo_range(struct inod
 
 			lock_page(page);
 
-			if (PageTransTail(page)) {
-				/* Middle of THP: zero out the page */
-				clear_highpage(page);
-				unlock_page(page);
-				/*
-				 * Partial thp truncate due 'start' in middle
-				 * of THP: don't need to look on these pages
-				 * again on !pvec.nr restart.
-				 */
-				if (index != round_down(end, HPAGE_PMD_NR))
-					start++;
-				continue;
-			} else if (PageTransHuge(page)) {
-				if (index == round_down(end, HPAGE_PMD_NR)) {
-					/*
-					 * Range ends in the middle of THP:
-					 * zero out the page
-					 */
-					clear_highpage(page);
-					unlock_page(page);
-					continue;
-				}
-				index += HPAGE_PMD_NR - 1;
-				i += HPAGE_PMD_NR - 1;
-			}
-
 			if (!unfalloc || !PageUptodate(page)) {
-				VM_BUG_ON_PAGE(PageTail(page), page);
-				if (page_mapping(page) == mapping) {
-					VM_BUG_ON_PAGE(PageWriteback(page), page);
-					truncate_inode_page(mapping, page);
-				} else {
+				if (page_mapping(page) != mapping) {
 					/* Page was replaced by swap: retry */
 					unlock_page(page);
 					index--;
 					break;
 				}
+				VM_BUG_ON_PAGE(PageWriteback(page), page);
+				if (shmem_punch_compound(page, start, end))
+					truncate_inode_page(mapping, page);
+				else {
+					/* Wipe the page and don't get stuck */
+					clear_highpage(page);
+					flush_dcache_page(page);
+					set_page_dirty(page);
+					if (index <
+					    round_up(start, HPAGE_PMD_NR))
+						start = index + 1;
+				}
 			}
 			unlock_page(page);
 		}
--- 5.6-rc3/mm/swap.c	2020-02-09 17:36:41.806976836 -0800
+++ linux/mm/swap.c	2020-02-25 20:08:13.182755758 -0800
@@ -1005,6 +1005,10 @@ EXPORT_SYMBOL(__pagevec_lru_add);
  * ascending indexes.  There may be holes in the indices due to
  * not-present entries.
  *
+ * Only one subpage of a Transparent Huge Page is returned in one call:
+ * allowing truncate_inode_pages_range() to evict the whole THP without
+ * cycling through a pagevec of extra references.
+ *
  * pagevec_lookup_entries() returns the number of entries which were
  * found.
  */

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] huge tmpfs: try to split_huge_page() when punching hole
  2020-02-27  4:06 [PATCH] huge tmpfs: try to split_huge_page() when punching hole Hugh Dickins
@ 2020-02-27  8:47 ` Kirill A. Shutemov
  2020-02-28  4:04   ` Hugh Dickins
  0 siblings, 1 reply; 6+ messages in thread
From: Kirill A. Shutemov @ 2020-02-27  8:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Yang Shi, Alexander Duyck, Michael S. Tsirkin,
	David Hildenbrand, Kirill A. Shutemov, Matthew Wilcox,
	Andrea Arcangeli, linux-mm, linux-kernel

On Wed, Feb 26, 2020 at 08:06:33PM -0800, Hugh Dickins wrote:
> Yang Shi writes:
> 
> Currently, when truncating a shmem file, if the range is partly in a THP
> (start or end is in the middle of THP), the pages actually will just get
> cleared rather than being freed, unless the range covers the whole THP.
> Even though all the subpages are truncated (randomly or sequentially),
> the THP may still be kept in page cache.
> 
> This might be fine for some usecases which prefer preserving THP, but
> balloon inflation is handled in base page size.  So when using shmem THP
> as memory backend, QEMU inflation actually doesn't work as expected since
> it doesn't free memory.  But the inflation usecase really needs to get
> the memory freed.  (Anonymous THP will also not get freed right away,
> but will be freed eventually when all subpages are unmapped: whereas
> shmem THP still stays in page cache.)
> 
> Split THP right away when doing partial hole punch, and if split fails
> just clear the page so that read of the punched area will return zeroes.
> 
> Hugh Dickins adds:
> 
> Our earlier "team of pages" huge tmpfs implementation worked in the way
> that Yang Shi proposes; and we have been using this patch to continue to
> split the huge page when hole-punched or truncated, since converting over
> to the compound page implementation.  Although huge tmpfs gives out huge
> pages when available, if the user specifically asks to truncate or punch
> a hole (perhaps to free memory, perhaps to reduce the memcg charge), then
> the filesystem should do so as best it can, splitting the huge page.

I'm still uncomfortable with proposition to use truncate or punch a hole
operations to manage memory footprint. These operations are about managing
storage footprint, not memory. This happens to be the same for tmpfs.

I wounder if we should consider limiting the behaviour to the operation
that explicitly combines memory and storage managing: MADV_REMOVE. This
way we can avoid future misunderstandings with THP backed by a real
filesystem.

>  }
>  
>  /*
> + * Check whether a hole-punch or truncation needs to split a huge page,
> + * returning true if no split was required, or the split has been successful.
> + *
> + * Eviction (or truncation to 0 size) should never need to split a huge page;
> + * but in rare cases might do so, if shmem_undo_range() failed to trylock on
> + * head, and then succeeded to trylock on tail.
> + *
> + * A split can only succeed when there are no additional references on the
> + * huge page: so the split below relies upon find_get_entries() having stopped
> + * when it found a subpage of the huge page, without getting further references.
> + */
> +static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
> +{
> +	if (!PageTransCompound(page))
> +		return true;
> +
> +	/* Just proceed to delete a huge page wholly within the range punched */
> +	if (PageHead(page) &&
> +	    page->index >= start && page->index + HPAGE_PMD_NR <= end)
> +		return true;
> +
> +	/* Try to split huge page, so we can truly punch the hole or truncate */
> +	return split_huge_page(page) >= 0;
> +}

I wanted to recommend taking into account khugepaged_max_ptes_none here,
but it will nullify usefulness of the feature for ballooning. Oh, well...

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] huge tmpfs: try to split_huge_page() when punching hole
  2020-02-27  8:47 ` Kirill A. Shutemov
@ 2020-02-28  4:04   ` Hugh Dickins
  2020-02-28  4:26     ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Hugh Dickins @ 2020-02-28  4:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Yang Shi, Alexander Duyck,
	Michael S. Tsirkin, David Hildenbrand, Kirill A. Shutemov,
	Matthew Wilcox, Andrea Arcangeli, linux-mm, linux-kernel

On Thu, 27 Feb 2020, Kirill A. Shutemov wrote:
> On Wed, Feb 26, 2020 at 08:06:33PM -0800, Hugh Dickins wrote:
> > Yang Shi writes:
> > 
> > Currently, when truncating a shmem file, if the range is partly in a THP
> > (start or end is in the middle of THP), the pages actually will just get
> > cleared rather than being freed, unless the range covers the whole THP.
> > Even though all the subpages are truncated (randomly or sequentially),
> > the THP may still be kept in page cache.
> > 
> > This might be fine for some usecases which prefer preserving THP, but
> > balloon inflation is handled in base page size.  So when using shmem THP
> > as memory backend, QEMU inflation actually doesn't work as expected since
> > it doesn't free memory.  But the inflation usecase really needs to get
> > the memory freed.  (Anonymous THP will also not get freed right away,
> > but will be freed eventually when all subpages are unmapped: whereas
> > shmem THP still stays in page cache.)
> > 
> > Split THP right away when doing partial hole punch, and if split fails
> > just clear the page so that read of the punched area will return zeroes.
> > 
> > Hugh Dickins adds:
> > 
> > Our earlier "team of pages" huge tmpfs implementation worked in the way
> > that Yang Shi proposes; and we have been using this patch to continue to
> > split the huge page when hole-punched or truncated, since converting over
> > to the compound page implementation.  Although huge tmpfs gives out huge
> > pages when available, if the user specifically asks to truncate or punch
> > a hole (perhaps to free memory, perhaps to reduce the memcg charge), then
> > the filesystem should do so as best it can, splitting the huge page.
> 
> I'm still uncomfortable with proposition to use truncate or punch a hole
> operations to manage memory footprint. These operations are about managing
> storage footprint, not memory. This happens to be the same for tmpfs.

I'd slightly reword that as "These operations are mainly about managing
storage footprint. This happens to be the same as memory for tmpfs."
and then happily agree with it.

> 
> I wounder if we should consider limiting the behaviour to the operation
> that explicitly combines memory and storage managing: MADV_REMOVE.

I'd strongly oppose letting MADV_REMOVE diverge from FALLOC_FL_PUNCH_HOLE:
if it came down to that, I would prefer to revert this patch.

> This way we can avoid future misunderstandings with THP backed by a real
> filesystem.

It's good to consider the implications for hole-punch on a persistent
filesystem cached with THPs (or lower order compound pages); but I
disagree that they should behave differently from this patch.

The hole-punch is fundamentally directed at freeing up the storage, yes;
but its page cache must also be removed, otherwise you have the user
writing into cache which is not backed by storage, and potentially losing
the data later.  So a hole must be punched in the compound page in that
case too: in fact, it's then much more important that split_huge_page()
succeeds - not obvious what the fallback should be if it fails (perhaps
in that case the compound page must be kept, but all its pmds removed,
and info on holes kept in spare fields of the compound page, to prevent
writes and write faults without calling back into the filesystem:
soluble, but more work than tmpfs needs today)(and perhaps when that
extra work is done, we would choose to rely on it rather than
immediately splitting; but it will involve discounting the holes).

Hugh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] huge tmpfs: try to split_huge_page() when punching hole
  2020-02-28  4:04   ` Hugh Dickins
@ 2020-02-28  4:26     ` Matthew Wilcox
  2020-02-28 12:54       ` Kirill A. Shutemov
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2020-02-28  4:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrew Morton, Yang Shi, Alexander Duyck,
	Michael S. Tsirkin, David Hildenbrand, Kirill A. Shutemov,
	Andrea Arcangeli, linux-mm, linux-kernel

On Thu, Feb 27, 2020 at 08:04:21PM -0800, Hugh Dickins wrote:
> It's good to consider the implications for hole-punch on a persistent
> filesystem cached with THPs (or lower order compound pages); but I
> disagree that they should behave differently from this patch.
> 
> The hole-punch is fundamentally directed at freeing up the storage, yes;
> but its page cache must also be removed, otherwise you have the user
> writing into cache which is not backed by storage, and potentially losing
> the data later.  So a hole must be punched in the compound page in that
> case too: in fact, it's then much more important that split_huge_page()
> succeeds - not obvious what the fallback should be if it fails (perhaps
> in that case the compound page must be kept, but all its pmds removed,
> and info on holes kept in spare fields of the compound page, to prevent
> writes and write faults without calling back into the filesystem:
> soluble, but more work than tmpfs needs today)(and perhaps when that
> extra work is done, we would choose to rely on it rather than
> immediately splitting; but it will involve discounting the holes).

Ooh, a topic that reasonable people can disagree on!

The current prototype I have will allocate (huge) pages and then
ask the filesystem to fill them.  The filesystem may well find that
the extent is a hole, and if it is, it will fill the page with zeroes.
Then, the application may write to those pages, and if it does, the
filesystem will be notified to create an on-disk extent for that write.

I haven't looked at the hole-punch path in detail, but presumably it
notifies the filesystem to create a hole extent and zeroes out the
pagecache for that range (possibly by removing entire pages, and with
memset for partial pages).  Then a subsequent write to the hole will
cause the filesystem to allocate a new non-hole extent, just like the
previous case.

I think it's reasonable for the page cache to interpret a hole-punch
request as being a hint that the hole is unlikely to be accessed again,
so allocating new smaller pages for that region of the file (or just
writing back & dropping the covering page altogether) would seem like
a reasonable implementation decision.

However, it also seems reasonable that just memset() of the affected
region and leaving the page intact would also be an acceptable
implementation.  As long as writes to the newly-created hole cause the
page to become dirtied and thus writeback to be in effect.  It probably
wouldn't be as good an implementation, but it shouldn't lose writes as
you suggest above.

I'm not sure I'd choose to split a large page into smaller pages.  I think
I'd prefer to allocate lower-order pages and memcpy() the data over.
Again, that's an implementation choice, and not something that should
be visible outside the implementation.

[1] http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-pagecache

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] huge tmpfs: try to split_huge_page() when punching hole
  2020-02-28  4:26     ` Matthew Wilcox
@ 2020-02-28 12:54       ` Kirill A. Shutemov
  2020-02-28 22:28         ` Yang Shi
  0 siblings, 1 reply; 6+ messages in thread
From: Kirill A. Shutemov @ 2020-02-28 12:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Yang Shi, Alexander Duyck,
	Michael S. Tsirkin, David Hildenbrand, Kirill A. Shutemov,
	Andrea Arcangeli, linux-mm, linux-kernel

On Thu, Feb 27, 2020 at 08:26:46PM -0800, Matthew Wilcox wrote:
> On Thu, Feb 27, 2020 at 08:04:21PM -0800, Hugh Dickins wrote:
> > It's good to consider the implications for hole-punch on a persistent
> > filesystem cached with THPs (or lower order compound pages); but I
> > disagree that they should behave differently from this patch.
> > 
> > The hole-punch is fundamentally directed at freeing up the storage, yes;
> > but its page cache must also be removed, otherwise you have the user
> > writing into cache which is not backed by storage, and potentially losing
> > the data later.  So a hole must be punched in the compound page in that
> > case too: in fact, it's then much more important that split_huge_page()
> > succeeds - not obvious what the fallback should be if it fails (perhaps
> > in that case the compound page must be kept, but all its pmds removed,
> > and info on holes kept in spare fields of the compound page, to prevent
> > writes and write faults without calling back into the filesystem:
> > soluble, but more work than tmpfs needs today)(and perhaps when that
> > extra work is done, we would choose to rely on it rather than
> > immediately splitting; but it will involve discounting the holes).
> 
> Ooh, a topic that reasonable people can disagree on!

Hugh wins me over on this.

Removing PMDs will not do much as we track dirty status on compound page
level.

I see two reasonable options for persistent filesystem to handle the
punch hole:

  - Keep the page and PMD mappings intact, but trigger writeback if page
    is dirty. After the page is clean we can safely punch hole in the
    storage. Following write access to the area would trigger
    page_mkwrite() which would allocate storage accordingly.

    This is reasonable behaviour if we allow to allocate THPs not fully
    covered by space allocated on disk.

  - Try to split the page or drop it completely from the page cache (after
    write back if need) before punching the hole. Fallback to the first
    scenario if we cannot split or get rid of the page.

I cannot say I strongly prefer one approach over another. The first one
fits better with THP attitude: pay for performance with memory (and
storage I guess). The second may work better if resources is limited.

> The current prototype I have will allocate (huge) pages and then
> ask the filesystem to fill them.  The filesystem may well find that
> the extent is a hole, and if it is, it will fill the page with zeroes.
> Then, the application may write to those pages, and if it does, the
> filesystem will be notified to create an on-disk extent for that write.
> 
> I haven't looked at the hole-punch path in detail, but presumably it
> notifies the filesystem to create a hole extent and zeroes out the
> pagecache for that range (possibly by removing entire pages, and with
> memset for partial pages).  Then a subsequent write to the hole will
> cause the filesystem to allocate a new non-hole extent, just like the
> previous case.
> 
> I think it's reasonable for the page cache to interpret a hole-punch
> request as being a hint that the hole is unlikely to be accessed again,
> so allocating new smaller pages for that region of the file (or just
> writing back & dropping the covering page altogether) would seem like
> a reasonable implementation decision.
> 
> However, it also seems reasonable that just memset() of the affected
> region and leaving the page intact would also be an acceptable
> implementation.  As long as writes to the newly-created hole cause the
> page to become dirtied and thus writeback to be in effect.  It probably
> wouldn't be as good an implementation, but it shouldn't lose writes as
> you suggest above.
> 
> I'm not sure I'd choose to split a large page into smaller pages.  I think
> I'd prefer to allocate lower-order pages and memcpy() the data over.
> Again, that's an implementation choice, and not something that should
> be visible outside the implementation.

Copying over the data has the same limitation as split: you need to get
refcounts to the well-known state (no extra pins) before you can proceed.
So it will not be never-fail operation.

> 
> [1] http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-pagecache

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] huge tmpfs: try to split_huge_page() when punching hole
  2020-02-28 12:54       ` Kirill A. Shutemov
@ 2020-02-28 22:28         ` Yang Shi
  0 siblings, 0 replies; 6+ messages in thread
From: Yang Shi @ 2020-02-28 22:28 UTC (permalink / raw)
  To: Kirill A. Shutemov, Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Alexander Duyck, Michael S. Tsirkin,
	David Hildenbrand, Kirill A. Shutemov, Andrea Arcangeli,
	linux-mm, linux-kernel



On 2/28/20 4:54 AM, Kirill A. Shutemov wrote:
> On Thu, Feb 27, 2020 at 08:26:46PM -0800, Matthew Wilcox wrote:
>> On Thu, Feb 27, 2020 at 08:04:21PM -0800, Hugh Dickins wrote:
>>> It's good to consider the implications for hole-punch on a persistent
>>> filesystem cached with THPs (or lower order compound pages); but I
>>> disagree that they should behave differently from this patch.
>>>
>>> The hole-punch is fundamentally directed at freeing up the storage, yes;
>>> but its page cache must also be removed, otherwise you have the user
>>> writing into cache which is not backed by storage, and potentially losing
>>> the data later.  So a hole must be punched in the compound page in that
>>> case too: in fact, it's then much more important that split_huge_page()
>>> succeeds - not obvious what the fallback should be if it fails (perhaps
>>> in that case the compound page must be kept, but all its pmds removed,
>>> and info on holes kept in spare fields of the compound page, to prevent
>>> writes and write faults without calling back into the filesystem:
>>> soluble, but more work than tmpfs needs today)(and perhaps when that
>>> extra work is done, we would choose to rely on it rather than
>>> immediately splitting; but it will involve discounting the holes).
>> Ooh, a topic that reasonable people can disagree on!
> Hugh wins me over on this.
>
> Removing PMDs will not do much as we track dirty status on compound page
> level.
>
> I see two reasonable options for persistent filesystem to handle the
> punch hole:
>
>    - Keep the page and PMD mappings intact, but trigger writeback if page
>      is dirty. After the page is clean we can safely punch hole in the
>      storage. Following write access to the area would trigger
>      page_mkwrite() which would allocate storage accordingly.
>
>      This is reasonable behaviour if we allow to allocate THPs not fully
>      covered by space allocated on disk.
>
>    - Try to split the page or drop it completely from the page cache (after
>      write back if need) before punching the hole. Fallback to the first
>      scenario if we cannot split or get rid of the page.
>
> I cannot say I strongly prefer one approach over another. The first one
> fits better with THP attitude: pay for performance with memory (and
> storage I guess). The second may work better if resources is limited.

I'm afraid any approach which would not drop page cache may get us end 
up being in the same situation as what Hugh or my patch is trying to 
solve. IMHO the latter one might be preferred.

>
>> The current prototype I have will allocate (huge) pages and then
>> ask the filesystem to fill them.  The filesystem may well find that
>> the extent is a hole, and if it is, it will fill the page with zeroes.
>> Then, the application may write to those pages, and if it does, the
>> filesystem will be notified to create an on-disk extent for that write.
>>
>> I haven't looked at the hole-punch path in detail, but presumably it
>> notifies the filesystem to create a hole extent and zeroes out the
>> pagecache for that range (possibly by removing entire pages, and with
>> memset for partial pages).  Then a subsequent write to the hole will
>> cause the filesystem to allocate a new non-hole extent, just like the
>> previous case.
>>
>> I think it's reasonable for the page cache to interpret a hole-punch
>> request as being a hint that the hole is unlikely to be accessed again,
>> so allocating new smaller pages for that region of the file (or just
>> writing back & dropping the covering page altogether) would seem like
>> a reasonable implementation decision.
>>
>> However, it also seems reasonable that just memset() of the affected
>> region and leaving the page intact would also be an acceptable
>> implementation.  As long as writes to the newly-created hole cause the
>> page to become dirtied and thus writeback to be in effect.  It probably
>> wouldn't be as good an implementation, but it shouldn't lose writes as
>> you suggest above.
>>
>> I'm not sure I'd choose to split a large page into smaller pages.  I think
>> I'd prefer to allocate lower-order pages and memcpy() the data over.
>> Again, that's an implementation choice, and not something that should
>> be visible outside the implementation.
> Copying over the data has the same limitation as split: you need to get
> refcounts to the well-known state (no extra pins) before you can proceed.
> So it will not be never-fail operation.
>
>> [1] http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-pagecache


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-02-28 22:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-27  4:06 [PATCH] huge tmpfs: try to split_huge_page() when punching hole Hugh Dickins
2020-02-27  8:47 ` Kirill A. Shutemov
2020-02-28  4:04   ` Hugh Dickins
2020-02-28  4:26     ` Matthew Wilcox
2020-02-28 12:54       ` Kirill A. Shutemov
2020-02-28 22:28         ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).