[RFC PATCH 0/6] variable-order, large folios for anonymous memory

* [RFC PATCH 0/6] variable-order, large folios for anonymous memory
@ 2023-03-17 10:57 ` Ryan Roberts
  0 siblings, 0 replies; 30+ messages in thread
From: Ryan Roberts @ 2023-03-17 10:57 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Yin, Fengwei, Yu Zhao
  Cc: Ryan Roberts, linux-mm, linux-arm-kernel

Hi All,

This is an RFC for an initial, _very_ limited stab at implementing support for
using variable-order, large folios for anonymous memory. It intends to be the
minimal change, upon which additions can be made incrementally. That said, with
just this change, I achive a 4% performance improvement when compiling the
kernel (more on that later).

My motivation for posting the RFC now is twofold:

- Get feedback on the approach I'm taking before I go too far down the path;
  does this fit with the direction of the community? Are there any bear traps
  that I've not considered (due to my being fairly new to mm and not having a
  complete understanding of its entirety)?

- Seek support for a bug I'm encountering when MADV_FREE is attempting to
  split_folio() one of these new variable-order anon folios. I've been pouring
  through the source and can't find the root cause. For now I have a work
  around, but hopefully someone can give me some pointers as to where the
  problem is likely to be. (see details below).

The patches apply on top of v6.3-rc1 + patches 1-31 of [4] (which needed one
minor conflict resolution). And I have a tree at [5].

See [1], [2], [3] for more background.

Approach
========

For now, I'm only modifying the allocation path (do_anonymous_page()). I'm not
touching the CoW path. First, I determine the order of the folio to allocate for
the given fault. This is determined by:

  - Folio must be naturally aligned within VA space
  - Folio must not breach boundaries of vma
  - Folio must be fully contained inside one pmd entry
  - Folio must not overlap any non-none ptes
  - Order must not be higher than a provided starting order

Where the "provided starting order" is currently hardcoded to 4, but the idea is
that this would eventually be a per-vma value that gets dynamically tuned.

We then try to allocate a large folio of the determined order, and keep trying
to allocate with successively smaller orders until we succeed. Once the folio is
allocated we can take the PTL and re-check that all the covered PTE entries are
still none. If not, we decrement the order and start again.

Next, the folio is added to the rmap using a new API,
folio_add_new_anon_rmap_range(), which is similar to Yin, Fengwei's
folio_add_file_rmap_range() at [4]. And finally set the ptes using Matthew
Wilcox's new set_ptes() API, also at [4].

Folio/page refcounts and mapcounts are managed in the same way as Yin, Fengwei
is doing in folio_add_file_rmap_range(); A reference is taken on the folio for
each pte, _mapcount is incremented on each page by 1, and
folio->_nr_pages_mapped is set to the number of pages in the folio (since every
page is initially mapped).

It is my assumption that the mm should be able to deal with these folios
correctly for CoW and reclaim etc. although perhaps not as optimally as we would
(eventually) like.

Bug(s)
======

When I run this code without the last (workaround) patch, with DEBUG_VM et al,
PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
relating to invalid kernel addresses (which usually look like either NULL +
small offset or mostly zeros with a few mid-order bits set + a small offset) or
lockdep complaining about a bad unlock balance. Call stacks are often in
madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
email example oopses out separately if anyone wants to review them). My hunch is
that struct pages adjacent to the folio are being corrupted, but don't have hard
evidence.

When adding the workaround patch, which prevents madvise_free_pte_range() from
attempting to split a large folio, I never see any issues. Although I'm not
putting the system under memory pressure so guess I might see the same types of
problem crop up under swap, etc.

I've reviewed most of the code within split_folio() and can't find any smoking
gun, but I wonder if there are implicit assumptions about the large folio being
PMD sized that I'm obviously breaking now?

The code in madvise_free_pte_range():

	if (folio_test_large(folio)) {
		if (folio_mapcount(folio) != 1)
			goto out;
		folio_get(folio);
		if (!folio_trylock(folio)) {
			folio_put(folio);
			goto out;
		}
		pte_unmap_unlock(orig_pte, ptl);
		if (split_folio(folio)) {
			folio_unlock(folio);
			folio_put(folio);
			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
			goto out;
		}
		...
	}

Will normally skip my large folios because they have a mapcount > 1, due to
incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
or CoW in this case?

I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
large anon folio is not using the folio lock in the same way as a THP would and
we are therefore not getting the expected serialization?

I'd really appreciate any suggestions for how to pregress here!

Performance
===========

With the above bug worked around, I'm benchmarking kernel compilation, which is
known to be heavy on anonymous page faults. Overall, I see a reduction in
wall-time by 4%. This is inline with my predictions based on earlier experiments
summarised at [1]. I beleive there is scope for future improvement on the CoW
and reclaim paths. I'd also expect to see performance improvements due to
reduced TLB pressure on CPUs that support HPA (I'm running on Ampere Altra where
HPA is not enabled).

Of the 4%, all of it is (obviously) in the kernel; overall kernel execution time
has reduced by 34%, more than halving the time spent servicing data faults, and
significantly speeding up sys_exit_group().

Thanks,
Ryan

[1] https://lore.kernel.org/linux-mm/4c991dcb-c5bb-86bb-5a29-05df24429607@arm.com/
[2] https://lore.kernel.org/linux-mm/a7cd938e-a86f-e3af-f56c-433c92ac69c2@arm.com/
[3] https://lore.kernel.org/linux-mm/Y%2FblF0GIunm+pRIC@casper.infradead.org/
[4] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[5] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc

Ryan Roberts (6):
  mm: Expose clear_huge_page() unconditionally
  mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  mm: Introduce try_vma_alloc_zeroed_movable_folio()
  mm: Implement folio_add_new_anon_rmap_range()
  mm: Allocate large folios for anonymous memory
  WORKAROUND: Don't split large folios on madvise

 arch/alpha/include/asm/page.h   |   5 +-
 arch/arm64/include/asm/page.h   |   3 +-
 arch/arm64/mm/fault.c           |   7 +-
 arch/ia64/include/asm/page.h    |   5 +-
 arch/m68k/include/asm/page_no.h |   7 +-
 arch/s390/include/asm/page.h    |   5 +-
 arch/x86/include/asm/page.h     |   5 +-
 include/linux/highmem.h         |  23 +++--
 include/linux/mm.h              |   3 +-
 include/linux/rmap.h            |   2 +
 mm/madvise.c                    |   8 ++
 mm/memory.c                     | 167 ++++++++++++++++++++++++++++----
 mm/rmap.c                       |  43 ++++++++
 13 files changed, 239 insertions(+), 44 deletions(-)

--
2.25.1

^ permalink raw reply	[flat|nested] 30+ messages in thread