Re: [RFC PATCH 00/14] Rearrange batched folio freeing

From: Ryan Roberts <ryan.roberts@arm.com>
To: "Matthew Wilcox (Oracle)" <willy@infradead.org>, linux-mm@kvack.org
Subject: Re: [RFC PATCH 00/14] Rearrange batched folio freeing
Date: Mon, 4 Sep 2023 14:25:41 +0100	[thread overview]
Message-ID: <eeaf36cf-8e29-4de2-9e5a-9ec2a5e30c61@arm.com> (raw)
In-Reply-To: <20230825135918.4164671-1-willy@infradead.org>

On 25/08/2023 14:59, Matthew Wilcox (Oracle) wrote:
> Other than the obvious "remove calls to compound_head" changes, the
> fundamental belief here is that iterating a linked list is much slower
> than iterating an array (5-15x slower in my testing).  There's also
> an associated belief that since we iterate the batch of folios three
> times, we do better when the array is small (ie 15 entries) than we do
> with a batch that is hundreds of entries long, which only gives us the
> opportunity for the first pages to fall out of cache by the time we get
> to the end.
> 
> The one place where that probably falls down is "Free folios in a batch
> in shrink_folio_list()" where we'll flush the TLB once per batch instead
> of at the end.  That's going to take some benchmarking.
> 
> Matthew Wilcox (Oracle) (14):
>   mm: Make folios_put() the basis of release_pages()
>   mm: Convert free_unref_page_list() to use folios
>   mm: Add free_unref_folios()
>   mm: Use folios_put() in __folio_batch_release()
>   memcg: Add mem_cgroup_uncharge_folios()
>   mm: Remove use of folio list from folios_put()
>   mm: Use free_unref_folios() in put_pages_list()
>   mm: use __page_cache_release() in folios_put()
>   mm: Handle large folios in free_unref_folios()
>   mm: Allow non-hugetlb large folios to be batch processed
>   mm: Free folios in a batch in shrink_folio_list()
>   mm: Free folios directly in move_folios_to_lru()
>   memcg: Remove mem_cgroup_uncharge_list()
>   mm: Remove free_unref_page_list()
> 
>  include/linux/memcontrol.h |  24 ++---
>  include/linux/mm.h         |  19 +---
>  mm/internal.h              |   4 +-
>  mm/memcontrol.c            |  16 ++--
>  mm/mlock.c                 |   3 +-
>  mm/page_alloc.c            |  74 ++++++++-------
>  mm/swap.c                  | 180 ++++++++++++++++++++-----------------
>  mm/vmscan.c                |  51 +++++------
>  8 files changed, 181 insertions(+), 190 deletions(-)
> 

Hi Matthew,

I've been doing some benchmarking of this series, as promised, but have hit an oops. It doesn't appear to be easily reproducible, and I'm struggling to figure out the root cause, so thought I would share in case you have any ideas?

--->8---

[  205.942771] ================================================================================
[  205.951201] UBSAN: array-index-out-of-bounds in mm/page_alloc.c:668:46
[  205.957716] index 10283 is out of range for type 'list_head [6]'
[  205.963774] ================================================================================
[  205.972200] Unable to handle kernel paging request at virtual address 006808190c06670f
[  205.980103] Mem abort info:
[  205.982884]   ESR = 0x0000000096000044
[  205.986620]   EC = 0x25: DABT (current EL), IL = 32 bits
[  205.991918]   SET = 0, FnV = 0
[  205.994959]   EA = 0, S1PTW = 0
[  205.998088]   FSC = 0x04: level 0 translation fault
[  206.002952] Data abort info:
[  206.005819]   ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000
[  206.011291]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[  206.016329]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  206.021627] [006808190c06670f] address between user and kernel address ranges
[  206.028749] Internal error: Oops: 0000000096000044 [#1] SMP
[  206.034310] Modules linked in: nfs lockd grace sunrpc fscache netfs nls_iso8859_1 scsi_dh_rdac scsi_dh_emc scsi_dh_alua drm xfs btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core mlx5_core mlxfw pci_hyperv_intf crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce tls nvme psample nvme_core aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[  206.075579] CPU: 43 PID: 159703 Comm: git Not tainted 6.5.0-rc4-ryarob01-all #1
[  206.082875] Hardware name: WIWYNN Mt.Jade Server System B81.03001.0005/Mt.Jade Motherboard, BIOS 1.08.20220218 (SCP: 1.08.20220218) 2022/02/18
[  206.095638] pstate: 004000c9 (nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  206.102587] pc : free_pcppages_bulk+0x330/0x7f8
[  206.107104] lr : free_pcppages_bulk+0x7a8/0x7f8
[  206.111622] sp : ffff8000aeef3680
[  206.114923] x29: ffff8000aeef3680 x28: 000000000000282b x27: 00000000000000fc
[  206.122046] x26: 000000008015a39a x25: ffff08181ef9e840 x24: ffff0818836caf80
[  206.129169] x23: 0000000000000001 x22: 0000000000000000 x21: ffff08181ef9e850
[  206.136292] x20: fffffc200368e680 x19: fffffc200368e6c0 x18: 0000000000000000
[  206.143414] x17: 3d3d3d3d3d3d3d3d x16: 3d3d3d3d3d3d3d3d x15: 3d3d3d3d3d3d3d3d
[  206.150537] x14: 3d3d3d3d3d3d3d3d x13: 3d3d3d3d3d3d3d3d x12: 3d3d3d3d3d3d3d3d
[  206.157659] x11: 3d3d3d3d3d3d3d3d x10: 3d3d3d3d3d3d3d3d x9 : fffffc200368e688
[  206.164782] x8 : fffffc200368e680 x7 : 205d343737333639 x6 : ffff08181dee0000
[  206.171904] x5 : ffff0818836caf80 x4 : 0000000000000000 x3 : 0000000000000001
[  206.179027] x2 : ffff0818836f3330 x1 : ffff0818836f3230 x0 : 006808190c066707
[  206.186149] Call trace:
[  206.188582]  free_pcppages_bulk+0x330/0x7f8
[  206.192752]  free_unref_page_commit+0x15c/0x250
[  206.197270]  free_unref_folios+0x37c/0x4a8
[  206.201353]  release_unref_folios+0xac/0xf8
[  206.205524]  folios_put+0xe0/0x1f0
[  206.208914]  __folio_batch_release+0x34/0x88
[  206.213170]  truncate_inode_pages_range+0x160/0x540
[  206.218035]  truncate_inode_pages_final+0x58/0x90
[  206.222726]  ext4_evict_inode+0x164/0x900
[  206.226722]  evict+0xac/0x160
[  206.229678]  iput+0x170/0x228
[  206.232633]  do_unlinkat+0x1d0/0x290
[  206.236196]  __arm64_sys_unlinkat+0x48/0x98
[  206.240367]  invoke_syscall+0x74/0xf8
[  206.244016]  el0_svc_common.constprop.0+0x58/0x130
[  206.248793]  do_el0_svc+0x40/0xa8
[  206.252095]  el0_svc+0x2c/0xb8
[  206.255137]  el0t_64_sync_handler+0xc0/0xc8
[  206.259308]  el0t_64_sync+0x1a8/0x1b0
[  206.262958] Code: 8b0110a1 8b050305 8b010301 f9408020 (f9000409) 
[  206.269039] ---[ end trace 0000000000000000 ]---

--->8---

UBSAN is complaining about migratetype being out of range here:

/* Used for pages not on another list */
static inline void add_to_free_list(struct page *page, struct zone *zone,
				    unsigned int order, int migratetype)
{
	struct free_area *area = &zone->free_area[order];

	list_add(&page->buddy_list, &area->free_list[migratetype]);
	area->nr_free++;
}

And I think that is called from __free_one_page(), which is called from free_pcppages_bulk() at the top of the stack trace. migratetype originates from get_pcppage_migratetype(page), which is page->index. But I can't see where this might be getting corrupted, or how yours or my changes could affect this.

Config: This particular config actually has my large anon folios and mmu_gather series as well as this series from you. Although this issue is happening for file-backed memory, so I'm hoping its not related to LAF.

I have modified your series in a couple of relevant ways though:

 - I'm using `pcp_allowed_order(order)` instead of direct compare to PAGE_ALLOC_COSTLY_ORDER in free_unref_folios() (as I suggested in the review).

 - I've refactored folios_put() as you suggested in another thread to make implementation of my free_folio_regions_and_swap_cache() simpler:

static void release_unref_folios(struct folio_batch *folios)
{
	struct lruvec *lruvec = NULL;
	unsigned long flags = 0;
	int i;

	for (i = 0; i < folios->nr; i++)
		__page_cache_release(folios->folios[i], &lruvec, &flags);

	if (lruvec)
		unlock_page_lruvec_irqrestore(lruvec, flags);

	mem_cgroup_uncharge_folios(folios);
	free_unref_folios(folios);
}

/**
 * free_folio_regions_and_swap_cache - free swap cache and put refs
 * @folios: array of `struct folio_region`s to release
 * @nr: number of folio regions in array
 *
 * Each `struct folio_region` describes the start and end pfn of a region within
 * a single folio. The folio's swap cache is freed, then the folio reference
 * count is decremented once for each pfn in the region. If it falls to zero,
 * remove the folio from the LRU and free it.
 */
void free_folio_regions_and_swap_cache(struct folio_region *folios, int nr)
{
	struct folio_batch fbatch;
	unsigned int i;

	folio_batch_init(&fbatch);
	lru_add_drain();

	for (i = 0; i < nr; i++) {
		struct folio *folio = pfn_folio(folios[i].pfn_start);
		int refs = folios[i].pfn_end - folios[i].pfn_start;

		free_swap_cache(folio);

		if (is_huge_zero_page(&folio->page))
			continue;

		if (folio_is_zone_device(folio)) {
			if (put_devmap_managed_page_refs(&folio->page, refs))
				continue;
			if (folio_ref_sub_and_test(folio, refs))
				free_zone_device_page(&folio->page);
			continue;
		}

		if (!folio_ref_sub_and_test(folio, refs))
			continue;

		/* hugetlb has its own memcg */
		if (folio_test_hugetlb(folio)) {
			free_huge_folio(folio);
			continue;
		}

		if (folio_batch_add(&fbatch, folio) == 0)
			release_unref_folios(&fbatch);
	}

	if (fbatch.nr)
		release_unref_folios(&fbatch);
}

/**
 * folios_put - Decrement the reference count on a batch of folios.
 * @folios: The folios.
 *
 * Like folio_put(), but for a batch of folios.  This is more efficient
 * than writing the loop yourself as it will optimise the locks which need
 * to be taken if the folios are freed.  The folios batch is returned
 * empty and ready to be reused for another batch; there is no need to
 * reinitialise it.
 *
 * Context: May be called in process or interrupt context, but not in NMI
 * context.  May be called while holding a spinlock.
 */
void folios_put(struct folio_batch *folios)
{
	int i, j;

	for (i = 0, j = 0; i < folios->nr; i++) {
		struct folio *folio = folios->folios[i];

		if (is_huge_zero_page(&folio->page))
			continue;

		if (folio_is_zone_device(folio)) {
			if (put_devmap_managed_page(&folio->page))
				continue;
			if (folio_put_testzero(folio))
				free_zone_device_page(&folio->page);
			continue;
		}

		if (!folio_put_testzero(folio))
			continue;

		/* hugetlb has its own memcg */
		if (folio_test_hugetlb(folio)) {
			free_huge_folio(folio);
			continue;
		}

		if (j != i)
			folios->folios[j] = folio;
		j++;
	}

	folios->nr = j;
	if (!j)
		return;
	release_unref_folios(folios);
}
EXPORT_SYMBOL(folios_put);

Let me know if you have any thoughts.

Thanks,
Ryan