All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: "Yin, Fengwei" <fengwei.yin@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory
Date: Wed, 22 Mar 2023 14:25:27 +0000	[thread overview]
Message-ID: <a66d7ac1-59cd-2a95-dd48-79d569583aff@arm.com> (raw)
In-Reply-To: <7b63b922-9c19-b1ed-a6f6-72e1b25f38f6@intel.com>

On 22/03/2023 13:36, Yin, Fengwei wrote:
> 
> 
> On 3/22/2023 8:03 PM, Ryan Roberts wrote:
>> Hi Matthew,
>>
>> On 17/03/2023 10:57, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> [...]
>>>
>>> Bug(s)
>>> ======
>>>
>>> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
>>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
>>> relating to invalid kernel addresses (which usually look like either NULL +
>>> small offset or mostly zeros with a few mid-order bits set + a small offset) or
>>> lockdep complaining about a bad unlock balance. Call stacks are often in
>>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
>>> email example oopses out separately if anyone wants to review them). My hunch is
>>> that struct pages adjacent to the folio are being corrupted, but don't have hard
>>> evidence.
>>>
>>> When adding the workaround patch, which prevents madvise_free_pte_range() from
>>> attempting to split a large folio, I never see any issues. Although I'm not
>>> putting the system under memory pressure so guess I might see the same types of
>>> problem crop up under swap, etc.
>>>
>>> I've reviewed most of the code within split_folio() and can't find any smoking
>>> gun, but I wonder if there are implicit assumptions about the large folio being
>>> PMD sized that I'm obviously breaking now?
>>>
>>> The code in madvise_free_pte_range():
>>>
>>> 	if (folio_test_large(folio)) {
>>> 		if (folio_mapcount(folio) != 1)
>>> 			goto out;
>>> 		folio_get(folio);
>>> 		if (!folio_trylock(folio)) {
>>> 			folio_put(folio);
>>> 			goto out;
>>> 		}
>>> 		pte_unmap_unlock(orig_pte, ptl);
>>> 		if (split_folio(folio)) {
>>> 			folio_unlock(folio);
>>> 			folio_put(folio);
>>> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>>> 			goto out;
>>> 		}
>>> 		...
>>> 	}
>>
>> I've noticed that its folio_split() with a folio order of 1 that causes my
>> problems. And I also see that the page cache code always explicitly never
>> allocates order-1 folios:
>>
>> void page_cache_ra_order(struct readahead_control *ractl,
>> 		struct file_ra_state *ra, unsigned int new_order)
>> {
>> 	...
>>
>> 	while (index <= limit) {
>> 		unsigned int order = new_order;
>>
>> 		/* Align with smaller pages if needed */
>> 		if (index & ((1UL << order) - 1)) {
>> 			order = __ffs(index);
>> 			if (order == 1)
>> 				order = 0;
>> 		}
>> 		/* Don't allocate pages past EOF */
>> 		while (index + (1UL << order) - 1 > limit) {
>> 			if (--order == 1)
>> 				order = 0;
>> 		}
>> 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
>> 		if (err)
>> 			break;
>> 		index += 1UL << order;
>> 	}
>>
>> 	...
>> }
>>
>> Matthew, what is the reason for this? I suspect its guarding against the same
>> problem I'm seeing.
>>
>> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
>> any oops/panic/etc. I'd just like to understand the root cause.
> Checked the struct folio definition. The _deferred_list is in third page struct.
> My understanding is to support folio split, the folio order must >= 2. Thanks.

Yep, looks like we have found the root cause - thanks for your help!

I've updated calc_anonymous_folio_order() to only use non-0 order if THP is
available and in that case, never allocate order-1. I think that both fixes the
problem and manages the dependency we have on THP:

static void calc_anonymous_folio_order(struct vm_fault *vmf,
				       int *order_out,
				       unsigned long *addr_out)
{
	/*
	 * The aim here is to determine what size of folio we should allocate
	 * for this fault. Factors include:
	 * - Folio must be naturally aligned within VA space
	 * - Folio must not breach boundaries of vma
	 * - Folio must be fully contained inside one pmd entry
	 * - Folio must not overlap any non-none ptes
	 * - Order must not be higher than *order_out upon entry
	 *
	 * Additionally, we do not allow order-1 since this breaks assumptions
	 * elsewhere in the mm; THP pages must be at least order-2 (since they
	 * store state up to the 3rd struct page subpage), and these pages must
	 * be THP in order to correctly use pre-existing THP infrastructure such
	 * as folio_split().
	 *
	 * As a consequence of relying on the THP infrastructure, if the system
	 * does not support THP, we always fallback to order-0.
	 *
	 * Note that the caller may or may not choose to lock the pte. If
	 * unlocked, the calculation should be considered an estimate that will
	 * need to be validated under the lock.
	 */

	struct vm_area_struct *vma = vmf->vma;
	int nr;
	int order;
	unsigned long addr;
	pte_t *pte;
	pte_t *first_set = NULL;
	int ret;

	if (has_transparent_hugepage()) {
		order = min(*order_out, PMD_SHIFT - PAGE_SHIFT);

		for (; order > 1; order--) {
			nr = 1 << order;
			addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
			pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);

			/* Check vma bounds. */
			if (addr < vma->vm_start ||
			    addr + nr * PAGE_SIZE > vma->vm_end)
				continue;

			/* Ptes covered by order already known to be none. */
			if (pte + nr <= first_set)
				break;

			/* Already found set pte in range covered by order. */
			if (pte <= first_set)
				continue;

			/* Need to check if all the ptes are none. */
			ret = check_all_ptes_none(pte, nr);
			if (ret == nr)
				break;

			first_set = pte + ret;
		}

		if (order == 1)
			order = 0;
	} else {
		order = 0;
	}

	*order_out = order;
	*addr_out = order > 0 ? addr : vmf->address;
}



> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks,
>> Ryan
>>
>>
>>
>>>
>>> Will normally skip my large folios because they have a mapcount > 1, due to
>>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
>>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
>>> or CoW in this case?
>>>
>>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
>>> large anon folio is not using the folio lock in the same way as a THP would and
>>> we are therefore not getting the expected serialization?
>>>
>>> I'd really appreciate any suggestions for how to pregress here!
>>>
>>



WARNING: multiple messages have this Message-ID (diff)
From: Ryan Roberts <ryan.roberts@arm.com>
To: "Yin, Fengwei" <fengwei.yin@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory
Date: Wed, 22 Mar 2023 14:25:27 +0000	[thread overview]
Message-ID: <a66d7ac1-59cd-2a95-dd48-79d569583aff@arm.com> (raw)
In-Reply-To: <7b63b922-9c19-b1ed-a6f6-72e1b25f38f6@intel.com>

On 22/03/2023 13:36, Yin, Fengwei wrote:
> 
> 
> On 3/22/2023 8:03 PM, Ryan Roberts wrote:
>> Hi Matthew,
>>
>> On 17/03/2023 10:57, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> [...]
>>>
>>> Bug(s)
>>> ======
>>>
>>> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
>>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
>>> relating to invalid kernel addresses (which usually look like either NULL +
>>> small offset or mostly zeros with a few mid-order bits set + a small offset) or
>>> lockdep complaining about a bad unlock balance. Call stacks are often in
>>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
>>> email example oopses out separately if anyone wants to review them). My hunch is
>>> that struct pages adjacent to the folio are being corrupted, but don't have hard
>>> evidence.
>>>
>>> When adding the workaround patch, which prevents madvise_free_pte_range() from
>>> attempting to split a large folio, I never see any issues. Although I'm not
>>> putting the system under memory pressure so guess I might see the same types of
>>> problem crop up under swap, etc.
>>>
>>> I've reviewed most of the code within split_folio() and can't find any smoking
>>> gun, but I wonder if there are implicit assumptions about the large folio being
>>> PMD sized that I'm obviously breaking now?
>>>
>>> The code in madvise_free_pte_range():
>>>
>>> 	if (folio_test_large(folio)) {
>>> 		if (folio_mapcount(folio) != 1)
>>> 			goto out;
>>> 		folio_get(folio);
>>> 		if (!folio_trylock(folio)) {
>>> 			folio_put(folio);
>>> 			goto out;
>>> 		}
>>> 		pte_unmap_unlock(orig_pte, ptl);
>>> 		if (split_folio(folio)) {
>>> 			folio_unlock(folio);
>>> 			folio_put(folio);
>>> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>>> 			goto out;
>>> 		}
>>> 		...
>>> 	}
>>
>> I've noticed that its folio_split() with a folio order of 1 that causes my
>> problems. And I also see that the page cache code always explicitly never
>> allocates order-1 folios:
>>
>> void page_cache_ra_order(struct readahead_control *ractl,
>> 		struct file_ra_state *ra, unsigned int new_order)
>> {
>> 	...
>>
>> 	while (index <= limit) {
>> 		unsigned int order = new_order;
>>
>> 		/* Align with smaller pages if needed */
>> 		if (index & ((1UL << order) - 1)) {
>> 			order = __ffs(index);
>> 			if (order == 1)
>> 				order = 0;
>> 		}
>> 		/* Don't allocate pages past EOF */
>> 		while (index + (1UL << order) - 1 > limit) {
>> 			if (--order == 1)
>> 				order = 0;
>> 		}
>> 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
>> 		if (err)
>> 			break;
>> 		index += 1UL << order;
>> 	}
>>
>> 	...
>> }
>>
>> Matthew, what is the reason for this? I suspect its guarding against the same
>> problem I'm seeing.
>>
>> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
>> any oops/panic/etc. I'd just like to understand the root cause.
> Checked the struct folio definition. The _deferred_list is in third page struct.
> My understanding is to support folio split, the folio order must >= 2. Thanks.

Yep, looks like we have found the root cause - thanks for your help!

I've updated calc_anonymous_folio_order() to only use non-0 order if THP is
available and in that case, never allocate order-1. I think that both fixes the
problem and manages the dependency we have on THP:

static void calc_anonymous_folio_order(struct vm_fault *vmf,
				       int *order_out,
				       unsigned long *addr_out)
{
	/*
	 * The aim here is to determine what size of folio we should allocate
	 * for this fault. Factors include:
	 * - Folio must be naturally aligned within VA space
	 * - Folio must not breach boundaries of vma
	 * - Folio must be fully contained inside one pmd entry
	 * - Folio must not overlap any non-none ptes
	 * - Order must not be higher than *order_out upon entry
	 *
	 * Additionally, we do not allow order-1 since this breaks assumptions
	 * elsewhere in the mm; THP pages must be at least order-2 (since they
	 * store state up to the 3rd struct page subpage), and these pages must
	 * be THP in order to correctly use pre-existing THP infrastructure such
	 * as folio_split().
	 *
	 * As a consequence of relying on the THP infrastructure, if the system
	 * does not support THP, we always fallback to order-0.
	 *
	 * Note that the caller may or may not choose to lock the pte. If
	 * unlocked, the calculation should be considered an estimate that will
	 * need to be validated under the lock.
	 */

	struct vm_area_struct *vma = vmf->vma;
	int nr;
	int order;
	unsigned long addr;
	pte_t *pte;
	pte_t *first_set = NULL;
	int ret;

	if (has_transparent_hugepage()) {
		order = min(*order_out, PMD_SHIFT - PAGE_SHIFT);

		for (; order > 1; order--) {
			nr = 1 << order;
			addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
			pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);

			/* Check vma bounds. */
			if (addr < vma->vm_start ||
			    addr + nr * PAGE_SIZE > vma->vm_end)
				continue;

			/* Ptes covered by order already known to be none. */
			if (pte + nr <= first_set)
				break;

			/* Already found set pte in range covered by order. */
			if (pte <= first_set)
				continue;

			/* Need to check if all the ptes are none. */
			ret = check_all_ptes_none(pte, nr);
			if (ret == nr)
				break;

			first_set = pte + ret;
		}

		if (order == 1)
			order = 0;
	} else {
		order = 0;
	}

	*order_out = order;
	*addr_out = order > 0 ? addr : vmf->address;
}



> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks,
>> Ryan
>>
>>
>>
>>>
>>> Will normally skip my large folios because they have a mapcount > 1, due to
>>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
>>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
>>> or CoW in this case?
>>>
>>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
>>> large anon folio is not using the folio lock in the same way as a THP would and
>>> we are therefore not getting the expected serialization?
>>>
>>> I'd really appreciate any suggestions for how to pregress here!
>>>
>>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2023-03-22 14:25 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-17 10:57 [RFC PATCH 0/6] variable-order, large folios for anonymous memory Ryan Roberts
2023-03-17 10:57 ` Ryan Roberts
2023-03-17 10:57 ` [RFC PATCH 1/6] mm: Expose clear_huge_page() unconditionally Ryan Roberts
2023-03-17 10:57   ` Ryan Roberts
2023-03-17 10:57 ` [RFC PATCH 2/6] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-03-17 10:57   ` Ryan Roberts
2023-03-17 10:57 ` [RFC PATCH 3/6] mm: Introduce try_vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-03-17 10:57   ` Ryan Roberts
2023-03-17 10:58 ` [RFC PATCH 4/6] mm: Implement folio_add_new_anon_rmap_range() Ryan Roberts
2023-03-17 10:58   ` Ryan Roberts
2023-03-22  6:59   ` Yin Fengwei
2023-03-22  6:59     ` Yin Fengwei
2023-03-22  7:10   ` Yin Fengwei
2023-03-22  7:10     ` Yin Fengwei
2023-03-22  7:42     ` Ryan Roberts
2023-03-22  7:42       ` Ryan Roberts
2023-03-17 10:58 ` [RFC PATCH 5/6] mm: Allocate large folios for anonymous memory Ryan Roberts
2023-03-17 10:58   ` Ryan Roberts
2023-03-17 10:58 ` [RFC PATCH 6/6] WORKAROUND: Don't split large folios on madvise Ryan Roberts
2023-03-17 10:58   ` Ryan Roberts
2023-03-22  8:19   ` Yin Fengwei
2023-03-22  8:19     ` Yin Fengwei
2023-03-22  8:59     ` Ryan Roberts
2023-03-22  8:59       ` Ryan Roberts
2023-03-22 12:03 ` [RFC PATCH 0/6] variable-order, large folios for anonymous memory Ryan Roberts
2023-03-22 12:03   ` Ryan Roberts
2023-03-22 13:36   ` Yin, Fengwei
2023-03-22 13:36     ` Yin, Fengwei
2023-03-22 14:25     ` Ryan Roberts [this message]
2023-03-22 14:25       ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a66d7ac1-59cd-2a95-dd48-79d569583aff@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=fengwei.yin@intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.