All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	"Yin, Fengwei" <fengwei.yin@intel.com>,
	Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory
Date: Wed, 22 Mar 2023 12:03:31 +0000	[thread overview]
Message-ID: <a2861b0a-be8f-0911-e551-af9d3b549f6d@arm.com> (raw)
In-Reply-To: <20230317105802.2634004-1-ryan.roberts@arm.com>

Hi Matthew,

On 17/03/2023 10:57, Ryan Roberts wrote:
> Hi All,
> 
> [...]
> 
> Bug(s)
> ======
> 
> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
> relating to invalid kernel addresses (which usually look like either NULL +
> small offset or mostly zeros with a few mid-order bits set + a small offset) or
> lockdep complaining about a bad unlock balance. Call stacks are often in
> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
> email example oopses out separately if anyone wants to review them). My hunch is
> that struct pages adjacent to the folio are being corrupted, but don't have hard
> evidence.
> 
> When adding the workaround patch, which prevents madvise_free_pte_range() from
> attempting to split a large folio, I never see any issues. Although I'm not
> putting the system under memory pressure so guess I might see the same types of
> problem crop up under swap, etc.
> 
> I've reviewed most of the code within split_folio() and can't find any smoking
> gun, but I wonder if there are implicit assumptions about the large folio being
> PMD sized that I'm obviously breaking now?
> 
> The code in madvise_free_pte_range():
> 
> 	if (folio_test_large(folio)) {
> 		if (folio_mapcount(folio) != 1)
> 			goto out;
> 		folio_get(folio);
> 		if (!folio_trylock(folio)) {
> 			folio_put(folio);
> 			goto out;
> 		}
> 		pte_unmap_unlock(orig_pte, ptl);
> 		if (split_folio(folio)) {
> 			folio_unlock(folio);
> 			folio_put(folio);
> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> 			goto out;
> 		}
> 		...
> 	}

I've noticed that its folio_split() with a folio order of 1 that causes my
problems. And I also see that the page cache code always explicitly never
allocates order-1 folios:

void page_cache_ra_order(struct readahead_control *ractl,
		struct file_ra_state *ra, unsigned int new_order)
{
	...

	while (index <= limit) {
		unsigned int order = new_order;

		/* Align with smaller pages if needed */
		if (index & ((1UL << order) - 1)) {
			order = __ffs(index);
			if (order == 1)
				order = 0;
		}
		/* Don't allocate pages past EOF */
		while (index + (1UL << order) - 1 > limit) {
			if (--order == 1)
				order = 0;
		}
		err = ra_alloc_folio(ractl, index, mark, order, gfp);
		if (err)
			break;
		index += 1UL << order;
	}

	...
}

Matthew, what is the reason for this? I suspect its guarding against the same
problem I'm seeing.

If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
any oops/panic/etc. I'd just like to understand the root cause.

Thanks,
Ryan



> 
> Will normally skip my large folios because they have a mapcount > 1, due to
> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
> or CoW in this case?
> 
> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
> large anon folio is not using the folio lock in the same way as a THP would and
> we are therefore not getting the expected serialization?
> 
> I'd really appreciate any suggestions for how to pregress here!
> 



WARNING: multiple messages have this Message-ID (diff)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	"Yin, Fengwei" <fengwei.yin@intel.com>,
	Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC PATCH 0/6] variable-order, large folios for anonymous memory
Date: Wed, 22 Mar 2023 12:03:31 +0000	[thread overview]
Message-ID: <a2861b0a-be8f-0911-e551-af9d3b549f6d@arm.com> (raw)
In-Reply-To: <20230317105802.2634004-1-ryan.roberts@arm.com>

Hi Matthew,

On 17/03/2023 10:57, Ryan Roberts wrote:
> Hi All,
> 
> [...]
> 
> Bug(s)
> ======
> 
> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
> relating to invalid kernel addresses (which usually look like either NULL +
> small offset or mostly zeros with a few mid-order bits set + a small offset) or
> lockdep complaining about a bad unlock balance. Call stacks are often in
> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
> email example oopses out separately if anyone wants to review them). My hunch is
> that struct pages adjacent to the folio are being corrupted, but don't have hard
> evidence.
> 
> When adding the workaround patch, which prevents madvise_free_pte_range() from
> attempting to split a large folio, I never see any issues. Although I'm not
> putting the system under memory pressure so guess I might see the same types of
> problem crop up under swap, etc.
> 
> I've reviewed most of the code within split_folio() and can't find any smoking
> gun, but I wonder if there are implicit assumptions about the large folio being
> PMD sized that I'm obviously breaking now?
> 
> The code in madvise_free_pte_range():
> 
> 	if (folio_test_large(folio)) {
> 		if (folio_mapcount(folio) != 1)
> 			goto out;
> 		folio_get(folio);
> 		if (!folio_trylock(folio)) {
> 			folio_put(folio);
> 			goto out;
> 		}
> 		pte_unmap_unlock(orig_pte, ptl);
> 		if (split_folio(folio)) {
> 			folio_unlock(folio);
> 			folio_put(folio);
> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> 			goto out;
> 		}
> 		...
> 	}

I've noticed that its folio_split() with a folio order of 1 that causes my
problems. And I also see that the page cache code always explicitly never
allocates order-1 folios:

void page_cache_ra_order(struct readahead_control *ractl,
		struct file_ra_state *ra, unsigned int new_order)
{
	...

	while (index <= limit) {
		unsigned int order = new_order;

		/* Align with smaller pages if needed */
		if (index & ((1UL << order) - 1)) {
			order = __ffs(index);
			if (order == 1)
				order = 0;
		}
		/* Don't allocate pages past EOF */
		while (index + (1UL << order) - 1 > limit) {
			if (--order == 1)
				order = 0;
		}
		err = ra_alloc_folio(ractl, index, mark, order, gfp);
		if (err)
			break;
		index += 1UL << order;
	}

	...
}

Matthew, what is the reason for this? I suspect its guarding against the same
problem I'm seeing.

If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
any oops/panic/etc. I'd just like to understand the root cause.

Thanks,
Ryan



> 
> Will normally skip my large folios because they have a mapcount > 1, due to
> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
> or CoW in this case?
> 
> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
> large anon folio is not using the folio lock in the same way as a THP would and
> we are therefore not getting the expected serialization?
> 
> I'd really appreciate any suggestions for how to pregress here!
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  parent reply	other threads:[~2023-03-22 12:03 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-17 10:57 [RFC PATCH 0/6] variable-order, large folios for anonymous memory Ryan Roberts
2023-03-17 10:57 ` Ryan Roberts
2023-03-17 10:57 ` [RFC PATCH 1/6] mm: Expose clear_huge_page() unconditionally Ryan Roberts
2023-03-17 10:57   ` Ryan Roberts
2023-03-17 10:57 ` [RFC PATCH 2/6] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-03-17 10:57   ` Ryan Roberts
2023-03-17 10:57 ` [RFC PATCH 3/6] mm: Introduce try_vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-03-17 10:57   ` Ryan Roberts
2023-03-17 10:58 ` [RFC PATCH 4/6] mm: Implement folio_add_new_anon_rmap_range() Ryan Roberts
2023-03-17 10:58   ` Ryan Roberts
2023-03-22  6:59   ` Yin Fengwei
2023-03-22  6:59     ` Yin Fengwei
2023-03-22  7:10   ` Yin Fengwei
2023-03-22  7:10     ` Yin Fengwei
2023-03-22  7:42     ` Ryan Roberts
2023-03-22  7:42       ` Ryan Roberts
2023-03-17 10:58 ` [RFC PATCH 5/6] mm: Allocate large folios for anonymous memory Ryan Roberts
2023-03-17 10:58   ` Ryan Roberts
2023-03-17 10:58 ` [RFC PATCH 6/6] WORKAROUND: Don't split large folios on madvise Ryan Roberts
2023-03-17 10:58   ` Ryan Roberts
2023-03-22  8:19   ` Yin Fengwei
2023-03-22  8:19     ` Yin Fengwei
2023-03-22  8:59     ` Ryan Roberts
2023-03-22  8:59       ` Ryan Roberts
2023-03-22 12:03 ` Ryan Roberts [this message]
2023-03-22 12:03   ` [RFC PATCH 0/6] variable-order, large folios for anonymous memory Ryan Roberts
2023-03-22 13:36   ` Yin, Fengwei
2023-03-22 13:36     ` Yin, Fengwei
2023-03-22 14:25     ` Ryan Roberts
2023-03-22 14:25       ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a2861b0a-be8f-0911-e551-af9d3b549f6d@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=fengwei.yin@intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.