From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
kvm@vger.kernel.org, linux-kselftest@vger.kernel.org,
David Hildenbrand <david@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
liubo <liubo254@huawei.com>, Peter Xu <peterx@redhat.com>,
Matthew Wilcox <willy@infradead.org>,
Hugh Dickins <hughd@google.com>, Jason Gunthorpe <jgg@ziepe.ca>,
John Hubbard <jhubbard@nvidia.com>, Mel Gorman <mgorman@suse.de>,
Shuah Khan <shuah@kernel.org>,
Paolo Bonzini <pbonzini@redhat.com>,
Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH v3 2/7] smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
Date: Thu, 3 Aug 2023 16:32:03 +0200 [thread overview]
Message-ID: <20230803143208.383663-3-david@redhat.com> (raw)
In-Reply-To: <20230803143208.383663-1-david@redhat.com>
We shouldn't be using a GUP-internal helper if it can be avoided.
Similar to smaps_pte_entry() that uses vm_normal_page(), let's use
vm_normal_page_pmd() that similarly refuses to return the huge zeropage.
In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd():
(1) Will always return the head page, not a tail page of a THP.
If we'd ever call smaps_account with a tail page while setting "compound
= true", we could be in trouble, because smaps_account() would look at
the memmap of unrelated pages.
If we're unlucky, that memmap does not exist at all. Before we removed
PG_doublemap, we could have triggered something similar as in
commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for
migration entry").
This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc:
smaps_rollup: do not stall write attempts on mmap_lock"):
(a) We're in show_smaps_rollup() and processed a VMA
(b) We release the mmap lock in show_smaps_rollup() because it is
contended
(c) We merged that VMA with another VMA
(d) We collapsed a THP in that merged VMA at that position
If the end address of the original VMA falls into the middle of a THP
area, we would call smap_gather_stats() with a start address that falls
into a PMD-mapped THP. It's probably very rare to trigger when not
really forced.
(2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page()
Treat such PMDs here just like smaps_pte_entry() would treat such PTEs.
If such pages would be anonymous, we most certainly would want to
account them.
(3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap()
As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE
pages. So just like smaps_pte_entry(), we'll now also ignore such PMD
entries.
Especially, follow_pmd_mask() never ends up calling
follow_trans_huge_pmd() on pmd_devmap(). Instead it calls
follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN
is set.
So skipping pmd_devmap() pages seems to be the right thing to do.
(4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page()
We won't be returning a memmap that should be ignored by core-mm, or
worse, a memmap that does not even exist. Note that while
walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't.
Most probably this case doesn't currently really happen on the PMD level,
otherwise we'd already be able to trigger kernel crashes when reading
smaps / smaps_rollup.
So most probably only (1) is relevant in practice as of now, but could only
cause trouble in extreme corner cases.
Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future
reuse in wrong context.
Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
fs/proc/task_mmu.c | 3 +--
include/linux/huge_mm.h | 3 ---
mm/internal.h | 7 +++++++
3 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index bf25178ae66a..7a7d6e2e6a14 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -571,8 +571,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
bool migration = false;
if (pmd_present(*pmd)) {
- /* FOLL_DUMP will return -EFAULT on huge zero page */
- page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+ page = vm_normal_page_pmd(vma, addr, *pmd);
} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
swp_entry_t entry = pmd_to_swp_entry(*pmd);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 20284387b841..e718dbe928ba 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -25,9 +25,6 @@ static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
#endif
vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf);
-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
- unsigned long addr, pmd_t *pmd,
- unsigned int flags);
bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr, unsigned long next);
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd,
diff --git a/mm/internal.h b/mm/internal.h
index 5a03bc4782a2..c94eda536c4c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -941,6 +941,13 @@ int migrate_device_coherent_page(struct page *page);
struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
int __must_check try_grab_page(struct page *page, unsigned int flags);
+/*
+ * mm/huge_memory.c
+ */
+struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd,
+ unsigned int flags);
+
enum {
/* mark page accessed */
FOLL_TOUCH = 1 << 16,
--
2.41.0
next prev parent reply other threads:[~2023-08-03 14:34 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-03 14:32 [PATCH v3 0/7] smaps / mm/gup: fix gup_can_follow_protnone fallout David Hildenbrand
2023-08-03 14:32 ` [PATCH v3 1/7] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT David Hildenbrand
2023-08-03 14:32 ` David Hildenbrand [this message]
2023-08-03 14:32 ` [PATCH v3 3/7] kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow() David Hildenbrand
2023-08-03 14:32 ` [PATCH v3 4/7] mm/gup: don't implicitly set FOLL_HONOR_NUMA_FAULT David Hildenbrand
2023-08-03 14:32 ` [PATCH v3 5/7] pgtable: improve pte_protnone() comment David Hildenbrand
2023-08-03 14:32 ` [PATCH v3 6/7] selftest/mm: ksm_functional_tests: test in mmap_and_merge_range() if anything got merged David Hildenbrand
2023-08-03 19:05 ` Peter Xu
2023-08-04 17:56 ` David Hildenbrand
2023-08-03 14:32 ` [PATCH v3 7/7] selftest/mm: ksm_functional_tests: Add PROT_NONE test David Hildenbrand
2023-08-03 19:06 ` Peter Xu
2023-08-04 18:00 ` David Hildenbrand
2023-08-07 15:36 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230803143208.383663-3-david@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=hughd@google.com \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=kvm@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=liubo254@huawei.com \
--cc=mgorman@suse.de \
--cc=mgorman@techsingularity.net \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=shuah@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.