From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E5B4C43334 for ; Mon, 11 Jul 2022 21:04:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9F67C940020; Mon, 11 Jul 2022 17:04:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 97FD1940010; Mon, 11 Jul 2022 17:04:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 82020940020; Mon, 11 Jul 2022 17:04:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 716FF940010 for ; Mon, 11 Jul 2022 17:04:05 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 4A2B520D0D for ; Mon, 11 Jul 2022 21:04:05 +0000 (UTC) X-FDA: 79676046450.06.C8C86E2 Received: from mail-pg1-f180.google.com (mail-pg1-f180.google.com [209.85.215.180]) by imf08.hostedemail.com (Postfix) with ESMTP id E3728160056 for ; Mon, 11 Jul 2022 21:04:04 +0000 (UTC) Received: by mail-pg1-f180.google.com with SMTP id 145so5767349pga.12 for ; Mon, 11 Jul 2022 14:04:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=LGhqwJZQ8xI5Bgs8A1JH+hSUsBWpVQHpbXD1zFnPTBQ=; b=nDf/hwRLRA2m0EWAVmutnmoZ6W1LUx4VDWWwBUo2GUOqJzFblbQALw0cok8+qFFwMQ WJU+Rpc/LcaGA9qfNO0lw1UXsE650YYmPDJJrp7LkJ4KCPwIha5coob0R5RTYXbkLNOm PbR/i8yTvd2IVVq77HCBhGQHzRj1Kpt0Ber5eZwh2D6icZAQ8Y/7O+6GWsVVH1vxgF7n 9YNK+6YiDS40xwhFr/Bdnc+MVYwmF5V+oyegH/Hh9N1/kJn2AYJ5J9V0u13xHpPRJZnR 41H+byBi0YG29YE5Of+pEsoOXryqzWBZwaYLjaYTRtzDTErRPJ8aQr8fP83LaVkOENmJ UcVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=LGhqwJZQ8xI5Bgs8A1JH+hSUsBWpVQHpbXD1zFnPTBQ=; b=23ywGRwHkwNN+xiFCNW/Ps05mqft43ix+YxyVT3WHjLOHfAsMalqnFDgQZ+KUQeRiY Cf/l28Vg2CVvZ05k/S+Dz9sLcNIhBQWba7BgRcXL5j6BbAGNZ/tx7WlTjlEsAvgVvKgZ ZVHdLZ3tO3RiioTNkHleSWKvFUuzo7eZQhGI+vugauGnvTJj8v0U57b3exZAhk0wVpZr 9AacE85OhhlCuaWxMyExuML6SVh5pCiuyknFbidq+Sx7i0odU3uFXBu1Ze+UzqyaUeio SNm99WlC7GP+rSJkDXiTES7WkLVXtGjnIyD0xL0myXQYfI5jOZmwIZJ8lOVUfsqy4l5v GYJw== X-Gm-Message-State: AJIora8ReGH7PJCugzo9sYRfZoD/FXNYsjG2MZ4B6mzxrQye8jrDZnJK YDpu7HSWO+vQ+dES7W28oJy0B9bFj0l0A2mDCnU= X-Google-Smtp-Source: AGRyM1tpon5Wzw4Gb1llOcsgxOEK8JvB0TOX89XwMRSYiycGILfQ6qheLzr2qcCwP17nz2HkAivYam7Lx/oZA1178Ms= X-Received: by 2002:a62:1509:0:b0:528:98a1:1f7e with SMTP id 9-20020a621509000000b0052898a11f7emr20159592pfv.11.1657573443882; Mon, 11 Jul 2022 14:04:03 -0700 (PDT) MIME-Version: 1.0 References: <20220706235936.2197195-1-zokeefe@google.com> <20220706235936.2197195-9-zokeefe@google.com> In-Reply-To: <20220706235936.2197195-9-zokeefe@google.com> From: Yang Shi Date: Mon, 11 Jul 2022 14:03:51 -0700 Message-ID: Subject: Re: [mm-unstable v7 08/18] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage To: "Zach O'Keefe" Cc: Alex Shi , David Hildenbrand , David Rientjes , Matthew Wilcox , Michal Hocko , Pasha Tatashin , Peter Xu , Rongwei Wang , SeongJae Park , Song Liu , Vlastimil Babka , Zi Yan , Linux MM , Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Thomas Bogendoerfer Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657573445; a=rsa-sha256; cv=none; b=c4WJcbBZGCdpkYUe0QBP5O2QU32WQyNJqtFXing2P8OlpfktVtc7yoRdFDLw5O6YP73qw8 ktWxp0yOEpQ/2ygK6sSKzQ0cfH22g0rHCESFqJ+nd5QmxmmH1wN8Vg+BYsuNPBHivDbdWt qFtmLVV425C1X9E/Hkw2toMw5PYJsgs= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="nDf/hwRL"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of shy828301@gmail.com designates 209.85.215.180 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657573445; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LGhqwJZQ8xI5Bgs8A1JH+hSUsBWpVQHpbXD1zFnPTBQ=; b=mHXVjFFP2mSeazImGpK+8Omn7olCqlWDCGROfviSHpV+F9hU3KOCEA47HNiyr6BgNr6MsW hYIPFgX+GRsCRIDx8TI3UXiUMgxWiSBoSJVIUSuLE0Bawknv4NGpb/9VlBKPstpz1hFWYR xVdfQUTy0fkhSlzhPJaRYxoGLDMY7WU= Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="nDf/hwRL"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of shy828301@gmail.com designates 209.85.215.180 as permitted sender) smtp.mailfrom=shy828301@gmail.com X-Stat-Signature: sraacqwn71rh5z58nuhmxeerj8quz1o1 X-Rspamd-Queue-Id: E3728160056 X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1657573444-711331 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jul 6, 2022 at 5:06 PM Zach O'Keefe wrote: > > When scanning an anon pmd to see if it's eligible for collapse, return > SCAN_PMD_MAPPED if the pmd already maps a hugepage. Note that > SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the > file-collapse path, since the latter might identify pte-mapped compound > pages. This is required by MADV_COLLAPSE which necessarily needs to > know what hugepage-aligned/sized regions are already pmd-mapped. > > In order to determine if a pmd already maps a hugepage, refactor > mm_find_pmd(): > > Return mm_find_pmd() to it's pre-commit f72e7dcdd252 ("mm: let mm_find_pmd > fix buggy race with THP fault") behavior. ksm was the only caller that > explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic > there (pmd_present() and pmd_trans_huge() checks). > > Undo revert change in commit f72e7dcdd252 ("mm: let mm_find_pmd fix buggy race > with THP fault") that open-coded split_huge_pmd_address() pmd lookup and > use mm_find_pmd() instead. > > Signed-off-by: Zach O'Keefe Reviewed-by: Yang Shi > --- > include/trace/events/huge_memory.h | 1 + > mm/huge_memory.c | 18 +-------- > mm/internal.h | 2 +- > mm/khugepaged.c | 60 ++++++++++++++++++++++++------ > mm/ksm.c | 10 +++++ > mm/rmap.c | 15 +++----- > 6 files changed, 67 insertions(+), 39 deletions(-) > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > index d651f3437367..55392bf30a03 100644 > --- a/include/trace/events/huge_memory.h > +++ b/include/trace/events/huge_memory.h > @@ -11,6 +11,7 @@ > EM( SCAN_FAIL, "failed") \ > EM( SCAN_SUCCEED, "succeeded") \ > EM( SCAN_PMD_NULL, "pmd_null") \ > + EM( SCAN_PMD_MAPPED, "page_pmd_mapped") \ > EM( SCAN_EXCEED_NONE_PTE, "exceed_none_pte") \ > EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \ > EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 4fbe43dc1568..fb76db6c703e 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2363,25 +2363,11 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, > bool freeze, struct folio *folio) > { > - pgd_t *pgd; > - p4d_t *p4d; > - pud_t *pud; > - pmd_t *pmd; > + pmd_t *pmd = mm_find_pmd(vma->vm_mm, address); > > - pgd = pgd_offset(vma->vm_mm, address); > - if (!pgd_present(*pgd)) > + if (!pmd) > return; > > - p4d = p4d_offset(pgd, address); > - if (!p4d_present(*p4d)) > - return; > - > - pud = pud_offset(p4d, address); > - if (!pud_present(*pud)) > - return; > - > - pmd = pmd_offset(pud, address); > - > __split_huge_pmd(vma, pmd, address, freeze, folio); > } > > diff --git a/mm/internal.h b/mm/internal.h > index 6e14749ad1e5..ef8c23fb678f 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -188,7 +188,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason > /* > * in mm/rmap.c: > */ > -extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address); > +pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address); > > /* > * in mm/page_alloc.c > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index b0e20db3f805..c7a09cc9a0e8 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -28,6 +28,7 @@ enum scan_result { > SCAN_FAIL, > SCAN_SUCCEED, > SCAN_PMD_NULL, > + SCAN_PMD_MAPPED, > SCAN_EXCEED_NONE_PTE, > SCAN_EXCEED_SWAP_PTE, > SCAN_EXCEED_SHARED_PTE, > @@ -871,6 +872,45 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, > return SCAN_SUCCEED; > } > > +static int find_pmd_or_thp_or_none(struct mm_struct *mm, > + unsigned long address, > + pmd_t **pmd) > +{ > + pmd_t pmde; > + > + *pmd = mm_find_pmd(mm, address); > + if (!*pmd) > + return SCAN_PMD_NULL; > + > + pmde = pmd_read_atomic(*pmd); > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + /* See comments in pmd_none_or_trans_huge_or_clear_bad() */ > + barrier(); > +#endif > + if (!pmd_present(pmde)) > + return SCAN_PMD_NULL; > + if (pmd_trans_huge(pmde)) > + return SCAN_PMD_MAPPED; > + if (pmd_bad(pmde)) > + return SCAN_PMD_NULL; > + return SCAN_SUCCEED; > +} > + > +static int check_pmd_still_valid(struct mm_struct *mm, > + unsigned long address, > + pmd_t *pmd) > +{ > + pmd_t *new_pmd; > + int result = find_pmd_or_thp_or_none(mm, address, &new_pmd); > + > + if (result != SCAN_SUCCEED) > + return result; > + if (new_pmd != pmd) > + return SCAN_FAIL; > + return SCAN_SUCCEED; > +} > + > /* > * Bring missing pages in from swap, to complete THP collapse. > * Only done if khugepaged_scan_pmd believes it is worthwhile. > @@ -982,9 +1022,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > goto out_nolock; > } > > - pmd = mm_find_pmd(mm, address); > - if (!pmd) { > - result = SCAN_PMD_NULL; > + result = find_pmd_or_thp_or_none(mm, address, &pmd); > + if (result != SCAN_SUCCEED) { > mmap_read_unlock(mm); > goto out_nolock; > } > @@ -1012,7 +1051,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > if (result != SCAN_SUCCEED) > goto out_up_write; > /* check if the pmd is still valid */ > - if (mm_find_pmd(mm, address) != pmd) > + result = check_pmd_still_valid(mm, address, pmd); > + if (result != SCAN_SUCCEED) > goto out_up_write; > > anon_vma_lock_write(vma->anon_vma); > @@ -1115,11 +1155,9 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma, > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > - pmd = mm_find_pmd(mm, address); > - if (!pmd) { > - result = SCAN_PMD_NULL; > + result = find_pmd_or_thp_or_none(mm, address, &pmd); > + if (result != SCAN_SUCCEED) > goto out; > - } > > memset(cc->node_load, 0, sizeof(cc->node_load)); > pte = pte_offset_map_lock(mm, pmd, address, &ptl); > @@ -1373,8 +1411,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) > if (!PageHead(hpage)) > goto drop_hpage; > > - pmd = mm_find_pmd(mm, haddr); > - if (!pmd) > + if (find_pmd_or_thp_or_none(mm, haddr, &pmd) != SCAN_SUCCEED) > goto drop_hpage; > > start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); > @@ -1492,8 +1529,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > if (vma->vm_end < addr + HPAGE_PMD_SIZE) > continue; > mm = vma->vm_mm; > - pmd = mm_find_pmd(mm, addr); > - if (!pmd) > + if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED) > continue; > /* > * We need exclusive mmap_lock to retract page table. > diff --git a/mm/ksm.c b/mm/ksm.c > index 075123602bd0..3e0a0a42fa1f 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -1136,6 +1136,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > { > struct mm_struct *mm = vma->vm_mm; > pmd_t *pmd; > + pmd_t pmde; > pte_t *ptep; > pte_t newpte; > spinlock_t *ptl; > @@ -1150,6 +1151,15 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > pmd = mm_find_pmd(mm, addr); > if (!pmd) > goto out; > + /* > + * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at() > + * without holding anon_vma lock for write. So when looking for a > + * genuine pmde (in which to find pte), test present and !THP together. > + */ > + pmde = *pmd; > + barrier(); > + if (!pmd_present(pmde) || pmd_trans_huge(pmde)) > + goto out; > > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr, > addr + PAGE_SIZE); > diff --git a/mm/rmap.c b/mm/rmap.c > index edc06c52bc82..af775855e58f 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -767,13 +767,17 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) > return vma_address(page, vma); > } > > +/* > + * Returns the actual pmd_t* where we expect 'address' to be mapped from, or > + * NULL if it doesn't exist. No guarantees / checks on what the pmd_t* > + * represents. > + */ > pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) > { > pgd_t *pgd; > p4d_t *p4d; > pud_t *pud; > pmd_t *pmd = NULL; > - pmd_t pmde; > > pgd = pgd_offset(mm, address); > if (!pgd_present(*pgd)) > @@ -788,15 +792,6 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) > goto out; > > pmd = pmd_offset(pud, address); > - /* > - * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at() > - * without holding anon_vma lock for write. So when looking for a > - * genuine pmde (in which to find pte), test present and !THP together. > - */ > - pmde = *pmd; > - barrier(); > - if (!pmd_present(pmde) || pmd_trans_huge(pmde)) > - pmd = NULL; > out: > return pmd; > } > -- > 2.37.0.rc0.161.g10f37bed90-goog >