From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B64CC6FD1C for ; Thu, 23 Mar 2023 21:38:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 997466B007B; Thu, 23 Mar 2023 17:38:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 91EAA6B007D; Thu, 23 Mar 2023 17:38:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 771DA6B007E; Thu, 23 Mar 2023 17:38:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 60BC86B007B for ; Thu, 23 Mar 2023 17:38:07 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 2CD64AB2BE for ; Thu, 23 Mar 2023 21:38:07 +0000 (UTC) X-FDA: 80601476214.22.A45CADE Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) by imf08.hostedemail.com (Postfix) with ESMTP id 54F5C160002 for ; Thu, 23 Mar 2023 21:38:05 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=hNwSFbSj; spf=pass (imf08.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.50 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679607485; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eDeyfgvOtyd2CLz5y3cqm2NxLXs9yCI9BkCRPCoi1M4=; b=RDy/1bqSjC8czRQaapAh/3BxRaErSkYxJdI6RYwpzBkD0W4fppMotlHyzEv9ac12IGw8Cl eux7uWUNxeGqNXo81qr2TJzXgCN3kmkyx8Rq36O6LVMKn+iAOjkQgLVizw+I5+zudYkC4m vYTuHUvW2mHRbBD1dTr/mY2rBMwZpPo= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=hNwSFbSj; spf=pass (imf08.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.50 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679607485; a=rsa-sha256; cv=none; b=m8HNnB/p14F3nPDlTNjKRf6eK5CYEC/cw525pm6zoT6X4yCZ9ck5sz8wtAqAw+8BAwHw3O QSw0+Zty+AfStlUp+t/aJTy0HG4Rw6nopCzM0+YIDLBqA86oHFxBkfnxMo/P/tqKdXhgA2 vuTApHInUeKQUDFRYuzk/tt139zaFag= Received: by mail-pj1-f50.google.com with SMTP id fy10-20020a17090b020a00b0023b4bcf0727so3465423pjb.0 for ; Thu, 23 Mar 2023 14:38:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679607484; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=eDeyfgvOtyd2CLz5y3cqm2NxLXs9yCI9BkCRPCoi1M4=; b=hNwSFbSj22mvFtc4p/YBwhggWPfiWvhZDZnzQdFj6xlUVKtCrxfvCd2SHDnqPj9cHq fFJO8FCz1RB9SpUHisMHOexygLDihNq8KKLLkH49XvLPxPlvaqxf+fAQvGiy/4A7rlfh bKPTgLM5j0/1IpYeIlkub0n848/6KJq2xqIlxBdxT5RsXYyXaq7XtfwXxg6davre8cPs MXHTc5CerSVtVKINfxlMW5T0dM+PGREcMOw4qAIMXQYjXLYY2tkgSfGxJCs1raTGIu2H oInNCRGTcjMUfvSrOFlun0z9Zug0lmuk5YVtUTBoYm/zpN7Cp2rpOdYn95chQnFZQKJO Fq9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679607484; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eDeyfgvOtyd2CLz5y3cqm2NxLXs9yCI9BkCRPCoi1M4=; b=POs/ydmuElDEtRFqmrvgezGQMEmen9Upnyf8bn+OZ7nqku0M+wb+IrPQQD296xCeNB iVuTHmBEgipG/F8+nkY9ozGSn/HlbWpJwTUvecdRRQCKus8RzmEknvQL9W8L41bMHRLA TrchoMeOAnsDhAd4ghUyr4c9CIaZYAhPrWi73LeaNn+Id+bAG2aHFb60rduhwYTa/2it dEV5JXJ3u7zQxfM+oc/qcxQJdYs0q6c241QZqha9UCHsvySYGgcF5xVychaOiUMXQJ66 hlRgCK218RqEyro1dxPIBNtu+uH+QioaiGoyW625DmXNnK416c2+jnR09SJf7pufBYNf AP4g== X-Gm-Message-State: AAQBX9c2QE6+qcDpPgYb5ElvezTlk6DvsmXK1cyRUl5HZq7R7+wRi6MA eEyJSJc+On3A6bykG8XT2bTuSfr514Qbj1jh/dc= X-Google-Smtp-Source: AKy350bxr/SaNk1r9lgJ0E1iRdoOfRG5mzEyfjH/iJrVGquRuqDVdGbWWhsN2nMxavVkL958p3KptM+1Ednj4E3rDxY= X-Received: by 2002:a17:902:da8e:b0:1a1:efe7:d7e0 with SMTP id j14-20020a170902da8e00b001a1efe7d7e0mr124521plx.1.1679607484085; Thu, 23 Mar 2023 14:38:04 -0700 (PDT) MIME-Version: 1.0 References: <20230305065112.1932255-1-jiaqiyan@google.com> <20230305065112.1932255-2-jiaqiyan@google.com> In-Reply-To: <20230305065112.1932255-2-jiaqiyan@google.com> From: Yang Shi Date: Thu, 23 Mar 2023 14:37:52 -0700 Message-ID: Subject: Re: [PATCH v10 1/3] mm/khugepaged: recover from poisoned anonymous memory To: Jiaqi Yan Cc: kirill.shutemov@linux.intel.com, kirill@shutemov.name, tongtiangen@huawei.com, tony.luck@intel.com, akpm@linux-foundation.org, naoya.horiguchi@nec.com, linmiaohe@huawei.com, linux-mm@kvack.org, osalvador@suse.de, wangkefeng.wang@huawei.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 7untkdub714de9eqsprm56zubgr9ntmw X-Rspam-User: X-Rspamd-Queue-Id: 54F5C160002 X-Rspamd-Server: rspam06 X-HE-Tag: 1679607485-506604 X-HE-Meta: U2FsdGVkX180kKO+MmckxgxaHsroBzw/IC2dvCr3SJMY4TWAUcMt6i4iKB3aGvZx/7jjDfQiOgZ+iRkkxhFElM3ssXWNr3PhEBHBgdihX0oAbqjn1wChNtWSuInzEEUVl8CNiFR19nMu3/Dvaibj1KauOIHHS6vY1AVvsM4deMZHjwXl43LRk5JlwG3+K2ZpUz49mnR86zcJSG31AxRUgcZA2rgyD84/D8mzDH7JabMFS7F5m5aqW5ksIec13WBnByDScA5pAlk3zgIs1MK207PwFCKYcmVTY9+tNRgG1+p9XQZGgXWsLawPK+8vn1Om1f7wFplyTcM2C88dGZ9cpKDuk420XDpQx9GctxEhCrjamwH7BNb7MiiHC/QNKNJ0JvCiz3b2Lrbh7NXERHNBmrZKec+trIyONejbb+Zy9CP7f+MjCeV2w0Xf34C5LsUqPL0Cco7zMM9KhgXaOnV/6hYRYSTBjvhkI5C6MVr1E57XipUmz4Z2LrxjBEH5zW+Dzk1QWdk4dXS6ZvvsgPFQWR//ymzx5VxZiWa7/lEdXkBBaodHhUmfg0dFacbF0eT+055S2E35laPk5OIWNTSz4NMwh32U7axd3nrQ9nZVwAV7aBO3zhL4rW5yWvF1ti0Wj1XyoLxJzFtKZElSkHfPeEK63RlYTjc7InNiMm057xF7pT2gleAGKRSjDjAZEi4q0zae+qHb+tp81et6anQ3ksvvqceFEP4Kl/L4pqLVVjPMOWzQh45qAtDz7TIskYp2Spr7PAnV6XGtEDpeJU26wvyK4Jt6br5IKxt/Czzc1Dbc07ygH9pQPh4F95IxGrHrt6Ik0V4a+jPKjIkEQd8Ez+crgLR0ieVetC9vX/bUSpM8xrVlwF1omYBLZOGv2IqwKTBoneS6LGJB4/X4iuVwq9aprK79+NzwoFjbntCixHVgXbXp2SHWa7PTZr7ibs2eWmrUAIMet6fco7NK84q 8vaWKVAA t2fq0qjgeSGDi+I/AYXJjh98Jwf509Cq8egxAgLcbcem6JDW3Z7A4BD6xcKQViZzQyWMAuygMusSFHQFcjQqU3bJ88dv2MwDdtNeTMf75R9FB7todjjuYe4PT3QcY4Issb+wZi3rSJGROzsiCjw4ZSeLFeiWn01wOt7lJTS6xDjTN3nZsGeA/pF/iOr/K8z3HAspENt3YjfY/DWwy3JZlyYPTQSehBbQXOsEc3xtlEtjBTLRCMqgVNEPnui3OBfC9xvESXRjxDb/ngyVgrLw04FvN9o/j09VA/YC/ipmoR+WMxYrfpOiU+Fci2Qh0dsJYsz5olVL5bi91H3ZeptsRxScEs9pclx5EReXC7TGiy5ELl5seMmNUUHC5CyiqAmafdXEatfibUZPJCryiGAFm8VN0p78lKHCOGCpqqGeA1Ge5cAHnxWPxOSm1IA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Mar 4, 2023 at 10:51=E2=80=AFPM Jiaqi Yan wro= te: > > Make __collapse_huge_page_copy return whether copying anonymous pages > succeeded, and make collapse_huge_page handle the return status. > > Break existing PTE scan loop into two for-loops. The first loop copies > source pages into target huge page, and can fail gracefully when running > into memory errors in source pages. If copying all pages succeeds, the > second loop releases and clears up these normal pages. Otherwise, the > second loop rolls back the page table and page states by: > - re-establishing the original PTEs-to-PMD connection. > - releasing source pages back to their LRU list. > > Tested manually: > 0. Enable khugepaged on system under test. > 1. Start a two-thread application. Each thread allocates a chunk of > non-huge anonymous memory buffer. > 2. Pick 4 random buffer locations (2 in each thread) and inject > uncorrectable memory errors at corresponding physical addresses. > 3. Signal both threads to make their memory buffer collapsible, i.e. > calling madvise(MADV_HUGEPAGE). > 4. Wait and check kernel log: khugepaged is able to recover from poisoned > pages and skips collapsing them. > 5. Signal both threads to inspect their buffer contents and make sure no > data corruption. > > Signed-off-by: Jiaqi Yan > --- > include/trace/events/huge_memory.h | 3 +- > mm/khugepaged.c | 148 ++++++++++++++++++++++++----- > 2 files changed, 128 insertions(+), 23 deletions(-) > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/hu= ge_memory.h > index 3e6fb05852f9a..46cce509957ba 100644 > --- a/include/trace/events/huge_memory.h > +++ b/include/trace/events/huge_memory.h > @@ -36,7 +36,8 @@ > EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \ > EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \ > EM( SCAN_TRUNCATED, "truncated") \ > - EMe(SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > + EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > + EMe(SCAN_COPY_MC, "copy_poisoned_page") \ > > #undef EM > #undef EMe > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 27956d4404134..c3c217f6ebc6e 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -19,6 +19,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -55,6 +56,7 @@ enum scan_result { > SCAN_CGROUP_CHARGE_FAIL, > SCAN_TRUNCATED, > SCAN_PAGE_HAS_PRIVATE, > + SCAN_COPY_MC, > }; > > #define CREATE_TRACE_POINTS > @@ -681,47 +683,47 @@ static int __collapse_huge_page_isolate(struct vm_a= rea_struct *vma, > return result; > } > > -static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > - struct vm_area_struct *vma, > - unsigned long address, > - spinlock_t *ptl, > - struct list_head *compound_pagelist= ) > +static void __collapse_huge_page_copy_succeeded(pte_t *pte, > + pmd_t *pmd, > + struct vm_area_struct *vm= a, > + unsigned long address, > + spinlock_t *pte_ptl, > + struct list_head *compoun= d_pagelist) > { > struct page *src_page, *tmp; > pte_t *_pte; > - for (_pte =3D pte; _pte < pte + HPAGE_PMD_NR; > - _pte++, page++, address +=3D PAGE_SIZE) { > - pte_t pteval =3D *_pte; > + pte_t pteval; > + unsigned long _address; > > + for (_pte =3D pte, _address =3D address; _pte < pte + HPAGE_PMD_N= R; > + _pte++, _address +=3D PAGE_SIZE) { > + pteval =3D *_pte; > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > - clear_user_highpage(page, address); > add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > if (is_zero_pfn(pte_pfn(pteval))) { > /* > - * ptl mostly unnecessary. > + * pte_ptl mostly unnecessary. > */ > - spin_lock(ptl); > - ptep_clear(vma->vm_mm, address, _pte); > - spin_unlock(ptl); > + spin_lock(pte_ptl); Why did you have to rename ptl to pte_ptl? It seems unnecessary. > + pte_clear(vma->vm_mm, _address, _pte); > + spin_unlock(pte_ptl); > } > } else { > src_page =3D pte_page(pteval); > - copy_user_highpage(page, src_page, address, vma); > if (!PageCompound(src_page)) > release_pte_page(src_page); > /* > - * ptl mostly unnecessary, but preempt has to > - * be disabled to update the per-cpu stats > + * pte_ptl mostly unnecessary, but preempt has > + * to be disabled to update the per-cpu stats > * inside page_remove_rmap(). > */ > - spin_lock(ptl); > - ptep_clear(vma->vm_mm, address, _pte); > + spin_lock(pte_ptl); > + ptep_clear(vma->vm_mm, _address, _pte); > page_remove_rmap(src_page, vma, false); > - spin_unlock(ptl); > + spin_unlock(pte_ptl); > free_page_and_swap_cache(src_page); > } > } > - > list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) { > list_del(&src_page->lru); > mod_node_page_state(page_pgdat(src_page), > @@ -733,6 +735,104 @@ static void __collapse_huge_page_copy(pte_t *pte, s= truct page *page, > } > } > > +static void __collapse_huge_page_copy_failed(pte_t *pte, > + pmd_t *pmd, > + pmd_t orig_pmd, > + struct vm_area_struct *vma, > + unsigned long address, > + struct list_head *compound_p= agelist) > +{ > + struct page *src_page, *tmp; > + pte_t *_pte; > + pte_t pteval; > + unsigned long _address; > + spinlock_t *pmd_ptl; > + > + /* > + * Re-establish the PMD to point to the original page table > + * entry. Restoring PMD needs to be done prior to releasing > + * pages. Since pages are still isolated and locked here, > + * acquiring anon_vma_lock_write is unnecessary. > + */ > + pmd_ptl =3D pmd_lock(vma->vm_mm, pmd); > + pmd_populate(vma->vm_mm, pmd, pmd_pgtable(orig_pmd)); > + spin_unlock(pmd_ptl); > + /* > + * Release both raw and compound pages isolated > + * in __collapse_huge_page_isolate. > + */ It looks like the below code could be replaced by release_pte_pages() with advancing _pte to (pte + HPAGE_PMD_NR - 1). > + for (_pte =3D pte, _address =3D address; _pte < pte + HPAGE_PMD_N= R; > + _pte++, _address +=3D PAGE_SIZE) { > + pteval =3D *_pte; > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) > + continue; > + src_page =3D pte_page(pteval); > + if (!PageCompound(src_page)) > + release_pte_page(src_page); > + } > + list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) { > + list_del(&src_page->lru); > + release_pte_page(src_page); > + } > +} > + > +/* > + * __collapse_huge_page_copy - attempts to copy memory contents from raw > + * pages to a hugepage. Cleans up the raw pages if copying succeeds; > + * otherwise restores the original page table and releases isolated raw = pages. > + * Returns SCAN_SUCCEED if copying succeeds, otherwise returns SCAN_COPY= _MC. > + * > + * @pte: starting of the PTEs to copy from > + * @page: the new hugepage to copy contents to > + * @pmd: pointer to the new hugepage's PMD > + * @orig_pmd: the original raw pages' PMD > + * @vma: the original raw pages' virtual memory area > + * @address: starting address to copy > + * @pte_ptl: lock on raw pages' PTEs > + * @compound_pagelist: list that stores compound pages > + */ > +static int __collapse_huge_page_copy(pte_t *pte, > + struct page *page, > + pmd_t *pmd, > + pmd_t orig_pmd, > + struct vm_area_struct *vma, > + unsigned long address, > + spinlock_t *pte_ptl, > + struct list_head *compound_pagelist) > +{ > + struct page *src_page; > + pte_t *_pte; > + pte_t pteval; > + unsigned long _address; > + int result =3D SCAN_SUCCEED; > + > + /* > + * Copying pages' contents is subject to memory poison at any ite= ration. > + */ > + for (_pte =3D pte, _address =3D address; _pte < pte + HPAGE_PMD_N= R; > + _pte++, page++, _address +=3D PAGE_SIZE) { > + pteval =3D *_pte; > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > + clear_user_highpage(page, _address); > + continue; > + } > + src_page =3D pte_page(pteval); > + if (copy_mc_user_highpage(page, src_page, _address, vma) = > 0) { > + result =3D SCAN_COPY_MC; > + break; > + } > + } > + > + if (likely(result =3D=3D SCAN_SUCCEED)) > + __collapse_huge_page_copy_succeeded(pte, pmd, vma, addres= s, > + pte_ptl, compound_pag= elist); > + else > + __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma, > + address, compound_pageli= st); > + > + return result; > +} > + > static void khugepaged_alloc_sleep(void) > { > DEFINE_WAIT(wait); > @@ -1106,9 +1206,13 @@ static int collapse_huge_page(struct mm_struct *mm= , unsigned long address, > */ > anon_vma_unlock_write(vma->anon_vma); > > - __collapse_huge_page_copy(pte, hpage, vma, address, pte_ptl, > - &compound_pagelist); > + result =3D __collapse_huge_page_copy(pte, hpage, pmd, _pmd, > + vma, address, pte_ptl, > + &compound_pagelist); > pte_unmap(pte); > + if (unlikely(result !=3D SCAN_SUCCEED)) > + goto out_up_write; > + > /* > * spin_lock() below is not the equivalent of smp_wmb(), but > * the smp_wmb() inside __SetPageUptodate() can be reused to > -- > 2.40.0.rc0.216.gc4246ad0f0-goog >