From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4DCFCC6FD1C for ; Tue, 21 Mar 2023 00:12:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9A1126B0075; Mon, 20 Mar 2023 20:12:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9512F6B0078; Mon, 20 Mar 2023 20:12:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 818E26B007B; Mon, 20 Mar 2023 20:12:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 6FEAD6B0075 for ; Mon, 20 Mar 2023 20:12:48 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 40388A04E6 for ; Tue, 21 Mar 2023 00:12:48 +0000 (UTC) X-FDA: 80590979616.10.06DA731 Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) by imf11.hostedemail.com (Postfix) with ESMTP id 77BBC40015 for ; Tue, 21 Mar 2023 00:12:46 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=WelZ0TW0; spf=pass (imf11.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.45 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679357566; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BPMTtOV8PMLCbI/0Pxd1GhyDv+H2K5J8bn++ZFV4Cds=; b=tiJ1Kj7DhyIaOT4mIuBBiHXIcz83aCY75j3cDGtU5AluQTc0jffD3dSgmvK5KO9AYSUVU4 U2t75EVLtNP050aSKm900bPyCLRyab72/scAwLljSKKNNZ57ABiclLHgERRdJwvEK2JPwN LcfP9FDx2GPyBDIopap6+0KNAPhXwoM= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=WelZ0TW0; spf=pass (imf11.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.45 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679357566; a=rsa-sha256; cv=none; b=s5lGmjKDapStu+DF/nlskBOwJ0Ih79p8GNdZH1ASTX1HBhYbnMYB5DeK8+NkWXetxWQJL0 L4C+kZ+o+rLpbJ+SY2WTlfvL4N/oR0zK6ExBHN4D5gBG/OWzwRDW8kV48ie7oBWjL6Bm2N 6LRVkrkASOBRzd6pRr1gHdylQTMnIZk= Received: by mail-pj1-f45.google.com with SMTP id p3-20020a17090a74c300b0023f69bc7a68so9688514pjl.4 for ; Mon, 20 Mar 2023 17:12:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679357565; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=BPMTtOV8PMLCbI/0Pxd1GhyDv+H2K5J8bn++ZFV4Cds=; b=WelZ0TW03yFVj0M3brOTnFlIPIACtUKjfNP0eWOwDFlnNMbzxglmtPSSyREXE9irUg iy/oHH7pKTt9OkH/krzbfYvu5f6IUJNh6+DqeACFsAm97L3lzIn+ET33WJvpw97Au2EZ kr09GL58ejJznBn9WJqMxhdSTSpfBedxTeu8PshA7Hj8Jkl/LM07k7c8zsLulzsIqG8X OooRreN7lH9UsPfOXyOdFBhDGvnTi7rIBfd/DzR9LVOyvuZwotFh+71Bw45WZT9MwqVS /DJHmxywAUfpL4y1evbjW1YD29RJYgjAjStfSzj7gLzie8CNkNJiL8iSTC+sIOwnwkl+ Wv0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679357565; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BPMTtOV8PMLCbI/0Pxd1GhyDv+H2K5J8bn++ZFV4Cds=; b=JUIvjxIALvrdKSNPV6wZzJICMc78s9PYJ9F43iTQDiUCy8sIQ13LennZVsrO03nM1C bqpvBc+E78UHu6w3jfTpqokecQXEyv3XtPzqfHiw7JwTx0CNtNY5s7A5It16nlcog5t5 c3L2d04GxEc/982CLoao2bJj7fB0M3en6NK070T1Mmkx7Ip2UPFgbwIfdNi/2YvcG48n 2xyQeFTGhBcrmlfNb8SQvtQ96Hfu0qPcUm8teqDv/joea4AUYGkQfEaQ/IWeNBkGizw2 i6z7+0scCU9zqaqaFKHhOK71cnLzPkF+XoTpczlTHHBD+dpRiIEoJxCSRDQobeVZqlPp PqJg== X-Gm-Message-State: AO0yUKW7gg44WXDaJ/znUcFSyDG8AqVkONtK9LFMzaMteMBmLodoI7MJ 774X+kcrAQXXyDJJIeV3SfKFh0u1uwgvBoVsKgI= X-Google-Smtp-Source: AK7set8BhL9+TfC+eWUAgU0K+6Ma8oC3cErs++MC2Wm+INF/sjT1lxoCx67kXYPHZECx7xY0lCwV6F/IOgPUc4wsUJM= X-Received: by 2002:a17:902:d4ca:b0:19f:6f30:a3f6 with SMTP id o10-20020a170902d4ca00b0019f6f30a3f6mr96075plg.1.1679357565211; Mon, 20 Mar 2023 17:12:45 -0700 (PDT) MIME-Version: 1.0 References: <20230305065112.1932255-1-jiaqiyan@google.com> <20230305065112.1932255-2-jiaqiyan@google.com> In-Reply-To: From: Yang Shi Date: Mon, 20 Mar 2023 17:12:33 -0700 Message-ID: Subject: Re: [PATCH v10 1/3] mm/khugepaged: recover from poisoned anonymous memory To: Jiaqi Yan Cc: kirill.shutemov@linux.intel.com, kirill@shutemov.name, tongtiangen@huawei.com, tony.luck@intel.com, naoya.horiguchi@nec.com, linmiaohe@huawei.com, linux-mm@kvack.org, osalvador@suse.de, wangkefeng.wang@huawei.com, akpm@linux-foundation.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: 6kkaij9dutjxhga3hug1aew9k17z9hy7 X-Rspamd-Queue-Id: 77BBC40015 X-HE-Tag: 1679357566-17080 X-HE-Meta: U2FsdGVkX19rHCgltVnxywAcnaafpUKoCvM+yeraPgFvA42UAPDrrs6mFxNmsWQQ/Nbtp0hcCLmKPEG9c7HfEH67bt9c26m5K8hc9PLR/m9bK29K/jy9tei6w7jES3wlw6Gr1ccojG/DvcAkY56BFo4XBqBVG7APPxwz3JGr0cWwHVCAM+CUTfxvSXSdm1hdb54vRP9e0yzZ+nlrenvf/uzy1VeChReCyq5RzWZi+b65vGCGQFwmYMZb8UFbGCF6MCwPIZES3qo14ATsS7vVDhhgnhz3JBadbHiNWattqB1sN6j9ZzI3KGOFZf2/yhYGCuA1yIQVuPC6Io8kwUtrM1vT5ArGvpxLU5lZ3mObc6VTxsxapeHd25CZHr79VyBPpj2DpzoRWGkGFp9RwWxblSQVUI6rz1c95Xf7Saz5A93b5+n95B3I23tq4v/4OY5YNvsEAz9UKsYqDphYyapf/usHMm3g2TGPqFqv2yZNCy8gZstdxUtPj6XiIsWjgjljyE4KsRKzRGJRQEBl85WXLMJF5ObPpbGKILSTwkAW9D+HjeTQzWrxy7bEEv3KnTw2g5o7RxruAC5cOvy056MxvbbHy+6WYShNEjMv7W1NjjPJDJWH1x58bPeULhQD93675z4B1vAr1EhnOYtVs7lCVZ7vAZlliwIFPfrqP9z+izAv7xRpJDeangPMYWhcmyIAkL4nQFYUiOktfNwsvq4R9AAYxFt+EpvFxBqWhfahvsS8e+tO9S+4S5CEjpM5/VNIXzfb4AA0lVNkWoc0LwEYtWEDBoHFkWAHr/v+tN9uYr3r4920POLtoeeas0ET8NdHQlXCOaBBHlnEC10pHCNyqlM23s8/gtVc4GO/bHhKTEsZjsvCagT+Lr01QzwMT2iMb9SE5meQBxGtRQ1yt5lnGARoRieF8JupEu897FFNuheOtUsQ5YQ5+UN0FpU7rdga2m1w7FvpsW4iS10uL+B DiDbQ+QS HHY8hgFw2t4n9yOqzD1wLJy0QTdNEpQ+8aT+i15cJZRR2C14lDZznTf0UIVR/XtYY0soV7qeg5ZcboCC7kp6RY13G6aQHxl6zsgqZBKEK16eJWTPbHmSFjyHHCBw5UGatfOGNMgen9m/cgmlLNbv3wG8hvHIDUSA9GwVHULtnM0MBh621js0GWJ9VE9sOJOX+UWFEf9/h/3MJftMBHHGtJbTm/2Hkry8KEx7JkanXsnMWyJyZq6mQH1zo671ZaPnv9eP12CPE22S3I4CvEUXlGkCtjKrTlHfKDVaRNE033hC4N+VKhdbL2cnbENxspgKywgYo5uipL0cht2+IZndTx6bpZJICcRMmvBJbfXo63HQsZjSNyaNsXjZ3jxpjfnTj/SM6uKq2gQ4vlFY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Mar 20, 2023 at 7:42=E2=80=AFAM Jiaqi Yan wro= te: > > ping for review :) Quite busy recently so I didn't spend too much time on the recent revisions. Hopefully I can find some time this week. > > On Sat, Mar 4, 2023 at 10:51=E2=80=AFPM Jiaqi Yan w= rote: > > > > Make __collapse_huge_page_copy return whether copying anonymous pages > > succeeded, and make collapse_huge_page handle the return status. > > > > Break existing PTE scan loop into two for-loops. The first loop copies > > source pages into target huge page, and can fail gracefully when runnin= g > > into memory errors in source pages. If copying all pages succeeds, the > > second loop releases and clears up these normal pages. Otherwise, the > > second loop rolls back the page table and page states by: > > - re-establishing the original PTEs-to-PMD connection. > > - releasing source pages back to their LRU list. > > > > Tested manually: > > 0. Enable khugepaged on system under test. > > 1. Start a two-thread application. Each thread allocates a chunk of > > non-huge anonymous memory buffer. > > 2. Pick 4 random buffer locations (2 in each thread) and inject > > uncorrectable memory errors at corresponding physical addresses. > > 3. Signal both threads to make their memory buffer collapsible, i.e. > > calling madvise(MADV_HUGEPAGE). > > 4. Wait and check kernel log: khugepaged is able to recover from poison= ed > > pages and skips collapsing them. > > 5. Signal both threads to inspect their buffer contents and make sure n= o > > data corruption. > > > > Signed-off-by: Jiaqi Yan > > --- > > include/trace/events/huge_memory.h | 3 +- > > mm/khugepaged.c | 148 ++++++++++++++++++++++++----- > > 2 files changed, 128 insertions(+), 23 deletions(-) > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/= huge_memory.h > > index 3e6fb05852f9a..46cce509957ba 100644 > > --- a/include/trace/events/huge_memory.h > > +++ b/include/trace/events/huge_memory.h > > @@ -36,7 +36,8 @@ > > EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") = \ > > EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") = \ > > EM( SCAN_TRUNCATED, "truncated") = \ > > - EMe(SCAN_PAGE_HAS_PRIVATE, "page_has_private") = \ > > + EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") = \ > > + EMe(SCAN_COPY_MC, "copy_poisoned_page") = \ > > > > #undef EM > > #undef EMe > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 27956d4404134..c3c217f6ebc6e 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -19,6 +19,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -55,6 +56,7 @@ enum scan_result { > > SCAN_CGROUP_CHARGE_FAIL, > > SCAN_TRUNCATED, > > SCAN_PAGE_HAS_PRIVATE, > > + SCAN_COPY_MC, > > }; > > > > #define CREATE_TRACE_POINTS > > @@ -681,47 +683,47 @@ static int __collapse_huge_page_isolate(struct vm= _area_struct *vma, > > return result; > > } > > > > -static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > > - struct vm_area_struct *vma, > > - unsigned long address, > > - spinlock_t *ptl, > > - struct list_head *compound_pageli= st) > > +static void __collapse_huge_page_copy_succeeded(pte_t *pte, > > + pmd_t *pmd, > > + struct vm_area_struct *= vma, > > + unsigned long address, > > + spinlock_t *pte_ptl, > > + struct list_head *compo= und_pagelist) > > { > > struct page *src_page, *tmp; > > pte_t *_pte; > > - for (_pte =3D pte; _pte < pte + HPAGE_PMD_NR; > > - _pte++, page++, address +=3D PAGE_SIZE)= { > > - pte_t pteval =3D *_pte; > > + pte_t pteval; > > + unsigned long _address; > > > > + for (_pte =3D pte, _address =3D address; _pte < pte + HPAGE_PMD= _NR; > > + _pte++, _address +=3D PAGE_SIZE) { > > + pteval =3D *_pte; > > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > > - clear_user_highpage(page, address); > > add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > > if (is_zero_pfn(pte_pfn(pteval))) { > > /* > > - * ptl mostly unnecessary. > > + * pte_ptl mostly unnecessary. > > */ > > - spin_lock(ptl); > > - ptep_clear(vma->vm_mm, address, _pte); > > - spin_unlock(ptl); > > + spin_lock(pte_ptl); > > + pte_clear(vma->vm_mm, _address, _pte); > > + spin_unlock(pte_ptl); > > } > > } else { > > src_page =3D pte_page(pteval); > > - copy_user_highpage(page, src_page, address, vma= ); > > if (!PageCompound(src_page)) > > release_pte_page(src_page); > > /* > > - * ptl mostly unnecessary, but preempt has to > > - * be disabled to update the per-cpu stats > > + * pte_ptl mostly unnecessary, but preempt has > > + * to be disabled to update the per-cpu stats > > * inside page_remove_rmap(). > > */ > > - spin_lock(ptl); > > - ptep_clear(vma->vm_mm, address, _pte); > > + spin_lock(pte_ptl); > > + ptep_clear(vma->vm_mm, _address, _pte); > > page_remove_rmap(src_page, vma, false); > > - spin_unlock(ptl); > > + spin_unlock(pte_ptl); > > free_page_and_swap_cache(src_page); > > } > > } > > - > > list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru)= { > > list_del(&src_page->lru); > > mod_node_page_state(page_pgdat(src_page), > > @@ -733,6 +735,104 @@ static void __collapse_huge_page_copy(pte_t *pte,= struct page *page, > > } > > } > > > > +static void __collapse_huge_page_copy_failed(pte_t *pte, > > + pmd_t *pmd, > > + pmd_t orig_pmd, > > + struct vm_area_struct *vma= , > > + unsigned long address, > > + struct list_head *compound= _pagelist) > > +{ > > + struct page *src_page, *tmp; > > + pte_t *_pte; > > + pte_t pteval; > > + unsigned long _address; > > + spinlock_t *pmd_ptl; > > + > > + /* > > + * Re-establish the PMD to point to the original page table > > + * entry. Restoring PMD needs to be done prior to releasing > > + * pages. Since pages are still isolated and locked here, > > + * acquiring anon_vma_lock_write is unnecessary. > > + */ > > + pmd_ptl =3D pmd_lock(vma->vm_mm, pmd); > > + pmd_populate(vma->vm_mm, pmd, pmd_pgtable(orig_pmd)); > > + spin_unlock(pmd_ptl); > > + /* > > + * Release both raw and compound pages isolated > > + * in __collapse_huge_page_isolate. > > + */ > > + for (_pte =3D pte, _address =3D address; _pte < pte + HPAGE_PMD= _NR; > > + _pte++, _address +=3D PAGE_SIZE) { > > + pteval =3D *_pte; > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) > > + continue; > > + src_page =3D pte_page(pteval); > > + if (!PageCompound(src_page)) > > + release_pte_page(src_page); > > + } > > + list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru)= { > > + list_del(&src_page->lru); > > + release_pte_page(src_page); > > + } > > +} > > + > > +/* > > + * __collapse_huge_page_copy - attempts to copy memory contents from r= aw > > + * pages to a hugepage. Cleans up the raw pages if copying succeeds; > > + * otherwise restores the original page table and releases isolated ra= w pages. > > + * Returns SCAN_SUCCEED if copying succeeds, otherwise returns SCAN_CO= PY_MC. > > + * > > + * @pte: starting of the PTEs to copy from > > + * @page: the new hugepage to copy contents to > > + * @pmd: pointer to the new hugepage's PMD > > + * @orig_pmd: the original raw pages' PMD > > + * @vma: the original raw pages' virtual memory area > > + * @address: starting address to copy > > + * @pte_ptl: lock on raw pages' PTEs > > + * @compound_pagelist: list that stores compound pages > > + */ > > +static int __collapse_huge_page_copy(pte_t *pte, > > + struct page *page, > > + pmd_t *pmd, > > + pmd_t orig_pmd, > > + struct vm_area_struct *vma, > > + unsigned long address, > > + spinlock_t *pte_ptl, > > + struct list_head *compound_pagelis= t) > > +{ > > + struct page *src_page; > > + pte_t *_pte; > > + pte_t pteval; > > + unsigned long _address; > > + int result =3D SCAN_SUCCEED; > > + > > + /* > > + * Copying pages' contents is subject to memory poison at any i= teration. > > + */ > > + for (_pte =3D pte, _address =3D address; _pte < pte + HPAGE_PMD= _NR; > > + _pte++, page++, _address +=3D PAGE_SIZE) { > > + pteval =3D *_pte; > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > > + clear_user_highpage(page, _address); > > + continue; > > + } > > + src_page =3D pte_page(pteval); > > + if (copy_mc_user_highpage(page, src_page, _address, vma= ) > 0) { > > + result =3D SCAN_COPY_MC; > > + break; > > + } > > + } > > + > > + if (likely(result =3D=3D SCAN_SUCCEED)) > > + __collapse_huge_page_copy_succeeded(pte, pmd, vma, addr= ess, > > + pte_ptl, compound_p= agelist); > > + else > > + __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vm= a, > > + address, compound_page= list); > > + > > + return result; > > +} > > + > > static void khugepaged_alloc_sleep(void) > > { > > DEFINE_WAIT(wait); > > @@ -1106,9 +1206,13 @@ static int collapse_huge_page(struct mm_struct *= mm, unsigned long address, > > */ > > anon_vma_unlock_write(vma->anon_vma); > > > > - __collapse_huge_page_copy(pte, hpage, vma, address, pte_ptl, > > - &compound_pagelist); > > + result =3D __collapse_huge_page_copy(pte, hpage, pmd, _pmd, > > + vma, address, pte_ptl, > > + &compound_pagelist); > > pte_unmap(pte); > > + if (unlikely(result !=3D SCAN_SUCCEED)) > > + goto out_up_write; > > + > > /* > > * spin_lock() below is not the equivalent of smp_wmb(), but > > * the smp_wmb() inside __SetPageUptodate() can be reused to > > -- > > 2.40.0.rc0.216.gc4246ad0f0-goog > >