From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5E2C1C004D4 for ; Thu, 19 Jan 2023 15:03:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 951046B0074; Thu, 19 Jan 2023 10:03:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 901316B0075; Thu, 19 Jan 2023 10:03:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A2316B0078; Thu, 19 Jan 2023 10:03:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 6922F6B0074 for ; Thu, 19 Jan 2023 10:03:35 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id EDD47AB317 for ; Thu, 19 Jan 2023 15:03:34 +0000 (UTC) X-FDA: 80371867548.14.0B2E5AA Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by imf13.hostedemail.com (Postfix) with ESMTP id 2D95A20028 for ; Thu, 19 Jan 2023 15:03:31 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=K7jYZvaP; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf13.hostedemail.com: domain of kirill.shutemov@linux.intel.com has no SPF policy when checking 192.55.52.120) smtp.mailfrom=kirill.shutemov@linux.intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674140612; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CYgSMhzCEkkWpHLcVR/3FRab+cjBGp1LIHMvwi3QEBo=; b=6/MZwu2OvKDSTDVqlQHyFgt0wOtaCZkhBNWxRocfBdnt3Da2AugVCWsBOjZ8/gz+qCcFJ7 0Y3YZbVDPhOKEWe3Fk74s5wRQZgNPqIm6DPEU/8BBjktG3LnFU/v0jdD1sQMLAGjrVxBbQ oZR6s3D9EaEsR+i4aYajH49XbWh+dkA= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=K7jYZvaP; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf13.hostedemail.com: domain of kirill.shutemov@linux.intel.com has no SPF policy when checking 192.55.52.120) smtp.mailfrom=kirill.shutemov@linux.intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674140612; a=rsa-sha256; cv=none; b=wPuithxfyjdvvbZ52FXpzWrWoFlny4bC2RnQOAeLlWDbBu4D+A0dDKb1c3jVavH77xmefB GrCPIrrrBq9/eNX+NRtKuyalRuZ2qA9tALBwckG6k5WZXnRKof0psY6pYM+/Q2wowKnQxc DDa5xTkQbW2ojaXvGFabvolkfIjsU/c= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674140612; x=1705676612; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=fO+C6MDLo/Ebz49RiZVQJVrfjxlkFgu51fNuubDSStg=; b=K7jYZvaP9yi1dyvKzIsFKuhFD96JwtbS5iwMPO5BuiW4S0VeECdlU3v/ cP7TBco8a8QzgI8l9YR4qZUDy1fhmY1L3zOISoVDnnS+n1K0UCe5o5VJq nyX63KV+FnV0l3KIZF3KGqxh6mrIjna4mOdu/ydQPrgARQIAs+ovT/lq2 cXdQBhX4Y0+J5dRsjoypT+PbYYrIwG2G2oHJrOfkoOHNjCZJV+FflBF5K 1LWbr3Nua6/rpGqbqZrWHIUjz0gJ8Rkv4Fem4VdFc0PQsvYy9C2LKxlmv JLMB741mrvVUV4oQFE09thjrPPb11xtwSKeQMoQXnAd9xdXsYwfqO01r1 Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10595"; a="323994575" X-IronPort-AV: E=Sophos;i="5.97,229,1669104000"; d="scan'208";a="323994575" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jan 2023 07:03:04 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10595"; a="748952727" X-IronPort-AV: E=Sophos;i="5.97,229,1669104000"; d="scan'208";a="748952727" Received: from sburgans-mobl.ger.corp.intel.com (HELO box.shutemov.name) ([10.249.47.75]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jan 2023 07:03:01 -0800 Received: by box.shutemov.name (Postfix, from userid 1000) id 04F5F10B080; Thu, 19 Jan 2023 18:02:59 +0300 (+03) Date: Thu, 19 Jan 2023 18:02:58 +0300 From: kirill.shutemov@linux.intel.com To: Jiaqi Yan Cc: kirill@shutemov.name, shy828301@gmail.com, tongtiangen@huawei.com, tony.luck@intel.com, akpm@linux-foundation.org, wangkefeng.wang@huawei.com, naoya.horiguchi@nec.com, linmiaohe@huawei.com, linux-mm@kvack.org, osalvador@suse.de Subject: Re: [PATCH v9 1/2] mm/khugepaged: recover from poisoned anonymous memory Message-ID: <20230119150258.npfadnefkpny5fd3@box.shutemov.name> References: <20221205234059.42971-1-jiaqiyan@google.com> <20221205234059.42971-2-jiaqiyan@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20221205234059.42971-2-jiaqiyan@google.com> X-Rspamd-Queue-Id: 2D95A20028 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: nkcdu1mgptad4wktr54obaj1odrzx9gn X-HE-Tag: 1674140611-4439 X-HE-Meta: U2FsdGVkX18ZL5EdaQvZNOWS90WuC1wqyGXg96gO0DYuRYP0q0AvDO9KQ7sOq7vRrXC5eCf/T3ILMcKluc2bjMowmZc8rFT605CL7TWbpdAZwBoMVapcWPQQjYPmtfANo88XZ0VzfpJjiV3Qje3XQA6VajTJx0qOTeeYctO9Jg2ckwyWKR0hO50joERYes6LtRBep9tFd/XObs+UVOM9VCzpBuPlXc0Jhs0f7BJ1vhhyMoAB5aDdC41iYi9im0cY8wUYdjtUpKrq0ecK61uj/pVTPUfoI+gepayjmnWYmu4oggK4nBQaD85hbHzj5W8nKwpbU/xk5kUAf1DRHPcVXdjyzfgwSJ3kxQJehBBSpcDJKYrrPDOWlWuSjSybQGuM7vJiYzvXzPiosefRxPWTXlkKS5QppKQYjYJnU3c9BFo9uHKjNs4XP3g2cafNjYXzG4ko/1q8ZjbbbJ5155ZpnWzEVV/cAdkcvbhvsYMsMQF0thV2AsXijaIFWiMLP7VyNb4E89QI4Dh9ccsAEa7Bn0kEexf/17R17Uji3up0uxYL3msvs08kEY0NTBRKhZUdw5aRpIuESzs3DZIzfk8YpONdt0GdQOUoAlvOfkkT48ziHaSI9hBsSp350nv0BlQooWxdiS54SiEg/cQHLPw4h+fGc9pGYpZGHAs43owJTpzfNsHi9zBDoOpMb7xKvHoZIaAxe1RoFE2d7z3FoFJlMsoY1UWcjKV8/Ej+8WUud8cLoGjgFIA9eNV9hikIZWlkWmqW2INBc1RRNKIGn77jP1+KTGEcdLpqIq8ohDFkB4ghOuy302knQw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Dec 05, 2022 at 03:40:58PM -0800, Jiaqi Yan wrote: > Make __collapse_huge_page_copy return whether copying anonymous pages > succeeded, and make collapse_huge_page handle the return status. > > Break existing PTE scan loop into two for-loops. The first loop copies > source pages into target huge page, and can fail gracefully when running > into memory errors in source pages. If copying all pages succeeds, the > second loop releases and clears up these normal pages. Otherwise, the > second loop rolls back the page table and page states by: > - re-establishing the original PTEs-to-PMD connection. > - releasing source pages back to their LRU list. > > Tested manually: > 0. Enable khugepaged on system under test. > 1. Start a two-thread application. Each thread allocates a chunk of > non-huge anonymous memory buffer. > 2. Pick 4 random buffer locations (2 in each thread) and inject > uncorrectable memory errors at corresponding physical addresses. > 3. Signal both threads to make their memory buffer collapsible, i.e. > calling madvise(MADV_HUGEPAGE). > 4. Wait and check kernel log: khugepaged is able to recover from poisoned > pages and skips collapsing them. > 5. Signal both threads to inspect their buffer contents and make sure no > data corruption. > > Signed-off-by: Jiaqi Yan > --- > include/trace/events/huge_memory.h | 3 +- > mm/khugepaged.c | 179 ++++++++++++++++++++++------- > 2 files changed, 139 insertions(+), 43 deletions(-) > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > index 35d759d3b0104..5743ae970af31 100644 > --- a/include/trace/events/huge_memory.h > +++ b/include/trace/events/huge_memory.h > @@ -36,7 +36,8 @@ > EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \ > EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \ > EM( SCAN_TRUNCATED, "truncated") \ > - EMe(SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > + EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > + EMe(SCAN_COPY_MC, "copy_poisoned_page") \ > > #undef EM > #undef EMe > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 5a7d2d5093f9c..0f1b9e05e17ec 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -19,6 +19,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -55,6 +56,7 @@ enum scan_result { > SCAN_CGROUP_CHARGE_FAIL, > SCAN_TRUNCATED, > SCAN_PAGE_HAS_PRIVATE, > + SCAN_COPY_MC, > }; > > #define CREATE_TRACE_POINTS > @@ -530,6 +532,27 @@ static bool is_refcount_suitable(struct page *page) > return page_count(page) == expected_refcount; > } > > +/* > + * Copies memory with #MC in source page (@from) handled. Returns number > + * of bytes not copied if there was an exception; otherwise 0 for success. > + * Note handling #MC requires arch opt-in. > + */ > +static int copy_mc_page(struct page *to, struct page *from) > +{ > + char *vfrom, *vto; > + unsigned long ret; > + > + vfrom = kmap_local_page(from); > + vto = kmap_local_page(to); > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); > + if (ret == 0) > + kmsan_copy_page_meta(to, from); > + kunmap_local(vto); > + kunmap_local(vfrom); > + > + return ret; > +} It is very similar to copy_mc_user_highpage(), but uses kmsan_copy_page_meta() instead of kmsan_unpoison_memory(). Could you explain the difference? I don't quite get it. > + > static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > unsigned long address, > pte_t *pte, > @@ -670,56 +693,124 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > return result; > } > > -static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > - struct vm_area_struct *vma, > - unsigned long address, > - spinlock_t *ptl, > - struct list_head *compound_pagelist) > +/* > + * __collapse_huge_page_copy - attempts to copy memory contents from normal > + * pages to a hugepage. Cleans up the normal pages if copying succeeds; > + * otherwise restores the original page table and releases isolated normal pages. > + * Returns SCAN_SUCCEED if copying succeeds, otherwise returns SCAN_COPY_MC. > + * > + * @pte: starting of the PTEs to copy from > + * @page: the new hugepage to copy contents to > + * @pmd: pointer to the new hugepage's PMD > + * @rollback: the original normal pages' PMD > + * @vma: the original normal pages' virtual memory area > + * @address: starting address to copy > + * @pte_ptl: lock on normal pages' PTEs > + * @compound_pagelist: list that stores compound pages > + */ > +static int __collapse_huge_page_copy(pte_t *pte, > + struct page *page, > + pmd_t *pmd, > + pmd_t rollback, I think 'orig_pmd' is a better name. > + struct vm_area_struct *vma, > + unsigned long address, > + spinlock_t *pte_ptl, > + struct list_head *compound_pagelist) > { > struct page *src_page, *tmp; > pte_t *_pte; > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; > - _pte++, page++, address += PAGE_SIZE) { > - pte_t pteval = *_pte; > + pte_t pteval; > + unsigned long _address; > + spinlock_t *pmd_ptl; > + int result = SCAN_SUCCEED; > > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > - clear_user_highpage(page, address); > - add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > - if (is_zero_pfn(pte_pfn(pteval))) { > + /* > + * Copying pages' contents is subject to memory poison at any iteration. > + */ > + for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR; > + _pte++, page++, _address += PAGE_SIZE) { > + pteval = *_pte; > + > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) > + clear_user_highpage(page, _address); > + else { > + src_page = pte_page(pteval); > + if (copy_mc_page(page, src_page) > 0) { > + result = SCAN_COPY_MC; > + break; > + } > + } > + } > + > + if (likely(result == SCAN_SUCCEED)) { > + for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR; > + _pte++, _address += PAGE_SIZE) { > + pteval = *_pte; > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > + if (is_zero_pfn(pte_pfn(pteval))) { > + /* > + * pte_ptl mostly unnecessary. > + */ > + spin_lock(pte_ptl); > + pte_clear(vma->vm_mm, _address, _pte); > + spin_unlock(pte_ptl); > + } > + } else { > + src_page = pte_page(pteval); > + if (!PageCompound(src_page)) > + release_pte_page(src_page); > /* > - * ptl mostly unnecessary. > + * pte_ptl mostly unnecessary, but preempt has > + * to be disabled to update the per-cpu stats > + * inside page_remove_rmap(). > */ > - spin_lock(ptl); > - ptep_clear(vma->vm_mm, address, _pte); > - spin_unlock(ptl); > + spin_lock(pte_ptl); > + ptep_clear(vma->vm_mm, _address, _pte); > + page_remove_rmap(src_page, vma, false); > + spin_unlock(pte_ptl); > + free_page_and_swap_cache(src_page); > + } > + } > + list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) { > + list_del(&src_page->lru); > + mod_node_page_state(page_pgdat(src_page), > + NR_ISOLATED_ANON + page_is_file_lru(src_page), > + -compound_nr(src_page)); > + unlock_page(src_page); > + free_swap_cache(src_page); > + putback_lru_page(src_page); > + } > + } else { > + /* > + * Re-establish the regular PMD that points to the regular > + * page table. Restoring PMD needs to be done prior to > + * releasing pages. Since pages are still isolated and > + * locked here, acquiring anon_vma_lock_write is unnecessary. > + */ > + pmd_ptl = pmd_lock(vma->vm_mm, pmd); > + pmd_populate(vma->vm_mm, pmd, pmd_pgtable(rollback)); > + spin_unlock(pmd_ptl); > + /* > + * Release both raw and compound pages isolated > + * in __collapse_huge_page_isolate. > + */ > + for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR; > + _pte++, _address += PAGE_SIZE) { > + pteval = *_pte; > + if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval))) { > + src_page = pte_page(pteval); > + if (!PageCompound(src_page)) > + release_pte_page(src_page); Indentation levels get out of control. Maybe some code restructuring is required? > } > - } else { > - src_page = pte_page(pteval); > - copy_user_highpage(page, src_page, address, vma); > - if (!PageCompound(src_page)) > - release_pte_page(src_page); > - /* > - * ptl mostly unnecessary, but preempt has to > - * be disabled to update the per-cpu stats > - * inside page_remove_rmap(). > - */ > - spin_lock(ptl); > - ptep_clear(vma->vm_mm, address, _pte); > - page_remove_rmap(src_page, vma, false); > - spin_unlock(ptl); > - free_page_and_swap_cache(src_page); > + } > + list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) { > + list_del(&src_page->lru); > + release_pte_page(src_page); > } > } > > - list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) { > - list_del(&src_page->lru); > - mod_node_page_state(page_pgdat(src_page), > - NR_ISOLATED_ANON + page_is_file_lru(src_page), > - -compound_nr(src_page)); > - unlock_page(src_page); > - free_swap_cache(src_page); > - putback_lru_page(src_page); > - } > + return result; > } > > static void khugepaged_alloc_sleep(void) > @@ -1079,9 +1170,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > */ > anon_vma_unlock_write(vma->anon_vma); > > - __collapse_huge_page_copy(pte, hpage, vma, address, pte_ptl, > - &compound_pagelist); > + result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd, > + vma, address, pte_ptl, > + &compound_pagelist); > pte_unmap(pte); > + if (unlikely(result != SCAN_SUCCEED)) > + goto out_up_write; > + > /* > * spin_lock() below is not the equivalent of smp_wmb(), but > * the smp_wmb() inside __SetPageUptodate() can be reused to > -- > 2.39.0.rc0.267.gcb52ba06e7-goog > -- Kiryl Shutsemau / Kirill A. Shutemov