From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87EA8C6FD1C for ; Fri, 24 Mar 2023 22:54:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 12E67900004; Fri, 24 Mar 2023 18:54:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0E041900002; Fri, 24 Mar 2023 18:54:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EE7CD900004; Fri, 24 Mar 2023 18:54:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id DF1DA900002 for ; Fri, 24 Mar 2023 18:54:46 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id B37271C632A for ; Fri, 24 Mar 2023 22:54:46 +0000 (UTC) X-FDA: 80605298172.05.B550489 Received: from mail-yw1-f178.google.com (mail-yw1-f178.google.com [209.85.128.178]) by imf03.hostedemail.com (Postfix) with ESMTP id F281320003 for ; Fri, 24 Mar 2023 22:54:44 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=SAhxTf1v; spf=pass (imf03.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.178 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679698485; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/I8FiCVy0AXgDUMYccqhha+JwTUVXHrNppEYVVY1guE=; b=tMKJV+p1Stn45q5r9Xupij0pjoGLkx4rTjCz7rrAqJ3WaF7vmQvKTvKwj1E6POQcft94z/ kWCHIyypUQWh+boJzN++hKZBsogtKRs7h/m60hLAH9s/3LDtBbJan9lX3cD49YqmnakTYD YCKoDG/izgWAoNDpg7zPbMPzDVMC1Tc= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=SAhxTf1v; spf=pass (imf03.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.178 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679698485; a=rsa-sha256; cv=none; b=qJocdiV8iJCJ2nZ5LvV5i2S7OHnPD2CQOGLGMHFua45KrWtgp8gK2ua0rxeW24VdexnNf2 OLvabtjIQ4knqw1kl9U8DcYZWrwVd7Sm4Irj2+gquZk7ovM83iCH2g+MOnllsOuNxJ48kw YWLffyc7WZWiNr7Z4vppDNjEJBLxpKE= Received: by mail-yw1-f178.google.com with SMTP id 00721157ae682-544f7c176easo59721557b3.9 for ; Fri, 24 Mar 2023 15:54:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1679698484; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/I8FiCVy0AXgDUMYccqhha+JwTUVXHrNppEYVVY1guE=; b=SAhxTf1vpJlTb0jmFzAaKj6k7slDolqwTWn8IYv983/NOFmORrSpRmWMqVoXaYPyTv XaJr6JnFrwTQqAVvBMyQfEq2ZC38g8BBoa7N6WoWw3EkHyYNTzDzvtcEJqeM0EW2ydV9 V4lIfmwWo0X/JDWEeAg/zs9Rir5dNSdhmoPWwJemN0ODLOC2u0drWO1ZgiDeV58daRkf NFMZUKjHgSt9ubxlasdlGhS21r4Lja6vkz5Z5l0iT9B5hb/gvFvmomnABE3dxdm5+zfX jcYGc4HXVLs41XrG9SB83LZTxtbq9YCyeIuXhDQ17rgd7o5WolUBUCNXoUhcaQ4Qkboz +B8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679698484; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/I8FiCVy0AXgDUMYccqhha+JwTUVXHrNppEYVVY1guE=; b=EswITLMZrVB3xQVNztlm8WIvJN/Km1DJxB2gge1CDKB2XRQRzMnNHdLNkxMF3JPpEY VXZ83Wpy6Ryh410n7kv07gxUDyIPzveuw2XXioOwJX6sBm3bPM8rSaKAcLgsAAjxZQds OBLOZ1OpfoEdV2HQoUPF1i09Xnhz/cBmz2UDbXCPde4T2XErZe4V4es7wMsWSYaEWPI7 /xfWKBccBr50FY4xdJqJwYprpWWT6jCqeHH3ET739qNEOric623fm587l5SZV2Tym4QJ mGRWvZO+emUiW5/o5ONpBdWQSncCUlF608YhhYNwJVQRQTF/jiDTKynvHQGKglo5+7u0 dVJA== X-Gm-Message-State: AAQBX9fAXAXBvbD1vSwPmmSfdWvSmHVjFxU5s7lIiOD9TjzjwjU3lg5l Ir06UkT5eO4GafCMKosfEcTegM72QiM0nGl2pKCBVjLFvlvats0kXqfPXA== X-Google-Smtp-Source: AKy350ZerhSR87qtndiaba2I7WRs1BsPMRpTQfVrWpmEqZPiS+Cb1RRCAXdiyHE0vJq/yJ0uz/1Gb3ty30rr2jj9rYM= X-Received: by 2002:a81:ef0f:0:b0:545:1d7f:acbf with SMTP id o15-20020a81ef0f000000b005451d7facbfmr1800382ywm.10.1679698483879; Fri, 24 Mar 2023 15:54:43 -0700 (PDT) MIME-Version: 1.0 References: <20230305065112.1932255-1-jiaqiyan@google.com> <20230305065112.1932255-4-jiaqiyan@google.com> In-Reply-To: From: Jiaqi Yan Date: Fri, 24 Mar 2023 15:54:32 -0700 Message-ID: Subject: Re: [PATCH v10 3/3] mm/khugepaged: recover from poisoned file-backed memory To: Yang Shi Cc: kirill.shutemov@linux.intel.com, kirill@shutemov.name, tongtiangen@huawei.com, tony.luck@intel.com, akpm@linux-foundation.org, naoya.horiguchi@nec.com, linmiaohe@huawei.com, linux-mm@kvack.org, osalvador@suse.de, wangkefeng.wang@huawei.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: F281320003 X-Rspam-User: X-Stat-Signature: qzng5z9rmd17h3nzktb4t44ro9wr1fk5 X-HE-Tag: 1679698484-894352 X-HE-Meta: U2FsdGVkX1+klyQZosTcRy1CGbKHW/Sn8HPqZKN7JkyUKpSxwP85RENLhk16bY39HaFU+OgfI9vAdVDE/TJnv1zCdJTxjYfbR1cC0sN4Hl/+EoxE6riHw6DB8CDj4tiXfhKOG+mRq7/Y83GkPJ0uwKjOK+i+exBKD9I41KPjGa+fTd9MAY35MFHsjOeAzSNJ8enl06QYnHeN5D7tL0nfAFd/55t3Ex0NfVPzGu55p7z2HAeTF3PGFBP10IpQaFcKhK5FV5r+Lnlq8jCrCJofVsUP+SIr5YNpsB/q2G4xTicfcSPpKeRKtPZOuVGk0ZmjrP64Sc+pkhS1uUjTEWCF+Y3NF2KhX88mYKn82Eppc/UyBJI4iX++nsOK2HwrL1ls2S1wP/4F4VTylGAyR7/icYTnwHWH6T1ME3r0P/DGVfdN3rSKmyZEqwrVwfkMEjEpQFHX2747T3q+2qK5ANVRZl0iV+Bgh2BUcjdC9tlDVudcJZxPfofZdaSukDkk95nElEZWKEFUV38nckfvQzx71qgw5fGmoF6YulfprhJ8N3b9OXbMAPkHAfUHiHR7tYEzLcW5eVczUOezUQfqlnw01xmaLpS1Hgx9vQfX+8pCfFoJntfTQ1Z+2gNFV4FUeQ5aFO/ks9RUoW2w3YoKPnqb8cg1UNl3s7Nr3l+UhpxQVarIFIcDcEVXjL/Xto8JAEGzL0f4FLzG4vOWjR1/Ux/5+tge4xSlIQyPMdVCpfNkGY7WVGLXFZvamjGCV71qFgnC1Qf5XgJxH28ulp4uCyQkRAIx+mlI48dxh62rbwv0bT/01lP31QfNyONa2BpCGavZ3DWO1M2PzMiAflpzdTb7khmsSAU6xsruwpphFuHYjUIALLdTmSO8RiOHljAIFzl/KOhIVOoH+5SSLHimvITQaf26yrHcpR7q3/YYsTETBgMa6hF7XQrCt1ec9hQf1l1odY9/HpuAOcU2oCE2WXL ZMbXUvdV Kbe0iO2NrShfKc2KwovWxINAVdSUD7Oy68xaEKsMX+V6aN/M/objHevSh2729PqdNX+uB7iqokyuGLLpSXEGAlTQIufLkf4hIfD79DXognuz/n3s8GcX2+q27yfWPwQ1NPAzqkmI8byZ47psufvFEIxcdwY/NXLrou/ZkGw6SyJRY+w2Z34Exza+EMTwA/r97va1eN+h8QBkjZwEMXcEQgslCvTyGoGlq2rrYhHod1ulhC3VOaLprgvHBq71MxWDuUm7MP2IESSA2Wss6yg/qXQ4SYXmkzMHswVmpaiG7zJSQocI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Mar 24, 2023 at 2:15=E2=80=AFPM Yang Shi wrot= e: > > On Sat, Mar 4, 2023 at 10:51=E2=80=AFPM Jiaqi Yan w= rote: > > > > Make collapse_file roll back when copying pages failed. More concretely= : > > - extract copying operations into a separate loop > > - postpone the updates for nr_none until both scanning and copying > > succeeded > > - postpone joining small xarray entries until both scanning and copying > > succeeded > > - postpone the update operations to NR_XXX_THPS until both scanning and > > copying succeeded > > - for non-SHMEM file, roll back filemap_nr_thps_inc if scan succeeded b= ut > > copying failed > > > > Tested manually: > > 0. Enable khugepaged on system under test. Mount tmpfs at /mnt/ramdisk. > > 1. Start a two-thread application. Each thread allocates a chunk of > > non-huge memory buffer from /mnt/ramdisk. > > 2. Pick 4 random buffer address (2 in each thread) and inject > > uncorrectable memory errors at physical addresses. > > 3. Signal both threads to make their memory buffer collapsible, i.e. > > calling madvise(MADV_HUGEPAGE). > > 4. Wait and then check kernel log: khugepaged is able to recover from > > poisoned pages by skipping them. > > 5. Signal both threads to inspect their buffer contents and make sure n= o > > data corruption. > > > > Signed-off-by: Jiaqi Yan > > Reviewed-by: Yang Shi > > Just a nit below: > > > --- > > mm/khugepaged.c | 78 ++++++++++++++++++++++++++++++------------------- > > 1 file changed, 48 insertions(+), 30 deletions(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index c3c217f6ebc6e..3ea2aa55c2c52 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1890,6 +1890,9 @@ static int collapse_file(struct mm_struct *mm, un= signed long addr, > > { > > struct address_space *mapping =3D file->f_mapping; > > struct page *hpage; > > + struct page *page; > > + struct page *tmp; > > + struct folio *folio; > > pgoff_t index =3D 0, end =3D start + HPAGE_PMD_NR; > > LIST_HEAD(pagelist); > > XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER); > > @@ -1934,8 +1937,7 @@ static int collapse_file(struct mm_struct *mm, un= signed long addr, > > > > xas_set(&xas, start); > > for (index =3D start; index < end; index++) { > > - struct page *page =3D xas_next(&xas); > > - struct folio *folio; > > + page =3D xas_next(&xas); > > > > VM_BUG_ON(index !=3D xas.xa_index); > > if (is_shmem) { > > @@ -2117,10 +2119,7 @@ static int collapse_file(struct mm_struct *mm, u= nsigned long addr, > > } > > nr =3D thp_nr_pages(hpage); > > > > - if (is_shmem) > > - __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr); > > - else { > > - __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr); > > + if (!is_shmem) { > > filemap_nr_thps_inc(mapping); > > /* > > * Paired with smp_mb() in do_dentry_open() to ensure > > @@ -2131,21 +2130,10 @@ static int collapse_file(struct mm_struct *mm, = unsigned long addr, > > smp_mb(); > > if (inode_is_open_for_write(mapping->host)) { > > result =3D SCAN_FAIL; > > - __mod_lruvec_page_state(hpage, NR_FILE_THPS, -n= r); > > filemap_nr_thps_dec(mapping); > > goto xa_locked; I notice the "goto xa_locked" statement can be removed now that the code between here and xa_locked are all gone. > > } > > } > > - > > - if (nr_none) { > > - __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none); > > - /* nr_none is always 0 for non-shmem. */ > > - __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none); > > - } > > - > > - /* Join all the small entries into a single multi-index entry *= / > > - xas_set_order(&xas, start, HPAGE_PMD_ORDER); > > - xas_store(&xas, hpage); > > xa_locked: > > xas_unlock_irq(&xas); > > xa_unlocked: > > @@ -2158,21 +2146,35 @@ static int collapse_file(struct mm_struct *mm, = unsigned long addr, > > try_to_unmap_flush(); > > > > if (result =3D=3D SCAN_SUCCEED) { > > - struct page *page, *tmp; > > - struct folio *folio; > > - > > /* > > * Replacing old pages with new one has succeeded, now = we > > - * need to copy the content and free the old pages. > > + * attempt to copy the contents. > > */ > > index =3D start; > > - list_for_each_entry_safe(page, tmp, &pagelist, lru) { > > + list_for_each_entry(page, &pagelist, lru) { > > while (index < page->index) { > > clear_highpage(hpage + (index % HPAGE_P= MD_NR)); > > index++; > > } > > - copy_highpage(hpage + (page->index % HPAGE_PMD_= NR), > > - page); > > + if (copy_mc_highpage(hpage + (page->index % HPA= GE_PMD_NR), > > + page) > 0) { > > + result =3D SCAN_COPY_MC; > > + break; > > + } > > + index++; > > + } > > + while (result =3D=3D SCAN_SUCCEED && index < end) { > > + clear_highpage(hpage + (index % HPAGE_PMD_NR)); > > + index++; > > + } > > + } > > + > > + if (result =3D=3D SCAN_SUCCEED) { > > + /* > > + * Copying old pages to huge one has succeeded, now we > > + * need to free the old pages. > > + */ > > + list_for_each_entry_safe(page, tmp, &pagelist, lru) { > > list_del(&page->lru); > > page->mapping =3D NULL; > > page_ref_unfreeze(page, 1); > > @@ -2180,12 +2182,23 @@ static int collapse_file(struct mm_struct *mm, = unsigned long addr, > > ClearPageUnevictable(page); > > unlock_page(page); > > put_page(page); > > - index++; > > } > > - while (index < end) { > > - clear_highpage(hpage + (index % HPAGE_PMD_NR)); > > - index++; > > + > > + xas_lock_irq(&xas); > > + if (is_shmem) > > + __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, n= r); > > + else > > + __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr= ); > > + > > + if (nr_none) { > > + __mod_lruvec_page_state(hpage, NR_FILE_PAGES, n= r_none); > > + /* nr_none is always 0 for non-shmem. */ > > + __mod_lruvec_page_state(hpage, NR_SHMEM, nr_non= e); > > } > > + /* Join all the small entries into a single multi-index= entry. */ > > + xas_set_order(&xas, start, HPAGE_PMD_ORDER); > > + xas_store(&xas, hpage); > > + xas_unlock_irq(&xas); > > > > folio =3D page_folio(hpage); > > folio_mark_uptodate(folio); > > @@ -2203,8 +2216,6 @@ static int collapse_file(struct mm_struct *mm, un= signed long addr, > > unlock_page(hpage); > > hpage =3D NULL; > > } else { > > - struct page *page; > > - > > /* Something went wrong: roll back page cache changes *= / > > xas_lock_irq(&xas); > > if (nr_none) { > > @@ -2238,6 +2249,13 @@ static int collapse_file(struct mm_struct *mm, u= nsigned long addr, > > xas_lock_irq(&xas); > > } > > VM_BUG_ON(nr_none); > > + /* > > + * Undo the updates of filemap_nr_thps_inc for non-SHME= M file only. > > + * This undo is not needed unless failure is due to SCA= N_COPY_MC. > > + */ > > + if (!is_shmem && result =3D=3D SCAN_COPY_MC) > > + filemap_nr_thps_dec(mapping); > > We may need a memory barrier here. But missing the memory barrier is > not a fatal issue either, the worst case is unnecessary truncate from > open path if it sees obsolete nr_thps counter. And it may be better to > handle it in a follow up patch by moving smp_mb() into > filemap_nr_thp_xxx functions. Ah, for the same reason as the comment above: "Paired with smp_mb() in do_dentry_open() to ensure i_writecount is up to date and the update to nr_thps is visible."? In that case, let me add the smp_mb() like this: smp_mb(); if (!is_shmem && result =3D=3D SCAN_COPY_MC) {...} > > > + > > xas_unlock_irq(&xas); > > > > hpage->mapping =3D NULL; > > -- > > 2.40.0.rc0.216.gc4246ad0f0-goog > >