From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.5 required=3.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B577FC433EF for ; Thu, 23 Sep 2021 03:28:49 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4C71D61038 for ; Thu, 23 Sep 2021 03:28:49 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 4C71D61038 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id C78F1900002; Wed, 22 Sep 2021 23:28:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C285E6B0071; Wed, 22 Sep 2021 23:28:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B1746900002; Wed, 22 Sep 2021 23:28:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0070.hostedemail.com [216.40.44.70]) by kanga.kvack.org (Postfix) with ESMTP id A27ED6B006C for ; Wed, 22 Sep 2021 23:28:48 -0400 (EDT) Received: from smtpin35.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 5C9792D253 for ; Thu, 23 Sep 2021 03:28:48 +0000 (UTC) X-FDA: 78617406336.35.FF46A17 Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) by imf26.hostedemail.com (Postfix) with ESMTP id 0BAC120019C6 for ; Thu, 23 Sep 2021 03:28:47 +0000 (UTC) Received: by mail-pj1-f54.google.com with SMTP id d13-20020a17090ad3cd00b0019e746f7bd4so731673pjw.0 for ; Wed, 22 Sep 2021 20:28:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=/ki6pkYMnfOAScz3q1SGd96OuCB2AYCYHs3PdlWGy2M=; b=efR/57wxxOzBQ+rSB4cPkVOlyrX8uZig3qLozrZRNonIybdqpcjAEN4KFFcDPipJiC wxASo3FRXk1ekHoBikPE+cbSSo2Y2f7TR8cdHlACSsnyAqhCciGHfYV7jgJxUer4l6D0 cHtj0MmVH3WPv5HLUo+EGCXHMIVbgDJonnoLldp3VycSAqOX+PLsFg/X4Lj1M1PomU2R do+kwEIlu1A9ZvZXrAwrD9xE9IL8I6hvkQX+FdLcXTIF5fcE1Z9GCkcwhT26zMvQIBL1 GRubOtubSAPHJa7NsjF2PZKRljYgOuab9lBv0Wxj6VXmxKfDp5A7fyKPPDae7xbGnnBg tUNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=/ki6pkYMnfOAScz3q1SGd96OuCB2AYCYHs3PdlWGy2M=; b=fVYzVxN22N9cTiiI04UDh5heYO5ZDD+YO9UT3OoAxjlfv5NzCN77JVEyfC6rCin1BG Zy3JfLp5alTyixAE1DXVq29FQjv74i7TCyQS9EEz49n/V8Fkpfhwf2dZbx6Ra+q5jntb cwdlft9XMoJ7dxu7Dma5dAJ3zxKEjPOJfKpBbv/iuTfw90WbBcsRjpP1h//X3gKWpCZI tnEPpuShdbIwbRIJmsh2lFrwec64vOypDSK7SvagfCzntjnGH31ZmhHUTAfFUpmEgGvU IpBg2SpCDXOejy7QYy7UCaRBCHG3cgeozn9bFUO+ZixHLOHmrjctnh+dTXbtsfVeaGw8 freQ== X-Gm-Message-State: AOAM530FAv2qkyAihbH7wumoqoQUEDpjwBjD61RErTjvaVk4m+2/vtCW ZqCeQRHsz9Hr8pYFwUJLj6s= X-Google-Smtp-Source: ABdhPJy1rUDMwQ5o/PbV3Mjy4wZ2nZOlypgylHA3NsHqudKF/jXoHx26oW+maHH8y0pW0LN6TFdhPw== X-Received: by 2002:a17:902:bb94:b0:13c:9113:5652 with SMTP id m20-20020a170902bb9400b0013c91135652mr2003926pls.70.1632367726840; Wed, 22 Sep 2021 20:28:46 -0700 (PDT) Received: from localhost.localdomain (c-73-93-239-127.hsd1.ca.comcast.net. [73.93.239.127]) by smtp.gmail.com with ESMTPSA id x8sm3699696pfq.131.2021.09.22.20.28.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Sep 2021 20:28:45 -0700 (PDT) From: Yang Shi To: naoya.horiguchi@nec.com, hughd@google.com, kirill.shutemov@linux.intel.com, willy@infradead.org, peterx@redhat.com, osalvador@suse.de, akpm@linux-foundation.org Cc: shy828301@gmail.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC v2 PATCH 0/5] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Date: Wed, 22 Sep 2021 20:28:25 -0700 Message-Id: <20210923032830.314328-1-shy828301@gmail.com> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 X-Stat-Signature: tse8ziki9879gc3yactzxgoq1936it5f Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="efR/57wx"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf26.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.54 as permitted sender) smtp.mailfrom=shy828301@gmail.com X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 0BAC120019C6 X-HE-Tag: 1632367727-147145 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When discussing the patch that splits page cache THP in order to offline = the poisoned page, Noaya mentioned there is a bigger problem [1] that prevent= s this from working since the page cache page will be truncated if uncorrectable errors happen. By looking this deeper it turns out this approach (trunca= ting poisoned page) may incur silent data loss for all non-readonly filesystem= s if the page is dirty. It may be worse for in-memory filesystem, e.g. shmem/= tmpfs since the data blocks are actually gone. To solve this problem we could keep the poisoned dirty page in page cache= then notify the users on any later access, e.g. page fault, read/write, etc. = The clean page could be truncated as is since they can be reread from disk la= ter on. The consequence is the filesystems may find poisoned page and manipulate = it as healthy page since all the filesystems actually don't check if the page i= s poisoned or not in all the relevant paths except page fault. In general,= we need make the filesystems be aware of poisoned page before we could keep = the poisoned page in page cache in order to solve the data loss problem. To make filesystems be aware of poisoned page we should consider: - The page should be not written back: clearing dirty flag could prevent = from writeback. - The page should not be dropped (it shows as a clean page) by drop cache= s or other callers: the refcount pin from hwpoison could prevent from invali= dating (called by cache drop, inode cache shrinking, etc), but it doesn't avoi= d invalidation in DIO path. - The page should be able to get truncated/hole punched/unlinked: it work= s as it is. - Notify users when the page is accessed, e.g. read/write, page fault and= other paths (compression, encryption, etc). The scope of the last one is huge since almost all filesystems need do it= once a page is returned from page cache lookup. There are a couple of options= to do it: 1. Check hwpoison flag for every path, the most straightforward way. 2. Return NULL for poisoned page from page cache lookup, the most callsit= es check if NULL is returned, this should have least work I think. But t= he error handling in filesystems just return -ENOMEM, the error code will= incur confusion to the users obviously. 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO), but = this will involve significant amount of code change as well since all the p= aths need check if the pointer is ERR or not just like option #1. I did prototype for both #1 and #3, but it seems #3 may require more chan= ges than #1. For #3 ERR_PTR will be returned so all the callers need to chec= k the return value otherwise invalid pointer may be dereferenced, but not all c= allers really care about the content of the page, for example, partial truncate = which just sets the truncated range in one page to 0. So for such paths it nee= ds additional modification if ERR_PTR is returned. And if the callers have = their own way to handle the problematic pages we need to add a new FGP flag to = tell FGP functions to return the pointer to the page. It may happen very rarely, but once it happens the consequence (data corr= uption) could be very bad and it is very hard to debug. It seems this problem ha= d been slightly discussed before, but seems no action was taken at that time. [2= ] As the aforementioned investigation, it needs huge amount of work to solv= e the potential data loss for all filesystems. But it is much easier for in-memory filesystems and such filesystems actually suffer more than othe= rs since even the data blocks are gone due to truncating. So this patchset = starts from shmem/tmpfs by taking option #1. Patch #1: fix bugs in page fault and khugepaged. Patch #2 and #3: refactor, cleanup and preparation. Patch #4: keep the poisoned page in page cache and handle such case for a= ll the paths. Patch #5: the previous patches unblock page cache THP split, so this patc= h add page cache THP split support. Changelog v1 --> v2: * Incorporated the suggestion from Kirill to use a new page flag to indicate there is hwpoisoned subpage(s) in a THP. (patch #1) * Dropped patch #2 of v1. * Refctored the page refcount check logic of hwpoison per Naoya. (patch= #2) * Removed unnecessary THP check per Naoya. (patch #3) * Incorporated the other comments for shmem from Naoya. (patch #4) Yang Shi (5): mm: filemap: check if THP has hwpoisoned subpage for PMD page fault mm: hwpoison: refactor refcount check handling mm: hwpoison: remove the unnecessary THP check mm: shmem: don't truncate page if memory failure happens mm: hwpoison: handle non-anonymous THP correctly include/linux/page-flags.h | 19 ++++++++++ mm/filemap.c | 15 ++++---- mm/huge_memory.c | 2 ++ mm/memory-failure.c | 130 +++++++++++++++++++++++++++++++++++++++= +++--------------------------- mm/memory.c | 9 +++++ mm/page_alloc.c | 4 ++- mm/shmem.c | 31 +++++++++++++++-- mm/userfaultfd.c | 5 +++ 8 files changed, 156 insertions(+), 59 deletions(-) [1] https://lore.kernel.org/linux-mm/CAHbLzkqNPBh_sK09qfr4yu4WTFOzRy+MKj+= PA7iG-adzi9zGsg@mail.gmail.com/T/#m0e959283380156f1d064456af01ae51fdff912= 65 [2] https://lore.kernel.org/lkml/20210318183350.GT3420@casper.infradead.o= rg/