From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37214C432BE for ; Mon, 30 Aug 2021 23:44:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 12D5D6101C for ; Mon, 30 Aug 2021 23:44:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239064AbhH3XpQ (ORCPT ); Mon, 30 Aug 2021 19:45:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36246 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231601AbhH3XpO (ORCPT ); Mon, 30 Aug 2021 19:45:14 -0400 Received: from mail-ed1-x535.google.com (mail-ed1-x535.google.com [IPv6:2a00:1450:4864:20::535]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 47314C061575 for ; Mon, 30 Aug 2021 16:44:20 -0700 (PDT) Received: by mail-ed1-x535.google.com with SMTP id g22so24096378edy.12 for ; Mon, 30 Aug 2021 16:44:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=bpG4L9d+cRIIKTqDcLTt0bxF3iyS6yijTSug5gYz8jc=; b=kbTtePJcgJcz3U3Xmu/pEDRBjEgazEu8NMuoIpNUvyXSc7hrHmeRGBWh6eEw/gPwtA +ZWH9s6Uui7QwLpqmzi6B39kQ/O9SzJ0QnWz+mZmm23GTZhNHdqCv425db2LEF8tMjpp JzsKHijZN9syr7c3MUIRXKgEKQoHMUmySxj7xeJWQ4lync/4vr+H9LE2YCGpFdWWwzYH 2bhQ9wvpXQ6kxKLuPmi6dFIjHwZaGmx2pRJKK9sSB8hNKViqbdrpZdPcqzcQoNemtgnZ yNzi4d7UQtBCaVuBVcfOR3OvnSuZhJWD+sXWLT1965ft5KH5chguvd9AByAwYojYU2ry hLRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=bpG4L9d+cRIIKTqDcLTt0bxF3iyS6yijTSug5gYz8jc=; b=l8I8It0VS+KAhPRRitbwE5mx8TICQjUYWLeDCf0K52b1SXh4KzNqlwoWSUCXb0966u cbuANHvWh5slBl9Rtkn2g03grsI6RQqDLufwbAO0Eo+KlrEbQys+HRlviXD431HqnkiI qUsxBWAsbwxoXMLAsF4Q70cjuY5cZWsiI1d44vxjHEMsCHvQuHpLCAZ1pvO7Ir4NuzEH UVxyxlR+MTVFhld5UQEFuAlyAccP8hcj6H53KGF4Ui10N443ZXz8zvXOWZAqVVzC07ce MPvhXrLCit+IjH0rUJnz0DlcVQMf4gYdDOkIAWdc78GXhmaXzhSNroaCEhH7MWqHEMuy noOg== X-Gm-Message-State: AOAM533OkTtriKUca4mOj/W42m0QIA+xppIEZJAk0KNk+bBx3yTbgh4p AzbiFownhV0asY+lqve9VZA6a/vxx1fzt0GahG8= X-Google-Smtp-Source: ABdhPJw+3MEgNduoDicMT0qQOJkhmJV2xyK4cE4U5oDEmVEpi90YNIG2XvnCgssov+UL1WUGvhDn9h4QXpLfywbaasc= X-Received: by 2002:a50:eb8a:: with SMTP id y10mr26907900edr.137.1630367058843; Mon, 30 Aug 2021 16:44:18 -0700 (PDT) MIME-Version: 1.0 References: <20210824221322.7663-1-shy828301@gmail.com> <20210826061724.GA2864786@hori.linux.bs1.fc.nec.co.jp> <20210827035739.GA3247360@hori.linux.bs1.fc.nec.co.jp> In-Reply-To: From: Yang Shi Date: Mon, 30 Aug 2021 16:44:06 -0700 Message-ID: Subject: Re: [PATCH] mm: hwpoison: deal with page cache THP To: =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= Cc: "osalvador@suse.de" , "hughd@google.com" , "kirill.shutemov@linux.intel.com" , "akpm@linux-foundation.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Aug 26, 2021 at 10:02 PM Yang Shi wrote: > > On Thu, Aug 26, 2021 at 8:57 PM HORIGUCHI NAOYA(=E5=A0=80=E5=8F=A3=E3=80= =80=E7=9B=B4=E4=B9=9F) > wrote: > > > > On Thu, Aug 26, 2021 at 03:03:57PM -0700, Yang Shi wrote: > > > On Thu, Aug 26, 2021 at 1:03 PM Yang Shi wrote: > > > > > > > > On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(=E5=A0=80=E5=8F=A3= =E3=80=80=E7=9B=B4=E4=B9=9F) > > > > wrote: > > > > > > > > > > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote: > > ... > > > > > > > > > > There was a discussion about another approach of keeping error pa= ges in page > > > > > cache for filesystem without backend storage. > > > > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@e= ggly.anvils/ > > > > > This approach seems to me less complicated, but one concern is th= at this > > > > > change affects user-visible behavior of memory errors. Keeping e= rror pages > > > > > in page cache means that the errors are persistent until next sys= tem reboot, > > > > > so we might need to define the way to clear the errors to continu= e to use > > > > > the error file. Current implementation is just to send SIGBUS to= the > > > > > mapping processes (at least once), then forget about the error, s= o there is > > > > > no such issue. > > > > > > > > > > Another thought of possible solution might be to send SIGBUS imme= diately when > > > > > a memory error happens on a shmem thp. We can find all the mappin= g processes > > > > > before splitting shmem thp, so send SIGBUS first, then split it a= nd contain > > > > > the error page. This is not elegant (giving up any optional acti= ons) but > > > > > anyway we can avoid the silent data lost. > > > > > > > > Thanks a lot. I apologize I didn't notice you already posted a simi= lar > > > > patch before. > > > > > > > > Yes, I think I focused on the soft offline part too much and missed > > > > the uncorrected error part and I admit I did underestimate the > > > > problem. > > > > > > > > I think Hugh's suggestion makes sense if we treat tmpfs as a regula= r > > > > filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs= , > > > > may do checksum after reading from storage block then return an err= or > > > > if checksum is not right since it may indicate hardware failure on > > > > disk. Then the syscalls or page fault return error or SIGBUS. > > > > > > > > So in shmem/tmpfs case, if hwpoisoned page is met, just return erro= r > > > > (-EIO or whatever) for syscall or SIGBUS for page fault. It does al= ign > > > > with the behavior of other filesystems. It is definitely applicatio= ns' > > > > responsibility to check the return value of read/write syscalls. > > > > > > BTW, IIUC the dirty regular page cache (storage backed) would be left > > > in the page cache too, the clean page cache would be truncated since > > > they can be just reread from storage, right? > > > > A dirty page cache is also removed on error (me_pagecache_dirty() falls > > through me_pagecache_clean(), then truncate_error_page() is called). > > The main purpose of this is to separate off the error page from exising > > data structures to minimize the risk of later accesses (maybe by race o= r bug). > > But we can change this behavior for specific file systems by updating > > error_remove_page() callbacks in address_space_operation. > > Yeah, if fs's error_remove_page() is defined. It seems the filesystems > which have error_remove_page() defined just use generic_remove_page() > except hugetlbfs. And the generic implementation just clears the dirty > flag and removes the page from page cache. > > If error_remove_page() is not defined, the page would stay in page > cache since invalidate_inode_page() can't remove dirty page. > > > > > Honestly, it seems to me that how dirty data is lost does not depend on > > file system, and I'm still not sure that this is really a right approac= h > > for the current issue. > > IMHO the biggest problem is that applications may see > obsolete/inconsistent data silently, right? Actually keeping the > corrupted page in page cache should be able to notify applications > that they are accessing inconsistent data. The removal from page cache behavior may be much worse for shmem/tmpfs since it actually removes the whole data blocks for the file. The user will get all zero if the corrupted blocks are read without any notification. The more I stared at the code and had tests done, the more I think we should keep the corrupted page in page cache and notify the users. It seems easier for readonly filesystem. Just remove the page from page cache since it always could read data from disk. This is also the current behavior. For shmem, the page could be kept in page cache with dirty flag set since it won't be written back. For regular filesystems that could do writeback, things are a little bit more complicated since we need to prevent from writing back by clearing dirty flag. Other than writeback we also need to distinguish cache drop from truncation/hole punch/unlink. We don't want cache drop (e.g. echo 1 > /proc/sys/vm/drop_caches) drop corrupted page. But truncate/hole punch/unlink should be fine to remove the page since the underlying data blocks will be gone too. Thanks to the refcount pin done by memory failure, cache drop can't drop the page since it checks if the refcount is expected or not. Truncate/hole punch/unlink doesn't check refcount so they could proceed. But inode evict (slab shrinking path) may call truncate, so the corrupted page may still be removed from page cache when the underlying data blocks still exist IIUC. There might be other paths in filesystems to have page cache truncate but the underlying data blocks are still present. The read/write syscalls also need check hwpoisoned flag. I'm not sure if I miss other syscalls or not. I'm not a filesystem expert so I'm not sure if I'm missing something else or not. But I'm supposed the most should be covered. I'd like to start with shmem/tmpfs since it is relatively easier and this also could unblock shmem THP hwpoison support. Any comment is welcome. > > > > > Thanks, > > Naoya Horiguchi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56D94C432BE for ; Mon, 30 Aug 2021 23:44:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BEA2461027 for ; Mon, 30 Aug 2021 23:44:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org BEA2461027 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 10DA58D0001; Mon, 30 Aug 2021 19:44:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0BD966B0072; Mon, 30 Aug 2021 19:44:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E9F348D0001; Mon, 30 Aug 2021 19:44:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0244.hostedemail.com [216.40.44.244]) by kanga.kvack.org (Postfix) with ESMTP id D9E326B0071 for ; Mon, 30 Aug 2021 19:44:20 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 9071F27700 for ; Mon, 30 Aug 2021 23:44:20 +0000 (UTC) X-FDA: 78533378280.13.FBBD3C6 Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48]) by imf16.hostedemail.com (Postfix) with ESMTP id 4CE4EF00008E for ; Mon, 30 Aug 2021 23:44:20 +0000 (UTC) Received: by mail-ed1-f48.google.com with SMTP id q3so24065302edt.5 for ; Mon, 30 Aug 2021 16:44:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=bpG4L9d+cRIIKTqDcLTt0bxF3iyS6yijTSug5gYz8jc=; b=kbTtePJcgJcz3U3Xmu/pEDRBjEgazEu8NMuoIpNUvyXSc7hrHmeRGBWh6eEw/gPwtA +ZWH9s6Uui7QwLpqmzi6B39kQ/O9SzJ0QnWz+mZmm23GTZhNHdqCv425db2LEF8tMjpp JzsKHijZN9syr7c3MUIRXKgEKQoHMUmySxj7xeJWQ4lync/4vr+H9LE2YCGpFdWWwzYH 2bhQ9wvpXQ6kxKLuPmi6dFIjHwZaGmx2pRJKK9sSB8hNKViqbdrpZdPcqzcQoNemtgnZ yNzi4d7UQtBCaVuBVcfOR3OvnSuZhJWD+sXWLT1965ft5KH5chguvd9AByAwYojYU2ry hLRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=bpG4L9d+cRIIKTqDcLTt0bxF3iyS6yijTSug5gYz8jc=; b=uDRAOVuqoXHZvQJjf960LG8RC+r0AcbC2DxNe/a5KtL/+xYesvlPBRUz9k6DGa0il7 Uo9sipChcL8WqRjuD8AcQO2TjFMxvrlR148M/88cVaC6CHuX70/LNJBgsRKlf900dgXz 8s5DktChIZZ49eP3ScrXUUZhdP6zNZtKUHI9Wzuq56zFPS4//I6o8sjVy/S2OWLrlDFA 8SmIzqVm7Mu6rCEKI8Q1LJDmh8GfLQmiQryD3lRjz5Gju1Hq+PhrGTq0LiGkBfadUDYF ZN5oqHrfGyHnFfnXL16ehCxJ+bWjpqhNenuJ1BURcD7ckufLDjVL7s+uuqcNMNkR+VZk 1mMA== X-Gm-Message-State: AOAM533XgZcNX6yLUsSMX0weXiivSzOLsPutpnCyh/CBMkP6SnmU2osX 1l6LqRCKQgsP4iU4DoTUg2X88Eb7P87hivuq804= X-Google-Smtp-Source: ABdhPJw+3MEgNduoDicMT0qQOJkhmJV2xyK4cE4U5oDEmVEpi90YNIG2XvnCgssov+UL1WUGvhDn9h4QXpLfywbaasc= X-Received: by 2002:a50:eb8a:: with SMTP id y10mr26907900edr.137.1630367058843; Mon, 30 Aug 2021 16:44:18 -0700 (PDT) MIME-Version: 1.0 References: <20210824221322.7663-1-shy828301@gmail.com> <20210826061724.GA2864786@hori.linux.bs1.fc.nec.co.jp> <20210827035739.GA3247360@hori.linux.bs1.fc.nec.co.jp> In-Reply-To: From: Yang Shi Date: Mon, 30 Aug 2021 16:44:06 -0700 Message-ID: Subject: Re: [PATCH] mm: hwpoison: deal with page cache THP To: =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= Cc: "osalvador@suse.de" , "hughd@google.com" , "kirill.shutemov@linux.intel.com" , "akpm@linux-foundation.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 4CE4EF00008E Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20161025 header.b=kbTtePJc; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf16.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=shy828301@gmail.com X-Rspamd-Server: rspam01 X-Stat-Signature: 9u7rese7qug1jnt4zdfkpwmtzpbj3tq7 X-HE-Tag: 1630367060-824994 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Aug 26, 2021 at 10:02 PM Yang Shi wrote: > > On Thu, Aug 26, 2021 at 8:57 PM HORIGUCHI NAOYA(=E5=A0=80=E5=8F=A3=E3=80= =80=E7=9B=B4=E4=B9=9F) > wrote: > > > > On Thu, Aug 26, 2021 at 03:03:57PM -0700, Yang Shi wrote: > > > On Thu, Aug 26, 2021 at 1:03 PM Yang Shi wrote: > > > > > > > > On Wed, Aug 25, 2021 at 11:17 PM HORIGUCHI NAOYA(=E5=A0=80=E5=8F=A3= =E3=80=80=E7=9B=B4=E4=B9=9F) > > > > wrote: > > > > > > > > > > On Tue, Aug 24, 2021 at 03:13:22PM -0700, Yang Shi wrote: > > ... > > > > > > > > > > There was a discussion about another approach of keeping error pa= ges in page > > > > > cache for filesystem without backend storage. > > > > > https://lore.kernel.org/lkml/alpine.LSU.2.11.2103111312310.7859@e= ggly.anvils/ > > > > > This approach seems to me less complicated, but one concern is th= at this > > > > > change affects user-visible behavior of memory errors. Keeping e= rror pages > > > > > in page cache means that the errors are persistent until next sys= tem reboot, > > > > > so we might need to define the way to clear the errors to continu= e to use > > > > > the error file. Current implementation is just to send SIGBUS to= the > > > > > mapping processes (at least once), then forget about the error, s= o there is > > > > > no such issue. > > > > > > > > > > Another thought of possible solution might be to send SIGBUS imme= diately when > > > > > a memory error happens on a shmem thp. We can find all the mappin= g processes > > > > > before splitting shmem thp, so send SIGBUS first, then split it a= nd contain > > > > > the error page. This is not elegant (giving up any optional acti= ons) but > > > > > anyway we can avoid the silent data lost. > > > > > > > > Thanks a lot. I apologize I didn't notice you already posted a simi= lar > > > > patch before. > > > > > > > > Yes, I think I focused on the soft offline part too much and missed > > > > the uncorrected error part and I admit I did underestimate the > > > > problem. > > > > > > > > I think Hugh's suggestion makes sense if we treat tmpfs as a regula= r > > > > filesystem (just memory backed). AFAIK, some filesystem, e.g. btrfs= , > > > > may do checksum after reading from storage block then return an err= or > > > > if checksum is not right since it may indicate hardware failure on > > > > disk. Then the syscalls or page fault return error or SIGBUS. > > > > > > > > So in shmem/tmpfs case, if hwpoisoned page is met, just return erro= r > > > > (-EIO or whatever) for syscall or SIGBUS for page fault. It does al= ign > > > > with the behavior of other filesystems. It is definitely applicatio= ns' > > > > responsibility to check the return value of read/write syscalls. > > > > > > BTW, IIUC the dirty regular page cache (storage backed) would be left > > > in the page cache too, the clean page cache would be truncated since > > > they can be just reread from storage, right? > > > > A dirty page cache is also removed on error (me_pagecache_dirty() falls > > through me_pagecache_clean(), then truncate_error_page() is called). > > The main purpose of this is to separate off the error page from exising > > data structures to minimize the risk of later accesses (maybe by race o= r bug). > > But we can change this behavior for specific file systems by updating > > error_remove_page() callbacks in address_space_operation. > > Yeah, if fs's error_remove_page() is defined. It seems the filesystems > which have error_remove_page() defined just use generic_remove_page() > except hugetlbfs. And the generic implementation just clears the dirty > flag and removes the page from page cache. > > If error_remove_page() is not defined, the page would stay in page > cache since invalidate_inode_page() can't remove dirty page. > > > > > Honestly, it seems to me that how dirty data is lost does not depend on > > file system, and I'm still not sure that this is really a right approac= h > > for the current issue. > > IMHO the biggest problem is that applications may see > obsolete/inconsistent data silently, right? Actually keeping the > corrupted page in page cache should be able to notify applications > that they are accessing inconsistent data. The removal from page cache behavior may be much worse for shmem/tmpfs since it actually removes the whole data blocks for the file. The user will get all zero if the corrupted blocks are read without any notification. The more I stared at the code and had tests done, the more I think we should keep the corrupted page in page cache and notify the users. It seems easier for readonly filesystem. Just remove the page from page cache since it always could read data from disk. This is also the current behavior. For shmem, the page could be kept in page cache with dirty flag set since it won't be written back. For regular filesystems that could do writeback, things are a little bit more complicated since we need to prevent from writing back by clearing dirty flag. Other than writeback we also need to distinguish cache drop from truncation/hole punch/unlink. We don't want cache drop (e.g. echo 1 > /proc/sys/vm/drop_caches) drop corrupted page. But truncate/hole punch/unlink should be fine to remove the page since the underlying data blocks will be gone too. Thanks to the refcount pin done by memory failure, cache drop can't drop the page since it checks if the refcount is expected or not. Truncate/hole punch/unlink doesn't check refcount so they could proceed. But inode evict (slab shrinking path) may call truncate, so the corrupted page may still be removed from page cache when the underlying data blocks still exist IIUC. There might be other paths in filesystems to have page cache truncate but the underlying data blocks are still present. The read/write syscalls also need check hwpoisoned flag. I'm not sure if I miss other syscalls or not. I'm not a filesystem expert so I'm not sure if I'm missing something else or not. But I'm supposed the most should be covered. I'd like to start with shmem/tmpfs since it is relatively easier and this also could unblock shmem THP hwpoison support. Any comment is welcome. > > > > > Thanks, > > Naoya Horiguchi