From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: ** X-Spam-Status: No, score=2.2 required=3.0 tests=CHARSET_FARAWAY_HEADER, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5FEC0C433DF for ; Thu, 25 Jun 2020 02:16:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 28F4A207E8 for ; Thu, 25 Jun 2020 02:16:47 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=necglobal.onmicrosoft.com header.i=@necglobal.onmicrosoft.com header.b="pQCNbQJL" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389182AbgFYCQq (ORCPT ); Wed, 24 Jun 2020 22:16:46 -0400 Received: from mail-eopbgr1400070.outbound.protection.outlook.com ([40.107.140.70]:44672 "EHLO JPN01-TY1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S2389144AbgFYCQq (ORCPT ); Wed, 24 Jun 2020 22:16:46 -0400 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=l9OnyzQ9HOZrSsNo9TJ8ZpOLzaM8qC3h4tjpW1z9NL4QGLPFYjsBdTi71KDySfEg3WKUE9txCFcivMqCH6BO/IgL4D5jq55YSRyYiyq93+c+M8ZViP9z4ud8KdFeiVKvRYNi3nxencNPGTfH1Fuhaxl6v3jLsqRYFt4xxl/Mf9Gm78eRB7Wdlhtd1Wg+RavZkZkSLVRbgV9umJEM19n6hy+PrV9eR8Ay4z9OdQWbRNQattMx8yrpckDwfNX895AWt8x1gWr/fc/q/AwLWPvjH0YPWfdm73dgB1uNgUqjHg9lMAuSuPqryRzvMuEsWXR3lUeYIRbpFsyuA5SKqhGL8Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=QN+9oBK+Gzj5d84+9f1X9XkuAmr7dVwjxciP1m/OlMw=; b=Y6mClWZ6Vxxgvk3yylWAnOn+b/TAmvsV/IyY0A9Y1vrNTfSuVAj5NBuOrXBm8A97zlzXXsebpqXURchim88D3h71l6WQnUkPEm5RMhOSY4hbRFpMPXVfZBsb0Z9s8fY3o0k+gi/IheVBYwjvv7rqUW5MwyY2YEE00hLtfas+kcz4gTQaoZGgN3b0oDlDNgR2teeEb6+z+Mdc+2+PQZGzg1P/Yh08jf9/Iq7puRbc+WVwfjSB8On4sxmOiTE+FAL9fKMGRbOk/Y9/61gf2zOl6vuFvCX7kBD+6mdUm2cXBSEbhGklXUiwGPJwhCVaKAFX+aOcmzbQNPmiIBrKhj90qg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nec.com; dmarc=pass action=none header.from=nec.com; dkim=pass header.d=nec.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=necglobal.onmicrosoft.com; s=selector1-necglobal-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=QN+9oBK+Gzj5d84+9f1X9XkuAmr7dVwjxciP1m/OlMw=; b=pQCNbQJLR1M9SEu30g3Ma1XUswZUkGTBefCggGjHKzOkH1CdFo2pfWjB0/pmoUUPOINOAEmGFVICH5H9mgEitqbFn46L0gnPYKjfq/vqbEMUUQwmY3PdPsnQ8tHNye6UDdY4X5OB20fAIFkMMN2yZwuC3uPQt+Frgiz3sN79bno= Received: from TY2PR01MB3210.jpnprd01.prod.outlook.com (2603:1096:404:74::14) by TYAPR01MB5452.jpnprd01.prod.outlook.com (2603:1096:404:803d::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3131.20; Thu, 25 Jun 2020 02:16:42 +0000 Received: from TY2PR01MB3210.jpnprd01.prod.outlook.com ([fe80::3841:ec9f:5cdf:f58]) by TY2PR01MB3210.jpnprd01.prod.outlook.com ([fe80::3841:ec9f:5cdf:f58%5]) with mapi id 15.20.3131.020; Thu, 25 Jun 2020 02:16:42 +0000 From: =?iso-2022-jp?B?SE9SSUdVQ0hJIE5BT1lBKBskQktZOH0hIUQ+TGkbKEIp?= To: David Rientjes CC: "Luck, Tony" , Mike Kravetz , "Dr. David Alan Gilbert" , Peter Xu , Andrea Arcangeli , Matthew Wilcox , Borislav Petkov , "linux-edac@vger.kernel.org" , "linux-mm@kvack.org" , "linux-nvdimm@lists.01.org" , "Darrick J. Wong" , Jane Chu Subject: Re: [RFC] Make the memory failure blast radius more precise Thread-Topic: [RFC] Make the memory failure blast radius more precise Thread-Index: AQHWSZt3qNTHVUr/sEmZKbB8SUspNajmwT0AgABsioCAAWxVgA== Date: Thu, 25 Jun 2020 02:16:42 +0000 Message-ID: <20200625021641.GA21811@hori.linux.bs1.fc.nec.co.jp> References: <20200623201745.GG21350@casper.infradead.org> <20200623220412.GA21232@agluck-desk2.amr.corp.intel.com> In-Reply-To: Accept-Language: ja-JP, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: google.com; dkim=none (message not signed) header.d=none;google.com; dmarc=none action=none header.from=nec.com; x-originating-ip: [165.225.110.205] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: c949bdc3-24f2-47c5-02c9-08d818adcd2c x-ms-traffictypediagnostic: TYAPR01MB5452: x-ms-exchange-transport-forked: True x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:8273; x-forefront-prvs: 0445A82F82 x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: ZQ16/yKBgCnCWlKJBVXLfd+Q6Lz/JZzCR0mkSkLoS8E7NkAY4egFdp0+UUH6zsc7wHx3tdxbuXYbgLFE4gpLjCm1+iGauObko578DLDBN8CKPNGpVY3kTqGexwKnD2Zi5jTHp8AimxOSxbrJPROHnMXq+4cDsyUq8HG0pMYrawvmo5DAz+N1wVl6BNGH/iE+9luQALUVVgXj3iELTFT6+abldBdmDxdavarkTSAy6uYgSdEcYimrRPVkYAtsFC/OC5HCl0uuBDz+k0ttqvh61T9u0twHtR8U9KA4/9d52+1DMSb2+xlWZC+XBLlkxNPfEWNru5lpaflGEHASmzqWAw== x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:TY2PR01MB3210.jpnprd01.prod.outlook.com;PTR:;CAT:NONE;SFTY:;SFS:(4636009)(136003)(39860400002)(366004)(396003)(346002)(376002)(66446008)(5660300002)(8676002)(66476007)(7416002)(66556008)(64756008)(66946007)(54906003)(478600001)(8936002)(55236004)(1076003)(71200400001)(316002)(85182001)(86362001)(33656002)(2906002)(4326008)(76116006)(186003)(9686003)(6506007)(6512007)(26005)(6916009)(6486002)(83380400001);DIR:OUT;SFP:1101; x-ms-exchange-antispam-messagedata: WvVKwgTIwJ/dKZBdAXPen6gmXvMH9J2jO4cchJWoaMewQrAfngEAAvlozJtfQHM8WjRPmvdYojsBpcaPHqqREv04wpxk8DH7x/SKB759qxXJImGgOQEzQU43z02+7QxHJAirKFnHJdXdI8Hly84fATPllFGgCENYy1ptNu0dEgbKHXdNcAkjdXyCStjVMmkkChgfmeKf2u9Ibfuv6bqTghgx6LAREVp2flqXUGPJluRkJmTvzEhAyZrPzVpu92GMFQqcKapI33dKMg4xSvtvpskc6Bks1HFFRRy4CuyGfVTopkRuEmaGo3LetJkcF6mxYojdArqqDG8ZqWLi9nfsUGZrLE+q257OvmTSPuHYLqeKO5WY6dUsrwTxeRSEM8N2p0JDcZAjC+aMEAlf4nSNeApYxJ1ApuA9LJ++Py6HNAzx2hhu6otDl7JcBdhaArOZzjubreLe+KDSpnGCtPdREJXP/AEN6mSbKzclQl5y2dz0EOkv45iWTFqMd+q9/vd8 Content-Type: text/plain; charset="iso-2022-jp" Content-ID: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: nec.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: TY2PR01MB3210.jpnprd01.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: c949bdc3-24f2-47c5-02c9-08d818adcd2c X-MS-Exchange-CrossTenant-originalarrivaltime: 25 Jun 2020 02:16:42.2678 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: e67df547-9d0d-4f4d-9161-51c6ed1f7d11 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: yiuPl0Arpu64Vo5JL2glirCGaAA+j8SzCpg82a1f/3rT7dU7OU9T7S91lXdMTpfElGzRmmburXohEY7Qxx/8aw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYAPR01MB5452 Sender: linux-edac-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org On Tue, Jun 23, 2020 at 09:32:41PM -0700, David Rientjes wrote: > On Tue, 23 Jun 2020, Luck, Tony wrote: >=20 > > > Hardware actually tells us the blast radius of the error, but we igno= re > > > it and take out the entire page. We've had a customer request to kno= w > > > exactly how much of the page is damaged so they can avoid reconstruct= ing > > > an entire 2MB page if only a single cacheline is damaged. > > >=20 > > > This is only a strawman that I did in an hour or two; I'd appreciate > > > architectural-level feedback. Should I just convert memory_failure()= to > > > always take an address & granularity? Should I create a struct to pa= ss > > > around (page, phys, granularity) instead of reconstructing the missin= g > > > pieces in half a dozen functions? Is this functionality welcome at a= ll, > > > or is the risk of upsetting applications which expect at least a page > > > of granularity too high? > >=20 > > What is the interface to these applications that want finer granularity= ? > >=20 > > Current code does very poorly with hugetlbfs pages ... user loses the > > whole 2 MB or 1GB. That's just silly (though I've been told that it is > > hard to fix because allowing a hugetlbfs page to be broken up at an arb= itrary > > time as the result of a mahcine check means that the kernel needs locki= ng > > around a bunch of fas paths that currently assume that a huge page will > > stay being a huge page). > >=20 >=20 > Thanks for bringing this up, Tony. Mike Kravetz pointed me to this threa= d=20 > (thanks Mike!) so let's add him in explicitly as well as Andrea, Peter,=20 > and David from Red Hat who we've been discussing an idea with that may=20 > introduce exactly this needed support but for different purposes :) The= =20 > timing of this thread is _uncanny_. >=20 > To improve the performance of userfaultfd for the purposes of post-copy=20 > live migration we need to reduce the granularity in which pages are=20 > migrated; we're looking at this from a 1GB gigantic page perspective but= =20 > the same arguments can likely be had for 2MB hugepages as well. 1GB page= s=20 > are too much of a bottleneck and, as you bring up, 1GB is simply too much= =20 > memory to poison :) We don't have 1GB thp support so the big idea was to= =20 > introduce thp-like DoubleMap support into hugetlbfs for the purposes of=20 > post-copy live migration and then I had the idea that this could be=20 > extended to memory failure as well. >=20 > (We don't see the lack of 1GB thp here as a deficiency for anything other= =20 > than these two issues, hugetlb provides strong guarantees.) >=20 > I don't want to hijack Matthew's thread which is primarily about DAX, but= =20 > did get intrigued by your concerns about hugetlbfs page poisoning. We ca= n=20 > fork the thread off here to discuss only the hugetlb application of this= =20 > if it makes sense to you or you'd like to collaborate on it as well. >=20 > The DoubleMap support would allow us to map the 1GB gigantic pages with=20 > the PUD and the PMDs as well (and, further, the 2MB hugepages with the PM= D=20 > and PTEs) so that we can copy fragments into PMDs or PTEs and we don't=20 > need to migrate the entire gigantic page. Any access triggers #PF throug= h=20 > hugetlb_no_page() -> handle_userfault() which would trigger another=20 > UFFDIO_COPY and map another fragment. > > Assume a world where this DoubleMap support already exists for hugetlb=20 > pages today and all the invariants including page migration are fixed up= =20 > (since a PTE can now map a hugetlb page and a PMD can now map a gigantic= =20 > hugetlb page). It *seems* like we'd be able to reduce the blast radius=20 > here too on a hard memory failure: dissolve the gigantic page in place,=20 > SIGBUS/SIGKILL on the bad PMD or PTE, and avoid poisoning the head of the= =20 > hugetlb page. We agree that poisoning this large amount of memory is not= =20 > ideal :) >=20 > Anyway, this was some brainstorming that I was doing with Mike and the=20 > others based on the idea of using DoubleMap support for post-copy live=20 > migration. If you would be interested or would like to collaborate on=20 > it, we'd love to talk. Thanks for proposing. I think that DoubleMap support could be a good solution generally (not only for the usecase of post-copy live migration). Splitting pud/pmd entry into pmd/pte entry makes smaller impact than migrat= ing all healthy data to somewhere else. The implementation could be challengin= g but not so as thp splitting because we don't have to consider collapsing. Dax mapping seems to have similar issue. If we can share pmd mapping and pt= e mapping to a dax file and covert the pmd mapping into pte mapping, we could contain errors in smaller granularity for pmem. Thanks, Naoya Horiguchi=