From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 031A4C433DF for ; Wed, 24 Jun 2020 04:32:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CABDB2078E for ; Wed, 24 Jun 2020 04:32:44 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ed4OK0LO" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728858AbgFXEco (ORCPT ); Wed, 24 Jun 2020 00:32:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43400 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726131AbgFXEcn (ORCPT ); Wed, 24 Jun 2020 00:32:43 -0400 Received: from mail-pg1-x541.google.com (mail-pg1-x541.google.com [IPv6:2607:f8b0:4864:20::541]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9DDD4C061573 for ; Tue, 23 Jun 2020 21:32:43 -0700 (PDT) Received: by mail-pg1-x541.google.com with SMTP id t6so778477pgq.1 for ; Tue, 23 Jun 2020 21:32:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=e0EpgUPvb3oHQJmMLTQgcHeM9n6YCWTkRNUuoek9hfg=; b=ed4OK0LOIth9jOXTiae1LeoQbTkRLW3h1DOXDFGhdT+RK8JeF46h3VV+io9JyJ/qWn fxlglfyDQjotzQGlmf/inIKOHLFVr70dt/YLCY4wulI/K0qAWpQMi7YwoSm70rRQRwgk VccBTh0C7e07735Tz2QuNlq+FWdoCTo0yK5cXV+LUO0y8+sgmrWZsOnwA8d2y0hqSDH3 6aV+qLghe9fiHfD+ZLhm/v7LO8V/ErgejMdULGdU7/H4vQWsUQjXMf4QpPcg5G4MkTfV IDTGnKUwvAMeNveHI16hg8HyQO4HJhnLcc3nZIZsAoMEquhfMRcE7c5gblvXgxiyApZl 2aYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=e0EpgUPvb3oHQJmMLTQgcHeM9n6YCWTkRNUuoek9hfg=; b=n09X2kNqhwUaMN+WaBoWMKy5eUvvikQORhvPF3sFpyQm+ZlY2obgF1bOmyej5Do+Pz N18PjQr4EaMkMAk85JDUzGYzo4hIzWiDD3HdBSFT9PQlaGmE1R315ii4yBt0pE9OFTyP AtMx1Et7xEyfOGyEvd32olNGX8AvSNfs4/hrQzin5QobBWNAh+mSyOt/qlkJeu/87aVL eqBP/JRrPjoIKkyhPBbZMGnBx6j6FIsNSal/GL3iyvGNSBlaueihV8N95MArREwH6JgK K7p6xA3xkrzFWNsMhSPOn9wW4TxhLUqosnBBJWzcMnzWiLaKDo8i9MabKAxB4sIVJXqH vJFg== X-Gm-Message-State: AOAM5333vlzA6KmO7uCvbQoFDS69+bBeam3vvXxC4gb42LzdbjRIoF9d iXSAQHCk0rPD9SjDyS8fWRqnlw== X-Google-Smtp-Source: ABdhPJwlS3fHaPwPLhKXuHqqHNQLJX8fzM5Xx9ZkK7q6D7wHwgwL3bpHPJ7udhXXKTVoOZqNrpiOig== X-Received: by 2002:a63:5013:: with SMTP id e19mr19742148pgb.68.1592973162813; Tue, 23 Jun 2020 21:32:42 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id y187sm18817777pfb.46.2020.06.23.21.32.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jun 2020 21:32:42 -0700 (PDT) Date: Tue, 23 Jun 2020 21:32:41 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: "Luck, Tony" , Mike Kravetz , "Dr. David Alan Gilbert" , Peter Xu , Andrea Arcangeli cc: Matthew Wilcox , Borislav Petkov , Naoya Horiguchi , linux-edac@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, "Darrick J. Wong" , Jane Chu Subject: Re: [RFC] Make the memory failure blast radius more precise In-Reply-To: <20200623220412.GA21232@agluck-desk2.amr.corp.intel.com> Message-ID: References: <20200623201745.GG21350@casper.infradead.org> <20200623220412.GA21232@agluck-desk2.amr.corp.intel.com> User-Agent: Alpine 2.22 (DEB 394 2020-01-19) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-edac-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org On Tue, 23 Jun 2020, Luck, Tony wrote: > > Hardware actually tells us the blast radius of the error, but we ignore > > it and take out the entire page. We've had a customer request to know > > exactly how much of the page is damaged so they can avoid reconstructing > > an entire 2MB page if only a single cacheline is damaged. > > > > This is only a strawman that I did in an hour or two; I'd appreciate > > architectural-level feedback. Should I just convert memory_failure() to > > always take an address & granularity? Should I create a struct to pass > > around (page, phys, granularity) instead of reconstructing the missing > > pieces in half a dozen functions? Is this functionality welcome at all, > > or is the risk of upsetting applications which expect at least a page > > of granularity too high? > > What is the interface to these applications that want finer granularity? > > Current code does very poorly with hugetlbfs pages ... user loses the > whole 2 MB or 1GB. That's just silly (though I've been told that it is > hard to fix because allowing a hugetlbfs page to be broken up at an arbitrary > time as the result of a mahcine check means that the kernel needs locking > around a bunch of fas paths that currently assume that a huge page will > stay being a huge page). > Thanks for bringing this up, Tony. Mike Kravetz pointed me to this thread (thanks Mike!) so let's add him in explicitly as well as Andrea, Peter, and David from Red Hat who we've been discussing an idea with that may introduce exactly this needed support but for different purposes :) The timing of this thread is _uncanny_. To improve the performance of userfaultfd for the purposes of post-copy live migration we need to reduce the granularity in which pages are migrated; we're looking at this from a 1GB gigantic page perspective but the same arguments can likely be had for 2MB hugepages as well. 1GB pages are too much of a bottleneck and, as you bring up, 1GB is simply too much memory to poison :) We don't have 1GB thp support so the big idea was to introduce thp-like DoubleMap support into hugetlbfs for the purposes of post-copy live migration and then I had the idea that this could be extended to memory failure as well. (We don't see the lack of 1GB thp here as a deficiency for anything other than these two issues, hugetlb provides strong guarantees.) I don't want to hijack Matthew's thread which is primarily about DAX, but did get intrigued by your concerns about hugetlbfs page poisoning. We can fork the thread off here to discuss only the hugetlb application of this if it makes sense to you or you'd like to collaborate on it as well. The DoubleMap support would allow us to map the 1GB gigantic pages with the PUD and the PMDs as well (and, further, the 2MB hugepages with the PMD and PTEs) so that we can copy fragments into PMDs or PTEs and we don't need to migrate the entire gigantic page. Any access triggers #PF through hugetlb_no_page() -> handle_userfault() which would trigger another UFFDIO_COPY and map another fragment. Assume a world where this DoubleMap support already exists for hugetlb pages today and all the invariants including page migration are fixed up (since a PTE can now map a hugetlb page and a PMD can now map a gigantic hugetlb page). It *seems* like we'd be able to reduce the blast radius here too on a hard memory failure: dissolve the gigantic page in place, SIGBUS/SIGKILL on the bad PMD or PTE, and avoid poisoning the head of the hugetlb page. We agree that poisoning this large amount of memory is not ideal :) Anyway, this was some brainstorming that I was doing with Mike and the others based on the idea of using DoubleMap support for post-copy live migration. If you would be interested or would like to collaborate on it, we'd love to talk.