From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CC2FE3FC3 for ; Thu, 16 Sep 2021 20:33:53 +0000 (UTC) Received: by mail-pj1-f45.google.com with SMTP id n13-20020a17090a4e0d00b0017946980d8dso8398027pjh.5 for ; Thu, 16 Sep 2021 13:33:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=12EcOqClZL2NdniFoQSXbl5bCHq8Vg4z9WdZdd/j3Qg=; b=0FywxIR9DYqL7k4JNQpCurJRYRx2yHzRZqPBOgXf8ZibsDbQg2ICQ0SO/K06EgI4Dn O/hHgjlgD1bQucA3a5pxSquuBiabnRqamWSbUnFbeNHpX7wPeRXxVTEYYOR2NcMOG3H0 c3Fjbtd4s5aoQOC2l+pswHHaMKYPc9+fT7TsIl93sjAXKfRl8wuUIbx2zICUxyAfn/zp ZEjfXIwi+LEVyQls8cMn20hnDBbtgdxnf8CZMTUnPXR0mgRsJHAff4e7FVKfZ6O+UHjc 2cy+8ab2+x/8KEqUaH6lFAgMDSGHhxJgehsN5SaTCSptnl7/+J+5ldGP4fZN9bSSw4Lc md+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=12EcOqClZL2NdniFoQSXbl5bCHq8Vg4z9WdZdd/j3Qg=; b=hRGbuE21hQDrsNJMLmg2RZn4MViN+zx1Fbk04JegrwiMaQfpl9nnCouQG+0SBewNUs AFE5Jm+xlFAL0qz4ns7J//r9tBF3ajrLrtyxRg6lKKs513f6gvNaiqdtUuLGPE+q5MRI SykjkvrsuP94cHTJbbdwzg93ODpZMpF4oUMnZDlY5pRJPY+a/GPE4XQPfd6in9u/QDrw lo97RBuDmX/Are/wqWQItfKS35fyMLKH5c/oS+OzanZeVcXsJjrgDJwFx025LCgVzIzK ptalCErnBByG/FVmaDHBUGl+2XOMDoXaqkMhiHFzlIoIWrftKdOMdDn/Tx4Z6om9kX+D R4kQ== X-Gm-Message-State: AOAM533cctSXLt4Dpt1e/ywFRJXbm7c3cSHhtl9vcytYsNlfNQ0zxMfh p/NAAe8lQtHnyl3+w63MEGzuXxdxcQfQqrIN69bccjJCPmg8MQ== X-Google-Smtp-Source: ABdhPJzum3H3rbyk5groX6o2LukdMv4I8lqkz1dsWquOOS/ZLeVFE7p+EfnrS43E1sV6OYeguDOo9k9ClyUrTkpXkh8= X-Received: by 2002:a17:902:cec8:b0:13b:9ce1:b3ef with SMTP id d8-20020a170902cec800b0013b9ce1b3efmr6426487plg.4.1631824433090; Thu, 16 Sep 2021 13:33:53 -0700 (PDT) Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <162561960776.1149519.9267511644788011712.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: From: Dan Williams Date: Thu, 16 Sep 2021 13:33:42 -0700 Message-ID: Subject: Re: [RFT PATCH] x86/pat: Fix set_mce_nospec() for pmem To: Borislav Petkov Cc: Linux NVDIMM , Jane Chu , Luis Chamberlain , Tony Luck Content-Type: text/plain; charset="UTF-8" On Wed, Sep 15, 2021 at 3:41 AM Borislav Petkov wrote: > > On Tue, Sep 14, 2021 at 11:08:00AM -0700, Dan Williams wrote: > > Sure, I should probably include this following note in all patches > > touching the DAX-memory_failure() path, because it is a frequently > > asked question. The tl;dr is: > > > > Typical memory_failure() does not assume the physical page can be > > recovered and put back into circulation, PMEM memory_failure() allows > > for recovery of the page. > > Hmm, I think by "PMEM memory_failure()" you mean what pmem_do_write() > does with that poison clearing or? I am specifically talking about the memory_failure_dev_pagemap() path taken from memory_failure(). > But then I have no clue what the "DAX-memory_failure()" path is. Sorry, I should not have lazily used {PMEM,DAX} memory_failure() to refer to the exact routine in question: memory_failure_dev_pagemap(). That path is taken when memory_failure() sees that a pfn was mapped by memremap_pages(). It determines that the pfn does not represent a dynamic page allocator page and may refer to a static page of storage from a device like /dev/pmem, or /dev/dax. > > > The longer description is: > > Typical memory_failure() for anonymous, or page-cache pages, has the > > flexibility to invalidate bad pages and trigger any users to request a > > new page from the page allocator to replace the quarantined one. DAX > > removes that flexibility. The page is a handle for a fixed storage > > location, i.e. no mechanism to remap a physical page to a different > > logical address. Software expects to be able to repair an error in > > PMEM by reading around the poisoned cache lines and writing zeros, > > fallocate(...FALLOC_FL_PUNCH_HOLE...), to overwrite poison. The page > > needs to remain accessible to enable recovery. > > Aha, so memory failure there is not offlining he 4K page but simply > zeroing the poison. The zeroing of the poison comes optionally later if userspace explicitly asks for it to be cleared. > What happens if the same physical location triggers > more read errors, i.e., the underlying hw is going bad? Don't you want > to exclude that location from ever being written again? Yes. The device driver avoids reconsumption of the same error by recording the error in a "badblocks" structure (block/badblocks.c). The driver consults its badblocks instance on every subsequent access and preemptively returns EIO. The driver registers for machine-check notifications and translates those events into badblocks entries. So, repeat consumption is avoided unless/until the babblocks entry can be cleared along with the poison itself. I'll note that going forward the filesystem wants to be more involved in tracking and managing these errors so the driver will grow the ability to forward those notifications up the stack. > Or maybe such recovery action doesn't make sense for pmem? It does. In addition to machine-check synchronous notification of poison, there are asynchronous scanning mechanisms that prepopulate the badblocks entries to get ahead of consumption events. > > Short answer, PMEM never goes "offline" because it was never "online" > > in the first place. Where "online" in this context is specifically > > referring to pfns that are under the watchful eye of the core-mm page > > allocator. > > Ok, so pmem wants to be handled differently during memory failure. > Looking at the patch again, you change the !unmap case to do > _set_memory_uc(). > > That bool unmap thing gets *not* set in whole_page(): > > return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT; > > so I'm guessing that "Recoverable Address LSB (bits 5:0): The lowest > valid recoverable address bit." thing is < 12 for pmem. > > But are you really saying that the hardware would never report a lower > than 12 value for normal memory? Ugh, no, I failed to update my mental model of set_mce_nopsec() usage after commit 17fae1294ad9 "x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned". I was just looking for "where does set_mce_nospec() do the UC vs NP decision", and did not read the new calling convention... more below. > If it does, then that is wrong here. > > In any case, I'd prefer if the code would do an explicit check somewhere > saying "is this pmem" in order to do that special action only for when > it really is pmem and not rely on it implicitly. After thinking this through a bit I don't think we want that, but I did find a bug, and this patch needs a rewritten changelog to say why pmem does not need to be explicitly called out in this path. The reasoning is that set_mce_nospec() was introduced as a way to achieve the goal of preventing speculative consumption of poison *and* allow for reads around poisoned cachelines for PMEM recovery. UC achieves that goal for both PMEM and DRAM, unlike what this changelog alluded to which was NP for DRAM and UC for PMEM. The new consideration of whole_page() errors means that the code in nfit_handle_mce() is wrong to assume that the length of the error is just 1 cacheline. When that bug is fixed the badblocks range will cover the entire page in whole_page() notification cases. Because the entire page is dead there is no need for UC because all the cachelines are gone, nothing to read-around in this page. In short DRAM and PMEM want to share the exact same policy here and use NP for whole_page() and UC for not. Just the small matter of ignoring the memtype by using _set_memory_uc(). Thanks for triggering that deeper look.