From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42D73C433F5 for ; Thu, 4 Nov 2021 06:21:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2D423611C9 for ; Thu, 4 Nov 2021 06:21:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230084AbhKDGYa (ORCPT ); Thu, 4 Nov 2021 02:24:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46188 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230252AbhKDGY2 (ORCPT ); Thu, 4 Nov 2021 02:24:28 -0400 Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C86A7C06127A for ; Wed, 3 Nov 2021 23:21:50 -0700 (PDT) Received: by mail-pl1-x62e.google.com with SMTP id y1so5535958plk.10 for ; Wed, 03 Nov 2021 23:21:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ZyuA6ddRdNnvNpvSUsN4hZr4Uf2uGmRKXvml9zPHv3U=; b=LZQMkMEgCvdpUn1y3IGr6TRctr3Q9zUor3iSBVXWEwdAP54SnFBy+WF/4Ebp9InWiZ 82EJ07KkJqO1IXAjeb5tQuOwY1ckueHAnmH/o6R5PLxqRYs7VaaskeCn1nugf4y2mqcY E6yt7u/txZYD8jXP/DpSBx32wbdzRNzBp4YXA0WR06xzBm2Tz82m9R7MZkgsmA0yPcMZ ToDGBKl9+nZ9SgaoC7yR0jgPCp3Cx6s1akPqKynUiynPbM1vR9RHDbsWDBq4+aqq/5sp U+yyphazc5dywiF0swD7YL24mIehmRIY8zog0Mlsw1WKup+EjujksqX34ab7rjp8ivql LdNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ZyuA6ddRdNnvNpvSUsN4hZr4Uf2uGmRKXvml9zPHv3U=; b=Tivlslq8wi+FN1C6a2ZHR8+USefPiEzkp7m15IkKJZU6CVFL1rUPxvhDzopyMRYMK3 cXGqF6teQ6y7GoKt9LK5K9fo3vVZPLYe0KHVLQ5YuxEWy7BEDgCLqhvbTeRpNog2N/RY ZU7UNmagUwmFQbM032BpQPRt6aGf9CiluiZvP4bQU9+/XkovG8x+LewCXHOhlN1vEhPC ECvcQ3mEvLbx8TJs9yDmhcC0uc0KzZU1yDnZ00qkrP4TXIkU4uOnIABL+uGTg4+2EiiU IyctvKQpIpXn1T8reqrul46GkI67DysQLh6isKx42OAXGPrp67kSdBAdyH2xePRuQZRq 1mqg== X-Gm-Message-State: AOAM531jwA7mFnvISy/c/2j/BO4uGf4z/fH99bIYzjUXwg+5zHBGXEk1 zQ2UXZgWQuoL4CW1ZxqalCyi9NCJQ9WigI24VnV25w== X-Google-Smtp-Source: ABdhPJx4n/3hi5QSqYsscdoJBCM1f/g8m0p4vAuABjz6+mE+bZ+8hWcuD6Y6gRvy3XYJHICi5mkaZySCeaB3JPBO+/Q= X-Received: by 2002:a17:90a:6c47:: with SMTP id x65mr3577541pjj.8.1636006910291; Wed, 03 Nov 2021 23:21:50 -0700 (PDT) MIME-Version: 1.0 References: <20211021001059.438843-1-jane.chu@oracle.com> <2102a2e6-c543-2557-28a2-8b0bdc470855@oracle.com> <20211028002451.GB2237511@magnolia> In-Reply-To: From: Dan Williams Date: Wed, 3 Nov 2021 23:21:39 -0700 Message-ID: Subject: Re: [dm-devel] [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag To: Jane Chu Cc: Christoph Hellwig , "Darrick J. Wong" , "david@fromorbit.com" , "vishal.l.verma@intel.com" , "dave.jiang@intel.com" , "agk@redhat.com" , "snitzer@redhat.com" , "dm-devel@redhat.com" , "ira.weiny@intel.com" , "willy@infradead.org" , "vgoyal@redhat.com" , "linux-fsdevel@vger.kernel.org" , "nvdimm@lists.linux.dev" , "linux-kernel@vger.kernel.org" , "linux-xfs@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 3, 2021 at 11:10 AM Jane Chu wrote: > > On 11/1/2021 11:18 PM, Christoph Hellwig wrote: > > On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote: > >> ...so would you happen to know if anyone's working on solving this > >> problem for us by putting the memory controller in charge of dealing > >> with media errors? > > > > The only one who could know is Intel.. > > > >> The trouble is, we really /do/ want to be able to (re)write the failed > >> area, and we probably want to try to read whatever we can. Those are > >> reads and writes, not {pre,f}allocation activities. This is where Dave > >> and I arrived at a month ago. > >> > >> Unless you'd be ok with a second IO path for recovery where we're > >> allowed to be slow? That would probably have the same user interface > >> flag, just a different path into the pmem driver. > > > > Which is fine with me. If you look at the API here we do have the > > RWF_ API, which them maps to the IOMAP API, which maps to the DAX_ > > API which then gets special casing over three methods. > > > > And while Pavel pointed out that he and Jens are now optimizing for > > single branches like this. I think this actually is silly and it is > > not my point. > > > > The point is that the DAX in-kernel API is a mess, and before we make > > it even worse we need to sort it first. What is directly relevant > > here is that the copy_from_iter and copy_to_iter APIs do not make > > sense. Most of the DAX API is based around getting a memory mapping > > using ->direct_access, it is just the read/write path which is a slow > > path that actually uses this. I have a very WIP patch series to try > > to sort this out here: > > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize > > > > But back to this series. The basic DAX model is that the callers gets a > > memory mapping an just works on that, maybe calling a sync after a write > > in a few cases. So any kind of recovery really needs to be able to > > work with that model as going forward the copy_to/from_iter path will > > be used less and less. i.e. file systems can and should use > > direct_access directly instead of using the block layer implementation > > in the pmem driver. As an example the dm-writecache driver, the pending > > bcache nvdimm support and the (horribly and out of tree) nova file systems > > won't even use this path. We need to find a way to support recovery > > for them. And overloading it over the read/write path which is not > > the main path for DAX, but the absolutely fast path for 99% of the > > kernel users is a horrible idea. > > > > So how can we work around the horrible nvdimm design for data recovery > > in a way that: > > > > a) actually works with the intended direct memory map use case > > b) doesn't really affect the normal kernel too much > > > > ? > > > > This is clearer, I've looked at your 'dax-devirtualize' patch which > removes pmem_copy_to/from_iter, and as you mentioned before, > a separate API for poison-clearing is needed. So how about I go ahead > rebase my earlier patch > > https://lore.kernel.org/lkml/20210914233132.3680546-2-jane.chu@oracle.com/ > on 'dax-devirtualize', provide dm support for clear-poison? > That way, the non-dax 99% of the pwrite use-cases aren't impacted at all > and we resolve the urgent pmem poison-clearing issue? > > Dan, are you okay with this? I am getting pressure from our customers > who are basically stuck at the moment. The concern I have with dax_clear_poison() is that it precludes atomic error clearing. Also, as Boris and I discussed, poisoned pages should be marked NP (not present) rather than UC (uncacheable) [1]. With those 2 properties combined I think that wants a custom pmem fault handler that knows how to carefully write to pmem pages with poison present, rather than an additional explicit dax-operation. That also meets Christoph's requirement of "works with the intended direct memory map use case". [1]: https://lore.kernel.org/r/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com