From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DFA6EC433EF for ; Wed, 3 Nov 2021 16:58:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BE67D61101 for ; Wed, 3 Nov 2021 16:58:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233014AbhKCRBR (ORCPT ); Wed, 3 Nov 2021 13:01:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36466 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232870AbhKCRBP (ORCPT ); Wed, 3 Nov 2021 13:01:15 -0400 Received: from bombadil.infradead.org (bombadil.infradead.org [IPv6:2607:7c80:54:e::133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 488B0C061714; Wed, 3 Nov 2021 09:58:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=OlZDjSjQL2PENkut3+Y0kHOKrNMRwIS4kL/V7VGLrlw=; b=TIz3Ai4DKmZIG9fZGTDqeAEcYz mEgFA3weDd0OsDcFx+ocxMmDGWZCwwjzFbPgtk4VUbLHj4aTUsQVkgtKwqDFyuKLq9J3XHmIneUkY O8wqQlMnH7i2EJE/dy4Fn/cN6C0XlxmMLtADP/tCLzod6JJysS6RL/ftRMPn/fO8A5XPGhSZISUR9 RkLIMYjsjeHUy9Ti3B0WQe/0wHuaJ2aBVKQcz0QB6rJgGtU2uH/DrBJFpvH/B9SBlkGP3tigv27LA gWOuGX/EmnRVAe0zdbSwl3hIoPfbPGfoGZnyG3xrT0OEVsRVf+Ib6Mt366T9lE3DvgZxthuA9+dL0 S99QMEdQ==; Received: from hch by bombadil.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1miJam-005shC-K2; Wed, 03 Nov 2021 16:58:28 +0000 Date: Wed, 3 Nov 2021 09:58:28 -0700 From: Christoph Hellwig To: Dan Williams Cc: Christoph Hellwig , "Darrick J. Wong" , Jane Chu , "david@fromorbit.com" , "vishal.l.verma@intel.com" , "dave.jiang@intel.com" , "agk@redhat.com" , "snitzer@redhat.com" , "dm-devel@redhat.com" , "ira.weiny@intel.com" , "willy@infradead.org" , "vgoyal@redhat.com" , "linux-fsdevel@vger.kernel.org" , "nvdimm@lists.linux.dev" , "linux-kernel@vger.kernel.org" , "linux-xfs@vger.kernel.org" Subject: Re: [dm-devel] [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag Message-ID: References: <20211021001059.438843-1-jane.chu@oracle.com> <2102a2e6-c543-2557-28a2-8b0bdc470855@oracle.com> <20211028002451.GB2237511@magnolia> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 02, 2021 at 12:57:10PM -0700, Dan Williams wrote: > This goes back to one of the original DAX concerns of wanting a kernel > library for coordinating PMEM mmap I/O vs leaving userspace to wrap > PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O > has this error-handling-API issue whether it is a DAX mapping or not. Semantics of writes through shared mmaps are a nightmare. Agreed, including agreeing that this is neither new nor pmem specific. But it also has absolutely nothing to do with the new RWF_ flag. > CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will > receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled > and then rely on readv(2)/writev(2) to recover. Do you see a readily > available way to improve upon that model without CPU instruction > changes? Even with CPU instructions changes, do you think it could > improve much upon the model of interrupting the process when a load > instruction aborts? The "only" think we need is something like the exception table we use in the kernel for the uaccess helpers (and the new _nofault kernel access helper). But I suspect refitting that into userspace environments is probably non-trivial. > I do agree with you that DAX needs to separate itself from block, but > I don't think it follows that DAX also needs to separate itself from > readv/writev for when a kernel slow-path needs to get involved because > mmap I/O (just CPU instructions) does not have the proper semantics. > Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement > those semantics in new / augmented CPU instructions you will likely > not get all of them to move and certainly not in any near term > timeframe, so the kernel path will be around indefinitely. I think you misunderstood me. I don't think pmem needs to be decoupled from the read/write path. But I'm very skeptical of adding a new flag to the common read/write path for the special workaround that a plain old write will not actually clear errors unlike every other store interfac. > Meanwhile, I think RWF_RECOVER_DATA is generically useful for other > storage besides PMEM and helps storage-drivers do better than large > blast radius "I/O error" completions with no other recourse. How?