From: Dan Williams <firstname.lastname@example.org> To: "Darrick J. Wong" <email@example.com> Cc: "firstname.lastname@example.org" <email@example.com>, "firstname.lastname@example.org" <email@example.com>, "firstname.lastname@example.org" <email@example.com>, "firstname.lastname@example.org" <email@example.com>, "firstname.lastname@example.org" <email@example.com>, Dave Chinner <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org> Subject: Re: [Ocfs2-devel] Question about the "EXPERIMENTAL" tag for dax in XFS Date: Mon, 1 Mar 2021 21:41:02 -0800 [thread overview] Message-ID: <CAPcyv4jXH0F+aii6ZtYQ3=Rx-mOWM7NFHC9wVxacWemail@example.com> (raw) In-Reply-To: <20210302032805.GM7272@magnolia> On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong <firstname.lastname@example.org> wrote: > > On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote: > > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner <email@example.com> wrote: > > > > > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote: > > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner <firstname.lastname@example.org> wrote: > > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <email@example.com> wrote: > > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > > > it points to, check if it points to the PMEM that is being removed, > > > > > grab the page it points to, map that to the relevant struct page, > > > > > run collect_procs() on that page, then kill the user processes that > > > > > map that page. > > > > > > > > > > So why can't we walk the ptescheck the physical pages that they > > > > > map to and if they map to a pmem page we go poison that > > > > > page and that kills any user process that maps it. > > > > > > > > > > i.e. I can't see how unexpected pmem device unplug is any different > > > > > to an MCE delivering a hwpoison event to a DAX mapped page. > > > > > > > > I guess the tradeoff is walking a long list of inodes vs walking a > > > > large array of pages. > > > > > > Not really. You're assuming all a filesystem has to do is invalidate > > > everything if a device goes away, and that's not true. Finding if an > > > inode has a mapping that spans a specific device in a multi-device > > > filesystem can be a lot more complex than that. Just walking inodes > > > is easy - determining whihc inodes need invalidation is the hard > > > part. > > > > That inode-to-device level of specificity is not needed for the same > > reason that drop_caches does not need to be specific. If the wrong > > page is unmapped a re-fault will bring it back, and re-fault will fail > > for the pages that are successfully removed. > > > > > That's where ->corrupt_range() comes in - the filesystem is already > > > set up to do reverse mapping from physical range to inode(s) > > > offsets... > > > > Sure, but what is the need to get to that level of specificity with > > the filesystem for something that should rarely happen in the course > > of normal operation outside of a mistake? > > I can't tell if we're conflating the "a bunch of your pmem went bad" > case with the "all your dimms fell out of the machine" case. >From the pmem driver perspective it has the media scanning to find some small handful of cachelines that have gone bad, and it has the driver ->remove() callback to tell it a bunch of pmem is now offline. The NVDIMM device "range has gone bad" mechanism has no way to communicate multiple terabytes have gone bad at once. In fact I think the distinction is important that ->remove() is not treated as ->corrupted_range() because I expect the level of freakout is much worse for a "your storage is offline" notification vs "your storage is corrupted" notification. > If, say, a single cacheline's worth of pmem goes bad on a node with 2TB > of pmem, I certainly want that level of specificity. Just notify the > users of the dead piece, don't flush the whole machine down the drain. Right, something like corrupted_range() is there to say, "keep going upper layers, but note that this handful of sectors now has indeterminant data and will return -EIO on access until repaired". The repair for device-offline is device-online. > > > > > There's likely always more pages than inodes, but perhaps it's more > > > > efficient to walk the 'struct page' array than sb->s_inodes? > > > > > > I really don't see you seem to be telling us that invalidation is an > > > either/or choice. There's more ways to convert physical block > > > address -> inode file offset and mapping index than brute force > > > inode cache walks.... > > > > Yes, but I was trying to map it to an existing mechanism and the > > internals of drop_pagecache_sb() are, in coarse terms, close to what > > needs to happen here. > > Yes. XFS (with rmap enabled) can do all the iteration and walking in > that function except for the invalidate_mapping_* call itself. The goal > of this series is first to wire up a callback within both the block and > pmem subsystems so that they can take notifications and reverse-map them > through the storage stack until they reach an fs superblock. I'm chuckling because this "reverse map all the way up the block layer" is the opposite of what Dave said at the first reaction to my proposal, "can't the mm map pfns to fs inode address_spaces?". I think dax unmap is distinct from corrupted_range() precisely because they are events happening in two different domains, block device sectors vs dax device pfns. Let's step back. I think a chain of ->corrupted_range() callbacks up the block stack terminating in the filesystem with dax implications tacked on is the wrong abstraction. Why not use the existing generic object for communicating bad sector ranges, 'struct badblocks'? Today whenever the pmem driver receives new corrupted range notification from the lower level nvdimm infrastructure(nd_pmem_notify) it updates the 'badblocks' instance associated with the pmem gendisk and then notifies userspace that there are new badblocks. This seems a perfect place to signal an upper level stacked block device that may also be watching disk->bb. Then each gendisk in a stacked topology is responsible for watching the badblock notifications of the next level and storing a remapped instance of those blocks until ultimately the filesystem mounted on the top-level block device is responsible for registering for those top-level disk->bb events. The device gone notification does not map cleanly onto 'struct badblocks'. If an upper level agent really cared about knowing about ->remove() events before they happened it could maybe do something like: dev = disk_to_dev(bdev->bd_disk)->parent; bus_register_notifier(dev->bus. &disk_host_device_notifier_block) ...where it's trying to watch for events that will trigger the driver ->remove() callback on the device hosting a disk. I still don't think that solves the need for a separate mechanism for global dax_device pte invalidation. I think that global dax_device invalidation needs new kernel infrastructure to allow internal users, like dm-writecache and future filesystems using dax for metadata, to take a fault when pmem is offlined. They can't use the direct-map because the direct-map can't fault, and they can't indefinitely pin metadata pages because that blocks ->remove() from being guaranteed of forward progress. Then an invalidation event is indeed a walk of address_space like objects where some are fs-inodes and some are kernel-mode dax-users, and that remains independent from remove events and badblocks notifications because they are independent objects and events. In contrast I think calling something like soft_offline_page() a pfn at a time over terabytes will take forever especially when that event need not fire if the dax_device is not mounted. > Once the information has reached XFS, it can use its own reverse > mappings to figure out which pages of which inodes are now targetted. It has its own sector based reverse mappings, it does not have pfn reverse map. > The future of DAX hw error handling can be that you throw the spitwad at > us, and it's our problem to distill that into mm invalidation calls. > XFS' reverse mapping data is indexed by storage location and isn't > sharded by address_space, so (except for the DIMMs falling out), we > don't need to walk the entire inode list or scan the entire mapping. ->remove() is effectively all the DIMMs falling out for all XFS knows. > Between XFS and DAX and mm, the mm already has the invalidation calls, > xfs already has the distiller, and so all we need is that first bit. > The current mm code doesn't fully solve the problem, nor does it need > to, since it handles DRAM errors acceptably* already. > > * Actually, the hwpoison code should _also_ be calling ->corrupted_range > when DRAM goes bad so that we can detect metadata failures and either > reload the buffer or (if it was dirty) shut down. [..] > > Going forward, for buses like CXL, there will be a managed physical > > remove operation via PCIE native hotplug. The flow there is that the > > PCIE hotplug driver will notify the OS of a pending removal, trigger > > ->remove() on the pmem driver, and then notify the technician (slot > > status LED) that the card is safe to pull. > > Well, that's a relief. Can we cancel longterm RDMA leases now too? > <duck> Yes, all problems can be solved with more blinky lights. _______________________________________________ Ocfs2-devel mailing list Ocfs2firstname.lastname@example.org https://oss.oracle.com/mailman/listinfo/ocfs2-devel
next prev parent reply other threads:[~2021-03-02 5:44 UTC|newest] Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-02-26 0:20 [Ocfs2-devel] [PATCH v2 00/10] fsdax, xfs: Add reflink&dedupe support for fsdax Shiyang Ruan 2021-02-26 0:20 ` [Ocfs2-devel] [PATCH v2 01/10] fsdax: Factor helpers to simplify dax fault code Shiyang Ruan 2021-03-03 9:13 ` Christoph Hellwig 2021-02-26 0:20 ` [Ocfs2-devel] [PATCH v2 02/10] fsdax: Factor helper: dax_fault_actor() Shiyang Ruan 2021-03-03 9:28 ` Christoph Hellwig 2021-03-12 9:01 ` ruansy.fnst 2021-02-26 0:20 ` [Ocfs2-devel] [PATCH v2 03/10] fsdax: Output address in dax_iomap_pfn() and rename it Shiyang Ruan 2021-02-26 0:20 ` [Ocfs2-devel] [PATCH v2 05/10] fsdax: Replace mmap entry in case of CoW Shiyang Ruan 2021-03-03 9:30 ` Christoph Hellwig 2021-03-03 9:41 ` ruansy.fnst 2021-03-03 9:44 ` Christoph Hellwig 2021-03-03 9:48 ` Christoph Hellwig 2021-02-26 0:20 ` [Ocfs2-devel] [PATCH v2 08/10] fsdax: Dedup file range to use a compare function Shiyang Ruan 2021-02-26 8:28 ` Shiyang Ruan 2021-03-03 8:20 ` Joe Perches 2021-03-03 8:45 ` ruansy.fnst 2021-03-03 9:04 ` Joe Perches 2021-03-03 9:39 ` hch 2021-03-03 9:46 ` ruansy.fnst 2021-03-04 5:42 ` [Ocfs2-devel] [RESEND PATCH v2.1 " Shiyang Ruan 2021-02-26 0:20 ` [Ocfs2-devel] [PATCH v2 09/10] fs/xfs: Handle CoW for fsdax write() path Shiyang Ruan 2021-03-03 9:43 ` Christoph Hellwig 2021-03-03 9:57 ` ruansy.fnst 2021-03-03 10:43 ` Christoph Hellwig 2021-03-04 1:35 ` ruansy.fnst 2021-02-26 0:20 ` [Ocfs2-devel] [PATCH v2 10/10] fs/xfs: Add dedupe support for fsdax Shiyang Ruan 2021-02-26 9:45 ` [Ocfs2-devel] Question about the "EXPERIMENTAL" tag for dax in XFS ruansy.fnst 2021-02-26 19:04 ` Darrick J. Wong 2021-02-26 19:24 ` Dan Williams 2021-02-26 20:51 ` Dave Chinner 2021-02-26 20:59 ` Dan Williams 2021-02-26 21:27 ` Dave Chinner 2021-02-26 22:41 ` Dan Williams 2021-02-27 22:36 ` Dave Chinner 2021-02-27 23:40 ` Dan Williams 2021-02-28 22:38 ` Dave Chinner 2021-03-01 20:55 ` Dan Williams 2021-03-01 22:46 ` Dave Chinner 2021-03-02 0:32 ` Dan Williams 2021-03-02 2:42 ` Dave Chinner 2021-03-02 3:33 ` Dan Williams 2021-03-02 5:38 ` Dave Chinner 2021-03-02 5:50 ` Dan Williams 2021-03-02 3:28 ` Darrick J. Wong 2021-03-02 5:41 ` Dan Williams [this message] 2021-03-02 7:57 ` Dave Chinner 2021-03-02 17:49 ` Dan Williams 2021-03-04 23:40 ` Darrick J. Wong 2021-03-01 7:26 ` Yasunori Goto 2021-03-01 21:34 ` Dan Williams [not found] ` <email@example.com> 2021-03-03 9:29 ` [Ocfs2-devel] [PATCH v2 04/10] fsdax: Introduce dax_iomap_cow_copy() Christoph Hellwig [not found] ` <firstname.lastname@example.org> 2021-03-03 9:31 ` [Ocfs2-devel] [PATCH v2 06/10] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero Christoph Hellwig [not found] ` <email@example.com> 2021-02-26 4:14 ` [Ocfs2-devel] [PATCH v2 07/10] iomap: Introduce iomap_apply2() for operations on two files Darrick J. Wong 2021-02-26 8:11 ` ruansy.fnst 2021-02-26 8:25 ` Shiyang Ruan 2021-03-04 5:41 ` [Ocfs2-devel] [RESEND PATCH v2.1 " Shiyang Ruan 2021-03-11 12:30 ` Christoph Hellwig 2021-03-09 6:36 ` [Ocfs2-devel] [PATCH v2 00/10] fsdax, xfs: Add reflink&dedupe support for fsdax Xiaoguang Wang 2021-03-10 1:32 ` ruansy.fnst 2021-03-09 16:19 ` Goldwyn Rodrigues 2021-03-10 1:26 ` ruansy.fnst 2021-03-10 12:30 ` Neal Gompa 2021-03-10 13:02 ` Matthew Wilcox 2021-03-10 13:36 ` Neal Gompa 2021-03-10 13:55 ` Matthew Wilcox 2021-03-10 14:21 ` Goldwyn Rodrigues 2021-03-10 14:26 ` Matthew Wilcox 2021-03-10 17:04 ` Goldwyn Rodrigues 2021-03-11 0:53 ` Dan Williams 2021-03-11 8:26 ` Neal Gompa 2021-03-13 13:07 ` Adam Borowski 2021-03-13 16:24 ` Neal Gompa 2021-03-13 22:00 ` Adam Borowski
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to='CAPcyv4jXH0F+aii6ZtYQ3=Rx-mOWM7NFHC9wVxacWfirstname.lastname@example.org' \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --subject='Re: [Ocfs2-devel] Question about the "EXPERIMENTAL" tag for dax in XFS' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).