Linux-NVDIMM Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-kernel@vger.kernel.org, rgoldwyn@suse.de,
	gujx@cn.fujitsu.com, qi.fuli@fujitsu.com
Subject: Re: [RFC PATCH 0/7] xfs: add reflink & dedupe support for fsdax.
Date: Thu, 10 Oct 2019 18:30:20 +1100
Message-ID: <20191010073020.GI16973@dread.disaster.area> (raw)
In-Reply-To: <20191009171152.GF13108@magnolia>

On Wed, Oct 09, 2019 at 10:11:52AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 08, 2019 at 11:31:44PM -0700, Christoph Hellwig wrote:
> > Btw, I just had a chat with Dan last week on this.  And he pointed out
> > that while this series deals with the read/write path issues of 
> > reflink on DAX it doesn't deal with the mmap side issue that
> > page->mapping and page->index can point back to exactly one file.
> > 
> > I think we want a few xfstests that reflink a file and then use the
> > different links using mmap, as that should blow up pretty reliably.
> 
> Hmm, you're right, we don't actually have a test that checks the
> behavior of mwriting all copies of a shared block.  Ok, I'll go write
> one.

I've pointed this problem out to everyone who has asked me "what do
we need to do to support reflink on DAX". I've even walked a couple
of people right through the problem that needs to be solved and
discussed the potential solutions to it.

Problems that I think need addressing:

	- device dax and filesystem dax have fundamentally different
	  needs in this space, so they need to be separated and not
	  try to use the same solution.
	- dax_lock_entry() being used as a substitute for
	  page_lock() but it not being held on the page itself means
	  it can't be extended to serialise access to the page
	  across multiple mappings that are unaware of each other
	- dax_lock_page/dax_unlock_page interface for hardware
	  memory errors needs to report to the
	  filesystem for processing and repair, not assume the page
	  is user data and killing processes is the only possible
	  recovery mechanism.
	- dax_associate_entry/dax_disassociate_entry can only work
	  for a 1:1 page:mapping,index relationship. It needs to go
	  away and be replaced by a mechanism that allows
	  tracking multiple page mapping/index/state tuples. This
	  has much wider use than DAX (e.g. sharing page cache pages
	  between reflinked files)

I've proposed shadow pages (based on a concept from Matethw Wilcox)
for each read-only reflink mapping with the real physical page being
owned by the filesystem and indexed by LBA in the filesystem buffer
cache. This would be based on whether the extent in the file the
page is mapped from has multiple references to it.

i.e. When a new page mapping occurs in a shared extent, we add the
page to the buffer cache (i.e. point a struct xfs_buf at it)i if it
isn't already present, then allocate a shadow page, point it at the
master, set it up with the new mapping,index tuple and add it to the
mapping tree. Then we can treat it as a unique page even though it
points to the read-only master page.

When the page get's COWed, we toss away the shadow page and the
master can be reclaimed with the reference count goes to zero or the
extent is no longer shared.  Think of it kind of like the way we
multiply reference the zero page for holes in mmap()d dax regions,
except we can have millions of them and they are found by physical
buffer cache index lookups. 

This works for both DAX and non-DAX sharing of read-only shared
filesytsem pages. i.e. it would form the basis of single-copy
read-only page cache pages for reflinked files.

There was quite a bit of talk at LSFMM 2018 about having a linked
list of mapping structures hanging off a struct page, one for each
mapping that references the page. Operations would then have to walk
all mappings that reference the page. This was useful to other
subsystems (HMM?) for some purpose I forget, but I'm not sure it's
particularly useful by itself for non-dax reflink purposes - I
suspect the filesystem would still need to track such pages itself
in it's buffer cache so it can find the cached page to link new
reflink copies to the same page...

ISTR a couple of other solutions were thrown around, but I don't
think anyone came up with a simple solution...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

      reply index

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-31 11:49 Shiyang Ruan
2019-07-31 11:49 ` [RFC PATCH 1/7] dax: Introduce dax_copy_edges() for COW Shiyang Ruan
2019-07-31 11:49 ` [RFC PATCH 2/7] dax: copy data before write Shiyang Ruan
2019-07-31 11:49 ` [RFC PATCH 3/7] dax: replace mmap entry in case of CoW Shiyang Ruan
2019-07-31 11:49 ` [RFC PATCH 4/7] fs: dedup file range to use a compare function Shiyang Ruan
2019-07-31 11:49 ` [RFC PATCH 5/7] dax: memcpy before zeroing range Shiyang Ruan
2019-07-31 11:49 ` [RFC PATCH 6/7] xfs: Add COW handle for fsdax Shiyang Ruan
2019-07-31 11:49 ` [RFC PATCH 7/7] xfs: Add dedupe support " Shiyang Ruan
2019-07-31 20:33 ` [RFC PATCH 0/7] xfs: add reflink & " Goldwyn Rodrigues
2019-08-01  1:37   ` Shiyang Ruan
2019-08-05  0:21     ` Dave Chinner
2019-10-09  6:31 ` Christoph Hellwig
2019-10-09 17:11   ` Darrick J. Wong
2019-10-10  7:30     ` Dave Chinner [this message]

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191010073020.GI16973@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=darrick.wong@oracle.com \
    --cc=gujx@cn.fujitsu.com \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=qi.fuli@fujitsu.com \
    --cc=rgoldwyn@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-NVDIMM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvdimm/0 linux-nvdimm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvdimm linux-nvdimm/ https://lore.kernel.org/linux-nvdimm \
		linux-nvdimm@lists.01.org linux-nvdimm@archiver.kernel.org
	public-inbox-index linux-nvdimm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.01.lists.linux-nvdimm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox