All of lore.kernel.org
 help / color / mirror / Atom feed
* question re: xfs inode to inode copy implementation
@ 2015-04-21  1:06 xfs
  2015-04-21  4:28 ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: xfs @ 2015-04-21  1:06 UTC (permalink / raw)
  To: xfs; +Cc: vito.caputo

Hello list,

I'm prototyping something like reflinks in xfs and was wondering if
anyone could give me some pointers on the best way to duplicate the
blocks of the shared inode at the reflink inode, the copy which must
occur when breaking the link.

It would be nice to do the transfer via the page cache after allocating
the space at the desintation inode, but it doesn't seem like I can use
any of the kernel helpers for copying the data via the address_space
structs since I don't have a struct file on hand for the copy source.
I'm doing this in xfs_file_open() so the only struct file I have is the
file being opened for writing - the destination of the copy.

What I do have on hand is the shared inode and the destination inode
opened and ready to go, and the struct file for the destination.

My prototype already mostly works just using xfs_alloc_file_space() to
allocate the appropriate space in the destination inode, but I need to
get that allocated space populated from the shared inode's extents.

Any pointers appreciated, thanks!

Regards,
Vito Caputo


P.S. I've seen Dave Chinner's mention of reflink prototypes in XFS on
lwn but haven't been able to find any code, what's the status of that?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: question re: xfs inode to inode copy implementation
  2015-04-21  1:06 question re: xfs inode to inode copy implementation xfs
@ 2015-04-21  4:28 ` Darrick J. Wong
  2015-04-21 22:27   ` Dave Chinner
  2015-04-30  0:53   ` xfs
  0 siblings, 2 replies; 10+ messages in thread
From: Darrick J. Wong @ 2015-04-21  4:28 UTC (permalink / raw)
  To: xfs, vito.caputo; +Cc: xfs

On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@pengaru.com wrote:
> Hello list,
> 
> I'm prototyping something like reflinks in xfs and was wondering if
> anyone could give me some pointers on the best way to duplicate the

Heh, funny, I'm working on that too...

> blocks of the shared inode at the reflink inode, the copy which must
> occur when breaking the link.

...though I'm not sure what "the shared inode at the reflink inode" means.
Are there somehow three inodes involved with reflinking one file to another?

> It would be nice to do the transfer via the page cache after allocating
> the space at the desintation inode, but it doesn't seem like I can use
> any of the kernel helpers for copying the data via the address_space
> structs since I don't have a struct file on hand for the copy source.
> I'm doing this in xfs_file_open() so the only struct file I have is the
> file being opened for writing - the destination of the copy.

So you're cloning the entire file's contents (i.e. breaking the reflink) as
soon as the file is opened rw?

> What I do have on hand is the shared inode and the destination inode
> opened and ready to go, and the struct file for the destination.

The design I'm pursuing is different from yours, I think -- two files can use
the regular bmbt to point to the same physical blocks, and there's a per-ag
btree that tracks reference counts for physical extents.  What I'd like to do
for the CoW operation is to clone the page (somehow), change the bmbt mapping
to "delayed allocation", and let the dirty pages flush out like normal.

I haven't figured out /how/ to do this, mind you.  The rest of the bookkeeping
parts are already written, though.

With reflink enabled, xfsrepair theoretically can solve multiply claimed blocks
by simply adding the appropriate agblock:refcount entry to the refcount btree
and it's done.

> My prototype already mostly works just using xfs_alloc_file_space() to
> allocate the appropriate space in the destination inode, but I need to
> get that allocated space populated from the shared inode's extents.

I think you're asking how to copy extent map entries from one file to another?

--D

> 
> Any pointers appreciated, thanks!
> 
> Regards,
> Vito Caputo
> 
> 
> P.S. I've seen Dave Chinner's mention of reflink prototypes in XFS on
> lwn but haven't been able to find any code, what's the status of that?
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: question re: xfs inode to inode copy implementation
  2015-04-21  4:28 ` Darrick J. Wong
@ 2015-04-21 22:27   ` Dave Chinner
  2015-04-23  0:44     ` Darrick J. Wong
  2015-04-30  0:53   ` xfs
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2015-04-21 22:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: vito.caputo, xfs, xfs

On Mon, Apr 20, 2015 at 09:28:20PM -0700, Darrick J. Wong wrote:
> On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@pengaru.com wrote:
> > Hello list,
> > 
> > I'm prototyping something like reflinks in xfs and was wondering if
> > anyone could give me some pointers on the best way to duplicate the
> 
> Heh, funny, I'm working on that too...
> 
> > blocks of the shared inode at the reflink inode, the copy which must
> > occur when breaking the link.
> 
> ...though I'm not sure what "the shared inode at the reflink inode" means.
> Are there somehow three inodes involved with reflinking one file to another?
> 
> > It would be nice to do the transfer via the page cache after allocating
> > the space at the desintation inode, but it doesn't seem like I can use
> > any of the kernel helpers for copying the data via the address_space
> > structs since I don't have a struct file on hand for the copy source.
> > I'm doing this in xfs_file_open() so the only struct file I have is the
> > file being opened for writing - the destination of the copy.
> 
> So you're cloning the entire file's contents (i.e. breaking the reflink) as
> soon as the file is opened rw?
> 
> > What I do have on hand is the shared inode and the destination inode
> > opened and ready to go, and the struct file for the destination.
> 
> The design I'm pursuing is different from yours, I think -- two files can use
> the regular bmbt to point to the same physical blocks, and there's a per-ag
> btree that tracks reference counts for physical extents.  What I'd like to do
> for the CoW operation is to clone the page (somehow), change the bmbt mapping
> to "delayed allocation", and let the dirty pages flush out like normal.
> 
> I haven't figured out /how/ to do this, mind you.  The rest of the bookkeeping
> parts are already written, though.

My first thought on COW was to try to use the write path get_blocks
callback to do all this.  i.e. in __xfs_get_blocks() detect that it
is an overwrite of a shared extent, remove the shared extent
reference and then convert it to delayed alloc extent. (i.e.
xfs_iomap_overwrite_shared()). Then writeback will allocate new
blocks for the data.

The question, however, is how to do this in a manner such that
crashing between the breaking of the shared reference and data
writeback doesn't leave us with a hole instead of data. To deal with
that, I think that we're going to have to break shared extents
during writeback, not during the write. However, we are going to
need a delalloc reservation to do that.

So I suspect we need a new type of extent in the in-core extent tree
- a "delalloc overwrite" extent - so that when we map it in writeback
we can allocate the new extent, do the write to it, and then on IO
completion do the BMBT manipulation to break the shared reference
and insert the new extent. That solves the atomicity problem, and it
allows us to track COW data on a per-inode basis without having
to care about all the other reflink contexts to that same data.

> With reflink enabled, xfsrepair theoretically can solve multiply claimed blocks
> by simply adding the appropriate agblock:refcount entry to the refcount btree
> and it's done.

With rmap, XFS can solve multiply claimed blocks simply by looking
at who really owns the block in the rmap... :P

> > P.S. I've seen Dave Chinner's mention of reflink prototypes in XFS on
> > lwn but haven't been able to find any code, what's the status of that?

No code, because they are prototypes to determine if ideas are sane
and workable.  Similar to what Darrick is doing right now, and we've
talked about it on #xfs a fair bit. Darrick has more time to work on
this right now than I do, so he's the guy doing all the heavy
lifting at the moment...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: question re: xfs inode to inode copy implementation
  2015-04-21 22:27   ` Dave Chinner
@ 2015-04-23  0:44     ` Darrick J. Wong
  2015-04-23  1:13       ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2015-04-23  0:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: vito.caputo, xfs, xfs

On Wed, Apr 22, 2015 at 08:27:38AM +1000, Dave Chinner wrote:
> On Mon, Apr 20, 2015 at 09:28:20PM -0700, Darrick J. Wong wrote:
> > On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@pengaru.com wrote:
> > > Hello list,
> > > 
> > > I'm prototyping something like reflinks in xfs and was wondering if
> > > anyone could give me some pointers on the best way to duplicate the
> > 
> > Heh, funny, I'm working on that too...
> > 
> > > blocks of the shared inode at the reflink inode, the copy which must
> > > occur when breaking the link.
> > 
> > ...though I'm not sure what "the shared inode at the reflink inode" means.
> > Are there somehow three inodes involved with reflinking one file to another?
> > 
> > > It would be nice to do the transfer via the page cache after allocating
> > > the space at the desintation inode, but it doesn't seem like I can use
> > > any of the kernel helpers for copying the data via the address_space
> > > structs since I don't have a struct file on hand for the copy source.
> > > I'm doing this in xfs_file_open() so the only struct file I have is the
> > > file being opened for writing - the destination of the copy.
> > 
> > So you're cloning the entire file's contents (i.e. breaking the reflink) as
> > soon as the file is opened rw?
> > 
> > > What I do have on hand is the shared inode and the destination inode
> > > opened and ready to go, and the struct file for the destination.
> > 
> > The design I'm pursuing is different from yours, I think -- two files can use
> > the regular bmbt to point to the same physical blocks, and there's a per-ag
> > btree that tracks reference counts for physical extents.  What I'd like to do
> > for the CoW operation is to clone the page (somehow), change the bmbt mapping
> > to "delayed allocation", and let the dirty pages flush out like normal.
> > 
> > I haven't figured out /how/ to do this, mind you.  The rest of the bookkeeping
> > parts are already written, though.
> 
> My first thought on COW was to try to use the write path get_blocks
> callback to do all this.  i.e. in __xfs_get_blocks() detect that it
> is an overwrite of a shared extent, remove the shared extent
> reference and then convert it to delayed alloc extent. (i.e.
> xfs_iomap_overwrite_shared()). Then writeback will allocate new
> blocks for the data.

<nod> That was my first thought, too.  I was rather hoping that I could just
update the incore BMBT to kick off delayed allocation and hope that it flushes
everything to disk before anything can blow up. (Ha...)  But alas, I hit the
same conclusion that you'd have to allocate the new block, write it, and only
then ought you update the BMBT.

> The question, however, is how to do this in a manner such that
> crashing between the breaking of the shared reference and data
> writeback doesn't leave us with a hole instead of data. To deal with
> that, I think that we're going to have to break shared extents
> during writeback, not during the write. However, we are going to
> need a delalloc reservation to do that.
> 
> So I suspect we need a new type of extent in the in-core extent tree
> - a "delalloc overwrite" extent - so that when we map it in writeback
> we can allocate the new extent, do the write to it, and then on IO
> completion do the BMBT manipulation to break the shared reference
> and insert the new extent. That solves the atomicity problem, and it
> allows us to track COW data on a per-inode basis without having
> to care about all the other reflink contexts to that same data.

I think that'll work... in xfs_vm_writepage (more probably xfs_map_blocks) if
the refcount > 2, allocate a new block, insert a new delalloc-overwrite in-core
extent with the new block number and set a flag in the ioend to remind
ourselves to update the bookkeeping later.  During xfs_end_io if that flag is
set, commit the new in-core extent to disk, decrement the refcounts, and
free the block if the refcount is 1.

For O_DIRECT I suppose we could use a similar mechanism -- you'd have to set up
the delalloc-overwrite extent in xfs_iomap_write_direct() and use
xfs_end_io_direct_write() to update the bmbt and decrement the refcounts in the
same way as above.

Hm.  Not sure what'll happen if the write buffer or the block size aren't a
page size.  Will have to go figure out what XFS does to fill in the rest of a
block if you try to directio-write to less than a block.  Hoping it's less
weird than other things I've seen.

(Does any of that make sense?)

> > With reflink enabled, xfsrepair theoretically can solve multiply claimed blocks
> > by simply adding the appropriate agblock:refcount entry to the refcount btree
> > and it's done.
> 
> With rmap, XFS can solve multiply claimed blocks simply by looking
> at who really owns the block in the rmap... :P

Yes, rmap will make reconstruction easier; when I wrote that I was thinking of
the !rmap case.  That said, it might turn out that rmap & reflink appear around
the same time?

<shrug> Guess I should get at least a PoC operational. :)

> > > P.S. I've seen Dave Chinner's mention of reflink prototypes in XFS on
> > > lwn but haven't been able to find any code, what's the status of that?
> 
> No code, because they are prototypes to determine if ideas are sane
> and workable.  Similar to what Darrick is doing right now, and we've
> talked about it on #xfs a fair bit. Darrick has more time to work on
> this right now than I do, so he's the guy doing all the heavy
> lifting at the moment...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: question re: xfs inode to inode copy implementation
  2015-04-23  0:44     ` Darrick J. Wong
@ 2015-04-23  1:13       ` Dave Chinner
  2015-04-23  6:40         ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2015-04-23  1:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: vito.caputo, xfs, xfs

On Wed, Apr 22, 2015 at 05:44:26PM -0700, Darrick J. Wong wrote:
> On Wed, Apr 22, 2015 at 08:27:38AM +1000, Dave Chinner wrote:
> > On Mon, Apr 20, 2015 at 09:28:20PM -0700, Darrick J. Wong wrote:
> > > On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@pengaru.com wrote:
> > > > Hello list,
> > > > 
> > > > I'm prototyping something like reflinks in xfs and was wondering if
> > > > anyone could give me some pointers on the best way to duplicate the
> > > 
> > > Heh, funny, I'm working on that too...
> > > 
> > > > blocks of the shared inode at the reflink inode, the copy which must
> > > > occur when breaking the link.
> > > 
> > > ...though I'm not sure what "the shared inode at the reflink inode" means.
> > > Are there somehow three inodes involved with reflinking one file to another?
> > > 
> > > > It would be nice to do the transfer via the page cache after allocating
> > > > the space at the desintation inode, but it doesn't seem like I can use
> > > > any of the kernel helpers for copying the data via the address_space
> > > > structs since I don't have a struct file on hand for the copy source.
> > > > I'm doing this in xfs_file_open() so the only struct file I have is the
> > > > file being opened for writing - the destination of the copy.
> > > 
> > > So you're cloning the entire file's contents (i.e. breaking the reflink) as
> > > soon as the file is opened rw?
> > > 
> > > > What I do have on hand is the shared inode and the destination inode
> > > > opened and ready to go, and the struct file for the destination.
> > > 
> > > The design I'm pursuing is different from yours, I think -- two files can use
> > > the regular bmbt to point to the same physical blocks, and there's a per-ag
> > > btree that tracks reference counts for physical extents.  What I'd like to do
> > > for the CoW operation is to clone the page (somehow), change the bmbt mapping
> > > to "delayed allocation", and let the dirty pages flush out like normal.
> > > 
> > > I haven't figured out /how/ to do this, mind you.  The rest of the bookkeeping
> > > parts are already written, though.
> > 
> > My first thought on COW was to try to use the write path get_blocks
> > callback to do all this.  i.e. in __xfs_get_blocks() detect that it
> > is an overwrite of a shared extent, remove the shared extent
> > reference and then convert it to delayed alloc extent. (i.e.
> > xfs_iomap_overwrite_shared()). Then writeback will allocate new
> > blocks for the data.
> 
> <nod> That was my first thought, too.  I was rather hoping that I could just
> update the incore BMBT to kick off delayed allocation and hope that it flushes
> everything to disk before anything can blow up. (Ha...)  But alas, I hit the
> same conclusion that you'd have to allocate the new block, write it, and only
> then ought you update the BMBT.
> 
> > The question, however, is how to do this in a manner such that
> > crashing between the breaking of the shared reference and data
> > writeback doesn't leave us with a hole instead of data. To deal with
> > that, I think that we're going to have to break shared extents
> > during writeback, not during the write. However, we are going to
> > need a delalloc reservation to do that.
> > 
> > So I suspect we need a new type of extent in the in-core extent tree
> > - a "delalloc overwrite" extent - so that when we map it in writeback
> > we can allocate the new extent, do the write to it, and then on IO
> > completion do the BMBT manipulation to break the shared reference
> > and insert the new extent. That solves the atomicity problem, and it
> > allows us to track COW data on a per-inode basis without having
> > to care about all the other reflink contexts to that same data.
> 
> I think that'll work... in xfs_vm_writepage (more probably xfs_map_blocks) if
> the refcount > 2, allocate a new block, insert a new delalloc-overwrite in-core
> extent with the new block number and set a flag in the ioend to remind
> ourselves to update the bookkeeping later.  During xfs_end_io if that flag is
> set, commit the new in-core extent to disk, decrement the refcounts, and
> free the block if the refcount is 1.

If we are going to track the overwrite in-core, then we are probably
going to need some form of intent/done transaction structure so that
we don't leak the allocated block if we crash before the completion
runs and commits the extent swap. I'd prefer to do that than require
on-disk state to prevent free space leakage in this case.

We could, potentially, abuse the EFI for this. i.e. record an EFI
for the extent in the allocation transaction, then in the completion
record a matching EFD. That way recovery will free the allocated
extent if we don't complete it....

> For O_DIRECT I suppose we could use a similar mechanism -- you'd
> have to set up the delalloc-overwrite extent in
> xfs_iomap_write_direct() and use xfs_end_io_direct_write() to
> update the bmbt and decrement the refcounts in the same way as
> above.

Effectively.

> Hm.  Not sure what'll happen if the write buffer or the block size aren't a
> page size.  Will have to go figure out what XFS does to fill in the rest of a
> block if you try to directio-write to less than a block.  Hoping it's less
> weird than other things I've seen.

Oh, it's weird enough. We allow sector size alignment, but we
serialise all unaligned DIO writes because the sub-block zeroing is
a nightmare to co-ordinate properly. But, really, DIO to a reflink
file is not a performant operation, so maybe we should just punt all
writes to shared extent files to the buffered IO path and not have
to care about COW during DIO writes?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: question re: xfs inode to inode copy implementation
  2015-04-23  1:13       ` Dave Chinner
@ 2015-04-23  6:40         ` Darrick J. Wong
  2015-04-26 22:40           ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2015-04-23  6:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: vito.caputo, xfs, xfs

On Thu, Apr 23, 2015 at 11:13:45AM +1000, Dave Chinner wrote:
> On Wed, Apr 22, 2015 at 05:44:26PM -0700, Darrick J. Wong wrote:
> > On Wed, Apr 22, 2015 at 08:27:38AM +1000, Dave Chinner wrote:
> > > On Mon, Apr 20, 2015 at 09:28:20PM -0700, Darrick J. Wong wrote:
> > > > On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@pengaru.com wrote:
> > > > > Hello list,
> > > > > 
> > > > > I'm prototyping something like reflinks in xfs and was wondering if
> > > > > anyone could give me some pointers on the best way to duplicate the
> > > > 
> > > > Heh, funny, I'm working on that too...
> > > > 
> > > > > blocks of the shared inode at the reflink inode, the copy which must
> > > > > occur when breaking the link.
> > > > 
> > > > ...though I'm not sure what "the shared inode at the reflink inode" means.
> > > > Are there somehow three inodes involved with reflinking one file to another?
> > > > 
> > > > > It would be nice to do the transfer via the page cache after allocating
> > > > > the space at the desintation inode, but it doesn't seem like I can use
> > > > > any of the kernel helpers for copying the data via the address_space
> > > > > structs since I don't have a struct file on hand for the copy source.
> > > > > I'm doing this in xfs_file_open() so the only struct file I have is the
> > > > > file being opened for writing - the destination of the copy.
> > > > 
> > > > So you're cloning the entire file's contents (i.e. breaking the reflink) as
> > > > soon as the file is opened rw?
> > > > 
> > > > > What I do have on hand is the shared inode and the destination inode
> > > > > opened and ready to go, and the struct file for the destination.
> > > > 
> > > > The design I'm pursuing is different from yours, I think -- two files can use
> > > > the regular bmbt to point to the same physical blocks, and there's a per-ag
> > > > btree that tracks reference counts for physical extents.  What I'd like to do
> > > > for the CoW operation is to clone the page (somehow), change the bmbt mapping
> > > > to "delayed allocation", and let the dirty pages flush out like normal.
> > > > 
> > > > I haven't figured out /how/ to do this, mind you.  The rest of the bookkeeping
> > > > parts are already written, though.
> > > 
> > > My first thought on COW was to try to use the write path get_blocks
> > > callback to do all this.  i.e. in __xfs_get_blocks() detect that it
> > > is an overwrite of a shared extent, remove the shared extent
> > > reference and then convert it to delayed alloc extent. (i.e.
> > > xfs_iomap_overwrite_shared()). Then writeback will allocate new
> > > blocks for the data.
> > 
> > <nod> That was my first thought, too.  I was rather hoping that I could just
> > update the incore BMBT to kick off delayed allocation and hope that it flushes
> > everything to disk before anything can blow up. (Ha...)  But alas, I hit the
> > same conclusion that you'd have to allocate the new block, write it, and only
> > then ought you update the BMBT.
> > 
> > > The question, however, is how to do this in a manner such that
> > > crashing between the breaking of the shared reference and data
> > > writeback doesn't leave us with a hole instead of data. To deal with
> > > that, I think that we're going to have to break shared extents
> > > during writeback, not during the write. However, we are going to
> > > need a delalloc reservation to do that.
> > > 
> > > So I suspect we need a new type of extent in the in-core extent tree
> > > - a "delalloc overwrite" extent - so that when we map it in writeback
> > > we can allocate the new extent, do the write to it, and then on IO
> > > completion do the BMBT manipulation to break the shared reference
> > > and insert the new extent. That solves the atomicity problem, and it
> > > allows us to track COW data on a per-inode basis without having
> > > to care about all the other reflink contexts to that same data.
> > 
> > I think that'll work... in xfs_vm_writepage (more probably xfs_map_blocks) if
> > the refcount > 2, allocate a new block, insert a new delalloc-overwrite in-core

Speaking of which, should I add a XFS_DIFLAG_ to indicate that a file has
(or has had) reflinked blocks?  In theory this would save us a trip through
the reflinkbt for "normal" files when the reflink feature is set, but
we'd then have to maintain it (and repair would have to check it).

> > extent with the new block number and set a flag in the ioend to remind
> > ourselves to update the bookkeeping later.  During xfs_end_io if that flag is
> > set, commit the new in-core extent to disk, decrement the refcounts, and
> > free the block if the refcount is 1.
> 
> If we are going to track the overwrite in-core, then we are probably
> going to need some form of intent/done transaction structure so that
> we don't leak the allocated block if we crash before the completion
> runs and commits the extent swap. I'd prefer to do that than require
> on-disk state to prevent free space leakage in this case.
>
> We could, potentially, abuse the EFI for this. i.e. record an EFI
> for the extent in the allocation transaction, then in the completion
> record a matching EFD. That way recovery will free the allocated
> extent if we don't complete it....

Clever!  I was looking around to see if XFS had something that could
take care of cleaning up orphans like that.

Rather nice that the usual outcome to "I think I want ____ data structure" is
that someone already thought of it. :)

> > For O_DIRECT I suppose we could use a similar mechanism -- you'd
> > have to set up the delalloc-overwrite extent in
> > xfs_iomap_write_direct() and use xfs_end_io_direct_write() to
> > update the bmbt and decrement the refcounts in the same way as
> > above.
> 
> Effectively.
> 
> > Hm.  Not sure what'll happen if the write buffer or the block size aren't a
> > page size.  Will have to go figure out what XFS does to fill in the rest of a
> > block if you try to directio-write to less than a block.  Hoping it's less
> > weird than other things I've seen.
> 
> Oh, it's weird enough. We allow sector size alignment, but we
> serialise all unaligned DIO writes because the sub-block zeroing is
> a nightmare to co-ordinate properly. But, really, DIO to a reflink
> file is not a performant operation, so maybe we should just punt all
> writes to shared extent files to the buffered IO path and not have
> to care about COW during DIO writes?

Sure, I'll punt to buffered mode for now, just to get something working.
I can always come back to this later if I dare.

--D

> 
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: question re: xfs inode to inode copy implementation
  2015-04-23  6:40         ` Darrick J. Wong
@ 2015-04-26 22:40           ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2015-04-26 22:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: vito.caputo, xfs, xfs

On Wed, Apr 22, 2015 at 11:40:16PM -0700, Darrick J. Wong wrote:
> On Thu, Apr 23, 2015 at 11:13:45AM +1000, Dave Chinner wrote:
> > On Wed, Apr 22, 2015 at 05:44:26PM -0700, Darrick J. Wong wrote:
> > > On Wed, Apr 22, 2015 at 08:27:38AM +1000, Dave Chinner wrote:
> > > > On Mon, Apr 20, 2015 at 09:28:20PM -0700, Darrick J. Wong wrote:
> > > > > On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@pengaru.com wrote:
> > > > crashing between the breaking of the shared reference and data
> > > > writeback doesn't leave us with a hole instead of data. To deal with
> > > > that, I think that we're going to have to break shared extents
> > > > during writeback, not during the write. However, we are going to
> > > > need a delalloc reservation to do that.
> > > > 
> > > > So I suspect we need a new type of extent in the in-core extent tree
> > > > - a "delalloc overwrite" extent - so that when we map it in writeback
> > > > we can allocate the new extent, do the write to it, and then on IO
> > > > completion do the BMBT manipulation to break the shared reference
> > > > and insert the new extent. That solves the atomicity problem, and it
> > > > allows us to track COW data on a per-inode basis without having
> > > > to care about all the other reflink contexts to that same data.
> > > 
> > > I think that'll work... in xfs_vm_writepage (more probably xfs_map_blocks) if
> > > the refcount > 2, allocate a new block, insert a new delalloc-overwrite in-core
> 
> Speaking of which, should I add a XFS_DIFLAG_ to indicate that a file has
> (or has had) reflinked blocks?  In theory this would save us a trip through
> the reflinkbt for "normal" files when the reflink feature is set, but
> we'd then have to maintain it (and repair would have to check it).

Yes, we probably should have a flag like this. It makes it simple to
determine where to look for extent info, but it also gives us some
redundant information as to whether the inode should have shared
extents rather than have them trashed as duplicate references to an
unshared block in repair...

> > > extent with the new block number and set a flag in the ioend to remind
> > > ourselves to update the bookkeeping later.  During xfs_end_io if that flag is
> > > set, commit the new in-core extent to disk, decrement the refcounts, and
> > > free the block if the refcount is 1.
> > 
> > If we are going to track the overwrite in-core, then we are probably
> > going to need some form of intent/done transaction structure so that
> > we don't leak the allocated block if we crash before the completion
> > runs and commits the extent swap. I'd prefer to do that than require
> > on-disk state to prevent free space leakage in this case.
> >
> > We could, potentially, abuse the EFI for this. i.e. record an EFI
> > for the extent in the allocation transaction, then in the completion
> > record a matching EFD. That way recovery will free the allocated
> > extent if we don't complete it....
> 
> Clever!  I was looking around to see if XFS had something that could
> take care of cleaning up orphans like that.

It's not intended for this purpose, but I think it will work just
fine - as long as the extent is not actually added to the on-disk
bmbt by the allocation transaction.  The EFI is currently committed in
the same transaction that removes the extent from the BMBT to
indicate it is not referenced on disk and then it is freed in the
following transaction that also commits the EFD to indicate it is
referenced again (by the free space tree). The above would work the
opposite way around - EFI commited on removal from the free space
tree, EFD committed on addition to the BMBT...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: question re: xfs inode to inode copy implementation
  2015-04-21  4:28 ` Darrick J. Wong
  2015-04-21 22:27   ` Dave Chinner
@ 2015-04-30  0:53   ` xfs
  2015-04-30  6:26     ` Dave Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: xfs @ 2015-04-30  0:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: vito.caputo, xfs, xfs

On Mon, Apr 20, 2015 at 09:28:20PM -0700, Darrick J. Wong wrote:
> On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@pengaru.com wrote:
> > Hello list,
> > 
> > I'm prototyping something like reflinks in xfs and was wondering if
> > anyone could give me some pointers on the best way to duplicate the
> 
> Heh, funny, I'm working on that too...
> 
> > blocks of the shared inode at the reflink inode, the copy which must
> > occur when breaking the link.
> 
> ...though I'm not sure what "the shared inode at the reflink inode" means.
> Are there somehow three inodes involved with reflinking one file to another?

There's just two inodes, the original file's inode and the inode created for
the reflink.

> 
> > It would be nice to do the transfer via the page cache after allocating
> > the space at the desintation inode, but it doesn't seem like I can use
> > any of the kernel helpers for copying the data via the address_space
> > structs since I don't have a struct file on hand for the copy source.
> > I'm doing this in xfs_file_open() so the only struct file I have is the
> > file being opened for writing - the destination of the copy.
> 
> So you're cloning the entire file's contents (i.e. breaking the reflink) as
> soon as the file is opened rw?

Correct.

> 
> > What I do have on hand is the shared inode and the destination inode
> > opened and ready to go, and the struct file for the destination.
> 
> The design I'm pursuing is different from yours, I think -- two files can use
> the regular bmbt to point to the same physical blocks, and there's a per-ag
> btree that tracks reference counts for physical extents.  What I'd like to do
> for the CoW operation is to clone the page (somehow), change the bmbt mapping
> to "delayed allocation", and let the dirty pages flush out like normal.
> 
> I haven't figured out /how/ to do this, mind you.  The rest of the bookkeeping
> parts are already written, though.
> 
> With reflink enabled, xfsrepair theoretically can solve multiply claimed blocks
> by simply adding the appropriate agblock:refcount entry to the refcount btree
> and it's done.
> 
> > My prototype already mostly works just using xfs_alloc_file_space() to
> > allocate the appropriate space in the destination inode, but I need to
> > get that allocated space populated from the shared inode's extents.
> 
> I think you're asking how to copy extent map entries from one file to another?

What I really wanted was something analogous to do_splice_direct() but for
operating on the inodes/address_space structs.  I ended up just hacking
something together which does the copy ad-hoc directly via the address_space
structs and calling xfs_get_blocks() on buffer heads of the destination pages,
without any readahead or other optimizations, at least it reads from and
populates the page caches.

It looks like what you guys are working on is a more granular/block-level COW
reflink implementation, which I assumed would be significantly more challenging
and well beyond my ability to hack up quickly for experimentation.

Below I'll summarize what I've hacked together.  It's probably inappropriate to
refer to this as a reflink, I've been referring to it as a COW-file in the
code.

A COW-file is a new inode which refers to another (shared) inode for its data
until the COW-file is opened for writing.  At that point it clones the shared
inode's data as its own.

Here's the mid-level details of the hack:

1. Add two members to the inode in the available padding:
 be32  nlink_cow:	Number of COW-file links to the inode
 be64  inumber_cow:	Number of backing inode if inode is a COW-file

2. Introduc a new ioctl for creating a COW-file: 
 #define XFS_IOC_CREATE_COWFILE_AT     _IOW ('X', 126, struct xfs_create_cowfile_at)

 typedef struct xfs_create_cowfile_at
 {
 	int	dfd;			/* parent directory	*/
 	char	name[MAXNAMELEN];	/* name to create	*/
 	umode_t	mode;			/* mode			*/
 } xfs_create_cowfile_at_t;

3. Derive a xfs_create_cowfile() from xfs_create() and xfs_link():
 xfs_create_cowfile(
	xfs_inode_t	*dp,		/* parent directory */
	xfs_inode_t	*sip,		/* shared inode (ioctl callee) */
	struct xfs_name	*name,		/* name of cowfile to create in dp */
	umode_t		mode,		/* mode */
	xfs_dev_t	rdev,
	xfs_inode_t	**ipp)		/* new inode */

 - ipp = allocate inode
 - ipp->i_mapping->host = sip
 - bump sip->nlink_cow
 - ipp->inumber_cow = sip->di_ino
 - create name in dp referencing ipp

 ipp starts out with the shared inode hosting i_mapping

4. Modify xfs_setup_inode() to be inumber_cow-aware, opening the shared inode
   when set, and assigning to i_mapping->host.

5. Modify xfs_file_open() S_ISREG && inumber_cow && FMODE_WRITE:
 - clear inumber_cow
 - restore i_mapping->host to the inode being opened
 - invalidate_inode_pages2(i_mapping)
 - allocate all needed space in this inode
 - copy size from shared inode to this inode
 - copy all pages from the previously shared inode to this one

6. Modify xfs_vn_getattr() to copy stat->size from the shared inode if
   inumer_cow is set.

7. Modify unlink paths to be nlink_cow-aware

8. To simplify things, inodes that have nlink_cow no longer can be opened for
   writing, they've become immutable backing stores of sorts for other inodes.


The one major question mark I see in this achieving correctness is the live
manipulation of i_mapping->host.  It seems to work in my casual testing on
x86_64, this actually all works surprisingly well considering it's a fast and
nasty hack.  I abuse invalidate_inode_pages2() as if a truncate has occurred,
forcing subsequent page accesses to fault and revisit i_mapping->host.

The goal here was to achieve something overlayfs-like but with inodes capable
of being chowned/chmodded without triggering the copy_up, operations likely
necessary for supporting user namespaces in containers.  Additionally,
overlayfs has some serious correctness issues WRT multiply-opened files
spanning the lower and upper layers due to one of the subsequent opens being a
writer.  Since my hack creates distinct inodes from the start, no such issue
exists.

However, one of the attractive things about overlayfs is the page cache sharing
which my XFS hack doesn't enable due to the distinct struct addres_space's and
i_mapping->host host exchanging.  I had hoped to explore something KSM-like
with maybe some hints from XFS for these shared inode pagess saying "hey these
are read-only pages in a shared inode, try deduplicate them!" but KSM is purely
in vma land so that seems like a lot of work to make happen.

In any case, thanks for the responses and any further input you guys have.
Obviously a more granular btrfs-like reflink is preferable, and I'd welcome it.
It just seemed like doing something overlayfs-like would be a whole lot easier
to get working in a relatively short time.

Cheers,
Vito Caputo

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: question re: xfs inode to inode copy implementation
  2015-04-30  0:53   ` xfs
@ 2015-04-30  6:26     ` Dave Chinner
  2015-04-30  8:03       ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2015-04-30  6:26 UTC (permalink / raw)
  To: xfs; +Cc: vito.caputo, xfs, Darrick J. Wong

On Wed, Apr 29, 2015 at 07:53:30PM -0500, xfs@pengaru.com wrote:
> On Mon, Apr 20, 2015 at 09:28:20PM -0700, Darrick J. Wong wrote:
> > On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@pengaru.com wrote:
> > > My prototype already mostly works just using xfs_alloc_file_space() to
> > > allocate the appropriate space in the destination inode, but I need to
> > > get that allocated space populated from the shared inode's extents.
> > 
> > I think you're asking how to copy extent map entries from one file to another?
> 
> What I really wanted was something analogous to do_splice_direct() but for
> operating on the inodes/address_space structs.  I ended up just hacking
> something together which does the copy ad-hoc directly via the address_space
> structs and calling xfs_get_blocks() on buffer heads of the destination pages,
> without any readahead or other optimizations, at least it reads from and
> populates the page caches.
> 
> It looks like what you guys are working on is a more granular/block-level COW
> reflink implementation, which I assumed would be significantly more challenging
> and well beyond my ability to hack up quickly for experimentation.
> 
> Below I'll summarize what I've hacked together.  It's probably inappropriate to
> refer to this as a reflink, I've been referring to it as a COW-file in the
> code.

It's the same thing, just using a different COW mechanism to break
the reflink.

> A COW-file is a new inode which refers to another (shared) inode for its data
> until the COW-file is opened for writing.  At that point it clones the shared
> inode's data as its own.
> 
> Here's the mid-level details of the hack:
> 
> 1. Add two members to the inode in the available padding:
>  be32  nlink_cow:	Number of COW-file links to the inode
>  be64  inumber_cow:	Number of backing inode if inode is a COW-file

FYI, inode number alone is not enough to uniquely identify an inode.
Needs the ino/gen pair as inode numbers can be reused.

> 2. Introduc a new ioctl for creating a COW-file: 
>  #define XFS_IOC_CREATE_COWFILE_AT     _IOW ('X', 126, struct xfs_create_cowfile_at)
> 
>  typedef struct xfs_create_cowfile_at
>  {
>  	int	dfd;			/* parent directory	*/
>  	char	name[MAXNAMELEN];	/* name to create	*/
>  	umode_t	mode;			/* mode			*/
>  } xfs_create_cowfile_at_t;

Let's not create yet another incompatible reflink API. Please use
OCFS2_IOC_REFLINK as the API and pull the ioctl and infrastructure
within fs/ocfs2/refcounttree.c up to the VFS, add a ->reflink method
to the operations structure and then hook ocfs2 and the new XFS code
up to that interface.  We don't want to duplicate all that code
unnecessarily in an incompatible fashion.

> 3. Derive a xfs_create_cowfile() from xfs_create() and xfs_link():
>  xfs_create_cowfile(
> 	xfs_inode_t	*dp,		/* parent directory */
> 	xfs_inode_t	*sip,		/* shared inode (ioctl callee) */
> 	struct xfs_name	*name,		/* name of cowfile to create in dp */
> 	umode_t		mode,		/* mode */
> 	xfs_dev_t	rdev,
> 	xfs_inode_t	**ipp)		/* new inode */
> 
>  - ipp = allocate inode
>  - ipp->i_mapping->host = sip
>  - bump sip->nlink_cow
>  - ipp->inumber_cow = sip->di_ino
>  - create name in dp referencing ipp
> 
>  ipp starts out with the shared inode hosting i_mapping

That's kinda messy. I'd like to leave sharing page cache completely
out of the picture first, and get all the inode and extent
manipulations correct first. most important is all the inode life
cycle and unlink/unshare operations sorted, especially w.r.t.
locking and transactions.

Once we have that sanely under control, page cache sharing is a
matter of reflecting the shared extents in the page cache via
multiple page references (i.e. one for each mapping tree the page is
inserted into) rather than playing games trying to share mappings.

> 4. Modify xfs_setup_inode() to be inumber_cow-aware, opening the shared inode
>    when set, and assigning to i_mapping->host.

So you've added an xfs_iget() call to xfs_setup_inode() if the inode
is a reflink inode? How does the locking work?

> 5. Modify xfs_file_open() S_ISREG && inumber_cow && FMODE_WRITE:
>  - clear inumber_cow
>  - restore i_mapping->host to the inode being opened
>  - invalidate_inode_pages2(i_mapping)
>  - allocate all needed space in this inode
>  - copy size from shared inode to this inode
>  - copy all pages from the previously shared inode to this one

So the copy is done on open(O_WR) of the file. files get opened in
write mode for more than just writing date to them. e.g. fchmod,
writing new attributes, etc. We don't want a data cow if this is the
first operation that is done to the inode after the reflink is
made...

> 6. Modify xfs_vn_getattr() to copy stat->size from the shared inode if
>    inumer_cow is set.

Why wouldn't you just put the size in the COW inode?

> 7. Modify unlink paths to be nlink_cow-aware

What happens when parent is unlinked? It stays on the unlinked list?
Or do we know have the possibility that unlink can have two inodes
to free in xfs-inactive()?  What's the implication for transaction
reservations? What's the locking order?

reflink doesn't have this "readonly parent" wart. If the parent is
modified, it breaks it's side of the reflink via COW. Simple,
symmetric handling of the problem - it doesn't matter who writes to
what, the unmodified inodes do not see any change.

> 8. To simplify things, inodes that have nlink_cow no longer can be opened for
>    writing, they've become immutable backing stores of sorts for other inodes.

So, someone marks a file as COW, and that makes it read only? What
happens if there is already an open fd with write permissions when
the COW ioctl is called? So, where does the immutable state of the
inode get enforced? What's preventing it from being unlinked?

> The one major question mark I see in this achieving correctness is the live
> manipulation of i_mapping->host.  It seems to work in my casual testing on
> x86_64, this actually all works surprisingly well considering it's a fast and
> nasty hack.  I abuse invalidate_inode_pages2() as if a truncate has occurred,
> forcing subsequent page accesses to fault and revisit i_mapping->host.

I don't think you can't change mapping->host, as there is no one
lock that protects access to it.

> The goal here was to achieve something overlayfs-like but with inodes capable
> of being chowned/chmodded without triggering the copy_up, operations likely
> necessary for supporting user namespaces in containers.

Yup, reflink gives us this.

> Additionally,
> overlayfs has some serious correctness issues WRT multiply-opened files
> spanning the lower and upper layers due to one of the subsequent opens being a
> writer.  Since my hack creates distinct inodes from the start, no such issue
> exists.

Right, but it's not something that overlayfs can use unconditionally
(reflink or your COW file) because overlayfs can have upper and
lower directories on different filesystems.  And, as has also been
pointed out, it doesn't work on lower filesystems that are
read-only.

As such, I'm not sure that overlayfs will ever support this change
of behaviour as it restricts overlayfs to a single writeable
filesystem. I know, most people only need overlayfs on a single
writeable filesystem, but....

> However, one of the attractive things about overlayfs is the page cache sharing
> which my XFS hack doesn't enable due to the distinct struct addres_space's and
> i_mapping->host host exchanging.  I had hoped to explore something KSM-like
> with maybe some hints from XFS for these shared inode pagess saying "hey these
> are read-only pages in a shared inode, try deduplicate them!" but KSM is purely
> in vma land so that seems like a lot of work to make happen.

As I mentioned before, page cache sharing needs to reflect shared
blocks on disk. Playing games with KSM or mappings isn't a viable
solution.

> In any case, thanks for the responses and any further input you guys have.
> Obviously a more granular btrfs-like reflink is preferable, and I'd welcome it.
> It just seemed like doing something overlayfs-like would be a whole lot easier
> to get working in a relatively short time.

Overlayfs is creating more problems than it is solving. Hacking
random one-off functionality into filesystems is not the way to fix
overlayfs problems. Let's do reflink properly, then we can modify
overlayfs to be able to use reflink when all it's working
directories are on the same filesystem.  The we can add page cache
sharing to reflink files, and all of a sudden, the overlayfs page
cache sharing problem is solved. Not to mention it is solved for all
the other people that want reflink files for their applications,
too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: question re: xfs inode to inode copy implementation
  2015-04-30  6:26     ` Dave Chinner
@ 2015-04-30  8:03       ` Darrick J. Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2015-04-30  8:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: vito.caputo, xfs, xfs

On Thu, Apr 30, 2015 at 04:26:14PM +1000, Dave Chinner wrote:
> On Wed, Apr 29, 2015 at 07:53:30PM -0500, xfs@pengaru.com wrote:
> > On Mon, Apr 20, 2015 at 09:28:20PM -0700, Darrick J. Wong wrote:
> > > On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@pengaru.com wrote:
> > > > My prototype already mostly works just using xfs_alloc_file_space() to
> > > > allocate the appropriate space in the destination inode, but I need to
> > > > get that allocated space populated from the shared inode's extents.
> > > 
> > > I think you're asking how to copy extent map entries from one file to another?
> > 
> > What I really wanted was something analogous to do_splice_direct() but for
> > operating on the inodes/address_space structs.  I ended up just hacking
> > something together which does the copy ad-hoc directly via the address_space
> > structs and calling xfs_get_blocks() on buffer heads of the destination pages,
> > without any readahead or other optimizations, at least it reads from and
> > populates the page caches.
> > 
> > It looks like what you guys are working on is a more granular/block-level COW
> > reflink implementation, which I assumed would be significantly more challenging
> > and well beyond my ability to hack up quickly for experimentation.
> > 
> > Below I'll summarize what I've hacked together.  It's probably inappropriate to
> > refer to this as a reflink, I've been referring to it as a COW-file in the
> > code.
> 
> It's the same thing, just using a different COW mechanism to break
> the reflink.
> 
> > A COW-file is a new inode which refers to another (shared) inode for its data
> > until the COW-file is opened for writing.  At that point it clones the shared
> > inode's data as its own.
> > 
> > Here's the mid-level details of the hack:
> > 
> > 1. Add two members to the inode in the available padding:
> >  be32  nlink_cow:	Number of COW-file links to the inode
> >  be64  inumber_cow:	Number of backing inode if inode is a COW-file
> 
> FYI, inode number alone is not enough to uniquely identify an inode.
> Needs the ino/gen pair as inode numbers can be reused.
> 
> > 2. Introduc a new ioctl for creating a COW-file: 
> >  #define XFS_IOC_CREATE_COWFILE_AT     _IOW ('X', 126, struct xfs_create_cowfile_at)
> > 
> >  typedef struct xfs_create_cowfile_at
> >  {
> >  	int	dfd;			/* parent directory	*/
> >  	char	name[MAXNAMELEN];	/* name to create	*/
> >  	umode_t	mode;			/* mode			*/
> >  } xfs_create_cowfile_at_t;
> 
> Let's not create yet another incompatible reflink API. Please use
> OCFS2_IOC_REFLINK as the API and pull the ioctl and infrastructure
> within fs/ocfs2/refcounttree.c up to the VFS, add a ->reflink method
> to the operations structure and then hook ocfs2 and the new XFS code
> up to that interface.  We don't want to duplicate all that code
> unnecessarily in an incompatible fashion.

I've been using BTRFS_IOC_CLONE{,_RANGE} so far, and eyeing zab's proposed
copy_file_range syscall.

> > 3. Derive a xfs_create_cowfile() from xfs_create() and xfs_link():
> >  xfs_create_cowfile(
> > 	xfs_inode_t	*dp,		/* parent directory */
> > 	xfs_inode_t	*sip,		/* shared inode (ioctl callee) */
> > 	struct xfs_name	*name,		/* name of cowfile to create in dp */
> > 	umode_t		mode,		/* mode */
> > 	xfs_dev_t	rdev,
> > 	xfs_inode_t	**ipp)		/* new inode */
> > 
> >  - ipp = allocate inode
> >  - ipp->i_mapping->host = sip
> >  - bump sip->nlink_cow
> >  - ipp->inumber_cow = sip->di_ino
> >  - create name in dp referencing ipp
> > 
> >  ipp starts out with the shared inode hosting i_mapping
> 
> That's kinda messy. I'd like to leave sharing page cache completely
> out of the picture first, and get all the inode and extent
> manipulations correct first. most important is all the inode life
> cycle and unlink/unshare operations sorted, especially w.r.t.
> locking and transactions.

Heh heh heh.  I found another one hidden one of those just this evening
(zeroing ranges for punch/zerorange).  In addition to buffered writes,
dio writes (which lazily fall back to buffered for now), rm, and block-punch.
There was also the little matter of making FIEMAP work.

> Once we have that sanely under control, page cache sharing is a
> matter of reflecting the shared extents in the page cache via
> multiple page references (i.e. one for each mapping tree the page is
> inserted into) rather than playing games trying to share mappings.
>
> > 4. Modify xfs_setup_inode() to be inumber_cow-aware, opening the shared inode
> >    when set, and assigning to i_mapping->host.
> 
> So you've added an xfs_iget() call to xfs_setup_inode() if the inode
> is a reflink inode? How does the locking work?
> 
> > 5. Modify xfs_file_open() S_ISREG && inumber_cow && FMODE_WRITE:
> >  - clear inumber_cow
> >  - restore i_mapping->host to the inode being opened
> >  - invalidate_inode_pages2(i_mapping)
> >  - allocate all needed space in this inode
> >  - copy size from shared inode to this inode
> >  - copy all pages from the previously shared inode to this one
> 
> So the copy is done on open(O_WR) of the file. files get opened in
> write mode for more than just writing date to them. e.g. fchmod,
> writing new attributes, etc. We don't want a data cow if this is the
> first operation that is done to the inode after the reflink is
> made...

That could be a lot of stuff to copy?

Anyway, I might have a RFC(rap) posting of reflink patches tomorrow.

I haven't added the "this is reflinked" inode flag optimization yet.  I bet the
performance hit is awful.  But, no more patching at 1am.

--D

> > 6. Modify xfs_vn_getattr() to copy stat->size from the shared inode if
> >    inumer_cow is set.
> 
> Why wouldn't you just put the size in the COW inode?
> 
> > 7. Modify unlink paths to be nlink_cow-aware
> 
> What happens when parent is unlinked? It stays on the unlinked list?
> Or do we know have the possibility that unlink can have two inodes
> to free in xfs-inactive()?  What's the implication for transaction
> reservations? What's the locking order?
> 
> reflink doesn't have this "readonly parent" wart. If the parent is
> modified, it breaks it's side of the reflink via COW. Simple,
> symmetric handling of the problem - it doesn't matter who writes to
> what, the unmodified inodes do not see any change.
> 
> > 8. To simplify things, inodes that have nlink_cow no longer can be opened for
> >    writing, they've become immutable backing stores of sorts for other inodes.
> 
> So, someone marks a file as COW, and that makes it read only? What
> happens if there is already an open fd with write permissions when
> the COW ioctl is called? So, where does the immutable state of the
> inode get enforced? What's preventing it from being unlinked?
> 
> > The one major question mark I see in this achieving correctness is the live
> > manipulation of i_mapping->host.  It seems to work in my casual testing on
> > x86_64, this actually all works surprisingly well considering it's a fast and
> > nasty hack.  I abuse invalidate_inode_pages2() as if a truncate has occurred,
> > forcing subsequent page accesses to fault and revisit i_mapping->host.
> 
> I don't think you can't change mapping->host, as there is no one
> lock that protects access to it.
> 
> > The goal here was to achieve something overlayfs-like but with inodes capable
> > of being chowned/chmodded without triggering the copy_up, operations likely
> > necessary for supporting user namespaces in containers.
> 
> Yup, reflink gives us this.
> 
> > Additionally,
> > overlayfs has some serious correctness issues WRT multiply-opened files
> > spanning the lower and upper layers due to one of the subsequent opens being a
> > writer.  Since my hack creates distinct inodes from the start, no such issue
> > exists.
> 
> Right, but it's not something that overlayfs can use unconditionally
> (reflink or your COW file) because overlayfs can have upper and
> lower directories on different filesystems.  And, as has also been
> pointed out, it doesn't work on lower filesystems that are
> read-only.
> 
> As such, I'm not sure that overlayfs will ever support this change
> of behaviour as it restricts overlayfs to a single writeable
> filesystem. I know, most people only need overlayfs on a single
> writeable filesystem, but....
> 
> > However, one of the attractive things about overlayfs is the page cache sharing
> > which my XFS hack doesn't enable due to the distinct struct addres_space's and
> > i_mapping->host host exchanging.  I had hoped to explore something KSM-like
> > with maybe some hints from XFS for these shared inode pagess saying "hey these
> > are read-only pages in a shared inode, try deduplicate them!" but KSM is purely
> > in vma land so that seems like a lot of work to make happen.
> 
> As I mentioned before, page cache sharing needs to reflect shared
> blocks on disk. Playing games with KSM or mappings isn't a viable
> solution.
> 
> > In any case, thanks for the responses and any further input you guys have.
> > Obviously a more granular btrfs-like reflink is preferable, and I'd welcome it.
> > It just seemed like doing something overlayfs-like would be a whole lot easier
> > to get working in a relatively short time.
> 
> Overlayfs is creating more problems than it is solving. Hacking
> random one-off functionality into filesystems is not the way to fix
> overlayfs problems. Let's do reflink properly, then we can modify
> overlayfs to be able to use reflink when all it's working
> directories are on the same filesystem.  The we can add page cache
> sharing to reflink files, and all of a sudden, the overlayfs page
> cache sharing problem is solved. Not to mention it is solved for all
> the other people that want reflink files for their applications,
> too.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-04-30  8:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-21  1:06 question re: xfs inode to inode copy implementation xfs
2015-04-21  4:28 ` Darrick J. Wong
2015-04-21 22:27   ` Dave Chinner
2015-04-23  0:44     ` Darrick J. Wong
2015-04-23  1:13       ` Dave Chinner
2015-04-23  6:40         ` Darrick J. Wong
2015-04-26 22:40           ` Dave Chinner
2015-04-30  0:53   ` xfs
2015-04-30  6:26     ` Dave Chinner
2015-04-30  8:03       ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.