All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Preparing for XFS reflink D-day
@ 2016-12-10  8:04 Amir Goldstein
  2016-12-10 19:42 ` Darrick J. Wong
  2016-12-12  1:59 ` Dave Chinner
  0 siblings, 2 replies; 11+ messages in thread
From: Amir Goldstein @ 2016-12-10  8:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

Dave,

I would like to have some system's storage pre-formatted
with rmapbt and reflink support without allowing reflink until
the day comes where the feature is declared stable.

I realize that rmapbt/reflink features are declared unstable and
bugs could certainly be lurking without doing any reflinks at all.
However, I estimate the the class of bugs introduces by heavily
reflinked file systems is going to take more time to tame.

Considering these options for said systems:
1. kernel v4.8.y or v4.9.y and mkfs.xfs -m rmapbt=1
2. kernel v4.9.y and mkfs.xfs -m rmapbt=1
3. kernel v4.9.y and mkfs.xfs -m rmapbt=1,reflink=1
    and new mount option -onoreflink
4. kernel v4.9.y and mkfs.xfs -m rmapbt=1,reflinkbt=1
    (separate rocomapt features reflinkbt from reflink)

Options 1-2 would require adding support in xfs_admin to
enable reflink on an existing fs (by cloning the bmbt).

Option 3 would require adding a simple noreflink
mount option to disable reflink related ops.

Option 4 requires changing mkfs.xfs before 4.9 release
and possibly setting recompat feature reflink on first file
reflink. There are several precedents to this sort of  "set
on first use" feature in ext4, not sure if there are any in xfs.

The benefit of having this functionality is that others,
like me, could provide more testing for the refcount<=1
use case. I myself intend to test refcount>1 as well, but
the goal of getting recount<=1 ready for production is
higher priority.

Another benefit from option #4 is that you may be able
to declare rmapbt=1,reflinkbt=1 stable and/or default
mkfs options prior to declaring reflink=1 stable.

Which, if any, of the options above would you be willing
to endorse?

Darrick,

I seem to recall you taking about enabling reflink on existing
fs sometime before, but I could not find that reference.
I suppose you had an idea of how this should be done?

Amir.


* D-Day of course stands for Darrick's-day ;-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Preparing for XFS reflink D-day
  2016-12-10  8:04 [RFC] Preparing for XFS reflink D-day Amir Goldstein
@ 2016-12-10 19:42 ` Darrick J. Wong
  2016-12-11  8:38   ` Amir Goldstein
  2016-12-12  1:59 ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2016-12-10 19:42 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Dave Chinner, linux-xfs

On Sat, Dec 10, 2016 at 10:04:39AM +0200, Amir Goldstein wrote:
> Dave,
> 
> I would like to have some system's storage pre-formatted
> with rmapbt and reflink support without allowing reflink until
> the day comes where the feature is declared stable.

Heh heh heh.... ;)

> I realize that rmapbt/reflink features are declared unstable and
> bugs could certainly be lurking without doing any reflinks at all.
> However, I estimate the the class of bugs introduces by heavily
> reflinked file systems is going to take more time to tame.

Yes, probably.  It seems reasonably stable on a young FS, though we'll
see how gracefully it ages.  There's probably mistakes in the ENOSPC
handling since that seems to be everyone's Achilles heel.

> Considering these options for said systems:
> 1. kernel v4.8.y or v4.9.y and mkfs.xfs -m rmapbt=1
> 2. kernel v4.9.y and mkfs.xfs -m rmapbt=1
> 3. kernel v4.9.y and mkfs.xfs -m rmapbt=1,reflink=1
>     and new mount option -onoreflink
> 4. kernel v4.9.y and mkfs.xfs -m rmapbt=1,reflinkbt=1
>     (separate rocomapt features reflinkbt from reflink)
> 
> Options 1-2 would require adding support in xfs_admin to
> enable reflink on an existing fs (by cloning the bmbt).

Not sure why you'd clone the bmbt...?

You'd simply use the rmap information to calculate a new refcountbt,
just like the offline repair already knows how to do.

Now obviously if you don't have rmap information then you have to walk
all the inode data forks in the system to get rmap information... we
don't share non-data blocks and never will, particularly since we've
stamped owner information into all the metadata headers.

(More on this later)

> Option 3 would require adding a simple noreflink
> mount option to disable reflink related ops.
> 
> Option 4 requires changing mkfs.xfs before 4.9 release
> and possibly setting recompat feature reflink on first file
> reflink. There are several precedents to this sort of  "set
> on first use" feature in ext4, not sure if there are any in xfs.

There are several of these in XFS, but I don't want to burn another
feature bit if I can avoid it.  Dave might have a different opinion
though?

> The benefit of having this functionality is that others,
> like me, could provide more testing for the refcount<=1
> use case. I myself intend to test refcount>1 as well, but
> the goal of getting recount<=1 ready for production is
> higher priority.

If you're building your own kernels, you could just tweak
xfs_reflink_remap_range with something like:

if (!capable(CAP_SYS_ADMIN))
	return -EOPNOTSUPP;

so that only you (well, root) can make files share blocks.

> Another benefit from option #4 is that you may be able
> to declare rmapbt=1,reflinkbt=1 stable and/or default
> mkfs options prior to declaring reflink=1 stable.

Ew, more mkfs options to test. :(

(I'd call it refcountbt anyway.)

In any case there is no point to having a separate refcountbt option
because if nobody ever shares any blocks, each AG will have a single
refcountbt block with zero records that never gets touched.

> Which, if any, of the options above would you be willing
> to endorse?

Well, to paraphrase the ext4 manual,

"The recommended method for upgrading an [old] filesystem to [a new one]
is to back up the entire volume, reformat the storage device with [the
new mkfs options], and restore the entire volume onto the fresh
filesystem."

https://ext4.wiki.kernel.org/index.php/UpgradeToExt4

But I'd also say read on...

> Darrick,
> 
> I seem to recall you taking about enabling reflink on existing
> fs sometime before, but I could not find that reference.
> I suppose you had an idea of how this should be done?

Christoph posted the first patchset to enable at runtime:
http://oss.sgi.com/archives/xfs/2016-06/msg00053.html

ISTR Dave didn't really like the idea of a mount option.  I think it's
a little awkward to toggle fs features that way and would rather just
implement a SET_GEOMETRY ioctl that the administrator can call to flip
on certain features.

As for dynamically constructing a new rmapbt or a new refcountbt --
there's a few tricky bits that have to be dealt with before we start
turning on features.  The first is ensuring that the log size is
sufficient to handle the new options being turned on, the second is to
teach xfs_repair not to freak out if its precalculated notions of where
the root inode should be don't square with where it actually is
(provided the root inode looks ok), and the third is making sure there's
enough space in each AG to build the relevant data structures.  There
might be more; I haven't had time to investigate this.

xfs_repair already knows how to construct fresh rmap and refcount
btrees; it does this any time you run xfs_repair without -n.  I've done
evil things like manually flip on the two feature bits via xfs_db and
run xfs_repair to build the btrees.  It works, more or less, though
messing with your filesystem with the debugger is sketchy. ;)

As far as doing things online, we actually now have the raw pieces you'd
need to enable (some) features.  The upcoming online repair code can
(via questionable VFS interactions) freeze incoming IO so that we can
scan the whole FS to construct a new rmap btree.  It also can use
existing rmap information to construct a new refcount btree.

In theory we could allow people to turn things on dynamically provided
the FS meets all the requirements (log space, rootino doesn't move, free
AG space).  It'd be pretty easy to do this for reflink since the space
requirements are minimal, and much more risky to let people do that for
rmap.  We'd need thorough testing, too.

--D

PS: I resurrected spaceman and ported it to GETFSMAP.

> 
> Amir.
> 
> 
> * D-Day of course stands for Darrick's-day ;-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Preparing for XFS reflink D-day
  2016-12-10 19:42 ` Darrick J. Wong
@ 2016-12-11  8:38   ` Amir Goldstein
  2016-12-11 18:27     ` Darrick J. Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2016-12-11  8:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, Christoph Hellwig

On Sat, Dec 10, 2016 at 9:42 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Sat, Dec 10, 2016 at 10:04:39AM +0200, Amir Goldstein wrote:
...
>
>> I realize that rmapbt/reflink features are declared unstable and
>> bugs could certainly be lurking without doing any reflinks at all.
>> However, I estimate the the class of bugs introduces by heavily
>> reflinked file systems is going to take more time to tame.
>
> Yes, probably.  It seems reasonably stable on a young FS, though we'll
> see how gracefully it ages.  There's probably mistakes in the ENOSPC
> handling since that seems to be everyone's Achilles heel.
>

So we seem to be in agreement on the requirement.

>> Considering these options for said systems:
>> 1. kernel v4.8.y or v4.9.y and mkfs.xfs -m rmapbt=1
>> 2. kernel v4.9.y and mkfs.xfs -m rmapbt=1
>> 3. kernel v4.9.y and mkfs.xfs -m rmapbt=1,reflink=1
>>     and new mount option -onoreflink
>> 4. kernel v4.9.y and mkfs.xfs -m rmapbt=1,reflinkbt=1
>>     (separate rocomapt features reflinkbt from reflink)
>>
>> Options 1-2 would require adding support in xfs_admin to
>> enable reflink on an existing fs (by cloning the bmbt).
>
> Not sure why you'd clone the bmbt...?
>

Just because I am don't know reflink well enough..
I mistakenly thought that refcount=1 extents are tracked in refcountbt.

> You'd simply use the rmap information to calculate a new refcountbt,
> just like the offline repair already knows how to do.
>

Good, so you are saying that the tool to enable refcount offline is already
available and I can basically choose option #2.
In that case, no further questions :-)

> Now obviously if you don't have rmap information then you have to walk
> all the inode data forks in the system to get rmap information... we
> don't share non-data blocks and never will, particularly since we've
> stamped owner information into all the metadata headers.
>

I don't event want to thing about enabling rmapbt.


>>
>> Option 4 requires changing mkfs.xfs before 4.9 release
>> and possibly setting recompat feature reflink on first file
>> reflink. There are several precedents to this sort of  "set
>> on first use" feature in ext4, not sure if there are any in xfs.
>
> There are several of these in XFS, but I don't want to burn another
> feature bit if I can avoid it.  Dave might have a different opinion
> though?
>

Considering how easy it is to enable reflink offline (by running repair)
I myself see no reason for a new feature flag.

>> The benefit of having this functionality is that others,
>> like me, could provide more testing for the refcount<=1
>> use case. I myself intend to test refcount>1 as well, but
>> the goal of getting recount<=1 ready for production is
>> higher priority.
>
> If you're building your own kernels, you could just tweak
> xfs_reflink_remap_range with something like:
>
> if (!capable(CAP_SYS_ADMIN))
>         return -EOPNOTSUPP;
>
> so that only you (well, root) can make files share blocks.
>

Sure, I know that :)
I am not the admin in this case though, I am the developer
who wants to prevent other developers and admins of
messing with reflink before it is ripe.
And let us not forget:
a76b5b0 fs: try to clone files first in vfs_copy_file_range
And what would happen when the nfsd on the systems try to
copy file range.


>
> Well, to paraphrase the ext4 manual,
>
> "The recommended method for upgrading an [old] filesystem to [a new one]
> is to back up the entire volume, reformat the storage device with [the
> new mkfs options], and restore the entire volume onto the fresh
> filesystem."
>

Words of wisdom, no doubt, but reality calls for adjustments sometimes.
For the case of systems that are going to be deployed in production
and would not tolerate long downtime, I would relax this recommendation
to:
- backup the entire volume
- make the upgrade
- followup with regression testing after the upgrade
- if anything goes wrong, take system offlline and restore from backup

This just moved the penalty of downtime to the unlikely() branch.

I realize that there are other options to avoid long downtime
(switch to new server/volume), but the case above is valid as well.



>
>> Darrick,
>>
>> I seem to recall you taking about enabling reflink on existing
>> fs sometime before, but I could not find that reference.
>> I suppose you had an idea of how this should be done?
>
> Christoph posted the first patchset to enable at runtime:
> http://oss.sgi.com/archives/xfs/2016-06/msg00053.html
>

Thanks for that pointer.
Christoph, do you still have a use case for turning on reflink?
Does it have to be "online" or is enabling offline good enough?

...

>
> In theory we could allow people to turn things on dynamically provided
> the FS meets all the requirements (log space, rootino doesn't move, free
> AG space).  It'd be pretty easy to do this for reflink since the space
> requirements are minimal, and much more risky to let people do that for
> rmap.  We'd need thorough testing, too.
>

:-/ pre-allocate log space and AG space is an issue.
I can tweak mkfs.xfs to preallocate those for my use case,
but I am hoping that the need meets a bigger crowd and xfsprogs 4.9
would have a solution for that.

How about having mkfs.xfs 4.9 preallocate the space needed for
refcountbt if rmapbt=1? it's a bit of a hack, which Dave most probably
won't like, but it avoids the need to define a new recountbt=1 flag
just for the preallocation.

Thoughts?

Amir.


P.S.: I have a lesson to share:
6 years ago I released ext3 snapshots feature
It was deployed in production after a relatively short beta period
and very little community testing/review.
Since then, it was deployed on many systems and not once
did it cause any data corruption.
>From engineering POV, I consider this a miracle, but to aid that
miracle I had a powerful tool in my disposal.
I implemented e2fsck -x flag, where if anything messed up
wrt refcounting, snapshots could be discarded and file system
would be brought back to health.
The tool proved itself useful is several cases (used with no
developer intervention).

The lesson is that if xfs_repair is able to de-refcount all blocks
(given sufficient disk space) and turn off the reflink feature and if
that functionality is well tested, then more users would have the
courage to enable reflink during its "beta" phase.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Preparing for XFS reflink D-day
  2016-12-11  8:38   ` Amir Goldstein
@ 2016-12-11 18:27     ` Darrick J. Wong
  2016-12-11 19:23       ` Amir Goldstein
  0 siblings, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2016-12-11 18:27 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Dave Chinner, linux-xfs, Christoph Hellwig

On Sun, Dec 11, 2016 at 10:38:21AM +0200, Amir Goldstein wrote:
> On Sat, Dec 10, 2016 at 9:42 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Sat, Dec 10, 2016 at 10:04:39AM +0200, Amir Goldstein wrote:
> ...
> >
> >> I realize that rmapbt/reflink features are declared unstable and
> >> bugs could certainly be lurking without doing any reflinks at all.
> >> However, I estimate the the class of bugs introduces by heavily
> >> reflinked file systems is going to take more time to tame.
> >
> > Yes, probably.  It seems reasonably stable on a young FS, though we'll
> > see how gracefully it ages.  There's probably mistakes in the ENOSPC
> > handling since that seems to be everyone's Achilles heel.
> >
> 
> So we seem to be in agreement on the requirement.

I'm willing to consider code to dynamically enable reflink, yes.

> >> Considering these options for said systems:
> >> 1. kernel v4.8.y or v4.9.y and mkfs.xfs -m rmapbt=1
> >> 2. kernel v4.9.y and mkfs.xfs -m rmapbt=1
> >> 3. kernel v4.9.y and mkfs.xfs -m rmapbt=1,reflink=1
> >>     and new mount option -onoreflink
> >> 4. kernel v4.9.y and mkfs.xfs -m rmapbt=1,reflinkbt=1
> >>     (separate rocomapt features reflinkbt from reflink)
> >>
> >> Options 1-2 would require adding support in xfs_admin to
> >> enable reflink on an existing fs (by cloning the bmbt).
> >
> > Not sure why you'd clone the bmbt...?
> >
> 
> Just because I am don't know reflink well enough..
> I mistakenly thought that refcount=1 extents are tracked in refcountbt.

Reference counts are tracked in the refcountbt.

(Inode fork) block maps are tracked in the bmbt.

> > You'd simply use the rmap information to calculate a new refcountbt,
> > just like the offline repair already knows how to do.
> >
> 
> Good, so you are saying that the tool to enable refcount offline is already
> available and I can basically choose option #2.
> In that case, no further questions :-)

Keep in mind that editing the filesystem with xfs_db and running
xfs_repair to fill in the gaps is totally unsupported behavior!

If you break it you get to keep all the pieces.

I'd much, much, much rather have a properly engineered and tested
upgrade path, which I guess we could do for reflink.

> > Now obviously if you don't have rmap information then you have to walk
> > all the inode data forks in the system to get rmap information... we
> > don't share non-data blocks and never will, particularly since we've
> > stamped owner information into all the metadata headers.
> >
> 
> I don't event want to thing about enabling rmapbt.
> 
> 
> >>
> >> Option 4 requires changing mkfs.xfs before 4.9 release
> >> and possibly setting recompat feature reflink on first file
> >> reflink. There are several precedents to this sort of  "set
> >> on first use" feature in ext4, not sure if there are any in xfs.
> >
> > There are several of these in XFS, but I don't want to burn another
> > feature bit if I can avoid it.  Dave might have a different opinion
> > though?
> >
> 
> Considering how easy it is to enable reflink offline (by running repair)
> I myself see no reason for a new feature flag.
> 
> >> The benefit of having this functionality is that others,
> >> like me, could provide more testing for the refcount<=1
> >> use case. I myself intend to test refcount>1 as well, but
> >> the goal of getting recount<=1 ready for production is
> >> higher priority.
> >
> > If you're building your own kernels, you could just tweak
> > xfs_reflink_remap_range with something like:
> >
> > if (!capable(CAP_SYS_ADMIN))
> >         return -EOPNOTSUPP;
> >
> > so that only you (well, root) can make files share blocks.
> >
> 
> Sure, I know that :)
> I am not the admin in this case though, I am the developer
> who wants to prevent other developers and admins of
> messing with reflink before it is ripe.
> And let us not forget:
> a76b5b0 fs: try to clone files first in vfs_copy_file_range
> And what would happen when the nfsd on the systems try to
> copy file range.

<shrug> vfs_copy_file_range -> xfs_clone_file_range ->
xfs_reflink_remap_range....

> 
> >
> > Well, to paraphrase the ext4 manual,
> >
> > "The recommended method for upgrading an [old] filesystem to [a new one]
> > is to back up the entire volume, reformat the storage device with [the
> > new mkfs options], and restore the entire volume onto the fresh
> > filesystem."
> >
> 
> Words of wisdom, no doubt, but reality calls for adjustments sometimes.
> For the case of systems that are going to be deployed in production
> and would not tolerate long downtime, I would relax this recommendation
> to:
> - backup the entire volume
> - make the upgrade
> - followup with regression testing after the upgrade
> - if anything goes wrong, take system offlline and restore from backup
> 
> This just moved the penalty of downtime to the unlikely() branch.
> 
> I realize that there are other options to avoid long downtime
> (switch to new server/volume), but the case above is valid as well.
> 
> 
> 
> >
> >> Darrick,
> >>
> >> I seem to recall you taking about enabling reflink on existing
> >> fs sometime before, but I could not find that reference.
> >> I suppose you had an idea of how this should be done?
> >
> > Christoph posted the first patchset to enable at runtime:
> > http://oss.sgi.com/archives/xfs/2016-06/msg00053.html
> >
> 
> Thanks for that pointer.
> Christoph, do you still have a use case for turning on reflink?
> Does it have to be "online" or is enabling offline good enough?

(I think Christoph found some other way around this.)

> 
> ...
> 
> >
> > In theory we could allow people to turn things on dynamically provided
> > the FS meets all the requirements (log space, rootino doesn't move, free
> > AG space).  It'd be pretty easy to do this for reflink since the space
> > requirements are minimal, and much more risky to let people do that for
> > rmap.  We'd need thorough testing, too.
> >
> 
> :-/ pre-allocate log space and AG space is an issue.
> I can tweak mkfs.xfs to preallocate those for my use case,
> but I am hoping that the need meets a bigger crowd and xfsprogs 4.9
> would have a solution for that.

In general, mkfs seems to create a log that's more than large enough to
handle a dynamic increase in features.

> How about having mkfs.xfs 4.9 preallocate the space needed for
> refcountbt if rmapbt=1? it's a bit of a hack, which Dave most probably
> won't like, but it avoids the need to define a new recountbt=1 flag
> just for the preallocation.

Chances are pretty good there's enough space unless your fs is totally
full, and if it's full then you might seriously consider a full
backup/restore cycle onto a bigger disk to reduce fragmentation.

> Thoughts?
> 
> Amir.
> 
> 
> P.S.: I have a lesson to share:
> 6 years ago I released ext3 snapshots feature
> It was deployed in production after a relatively short beta period
> and very little community testing/review.
> Since then, it was deployed on many systems and not once
> did it cause any data corruption.
> From engineering POV, I consider this a miracle, but to aid that
> miracle I had a powerful tool in my disposal.
> I implemented e2fsck -x flag, where if anything messed up
> wrt refcounting, snapshots could be discarded and file system
> would be brought back to health.
> The tool proved itself useful is several cases (used with no
> developer intervention).
> 
> The lesson is that if xfs_repair is able to de-refcount all blocks
> (given sufficient disk space) and turn off the reflink feature and if
> that functionality is well tested, then more users would have the
> courage to enable reflink during its "beta" phase.

Sure, but IIRC you could nuke all the corrupt snapshots by deleting the
hidden snapshots file and releasing all the space it referenced back to
the filesystem, which makes it easy to zap all the snapshots if
something is amiss.

Un-sharing an fs full of reflinked files requires us to build code to
iterate every bmbt of every file (or to cross-reference every refcountbt
record against the rmapbt to find the sharers) and then relocate the
data, which is quite a bit more complex... and unnecessary since we can
rebuild all the broken refcount metadata anyway.

--D

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Preparing for XFS reflink D-day
  2016-12-11 18:27     ` Darrick J. Wong
@ 2016-12-11 19:23       ` Amir Goldstein
  2016-12-12  2:45         ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2016-12-11 19:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, Christoph Hellwig

On Sun, Dec 11, 2016 at 8:27 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Sun, Dec 11, 2016 at 10:38:21AM +0200, Amir Goldstein wrote:
>> On Sat, Dec 10, 2016 at 9:42 PM, Darrick J. Wong
>> <darrick.wong@oracle.com> wrote:
>> > On Sat, Dec 10, 2016 at 10:04:39AM +0200, Amir Goldstein wrote:
>> ...
>> >
>> >> I realize that rmapbt/reflink features are declared unstable and
>> >> bugs could certainly be lurking without doing any reflinks at all.
>> >> However, I estimate the the class of bugs introduces by heavily
>> >> reflinked file systems is going to take more time to tame.
>> >
>> > Yes, probably.  It seems reasonably stable on a young FS, though we'll
>> > see how gracefully it ages.  There's probably mistakes in the ENOSPC
>> > handling since that seems to be everyone's Achilles heel.
>> >
>>
>> So we seem to be in agreement on the requirement.
>
> I'm willing to consider code to dynamically enable reflink, yes.
>

Well, if we can get a consensus on what should be supported
I can work on it and if you prefer to implement I will be happy to test.

>>
>> Good, so you are saying that the tool to enable refcount offline is already
>> available and I can basically choose option #2.
>> In that case, no further questions :-)
>
> Keep in mind that editing the filesystem with xfs_db and running
> xfs_repair to fill in the gaps is totally unsupported behavior!
>
> If you break it you get to keep all the pieces.
>
> I'd much, much, much rather have a properly engineered and tested
> upgrade path, which I guess we could do for reflink.
>

I'd much much much much rather that as well.

>> > If you're building your own kernels, you could just tweak
>> > xfs_reflink_remap_range with something like:
>> >
>> > if (!capable(CAP_SYS_ADMIN))
>> >         return -EOPNOTSUPP;
>> >
>> > so that only you (well, root) can make files share blocks.
>> >
>>
>> Sure, I know that :)
>> I am not the admin in this case though, I am the developer
>> who wants to prevent other developers and admins of
>> messing with reflink before it is ripe.
>> And let us not forget:
>> a76b5b0 fs: try to clone files first in vfs_copy_file_range
>> And what would happen when the nfsd on the systems try to
>> copy file range.
>
> <shrug> vfs_copy_file_range -> xfs_clone_file_range ->
> xfs_reflink_remap_range....
>

What I meant is that I could probably make sure there are no
obvious programs on our systems that issue a clone ioctl,
but nfsd which runs as root is going to be a source for
copy/clone requests from clients, so the
!capable(CAP_SYS_ADMIN) test is in sufficient
If I have to patch our systems I will add -onoreflink

>>
>> :-/ pre-allocate log space and AG space is an issue.
>> I can tweak mkfs.xfs to preallocate those for my use case,
>> but I am hoping that the need meets a bigger crowd and xfsprogs 4.9
>> would have a solution for that.
>
> In general, mkfs seems to create a log that's more than large enough to
> handle a dynamic increase in features.
>

So for large enough arrays I suppose that preallocating log space is not
an issue?

>> How about having mkfs.xfs 4.9 preallocate the space needed for
>> refcountbt if rmapbt=1? it's a bit of a hack, which Dave most probably
>> won't like, but it avoids the need to define a new recountbt=1 flag
>> just for the preallocation.
>
> Chances are pretty good there's enough space unless your fs is totally
> full, and if it's full then you might seriously consider a full
> backup/restore cycle onto a bigger disk to reduce fragmentation.
>

I though there was an issue with reserved space per AG and
that the amount of reserved space for btree blocks depends on the
features. If a single full AG is not an issue then never mind.

>>
>> The lesson is that if xfs_repair is able to de-refcount all blocks
>> (given sufficient disk space) and turn off the reflink feature and if
>> that functionality is well tested, then more users would have the
>> courage to enable reflink during its "beta" phase.
>
> Sure, but IIRC you could nuke all the corrupt snapshots by deleting the
> hidden snapshots file and releasing all the space it referenced back to
> the filesystem, which makes it easy to zap all the snapshots if
> something is amiss.
>
> Un-sharing an fs full of reflinked files requires us to build code to
> iterate every bmbt of every file (or to cross-reference every refcountbt
> record against the rmapbt to find the sharers) and then relocate the
> data, which is quite a bit more complex... and unnecessary since we can
> rebuild all the broken refcount metadata anyway.
>

You are right, of course, from technical POV, but psychologically, if people
know they have a safe way back to what they know and trust, it is easier
for them make the leap...

Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Preparing for XFS reflink D-day
  2016-12-10  8:04 [RFC] Preparing for XFS reflink D-day Amir Goldstein
  2016-12-10 19:42 ` Darrick J. Wong
@ 2016-12-12  1:59 ` Dave Chinner
  2016-12-12  5:06   ` Amir Goldstein
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2016-12-12  1:59 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Darrick J. Wong, linux-xfs

On Sat, Dec 10, 2016 at 10:04:39AM +0200, Amir Goldstein wrote:
> Dave,
> 
> I would like to have some system's storage pre-formatted
> with rmapbt and reflink support without allowing reflink until
> the day comes where the feature is declared stable.

Amir, you should have realised by now that - as a matter of policy -
I simply say no to anything that is intended as a short-term
convenience for a special interest use case that has no long term
benefit to the wider community.

Your timeline for downstream customer feature delivery don't change
our upstream feature stabilisation and support plans.  If you want
to run your user base on reflink=1 filesystems on 4.9 kernels then
feel free to support them directly. That's enitrely your own choice
made entirely your own risk as a downstream distributor.  We'll
triage and fix bugs as you report them and incorporate fixes and
improvements as relevant, but we're not going to do any more than
that.  "Use at your own risk" means exactly that.

FWIW, Christoph has taken this "downstream risk" path for his own
clients and customers that are using the reflink functionality in
their systems. He doesn't bother us with triaging or fixing issues
his customers hit; all we see from him is a constant stream of bug
fixes and improvements to the experimental features his customers
are using...

> Considering these options for said systems:
> 1. kernel v4.8.y or v4.9.y and mkfs.xfs -m rmapbt=1
> 2. kernel v4.9.y and mkfs.xfs -m rmapbt=1
> 3. kernel v4.9.y and mkfs.xfs -m rmapbt=1,reflink=1
>     and new mount option -onoreflink
> 4. kernel v4.9.y and mkfs.xfs -m rmapbt=1,reflinkbt=1
>     (separate rocomapt features reflinkbt from reflink)
> 
> Options 1-2 would require adding support in xfs_admin to
> enable reflink on an existing fs

If we can properly design and implement the addition of the reflink
btree and reliably test it then this would be my preferred option.
However, I can see lots of intricate problems with adding reflink
after the fact.  e.g. if we've already got a full AG we won't be
able to have the refcount btree added to it dynamically, so how do
we prevent this sort of failure half way through the conversion?

And if we do fail half way through the conversion (for whatever
reason), how the hell do we clean up the mess reliably?

So while this seems simple in concept, I don't think the
implementation is going to be simple regardless of whether we do the
conversion online or offline. It'll be yet more experimental code
that will take a good length of time to test and stabilise, and has
the definite possibility of making it take longer to stabilise the
reflink feature..

>From this persepective, it makes no sense for upstream invest time
and effort into dynamic reflink enabling like this because it's just
a short term workaround to avoid waiting for the new feature to
stabilise. Our time is much better spent stabilising that feature to
reduce the amount of time it is considered EXPERIMENTAL.

> Option 3 would require adding a simple noreflink
> mount option to disable reflink related ops.

You can add it to your own kernels easily enough, but don't expect
us to carry one-off, special case mount options like this in the
upstream kernel.

> Option 4 requires changing mkfs.xfs before 4.9 release
> and possibly setting recompat feature reflink on first file
> reflink. There are several precedents to this sort of  "set
> on first use" feature in ext4, not sure if there are any in xfs.

There's a few in XFS, historically speaking (attribute fork layout,
v1->v2 inodes, etc). These days, however, we tend to avoid silent
dynamic feature bit addition because of the "upgrade kernel, random
feature bit gets added silently, upgrade causes other problems,
downgrade kernel, old kernel can't mount fs anymore" type of
problem it can cause the wider userbase.

FWIW, setting a feature bit on first reflink will require kernel
changes, and the soonest you'd get them into the kernel is 4.11 if
all the issues and problems could be sorted before then. So this
doesn't help you at all for the 4.9 kernel. It also requires that
the recountbt is being maintained for refcount=1 extents, otherwise
it introduces all the same problems as options 1-2. IMO, this is the
least appealing of all the options you presented.

> Another benefit from option #4 is that you may be able
> to declare rmapbt=1,reflinkbt=1 stable and/or default
> mkfs options prior to declaring reflink=1 stable.

Again, no. Once the kernel libxfs code is declared stable, we'll
merge that back into the next xfsprogs release which will also mark
that feature as stable. Madness lies in trying to support anything
else in the upstream code base.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Preparing for XFS reflink D-day
  2016-12-11 19:23       ` Amir Goldstein
@ 2016-12-12  2:45         ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2016-12-12  2:45 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Darrick J. Wong, linux-xfs, Christoph Hellwig

On Sun, Dec 11, 2016 at 09:23:38PM +0200, Amir Goldstein wrote:
> > Un-sharing an fs full of reflinked files requires us to build code to
> > iterate every bmbt of every file (or to cross-reference every refcountbt
> > record against the rmapbt to find the sharers) and then relocate the
> > data, which is quite a bit more complex... and unnecessary since we can
> > rebuild all the broken refcount metadata anyway.
> 
> You are right, of course, from technical POV, but psychologically, if people
> know they have a safe way back to what they know and trust, it is easier
> for them make the leap...

/me shakes his head.

If people don't feel safe running experimental code that might go
wrong, they're going to be absolutely thrilled to hear that when it
does go wrong, there's even more experimental code that will try to
fix it up again.

No.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Preparing for XFS reflink D-day
  2016-12-12  1:59 ` Dave Chinner
@ 2016-12-12  5:06   ` Amir Goldstein
  2016-12-12  7:44     ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2016-12-12  5:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Mon, Dec 12, 2016 at 3:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Dec 10, 2016 at 10:04:39AM +0200, Amir Goldstein wrote:
>> Dave,
>>
>> I would like to have some system's storage pre-formatted
>> with rmapbt and reflink support without allowing reflink until
>> the day comes where the feature is declared stable.
>
> Amir, you should have realised by now that - as a matter of policy -
> I simply say no to anything that is intended as a short-term
> convenience for a special interest use case that has no long term
> benefit to the wider community.
>

Absolutely! I heard that loud and clear and I fully agree with you.
I am putting my use case out there in hope that others that share a
similar use case can chime in.


> Your timeline for downstream customer feature delivery don't change
> our upstream feature stabilisation and support plans.  If you want
> to run your user base on reflink=1 filesystems on 4.9 kernels then
> feel free to support them directly. That's enitrely your own choice
> made entirely your own risk as a downstream distributor.  We'll
> triage and fix bugs as you report them and incorporate fixes and
> improvements as relevant, but we're not going to do any more than
> that.  "Use at your own risk" means exactly that.
>

This makes sense. To be clear, my intention was not exactly to run
4.9 with reflink=1, but to run 4.9 with "reflink pre-formatted".
That means addressing exactly the issues you mentioned below
of preallocating the refcountbt space in all AGs.
Hence, my suggestion to split the feature refcountbt=1 from reflink=1
where the former means maintain the refcount=1 tree and the latter
to allow refcount>1.

But as Darrick correctly noted, there is no real point of maintaining
recount=1 tree that can be easily calculated when enabling refcount,
so it is sufficient to set a feature flag (or geometry property) to always
reserve space for the reflink feature.

This split of reflink to "ready" and "effective" should be simple enough
that I should be able to maintain it out of tree, but I will always prefer an
acceptable upstreamed solution - second best choice is out of tree
solution which you are willing to endorse as being the least crazy.

Naturally, in that case, I will have also provided a lot of testing to the
upgrade in the lab and down the road in production systems.

> FWIW, Christoph has taken this "downstream risk" path for his own
> clients and customers that are using the reflink functionality in
> their systems. He doesn't bother us with triaging or fixing issues
> his customers hit; all we see from him is a constant stream of bug
> fixes and improvements to the experimental features his customers
> are using...
>

If I have to go down that path I will, but only as last resort.

>> Considering these options for said systems:
>> 1. kernel v4.8.y or v4.9.y and mkfs.xfs -m rmapbt=1
>> 2. kernel v4.9.y and mkfs.xfs -m rmapbt=1
>> 3. kernel v4.9.y and mkfs.xfs -m rmapbt=1,reflink=1
>>     and new mount option -onoreflink
>> 4. kernel v4.9.y and mkfs.xfs -m rmapbt=1,reflinkbt=1
>>     (separate rocomapt features reflinkbt from reflink)
>>
>> Options 1-2 would require adding support in xfs_admin to
>> enable reflink on an existing fs
>
> If we can properly design and implement the addition of the reflink
> btree and reliably test it then this would be my preferred option.

That's all I needed to hear.
If Darrick won't pick up that glove I will.
Testing is definitely on me.

> However, I can see lots of intricate problems with adding reflink
> after the fact.  e.g. if we've already got a full AG we won't be
> able to have the refcount btree added to it dynamically, so how do
> we prevent this sort of failure half way through the conversion?
>

All the more reason to preallocate that space anyway with mkfs.xfs
v4.9. If we should do that for the general case and under what options,
it is really up to you. But I am most definitely going to have to make that
adjustments for our systems, so when I get to it I will share whatever
I did.


>
>> Option 3 would require adding a simple noreflink
>> mount option to disable reflink related ops.
>
> You can add it to your own kernels easily enough, but don't expect
> us to carry one-off, special case mount options like this in the
> upstream kernel.
>

Of course. It's an easy patch to carry out of tree until D-day.
Probably the easiest option for me, so if the preferred solution
(option 2) doesn't go through, I will go with this one.

>> Option 4 requires changing mkfs.xfs before 4.9 release
>> and possibly setting recompat feature reflink on first file
>> reflink. There are several precedents to this sort of  "set
>> on first use" feature in ext4, not sure if there are any in xfs.
>
> There's a few in XFS, historically speaking (attribute fork layout,
> v1->v2 inodes, etc). These days, however, we tend to avoid silent
> dynamic feature bit addition because of the "upgrade kernel, random
> feature bit gets added silently, upgrade causes other problems,
> downgrade kernel, old kernel can't mount fs anymore" type of
> problem it can cause the wider userbase.
>
> FWIW, setting a feature bit on first reflink will require kernel
> changes, and the soonest you'd get them into the kernel is 4.11 if
> all the issues and problems could be sorted before then. So this
> doesn't help you at all for the 4.9 kernel. It also requires that
> the recountbt is being maintained for refcount=1 extents, otherwise
> it introduces all the same problems as options 1-2. IMO, this is the
> least appealing of all the options you presented.
>

I agree. option 4 is not appealing to me as I have no requirement
to enable reflink online. I proposed it only in case somebody else
does.

Thanks for clearing out my questions.
Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Preparing for XFS reflink D-day
  2016-12-12  5:06   ` Amir Goldstein
@ 2016-12-12  7:44     ` Dave Chinner
  2016-12-12  8:10       ` Amir Goldstein
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2016-12-12  7:44 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Darrick J. Wong, linux-xfs

On Mon, Dec 12, 2016 at 07:06:37AM +0200, Amir Goldstein wrote:
> On Mon, Dec 12, 2016 at 3:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Dec 10, 2016 at 10:04:39AM +0200, Amir Goldstein wrote:
> > our upstream feature stabilisation and support plans.  If you want
> > to run your user base on reflink=1 filesystems on 4.9 kernels then
> > feel free to support them directly. That's enitrely your own choice
> > made entirely your own risk as a downstream distributor.  We'll
> > triage and fix bugs as you report them and incorporate fixes and
> > improvements as relevant, but we're not going to do any more than
> > that.  "Use at your own risk" means exactly that.
> >
> 
> This makes sense. To be clear, my intention was not exactly to run
> 4.9 with reflink=1, but to run 4.9 with "reflink pre-formatted".
> That means addressing exactly the issues you mentioned below
> of preallocating the refcountbt space in all AGs.

This requires all sorts of changes to the code to make that work.
The ship has already sailed for 4.9 so stop thinking we're
going to add /anything/ to mkfs in 4.9.

And given that xfsprogs 4.10 needs to match the functionality we're
about to merge into the kernel for the 4.10 cycle, we're unlikely to
add support for on-disk changes that aren't supported by the 4.10
kernel - that dev cycle is already over, too.

The next 3 dev month cycle is for 4.11, and there's already all the
online scrub/repair code lined up for this release, so getting that
merged holds much greater priority than hacking reflink formats to
work around the EXPERIMENTAL tag.

IOWs, the /earliest/ anything like this could be done is 4.11, but
I'd be really hesitant to rush anything like this into 4.11 because
of all the stuff we already have in the pipeline.  And given that
we're currently looking at around the 4.12 release timeframe for
moving to full support for reflink, what does all this extra
"refcount-but-not-reflink" format shenanigans buy us? At best it's
going to be useful for a 3-6 month window, with very very limited
relevance or use to the rest of the XFS userbase?

When I look at what you ar eproposing from this perspective, the
cost-benefit analysis does not fall favourably on the side of making
these changes.

> > FWIW, Christoph has taken this "downstream risk" path for his own
> > clients and customers that are using the reflink functionality in
> > their systems. He doesn't bother us with triaging or fixing issues
> > his customers hit; all we see from him is a constant stream of bug
> > fixes and improvements to the experimental features his customers
> > are using...
> >
> 
> If I have to go down that path I will, but only as last resort.

That's the preferrable first path - it's been widely used across the
vendor storage ecosystem for many years. It's proven to be a good
model over many years because it puts no additional support or
development load on upstream but provides a steady flow of
additional features and bug fixes back to us.

That's the exact opposite of what you are proposing: that
we supply you with the functionality you require, immediately,
without forward planning, without caring about impact on
established lines of development, etc.

There's just a little bit of difference here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Preparing for XFS reflink D-day
  2016-12-12  7:44     ` Dave Chinner
@ 2016-12-12  8:10       ` Amir Goldstein
  2016-12-13  0:56         ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2016-12-12  8:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Mon, Dec 12, 2016 at 9:44 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Dec 12, 2016 at 07:06:37AM +0200, Amir Goldstein wrote:
...
>
> IOWs, the /earliest/ anything like this could be done is 4.11, but
> I'd be really hesitant to rush anything like this into 4.11 because
> of all the stuff we already have in the pipeline.  And given that
> we're currently looking at around the 4.12 release timeframe for
> moving to full support for reflink, what does all this extra
> "refcount-but-not-reflink" format shenanigans buy us? At best it's
> going to be useful for a 3-6 month window, with very very limited
> relevance or use to the rest of the XFS userbase?
>

If I truly believed that 4.12 is a realistic target, I wouldn't have bothered
at all. But to get there we need to have a sufficiently large beta group
of bleeding edge testers, don't you think?
In fact, I am hoping that overlayfs "clone up" is merged to 4.10, creating
a big incentive to CoreOS users to start experimenting with docker
with overlayfs over XFS reflink, so there may be hope for that beta group.

FYI, and unrelated, in coming up docker 1.13 release I implemented
support for container disk usage quota with overlayfs storage driver
over xfs using project quotas.
This is catching up with a feature that btrfs/zfs/lvm storage driver
already have.
So soon enough, docker users will have a new incentive to use xfs
an base fs.

> When I look at what you are proposing from this perspective, the
> cost-benefit analysis does not fall favourably on the side of making
> these changes.
>

There is only one single benefit to what I am proposing
and that is for admins that want to install systems near the day that
reflink is declared stable and wait for the D-day before activation.
How many such admins are there, I have no idea. This is why I am
posting this requirement on a public mailing list - to find out.

Following your feedback, I suppose I am going to go for the
option of carrying a patch for -onoreflink.
And yes, take the support of experimental reflink (refcount<=1)
on myself.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Preparing for XFS reflink D-day
  2016-12-12  8:10       ` Amir Goldstein
@ 2016-12-13  0:56         ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2016-12-13  0:56 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Darrick J. Wong, linux-xfs

On Mon, Dec 12, 2016 at 10:10:57AM +0200, Amir Goldstein wrote:
> On Mon, Dec 12, 2016 at 9:44 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Dec 12, 2016 at 07:06:37AM +0200, Amir Goldstein wrote:
> ...
> >
> > IOWs, the /earliest/ anything like this could be done is 4.11, but
> > I'd be really hesitant to rush anything like this into 4.11 because
> > of all the stuff we already have in the pipeline.  And given that
> > we're currently looking at around the 4.12 release timeframe for
> > moving to full support for reflink, what does all this extra
> > "refcount-but-not-reflink" format shenanigans buy us? At best it's
> > going to be useful for a 3-6 month window, with very very limited
> > relevance or use to the rest of the XFS userbase?
> >
> 
> If I truly believed that 4.12 is a realistic target, I wouldn't have bothered
> at all. But to get there we need to have a sufficiently large beta group
> of bleeding edge testers, don't you think?

We've never had to worry about this in the past. We've got plenty of
people already running reflink enabled filesystems (my production
systems included) and experimenting with it. People from the
gluster, ceph, container storage infrastructure, etc areas have
already been testing and evaluating the reflink functionality in
XFS, even before it was merged. They've been asking for this
functionality for /years/ for doing things like VM image snapshots,
so we've got no shortage of people testing and using it already.

Keep in mind that there's been years of work behind reflink to get
where we are now, so there's been lots of things going on behind the
scenes that you simply don't know about. We've had a "sufficiently
large beta group" for months before the feature was merged....

> In fact, I am hoping that overlayfs "clone up" is merged to 4.10, creating
> a big incentive to CoreOS users to start experimenting with docker
> with overlayfs over XFS reflink, so there may be hope for that beta group.

What we don't want is /production users/ to be guinea pigs for a new
on-disk functionality. That's just asking for trouble, especially if
we find a bug in the on-disk format.  We've done just fine in the
past with a small group of very knowledgable users testing new
functionality, so I see no reason to treat reflink differently and
thereby exposing a wide swath of unsuspecting users to excessive
risk unnecessarily.

.....

> FYI, and unrelated, in coming up docker 1.13 release I implemented
> support for container disk usage quota with overlayfs storage driver
> over xfs using project quotas.

Great to hear! It's only taken ~4 years since I first suggested this
container fs space management model for it to be implemented.... :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-12-13  0:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-10  8:04 [RFC] Preparing for XFS reflink D-day Amir Goldstein
2016-12-10 19:42 ` Darrick J. Wong
2016-12-11  8:38   ` Amir Goldstein
2016-12-11 18:27     ` Darrick J. Wong
2016-12-11 19:23       ` Amir Goldstein
2016-12-12  2:45         ` Dave Chinner
2016-12-12  1:59 ` Dave Chinner
2016-12-12  5:06   ` Amir Goldstein
2016-12-12  7:44     ` Dave Chinner
2016-12-12  8:10       ` Amir Goldstein
2016-12-13  0:56         ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.