All of lore.kernel.org
 help / color / mirror / Atom feed
* Why do we need these mount options?
@ 2021-01-14  2:12 waxhead
  2021-01-14 16:37 ` David Sterba
  0 siblings, 1 reply; 14+ messages in thread
From: waxhead @ 2021-01-14  2:12 UTC (permalink / raw)
  To: linux-btrfs

Howdy,

I was looking through the mount options and being a madman with strong 
opinions I can't help thinking that a lot of them does not really belong 
as mount options at all, but should rather be properties set on the 
subvolume - for example the toplevel subvolume.

And any options set on a child subvolume should override the parrent 
subvolume the way I see it.

By having a quick look - I don't see why these should be mount options 
at all.

autodefrag / noautodefrag
commit
compress / compress-force
datacow / nodatacow
datasum / nodatasum
discard / nodiscard
inode_cache / noinode_cache
space_cache / nospace_cache
sdd / ssd_spread / nossd / no_ssdspread
user_subvol_rm_allowed

Stuff like compress and nodatacow can be set with chattr so there is as 
far as I am aware three methods of setting compression for example.

Either by mount options in fstab, by chattr or by btrfs property set

I think it would be more consistent to have one interface for adjusting 
behavior.

As I asked before, the future plan to have different storage profiles on 
subvolumes seem to have been sneakily(?) removed from the wiki - if that 
is indeed a dropped goal I can see why it makes sense to keep the mount 
options, if not I think the mount options should go in favor of btrfs 
property set.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-14  2:12 Why do we need these mount options? waxhead
@ 2021-01-14 16:37 ` David Sterba
  2021-01-15  0:02   ` waxhead
  2021-01-15  3:54   ` Zygo Blaxell
  0 siblings, 2 replies; 14+ messages in thread
From: David Sterba @ 2021-01-14 16:37 UTC (permalink / raw)
  To: waxhead; +Cc: linux-btrfs

Hi,

On Thu, Jan 14, 2021 at 03:12:26AM +0100, waxhead wrote:
> I was looking through the mount options and being a madman with strong 
> opinions I can't help thinking that a lot of them does not really belong 
> as mount options at all, but should rather be properties set on the 
> subvolume - for example the toplevel subvolume.

I agree that some of them should not be there but mount options still
have their own usecase. They can be set from the outside and are
supposed to affect the whole filesystem mount lifetime.

However, they've been used as default values for some operations, which
is something that points more to what you suggest. And as they're not
persistent and need to be stored in /etc/fstab is also weighing for
storage inside the fs.

> And any options set on a child subvolume should override the parrent 
> subvolume the way I see it.

Yeah, that's one of the ways how to do it and I see it that way as well.
Property set closer to the object takes precedence, roughly

mount < subvolume < directory < file

but last time we had a discussion about that, the other oppinion was
that mount options beat everything, perhaps because they can be set from
the outside and forced to ovrride whatever is on the filesystem.

> By having a quick look - I don't see why these should be mount options 
> at all.
> 
> autodefrag / noautodefrag
> commit
> compress / compress-force
> datacow / nodatacow
> datasum / nodatasum
> discard / nodiscard
> inode_cache / noinode_cache
> space_cache / nospace_cache
> sdd / ssd_spread / nossd / no_ssdspread
> user_subvol_rm_allowed

So there are historical reasons and interface limitations that led to
current state and multiple ways to do things.

Per-inode attributes were originally private ioctl of ext2 that other
filesystems adopted due to feature parity, and as the interface was
bit-based, no additional values could be set eg. compression, limited
number of bits, no precedence, inter-flag dependencies.

> Stuff like compress and nodatacow can be set with chattr so there is as 
> far as I am aware three methods of setting compression for example.
> 
> Either by mount options in fstab, by chattr or by btrfs property set
> 
> I think it would be more consistent to have one interface for adjusting 
> behavior.

I agree with that and there's a proposal to unify that into the
properties as interface once for all, accessible through the extended
attributes. But there are much more ways how to do that wrong so it
hasn't been implemented so far.

A suggestion for an inode flag here and there comes from time to time,
fixing one problem each time. Repeating that would lead to a mess that
can be demonstrated on the existing mount options, so we've been there
and need to do it the right way.

> As I asked before, the future plan to have different storage profiles on 
> subvolumes seem to have been sneakily(?) removed from the wiki

I don't think the per-subvolume storage options were ever tracked on
wiki, the closest match is per-subvolume mount options that's still
there

https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-subvolume_mount_options

> - if that is indeed a dropped goal I can see why it makes sense to
> keep the mount options, if not I think the mount options should go in
> favor of btrfs property set.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-14 16:37 ` David Sterba
@ 2021-01-15  0:02   ` waxhead
  2021-01-15 15:29     ` David Sterba
  2021-01-15  3:54   ` Zygo Blaxell
  1 sibling, 1 reply; 14+ messages in thread
From: waxhead @ 2021-01-15  0:02 UTC (permalink / raw)
  To: dsterba, linux-btrfs

David Sterba wrote:
> Hi,
> 
> On Thu, Jan 14, 2021 at 03:12:26AM +0100, waxhead wrote:
>> I was looking through the mount options and being a madman with strong
>> opinions I can't help thinking that a lot of them does not really belong
>> as mount options at all, but should rather be properties set on the
>> subvolume - for example the toplevel subvolume.
> 
> I agree that some of them should not be there but mount options still
> have their own usecase. They can be set from the outside and are
> supposed to affect the whole filesystem mount lifetime.
> Yes, some of them. But not all, the ones I list for example can 
perfectly well be set on the toplevel subvolume.

> However, they've been used as default values for some operations, which
> is something that points more to what you suggest. And as they're not
> persistent and need to be stored in /etc/fstab is also weighing for
> storage inside the fs.
> 
>> And any options set on a child subvolume should override the parrent
>> subvolume the way I see it.
> 
> Yeah, that's one of the ways how to do it and I see it that way as well.
> Property set closer to the object takes precedence, roughly
> 
> mount < subvolume < directory < file
> 
> but last time we had a discussion about that, the other oppinion was
> that mount options beat everything, perhaps because they can be set from
> the outside and forced to ovrride whatever is on the filesystem.
> 
Well I agree with that. Mount options should beat everything and just 
because of that I think that some mount options should be deprecated and 
instead be set per. subvolume.

>> By having a quick look - I don't see why these should be mount options
>> at all.
>>
>> autodefrag / noautodefrag
>> commit
>> compress / compress-force
>> datacow / nodatacow
>> datasum / nodatasum
>> discard / nodiscard
>> inode_cache / noinode_cache
>> space_cache / nospace_cache
>> sdd / ssd_spread / nossd / no_ssdspread
>> user_subvol_rm_allowed
> 
> So there are historical reasons and interface limitations that led to
> current state and multiple ways to do things.
> 
> Per-inode attributes were originally private ioctl of ext2 that other
> filesystems adopted due to feature parity, and as the interface was
> bit-based, no additional values could be set eg. compression, limited
> number of bits, no precedence, inter-flag dependencies.
> 
Ok thanks, I was not aware of that.

>> Stuff like compress and nodatacow can be set with chattr so there is as
>> far as I am aware three methods of setting compression for example.
>>
>> Either by mount options in fstab, by chattr or by btrfs property set
>>
>> I think it would be more consistent to have one interface for adjusting
>> behavior.
> 
> I agree with that and there's a proposal to unify that into the
> properties as interface once for all, accessible through the extended
> attributes. But there are much more ways how to do that wrong so it
> hasn't been implemented so far.
> 
Good to know, and by the way another nugget of entertainment is that 
with btrfs property set the parameters come after the object. Usually 
command->params->target is IMHO the better way to go. It seems a bit 
backwards.

> A suggestion for an inode flag here and there comes from time to time,
> fixing one problem each time. Repeating that would lead to a mess that
> can be demonstrated on the existing mount options, so we've been there
> and need to do it the right way.
> 
>> As I asked before, the future plan to have different storage profiles on
>> subvolumes seem to have been sneakily(?) removed from the wiki
> 
> I don't think the per-subvolume storage options were ever tracked on
> wiki, the closest match is per-subvolume mount options that's still
> there
> 
> https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-subvolume_mount_options
> 
Well how about this from our friends archive.org ?
http://web.archive.org/web/20200117205248/https://btrfs.wiki.kernel.org/index.php/Main_Page

Here it clearly states that object level mirroring and striping is 
planned. Maybe I misinterpret this , but I understand this as (amongst 
other things) configurable storage profiles per subvolume.

>> - if that is indeed a dropped goal I can see why it makes sense to
>> keep the mount options, if not I think the mount options should go in
>> favor of btrfs property set.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-14 16:37 ` David Sterba
  2021-01-15  0:02   ` waxhead
@ 2021-01-15  3:54   ` Zygo Blaxell
  2021-01-15  9:32     ` waxhead
  2021-01-16  7:39     ` Andrei Borzenkov
  1 sibling, 2 replies; 14+ messages in thread
From: Zygo Blaxell @ 2021-01-15  3:54 UTC (permalink / raw)
  To: dsterba, waxhead, linux-btrfs

On Thu, Jan 14, 2021 at 05:37:29PM +0100, David Sterba wrote:
> Hi,
> 
> On Thu, Jan 14, 2021 at 03:12:26AM +0100, waxhead wrote:
> > I was looking through the mount options and being a madman with strong 
> > opinions I can't help thinking that a lot of them does not really belong 
> > as mount options at all, but should rather be properties set on the 
> > subvolume - for example the toplevel subvolume.
> 
> I agree that some of them should not be there but mount options still
> have their own usecase. They can be set from the outside and are
> supposed to affect the whole filesystem mount lifetime.
> 
> However, they've been used as default values for some operations, which
> is something that points more to what you suggest. And as they're not
> persistent and need to be stored in /etc/fstab is also weighing for
> storage inside the fs.
> 
> > And any options set on a child subvolume should override the parrent 
> > subvolume the way I see it.
> 
> Yeah, that's one of the ways how to do it and I see it that way as well.
> Property set closer to the object takes precedence, roughly
> 
> mount < subvolume < directory < file

Wearing my grumpy sysadmin hat, I have occasionally wanted the mount
options to override the subvolume and inode operations.  Examples below.

> but last time we had a discussion about that, the other oppinion was
> that mount options beat everything, perhaps because they can be set from
> the outside and forced to ovrride whatever is on the filesystem.
> 
> > By having a quick look - I don't see why these should be mount options 
> > at all.
> > 
> > autodefrag / noautodefrag

That makes sense as an inode property--you only want autodefrag on a few
files and they're usually easy to spot.

> > inode_cache / noinode_cache

That one is already gone as of v5.11-rc1.

> > commit
> > space_cache / nospace_cache
> > sdd / ssd_spread / nossd / no_ssdspread

How could those be anything other than filesystem-wide options?

> > discard / nodiscard

Maybe, but probably requires too much introspection in a fast path (we'd
have to add a check for the last owner of a deleted extent to see if it
had 'discard' set on some parent level).

On the other hand, I'm in favor of deprecating the whole discard option
and going with fstrim instead.  discard in its current form tends to
increase write wear rather than decrease it, especially on metadata-heavy
workloads.  discard is roughly equivalent to running fstrim thousands
of times a day, which is clearly bad for many (most?  all?) SSDs.

It might be possible to make the discard mount option's behavior more
sane (e.g. discard only full chunks, configurable minimum discard length,
discard only within data chunks, discard only once per hour, etc).

> > compress / compress-force
> > datacow / nodatacow
> > datasum / nodatasum

Here's where I prefer the mount option over the more local attributes,
because I'd like filesystem-level sysadmin overrides for those.
i.e. disallow all users, even privileged ones, from being able to create
files that don't have csums or compression on a filesystem.

It might be better to allow the per-inode/subvol properties to be set,
but have a mount option to override them.  I don't want to deal with
legacy applications that might throw an error if they get an error return
from a nodatacow chattr, so "silently drop the chattr" is better than
"prevent chattr with error."

I also want to store backup / live-mountable copies of filesystems that
have these attributes set, but that's a lot more complicated to implement.
The filesystem can't just ignore a nodatasum bit on an inode once it
has been set, and I can solve the problem above the filesystem by
modifying btrfs receive or other backup code to strip those bits and
store them somewhere else.

Something like this seems mandatory to have working crypto integrity in
the future, but silent data corruption on cheap SSDs is bad enough to
make it a requirement for plaintext storage now.

compress has some significant holes in its per-inode interface: no way
to specify compress level in an inode property, no way to specify force
on some files but not others.

Compress is far easier to override from a mount option--we just don't
look at the inode bits in needs_compress and everything else still works
the same as before.

> > user_subvol_rm_allowed

I'd like "user_subvol_create_disallowed" too.  Unprivileged users can
create subvols, and that breaks backups that rely on atomic btrfs
snapshots.  It could be a feature (it allows users to exclude parts of
their home directory from backups) but most users I've met who have
discovered this "feature" the hard way didn't enjoy it.

Historically I had other reasons to disallow subvol creates by
unprivileged users, but they are mostly removed in 4.18, now that 'rmdir'
works on an empty subvol.

> So there are historical reasons and interface limitations that led to
> current state and multiple ways to do things.
> 
> Per-inode attributes were originally private ioctl of ext2 that other
> filesystems adopted due to feature parity, and as the interface was
> bit-based, no additional values could be set eg. compression, limited
> number of bits, no precedence, inter-flag dependencies.
> 
> > Stuff like compress and nodatacow can be set with chattr so there is as 
> > far as I am aware three methods of setting compression for example.
> > 
> > Either by mount options in fstab, by chattr or by btrfs property set
> > 
> > I think it would be more consistent to have one interface for adjusting 
> > behavior.
> 
> I agree with that and there's a proposal to unify that into the
> properties as interface once for all, accessible through the extended
> attributes. But there are much more ways how to do that wrong so it
> hasn't been implemented so far.
> 
> A suggestion for an inode flag here and there comes from time to time,
> fixing one problem each time. Repeating that would lead to a mess that
> can be demonstrated on the existing mount options, so we've been there
> and need to do it the right way.
> 
> > As I asked before, the future plan to have different storage profiles on 
> > subvolumes seem to have been sneakily(?) removed from the wiki
> 
> I don't think the per-subvolume storage options were ever tracked on
> wiki, the closest match is per-subvolume mount options that's still
> there
> 
> https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-subvolume_mount_options
> 
> > - if that is indeed a dropped goal I can see why it makes sense to
> > keep the mount options, if not I think the mount options should go in
> > favor of btrfs property set.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-15  3:54   ` Zygo Blaxell
@ 2021-01-15  9:32     ` waxhead
  2021-01-16  0:42       ` Zygo Blaxell
  2021-01-16  7:39     ` Andrei Borzenkov
  1 sibling, 1 reply; 14+ messages in thread
From: waxhead @ 2021-01-15  9:32 UTC (permalink / raw)
  To: Zygo Blaxell, dsterba, linux-btrfs

Zygo Blaxell wrote:
> 
>>> commit
>>> space_cache / nospace_cache
>>> sdd / ssd_spread / nossd / no_ssdspread
> 
> How could those be anything other than filesystem-wide options?
> 

Well being me, I tend to live in a fantasy world where BTRFS have 
complete world domination and has become the VFS layer.
As I have nagged about before on this list - I really think that the 
only sensible way forward for BTRFS (or dare I say BTRFS2) would be to 
make it possible to assign "storage device groups" where you can make 
certain btrfs device ids belong to group a,b,c, etc...

And with that it would be possible to assign a weight to subvolumes so 
that they would be preferred to be stored on group a (SSD's perhaps), 
while other subvolumes would be stored mostly or exlusively on HDD's, 
Fast HDD's, Archival HDD's etc... So maybe a bit over enthusiastic in 
thinking perhaps , but hopefully you see now why I think it is right 
that this is not filesystem-wide , but subvolume baseed properties.

>>> discard / nodiscard
> 
> Maybe, but probably requires too much introspection in a fast path (we'd
> have to add a check for the last owner of a deleted extent to see if it
> had 'discard' set on some parent level).
> 
> On the other hand, I'm in favor of deprecating the whole discard option
> and going with fstrim instead.  discard in its current form tends to
> increase write wear rather than decrease it, especially on metadata-heavy
> workloads.  discard is roughly equivalent to running fstrim thousands
> of times a day, which is clearly bad for many (most?  all?) SSDs.
> 
> It might be possible to make the discard mount option's behavior more
> sane (e.g. discard only full chunks, configurable minimum discard length,
> discard only within data chunks, discard only once per hour, etc).
> 
Interesting, it might as well make sense to perhaps use the free space 
cache and a slow LRU mechanism e.g. "these chunks has not been in use 
for 64 hours/days" or something similar.

>>> compress / compress-force
>>> datacow / nodatacow
>>> datasum / nodatasum
> 
> Here's where I prefer the mount option over the more local attributes,
> because I'd like filesystem-level sysadmin overrides for those.
> i.e. disallow all users, even privileged ones, from being able to create
> files that don't have csums or compression on a filesystem.
> 
Then how about a mount option that allow only root to do certain things? 
e.g. a security restriction.

> 
>>> user_subvol_rm_allowed
> 
> I'd like "user_subvol_create_disallowed" too.  Unprivileged users can
> create subvols, and that breaks backups that rely on atomic btrfs
> snapshots.  It could be a feature (it allows users to exclude parts of
> their home directory from backups) but most users I've met who have
> discovered this "feature" the hard way didn't enjoy it.
> 
> Historically I had other reasons to disallow subvol creates by
> unprivileged users, but they are mostly removed in 4.18, now that 'rmdir'
> works on an empty subvol.
> 
Again see above...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-15  0:02   ` waxhead
@ 2021-01-15 15:29     ` David Sterba
  2021-01-16  1:47       ` waxhead
  0 siblings, 1 reply; 14+ messages in thread
From: David Sterba @ 2021-01-15 15:29 UTC (permalink / raw)
  To: waxhead; +Cc: dsterba, linux-btrfs

On Fri, Jan 15, 2021 at 01:02:12AM +0100, waxhead wrote:
> > I don't think the per-subvolume storage options were ever tracked on
> > wiki, the closest match is per-subvolume mount options that's still
> > there
> > 
> > https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-subvolume_mount_options
> > 
> Well how about this from our friends archive.org ?
> http://web.archive.org/web/20200117205248/https://btrfs.wiki.kernel.org/index.php/Main_Page
> 
> Here it clearly states that object level mirroring and striping is 
> planned. Maybe I misinterpret this , but I understand this as (amongst 
> other things) configurable storage profiles per subvolume.

I see. The list on the main page is supposed to list features that we could
promise to be implemented "soon". For all the ideas there's the specific
project page wher it does not matter too much when it will implemented, it's
kind of a pool.

In the wiki edit that removed the object-level storage I also removed
(https://btrfs.wiki.kernel.org/index.php?title=Main_Page&diff=prev&oldid=33190)

* Online filesystem check
* Object-level mirroring and striping
* In-band deduplication (happens during writes)
* Hot data tracking and moving to faster devices (or provided on the generic VFS layer)

For each of the task there's nobody working on that, to my knowledge,
though there was some interest and maybe RFC patches in the past.

The object-level storage idea/task can be added to the Project_ideas
page, so it's not lost.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-15  9:32     ` waxhead
@ 2021-01-16  0:42       ` Zygo Blaxell
  2021-01-16  1:57         ` waxhead
  0 siblings, 1 reply; 14+ messages in thread
From: Zygo Blaxell @ 2021-01-16  0:42 UTC (permalink / raw)
  To: waxhead; +Cc: dsterba, linux-btrfs

On Fri, Jan 15, 2021 at 10:32:39AM +0100, waxhead wrote:
> Zygo Blaxell wrote:
> > 
> > > > commit
> > > > space_cache / nospace_cache
> > > > sdd / ssd_spread / nossd / no_ssdspread
> > 
> > How could those be anything other than filesystem-wide options?
> > 
> 
> Well being me, I tend to live in a fantasy world where BTRFS have complete
> world domination and has become the VFS layer.
> As I have nagged about before on this list - I really think that the only
> sensible way forward for BTRFS (or dare I say BTRFS2) would be to make it
> possible to assign "storage device groups" where you can make certain btrfs
> device ids belong to group a,b,c, etc...
> 
> And with that it would be possible to assign a weight to subvolumes so that
> they would be preferred to be stored on group a (SSD's perhaps), while other
> subvolumes would be stored mostly or exlusively on HDD's, Fast HDD's,
> Archival HDD's etc... So maybe a bit over enthusiastic in thinking perhaps ,
> but hopefully you see now why I think it is right that this is not
> filesystem-wide , but subvolume baseed properties.

Sure, that's all wonderful, but it has nothing to do with any of those
mount options.  ;)

commit sets a timer that forces a filesystem-wide sync() every now
and then.  space_cache picks one of the allocator implementations, also
for the entire filesystem.  ssd and related options affect the behavior
of the metadata allocator and superblocks.

> > > > discard / nodiscard
> > 
> > Maybe, but probably requires too much introspection in a fast path (we'd
> > have to add a check for the last owner of a deleted extent to see if it
> > had 'discard' set on some parent level).
> > 
> > On the other hand, I'm in favor of deprecating the whole discard option
> > and going with fstrim instead.  discard in its current form tends to
> > increase write wear rather than decrease it, especially on metadata-heavy
> > workloads.  discard is roughly equivalent to running fstrim thousands
> > of times a day, which is clearly bad for many (most?  all?) SSDs.
> > 
> > It might be possible to make the discard mount option's behavior more
> > sane (e.g. discard only full chunks, configurable minimum discard length,
> > discard only within data chunks, discard only once per hour, etc).
> > 
> Interesting, it might as well make sense to perhaps use the free space cache
> and a slow LRU mechanism e.g. "these chunks has not been in use for 64
> hours/days" or something similar.

That would add more writes, as the free space cache is an on-disk entity.
It might make sense to maintain a 'discard tree', which lists extents
that have been freed but not yet discarded or overwritten, to make fstrim
more efficient.  This wouldn't have to be very precise, just pointing to
general regions of the disk (maybe even entire block groups) so fstrim
doesn't issue discards to idle areas of the disk over and over.

Currently the discard extent list is stored in memory, so doing one
discard per T time units would use more memory.  This feature would be
like discard=async, but 1) it would hold on to the pinned extents for a
few hundred transactions instead of just one or two (subject to memory
availability), and 2) it would be able to reclaim space from the discard
list as free space, thus removing the need to issue a discard at all.

But that's really complicated, considering that a cron job that runs
fstrim once an hour can do the same thing without all the complexity.
On the other hand, I just ran fstrim on a test machine and it took
34 minutes, so maybe some complexity might be useful after all... :-O

> > > > compress / compress-force
> > > > datacow / nodatacow
> > > > datasum / nodatasum
> > 
> > Here's where I prefer the mount option over the more local attributes,
> > because I'd like filesystem-level sysadmin overrides for those.
> > i.e. disallow all users, even privileged ones, from being able to create
> > files that don't have csums or compression on a filesystem.
> > 
> Then how about a mount option that allow only root to do certain things?
> e.g. a security restriction.

No, I don't want root doing those things either.  Most of the applications
I want to bring to heel are already running as root.

Basically I want to say "every file on this filesystem shall be datacow
and datasum" and (short of altering the mount option) no other kind of
file can be created.

It might possibly make more sense to do this through tunefs--that way the
filesystem couldn't ever have nodatacow files (new kernels would refuse,
old kernels wouldn't be able to mount through a new feature flag).

> > > > user_subvol_rm_allowed
> > 
> > I'd like "user_subvol_create_disallowed" too.  Unprivileged users can
> > create subvols, and that breaks backups that rely on atomic btrfs
> > snapshots.  It could be a feature (it allows users to exclude parts of
> > their home directory from backups) but most users I've met who have
> > discovered this "feature" the hard way didn't enjoy it.
> > 
> > Historically I had other reasons to disallow subvol creates by
> > unprivileged users, but they are mostly removed in 4.18, now that 'rmdir'
> > works on an empty subvol.
> > 
> Again see above...

Here, unlike above, I was already asking precisely for subvol create to
be made root only.

That, or make snapshots recursive and atomic to avoid the accidental
user data loss/corruption case.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-15 15:29     ` David Sterba
@ 2021-01-16  1:47       ` waxhead
  0 siblings, 0 replies; 14+ messages in thread
From: waxhead @ 2021-01-16  1:47 UTC (permalink / raw)
  To: dsterba, linux-btrfs



David Sterba wrote:
> On Fri, Jan 15, 2021 at 01:02:12AM +0100, waxhead wrote:
>>> I don't think the per-subvolume storage options were ever tracked on
>>> wiki, the closest match is per-subvolume mount options that's still
>>> there
>>>
>>> https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-subvolume_mount_options
>>>
>> Well how about this from our friends archive.org ?
>> http://web.archive.org/web/20200117205248/https://btrfs.wiki.kernel.org/index.php/Main_Page
>>
>> Here it clearly states that object level mirroring and striping is
>> planned. Maybe I misinterpret this , but I understand this as (amongst
>> other things) configurable storage profiles per subvolume.
> 
> I see. The list on the main page is supposed to list features that we could
> promise to be implemented "soon". For all the ideas there's the specific
> project page wher it does not matter too much when it will implemented, it's
> kind of a pool.
> 
> In the wiki edit that removed the object-level storage I also removed
> (https://btrfs.wiki.kernel.org/index.php?title=Main_Page&diff=prev&oldid=33190)
> 
> * Online filesystem check
> * Object-level mirroring and striping
> * In-band deduplication (happens during writes)
> * Hot data tracking and moving to faster devices (or provided on the generic VFS layer)
> 
> For each of the task there's nobody working on that, to my knowledge,
> though there was some interest and maybe RFC patches in the past.
> 
> The object-level storage idea/task can be added to the Project_ideas
> page, so it's not lost.
> 
Okeydok... good to know! :)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-16  0:42       ` Zygo Blaxell
@ 2021-01-16  1:57         ` waxhead
  2021-01-16  3:51           ` Zygo Blaxell
  0 siblings, 1 reply; 14+ messages in thread
From: waxhead @ 2021-01-16  1:57 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: dsterba, linux-btrfs



Zygo Blaxell wrote:
> On Fri, Jan 15, 2021 at 10:32:39AM +0100, waxhead wrote:
>> Zygo Blaxell wrote:
>>>
>>>>> commit
>>>>> space_cache / nospace_cache
>>>>> sdd / ssd_spread / nossd / no_ssdspread
>>>
>>> How could those be anything other than filesystem-wide options?
>>>
>>
>> Well being me, I tend to live in a fantasy world where BTRFS have complete
>> world domination and has become the VFS layer.
>> As I have nagged about before on this list - I really think that the only
>> sensible way forward for BTRFS (or dare I say BTRFS2) would be to make it
>> possible to assign "storage device groups" where you can make certain btrfs
>> device ids belong to group a,b,c, etc...
>>
>> And with that it would be possible to assign a weight to subvolumes so that
>> they would be preferred to be stored on group a (SSD's perhaps), while other
>> subvolumes would be stored mostly or exlusively on HDD's, Fast HDD's,
>> Archival HDD's etc... So maybe a bit over enthusiastic in thinking perhaps ,
>> but hopefully you see now why I think it is right that this is not
>> filesystem-wide , but subvolume baseed properties.
> 
> Sure, that's all wonderful, but it has nothing to do with any of those
> mount options.  ;)
> 
> commit sets a timer that forces a filesystem-wide sync() every now
> and then.  space_cache picks one of the allocator implementations, also
> for the entire filesystem.  ssd and related options affect the behavior
> of the metadata allocator and superblocks.
> 
Ok I understand the space_cache, but commit (the way I understand it) 
could be useful to keep stuff in memory for longer for certain subvolume.
ssd options could also be useful pr. subvolume *if* and that is a big if 
BTRFS sometime in the far future allows for storage device groups. 
However when that happens maybe everything is solid state , who knows ;)

>>>>> discard / nodiscard
>>>
>>> Maybe, but probably requires too much introspection in a fast path (we'd
>>> have to add a check for the last owner of a deleted extent to see if it
>>> had 'discard' set on some parent level).
>>>
>>> On the other hand, I'm in favor of deprecating the whole discard option
>>> and going with fstrim instead.  discard in its current form tends to
>>> increase write wear rather than decrease it, especially on metadata-heavy
>>> workloads.  discard is roughly equivalent to running fstrim thousands
>>> of times a day, which is clearly bad for many (most?  all?) SSDs.
>>>
>>> It might be possible to make the discard mount option's behavior more
>>> sane (e.g. discard only full chunks, configurable minimum discard length,
>>> discard only within data chunks, discard only once per hour, etc).
>>>
>> Interesting, it might as well make sense to perhaps use the free space cache
>> and a slow LRU mechanism e.g. "these chunks has not been in use for 64
>> hours/days" or something similar.
> 
> That would add more writes, as the free space cache is an on-disk entity.
> It might make sense to maintain a 'discard tree', which lists extents
> that have been freed but not yet discarded or overwritten, to make fstrim
> more efficient.  This wouldn't have to be very precise, just pointing to
> general regions of the disk (maybe even entire block groups) so fstrim
> doesn't issue discards to idle areas of the disk over and over.
> 
> Currently the discard extent list is stored in memory, so doing one
> discard per T time units would use more memory.  This feature would be
> like discard=async, but 1) it would hold on to the pinned extents for a
> few hundred transactions instead of just one or two (subject to memory
> availability), and 2) it would be able to reclaim space from the discard
> list as free space, thus removing the need to issue a discard at all.
> 
> But that's really complicated, considering that a cron job that runs
> fstrim once an hour can do the same thing without all the complexity.
> On the other hand, I just ran fstrim on a test machine and it took
> 34 minutes, so maybe some complexity might be useful after all... :-O
> 
Thanks for the education. Inspired by this I ran fstrim on my main 
machine with 7 older and smaller SSD's. Have not run fstrim on it or 
maybe a year so it took 15-20 minutes to do. But I see your point. 
Minimizing writes to SSD's is a good thing, but if I am not mistaking 
SMR disks can benefit from discard as well, I don't know the details but 
maybe it would be more beneficial on real disks than SSD storage?

>>>>> compress / compress-force
>>>>> datacow / nodatacow
>>>>> datasum / nodatasum
>>>
>>> Here's where I prefer the mount option over the more local attributes,
>>> because I'd like filesystem-level sysadmin overrides for those.
>>> i.e. disallow all users, even privileged ones, from being able to create
>>> files that don't have csums or compression on a filesystem.
>>>
>> Then how about a mount option that allow only root to do certain things?
>> e.g. a security restriction.
> 
> No, I don't want root doing those things either.  Most of the applications
> I want to bring to heel are already running as root.
> 
> Basically I want to say "every file on this filesystem shall be datacow
> and datasum" and (short of altering the mount option) no other kind of
> file can be created.
> 
> It might possibly make more sense to do this through tunefs--that way the
> filesystem couldn't ever have nodatacow files (new kernels would refuse,
> old kernels wouldn't be able to mount through a new feature flag).
> 
Allrighty point taken.

>>>>> user_subvol_rm_allowed
>>>
>>> I'd like "user_subvol_create_disallowed" too.  Unprivileged users can
>>> create subvols, and that breaks backups that rely on atomic btrfs
>>> snapshots.  It could be a feature (it allows users to exclude parts of
>>> their home directory from backups) but most users I've met who have
>>> discovered this "feature" the hard way didn't enjoy it.
>>>
>>> Historically I had other reasons to disallow subvol creates by
>>> unprivileged users, but they are mostly removed in 4.18, now that 'rmdir'
>>> works on an empty subvol.
>>>
>> Again see above...
> 
> Here, unlike above, I was already asking precisely for subvol create to
> be made root only.
> 
> That, or make snapshots recursive and atomic to avoid the accidental
> user data loss/corruption case.
> 
Allrighty again! :)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-16  1:57         ` waxhead
@ 2021-01-16  3:51           ` Zygo Blaxell
  0 siblings, 0 replies; 14+ messages in thread
From: Zygo Blaxell @ 2021-01-16  3:51 UTC (permalink / raw)
  To: waxhead; +Cc: dsterba, linux-btrfs

On Sat, Jan 16, 2021 at 02:57:05AM +0100, waxhead wrote:
> 
> 
> Zygo Blaxell wrote:
> > On Fri, Jan 15, 2021 at 10:32:39AM +0100, waxhead wrote:
> > > Zygo Blaxell wrote:
> > > > 
> > > > > > commit
> > > > > > space_cache / nospace_cache
> > > > > > sdd / ssd_spread / nossd / no_ssdspread
> > > > 
> > > > How could those be anything other than filesystem-wide options?
> > > > 
> > > 
> > > Well being me, I tend to live in a fantasy world where BTRFS have complete
> > > world domination and has become the VFS layer.
> > > As I have nagged about before on this list - I really think that the only
> > > sensible way forward for BTRFS (or dare I say BTRFS2) would be to make it
> > > possible to assign "storage device groups" where you can make certain btrfs
> > > device ids belong to group a,b,c, etc...
> > > 
> > > And with that it would be possible to assign a weight to subvolumes so that
> > > they would be preferred to be stored on group a (SSD's perhaps), while other
> > > subvolumes would be stored mostly or exlusively on HDD's, Fast HDD's,
> > > Archival HDD's etc... So maybe a bit over enthusiastic in thinking perhaps ,
> > > but hopefully you see now why I think it is right that this is not
> > > filesystem-wide , but subvolume baseed properties.
> > 
> > Sure, that's all wonderful, but it has nothing to do with any of those
> > mount options.  ;)
> > 
> > commit sets a timer that forces a filesystem-wide sync() every now
> > and then.  space_cache picks one of the allocator implementations, also
> > for the entire filesystem.  ssd and related options affect the behavior
> > of the metadata allocator and superblocks.
> > 
> Ok I understand the space_cache, but commit (the way I understand it) could
> be useful to keep stuff in memory for longer for certain subvolume.

That's probably something to implement mostly above the filesystem level,
as it's VFS that controls most of writeback, and maybe memcg could be
adapted to have per-cgroup writeback settings.

The btrfs transaction commit updates all of the filesystem trees at once
(they are all children of the root subtree's root node).  Changing that
would be a significant restructuring.

> ssd options could also be useful pr. subvolume *if* and that is a big if
> BTRFS sometime in the far future allows for storage device groups. However
> when that happens maybe everything is solid state , who knows ;)

First the ssd options have to do something that could be per-subvolume.
Right now they do almost nothing, and the little they do is per-device.

There are various proposals for HSM, metadata-on-SSD, hot data tracking
floating around.  If they ever arrive at btrfs, they definitely won't be
tacked onto the ssd mount option.

> > > > > > discard / nodiscard
> > > > 
> > > > Maybe, but probably requires too much introspection in a fast path (we'd
> > > > have to add a check for the last owner of a deleted extent to see if it
> > > > had 'discard' set on some parent level).
> > > > 
> > > > On the other hand, I'm in favor of deprecating the whole discard option
> > > > and going with fstrim instead.  discard in its current form tends to
> > > > increase write wear rather than decrease it, especially on metadata-heavy
> > > > workloads.  discard is roughly equivalent to running fstrim thousands
> > > > of times a day, which is clearly bad for many (most?  all?) SSDs.
> > > > 
> > > > It might be possible to make the discard mount option's behavior more
> > > > sane (e.g. discard only full chunks, configurable minimum discard length,
> > > > discard only within data chunks, discard only once per hour, etc).
> > > > 
> > > Interesting, it might as well make sense to perhaps use the free space cache
> > > and a slow LRU mechanism e.g. "these chunks has not been in use for 64
> > > hours/days" or something similar.
> > 
> > That would add more writes, as the free space cache is an on-disk entity.
> > It might make sense to maintain a 'discard tree', which lists extents
> > that have been freed but not yet discarded or overwritten, to make fstrim
> > more efficient.  This wouldn't have to be very precise, just pointing to
> > general regions of the disk (maybe even entire block groups) so fstrim
> > doesn't issue discards to idle areas of the disk over and over.
> > 
> > Currently the discard extent list is stored in memory, so doing one
> > discard per T time units would use more memory.  This feature would be
> > like discard=async, but 1) it would hold on to the pinned extents for a
> > few hundred transactions instead of just one or two (subject to memory
> > availability), and 2) it would be able to reclaim space from the discard
> > list as free space, thus removing the need to issue a discard at all.
> > 
> > But that's really complicated, considering that a cron job that runs
> > fstrim once an hour can do the same thing without all the complexity.
> > On the other hand, I just ran fstrim on a test machine and it took
> > 34 minutes, so maybe some complexity might be useful after all... :-O
> > 
> Thanks for the education. Inspired by this I ran fstrim on my main machine
> with 7 older and smaller SSD's. Have not run fstrim on it or maybe a year so
> it took 15-20 minutes to do. But I see your point. Minimizing writes to
> SSD's is a good thing, but if I am not mistaking SMR disks can benefit from
> discard as well, I don't know the details but maybe it would be more
> beneficial on real disks than SSD storage?

Overwriting data at a given LBA implies discarding the data currently in
the LBA, regardless of underlying storage.  discard is useful on SSD,
SMR, and also caching/thin LV layers if the LBAs are not going to be
overwritten for a long time (thin LVs take less space, caches have more
room for other data).  There is no need to send a separate discard command
if the LBAs in question are not part of a file at the time, and there is
a high probability of overwriting the same LBA with new data very soon.

There are multiple variants of the discard operation in devices.
One variant leaves the contents of the discarded blocks undefined.
That's a nice feature, since the device is free to ignore the discard
request if it might improve performance (i.e. there is no need to ensure
the discard has immediate persistent effect, so it is acceptable for
the drive to not flush discards to persistent storage right away, and
merge the discard into some later write command to save one write cycle).

Most of the SSDs I've tested do a different variant of discard, where the
discarded blocks are guaranteed to return zeros.  That forces the drive
to do at least one flash write for each discard.  On such drives it's
important to not send redundant discards from the filesystem, as those
discards are necessarily wasting write cycles.

On SMR drives, redundant discards are even worse than on SSD.  SMR
drives have address translation layers like SSD, but they are stored
on spinning disks.  It will take milliseconds to move heads to record
the discard in the translation layer, only to spend more milliseconds
to write the replacement data in the same location immediately after.
The disk platter has higher endurance than SSD, but can be orders of
magnitude slower if there are too many translation layer updates.

> > > > > > compress / compress-force
> > > > > > datacow / nodatacow
> > > > > > datasum / nodatasum
> > > > 
> > > > Here's where I prefer the mount option over the more local attributes,
> > > > because I'd like filesystem-level sysadmin overrides for those.
> > > > i.e. disallow all users, even privileged ones, from being able to create
> > > > files that don't have csums or compression on a filesystem.
> > > > 
> > > Then how about a mount option that allow only root to do certain things?
> > > e.g. a security restriction.
> > 
> > No, I don't want root doing those things either.  Most of the applications
> > I want to bring to heel are already running as root.
> > 
> > Basically I want to say "every file on this filesystem shall be datacow
> > and datasum" and (short of altering the mount option) no other kind of
> > file can be created.
> > 
> > It might possibly make more sense to do this through tunefs--that way the
> > filesystem couldn't ever have nodatacow files (new kernels would refuse,
> > old kernels wouldn't be able to mount through a new feature flag).
> > 
> Allrighty point taken.
> 
> > > > > > user_subvol_rm_allowed
> > > > 
> > > > I'd like "user_subvol_create_disallowed" too.  Unprivileged users can
> > > > create subvols, and that breaks backups that rely on atomic btrfs
> > > > snapshots.  It could be a feature (it allows users to exclude parts of
> > > > their home directory from backups) but most users I've met who have
> > > > discovered this "feature" the hard way didn't enjoy it.
> > > > 
> > > > Historically I had other reasons to disallow subvol creates by
> > > > unprivileged users, but they are mostly removed in 4.18, now that 'rmdir'
> > > > works on an empty subvol.
> > > > 
> > > Again see above...
> > 
> > Here, unlike above, I was already asking precisely for subvol create to
> > be made root only.
> > 
> > That, or make snapshots recursive and atomic to avoid the accidental
> > user data loss/corruption case.
> > 
> Allrighty again! :)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-15  3:54   ` Zygo Blaxell
  2021-01-15  9:32     ` waxhead
@ 2021-01-16  7:39     ` Andrei Borzenkov
  2021-01-16 15:19       ` Adam Borowski
  1 sibling, 1 reply; 14+ messages in thread
From: Andrei Borzenkov @ 2021-01-16  7:39 UTC (permalink / raw)
  To: Zygo Blaxell, dsterba, waxhead, linux-btrfs

15.01.2021 06:54, Zygo Blaxell пишет:
> 
> On the other hand, I'm in favor of deprecating the whole discard option
> and going with fstrim instead.  discard in its current form tends to
> increase write wear rather than decrease it, especially on metadata-heavy
> workloads.  discard is roughly equivalent to running fstrim thousands
> of times a day, which is clearly bad for many (most?  all?) SSDs.
> 

My (probably naive) understanding so far was that trim on SSD marks
areas as "unused" which means SSD need to copy less residual data from
erase block when reusing it. Assuming TRIM unit is (significantly)
smaller than erase block.

I would appreciate if you elaborate how trim results in more write on SSD?

Or do you mean more writes from btrfs while performing discard?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-16  7:39     ` Andrei Borzenkov
@ 2021-01-16 15:19       ` Adam Borowski
  2021-01-16 17:21         ` Andrei Borzenkov
  0 siblings, 1 reply; 14+ messages in thread
From: Adam Borowski @ 2021-01-16 15:19 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Zygo Blaxell, dsterba, waxhead, linux-btrfs

On Sat, Jan 16, 2021 at 10:39:51AM +0300, Andrei Borzenkov wrote:
> 15.01.2021 06:54, Zygo Blaxell пишет:
> > On the other hand, I'm in favor of deprecating the whole discard option
> > and going with fstrim instead.  discard in its current form tends to
> > increase write wear rather than decrease it, especially on metadata-heavy
> > workloads.  discard is roughly equivalent to running fstrim thousands
> > of times a day, which is clearly bad for many (most?  all?) SSDs.
> 
> My (probably naive) understanding so far was that trim on SSD marks
> areas as "unused" which means SSD need to copy less residual data from
> erase block when reusing it. Assuming TRIM unit is (significantly)
> smaller than erase block.
> 
> I would appreciate if you elaborate how trim results in more write on SSD?

The areas are not only marked as unused, but also zeroed.  To keep the
zeroing semantic, every discard must be persisted, thus requiring a write
to the SSD's metadata (not btrfs metadata) area.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ .--[ Makefile ]
⣾⠁⢠⠒⠀⣿⡁ # beware of races
⢿⡄⠘⠷⠚⠋⠀ all: pillage burn
⠈⠳⣄⠀⠀⠀⠀ `----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-16 15:19       ` Adam Borowski
@ 2021-01-16 17:21         ` Andrei Borzenkov
  2021-01-16 20:01           ` Zygo Blaxell
  0 siblings, 1 reply; 14+ messages in thread
From: Andrei Borzenkov @ 2021-01-16 17:21 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Zygo Blaxell, dsterba, waxhead, linux-btrfs

16.01.2021 18:19, Adam Borowski пишет:
> On Sat, Jan 16, 2021 at 10:39:51AM +0300, Andrei Borzenkov wrote:
>> 15.01.2021 06:54, Zygo Blaxell пишет:
>>> On the other hand, I'm in favor of deprecating the whole discard option
>>> and going with fstrim instead.  discard in its current form tends to
>>> increase write wear rather than decrease it, especially on metadata-heavy
>>> workloads.  discard is roughly equivalent to running fstrim thousands
>>> of times a day, which is clearly bad for many (most?  all?) SSDs.
>>
>> My (probably naive) understanding so far was that trim on SSD marks
>> areas as "unused" which means SSD need to copy less residual data from
>> erase block when reusing it. Assuming TRIM unit is (significantly)
>> smaller than erase block.
>>
>> I would appreciate if you elaborate how trim results in more write on SSD?
> 
> The areas are not only marked as unused, but also zeroed.  To keep the
> zeroing semantic, every discard must be persisted, thus requiring a write
> to the SSD's metadata (not btrfs metadata) area.
> 

There is no requirement that TRIM did it. If device sets RZAT SUPPORTED
bit, it should return zeroes for trimmed range, but there is no need to
physically zero anything - simply return zeroes for areas marked as
unallocated. Discard must be persisted in allocation table, but then
every write must be persisted in allocation table anyway.

Moreover, to actually zero on TRIM either trim request must be issued
for the full erase block or device must perform garbage collection.

Do you have any links that show that discards increase write load on
physical media? I am really curious.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why do we need these mount options?
  2021-01-16 17:21         ` Andrei Borzenkov
@ 2021-01-16 20:01           ` Zygo Blaxell
  0 siblings, 0 replies; 14+ messages in thread
From: Zygo Blaxell @ 2021-01-16 20:01 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Adam Borowski, dsterba, waxhead, linux-btrfs

On Sat, Jan 16, 2021 at 08:21:16PM +0300, Andrei Borzenkov wrote:
> 16.01.2021 18:19, Adam Borowski пишет:
> > On Sat, Jan 16, 2021 at 10:39:51AM +0300, Andrei Borzenkov wrote:
> >> 15.01.2021 06:54, Zygo Blaxell пишет:
> >>> On the other hand, I'm in favor of deprecating the whole discard option
> >>> and going with fstrim instead.  discard in its current form tends to
> >>> increase write wear rather than decrease it, especially on metadata-heavy
> >>> workloads.  discard is roughly equivalent to running fstrim thousands
> >>> of times a day, which is clearly bad for many (most?  all?) SSDs.
> >>
> >> My (probably naive) understanding so far was that trim on SSD marks
> >> areas as "unused" which means SSD need to copy less residual data from
> >> erase block when reusing it. Assuming TRIM unit is (significantly)
> >> smaller than erase block.
> >>
> >> I would appreciate if you elaborate how trim results in more write on SSD?
> > 
> > The areas are not only marked as unused, but also zeroed.  To keep the
> > zeroing semantic, every discard must be persisted, thus requiring a write
> > to the SSD's metadata (not btrfs metadata) area.
> > 
> 
> There is no requirement that TRIM did it. If device sets RZAT SUPPORTED
> bit, it should return zeroes for trimmed range, but there is no need to
> physically zero anything - simply return zeroes for areas marked as
> unallocated. Discard must be persisted in allocation table, but then
> every write must be persisted in allocation table anyway.

That is exactly the problem--the persistence is a write that counts
against total drive wear.  That is why TRIM variants that leave
the contents of the discarded LBAs undefined are better than those
which define the contents as zero.

The effect seems to be the equivalent of a small write, i.e. a 16K
write might be the same cost as any length of contiguous discard.
So it's OK to discard block-group-sized regions, but not OK to issue
one discard for every metadata free page hole.  Different drives have
different ratios between these costs, so parity might occur at 4K or
256K depending on the drive.

AIUI there is a minimum discard length filter implemented in btrfs
already, so maybe it just needs tuning?

> Moreover, to actually zero on TRIM either trim request must be issued
> for the full erase block or device must perform garbage collection.
> 
> Do you have any links that show that discards increase write load on
> physical media? I am really curious.

I have no links, it's a directly observed result.

It's fairly straightforward to replicate:  Set up a machine to do git
checkouts of each Linux kernel tag in random order, in a loop (maybe
multiple instances of this if needed to get the SSD device IO saturated).
While that happens, watch the percentage used endurance indicator reported
on the drives (smartctl -x).  Wait for the indicator to increment
twice, and measure the time between the first and second increment.
Use a low-cost consumer or OEM SSD so you get results in less than a
few hundred hours.  Then mount -o discard=async and wait for two more
increments.  Assuming the workload produces constant amounts of IO over
time, and the percentage used endurance indicator variable from SMART is
not a complete lie, the time between increments should roughly indicate
the wear rates of the different workloads.

In the field, we discovered this on CI builder workloads (lots of
tiny files created, destroyed, and created again in rapid succession).
They get almost double the SSD wear rate with discard on vs. discard off.
We have monitoring on the p-u-e-i variable, and use it to project the date
when 100% endurance will be reached.  If that date lands within the date
range when we want to be using the SSD, we get an alert.  When discard
is accidentally enabled on a CI server due to a configuration failure,
we get an alert about a week later, as it shortens our drives' projected
lifespan from more than 6 years to less than 4.

Other workloads are less sensitive to this.  If the workload has fewer
metadata updates, bigger files, and sequential writes, then discard
doesn't have a negative effect--though to be fair, it doesn't seem to
have a positive effect either, at least not by this measurement method.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-01-16 20:02 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-14  2:12 Why do we need these mount options? waxhead
2021-01-14 16:37 ` David Sterba
2021-01-15  0:02   ` waxhead
2021-01-15 15:29     ` David Sterba
2021-01-16  1:47       ` waxhead
2021-01-15  3:54   ` Zygo Blaxell
2021-01-15  9:32     ` waxhead
2021-01-16  0:42       ` Zygo Blaxell
2021-01-16  1:57         ` waxhead
2021-01-16  3:51           ` Zygo Blaxell
2021-01-16  7:39     ` Andrei Borzenkov
2021-01-16 15:19       ` Adam Borowski
2021-01-16 17:21         ` Andrei Borzenkov
2021-01-16 20:01           ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.