* Why do we need these mount options? @ 2021-01-14 2:12 waxhead 2021-01-14 16:37 ` David Sterba 0 siblings, 1 reply; 14+ messages in thread From: waxhead @ 2021-01-14 2:12 UTC (permalink / raw) To: linux-btrfs Howdy, I was looking through the mount options and being a madman with strong opinions I can't help thinking that a lot of them does not really belong as mount options at all, but should rather be properties set on the subvolume - for example the toplevel subvolume. And any options set on a child subvolume should override the parrent subvolume the way I see it. By having a quick look - I don't see why these should be mount options at all. autodefrag / noautodefrag commit compress / compress-force datacow / nodatacow datasum / nodatasum discard / nodiscard inode_cache / noinode_cache space_cache / nospace_cache sdd / ssd_spread / nossd / no_ssdspread user_subvol_rm_allowed Stuff like compress and nodatacow can be set with chattr so there is as far as I am aware three methods of setting compression for example. Either by mount options in fstab, by chattr or by btrfs property set I think it would be more consistent to have one interface for adjusting behavior. As I asked before, the future plan to have different storage profiles on subvolumes seem to have been sneakily(?) removed from the wiki - if that is indeed a dropped goal I can see why it makes sense to keep the mount options, if not I think the mount options should go in favor of btrfs property set. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-14 2:12 Why do we need these mount options? waxhead @ 2021-01-14 16:37 ` David Sterba 2021-01-15 0:02 ` waxhead 2021-01-15 3:54 ` Zygo Blaxell 0 siblings, 2 replies; 14+ messages in thread From: David Sterba @ 2021-01-14 16:37 UTC (permalink / raw) To: waxhead; +Cc: linux-btrfs Hi, On Thu, Jan 14, 2021 at 03:12:26AM +0100, waxhead wrote: > I was looking through the mount options and being a madman with strong > opinions I can't help thinking that a lot of them does not really belong > as mount options at all, but should rather be properties set on the > subvolume - for example the toplevel subvolume. I agree that some of them should not be there but mount options still have their own usecase. They can be set from the outside and are supposed to affect the whole filesystem mount lifetime. However, they've been used as default values for some operations, which is something that points more to what you suggest. And as they're not persistent and need to be stored in /etc/fstab is also weighing for storage inside the fs. > And any options set on a child subvolume should override the parrent > subvolume the way I see it. Yeah, that's one of the ways how to do it and I see it that way as well. Property set closer to the object takes precedence, roughly mount < subvolume < directory < file but last time we had a discussion about that, the other oppinion was that mount options beat everything, perhaps because they can be set from the outside and forced to ovrride whatever is on the filesystem. > By having a quick look - I don't see why these should be mount options > at all. > > autodefrag / noautodefrag > commit > compress / compress-force > datacow / nodatacow > datasum / nodatasum > discard / nodiscard > inode_cache / noinode_cache > space_cache / nospace_cache > sdd / ssd_spread / nossd / no_ssdspread > user_subvol_rm_allowed So there are historical reasons and interface limitations that led to current state and multiple ways to do things. Per-inode attributes were originally private ioctl of ext2 that other filesystems adopted due to feature parity, and as the interface was bit-based, no additional values could be set eg. compression, limited number of bits, no precedence, inter-flag dependencies. > Stuff like compress and nodatacow can be set with chattr so there is as > far as I am aware three methods of setting compression for example. > > Either by mount options in fstab, by chattr or by btrfs property set > > I think it would be more consistent to have one interface for adjusting > behavior. I agree with that and there's a proposal to unify that into the properties as interface once for all, accessible through the extended attributes. But there are much more ways how to do that wrong so it hasn't been implemented so far. A suggestion for an inode flag here and there comes from time to time, fixing one problem each time. Repeating that would lead to a mess that can be demonstrated on the existing mount options, so we've been there and need to do it the right way. > As I asked before, the future plan to have different storage profiles on > subvolumes seem to have been sneakily(?) removed from the wiki I don't think the per-subvolume storage options were ever tracked on wiki, the closest match is per-subvolume mount options that's still there https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-subvolume_mount_options > - if that is indeed a dropped goal I can see why it makes sense to > keep the mount options, if not I think the mount options should go in > favor of btrfs property set. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-14 16:37 ` David Sterba @ 2021-01-15 0:02 ` waxhead 2021-01-15 15:29 ` David Sterba 2021-01-15 3:54 ` Zygo Blaxell 1 sibling, 1 reply; 14+ messages in thread From: waxhead @ 2021-01-15 0:02 UTC (permalink / raw) To: dsterba, linux-btrfs David Sterba wrote: > Hi, > > On Thu, Jan 14, 2021 at 03:12:26AM +0100, waxhead wrote: >> I was looking through the mount options and being a madman with strong >> opinions I can't help thinking that a lot of them does not really belong >> as mount options at all, but should rather be properties set on the >> subvolume - for example the toplevel subvolume. > > I agree that some of them should not be there but mount options still > have their own usecase. They can be set from the outside and are > supposed to affect the whole filesystem mount lifetime. > Yes, some of them. But not all, the ones I list for example can perfectly well be set on the toplevel subvolume. > However, they've been used as default values for some operations, which > is something that points more to what you suggest. And as they're not > persistent and need to be stored in /etc/fstab is also weighing for > storage inside the fs. > >> And any options set on a child subvolume should override the parrent >> subvolume the way I see it. > > Yeah, that's one of the ways how to do it and I see it that way as well. > Property set closer to the object takes precedence, roughly > > mount < subvolume < directory < file > > but last time we had a discussion about that, the other oppinion was > that mount options beat everything, perhaps because they can be set from > the outside and forced to ovrride whatever is on the filesystem. > Well I agree with that. Mount options should beat everything and just because of that I think that some mount options should be deprecated and instead be set per. subvolume. >> By having a quick look - I don't see why these should be mount options >> at all. >> >> autodefrag / noautodefrag >> commit >> compress / compress-force >> datacow / nodatacow >> datasum / nodatasum >> discard / nodiscard >> inode_cache / noinode_cache >> space_cache / nospace_cache >> sdd / ssd_spread / nossd / no_ssdspread >> user_subvol_rm_allowed > > So there are historical reasons and interface limitations that led to > current state and multiple ways to do things. > > Per-inode attributes were originally private ioctl of ext2 that other > filesystems adopted due to feature parity, and as the interface was > bit-based, no additional values could be set eg. compression, limited > number of bits, no precedence, inter-flag dependencies. > Ok thanks, I was not aware of that. >> Stuff like compress and nodatacow can be set with chattr so there is as >> far as I am aware three methods of setting compression for example. >> >> Either by mount options in fstab, by chattr or by btrfs property set >> >> I think it would be more consistent to have one interface for adjusting >> behavior. > > I agree with that and there's a proposal to unify that into the > properties as interface once for all, accessible through the extended > attributes. But there are much more ways how to do that wrong so it > hasn't been implemented so far. > Good to know, and by the way another nugget of entertainment is that with btrfs property set the parameters come after the object. Usually command->params->target is IMHO the better way to go. It seems a bit backwards. > A suggestion for an inode flag here and there comes from time to time, > fixing one problem each time. Repeating that would lead to a mess that > can be demonstrated on the existing mount options, so we've been there > and need to do it the right way. > >> As I asked before, the future plan to have different storage profiles on >> subvolumes seem to have been sneakily(?) removed from the wiki > > I don't think the per-subvolume storage options were ever tracked on > wiki, the closest match is per-subvolume mount options that's still > there > > https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-subvolume_mount_options > Well how about this from our friends archive.org ? http://web.archive.org/web/20200117205248/https://btrfs.wiki.kernel.org/index.php/Main_Page Here it clearly states that object level mirroring and striping is planned. Maybe I misinterpret this , but I understand this as (amongst other things) configurable storage profiles per subvolume. >> - if that is indeed a dropped goal I can see why it makes sense to >> keep the mount options, if not I think the mount options should go in >> favor of btrfs property set. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-15 0:02 ` waxhead @ 2021-01-15 15:29 ` David Sterba 2021-01-16 1:47 ` waxhead 0 siblings, 1 reply; 14+ messages in thread From: David Sterba @ 2021-01-15 15:29 UTC (permalink / raw) To: waxhead; +Cc: dsterba, linux-btrfs On Fri, Jan 15, 2021 at 01:02:12AM +0100, waxhead wrote: > > I don't think the per-subvolume storage options were ever tracked on > > wiki, the closest match is per-subvolume mount options that's still > > there > > > > https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-subvolume_mount_options > > > Well how about this from our friends archive.org ? > http://web.archive.org/web/20200117205248/https://btrfs.wiki.kernel.org/index.php/Main_Page > > Here it clearly states that object level mirroring and striping is > planned. Maybe I misinterpret this , but I understand this as (amongst > other things) configurable storage profiles per subvolume. I see. The list on the main page is supposed to list features that we could promise to be implemented "soon". For all the ideas there's the specific project page wher it does not matter too much when it will implemented, it's kind of a pool. In the wiki edit that removed the object-level storage I also removed (https://btrfs.wiki.kernel.org/index.php?title=Main_Page&diff=prev&oldid=33190) * Online filesystem check * Object-level mirroring and striping * In-band deduplication (happens during writes) * Hot data tracking and moving to faster devices (or provided on the generic VFS layer) For each of the task there's nobody working on that, to my knowledge, though there was some interest and maybe RFC patches in the past. The object-level storage idea/task can be added to the Project_ideas page, so it's not lost. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-15 15:29 ` David Sterba @ 2021-01-16 1:47 ` waxhead 0 siblings, 0 replies; 14+ messages in thread From: waxhead @ 2021-01-16 1:47 UTC (permalink / raw) To: dsterba, linux-btrfs David Sterba wrote: > On Fri, Jan 15, 2021 at 01:02:12AM +0100, waxhead wrote: >>> I don't think the per-subvolume storage options were ever tracked on >>> wiki, the closest match is per-subvolume mount options that's still >>> there >>> >>> https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-subvolume_mount_options >>> >> Well how about this from our friends archive.org ? >> http://web.archive.org/web/20200117205248/https://btrfs.wiki.kernel.org/index.php/Main_Page >> >> Here it clearly states that object level mirroring and striping is >> planned. Maybe I misinterpret this , but I understand this as (amongst >> other things) configurable storage profiles per subvolume. > > I see. The list on the main page is supposed to list features that we could > promise to be implemented "soon". For all the ideas there's the specific > project page wher it does not matter too much when it will implemented, it's > kind of a pool. > > In the wiki edit that removed the object-level storage I also removed > (https://btrfs.wiki.kernel.org/index.php?title=Main_Page&diff=prev&oldid=33190) > > * Online filesystem check > * Object-level mirroring and striping > * In-band deduplication (happens during writes) > * Hot data tracking and moving to faster devices (or provided on the generic VFS layer) > > For each of the task there's nobody working on that, to my knowledge, > though there was some interest and maybe RFC patches in the past. > > The object-level storage idea/task can be added to the Project_ideas > page, so it's not lost. > Okeydok... good to know! :) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-14 16:37 ` David Sterba 2021-01-15 0:02 ` waxhead @ 2021-01-15 3:54 ` Zygo Blaxell 2021-01-15 9:32 ` waxhead 2021-01-16 7:39 ` Andrei Borzenkov 1 sibling, 2 replies; 14+ messages in thread From: Zygo Blaxell @ 2021-01-15 3:54 UTC (permalink / raw) To: dsterba, waxhead, linux-btrfs On Thu, Jan 14, 2021 at 05:37:29PM +0100, David Sterba wrote: > Hi, > > On Thu, Jan 14, 2021 at 03:12:26AM +0100, waxhead wrote: > > I was looking through the mount options and being a madman with strong > > opinions I can't help thinking that a lot of them does not really belong > > as mount options at all, but should rather be properties set on the > > subvolume - for example the toplevel subvolume. > > I agree that some of them should not be there but mount options still > have their own usecase. They can be set from the outside and are > supposed to affect the whole filesystem mount lifetime. > > However, they've been used as default values for some operations, which > is something that points more to what you suggest. And as they're not > persistent and need to be stored in /etc/fstab is also weighing for > storage inside the fs. > > > And any options set on a child subvolume should override the parrent > > subvolume the way I see it. > > Yeah, that's one of the ways how to do it and I see it that way as well. > Property set closer to the object takes precedence, roughly > > mount < subvolume < directory < file Wearing my grumpy sysadmin hat, I have occasionally wanted the mount options to override the subvolume and inode operations. Examples below. > but last time we had a discussion about that, the other oppinion was > that mount options beat everything, perhaps because they can be set from > the outside and forced to ovrride whatever is on the filesystem. > > > By having a quick look - I don't see why these should be mount options > > at all. > > > > autodefrag / noautodefrag That makes sense as an inode property--you only want autodefrag on a few files and they're usually easy to spot. > > inode_cache / noinode_cache That one is already gone as of v5.11-rc1. > > commit > > space_cache / nospace_cache > > sdd / ssd_spread / nossd / no_ssdspread How could those be anything other than filesystem-wide options? > > discard / nodiscard Maybe, but probably requires too much introspection in a fast path (we'd have to add a check for the last owner of a deleted extent to see if it had 'discard' set on some parent level). On the other hand, I'm in favor of deprecating the whole discard option and going with fstrim instead. discard in its current form tends to increase write wear rather than decrease it, especially on metadata-heavy workloads. discard is roughly equivalent to running fstrim thousands of times a day, which is clearly bad for many (most? all?) SSDs. It might be possible to make the discard mount option's behavior more sane (e.g. discard only full chunks, configurable minimum discard length, discard only within data chunks, discard only once per hour, etc). > > compress / compress-force > > datacow / nodatacow > > datasum / nodatasum Here's where I prefer the mount option over the more local attributes, because I'd like filesystem-level sysadmin overrides for those. i.e. disallow all users, even privileged ones, from being able to create files that don't have csums or compression on a filesystem. It might be better to allow the per-inode/subvol properties to be set, but have a mount option to override them. I don't want to deal with legacy applications that might throw an error if they get an error return from a nodatacow chattr, so "silently drop the chattr" is better than "prevent chattr with error." I also want to store backup / live-mountable copies of filesystems that have these attributes set, but that's a lot more complicated to implement. The filesystem can't just ignore a nodatasum bit on an inode once it has been set, and I can solve the problem above the filesystem by modifying btrfs receive or other backup code to strip those bits and store them somewhere else. Something like this seems mandatory to have working crypto integrity in the future, but silent data corruption on cheap SSDs is bad enough to make it a requirement for plaintext storage now. compress has some significant holes in its per-inode interface: no way to specify compress level in an inode property, no way to specify force on some files but not others. Compress is far easier to override from a mount option--we just don't look at the inode bits in needs_compress and everything else still works the same as before. > > user_subvol_rm_allowed I'd like "user_subvol_create_disallowed" too. Unprivileged users can create subvols, and that breaks backups that rely on atomic btrfs snapshots. It could be a feature (it allows users to exclude parts of their home directory from backups) but most users I've met who have discovered this "feature" the hard way didn't enjoy it. Historically I had other reasons to disallow subvol creates by unprivileged users, but they are mostly removed in 4.18, now that 'rmdir' works on an empty subvol. > So there are historical reasons and interface limitations that led to > current state and multiple ways to do things. > > Per-inode attributes were originally private ioctl of ext2 that other > filesystems adopted due to feature parity, and as the interface was > bit-based, no additional values could be set eg. compression, limited > number of bits, no precedence, inter-flag dependencies. > > > Stuff like compress and nodatacow can be set with chattr so there is as > > far as I am aware three methods of setting compression for example. > > > > Either by mount options in fstab, by chattr or by btrfs property set > > > > I think it would be more consistent to have one interface for adjusting > > behavior. > > I agree with that and there's a proposal to unify that into the > properties as interface once for all, accessible through the extended > attributes. But there are much more ways how to do that wrong so it > hasn't been implemented so far. > > A suggestion for an inode flag here and there comes from time to time, > fixing one problem each time. Repeating that would lead to a mess that > can be demonstrated on the existing mount options, so we've been there > and need to do it the right way. > > > As I asked before, the future plan to have different storage profiles on > > subvolumes seem to have been sneakily(?) removed from the wiki > > I don't think the per-subvolume storage options were ever tracked on > wiki, the closest match is per-subvolume mount options that's still > there > > https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-subvolume_mount_options > > > - if that is indeed a dropped goal I can see why it makes sense to > > keep the mount options, if not I think the mount options should go in > > favor of btrfs property set. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-15 3:54 ` Zygo Blaxell @ 2021-01-15 9:32 ` waxhead 2021-01-16 0:42 ` Zygo Blaxell 2021-01-16 7:39 ` Andrei Borzenkov 1 sibling, 1 reply; 14+ messages in thread From: waxhead @ 2021-01-15 9:32 UTC (permalink / raw) To: Zygo Blaxell, dsterba, linux-btrfs Zygo Blaxell wrote: > >>> commit >>> space_cache / nospace_cache >>> sdd / ssd_spread / nossd / no_ssdspread > > How could those be anything other than filesystem-wide options? > Well being me, I tend to live in a fantasy world where BTRFS have complete world domination and has become the VFS layer. As I have nagged about before on this list - I really think that the only sensible way forward for BTRFS (or dare I say BTRFS2) would be to make it possible to assign "storage device groups" where you can make certain btrfs device ids belong to group a,b,c, etc... And with that it would be possible to assign a weight to subvolumes so that they would be preferred to be stored on group a (SSD's perhaps), while other subvolumes would be stored mostly or exlusively on HDD's, Fast HDD's, Archival HDD's etc... So maybe a bit over enthusiastic in thinking perhaps , but hopefully you see now why I think it is right that this is not filesystem-wide , but subvolume baseed properties. >>> discard / nodiscard > > Maybe, but probably requires too much introspection in a fast path (we'd > have to add a check for the last owner of a deleted extent to see if it > had 'discard' set on some parent level). > > On the other hand, I'm in favor of deprecating the whole discard option > and going with fstrim instead. discard in its current form tends to > increase write wear rather than decrease it, especially on metadata-heavy > workloads. discard is roughly equivalent to running fstrim thousands > of times a day, which is clearly bad for many (most? all?) SSDs. > > It might be possible to make the discard mount option's behavior more > sane (e.g. discard only full chunks, configurable minimum discard length, > discard only within data chunks, discard only once per hour, etc). > Interesting, it might as well make sense to perhaps use the free space cache and a slow LRU mechanism e.g. "these chunks has not been in use for 64 hours/days" or something similar. >>> compress / compress-force >>> datacow / nodatacow >>> datasum / nodatasum > > Here's where I prefer the mount option over the more local attributes, > because I'd like filesystem-level sysadmin overrides for those. > i.e. disallow all users, even privileged ones, from being able to create > files that don't have csums or compression on a filesystem. > Then how about a mount option that allow only root to do certain things? e.g. a security restriction. > >>> user_subvol_rm_allowed > > I'd like "user_subvol_create_disallowed" too. Unprivileged users can > create subvols, and that breaks backups that rely on atomic btrfs > snapshots. It could be a feature (it allows users to exclude parts of > their home directory from backups) but most users I've met who have > discovered this "feature" the hard way didn't enjoy it. > > Historically I had other reasons to disallow subvol creates by > unprivileged users, but they are mostly removed in 4.18, now that 'rmdir' > works on an empty subvol. > Again see above... ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-15 9:32 ` waxhead @ 2021-01-16 0:42 ` Zygo Blaxell 2021-01-16 1:57 ` waxhead 0 siblings, 1 reply; 14+ messages in thread From: Zygo Blaxell @ 2021-01-16 0:42 UTC (permalink / raw) To: waxhead; +Cc: dsterba, linux-btrfs On Fri, Jan 15, 2021 at 10:32:39AM +0100, waxhead wrote: > Zygo Blaxell wrote: > > > > > > commit > > > > space_cache / nospace_cache > > > > sdd / ssd_spread / nossd / no_ssdspread > > > > How could those be anything other than filesystem-wide options? > > > > Well being me, I tend to live in a fantasy world where BTRFS have complete > world domination and has become the VFS layer. > As I have nagged about before on this list - I really think that the only > sensible way forward for BTRFS (or dare I say BTRFS2) would be to make it > possible to assign "storage device groups" where you can make certain btrfs > device ids belong to group a,b,c, etc... > > And with that it would be possible to assign a weight to subvolumes so that > they would be preferred to be stored on group a (SSD's perhaps), while other > subvolumes would be stored mostly or exlusively on HDD's, Fast HDD's, > Archival HDD's etc... So maybe a bit over enthusiastic in thinking perhaps , > but hopefully you see now why I think it is right that this is not > filesystem-wide , but subvolume baseed properties. Sure, that's all wonderful, but it has nothing to do with any of those mount options. ;) commit sets a timer that forces a filesystem-wide sync() every now and then. space_cache picks one of the allocator implementations, also for the entire filesystem. ssd and related options affect the behavior of the metadata allocator and superblocks. > > > > discard / nodiscard > > > > Maybe, but probably requires too much introspection in a fast path (we'd > > have to add a check for the last owner of a deleted extent to see if it > > had 'discard' set on some parent level). > > > > On the other hand, I'm in favor of deprecating the whole discard option > > and going with fstrim instead. discard in its current form tends to > > increase write wear rather than decrease it, especially on metadata-heavy > > workloads. discard is roughly equivalent to running fstrim thousands > > of times a day, which is clearly bad for many (most? all?) SSDs. > > > > It might be possible to make the discard mount option's behavior more > > sane (e.g. discard only full chunks, configurable minimum discard length, > > discard only within data chunks, discard only once per hour, etc). > > > Interesting, it might as well make sense to perhaps use the free space cache > and a slow LRU mechanism e.g. "these chunks has not been in use for 64 > hours/days" or something similar. That would add more writes, as the free space cache is an on-disk entity. It might make sense to maintain a 'discard tree', which lists extents that have been freed but not yet discarded or overwritten, to make fstrim more efficient. This wouldn't have to be very precise, just pointing to general regions of the disk (maybe even entire block groups) so fstrim doesn't issue discards to idle areas of the disk over and over. Currently the discard extent list is stored in memory, so doing one discard per T time units would use more memory. This feature would be like discard=async, but 1) it would hold on to the pinned extents for a few hundred transactions instead of just one or two (subject to memory availability), and 2) it would be able to reclaim space from the discard list as free space, thus removing the need to issue a discard at all. But that's really complicated, considering that a cron job that runs fstrim once an hour can do the same thing without all the complexity. On the other hand, I just ran fstrim on a test machine and it took 34 minutes, so maybe some complexity might be useful after all... :-O > > > > compress / compress-force > > > > datacow / nodatacow > > > > datasum / nodatasum > > > > Here's where I prefer the mount option over the more local attributes, > > because I'd like filesystem-level sysadmin overrides for those. > > i.e. disallow all users, even privileged ones, from being able to create > > files that don't have csums or compression on a filesystem. > > > Then how about a mount option that allow only root to do certain things? > e.g. a security restriction. No, I don't want root doing those things either. Most of the applications I want to bring to heel are already running as root. Basically I want to say "every file on this filesystem shall be datacow and datasum" and (short of altering the mount option) no other kind of file can be created. It might possibly make more sense to do this through tunefs--that way the filesystem couldn't ever have nodatacow files (new kernels would refuse, old kernels wouldn't be able to mount through a new feature flag). > > > > user_subvol_rm_allowed > > > > I'd like "user_subvol_create_disallowed" too. Unprivileged users can > > create subvols, and that breaks backups that rely on atomic btrfs > > snapshots. It could be a feature (it allows users to exclude parts of > > their home directory from backups) but most users I've met who have > > discovered this "feature" the hard way didn't enjoy it. > > > > Historically I had other reasons to disallow subvol creates by > > unprivileged users, but they are mostly removed in 4.18, now that 'rmdir' > > works on an empty subvol. > > > Again see above... Here, unlike above, I was already asking precisely for subvol create to be made root only. That, or make snapshots recursive and atomic to avoid the accidental user data loss/corruption case. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-16 0:42 ` Zygo Blaxell @ 2021-01-16 1:57 ` waxhead 2021-01-16 3:51 ` Zygo Blaxell 0 siblings, 1 reply; 14+ messages in thread From: waxhead @ 2021-01-16 1:57 UTC (permalink / raw) To: Zygo Blaxell; +Cc: dsterba, linux-btrfs Zygo Blaxell wrote: > On Fri, Jan 15, 2021 at 10:32:39AM +0100, waxhead wrote: >> Zygo Blaxell wrote: >>> >>>>> commit >>>>> space_cache / nospace_cache >>>>> sdd / ssd_spread / nossd / no_ssdspread >>> >>> How could those be anything other than filesystem-wide options? >>> >> >> Well being me, I tend to live in a fantasy world where BTRFS have complete >> world domination and has become the VFS layer. >> As I have nagged about before on this list - I really think that the only >> sensible way forward for BTRFS (or dare I say BTRFS2) would be to make it >> possible to assign "storage device groups" where you can make certain btrfs >> device ids belong to group a,b,c, etc... >> >> And with that it would be possible to assign a weight to subvolumes so that >> they would be preferred to be stored on group a (SSD's perhaps), while other >> subvolumes would be stored mostly or exlusively on HDD's, Fast HDD's, >> Archival HDD's etc... So maybe a bit over enthusiastic in thinking perhaps , >> but hopefully you see now why I think it is right that this is not >> filesystem-wide , but subvolume baseed properties. > > Sure, that's all wonderful, but it has nothing to do with any of those > mount options. ;) > > commit sets a timer that forces a filesystem-wide sync() every now > and then. space_cache picks one of the allocator implementations, also > for the entire filesystem. ssd and related options affect the behavior > of the metadata allocator and superblocks. > Ok I understand the space_cache, but commit (the way I understand it) could be useful to keep stuff in memory for longer for certain subvolume. ssd options could also be useful pr. subvolume *if* and that is a big if BTRFS sometime in the far future allows for storage device groups. However when that happens maybe everything is solid state , who knows ;) >>>>> discard / nodiscard >>> >>> Maybe, but probably requires too much introspection in a fast path (we'd >>> have to add a check for the last owner of a deleted extent to see if it >>> had 'discard' set on some parent level). >>> >>> On the other hand, I'm in favor of deprecating the whole discard option >>> and going with fstrim instead. discard in its current form tends to >>> increase write wear rather than decrease it, especially on metadata-heavy >>> workloads. discard is roughly equivalent to running fstrim thousands >>> of times a day, which is clearly bad for many (most? all?) SSDs. >>> >>> It might be possible to make the discard mount option's behavior more >>> sane (e.g. discard only full chunks, configurable minimum discard length, >>> discard only within data chunks, discard only once per hour, etc). >>> >> Interesting, it might as well make sense to perhaps use the free space cache >> and a slow LRU mechanism e.g. "these chunks has not been in use for 64 >> hours/days" or something similar. > > That would add more writes, as the free space cache is an on-disk entity. > It might make sense to maintain a 'discard tree', which lists extents > that have been freed but not yet discarded or overwritten, to make fstrim > more efficient. This wouldn't have to be very precise, just pointing to > general regions of the disk (maybe even entire block groups) so fstrim > doesn't issue discards to idle areas of the disk over and over. > > Currently the discard extent list is stored in memory, so doing one > discard per T time units would use more memory. This feature would be > like discard=async, but 1) it would hold on to the pinned extents for a > few hundred transactions instead of just one or two (subject to memory > availability), and 2) it would be able to reclaim space from the discard > list as free space, thus removing the need to issue a discard at all. > > But that's really complicated, considering that a cron job that runs > fstrim once an hour can do the same thing without all the complexity. > On the other hand, I just ran fstrim on a test machine and it took > 34 minutes, so maybe some complexity might be useful after all... :-O > Thanks for the education. Inspired by this I ran fstrim on my main machine with 7 older and smaller SSD's. Have not run fstrim on it or maybe a year so it took 15-20 minutes to do. But I see your point. Minimizing writes to SSD's is a good thing, but if I am not mistaking SMR disks can benefit from discard as well, I don't know the details but maybe it would be more beneficial on real disks than SSD storage? >>>>> compress / compress-force >>>>> datacow / nodatacow >>>>> datasum / nodatasum >>> >>> Here's where I prefer the mount option over the more local attributes, >>> because I'd like filesystem-level sysadmin overrides for those. >>> i.e. disallow all users, even privileged ones, from being able to create >>> files that don't have csums or compression on a filesystem. >>> >> Then how about a mount option that allow only root to do certain things? >> e.g. a security restriction. > > No, I don't want root doing those things either. Most of the applications > I want to bring to heel are already running as root. > > Basically I want to say "every file on this filesystem shall be datacow > and datasum" and (short of altering the mount option) no other kind of > file can be created. > > It might possibly make more sense to do this through tunefs--that way the > filesystem couldn't ever have nodatacow files (new kernels would refuse, > old kernels wouldn't be able to mount through a new feature flag). > Allrighty point taken. >>>>> user_subvol_rm_allowed >>> >>> I'd like "user_subvol_create_disallowed" too. Unprivileged users can >>> create subvols, and that breaks backups that rely on atomic btrfs >>> snapshots. It could be a feature (it allows users to exclude parts of >>> their home directory from backups) but most users I've met who have >>> discovered this "feature" the hard way didn't enjoy it. >>> >>> Historically I had other reasons to disallow subvol creates by >>> unprivileged users, but they are mostly removed in 4.18, now that 'rmdir' >>> works on an empty subvol. >>> >> Again see above... > > Here, unlike above, I was already asking precisely for subvol create to > be made root only. > > That, or make snapshots recursive and atomic to avoid the accidental > user data loss/corruption case. > Allrighty again! :) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-16 1:57 ` waxhead @ 2021-01-16 3:51 ` Zygo Blaxell 0 siblings, 0 replies; 14+ messages in thread From: Zygo Blaxell @ 2021-01-16 3:51 UTC (permalink / raw) To: waxhead; +Cc: dsterba, linux-btrfs On Sat, Jan 16, 2021 at 02:57:05AM +0100, waxhead wrote: > > > Zygo Blaxell wrote: > > On Fri, Jan 15, 2021 at 10:32:39AM +0100, waxhead wrote: > > > Zygo Blaxell wrote: > > > > > > > > > > commit > > > > > > space_cache / nospace_cache > > > > > > sdd / ssd_spread / nossd / no_ssdspread > > > > > > > > How could those be anything other than filesystem-wide options? > > > > > > > > > > Well being me, I tend to live in a fantasy world where BTRFS have complete > > > world domination and has become the VFS layer. > > > As I have nagged about before on this list - I really think that the only > > > sensible way forward for BTRFS (or dare I say BTRFS2) would be to make it > > > possible to assign "storage device groups" where you can make certain btrfs > > > device ids belong to group a,b,c, etc... > > > > > > And with that it would be possible to assign a weight to subvolumes so that > > > they would be preferred to be stored on group a (SSD's perhaps), while other > > > subvolumes would be stored mostly or exlusively on HDD's, Fast HDD's, > > > Archival HDD's etc... So maybe a bit over enthusiastic in thinking perhaps , > > > but hopefully you see now why I think it is right that this is not > > > filesystem-wide , but subvolume baseed properties. > > > > Sure, that's all wonderful, but it has nothing to do with any of those > > mount options. ;) > > > > commit sets a timer that forces a filesystem-wide sync() every now > > and then. space_cache picks one of the allocator implementations, also > > for the entire filesystem. ssd and related options affect the behavior > > of the metadata allocator and superblocks. > > > Ok I understand the space_cache, but commit (the way I understand it) could > be useful to keep stuff in memory for longer for certain subvolume. That's probably something to implement mostly above the filesystem level, as it's VFS that controls most of writeback, and maybe memcg could be adapted to have per-cgroup writeback settings. The btrfs transaction commit updates all of the filesystem trees at once (they are all children of the root subtree's root node). Changing that would be a significant restructuring. > ssd options could also be useful pr. subvolume *if* and that is a big if > BTRFS sometime in the far future allows for storage device groups. However > when that happens maybe everything is solid state , who knows ;) First the ssd options have to do something that could be per-subvolume. Right now they do almost nothing, and the little they do is per-device. There are various proposals for HSM, metadata-on-SSD, hot data tracking floating around. If they ever arrive at btrfs, they definitely won't be tacked onto the ssd mount option. > > > > > > discard / nodiscard > > > > > > > > Maybe, but probably requires too much introspection in a fast path (we'd > > > > have to add a check for the last owner of a deleted extent to see if it > > > > had 'discard' set on some parent level). > > > > > > > > On the other hand, I'm in favor of deprecating the whole discard option > > > > and going with fstrim instead. discard in its current form tends to > > > > increase write wear rather than decrease it, especially on metadata-heavy > > > > workloads. discard is roughly equivalent to running fstrim thousands > > > > of times a day, which is clearly bad for many (most? all?) SSDs. > > > > > > > > It might be possible to make the discard mount option's behavior more > > > > sane (e.g. discard only full chunks, configurable minimum discard length, > > > > discard only within data chunks, discard only once per hour, etc). > > > > > > > Interesting, it might as well make sense to perhaps use the free space cache > > > and a slow LRU mechanism e.g. "these chunks has not been in use for 64 > > > hours/days" or something similar. > > > > That would add more writes, as the free space cache is an on-disk entity. > > It might make sense to maintain a 'discard tree', which lists extents > > that have been freed but not yet discarded or overwritten, to make fstrim > > more efficient. This wouldn't have to be very precise, just pointing to > > general regions of the disk (maybe even entire block groups) so fstrim > > doesn't issue discards to idle areas of the disk over and over. > > > > Currently the discard extent list is stored in memory, so doing one > > discard per T time units would use more memory. This feature would be > > like discard=async, but 1) it would hold on to the pinned extents for a > > few hundred transactions instead of just one or two (subject to memory > > availability), and 2) it would be able to reclaim space from the discard > > list as free space, thus removing the need to issue a discard at all. > > > > But that's really complicated, considering that a cron job that runs > > fstrim once an hour can do the same thing without all the complexity. > > On the other hand, I just ran fstrim on a test machine and it took > > 34 minutes, so maybe some complexity might be useful after all... :-O > > > Thanks for the education. Inspired by this I ran fstrim on my main machine > with 7 older and smaller SSD's. Have not run fstrim on it or maybe a year so > it took 15-20 minutes to do. But I see your point. Minimizing writes to > SSD's is a good thing, but if I am not mistaking SMR disks can benefit from > discard as well, I don't know the details but maybe it would be more > beneficial on real disks than SSD storage? Overwriting data at a given LBA implies discarding the data currently in the LBA, regardless of underlying storage. discard is useful on SSD, SMR, and also caching/thin LV layers if the LBAs are not going to be overwritten for a long time (thin LVs take less space, caches have more room for other data). There is no need to send a separate discard command if the LBAs in question are not part of a file at the time, and there is a high probability of overwriting the same LBA with new data very soon. There are multiple variants of the discard operation in devices. One variant leaves the contents of the discarded blocks undefined. That's a nice feature, since the device is free to ignore the discard request if it might improve performance (i.e. there is no need to ensure the discard has immediate persistent effect, so it is acceptable for the drive to not flush discards to persistent storage right away, and merge the discard into some later write command to save one write cycle). Most of the SSDs I've tested do a different variant of discard, where the discarded blocks are guaranteed to return zeros. That forces the drive to do at least one flash write for each discard. On such drives it's important to not send redundant discards from the filesystem, as those discards are necessarily wasting write cycles. On SMR drives, redundant discards are even worse than on SSD. SMR drives have address translation layers like SSD, but they are stored on spinning disks. It will take milliseconds to move heads to record the discard in the translation layer, only to spend more milliseconds to write the replacement data in the same location immediately after. The disk platter has higher endurance than SSD, but can be orders of magnitude slower if there are too many translation layer updates. > > > > > > compress / compress-force > > > > > > datacow / nodatacow > > > > > > datasum / nodatasum > > > > > > > > Here's where I prefer the mount option over the more local attributes, > > > > because I'd like filesystem-level sysadmin overrides for those. > > > > i.e. disallow all users, even privileged ones, from being able to create > > > > files that don't have csums or compression on a filesystem. > > > > > > > Then how about a mount option that allow only root to do certain things? > > > e.g. a security restriction. > > > > No, I don't want root doing those things either. Most of the applications > > I want to bring to heel are already running as root. > > > > Basically I want to say "every file on this filesystem shall be datacow > > and datasum" and (short of altering the mount option) no other kind of > > file can be created. > > > > It might possibly make more sense to do this through tunefs--that way the > > filesystem couldn't ever have nodatacow files (new kernels would refuse, > > old kernels wouldn't be able to mount through a new feature flag). > > > Allrighty point taken. > > > > > > > user_subvol_rm_allowed > > > > > > > > I'd like "user_subvol_create_disallowed" too. Unprivileged users can > > > > create subvols, and that breaks backups that rely on atomic btrfs > > > > snapshots. It could be a feature (it allows users to exclude parts of > > > > their home directory from backups) but most users I've met who have > > > > discovered this "feature" the hard way didn't enjoy it. > > > > > > > > Historically I had other reasons to disallow subvol creates by > > > > unprivileged users, but they are mostly removed in 4.18, now that 'rmdir' > > > > works on an empty subvol. > > > > > > > Again see above... > > > > Here, unlike above, I was already asking precisely for subvol create to > > be made root only. > > > > That, or make snapshots recursive and atomic to avoid the accidental > > user data loss/corruption case. > > > Allrighty again! :) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-15 3:54 ` Zygo Blaxell 2021-01-15 9:32 ` waxhead @ 2021-01-16 7:39 ` Andrei Borzenkov 2021-01-16 15:19 ` Adam Borowski 1 sibling, 1 reply; 14+ messages in thread From: Andrei Borzenkov @ 2021-01-16 7:39 UTC (permalink / raw) To: Zygo Blaxell, dsterba, waxhead, linux-btrfs 15.01.2021 06:54, Zygo Blaxell пишет: > > On the other hand, I'm in favor of deprecating the whole discard option > and going with fstrim instead. discard in its current form tends to > increase write wear rather than decrease it, especially on metadata-heavy > workloads. discard is roughly equivalent to running fstrim thousands > of times a day, which is clearly bad for many (most? all?) SSDs. > My (probably naive) understanding so far was that trim on SSD marks areas as "unused" which means SSD need to copy less residual data from erase block when reusing it. Assuming TRIM unit is (significantly) smaller than erase block. I would appreciate if you elaborate how trim results in more write on SSD? Or do you mean more writes from btrfs while performing discard? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-16 7:39 ` Andrei Borzenkov @ 2021-01-16 15:19 ` Adam Borowski 2021-01-16 17:21 ` Andrei Borzenkov 0 siblings, 1 reply; 14+ messages in thread From: Adam Borowski @ 2021-01-16 15:19 UTC (permalink / raw) To: Andrei Borzenkov; +Cc: Zygo Blaxell, dsterba, waxhead, linux-btrfs On Sat, Jan 16, 2021 at 10:39:51AM +0300, Andrei Borzenkov wrote: > 15.01.2021 06:54, Zygo Blaxell пишет: > > On the other hand, I'm in favor of deprecating the whole discard option > > and going with fstrim instead. discard in its current form tends to > > increase write wear rather than decrease it, especially on metadata-heavy > > workloads. discard is roughly equivalent to running fstrim thousands > > of times a day, which is clearly bad for many (most? all?) SSDs. > > My (probably naive) understanding so far was that trim on SSD marks > areas as "unused" which means SSD need to copy less residual data from > erase block when reusing it. Assuming TRIM unit is (significantly) > smaller than erase block. > > I would appreciate if you elaborate how trim results in more write on SSD? The areas are not only marked as unused, but also zeroed. To keep the zeroing semantic, every discard must be persisted, thus requiring a write to the SSD's metadata (not btrfs metadata) area. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ .--[ Makefile ] ⣾⠁⢠⠒⠀⣿⡁ # beware of races ⢿⡄⠘⠷⠚⠋⠀ all: pillage burn ⠈⠳⣄⠀⠀⠀⠀ `---- ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-16 15:19 ` Adam Borowski @ 2021-01-16 17:21 ` Andrei Borzenkov 2021-01-16 20:01 ` Zygo Blaxell 0 siblings, 1 reply; 14+ messages in thread From: Andrei Borzenkov @ 2021-01-16 17:21 UTC (permalink / raw) To: Adam Borowski; +Cc: Zygo Blaxell, dsterba, waxhead, linux-btrfs 16.01.2021 18:19, Adam Borowski пишет: > On Sat, Jan 16, 2021 at 10:39:51AM +0300, Andrei Borzenkov wrote: >> 15.01.2021 06:54, Zygo Blaxell пишет: >>> On the other hand, I'm in favor of deprecating the whole discard option >>> and going with fstrim instead. discard in its current form tends to >>> increase write wear rather than decrease it, especially on metadata-heavy >>> workloads. discard is roughly equivalent to running fstrim thousands >>> of times a day, which is clearly bad for many (most? all?) SSDs. >> >> My (probably naive) understanding so far was that trim on SSD marks >> areas as "unused" which means SSD need to copy less residual data from >> erase block when reusing it. Assuming TRIM unit is (significantly) >> smaller than erase block. >> >> I would appreciate if you elaborate how trim results in more write on SSD? > > The areas are not only marked as unused, but also zeroed. To keep the > zeroing semantic, every discard must be persisted, thus requiring a write > to the SSD's metadata (not btrfs metadata) area. > There is no requirement that TRIM did it. If device sets RZAT SUPPORTED bit, it should return zeroes for trimmed range, but there is no need to physically zero anything - simply return zeroes for areas marked as unallocated. Discard must be persisted in allocation table, but then every write must be persisted in allocation table anyway. Moreover, to actually zero on TRIM either trim request must be issued for the full erase block or device must perform garbage collection. Do you have any links that show that discards increase write load on physical media? I am really curious. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Why do we need these mount options? 2021-01-16 17:21 ` Andrei Borzenkov @ 2021-01-16 20:01 ` Zygo Blaxell 0 siblings, 0 replies; 14+ messages in thread From: Zygo Blaxell @ 2021-01-16 20:01 UTC (permalink / raw) To: Andrei Borzenkov; +Cc: Adam Borowski, dsterba, waxhead, linux-btrfs On Sat, Jan 16, 2021 at 08:21:16PM +0300, Andrei Borzenkov wrote: > 16.01.2021 18:19, Adam Borowski пишет: > > On Sat, Jan 16, 2021 at 10:39:51AM +0300, Andrei Borzenkov wrote: > >> 15.01.2021 06:54, Zygo Blaxell пишет: > >>> On the other hand, I'm in favor of deprecating the whole discard option > >>> and going with fstrim instead. discard in its current form tends to > >>> increase write wear rather than decrease it, especially on metadata-heavy > >>> workloads. discard is roughly equivalent to running fstrim thousands > >>> of times a day, which is clearly bad for many (most? all?) SSDs. > >> > >> My (probably naive) understanding so far was that trim on SSD marks > >> areas as "unused" which means SSD need to copy less residual data from > >> erase block when reusing it. Assuming TRIM unit is (significantly) > >> smaller than erase block. > >> > >> I would appreciate if you elaborate how trim results in more write on SSD? > > > > The areas are not only marked as unused, but also zeroed. To keep the > > zeroing semantic, every discard must be persisted, thus requiring a write > > to the SSD's metadata (not btrfs metadata) area. > > > > There is no requirement that TRIM did it. If device sets RZAT SUPPORTED > bit, it should return zeroes for trimmed range, but there is no need to > physically zero anything - simply return zeroes for areas marked as > unallocated. Discard must be persisted in allocation table, but then > every write must be persisted in allocation table anyway. That is exactly the problem--the persistence is a write that counts against total drive wear. That is why TRIM variants that leave the contents of the discarded LBAs undefined are better than those which define the contents as zero. The effect seems to be the equivalent of a small write, i.e. a 16K write might be the same cost as any length of contiguous discard. So it's OK to discard block-group-sized regions, but not OK to issue one discard for every metadata free page hole. Different drives have different ratios between these costs, so parity might occur at 4K or 256K depending on the drive. AIUI there is a minimum discard length filter implemented in btrfs already, so maybe it just needs tuning? > Moreover, to actually zero on TRIM either trim request must be issued > for the full erase block or device must perform garbage collection. > > Do you have any links that show that discards increase write load on > physical media? I am really curious. I have no links, it's a directly observed result. It's fairly straightforward to replicate: Set up a machine to do git checkouts of each Linux kernel tag in random order, in a loop (maybe multiple instances of this if needed to get the SSD device IO saturated). While that happens, watch the percentage used endurance indicator reported on the drives (smartctl -x). Wait for the indicator to increment twice, and measure the time between the first and second increment. Use a low-cost consumer or OEM SSD so you get results in less than a few hundred hours. Then mount -o discard=async and wait for two more increments. Assuming the workload produces constant amounts of IO over time, and the percentage used endurance indicator variable from SMART is not a complete lie, the time between increments should roughly indicate the wear rates of the different workloads. In the field, we discovered this on CI builder workloads (lots of tiny files created, destroyed, and created again in rapid succession). They get almost double the SSD wear rate with discard on vs. discard off. We have monitoring on the p-u-e-i variable, and use it to project the date when 100% endurance will be reached. If that date lands within the date range when we want to be using the SSD, we get an alert. When discard is accidentally enabled on a CI server due to a configuration failure, we get an alert about a week later, as it shortens our drives' projected lifespan from more than 6 years to less than 4. Other workloads are less sensitive to this. If the workload has fewer metadata updates, bigger files, and sequential writes, then discard doesn't have a negative effect--though to be fair, it doesn't seem to have a positive effect either, at least not by this measurement method. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2021-01-16 20:02 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-01-14 2:12 Why do we need these mount options? waxhead 2021-01-14 16:37 ` David Sterba 2021-01-15 0:02 ` waxhead 2021-01-15 15:29 ` David Sterba 2021-01-16 1:47 ` waxhead 2021-01-15 3:54 ` Zygo Blaxell 2021-01-15 9:32 ` waxhead 2021-01-16 0:42 ` Zygo Blaxell 2021-01-16 1:57 ` waxhead 2021-01-16 3:51 ` Zygo Blaxell 2021-01-16 7:39 ` Andrei Borzenkov 2021-01-16 15:19 ` Adam Borowski 2021-01-16 17:21 ` Andrei Borzenkov 2021-01-16 20:01 ` Zygo Blaxell
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.