* Shrinking a device - performance? @ 2017-03-27 11:17 Christian Theune 2017-03-27 13:07 ` Hugo Mills 0 siblings, 1 reply; 42+ messages in thread From: Christian Theune @ 2017-03-27 11:17 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1146 bytes --] Hi, I’m currently shrinking a device and it seems that the performance of shrink is abysmal. I intended to shrink a ~22TiB filesystem down to 20TiB. This is still using LVM underneath so that I can’t just remove a device from the filesystem but have to use the resize command. Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 Total devices 1 FS bytes used 18.21TiB devid 1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy This has been running since last Thursday, so roughly 3.5days now. The “used” number in devid1 has moved about 1TiB in this time. The filesystem is seeing regular usage (read and write) and when I’m suspending any application traffic I see about 1GiB of movement every now and then. Maybe once every 30 seconds or so. Does this sound fishy or normal to you? Kind regards, Christian -- Christian Theune · ct@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 496 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 11:17 Shrinking a device - performance? Christian Theune @ 2017-03-27 13:07 ` Hugo Mills 2017-03-27 13:20 ` Christian Theune 0 siblings, 1 reply; 42+ messages in thread From: Hugo Mills @ 2017-03-27 13:07 UTC (permalink / raw) To: Christian Theune; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1624 bytes --] On Mon, Mar 27, 2017 at 01:17:26PM +0200, Christian Theune wrote: > Hi, > > I’m currently shrinking a device and it seems that the performance of shrink is abysmal. I intended to shrink a ~22TiB filesystem down to 20TiB. This is still using LVM underneath so that I can’t just remove a device from the filesystem but have to use the resize command. > > Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 > Total devices 1 FS bytes used 18.21TiB > devid 1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy > > This has been running since last Thursday, so roughly 3.5days now. The “used” number in devid1 has moved about 1TiB in this time. The filesystem is seeing regular usage (read and write) and when I’m suspending any application traffic I see about 1GiB of movement every now and then. Maybe once every 30 seconds or so. > > Does this sound fishy or normal to you? On my hardware (consumer HDDs and SATA, RAID-1 over 6 devices), it takes about a minute to move 1 GiB of data. At that rate, it would take 1000 minutes (or about 16 hours) to move 1 TiB of data. However, there are cases where some items of data can take *much* longer to move. The biggest of these is when you have lots of snapshots. When that happens, some (but not all) of the metadata can take a very long time. In my case, with a couple of hundred snapshots, some metadata chunks take 4+ hours to move. Hugo. -- Hugo Mills | Great films about cricket: Silly Point Break hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 13:07 ` Hugo Mills @ 2017-03-27 13:20 ` Christian Theune 2017-03-27 13:24 ` Hugo Mills 2017-03-27 14:48 ` Roman Mamedov 0 siblings, 2 replies; 42+ messages in thread From: Christian Theune @ 2017-03-27 13:20 UTC (permalink / raw) To: Hugo Mills; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1936 bytes --] Hi, > On Mar 27, 2017, at 3:07 PM, Hugo Mills <hugo@carfax.org.uk> wrote: > > On my hardware (consumer HDDs and SATA, RAID-1 over 6 devices), it > takes about a minute to move 1 GiB of data. At that rate, it would > take 1000 minutes (or about 16 hours) to move 1 TiB of data. > > However, there are cases where some items of data can take *much* > longer to move. The biggest of these is when you have lots of > snapshots. When that happens, some (but not all) of the metadata can > take a very long time. In my case, with a couple of hundred snapshots, > some metadata chunks take 4+ hours to move. Thanks for that info. The 1min per 1GiB is what I saw too - the “it can take longer” wasn’t really explainable to me. As I’m not using snapshots: would large files (100+gb) with long chains of CoW history (specifically reflink copies) also hurt? Something I’d like to verify: does having traffic on the volume have the potential to delay this infinitely? I.e. does the system write to any segments that we’re trying to free so it may have to work on the same chunk over and over again? If not, then this means it’s just slow and we’re looking forward to about 2 months worth of time shrinking this volume. (And then again on the next bigger server probably about 3-4 months). (Background info: we’re migrating large volumes from btrfs to xfs and can only do this step by step: copying some data, shrinking the btrfs volume, extending the xfs volume, rinse repeat. If someone should have any suggestions to speed this up and not having to think in terms of _months_ then I’m all ears.) Cheers, Christian -- Christian Theune · ct@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 496 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 13:20 ` Christian Theune @ 2017-03-27 13:24 ` Hugo Mills 2017-03-27 13:46 ` Austin S. Hemmelgarn 2017-03-27 14:48 ` Roman Mamedov 1 sibling, 1 reply; 42+ messages in thread From: Hugo Mills @ 2017-03-27 13:24 UTC (permalink / raw) To: Christian Theune; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2476 bytes --] On Mon, Mar 27, 2017 at 03:20:37PM +0200, Christian Theune wrote: > Hi, > > > On Mar 27, 2017, at 3:07 PM, Hugo Mills <hugo@carfax.org.uk> wrote: > > > > On my hardware (consumer HDDs and SATA, RAID-1 over 6 devices), it > > takes about a minute to move 1 GiB of data. At that rate, it would > > take 1000 minutes (or about 16 hours) to move 1 TiB of data. > > > > However, there are cases where some items of data can take *much* > > longer to move. The biggest of these is when you have lots of > > snapshots. When that happens, some (but not all) of the metadata can > > take a very long time. In my case, with a couple of hundred snapshots, > > some metadata chunks take 4+ hours to move. > Thanks for that info. The 1min per 1GiB is what I saw too - the “it > can take longer” wasn’t really explainable to me. > As I’m not using snapshots: would large files (100+gb) with long > chains of CoW history (specifically reflink copies) also hurt? Yes, that's the same issue -- it's to do with the number of times an extent is shared. Snapshots are one way of creating that sharing, reflinks are another. > Something I’d like to verify: does having traffic on the volume have > the potential to delay this infinitely? I.e. does the system write > to any segments that we’re trying to free so it may have to work on > the same chunk over and over again? If not, then this means it’s > just slow and we’re looking forward to about 2 months worth of time > shrinking this volume. (And then again on the next bigger server > probably about 3-4 months). I don't know. I would hope not, but I simply don't know enough about the internal algorithms for that. Maybe someone else can confirm? > (Background info: we’re migrating large volumes from btrfs to xfs > and can only do this step by step: copying some data, shrinking the > btrfs volume, extending the xfs volume, rinse repeat. If someone > should have any suggestions to speed this up and not having to think > in terms of _months_ then I’m all ears.) All I can suggest is to move some unused data off the volume and do it in fewer larger steps. Sorry. Hugo. -- Hugo Mills | Jenkins! Chap with the wings there! Five rounds hugo@... carfax.org.uk | rapid! http://carfax.org.uk/ | Brigadier Alistair Lethbridge-Stewart PGP: E2AB1DE4 | Dr Who and the Daemons [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 13:24 ` Hugo Mills @ 2017-03-27 13:46 ` Austin S. Hemmelgarn 2017-03-27 13:50 ` Christian Theune 0 siblings, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-03-27 13:46 UTC (permalink / raw) To: Hugo Mills, Christian Theune, linux-btrfs On 2017-03-27 09:24, Hugo Mills wrote: > On Mon, Mar 27, 2017 at 03:20:37PM +0200, Christian Theune wrote: >> Hi, >> >>> On Mar 27, 2017, at 3:07 PM, Hugo Mills <hugo@carfax.org.uk> wrote: >>> >>> On my hardware (consumer HDDs and SATA, RAID-1 over 6 devices), it >>> takes about a minute to move 1 GiB of data. At that rate, it would >>> take 1000 minutes (or about 16 hours) to move 1 TiB of data. >>> >>> However, there are cases where some items of data can take *much* >>> longer to move. The biggest of these is when you have lots of >>> snapshots. When that happens, some (but not all) of the metadata can >>> take a very long time. In my case, with a couple of hundred snapshots, >>> some metadata chunks take 4+ hours to move. > >> Thanks for that info. The 1min per 1GiB is what I saw too - the “it >> can take longer” wasn’t really explainable to me. > >> As I’m not using snapshots: would large files (100+gb) with long >> chains of CoW history (specifically reflink copies) also hurt? > > Yes, that's the same issue -- it's to do with the number of times > an extent is shared. Snapshots are one way of creating that sharing, > reflinks are another. FWIW, I've noticed less of an issue with reflinks than snapshots, but I can't comment on this specific case. > >> Something I’d like to verify: does having traffic on the volume have >> the potential to delay this infinitely? I.e. does the system write >> to any segments that we’re trying to free so it may have to work on >> the same chunk over and over again? If not, then this means it’s >> just slow and we’re looking forward to about 2 months worth of time >> shrinking this volume. (And then again on the next bigger server >> probably about 3-4 months). > > I don't know. I would hope not, but I simply don't know enough > about the internal algorithms for that. Maybe someone else can confirm? I'm not 100% certain, but I believe that while it can delay things, it can't do so infinitely. AFAICT from looking at the code (disclaimer: I am not a C programmer by profession), it looks like writes to chunks that are being compacted or moved will go to the new location, not the old one, but writes to chunks which aren't being touched by the resize currently will just go to where the chunk is currently. Based on this, lowering the amount of traffic to the FS could probably speed things up a bit, but it likely won't help much. > >> (Background info: we’re migrating large volumes from btrfs to xfs >> and can only do this step by step: copying some data, shrinking the >> btrfs volume, extending the xfs volume, rinse repeat. If someone >> should have any suggestions to speed this up and not having to think >> in terms of _months_ then I’m all ears.) > > All I can suggest is to move some unused data off the volume and do > it in fewer larger steps. Sorry. Same. The other option though is to just schedule a maintenance window, nuke the old FS, and restore from a backup. If you can afford to take the system off-line temporarily, this will almost certainly go faster (assuming you have a reasonably fast means of restoring backups). ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 13:46 ` Austin S. Hemmelgarn @ 2017-03-27 13:50 ` Christian Theune 2017-03-27 13:54 ` Christian Theune 2017-03-27 14:14 ` Austin S. Hemmelgarn 0 siblings, 2 replies; 42+ messages in thread From: Christian Theune @ 2017-03-27 13:50 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Hugo Mills, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2446 bytes --] Hi, > On Mar 27, 2017, at 3:46 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: >> >>> Something I’d like to verify: does having traffic on the volume have >>> the potential to delay this infinitely? I.e. does the system write >>> to any segments that we’re trying to free so it may have to work on >>> the same chunk over and over again? If not, then this means it’s >>> just slow and we’re looking forward to about 2 months worth of time >>> shrinking this volume. (And then again on the next bigger server >>> probably about 3-4 months). >> >> I don't know. I would hope not, but I simply don't know enough >> about the internal algorithms for that. Maybe someone else can confirm? > I'm not 100% certain, but I believe that while it can delay things, it can't do so infinitely. AFAICT from looking at the code (disclaimer: I am not a C programmer by profession), it looks like writes to chunks that are being compacted or moved will go to the new location, not the old one, but writes to chunks which aren't being touched by the resize currently will just go to where the chunk is currently. Based on this, lowering the amount of traffic to the FS could probably speed things up a bit, but it likely won't help much. I hoped that this is the strategy implemented, otherwise it would end up in an infinite cat-and-mouse game. ;) >>> (Background info: we’re migrating large volumes from btrfs to xfs >>> and can only do this step by step: copying some data, shrinking the >>> btrfs volume, extending the xfs volume, rinse repeat. If someone >>> should have any suggestions to speed this up and not having to think >>> in terms of _months_ then I’m all ears.) >> >> All I can suggest is to move some unused data off the volume and do >> it in fewer larger steps. Sorry. > Same. > > The other option though is to just schedule a maintenance window, nuke the old FS, and restore from a backup. If you can afford to take the system off-line temporarily, this will almost certainly go faster (assuming you have a reasonably fast means of restoring backups). Well. This is the backup. ;) Thanks, Christian -- Christian Theune · ct@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 496 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 13:50 ` Christian Theune @ 2017-03-27 13:54 ` Christian Theune 2017-03-27 14:17 ` Austin S. Hemmelgarn 2017-03-27 14:14 ` Austin S. Hemmelgarn 1 sibling, 1 reply; 42+ messages in thread From: Christian Theune @ 2017-03-27 13:54 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Hugo Mills, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2898 bytes --] Hi, > On Mar 27, 2017, at 3:50 PM, Christian Theune <ct@flyingcircus.io> wrote: > > Hi, > >> On Mar 27, 2017, at 3:46 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: >>> >>>> Something I’d like to verify: does having traffic on the volume have >>>> the potential to delay this infinitely? I.e. does the system write >>>> to any segments that we’re trying to free so it may have to work on >>>> the same chunk over and over again? If not, then this means it’s >>>> just slow and we’re looking forward to about 2 months worth of time >>>> shrinking this volume. (And then again on the next bigger server >>>> probably about 3-4 months). >>> >>> I don't know. I would hope not, but I simply don't know enough >>> about the internal algorithms for that. Maybe someone else can confirm? >> I'm not 100% certain, but I believe that while it can delay things, it can't do so infinitely. AFAICT from looking at the code (disclaimer: I am not a C programmer by profession), it looks like writes to chunks that are being compacted or moved will go to the new location, not the old one, but writes to chunks which aren't being touched by the resize currently will just go to where the chunk is currently. Based on this, lowering the amount of traffic to the FS could probably speed things up a bit, but it likely won't help much. > > I hoped that this is the strategy implemented, otherwise it would end up in an infinite cat-and-mouse game. ;) > >>>> (Background info: we’re migrating large volumes from btrfs to xfs >>>> and can only do this step by step: copying some data, shrinking the >>>> btrfs volume, extending the xfs volume, rinse repeat. If someone >>>> should have any suggestions to speed this up and not having to think >>>> in terms of _months_ then I’m all ears.) >>> >>> All I can suggest is to move some unused data off the volume and do >>> it in fewer larger steps. Sorry. >> Same. >> >> The other option though is to just schedule a maintenance window, nuke the old FS, and restore from a backup. If you can afford to take the system off-line temporarily, this will almost certainly go faster (assuming you have a reasonably fast means of restoring backups). > > Well. This is the backup. ;) One strategy that does come to mind: we’re converting our backup from a system that uses reflinks to a non-reflink based system. We can convert this in place so this would remove all the reflink stuff in the existing filesystem and then we maybe can do the FS conversion faster when this isn’t an issue any longer. I think I’ll Christian -- Christian Theune · ct@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 496 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 13:54 ` Christian Theune @ 2017-03-27 14:17 ` Austin S. Hemmelgarn 2017-03-27 14:49 ` Christian Theune 0 siblings, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-03-27 14:17 UTC (permalink / raw) To: Christian Theune; +Cc: Hugo Mills, linux-btrfs On 2017-03-27 09:54, Christian Theune wrote: > Hi, > >> On Mar 27, 2017, at 3:50 PM, Christian Theune <ct@flyingcircus.io> wrote: >> >> Hi, >> >>> On Mar 27, 2017, at 3:46 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: >>>> >>>>> Something I’d like to verify: does having traffic on the volume have >>>>> the potential to delay this infinitely? I.e. does the system write >>>>> to any segments that we’re trying to free so it may have to work on >>>>> the same chunk over and over again? If not, then this means it’s >>>>> just slow and we’re looking forward to about 2 months worth of time >>>>> shrinking this volume. (And then again on the next bigger server >>>>> probably about 3-4 months). >>>> >>>> I don't know. I would hope not, but I simply don't know enough >>>> about the internal algorithms for that. Maybe someone else can confirm? >>> I'm not 100% certain, but I believe that while it can delay things, it can't do so infinitely. AFAICT from looking at the code (disclaimer: I am not a C programmer by profession), it looks like writes to chunks that are being compacted or moved will go to the new location, not the old one, but writes to chunks which aren't being touched by the resize currently will just go to where the chunk is currently. Based on this, lowering the amount of traffic to the FS could probably speed things up a bit, but it likely won't help much. >> >> I hoped that this is the strategy implemented, otherwise it would end up in an infinite cat-and-mouse game. ;) >> >>>>> (Background info: we’re migrating large volumes from btrfs to xfs >>>>> and can only do this step by step: copying some data, shrinking the >>>>> btrfs volume, extending the xfs volume, rinse repeat. If someone >>>>> should have any suggestions to speed this up and not having to think >>>>> in terms of _months_ then I’m all ears.) >>>> >>>> All I can suggest is to move some unused data off the volume and do >>>> it in fewer larger steps. Sorry. >>> Same. >>> >>> The other option though is to just schedule a maintenance window, nuke the old FS, and restore from a backup. If you can afford to take the system off-line temporarily, this will almost certainly go faster (assuming you have a reasonably fast means of restoring backups). >> >> Well. This is the backup. ;) > > One strategy that does come to mind: we’re converting our backup from a system that uses reflinks to a non-reflink based system. We can convert this in place so this would remove all the reflink stuff in the existing filesystem and then we maybe can do the FS conversion faster when this isn’t an issue any longer. I think I’ll One other thing that I just thought of: For a backup system, assuming some reasonable thinning system is used for the backups, I would personally migrate things slowly over time by putting new backups on the new filesystem, and shrinking the old filesystem as the old backups there get cleaned out. Unfortunately, most backup software I've seen doesn't handle this well, so it's not all that easy to do, but it does save you from having to migrate data off of the old filesystem, and means you don't have to worry as much about the resize of the old FS taking forever. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 14:17 ` Austin S. Hemmelgarn @ 2017-03-27 14:49 ` Christian Theune 2017-03-27 15:06 ` Roman Mamedov 0 siblings, 1 reply; 42+ messages in thread From: Christian Theune @ 2017-03-27 14:49 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Hugo Mills, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1948 bytes --] Hi, > On Mar 27, 2017, at 4:17 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > > One other thing that I just thought of: > For a backup system, assuming some reasonable thinning system is used for the backups, I would personally migrate things slowly over time by putting new backups on the new filesystem, and shrinking the old filesystem as the old backups there get cleaned out. Unfortunately, most backup software I've seen doesn't handle this well, so it's not all that easy to do, but it does save you from having to migrate data off of the old filesystem, and means you don't have to worry as much about the resize of the old FS taking forever. Right. This is an option we can do from a software perspective (our own solution - https://bitbucket.org/flyingcircus/backy) but our systems in use can’t hold all the data twice. Even though we’re migrating to a backend implementation that uses less data than before I have to perform an “inplace” migration in some way. This is VM block device backup. So basically we migrate one VM with all its previous data and that works quite fine with a little headroom. However, migrating all VMs to a new “full” backup and then wait for the old to shrink would only work if we had a completely empty backup server in place, which we don’t. Also: the idea of migrating on btrfs also has its downside - the performance of “mkdir” and “fsync” is abysmal at the moment. I’m waiting for the current shrinking job to finish but this is likely limited to the “find free space” algorithm. We’re talking about a few megabytes converted per second. Sigh. Cheers, Christian Theune -- Christian Theune · ct@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 496 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 14:49 ` Christian Theune @ 2017-03-27 15:06 ` Roman Mamedov 2017-04-01 9:05 ` Kai Krakow 0 siblings, 1 reply; 42+ messages in thread From: Roman Mamedov @ 2017-03-27 15:06 UTC (permalink / raw) To: Christian Theune; +Cc: Austin S. Hemmelgarn, Hugo Mills, linux-btrfs On Mon, 27 Mar 2017 16:49:47 +0200 Christian Theune <ct@flyingcircus.io> wrote: > Also: the idea of migrating on btrfs also has its downside - the performance of “mkdir” and “fsync” is abysmal at the moment. I’m waiting for the current shrinking job to finish but this is likely limited to the “find free space” algorithm. We’re talking about a few megabytes converted per second. Sigh. Btw since this is all on LVM already, you could set up lvmcache with a small SSD-based cache volume. Even some old 60GB SSD would work wonders for performance, and with the cache policy of "writethrough" you don't have to worry about its reliability (much). -- With respect, Roman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 15:06 ` Roman Mamedov @ 2017-04-01 9:05 ` Kai Krakow 0 siblings, 0 replies; 42+ messages in thread From: Kai Krakow @ 2017-04-01 9:05 UTC (permalink / raw) To: linux-btrfs Am Mon, 27 Mar 2017 20:06:46 +0500 schrieb Roman Mamedov <rm@romanrm.net>: > On Mon, 27 Mar 2017 16:49:47 +0200 > Christian Theune <ct@flyingcircus.io> wrote: > > > Also: the idea of migrating on btrfs also has its downside - the > > performance of “mkdir” and “fsync” is abysmal at the moment. I’m > > waiting for the current shrinking job to finish but this is likely > > limited to the “find free space” algorithm. We’re talking about a > > few megabytes converted per second. Sigh. > > Btw since this is all on LVM already, you could set up lvmcache with > a small SSD-based cache volume. Even some old 60GB SSD would work > wonders for performance, and with the cache policy of "writethrough" > you don't have to worry about its reliability (much). That's maybe the best recommendation to speed things up. I'm using bcache here for the same reasons (speeding up random workloads) and it works wonders. Tho, for such big storage I'd maybe recommend a bigger SSD and a new one. Bigger SSDs tend to last much longer. Just don't use the whole of it to allow for better wear leveling and you'll get a final setup that can serve the system much longer than for the period of migration. -- Regards, Kai Replies to list-only preferred. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 13:50 ` Christian Theune 2017-03-27 13:54 ` Christian Theune @ 2017-03-27 14:14 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-03-27 14:14 UTC (permalink / raw) To: Christian Theune; +Cc: Hugo Mills, linux-btrfs On 2017-03-27 09:50, Christian Theune wrote: > Hi, > >> On Mar 27, 2017, at 3:46 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: >>> >>>> Something I’d like to verify: does having traffic on the volume have >>>> the potential to delay this infinitely? I.e. does the system write >>>> to any segments that we’re trying to free so it may have to work on >>>> the same chunk over and over again? If not, then this means it’s >>>> just slow and we’re looking forward to about 2 months worth of time >>>> shrinking this volume. (And then again on the next bigger server >>>> probably about 3-4 months). >>> >>> I don't know. I would hope not, but I simply don't know enough >>> about the internal algorithms for that. Maybe someone else can confirm? >> I'm not 100% certain, but I believe that while it can delay things, it can't do so infinitely. AFAICT from looking at the code (disclaimer: I am not a C programmer by profession), it looks like writes to chunks that are being compacted or moved will go to the new location, not the old one, but writes to chunks which aren't being touched by the resize currently will just go to where the chunk is currently. Based on this, lowering the amount of traffic to the FS could probably speed things up a bit, but it likely won't help much. > > I hoped that this is the strategy implemented, otherwise it would end up in an infinite cat-and-mouse game. ;) I know that balance and replace work this way, and the code for resize appears to handle things similarly to both, so I'm pretty certain it works this way. TBH though, it's really the only sane way to handle something like this. > >>>> (Background info: we’re migrating large volumes from btrfs to xfs >>>> and can only do this step by step: copying some data, shrinking the >>>> btrfs volume, extending the xfs volume, rinse repeat. If someone >>>> should have any suggestions to speed this up and not having to think >>>> in terms of _months_ then I’m all ears.) >>> >>> All I can suggest is to move some unused data off the volume and do >>> it in fewer larger steps. Sorry. >> Same. >> >> The other option though is to just schedule a maintenance window, nuke the old FS, and restore from a backup. If you can afford to take the system off-line temporarily, this will almost certainly go faster (assuming you have a reasonably fast means of restoring backups). > > Well. This is the backup. ;) Ah, yeah, that does complicate things a bit more. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 13:20 ` Christian Theune 2017-03-27 13:24 ` Hugo Mills @ 2017-03-27 14:48 ` Roman Mamedov 2017-03-27 14:53 ` Christian Theune 1 sibling, 1 reply; 42+ messages in thread From: Roman Mamedov @ 2017-03-27 14:48 UTC (permalink / raw) To: Christian Theune; +Cc: Hugo Mills, linux-btrfs On Mon, 27 Mar 2017 15:20:37 +0200 Christian Theune <ct@flyingcircus.io> wrote: > (Background info: we’re migrating large volumes from btrfs to xfs and can > only do this step by step: copying some data, shrinking the btrfs volume, > extending the xfs volume, rinse repeat. If someone should have any > suggestions to speed this up and not having to think in terms of _months_ > then I’m all ears.) I would only suggest that you reconsider XFS. You can't shrink XFS, therefore you won't have the flexibility to migrate in the same way to anything better that comes along in the future (ZFS perhaps? or even Bcachefs?). XFS does not perform that much better over Ext4, and very importantly, Ext4 can be shrunk. >From the looks of it Ext4 has also overcome its 16TB limitation: http://askubuntu.com/questions/779754/how-do-i-resize-an-ext4-partition-beyond-the-16tb-limit -- With respect, Roman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 14:48 ` Roman Mamedov @ 2017-03-27 14:53 ` Christian Theune 2017-03-28 14:43 ` Peter Grandi 0 siblings, 1 reply; 42+ messages in thread From: Christian Theune @ 2017-03-27 14:53 UTC (permalink / raw) To: Roman Mamedov; +Cc: Hugo Mills, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1988 bytes --] Hi, > On Mar 27, 2017, at 4:48 PM, Roman Mamedov <rm@romanrm.net> wrote: > > On Mon, 27 Mar 2017 15:20:37 +0200 > Christian Theune <ct@flyingcircus.io> wrote: > >> (Background info: we’re migrating large volumes from btrfs to xfs and can >> only do this step by step: copying some data, shrinking the btrfs volume, >> extending the xfs volume, rinse repeat. If someone should have any >> suggestions to speed this up and not having to think in terms of _months_ >> then I’m all ears.) > > I would only suggest that you reconsider XFS. You can't shrink XFS, therefore > you won't have the flexibility to migrate in the same way to anything better > that comes along in the future (ZFS perhaps? or even Bcachefs?). XFS does not > perform that much better over Ext4, and very importantly, Ext4 can be shrunk. That is true. However, we do have moved the expected feature set of the filesystem (i.e. cow) down to “store files safely and reliably” and we’ve seen too much breakage with ext4 in the past. Of course “persistence means you’ll have to say I’m sorry” and thus with either choice we may be faced with some issue in the future that we might have circumvented with another solution and yes flexibility is worth a great deal. We’ve run XFS and ext4 on different (large and small) workloads in the last 2 years and I have to say I’m much more happy about XFS even with the shrinking limitation. To us ext4 is prohibitive with it’s fsck performance and we do like the tight error checking in XFS. Thanks for the reminder though - especially in the public archive making this tradeoff with flexibility known is wise to communicate. :-) Hugs, Christian -- Christian Theune · ct@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 496 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 14:53 ` Christian Theune @ 2017-03-28 14:43 ` Peter Grandi 2017-03-28 14:50 ` Tomasz Kusmierz ` (3 more replies) 0 siblings, 4 replies; 42+ messages in thread From: Peter Grandi @ 2017-03-28 14:43 UTC (permalink / raw) To: Linux fs Btrfs This is going to be long because I am writing something detailed hoping pointlessly that someone in the future will find it by searching the list archives while doing research before setting up a new storage system, and they will be the kind of person that tolerates reading messages longer than Twitter. :-). > I’m currently shrinking a device and it seems that the > performance of shrink is abysmal. When I read this kind of statement I am reminded of all the cases where someone left me to decatastrophize a storage system built on "optimistic" assumptions. The usual "optimism" is what I call the "syntactic approach", that is the axiomatic belief that any syntactically valid combination of features not only will "work", but very fast too and reliably despite slow cheap hardware and "unattentive" configuration. Some people call that the expectation that system developers provide or should provide an "O_PONIES" option. In particular I get very saddened when people use "performance" to mean "speed", as the difference between the two is very great. As a general consideration, shrinking a large filetree online in-place is an amazingly risky, difficult, slow operation and should be a last desperate resort (as apparently in this case), regardless of the filesystem type, and expecting otherwise is "optimistic". My guess is that very complex risky slow operations like that are provided by "clever" filesystem developers for "marketing" purposes, to win box-ticking competitions. That applies to those system developers who do know better; I suspect that even some filesystem developers are "optimistic" as to what they can actually achieve. > I intended to shrink a ~22TiB filesystem down to 20TiB. This is > still using LVM underneath so that I can’t just remove a device > from the filesystem but have to use the resize command. That is actually a very good idea because Btrfs multi-device is not quite as reliable as DM/LVM2 multi-device. > Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 > Total devices 1 FS bytes used 18.21TiB > devid 1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy Maybe 'balance' should have been used a bit more. > This has been running since last Thursday, so roughly 3.5days > now. The “used” number in devid1 has moved about 1TiB in this > time. The filesystem is seeing regular usage (read and write) > and when I’m suspending any application traffic I see about > 1GiB of movement every now and then. Maybe once every 30 > seconds or so. Does this sound fishy or normal to you? With consistent "optimism" this is a request to assess whether "performance" of some operations is adequate on a filetree without telling us either what the filetree contents look like, what the regular workload is, or what the storage layer looks like. Being one of the few system administrators crippled by lack of psychic powers :-), I rely on guesses and inferences here, and having read the whole thread containing some belated details. >From the ~22TB total capacity my guess is that the storage layer involves rotating hard disks, and from later details the filesystem contents seems to be heavily reflinked files of several GB in size, and workload seems to be backups to those files from several source hosts. Considering the general level of "optimism" in the situation my wild guess is that the storage layer is based on large slow cheap rotating disks in teh 4GB-8GB range, with very low IOPS-per-TB. > Thanks for that info. The 1min per 1GiB is what I saw too - > the “it can take longer” wasn’t really explainable to me. A contemporary rotating disk device can do around 0.5MB/s transfer rate with small random accesses with barriers up to around 80-160MB/s in purely sequential access without barriers. 1GB/m of simultaneous read-write means around 16MB/s reads plus 16MB/s writes which is fairly good *performance* (even if slow *speed*) considering that moving extents around, even across disks, involves quite a bit of randomish same-disk updates of metadata; because it all depends usually on how much randomish metadata updates need to done, on any filesystem type, as those must be done with barriers. > As I’m not using snapshots: would large files (100+gb) Using 100GB sized VM virtual disks (never mind with COW) seems very unwise to me to start with, but of course a lot of other people know better :-). Just like a lot of other people know better that large single pool storage systems are awesome in every respect :-): cost, reliability, speed, flexibility, maintenance, etc. > with long chains of CoW history (specifically reflink copies) > also hurt? Oh yes... They are about one of the worst cases for using Btrfs. But also very "optimistic" to think that kind of stuff can work awesomely on *any* filesystem type. > Something I’d like to verify: does having traffic on the > volume have the potential to delay this infinitely? [ ... ] > it’s just slow and we’re looking forward to about 2 months > worth of time shrinking this volume. (And then again on the > next bigger server probably about 3-4 months). Those are pretty typical times for whole-filesystem operations like that on rotating disk media. There are some reports in the list and IRC channel archives to 'scrub' or 'balance' or 'check' times for filetrees of that size. > (Background info: we’re migrating large volumes from btrfs to > xfs and can only do this step by step: copying some data, > shrinking the btrfs volume, extending the xfs volume, rinse > repeat. That "extending the xfs volume" will have consequences too, but not too bad hopefully. > If someone should have any suggestions to speed this up and > not having to think in terms of _months_ then I’m all ears.) High IOPS-per-TB enterprise SSDs with capacitor backed caches :-). > One strategy that does come to mind: we’re converting our > backup from a system that uses reflinks to a non-reflink based > system. We can convert this in place so this would remove all > the reflink stuff in the existing filesystem Do you have enough space to do that? Either your reflinks are pointless or they are saving a lot of storage. But I guess that you can do it one 100GB file at a time... > and then we maybe can do the FS conversion faster when this > isn’t an issue any longer. I think I’ll I suspect the de-reflinking plus shrinking will take longer, but not totally sure. > Right. This is wan option we can do from a software perspective > (our own solution - https://bitbucket.org/flyingcircus/backy) Many thanks for sharing your system, I'll have a look. > but our systems in use can’t hold all the data twice. Even > though we’re migrating to a backend implementation that uses > less data than before I have to perform an “inplace” migration > in some way. This is VM block device backup. So basically we > migrate one VM with all its previous data and that works quite > fine with a little headroom. However, migrating all VMs to a > new “full” backup and then wait for the old to shrink would > only work if we had a completely empty backup server in place, > which we don’t. > Also: the idea of migrating on btrfs also has its downside - > the performance of “mkdir” and “fsync” is abysmal at the > moment. That *performance* is pretty good indeed, it is the *speed* that may be low, but that's obvious. Please consider looking at these entirely typical speeds: http://www.sabi.co.uk/blog/17-one.html?170302#170302 http://www.sabi.co.uk/blog/17-one.html?170228#170228 > I’m waiting for the current shrinking job to finish but this > is likely limited to the “find free space” algorithm. We’re > talking about a few megabytes converted per second. Sigh. Well, if the filetree is being actively used for COW backups while being shrunk that involves a lot of randomish IO with barriers. >> I would only suggest that you reconsider XFS. You can't >> shrink XFS, therefore you won't have the flexibility to >> migrate in the same way to anything better that comes along >> in the future (ZFS perhaps? or even Bcachefs?). XFS does not >> perform that much better over Ext4, and very importantly, >> Ext4 can be shrunk. ZFS is a complicated mess too with an intensely anisotropic performance envelope too and not necessarily that good for backup archival for various reasons. I would consider looking instead at using a collection of smaller "silo" JFS, F2FS, NILFS2 filetrees as well as XFS, and using MD RAID in RAID10 mode instead of DM/LVM2: http://www.sabi.co.uk/blog/16-two.html?161217#161217 http://www.sabi.co.uk/blog/17-one.html?170107#170107 http://www.sabi.co.uk/blog/12-fou.html?121223#121223 http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b http://www.sabi.co.uk/blog/12-fou.html?121218#121218 and yes, Bcachefs looks promising, but I am sticking with Btrfs: https://lwn.net/Articles/717379 > That is true. However, we do have moved the expected feature > set of the filesystem (i.e. cow) That feature set is arguably not appropriate for VM images, but lots of people know better :-). > down to “store files safely and reliably” and we’ve seen too > much breakage with ext4 in the past. That is extremely unlikely unless your storage layer has unreliable barriers, and then you need a lot of "optimism". > Of course “persistence means you’ll have to say I’m sorry” and > thus with either choice we may be faced with some issue in the > future that we might have circumvented with another solution > and yes flexibility is worth a great deal. Enterprise SSDs with high small-random-write IOPS-per-TB can give both excellent speed and high flexibility :-). > We’ve run XFS and ext4 on different (large and small) > workloads in the last 2 years and I have to say I’m much more > happy about XFS even with the shrinking limitation. XFS and 'ext4' are essentially equivalent, except for the fixed-size inode table limitation of 'ext4' (and XFS reportedly has finer grained locking). Btrfs is nearly as good as either on most workloads is single-device mode without using the more complicated features (compression, qgroups, ...) and with appropriate use of the 'nowcow' options, and gives checksums on data too if needed. > To us ext4 is prohibitive with it’s fsck performance and we do > like the tight error checking in XFS. It is very pleasing to see someone care about the speed of whole-tree operations like 'fsck', a very often forgotten "little detail". But in my experience 'ext4' checking is quite competitive with XFS checking and repair, at least in recent years, as both have been hugely improved. XFS checking and repair still require a lot of RAM though. > Thanks for the reminder though - especially in the public > archive making this tradeoff with flexibility known is wise to > communicate. :-) "Flexibility" in filesystems, especially on rotating disk storage with extremely anisotropic performance envelopes, is very expensive, but of course lots of people know better :-). ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-28 14:43 ` Peter Grandi @ 2017-03-28 14:50 ` Tomasz Kusmierz 2017-03-28 15:06 ` Peter Grandi 2017-03-28 14:59 ` Peter Grandi ` (2 subsequent siblings) 3 siblings, 1 reply; 42+ messages in thread From: Tomasz Kusmierz @ 2017-03-28 14:50 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs Btrfs I glazed over at “This is going to be long” … :) > On 28 Mar 2017, at 15:43, Peter Grandi <pg@btrfs.for.sabi.co.UK> wrote: > > This is going to be long because I am writing something detailed > hoping pointlessly that someone in the future will find it by > searching the list archives while doing research before setting > up a new storage system, and they will be the kind of person > that tolerates reading messages longer than Twitter. :-). > >> I’m currently shrinking a device and it seems that the >> performance of shrink is abysmal. > > When I read this kind of statement I am reminded of all the > cases where someone left me to decatastrophize a storage system > built on "optimistic" assumptions. The usual "optimism" is what > I call the "syntactic approach", that is the axiomatic belief > that any syntactically valid combination of features not only > will "work", but very fast too and reliably despite slow cheap > hardware and "unattentive" configuration. Some people call that > the expectation that system developers provide or should provide > an "O_PONIES" option. In particular I get very saddened when > people use "performance" to mean "speed", as the difference > between the two is very great. > > As a general consideration, shrinking a large filetree online > in-place is an amazingly risky, difficult, slow operation and > should be a last desperate resort (as apparently in this case), > regardless of the filesystem type, and expecting otherwise is > "optimistic". > > My guess is that very complex risky slow operations like that > are provided by "clever" filesystem developers for "marketing" > purposes, to win box-ticking competitions. That applies to those > system developers who do know better; I suspect that even some > filesystem developers are "optimistic" as to what they can > actually achieve. > >> I intended to shrink a ~22TiB filesystem down to 20TiB. This is >> still using LVM underneath so that I can’t just remove a device >> from the filesystem but have to use the resize command. > > That is actually a very good idea because Btrfs multi-device is > not quite as reliable as DM/LVM2 multi-device. > >> Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 >> Total devices 1 FS bytes used 18.21TiB >> devid 1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy > > Maybe 'balance' should have been used a bit more. > >> This has been running since last Thursday, so roughly 3.5days >> now. The “used” number in devid1 has moved about 1TiB in this >> time. The filesystem is seeing regular usage (read and write) >> and when I’m suspending any application traffic I see about >> 1GiB of movement every now and then. Maybe once every 30 >> seconds or so. Does this sound fishy or normal to you? > > With consistent "optimism" this is a request to assess whether > "performance" of some operations is adequate on a filetree > without telling us either what the filetree contents look like, > what the regular workload is, or what the storage layer looks > like. > > Being one of the few system administrators crippled by lack of > psychic powers :-), I rely on guesses and inferences here, and > having read the whole thread containing some belated details. > > From the ~22TB total capacity my guess is that the storage layer > involves rotating hard disks, and from later details the > filesystem contents seems to be heavily reflinked files of > several GB in size, and workload seems to be backups to those > files from several source hosts. Considering the general level > of "optimism" in the situation my wild guess is that the storage > layer is based on large slow cheap rotating disks in teh 4GB-8GB > range, with very low IOPS-per-TB. > >> Thanks for that info. The 1min per 1GiB is what I saw too - >> the “it can take longer” wasn’t really explainable to me. > > A contemporary rotating disk device can do around 0.5MB/s > transfer rate with small random accesses with barriers up to > around 80-160MB/s in purely sequential access without barriers. > > 1GB/m of simultaneous read-write means around 16MB/s reads plus > 16MB/s writes which is fairly good *performance* (even if slow > *speed*) considering that moving extents around, even across > disks, involves quite a bit of randomish same-disk updates of > metadata; because it all depends usually on how much randomish > metadata updates need to done, on any filesystem type, as those > must be done with barriers. > >> As I’m not using snapshots: would large files (100+gb) > > Using 100GB sized VM virtual disks (never mind with COW) seems > very unwise to me to start with, but of course a lot of other > people know better :-). Just like a lot of other people know > better that large single pool storage systems are awesome in > every respect :-): cost, reliability, speed, flexibility, > maintenance, etc. > >> with long chains of CoW history (specifically reflink copies) >> also hurt? > > Oh yes... They are about one of the worst cases for using > Btrfs. But also very "optimistic" to think that kind of stuff > can work awesomely on *any* filesystem type. > >> Something I’d like to verify: does having traffic on the >> volume have the potential to delay this infinitely? [ ... ] >> it’s just slow and we’re looking forward to about 2 months >> worth of time shrinking this volume. (And then again on the >> next bigger server probably about 3-4 months). > > Those are pretty typical times for whole-filesystem operations > like that on rotating disk media. There are some reports in the > list and IRC channel archives to 'scrub' or 'balance' or 'check' > times for filetrees of that size. > >> (Background info: we’re migrating large volumes from btrfs to >> xfs and can only do this step by step: copying some data, >> shrinking the btrfs volume, extending the xfs volume, rinse >> repeat. > > That "extending the xfs volume" will have consequences too, but > not too bad hopefully. > >> If someone should have any suggestions to speed this up and >> not having to think in terms of _months_ then I’m all ears.) > > High IOPS-per-TB enterprise SSDs with capacitor backed caches :-). > >> One strategy that does come to mind: we’re converting our >> backup from a system that uses reflinks to a non-reflink based >> system. We can convert this in place so this would remove all >> the reflink stuff in the existing filesystem > > Do you have enough space to do that? Either your reflinks are > pointless or they are saving a lot of storage. But I guess that > you can do it one 100GB file at a time... > >> and then we maybe can do the FS conversion faster when this >> isn’t an issue any longer. I think I’ll > > I suspect the de-reflinking plus shrinking will take longer, but > not totally sure. > >> Right. This is wan option we can do from a software perspective >> (our own solution - https://bitbucket.org/flyingcircus/backy) > > Many thanks for sharing your system, I'll have a look. > >> but our systems in use can’t hold all the data twice. Even >> though we’re migrating to a backend implementation that uses >> less data than before I have to perform an “inplace” migration >> in some way. This is VM block device backup. So basically we >> migrate one VM with all its previous data and that works quite >> fine with a little headroom. However, migrating all VMs to a >> new “full” backup and then wait for the old to shrink would >> only work if we had a completely empty backup server in place, >> which we don’t. > >> Also: the idea of migrating on btrfs also has its downside - >> the performance of “mkdir” and “fsync” is abysmal at the >> moment. > > That *performance* is pretty good indeed, it is the *speed* that > may be low, but that's obvious. Please consider looking at these > entirely typical speeds: > > http://www.sabi.co.uk/blog/17-one.html?170302#170302 > http://www.sabi.co.uk/blog/17-one.html?170228#170228 > >> I’m waiting for the current shrinking job to finish but this >> is likely limited to the “find free space” algorithm. We’re >> talking about a few megabytes converted per second. Sigh. > > Well, if the filetree is being actively used for COW backups > while being shrunk that involves a lot of randomish IO with > barriers. > >>> I would only suggest that you reconsider XFS. You can't >>> shrink XFS, therefore you won't have the flexibility to >>> migrate in the same way to anything better that comes along >>> in the future (ZFS perhaps? or even Bcachefs?). XFS does not >>> perform that much better over Ext4, and very importantly, >>> Ext4 can be shrunk. > > ZFS is a complicated mess too with an intensely anisotropic > performance envelope too and not necessarily that good for > backup archival for various reasons. I would consider looking > instead at using a collection of smaller "silo" JFS, F2FS, > NILFS2 filetrees as well as XFS, and using MD RAID in RAID10 > mode instead of DM/LVM2: > > http://www.sabi.co.uk/blog/16-two.html?161217#161217 > http://www.sabi.co.uk/blog/17-one.html?170107#170107 > http://www.sabi.co.uk/blog/12-fou.html?121223#121223 > http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b > http://www.sabi.co.uk/blog/12-fou.html?121218#121218 > > and yes, Bcachefs looks promising, but I am sticking with Btrfs: > > https://lwn.net/Articles/717379 > >> That is true. However, we do have moved the expected feature >> set of the filesystem (i.e. cow) > > That feature set is arguably not appropriate for VM images, but > lots of people know better :-). > >> down to “store files safely and reliably” and we’ve seen too >> much breakage with ext4 in the past. > > That is extremely unlikely unless your storage layer has > unreliable barriers, and then you need a lot of "optimism". > >> Of course “persistence means you’ll have to say I’m sorry” and >> thus with either choice we may be faced with some issue in the >> future that we might have circumvented with another solution >> and yes flexibility is worth a great deal. > > Enterprise SSDs with high small-random-write IOPS-per-TB can > give both excellent speed and high flexibility :-). > >> We’ve run XFS and ext4 on different (large and small) >> workloads in the last 2 years and I have to say I’m much more >> happy about XFS even with the shrinking limitation. > > XFS and 'ext4' are essentially equivalent, except for the > fixed-size inode table limitation of 'ext4' (and XFS reportedly > has finer grained locking). Btrfs is nearly as good as either on > most workloads is single-device mode without using the more > complicated features (compression, qgroups, ...) and with > appropriate use of the 'nowcow' options, and gives checksums on > data too if needed. > >> To us ext4 is prohibitive with it’s fsck performance and we do >> like the tight error checking in XFS. > > It is very pleasing to see someone care about the speed of > whole-tree operations like 'fsck', a very often forgotten > "little detail". But in my experience 'ext4' checking is quite > competitive with XFS checking and repair, at least in recent > years, as both have been hugely improved. XFS checking and > repair still require a lot of RAM though. > >> Thanks for the reminder though - especially in the public >> archive making this tradeoff with flexibility known is wise to >> communicate. :-) > > "Flexibility" in filesystems, especially on rotating disk > storage with extremely anisotropic performance envelopes, is > very expensive, but of course lots of people know better :-). > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-28 14:50 ` Tomasz Kusmierz @ 2017-03-28 15:06 ` Peter Grandi 2017-03-28 15:35 ` Tomasz Kusmierz 0 siblings, 1 reply; 42+ messages in thread From: Peter Grandi @ 2017-03-28 15:06 UTC (permalink / raw) To: Linux fs Btrfs > I glazed over at “This is going to be long” … :) >> [ ... ] Not only that, you also top-posted while quoting it pointlessly in its entirety, to the whole mailing list. Well played :-). ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-28 15:06 ` Peter Grandi @ 2017-03-28 15:35 ` Tomasz Kusmierz 2017-03-28 16:20 ` Peter Grandi 0 siblings, 1 reply; 42+ messages in thread From: Tomasz Kusmierz @ 2017-03-28 15:35 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs Btrfs I’ve glazed over on “Not only that …” … can you make youtube video of that :)))) > On 28 Mar 2017, at 16:06, Peter Grandi <pg@btrfs.for.sabi.co.UK> wrote: > >> I glazed over at “This is going to be long” … :) >>> [ ... ] > > Not only that, you also top-posted while quoting it pointlessly > in its entirety, to the whole mailing list. Well played :-). It’s because I’m special :* On a real note thank’s for giving a f to provide a detailed comment … to much of open source stuff is based on short comments :/ ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-28 15:35 ` Tomasz Kusmierz @ 2017-03-28 16:20 ` Peter Grandi 0 siblings, 0 replies; 42+ messages in thread From: Peter Grandi @ 2017-03-28 16:20 UTC (permalink / raw) To: Linux fs Btrfs > I’ve glazed over on “Not only that …” … can you make youtube > video of that :)) [ ... ] It’s because I’m special :* Well played again, that's a fairly credible impersonation of a node.js/mongodb developer :-). > On a real note thank’s [ ... ] to much of open source stuff is > based on short comments :/ Yes... In part that's because the "sw engineering" aspect of programming takes a lot of time that unpaid volunteers sometimes cannot afford to take, in part though I have noticed sometimes free sw authors who do get paid to do free sw act as if they had a policy of obfuscation to protect their turf/jobs. Regardless, mailing lists, IRC channel logs, wikis, personal blogs, search engines allow a mosaic of lore to form, which in part remedies the situation, and here we are :-). ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-28 14:43 ` Peter Grandi 2017-03-28 14:50 ` Tomasz Kusmierz @ 2017-03-28 14:59 ` Peter Grandi 2017-03-28 15:20 ` Peter Grandi 2017-03-28 15:56 ` Austin S. Hemmelgarn 2017-03-30 15:00 ` Piotr Pawłow 3 siblings, 1 reply; 42+ messages in thread From: Peter Grandi @ 2017-03-28 14:59 UTC (permalink / raw) To: Linux fs Btrfs > [ ... ] reminded of all the cases where someone left me to > decatastrophize a storage system built on "optimistic" > assumptions. In particular when some "clever" sysadm with a "clever" (or dumb) manager slaps together a large storage system in the cheapest and quickest way knowing that while it is mostly empty it will seem very fast regardless and therefore to have awesome performance, and then the "clever" sysadm disappears surrounded by a halo of glory before the storage system gets full workload and fills up; when that happens usually I get to inherit it. BTW The same technique also can be done with HPC clusters. >> I intended to shrink a ~22TiB filesystem down to 20TiB. This >> is still using LVM underneath so that I can’t just remove a >> device from the filesystem but have to use the resize >> command. >> Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 >> Total devices 1 FS bytes used 18.21TiB >> devid 1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy Ahh it is indeed a filled up storage system now running a full workload. At least it wasn't me who inherited it this time. :-) ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-28 14:59 ` Peter Grandi @ 2017-03-28 15:20 ` Peter Grandi 0 siblings, 0 replies; 42+ messages in thread From: Peter Grandi @ 2017-03-28 15:20 UTC (permalink / raw) To: Linux fs Btrfs > [ ... ] slaps together a large storage system in the cheapest > and quickest way knowing that while it is mostly empty it will > seem very fast regardless and therefore to have awesome > performance, and then the "clever" sysadm disappears surrounded > by a halo of glory before the storage system gets full workload > and fills up; [ ... ] Fortunately or unfortunately Btrfs is particularly suitable for this technique, as it has an enormous number of checkbox-ticking awesome looking feature: transparent compression, dynamic add/remove, online balance/scrub, different sized member devices, online grow/shrink, online defrag, limitless scalability, online dedup, arbitrary subvolumes and snapshots, COW and reflinking, online conversion of RAID profiles, ... and one can use all of them at the same time, and for the initial period where volume workload is low and space used not much, it will looks absolutely fantastic, cheap, flexible, always available, fast, the work of genius of a very cool sysadm. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-28 14:43 ` Peter Grandi 2017-03-28 14:50 ` Tomasz Kusmierz 2017-03-28 14:59 ` Peter Grandi @ 2017-03-28 15:56 ` Austin S. Hemmelgarn 2017-03-30 15:55 ` Peter Grandi 2017-03-30 15:00 ` Piotr Pawłow 3 siblings, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-03-28 15:56 UTC (permalink / raw) To: Peter Grandi, Linux fs Btrfs On 2017-03-28 10:43, Peter Grandi wrote: > This is going to be long because I am writing something detailed > hoping pointlessly that someone in the future will find it by > searching the list archives while doing research before setting > up a new storage system, and they will be the kind of person > that tolerates reading messages longer than Twitter. :-). > >> I’m currently shrinking a device and it seems that the >> performance of shrink is abysmal. > > When I read this kind of statement I am reminded of all the > cases where someone left me to decatastrophize a storage system > built on "optimistic" assumptions. The usual "optimism" is what > I call the "syntactic approach", that is the axiomatic belief > that any syntactically valid combination of features not only > will "work", but very fast too and reliably despite slow cheap > hardware and "unattentive" configuration. Some people call that > the expectation that system developers provide or should provide > an "O_PONIES" option. In particular I get very saddened when > people use "performance" to mean "speed", as the difference > between the two is very great. > > As a general consideration, shrinking a large filetree online > in-place is an amazingly risky, difficult, slow operation and > should be a last desperate resort (as apparently in this case), > regardless of the filesystem type, and expecting otherwise is > "optimistic". > > My guess is that very complex risky slow operations like that > are provided by "clever" filesystem developers for "marketing" > purposes, to win box-ticking competitions. That applies to those > system developers who do know better; I suspect that even some > filesystem developers are "optimistic" as to what they can > actually achieve. There are cases where there really is no other sane option. Not everyone has the kind of budget needed for proper HA setups, and if you need maximal uptime and as a result have to reprovision the system online, then you pretty much need a filesystem that supports online shrinking. Also, it's not really all that slow on most filesystem, BTRFS is just hurt by it's comparatively poor performance, and the COW metadata updates that are needed. > >> I intended to shrink a ~22TiB filesystem down to 20TiB. This is >> still using LVM underneath so that I can’t just remove a device >> from the filesystem but have to use the resize command. > > That is actually a very good idea because Btrfs multi-device is > not quite as reliable as DM/LVM2 multi-device. This depends on how much you trust your storage hardware relative to how much you trust the kernel code. For raid5/6, yes, BTRFS multi-device is currently crap. For most people raid10 in BTRFS is too. For raid1 mode however, it really is personal opinion. > >> Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 >> Total devices 1 FS bytes used 18.21TiB >> devid 1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy > > Maybe 'balance' should have been used a bit more. > >> This has been running since last Thursday, so roughly 3.5days >> now. The “used” number in devid1 has moved about 1TiB in this >> time. The filesystem is seeing regular usage (read and write) >> and when I’m suspending any application traffic I see about >> 1GiB of movement every now and then. Maybe once every 30 >> seconds or so. Does this sound fishy or normal to you? > > With consistent "optimism" this is a request to assess whether > "performance" of some operations is adequate on a filetree > without telling us either what the filetree contents look like, > what the regular workload is, or what the storage layer looks > like. > > Being one of the few system administrators crippled by lack of > psychic powers :-), I rely on guesses and inferences here, and > having read the whole thread containing some belated details. > > From the ~22TB total capacity my guess is that the storage layer > involves rotating hard disks, and from later details the > filesystem contents seems to be heavily reflinked files of > several GB in size, and workload seems to be backups to those > files from several source hosts. Considering the general level > of "optimism" in the situation my wild guess is that the storage > layer is based on large slow cheap rotating disks in teh 4GB-8GB > range, with very low IOPS-per-TB. > >> Thanks for that info. The 1min per 1GiB is what I saw too - >> the “it can take longer” wasn’t really explainable to me. > > A contemporary rotating disk device can do around 0.5MB/s > transfer rate with small random accesses with barriers up to > around 80-160MB/s in purely sequential access without barriers. > > 1GB/m of simultaneous read-write means around 16MB/s reads plus > 16MB/s writes which is fairly good *performance* (even if slow > *speed*) considering that moving extents around, even across > disks, involves quite a bit of randomish same-disk updates of > metadata; because it all depends usually on how much randomish > metadata updates need to done, on any filesystem type, as those > must be done with barriers. > >> As I’m not using snapshots: would large files (100+gb) > > Using 100GB sized VM virtual disks (never mind with COW) seems > very unwise to me to start with, but of course a lot of other > people know better :-). Just like a lot of other people know > better that large single pool storage systems are awesome in > every respect :-): cost, reliability, speed, flexibility, > maintenance, etc. > >> with long chains of CoW history (specifically reflink copies) >> also hurt? > > Oh yes... They are about one of the worst cases for using > Btrfs. But also very "optimistic" to think that kind of stuff > can work awesomely on *any* filesystem type. It works just fine for archival storage on any number of other filesystems. Performance is poor, but with backups that shouldn't matter (performance should be your last criteria for designing a backup strategy, period). > >> Something I’d like to verify: does having traffic on the >> volume have the potential to delay this infinitely? [ ... ] >> it’s just slow and we’re looking forward to about 2 months >> worth of time shrinking this volume. (And then again on the >> next bigger server probably about 3-4 months). > > Those are pretty typical times for whole-filesystem operations > like that on rotating disk media. There are some reports in the > list and IRC channel archives to 'scrub' or 'balance' or 'check' > times for filetrees of that size. > >> (Background info: we’re migrating large volumes from btrfs to >> xfs and can only do this step by step: copying some data, >> shrinking the btrfs volume, extending the xfs volume, rinse >> repeat. > > That "extending the xfs volume" will have consequences too, but > not too bad hopefully. It shouldn't have any beyond the FS being bigger and the FS level metadata being a bit fragmented. Extending a filesystem if done right (and XFS absolutely does it right) doesn't need to move any data, just allocate a bit more space in a few places and update the super-blocks to point to the new end of the filesystem. > >> If someone should have any suggestions to speed this up and >> not having to think in terms of _months_ then I’m all ears.) > > High IOPS-per-TB enterprise SSDs with capacitor backed caches :-). > >> One strategy that does come to mind: we’re converting our >> backup from a system that uses reflinks to a non-reflink based >> system. We can convert this in place so this would remove all >> the reflink stuff in the existing filesystem > > Do you have enough space to do that? Either your reflinks are > pointless or they are saving a lot of storage. But I guess that > you can do it one 100GB file at a time... > >> and then we maybe can do the FS conversion faster when this >> isn’t an issue any longer. I think I’ll > > I suspect the de-reflinking plus shrinking will take longer, but > not totally sure. > >> Right. This is wan option we can do from a software perspective >> (our own solution - https://bitbucket.org/flyingcircus/backy) > > Many thanks for sharing your system, I'll have a look. > >> but our systems in use can’t hold all the data twice. Even >> though we’re migrating to a backend implementation that uses >> less data than before I have to perform an “inplace” migration >> in some way. This is VM block device backup. So basically we >> migrate one VM with all its previous data and that works quite >> fine with a little headroom. However, migrating all VMs to a >> new “full” backup and then wait for the old to shrink would >> only work if we had a completely empty backup server in place, >> which we don’t. > >> Also: the idea of migrating on btrfs also has its downside - >> the performance of “mkdir” and “fsync” is abysmal at the >> moment. > > That *performance* is pretty good indeed, it is the *speed* that > may be low, but that's obvious. Please consider looking at these > entirely typical speeds: > > http://www.sabi.co.uk/blog/17-one.html?170302#170302 > http://www.sabi.co.uk/blog/17-one.html?170228#170228 > >> I’m waiting for the current shrinking job to finish but this >> is likely limited to the “find free space” algorithm. We’re >> talking about a few megabytes converted per second. Sigh. > > Well, if the filetree is being actively used for COW backups > while being shrunk that involves a lot of randomish IO with > barriers. > >>> I would only suggest that you reconsider XFS. You can't >>> shrink XFS, therefore you won't have the flexibility to >>> migrate in the same way to anything better that comes along >>> in the future (ZFS perhaps? or even Bcachefs?). XFS does not >>> perform that much better over Ext4, and very importantly, >>> Ext4 can be shrunk. > > ZFS is a complicated mess too with an intensely anisotropic > performance envelope too and not necessarily that good for > backup archival for various reasons. I would consider looking > instead at using a collection of smaller "silo" JFS, F2FS, > NILFS2 filetrees as well as XFS, and using MD RAID in RAID10 > mode instead of DM/LVM2: > > http://www.sabi.co.uk/blog/16-two.html?161217#161217 > http://www.sabi.co.uk/blog/17-one.html?170107#170107 > http://www.sabi.co.uk/blog/12-fou.html?121223#121223 > http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b > http://www.sabi.co.uk/blog/12-fou.html?121218#121218 > > and yes, Bcachefs looks promising, but I am sticking with Btrfs: > > https://lwn.net/Articles/717379 > >> That is true. However, we do have moved the expected feature >> set of the filesystem (i.e. cow) > > That feature set is arguably not appropriate for VM images, but > lots of people know better :-). That depends on a lot of factors. I have no issues personally running small VM images on BTRFS, but I'm also running on decent SSD's (>500MB/s read and write speeds), using sparse files, and keeping on top of managing them. Most of the issue boils down to 3 things: 1. Running Windows in VM's. Windows has a horrendous allocator and does a horrible job of keeping data localized, which makes fragmentation on the back-end far worse. 2. Running another COW filesystem inside the VM. Having multiple COW layers on top of each other nukes performance and makes file fragments breed like rabbits. 3. Not taking the time to do proper routine maintenance. Unless you're running directly on a block storage device, you should be defragmenting your VM images both in the VM and on the host (internal first of course), and generally keeping on top of making sure they stay in good condition. > >> down to “store files safely and reliably” and we’ve seen too >> much breakage with ext4 in the past. > > That is extremely unlikely unless your storage layer has > unreliable barriers, and then you need a lot of "optimism". Then you've been lucky yourself. outside of ZFS or BTRFS, most filesystems choke the moment they hit some at-rest data corruption, which has a much higher rate than most people want to admit. Hardware failures happen, as do transient errors, and XFS usually does a better job recovering from them than ext4. > >> Of course “persistence means you’ll have to say I’m sorry” and >> thus with either choice we may be faced with some issue in the >> future that we might have circumvented with another solution >> and yes flexibility is worth a great deal. > > Enterprise SSDs with high small-random-write IOPS-per-TB can > give both excellent speed and high flexibility :-). > >> We’ve run XFS and ext4 on different (large and small) >> workloads in the last 2 years and I have to say I’m much more >> happy about XFS even with the shrinking limitation. > > XFS and 'ext4' are essentially equivalent, except for the > fixed-size inode table limitation of 'ext4' (and XFS reportedly > has finer grained locking). Btrfs is nearly as good as either on > most workloads is single-device mode without using the more > complicated features (compression, qgroups, ...) and with > appropriate use of the 'nowcow' options, and gives checksums on > data too if needed. No, if you look at actual data, they aren't anywhere near equivalent unless you're comparing them to crappy filesystems like FAT32 or drastically different filesystems like NILFFS2, ZFS, or BTRFS. XFS supports metadata checksumming, reflinks and a number of other things ext4 doesn't while also focusing on consistent performance across the life of the FS (so it performs worse on a clean FS than ext4, but better on a heavily used one than ext4). ext4 by contrast has support for a handful of things that XFS doesn't (like journaling all writes, not just metadata, optional lazy metadata initialization, optional multiple-mount protection, etc), and takes a rather optimistic view on performance, focusing on trying to make it as good as possible at all times. > >> To us ext4 is prohibitive with it’s fsck performance and we do >> like the tight error checking in XFS. > > It is very pleasing to see someone care about the speed of > whole-tree operations like 'fsck', a very often forgotten > "little detail". But in my experience 'ext4' checking is quite > competitive with XFS checking and repair, at least in recent > years, as both have been hugely improved. XFS checking and > repair still require a lot of RAM though. > >> Thanks for the reminder though - especially in the public >> archive making this tradeoff with flexibility known is wise to >> communicate. :-) > > "Flexibility" in filesystems, especially on rotating disk > storage with extremely anisotropic performance envelopes, is > very expensive, but of course lots of people know better :-). Time is not free, and humans generally prefer to minimize the amount of time they have to work on things. This is why ZFS is so popular, it handles most errors correctly by itself and usually requires very little human intervention for maintenance. 'Flexibility' in a filesystem costs some time on a regular basis, but can save a huge amount of time in the long run. To look at it another way, I have a home server system running BTRFS on top of LVM. Because of the flexibility this allows, I've been able to configure the system such that it is statistically certain that it will survive any combination of failed storage devices short of a complete catastrophic failure, keep running correctly and can recover completely with zero down-time, while still getting performance within 5-10% of what I would see just running BTRFS directly on the SSD's in the system. That flexibility is what makes this system work as well and reliably as it does, which in turn means that the extent of manual maintenance is running updates, thus saving me significantly more time that it costs in lost performance. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-28 15:56 ` Austin S. Hemmelgarn @ 2017-03-30 15:55 ` Peter Grandi 2017-03-31 12:41 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 42+ messages in thread From: Peter Grandi @ 2017-03-30 15:55 UTC (permalink / raw) To: Linux fs Btrfs >> My guess is that very complex risky slow operations like that are >> provided by "clever" filesystem developers for "marketing" purposes, >> to win box-ticking competitions. That applies to those system >> developers who do know better; I suspect that even some filesystem >> developers are "optimistic" as to what they can actually achieve. > There are cases where there really is no other sane option. Not > everyone has the kind of budget needed for proper HA setups, Thnaks for letting me know, that must have never occurred to me, just as it must have never occurred to me that some people expect extremely advanced features that imply big-budget high-IOPS high-reliability storage to be fast and reliable on small-budget storage too :-) > and if you need maximal uptime and as a result have to reprovision the > system online, then you pretty much need a filesystem that supports > online shrinking. That's a bigger topic than we can address here. The topic used to be known in one related domain as "Very Large Databases", which were defined as databases so large and critical that they the time needed for maintenance and backup were too slow for taking them them offline etc.; that is a topics that has largely vanished for discussion, I guess because most management just don't want to hear it :-). > Also, it's not really all that slow on most filesystem, BTRFS is just > hurt by it's comparatively poor performance, and the COW metadata > updates that are needed. Btrfs in realistic situations has pretty good speed *and* performance, and COW actually helps, as it often results in less head repositioning than update-in-place. What makes it a bit slower with metadata is having 'dup' by default to recover from especially damaging bitflips in metadata, but then that does not impact performance, only speed. >> That feature set is arguably not appropriate for VM images, but >> lots of people know better :-). > That depends on a lot of factors. I have no issues personally running > small VM images on BTRFS, but I'm also running on decent SSD's > (>500MB/s read and write speeds), using sparse files, and keeping on > top of managing them. [ ... ] Having (relatively) big-budget high-IOPS storage for high-IOPS workloads helps, that must have never occurred to me either :-). >> XFS and 'ext4' are essentially equivalent, except for the fixed-size >> inode table limitation of 'ext4' (and XFS reportedly has finer >> grained locking). Btrfs is nearly as good as either on most workloads >> is single-device mode [ ... ] > No, if you look at actual data, [ ... ] Well, I have looked at actual data in many published but often poorly made "benchmarks", and to me they seem they seem quite equivalent indeed, within somewhat differently shaped performance envelopes, so the results depend on the testing point within that envelope. I have been done my own simplistic actual data gathering, most recently here: http://www.sabi.co.uk/blog/17-one.html?170302#170302 http://www.sabi.co.uk/blog/17-one.html?170228#170228 and however simplistic they are fairly informative (and for writes they point a finger at a layer below the filesystem type). [ ... ] >> "Flexibility" in filesystems, especially on rotating disk >> storage with extremely anisotropic performance envelopes, is >> very expensive, but of course lots of people know better :-). > Time is not free, Your time seems especially and uniquely precious as you "waste" as little as possible editing your replies into readability. > and humans generally prefer to minimize the amount of time they have > to work on things. This is why ZFS is so popular, it handles most > errors correctly by itself and usually requires very little human > intervention for maintenance. That seems to me a pretty illusion, as it does not contain any magical AI, just pretty ordinary and limited error correction for trivial cases. > 'Flexibility' in a filesystem costs some time on a regular basis, but > can save a huge amount of time in the long run. Like everything else. The difficulty is having flexibility at scale with challenging workloads. "An engineer can do for a nickel what any damn fool can do for a dollar" :-). > To look at it another way, I have a home server system running BTRFS > on top of LVM. [ ... ] But usually home servers have "unchallenging" workloads, and it is relatively easy to overbudget their storage, because the total absolute cost is "affordable". ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-30 15:55 ` Peter Grandi @ 2017-03-31 12:41 ` Austin S. Hemmelgarn 2017-03-31 17:25 ` Peter Grandi 0 siblings, 1 reply; 42+ messages in thread From: Austin S. Hemmelgarn @ 2017-03-31 12:41 UTC (permalink / raw) To: Peter Grandi, Linux fs Btrfs On 2017-03-30 11:55, Peter Grandi wrote: >>> My guess is that very complex risky slow operations like that are >>> provided by "clever" filesystem developers for "marketing" purposes, >>> to win box-ticking competitions. That applies to those system >>> developers who do know better; I suspect that even some filesystem >>> developers are "optimistic" as to what they can actually achieve. > >> There are cases where there really is no other sane option. Not >> everyone has the kind of budget needed for proper HA setups, > > Thnaks for letting me know, that must have never occurred to me, just as > it must have never occurred to me that some people expect extremely > advanced features that imply big-budget high-IOPS high-reliability > storage to be fast and reliable on small-budget storage too :-) You're missing my point (or intentionally ignoring it). Those types of operations are implemented because there are use cases that actually need them, not because some developer thought it would be cool. The one possible counter-example of this is XFS, which doesn't support shrinking the filesystem at all, but that was a conscious decision because their target use case (very large scale data storage) does not need that feature and not implementing it allows them to make certain other parts of the filesystem faster. > >> and if you need maximal uptime and as a result have to reprovision the >> system online, then you pretty much need a filesystem that supports >> online shrinking. > > That's a bigger topic than we can address here. The topic used to be > known in one related domain as "Very Large Databases", which were > defined as databases so large and critical that they the time needed for > maintenance and backup were too slow for taking them them offline etc.; > that is a topics that has largely vanished for discussion, I guess > because most management just don't want to hear it :-). No, it's mostly vanished because of changes in best current practice. That was a topic in an era where the only platform that could handle high-availability was VMS, and software wasn't routinely written to handle things like load balancing. As a result, people ran a single system which hosted the database, and if that went down, everything went down. By contrast, it's rare these days outside of small companies to see singly hosted databases that aren't specific to the local system, and once you start parallelizing on the system level, backup and maintenance times generally go down. > >> Also, it's not really all that slow on most filesystem, BTRFS is just >> hurt by it's comparatively poor performance, and the COW metadata >> updates that are needed. > > Btrfs in realistic situations has pretty good speed *and* performance, > and COW actually helps, as it often results in less head repositioning > than update-in-place. What makes it a bit slower with metadata is having > 'dup' by default to recover from especially damaging bitflips in > metadata, but then that does not impact performance, only speed. I and numerous other people have done benchmarks running single metadata and single data profiles on BTRFS, and it consistently performs worse than XFS and ext4 even under those circumstances. It's not horrible performance (it's better for example than trying the same workload on NTFS on Windows), but it's still not what most people would call 'high' performance or speed. > >>> That feature set is arguably not appropriate for VM images, but >>> lots of people know better :-). > >> That depends on a lot of factors. I have no issues personally running >> small VM images on BTRFS, but I'm also running on decent SSD's >> (>500MB/s read and write speeds), using sparse files, and keeping on >> top of managing them. [ ... ] > > Having (relatively) big-budget high-IOPS storage for high-IOPS workloads > helps, that must have never occurred to me either :-). It's not big budget, the SSD's in question are at best mid-range consumer SSD's that cost only marginally more than a decent hard drive, and they really don't get all that great performance in terms of IOPS because they're all on the same cheap SATA controller. The point I was trying to make (which I should have been clearer about) is that they have good bulk throughput, which means that the OS can do much more aggressive writeback caching, which in turn means that COW and fragmentation have less impact. > >>> XFS and 'ext4' are essentially equivalent, except for the fixed-size >>> inode table limitation of 'ext4' (and XFS reportedly has finer >>> grained locking). Btrfs is nearly as good as either on most workloads >>> is single-device mode [ ... ] > >> No, if you look at actual data, [ ... ] > > Well, I have looked at actual data in many published but often poorly > made "benchmarks", and to me they seem they seem quite equivalent > indeed, within somewhat differently shaped performance envelopes, so the > results depend on the testing point within that envelope. I have been > done my own simplistic actual data gathering, most recently here: > > http://www.sabi.co.uk/blog/17-one.html?170302#170302 > http://www.sabi.co.uk/blog/17-one.html?170228#170228 > > and however simplistic they are fairly informative (and for writes they > point a finger at a layer below the filesystem type). In terms of performance, yes they are roughly equivalent. Performance isn't all that matters though, and once you get that point, ext4 and XFS are significantly different in what they offer. > > [ ... ] > >>> "Flexibility" in filesystems, especially on rotating disk >>> storage with extremely anisotropic performance envelopes, is >>> very expensive, but of course lots of people know better :-). > >> Time is not free, > > Your time seems especially and uniquely precious as you "waste" > as little as possible editing your replies into readability. > >> and humans generally prefer to minimize the amount of time they have >> to work on things. This is why ZFS is so popular, it handles most >> errors correctly by itself and usually requires very little human >> intervention for maintenance. > > That seems to me a pretty illusion, as it does not contain any magical > AI, just pretty ordinary and limited error correction for trivial cases. On average, trivial cases account for most errors in any computer. So, by definition, to handle most errors correctly, you can get by with just handling all 'trivial' cases correctly. By handling all trivial cases correctly, ZFS is doing far better than any other current filesystem or storage stack can even begin to claim. It's been doing this since before most modern Linux distributions made their first release too, so compared to just about anything else people are using these days, it's got a pretty solid track record. Anyone trying to claim it's the best option in any case is obviously either a zealot or being paid, but for many cases, it really is one of the top options. > >> 'Flexibility' in a filesystem costs some time on a regular basis, but >> can save a huge amount of time in the long run. > > Like everything else. The difficulty is having flexibility at scale with > challenging workloads. "An engineer can do for a nickel what any damn > fool can do for a dollar" :-). > >> To look at it another way, I have a home server system running BTRFS >> on top of LVM. [ ... ] > > But usually home servers have "unchallenging" workloads, and it is > relatively easy to overbudget their storage, because the total absolute > cost is "affordable". OK, so running * Almost a dozen statically allocated VM's with a variety of differing workloads including web-servers, a local mail server, DHCP and DNS for the network, a VPN server, and 3 different file sharing protocols (which see rather regular use) among other things * On average between 4 and 10 transient VM's running regression testing on kernel patches (including automation of almost everything but selecting patches) * A BOINC client * GlusterFS (both client and storage node) * Network security monitoring (Nagios plus a handful of custom scripts) * Cloud storage software All on the same system is an 'unchallenging' workload. Given the fact that it's only got 32G of RAM and a cheap quad-core Xeon, that's a pretty damn challenging workload by most people standards. I call it a home server because I run it out of my house, not because it's some trivial dinky little file server that could run just fine on something like a Raspberry Pi. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-31 12:41 ` Austin S. Hemmelgarn @ 2017-03-31 17:25 ` Peter Grandi 2017-03-31 19:38 ` GWB 0 siblings, 1 reply; 42+ messages in thread From: Peter Grandi @ 2017-03-31 17:25 UTC (permalink / raw) To: Linux fs Btrfs >>> My guess is that very complex risky slow operations like >>> that are provided by "clever" filesystem developers for >>> "marketing" purposes, to win box-ticking competitions. >>> That applies to those system developers who do know better; >>> I suspect that even some filesystem developers are >>> "optimistic" as to what they can actually achieve. >>> There are cases where there really is no other sane >>> option. Not everyone has the kind of budget needed for >>> proper HA setups, >> Thnaks for letting me know, that must have never occurred to >> me, just as it must have never occurred to me that some >> people expect extremely advanced features that imply >> big-budget high-IOPS high-reliability storage to be fast and >> reliable on small-budget storage too :-) > You're missing my point (or intentionally ignoring it). In "Thanks for letting me know" I am not missing your point, I am simply pointing out that I do know that people try to run high-budget workloads on low-budget storage. The argument as to whether "very complex risky slow operations" should be provided in the filesystem itself is a very different one, and I did not develop it fully. But is quite "optimistic" to simply state "there really is no other sane option", even when for people that don't have "proper HA setups". Let'a start by assuming for the time being. that "very complex risky slow operations" are indeed feasible on very reliable high speed storage layers. Then the questions become: * Is it really true that "there is no other sane option" to running "very complex risky slow operations" even on storage that is not "big-budget high-IOPS high-reliability"? * Is is really true that it is a good idea to run "very complex risky slow operations" even on ¨big-budget high-IOPS high-reliability storage"? > Those types of operations are implemented because there are > use cases that actually need them, not because some developer > thought it would be cool. [ ... ] And this is the really crucial bit, I'll disregard without agreeing too much (but in part I do) with the rest of the response, as those are less important matters, and this is going to be londer than a twitter message. First, I agree that "there are use cases that actually need them", and I need to explain what I am agreeing to: I believe that computer systems, "system" in a wide sense, have what I call "inewvitable functionality", that is functionality that is not optional, but must be provided *somewhere*: for example print spooling is "inevitable functionality" as long as there are multuple users, and spell checking is another example. The only choice as to "inevitable functionality" is *where* to provide it. For example spooling can be done among two users by queuing jobs manually with one saying "I am going to print now", and the other user waits until the print is finished, or by using a spool program that queues jobs on the source system, or by using a spool program that queues jobs on the target printer. Spell checking can be done on the fly in the document processor, batch with a tool, or manually by the document author. All these are valid implementations of "inevitable functionality", just with very different performance envelope, where the "system" includes the users as "peripherals" or "plugins" :-) in the manual implementations. There is no dispute from me that multiple devices, adding/removing block devices, data compression, structural repair, balancing, growing/shrinking, defragmentation, quota groups, integrity checking, deduplication, ...a are all in the general case "inevitably functionality", and every non-trivial storage system *must* implement them. The big question is *where*: for example when I started using UNIX the 'fsck' tool was several years away, and when the system crashed I did like everybody filetree integrity checking and structure recovery myself (with the help of 'ncheck' and 'icheck' and 'adb'), that is 'fsck' was implemented in my head. In the general case there are three places where such "inevitable functionality" can be implemented: * In the filesystem module in the kernel, for example Btrfs scrubbing. * In a tool that uses hook provided by the filesystem module in the kernel, for example Btrfs deduplication, 'send'/'receive'. * In a tool, for example 'btrfsck'. * In the system administrator. Consider the "very complex risky slow" operation of defragmentation; the system administrator can implement it by dumping and reloading the volume, or a tool ban implement it by running on the unmounted filesystem, or a tool and the kernel can implement it by using kernel module hooks, or it can be provided entirely in the kernel module. My argument is that providing "very complex risky slow" maintenance operations as filesystem primitives looks awesomely convenient, a good way to "win box-ticking competitions" for "marketing" purposes, but is rather bad idea for several reasons, of varying strengths: * Most system administrators apparently don't understand the most basic concepts of storage, or try to not understand them, and in particular don't understand that some in-place maintenance operations are "very complex risky slow" and should be avoided. Manual alternatives to shrinking like dumping and reloading should be encouraged. * In an ideal world "very complex risky slow operations" could be done either "automagically" or manually, and wise system administrators would choose appropriately, but the risk of the wrong choice by less wise system administrators can reflect badly on the filesystem reputation and that of their designers, as in "after 10 years it still is like this" :-). * In particular for whatever reasons many system administrators seems to be very "optimistic" as to cost/benefit planning, maybe because they want to be considered geniuses who can deliver large high performance high reliability storage for cheap, and systematically under-resource IOPS because they are very expensive, yet large quantities of these are consumed by most maintenance "very complex risky slow operations", especially those involving in-place manipulation, and then ingenuously or disingenuously complain when 'balance' takes 3 months, because after all it is a single command, and that single command hides a "very complex risky slow" operation. * In an ideal world implementing "very complex risky slow operations" in kernel modules (or even in tools) is entirely cost free, as kernel developers never make mistakes as to state machines or race conditions or lessedr bug despite the enormouse complexity of the code paths needed to support many possible options, but kernel code is particularly fragile, kernel developers seem to be human after all, when they are are not quite careless, and making it hard to stabilize kernel code can reflect badly on the filesystem reputation and that of their designers, as in "after 10 years it still is like this" :-). Therefore in my judgement a filesystem design should only provide the barest and most direct functionality, unless the designers really overrate themselves, or rate highly their skill at marketing long lists of features as "magic dust". Im my judgement higher level functionality can be left to the ingenuity of system administrators, both because crude methods like dump and reload actually work pretty well and quickly, even if they are most costly in terms of resources used, and because they give a more direct feel to system administrators of the real costs of doing certain maintenance operations. Put another way, as to this: > Those types of operations are implemented because there are > use cases that actually need them, Implementing "very complex risky slow operations" like in-place shrinking *in the kernel module* as a "just do it" primitive is certainly possible and looks great in a box-ticking competition but has large hidden costs as to complexity and opacity, and simpler cruder more manual out of kernel implementations are usually less complex, less risky, less slow, even if more expensive in terms of budget. In the end the question for either filesystem designers or system administrators is "Do you feel lucky?" :-). The following crudely tells part of the story, for example that some filesystem designers know better :-) $ D='btrfs f2fs gfs2 hfsplus jfs nilfs2 reiserfs udf xfs' $ find $D -name '*.ko' | xargs size | sed 's/^ *//;s/ .*\t//g' text filename 832719 btrfs/btrfs.ko 237952 f2fs/f2fs.ko 251805 gfs2/gfs2.ko 72731 hfsplus/hfsplus.ko 171623 jfs/jfs.ko 173540 nilfs2/nilfs2.ko 214655 reiserfs/reiserfs.ko 81628 udf/udf.ko 658637 xfs/xfs.ko ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-31 17:25 ` Peter Grandi @ 2017-03-31 19:38 ` GWB 2017-03-31 20:27 ` Peter Grandi 0 siblings, 1 reply; 42+ messages in thread From: GWB @ 2017-03-31 19:38 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs Btrfs Well, now I am curious. Until we hear back from Christiane on the progress of the never ending file system shrinkage, I suppose it can't hurt to ask what the signifigance of the xargs size limits of btrfs might be. Or, again, if Christiane is already happily on his way to an xfs server running over lvm, skip, ignore, delete. Here is the output of xargs --size-limits on my laptop: << $ xargs --show-limits Your environment variables take up 4830 bytes POSIX upper limit on argument length (this system): 2090274 POSIX smallest allowable upper limit on argument length (all systems): 4096 Maximum length of command we could actually use: 2085444 Size of command buffer we are actually using: 131072 Execution of xargs will continue now... >> That is for a laptop system. So what does it mean that btrfs has a higher xargs size limit than other file systems? Could I theoretically use 40% of the total allowed argument length of the system for btrfs arguments alone? Would that make balance, shrinkage, etc., faster? Does the higher capacity for argument length mean btrfs is overly complex and therefore more prone to breakage? Or does the lower capacity for argument length for hfsplus demonstrate it is the superior file system for avoiding breakage? Or does it means that hfsplus is very old (and reflects older xargs limits), and that btrfs is newer code? I am relatively new to btrfs, and would like to find out. I am also attracted to the idea that it is better to leave some operations to the system itself, and not code them into the file system. For example, I think deduplication "off line" or "out of band" is an advantage for btrfs over zfs. But that's only for what I do. For other uses deduplication "in line", while writing the file, is preferred, and that is what zfs does (preferably with lots of memory, at least one ssd to run zil, caches, etc.). I use btrfs now because Ubuntu has it as a default in the kernel, and I assume that when (not "if") I have to use a system rescue disk (USB or CD) it will have some capacity to repair btrfs. Along the way, btrfs has been quite good as a general purpose file system on root; it makes and sends snapshots, and so far only needs an occasional scrub and balance. My earlier experience with btrfs on a 2TB drive was more complicated, but I expected that for a file system with a lot of potential but less maturity. Personally, I would go back to fossil and venti on Plan 9 for an archival data server (using WORM drives), and VAX/VMS cluster for an HA server. But of course that no longer makes sense except for a very few usage cases. Time has moved on, prices have dropped drastically, and hardware can do a lot more per penny than it used to. Gordon On Fri, Mar 31, 2017 at 12:25 PM, Peter Grandi <pg@btrfs.list.sabi.co.uk> wrote: >>>> My guess is that very complex risky slow operations like >>>> that are provided by "clever" filesystem developers for >>>> "marketing" purposes, to win box-ticking competitions. > >>>> That applies to those system developers who do know better; >>>> I suspect that even some filesystem developers are >>>> "optimistic" as to what they can actually achieve. > >>>> There are cases where there really is no other sane >>>> option. Not everyone has the kind of budget needed for >>>> proper HA setups, > >>> Thnaks for letting me know, that must have never occurred to >>> me, just as it must have never occurred to me that some >>> people expect extremely advanced features that imply >>> big-budget high-IOPS high-reliability storage to be fast and >>> reliable on small-budget storage too :-) > >> You're missing my point (or intentionally ignoring it). > > In "Thanks for letting me know" I am not missing your point, I > am simply pointing out that I do know that people try to run > high-budget workloads on low-budget storage. > > The argument as to whether "very complex risky slow operations" > should be provided in the filesystem itself is a very different > one, and I did not develop it fully. But is quite "optimistic" > to simply state "there really is no other sane option", even > when for people that don't have "proper HA setups". > > Let'a start by assuming for the time being. that "very complex > risky slow operations" are indeed feasible on very reliable high > speed storage layers. Then the questions become: > > * Is it really true that "there is no other sane option" to > running "very complex risky slow operations" even on storage > that is not "big-budget high-IOPS high-reliability"? > > * Is is really true that it is a good idea to run "very complex > risky slow operations" even on ¨big-budget high-IOPS > high-reliability storage"? > >> Those types of operations are implemented because there are >> use cases that actually need them, not because some developer >> thought it would be cool. [ ... ] > > And this is the really crucial bit, I'll disregard without > agreeing too much (but in part I do) with the rest of the > response, as those are less important matters, and this is going > to be londer than a twitter message. > > First, I agree that "there are use cases that actually need > them", and I need to explain what I am agreeing to: I believe > that computer systems, "system" in a wide sense, have what I > call "inewvitable functionality", that is functionality that is > not optional, but must be provided *somewhere*: for example > print spooling is "inevitable functionality" as long as there > are multuple users, and spell checking is another example. > > The only choice as to "inevitable functionality" is *where* to > provide it. For example spooling can be done among two users by > queuing jobs manually with one saying "I am going to print now", > and the other user waits until the print is finished, or by > using a spool program that queues jobs on the source system, or > by using a spool program that queues jobs on the target > printer. Spell checking can be done on the fly in the document > processor, batch with a tool, or manually by the document > author. All these are valid implementations of "inevitable > functionality", just with very different performance envelope, > where the "system" includes the users as "peripherals" or > "plugins" :-) in the manual implementations. > > There is no dispute from me that multiple devices, > adding/removing block devices, data compression, structural > repair, balancing, growing/shrinking, defragmentation, quota > groups, integrity checking, deduplication, ...a are all in the > general case "inevitably functionality", and every non-trivial > storage system *must* implement them. > > The big question is *where*: for example when I started using > UNIX the 'fsck' tool was several years away, and when the system > crashed I did like everybody filetree integrity checking and > structure recovery myself (with the help of 'ncheck' and > 'icheck' and 'adb'), that is 'fsck' was implemented in my head. > > In the general case there are three places where such > "inevitable functionality" can be implemented: > > * In the filesystem module in the kernel, for example Btrfs > scrubbing. > * In a tool that uses hook provided by the filesystem module in > the kernel, for example Btrfs deduplication, 'send'/'receive'. > * In a tool, for example 'btrfsck'. > * In the system administrator. > > Consider the "very complex risky slow" operation of > defragmentation; the system administrator can implement it by > dumping and reloading the volume, or a tool ban implement it by > running on the unmounted filesystem, or a tool and the kernel > can implement it by using kernel module hooks, or it can be > provided entirely in the kernel module. > > My argument is that providing "very complex risky slow" > maintenance operations as filesystem primitives looks awesomely > convenient, a good way to "win box-ticking competitions" for > "marketing" purposes, but is rather bad idea for several > reasons, of varying strengths: > > * Most system administrators apparently don't understand the > most basic concepts of storage, or try to not understand them, > and in particular don't understand that some in-place > maintenance operations are "very complex risky slow" and > should be avoided. Manual alternatives to shrinking like > dumping and reloading should be encouraged. > > * In an ideal world "very complex risky slow operations" could > be done either "automagically" or manually, and wise system > administrators would choose appropriately, but the risk of the > wrong choice by less wise system administrators can reflect > badly on the filesystem reputation and that of their > designers, as in "after 10 years it still is like this" :-). > > * In particular for whatever reasons many system administrators > seems to be very "optimistic" as to cost/benefit planning, > maybe because they want to be considered geniuses who can > deliver large high performance high reliability storage for > cheap, and systematically under-resource IOPS because they are > very expensive, yet large quantities of these are consumed by > most maintenance "very complex risky slow operations", > especially those involving in-place manipulation, and then > ingenuously or disingenuously complain when 'balance' takes 3 > months, because after all it is a single command, and that > single command hides a "very complex risky slow" operation. > > * In an ideal world implementing "very complex risky slow > operations" in kernel modules (or even in tools) is entirely > cost free, as kernel developers never make mistakes as to > state machines or race conditions or lessedr bug despite the > enormouse complexity of the code paths needed to support many > possible options, but kernel code is particularly fragile, > kernel developers seem to be human after all, when they are > are not quite careless, and making it hard to stabilize kernel > code can reflect badly on the filesystem reputation and that > of their designers, as in "after 10 years it still is like > this" :-). > > Therefore in my judgement a filesystem design should only > provide the barest and most direct functionality, unless the > designers really overrate themselves, or rate highly their skill > at marketing long lists of features as "magic dust". Im my > judgement higher level functionality can be left to the > ingenuity of system administrators, both because crude methods > like dump and reload actually work pretty well and quickly, even > if they are most costly in terms of resources used, and because > they give a more direct feel to system administrators of the > real costs of doing certain maintenance operations. > > Put another way, as to this: > >> Those types of operations are implemented because there are >> use cases that actually need them, > > Implementing "very complex risky slow operations" like in-place > shrinking *in the kernel module* as a "just do it" primitive is > certainly possible and looks great in a box-ticking competition > but has large hidden costs as to complexity and opacity, and > simpler cruder more manual out of kernel implementations are > usually less complex, less risky, less slow, even if more > expensive in terms of budget. In the end the question for either > filesystem designers or system administrators is "Do you feel > lucky?" :-). > > The following crudely tells part of the story, for example that > some filesystem designers know better :-) > > $ D='btrfs f2fs gfs2 hfsplus jfs nilfs2 reiserfs udf xfs' > $ find $D -name '*.ko' | xargs size | sed 's/^ *//;s/ .*\t//g' > text filename > 832719 btrfs/btrfs.ko > 237952 f2fs/f2fs.ko > 251805 gfs2/gfs2.ko > 72731 hfsplus/hfsplus.ko > 171623 jfs/jfs.ko > 173540 nilfs2/nilfs2.ko > 214655 reiserfs/reiserfs.ko > 81628 udf/udf.ko > 658637 xfs/xfs.ko > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-31 19:38 ` GWB @ 2017-03-31 20:27 ` Peter Grandi 2017-04-01 0:02 ` GWB 0 siblings, 1 reply; 42+ messages in thread From: Peter Grandi @ 2017-03-31 20:27 UTC (permalink / raw) To: Linux fs Btrfs > [ ... ] what the signifigance of the xargs size limits of > btrfs might be. [ ... ] So what does it mean that btrfs has a > higher xargs size limit than other file systems? [ ... ] Or > does the lower capacity for argument length for hfsplus > demonstrate it is the superior file system for avoiding > breakage? [ ... ] That confuses, as my understanding of command argument size limit is that it is a system, not filesystem, property, and for example can be obtained with 'getconf _POSIX_ARG_MAX'. > Personally, I would go back to fossil and venti on Plan 9 for > an archival data server (using WORM drives), In an ideal world we would be using Plan 9. Not necessarily with Fossil and Venti. As a to storage/backup/archival Linux based options are not bad, even if the platform is far messier than Plan 9 (or some other alternatives). BTW I just noticed with a search that AWS might be offering Plan 9 hosts :-). > and VAX/VMS cluster for an HA server. [ ... ] Uhmmm, however nice it was, it was fairly weird. An IA32 or AMD64 port has been promised however :-). https://www.theregister.co.uk/2016/10/13/openvms_moves_slowly_towards_x86/ ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-31 20:27 ` Peter Grandi @ 2017-04-01 0:02 ` GWB 2017-04-01 2:42 ` Duncan 0 siblings, 1 reply; 42+ messages in thread From: GWB @ 2017-04-01 0:02 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs Btrfs It is confusing, and now that I look at it, more than a little funny. Your use of xargs returns the size of the kernel module for each of the filesystem types. I think I get it now: you are pointing to how large the kernel module for btrfs is compared to other file system kernel modules, 833 megs (piping find through xargs to sed). That does not mean the btrfs kernel module can accommodate an upper limit of a command line length that is 833 megs. It is just a very big loadable kernel module. So same question, but different expression: what is the signifigance of the large size of the btrfs kernel module? Is it that the larger the module, the more complex, the more prone to breakage, and more difficult to debug? Is the hfsplus kernel module less complex, and more robust? What did the file system designers of hfsplus (or udf) know better (or worse?) than the file system designers of btrfs? VAX/VMS clusters just aren't happy outside of a deeply hidden bunker running 9 machines in a cluster from one storage device connected by myranet over 500 miles to the next cluster. I applaud the move to x86, but like I wrote earlier, time has moved on. I suppose weird is in the eye of the beholder, but yes, when dial up was king and disco pants roamed the earth, they were nice. I don't think x86 is a viable use case even for OpenVMS. If you really need a VAX/VMS cluster, chances are you have already have had one running with a continuous uptime of more than a decade and you have already upgraded and changed out every component several times by cycling down one machine in the cluster at a time. Gordon On Fri, Mar 31, 2017 at 3:27 PM, Peter Grandi <pg@btrfs.list.sabi.co.uk> wrote: >> [ ... ] what the signifigance of the xargs size limits of >> btrfs might be. [ ... ] So what does it mean that btrfs has a >> higher xargs size limit than other file systems? [ ... ] Or >> does the lower capacity for argument length for hfsplus >> demonstrate it is the superior file system for avoiding >> breakage? [ ... ] > > That confuses, as my understanding of command argument size > limit is that it is a system, not filesystem, property, and for > example can be obtained with 'getconf _POSIX_ARG_MAX'. > >> Personally, I would go back to fossil and venti on Plan 9 for >> an archival data server (using WORM drives), > > In an ideal world we would be using Plan 9. Not necessarily with > Fossil and Venti. As a to storage/backup/archival Linux based > options are not bad, even if the platform is far messier than > Plan 9 (or some other alternatives). BTW I just noticed with a > search that AWS might be offering Plan 9 hosts :-). > >> and VAX/VMS cluster for an HA server. [ ... ] > > Uhmmm, however nice it was, it was fairly weird. An IA32 or > AMD64 port has been promised however :-). > > https://www.theregister.co.uk/2016/10/13/openvms_moves_slowly_towards_x86/ > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-04-01 0:02 ` GWB @ 2017-04-01 2:42 ` Duncan 2017-04-01 4:26 ` GWB 0 siblings, 1 reply; 42+ messages in thread From: Duncan @ 2017-04-01 2:42 UTC (permalink / raw) To: linux-btrfs GWB posted on Fri, 31 Mar 2017 19:02:40 -0500 as excerpted: > It is confusing, and now that I look at it, more than a little funny. > Your use of xargs returns the size of the kernel module for each of the > filesystem types. I think I get it now: you are pointing to how large > the kernel module for btrfs is compared to other file system kernel > modules, 833 megs (piping find through xargs to sed). That does not > mean the btrfs kernel module can accommodate an upper limit of a command > line length that is 833 megs. It is just a very big loadable kernel > module. Umm... 833 K, not M, I believe. (The unit is bytes not KiB.) Because if just one kernel module is nearing a gigabyte, then the kernel must be many gigabytes either monolithic or once assembled in memory, and it just ain't so. But FWIW megs was my first-glance impression too, until my brain said "No way! Doesn't work!" and I took a second look. The kernel may indeed no longer fit on a 1.44 MB floppy, but it's still got a ways to go before it's multiple GiB! =:^) While they're XZ- compressed, I'm still fitting several monolithic-build kernels including their appended initramfs, along with grub, its config and modules, and a few other misc things, in a quarter-GB dup-mode btrfs, meaning 128 MiB capacity, including the 16 MiB system chunk so 112 MiB for data and metadata. That simply wouldn't be possible if the kernel itself were multi-GB, even uncompressed. Even XZ isn't /that/ good! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-04-01 2:42 ` Duncan @ 2017-04-01 4:26 ` GWB 2017-04-01 11:30 ` Peter Grandi 0 siblings, 1 reply; 42+ messages in thread From: GWB @ 2017-04-01 4:26 UTC (permalink / raw) To: Btrfs BTRFS Indeed, that does make sense. It's the output of the size command in the Berkeley format of "text", not decimal, octal or hex. Out of curiosity about kernel module sizes, I dug up some old MacBooks and looked around in: /System/Library/Extensions/[modulename].kext/Content/MacOS: udf is 637K on Mac OS 10.6 exfat is 75K on Mac OS 10.9 msdosfs is 79K on Mac OS 10.9 ntfs is 394K (That must be Paragon's ntfs for Mac) And here's the kernel extension sizes for zfs (From OpenZFS): /Library/Extensions/[modulename].kext/Content/MacOS: zfs is 1.7M (10.9) spl is 247K (10.9) Different kernel from linux, of course (evidently a "mish mash" of NextStep, BSD, Mach and Apple's own code), but that is one large kernel extension for zfs. If they are somehow comparable even with the differences, 833K is not bad for btrfs compared to zfs. I did not look at the format of the file; it must be binary, but compression may be optional for third party kexts. So the kernel module sizes are large for both btrfs and zfs. Given the feature sets of both, is that surprising? My favourite kernel extension in Mac OS X is: /System/Library/Extensions/Dont Steal Mac OS X.kext/ Subtle, very subtle. Gordon On Fri, Mar 31, 2017 at 9:42 PM, Duncan <1i5t5.duncan@cox.net> wrote: > GWB posted on Fri, 31 Mar 2017 19:02:40 -0500 as excerpted: > >> It is confusing, and now that I look at it, more than a little funny. >> Your use of xargs returns the size of the kernel module for each of the >> filesystem types. I think I get it now: you are pointing to how large >> the kernel module for btrfs is compared to other file system kernel >> modules, 833 megs (piping find through xargs to sed). That does not >> mean the btrfs kernel module can accommodate an upper limit of a command >> line length that is 833 megs. It is just a very big loadable kernel >> module. > > Umm... 833 K, not M, I believe. (The unit is bytes not KiB.) > > Because if just one kernel module is nearing a gigabyte, then the kernel > must be many gigabytes either monolithic or once assembled in memory, and > it just ain't so. > > But FWIW megs was my first-glance impression too, until my brain said "No > way! Doesn't work!" and I took a second look. > > The kernel may indeed no longer fit on a 1.44 MB floppy, but it's still > got a ways to go before it's multiple GiB! =:^) While they're XZ- > compressed, I'm still fitting several monolithic-build kernels including > their appended initramfs, along with grub, its config and modules, and a > few other misc things, in a quarter-GB dup-mode btrfs, meaning 128 MiB > capacity, including the 16 MiB system chunk so 112 MiB for data and > metadata. That simply wouldn't be possible if the kernel itself were > multi-GB, even uncompressed. Even XZ isn't /that/ good! > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-04-01 4:26 ` GWB @ 2017-04-01 11:30 ` Peter Grandi 0 siblings, 0 replies; 42+ messages in thread From: Peter Grandi @ 2017-04-01 11:30 UTC (permalink / raw) To: Linux fs Btrfs [ ... ] >>> $ D='btrfs f2fs gfs2 hfsplus jfs nilfs2 reiserfs udf xfs' >>> $ find $D -name '*.ko' | xargs size | sed 's/^ *//;s/ .*\t//g' >>> text filename >>> 832719 btrfs/btrfs.ko >>> 237952 f2fs/f2fs.ko >>> 251805 gfs2/gfs2.ko >>> 72731 hfsplus/hfsplus.ko >>> 171623 jfs/jfs.ko >>> 173540 nilfs2/nilfs2.ko >>> 214655 reiserfs/reiserfs.ko >>> 81628 udf/udf.ko >>> 658637 xfs/xfs.ko That was Linux AMD64. > udf is 637K on Mac OS 10.6 > exfat is 75K on Mac OS 10.9 > msdosfs is 79K on Mac OS 10.9 > ntfs is 394K (That must be Paragon's ntfs for Mac) ... > zfs is 1.7M (10.9) > spl is 247K (10.9) Similar on Linux AMD64 but smaller: $ size updates/dkms/*.ko | sed 's/^ *//;s/ .*\t//g' text filename 62005 updates/dkms/spl.ko 184370 updates/dkms/splat.ko 3879 updates/dkms/zavl.ko 22688 updates/dkms/zcommon.ko 1012212 updates/dkms/zfs.ko 39874 updates/dkms/znvpair.ko 18321 updates/dkms/zpios.ko 319224 updates/dkms/zunicode.ko > If they are somehow comparable even with the differences, 833K > is not bad for btrfs compared to zfs. I did not look at the > format of the file; it must be binary, but compression may be > optional for third party kexts. So the kernel module sizes are > large for both btrfs and zfs. Given the feature sets of both, > is that surprising? Not surprising and indeed I agree with the statement that appeared earlier that "there are use cases that actually need them". There are also use cases that need realtime translation of file content from chinese to spanish, and one could add to ZFS or Btrfs an extension to detect the language of text files and invoke via HTTP Google Translate, for example with option "translate=chinese-spanish" at mount time; or less flexibly there are many use cases where B-Tree lookup of records in files is useful, and it would be possible to add that to Btrfs or ZFS, so that for example 'lseek(4,"Jane Smith",SEEK_KEY)' would be possible, as in the ancient TSS/370 filesystem design. But the question is about engineering, where best to implement those "feature sets": in the kernel or higher levels. There is no doubt for me that realtime language translation and seeking by key can be added to a filesystem kernel module, and would "work". The issue is a crudely technical one: "works" for an engineer is not a binary state, but a statistical property over a wide spectrum of cost/benefit tradeoffs. Adding "feature sets" because "there are use cases that actually need them" is fine, adding their implementation to the kernel driver of a filesystem is quite a different proposition, which may have downsides, as the implementations of those feature sets may make code more complex and harder to understand and test, never mind debug, even for the base features. But of course lots of people know better :-). Buit there is more; look again at some compiled code sizes as a crude proxy for complexity, divided in two groups, both of robust, full featured designs: 1012212 updates/dkms/zfs.ko 832719 btrfs/btrfs.ko 658637 xfs/xfs.ko 237952 f2fs/f2fs.ko 173540 nilfs2/nilfs2.ko 171623 jfs/jfs.ko 81628 udf/udf.ko The code size for JFS or NILFS2 or UDF is roughly 1/4 the code size for XFS, yet there is little difference in functionality. Compared to ZFS as to base functionality JFS lacks checksums and snapshots (in theory it has subvolumes, but they are disabled), but NILFS2 has snapshots and checksums (but does not verify them on ordinary reads), and yet the code size is 1/6 that of ZFS. ZFS has also RAID, but looking at the code size of the Linux MD RAID modules I see rather smaller numbers. Even so ZFS has a good reputation for reliability despire its amazing complexity, but that is also because SUN invested big into massive release engineering for it, and similarly for XFS. Therefore my impression is that the filesystems in the first group have a lot of cool features like compression or dedup etc. that could have been implemented user-level, and having them in the kernel is good "for "marketing" purposes, to win box-ticking competitions". ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-28 14:43 ` Peter Grandi ` (2 preceding siblings ...) 2017-03-28 15:56 ` Austin S. Hemmelgarn @ 2017-03-30 15:00 ` Piotr Pawłow 2017-03-30 16:13 ` Peter Grandi 3 siblings, 1 reply; 42+ messages in thread From: Piotr Pawłow @ 2017-03-30 15:00 UTC (permalink / raw) To: Peter Grandi, Linux fs Btrfs > As a general consideration, shrinking a large filetree online > in-place is an amazingly risky, difficult, slow operation and > should be a last desperate resort (as apparently in this case), > regardless of the filesystem type, and expecting otherwise is > "optimistic". The way btrfs is designed I'd actually expect shrinking to be fast in most cases. It could probably be done by moving whole chunks at near platter speed, instead of extent-by-extent as it is done now, as long as there is enough free space. There was a discussion about it already: http://www.spinics.net/lists/linux-btrfs/msg38608.html. It just hasn't been implemented yet. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-30 15:00 ` Piotr Pawłow @ 2017-03-30 16:13 ` Peter Grandi 2017-03-30 22:13 ` Piotr Pawłow 0 siblings, 1 reply; 42+ messages in thread From: Peter Grandi @ 2017-03-30 16:13 UTC (permalink / raw) To: Linux fs Btrfs >> As a general consideration, shrinking a large filetree online >> in-place is an amazingly risky, difficult, slow operation and >> should be a last desperate resort (as apparently in this case), >> regardless of the filesystem type, and expecting otherwise is >> "optimistic". > The way btrfs is designed I'd actually expect shrinking to be > fast in most cases. It could probably be done by moving whole > chunks at near platter speed, [ ... ] It just hasn't been > implemented yet. That seems to me a rather "optimistic" argument, as most of the cost of shrinking is the 'balance' to pack extents into chunks. As that thread implies, the current implementation in effect does a "balance" while shrinking, by moving extents from chunks "above the line" to free space in chunks "below the line". The proposed "move whole chunks" implementation helps only if there are enough unallocated chunks "below the line". If regular 'balance' is done on the filesystem there will be some, but that just spreads the cost of the 'balance' across time, it does not by itself make a «risky, difficult, slow operation» any less so, just spreads the risk, difficulty, slowness across time. More generally one of the downsides of Btrfs is that because of its two-level (allocated/unallocated chunks, used/free nodes or blocks) design it requires more than most other designs to do regular 'balance', which is indeed «risky, difficult, slow». Compare an even more COW design like NILFS2, which requires, but a bit less, to run its garbage collector, which is also «risky, difficult, slow». Just like in Btrfs that is a tradeoff that shrinks the performance envelope in one direction and expands it in another. But in the case of Btrfs it shrinks it perhaps a bit more than it expands it, as the added flexibility of having chunk-based 'profiles' is only very partially taken advantage of. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-30 16:13 ` Peter Grandi @ 2017-03-30 22:13 ` Piotr Pawłow 2017-03-31 1:00 ` GWB 2017-03-31 10:51 ` Peter Grandi 0 siblings, 2 replies; 42+ messages in thread From: Piotr Pawłow @ 2017-03-30 22:13 UTC (permalink / raw) To: Peter Grandi, Linux fs Btrfs > The proposed "move whole chunks" implementation helps only if > there are enough unallocated chunks "below the line". If regular > 'balance' is done on the filesystem there will be some, but that > just spreads the cost of the 'balance' across time, it does not > by itself make a «risky, difficult, slow operation» any less so, > just spreads the risk, difficulty, slowness across time. Isn't that too pessimistic? Most of my filesystems have 90+% of free space unallocated, even those I never run balance on. For me it wouldn't just spread the cost, it would reduce it considerably. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-30 22:13 ` Piotr Pawłow @ 2017-03-31 1:00 ` GWB 2017-03-31 5:26 ` Duncan 2017-03-31 11:37 ` Peter Grandi 2017-03-31 10:51 ` Peter Grandi 1 sibling, 2 replies; 42+ messages in thread From: GWB @ 2017-03-31 1:00 UTC (permalink / raw) To: ct, Linux fs Btrfs Hello, Christiane, I very much enjoyed the discussion you sparked with your original post. My ability in btrfs is very limited, much less than the others who have replied here, so this may not be much help. Let us assume that you have been able to shrink the device to the size you need, and you are now merrily on your way to moving the data to XFS. If so, ignore this email, delete, whatever, and read no further. If that is not the case, perhaps try something like the following. Can you try to first dedup the btrfs volume? This is probably out of date, but you could try one of these: https://btrfs.wiki.kernel.org/index.php/Deduplication If that does not work, this is a longer shot, but you might consider adding an intermediate step of creating yet another btrfs volume on the underlying lvm2 device mapper, turning on dedup, compression, and whatever else can squeeze some extra space out of the current btrfs volume. You could then try to copy over files and see if you get the results you need (or try sending the current btrfs volume as a snapshot, but I'm guessing 20TB is too much). Once the new btrfs volume on top of lvm2 is complete, you could just delete the old one, and then transfer the (hopefully compressed and deduped) data to XFS. Yep, that's probably a lot of work. I use both btrfs (on root on Ubuntu) and zfs (for data, home), and I try to do as little as possible with live mounted file systems other than snapshots. I avoid sending and receive snapshots from the live system (mostly zfs, but sometimes btrfs) but instead write increment snapshots as a file on the backup disks, and then import the incremental snaps into a backup pool at night. My recollection is that btrfs handles deduplication differently than zfs, but both of them can be very, very slow (from the human perspective; call that what you will; a sub optimal relationship of the parameters of performance and speed). The advantage you have is that with lvm you can create a number of different file systems. And lvm can also create snapshots. I think both zfs and btrfs both have a more "elegant" way of dealing with snapshots, but lvm allows a file system without that feature to have it. Others on the list can tell you about the disadvantages. I would be curious how it turns out for you. If you are able to move the data to XFS running on top of lvm, what is your plan for snapshots in lvm? Again, I'm not an expert in btrfs, but in most cases a full balance and scrub takes care of any problems on the root partition, but that is a relatively small partition. A full balance (without the options) and scrub on 20 TiB must take a very long time even with robust hardware, would it not? CentOS, Redhat, and Oracle seem to take the position that very large data subvolumes using btrfs should work fine. But I would be curious what the rest of the list thinks about 20 TiB in one volume/subvolume. Gordon On Thu, Mar 30, 2017 at 5:13 PM, Piotr Pawłow <pp@siedziba.pl> wrote: >> The proposed "move whole chunks" implementation helps only if >> there are enough unallocated chunks "below the line". If regular >> 'balance' is done on the filesystem there will be some, but that >> just spreads the cost of the 'balance' across time, it does not >> by itself make a «risky, difficult, slow operation» any less so, >> just spreads the risk, difficulty, slowness across time. > > Isn't that too pessimistic? Most of my filesystems have 90+% of free > space unallocated, even those I never run balance on. For me it wouldn't > just spread the cost, it would reduce it considerably. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-31 1:00 ` GWB @ 2017-03-31 5:26 ` Duncan 2017-03-31 5:38 ` Duncan 2017-03-31 11:37 ` Peter Grandi 1 sibling, 1 reply; 42+ messages in thread From: Duncan @ 2017-03-31 5:26 UTC (permalink / raw) To: linux-btrfs GWB posted on Thu, 30 Mar 2017 20:00:22 -0500 as excerpted: > CentOS, Redhat, and Oracle seem to take the position that very large > data subvolumes using btrfs should work fine. But I would be curious > what the rest of the list thinks about 20 TiB in one volume/subvolume. To be sure I'm a biased voice here, as I have multiple independent btrfs on multiple partitions here, with no btrfs over 100 GiB in size, and that's on ssd so maintenance commands normally return in minutes or even seconds, not the hours to days or even weeks it takes on multi-TB btrfs on spinning rust. But FWIW... IMO there are two rules favoring multiple relatively smaller btrfs over single far larger btrfs: 1) Don't put all your data eggs in one basket, especially when that basket isn't yet entirely stable and mature. A mantra commonly repeated on this list is that btrfs is still stabilizing, not fully stable and mature, the result being that keeping backups of any data you value more than the time/cost/hassle-factor of the backup, and being practically prepared to use them, is even *MORE* important than it is on fully mature and stable filesystems. If potential users aren't prepared to do that, flat answer, they should be looking at other filesystems, tho in reality, that rule applies to stable and mature filesystems too, and any good sysadmin understands that not having a backup is in reality defining the data in question as worth less than the cost of that backup, regardless of any protests to the contrary. Based on that and the fact that if this less than 100% stable and mature filesystem fails, all those subvolumes and snapshots you painstakingly created aren't going to matter, it's all up in smoke, it just makes sense to subdivide that data roughly along functional lines and split it up into multiple independent btrfs, so that if a filesystem fails, it'll take only a fraction of the total data with it, and restoring/repairing/ rebuilding will hopefully only have to be done on a small fraction of that data. Which brings us to rule #2: 2) Don't make your filesystems so large that any maintenance on them, including both filesystem maintenance like btrfs balance/scrub/check/ whatever, and normal backup and restore operations, takes impractically long, where "impractically" can be reasonably defined as so long it discourages you from doing them in the first place and/or so long that it's going to cause unwarranted downtime. Some years ago, before I started using btrfs and while I was using mdraid, I learned this one the hard way. I had a bunch of rather large mdraids setup, each with multiple partitions and filesystems[1]. This was before mdraid got proper write-intent bitmap support, so after a crash, I'd have to repair any of these large mdraids that had been active at the time, a process taking hours, even for the primary one containing root and /home, because it contained for example a large media partition that was unlikely to have been mounted at the same time. After getting tired of this I redid things, putting each partition/ filesystem on its own mdraid. Then it would take only a few minutes each for the mdraids for root, /home and /var/log, and I could be back in business with them in half an hour or so, instead of the couple hours I had to wait before, to get the bigger mdraid back up and repaired. Sure, if the much larger media raid was active and the partition mounted too, I'd still have it to repair, but I could do that in the background. And there was a good chance it was /not/ active and mounted at the time of the crash and thus didn't need repaired, saving that time entirely! =:^) Eventually I arranged things so I could keep root mounted read-only unless I was updating it, and that's still the way I run it today. That makes it very nice when a crash impairs /home and /var/log, since there's much less chance root was affected, and with a normal root mount, at least I have my full normal system available to me, including the latest installed btrfs-progs, and manpages and text-mode browsers such as lynx available to me to help troubleshoot, that aren't normally available in typical distros' rescue modes. Meanwhile, a scrub (my btrfs but for /boot are raid1 both data and metadata, and /boot is mixed-mode dup, so scrub can normally repair crash damage getting the two mirrors out of sync) of root takes only ~10 seconds, a scrub of /home takes only ~45 seconds, and a scrub of /var/log is normally done nearly as fast as I hit enter on the command. Similarly, btrfs balance and btrfs check normally run in under a minute, partly because I'm on ssd, and partly because those three filesystems are all well under 50 GiB each. Of course I may have to run two or three scrubs, depending on what was mounted writable at the time of the crash, and I've had /home and /var/ log (but not root as it's read-only by default) go unmountable until repaired a couple times, but repairs are typically short too, and if that fails, blow away with a fresh mkfs.btrfs and restore from backup is typically well under an hour. So I don't tend to be down for more than an hour. Of course some other partitions may still need fixed, but that can continue in the background, while I'm back up and posting about it to the btrfs list or whatever. Compare that to the current thread where someone's trying to do a resize of a 20+ TB btrfs and it was looking to take a week, due to the massive size and the slow speed of balance on his highly reflinked filesystem on spinning rust. Point of fact. If it's multiple TBs, chances are it's going to be faster to simply blow away and recreate from backup, than it is to try to repair... and repair may or may not actually work and leave you with a fully functional btrfs afterward. Apparently that 20+ TB /is/ the backup, but it's a backup of a whole bunch of systems. OK, so even if they'd still put all those backups on the same physical hardware, consider how much simpler it would have been had they had an independent btrfs of say a TB or two for each system they were backing up. At 2 TB, it's possible to work with one or two at a time, copying them over to say a 3-4 TB hard drive (or btrfs raid1 with a pair of hard drives), blowing away the original partition, and copying back from the second backup. But with a single 20+ TB monster, they don't have anything else close to that size to work with, and have to do the shrink-current-btrfs, expand-new-filesystem (which is xfs IIRC, they're getting off of btrfs), move-more-over-from-the-old-one, repeat, dance. And /each/ /iteration/ of that dance is taking them a week or so! What would they have done had the btrfs gone bad and needed repaired? Try repair and wait a week or two to see if it worked? Blow away the filesystem as it was only the backup and recreate? A single 20+ TB btrfs was clearly beyond anything practical for them. Had rule #2 been followed, they'd have never been in this spot in the first place, as even if all those backups from multiple machines (virtual or physical) were on the same hardware, they'd be in different independent btrfs, and those could be handled independently. Of course once they're multiple independent btrfs, it would make sense to split that 20+ TB onto smaller hardware setups as well, and they'd have been dealing with less data overall too, because part of it would have been unaffected (or handled separately if they were moving it /all/) as it would have been on other machines. Much like creating multiple mdraids and putting a single filesystem in each, instead of putting a bunch of data on a single mdraid, ended up working much better for me, because then only a fraction of the data was affected and I could do the repairs on those mdraids far faster as there wasn't as much data to deal with! But like I said I'm biased. By hard experience, yes, and getting the sizes for the partitions wrong can be a hassle until you get to know your use-case and size them correctly, but it's a definite bias. --- [1] Partitions and filesystems: I had learned about a somewhat different benefit of multiple partitions and filesystems even longer ago, 1997 or so, when I was still on MS, testing an IE 4 beta that for performance reasons used direct-disk IO on its cache-index file, but it forgot to set the system attribute on it that would have kept defrag from touching it. So defrag would move the file out from under the now constantly running IE, as IE was part of the explorer shell. IE would then happily overwrite whatever got moved into the old index file location, and a number of testers had important files seriously damaged that way. I didn't, because I had my cache on a separate "temp" partition, so while it could and did still damage data, all it could touch was "temporary" data in the first place, meaning no real damage on my system. =:^) All because I had the temp data on its own partition/filesystem. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-31 5:26 ` Duncan @ 2017-03-31 5:38 ` Duncan 2017-03-31 12:37 ` Peter Grandi 0 siblings, 1 reply; 42+ messages in thread From: Duncan @ 2017-03-31 5:38 UTC (permalink / raw) To: linux-btrfs Duncan posted on Fri, 31 Mar 2017 05:26:39 +0000 as excerpted: > Compare that to the current thread where someone's trying to do a resize > of a 20+ TB btrfs and it was looking to take a week, due to the massive > size and the slow speed of balance on his highly reflinked filesystem on > spinning rust. Heh, /this/ thread. =:^) I obviously lost track of the thread I was replying to. Which in a way makes the reply even more forceful, as it's obviously generically targeted, not just at this thread. Even if I were so devious as to arrange that deliberately (I'm not and I didn't, FWIW, but of course if you suspect that than this assurance won't mean much either). -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-31 5:38 ` Duncan @ 2017-03-31 12:37 ` Peter Grandi 0 siblings, 0 replies; 42+ messages in thread From: Peter Grandi @ 2017-03-31 12:37 UTC (permalink / raw) To: linux-btrfs >> [ ... ] CentOS, Redhat, and Oracle seem to take the position >> that very large data subvolumes using btrfs should work >> fine. But I would be curious what the rest of the list thinks >> about 20 TiB in one volume/subvolume. > To be sure I'm a biased voice here, as I have multiple > independent btrfs on multiple partitions here, with no btrfs > over 100 GiB in size, and that's on ssd so maintenance > commands normally return in minutes or even seconds, That's a bit extreme I think, as there are downsides to have many too small volumes too. > not the hours to days or even weeks it takes on multi-TB btrfs > on spinning rust. Or months :-). > But FWIW... 1) Don't put all your data eggs in one basket, > especially when that basket isn't yet entirely stable and > mature. Really good point here. > A mantra commonly repeated on this list is that btrfs is still > stabilizing, My impression is that most 4.x and later versions are very reliable for "base" functionality, that is excluding multi-device, compression, qgroups, ... Put another way, what scratches the Facebook itches works well :-). > [ ... ] the time/cost/hassle-factor of the backup, and being > practically prepared to use them, is even *MORE* important > than it is on fully mature and stable filesystems. Indeed, or at least *different* filesystems. I backup JFS filesystems to XFS ones, and Btrfs filesystems to NILFS2 ones, for example. > 2) Don't make your filesystems so large that any maintenance > on them, including both filesystem maintenance like btrfs > balance/scrub/check/ whatever, and normal backup and restore > operations, takes impractically long, As per my preceding post, that's the big deal, but so many people "know better" :-). > where "impractically" can be reasonably defined as so long it > discourages you from doing them in the first place and/or so > long that it's going to cause unwarranted downtime. That's the "Very Large DataBase" level of trouble. > Some years ago, before I started using btrfs and while I was > using mdraid, I learned this one the hard way. I had a bunch > of rather large mdraids setup, [ ... ] I have recently seen another much "funnier" example: people who "know better" and follow every cool trend decide to consolidate their server farm on VMs, backed by a storage server with a largish single pool of storage holding the virtual disk images of all the server VMs. They look like geniuses until the storage pool system crashes, and a minimal integrity check on restart takes two days during which the whole organization is without access to any email, files, databases, ... > [ ... ] And there was a good chance it was /not/ active and > mounted at the time of the crash and thus didn't need > repaired, saving that time entirely! =:^) As to that I have switched to using 'autofs' to mount volumes only on access, using a simple script that turns '/etc/fstab' into an automounter dynamic map, which means that most of the time most volumes on my (home) systems are not mounted: http://www.sabi.co.uk/blog/anno06-3rd.html?060928#060928 > Eventually I arranged things so I could keep root mounted > read-only unless I was updating it, and that's still the way I > run it today. The ancient way, instead of having '/' RO and '/var' RW, to have '/' RW and '/usr' RO (so for example it could be shared across many systems via NFS etc.), and while both are good ideas, I prefer the ancient way. But then some people who know better are moving to merge '/' with '/usr' without understanding what's the history and the advantages. > [ ... ] If it's multiple TBs, chances are it's going to be > faster to simply blow away and recreate from backup, than it > is to try to repair... [ ... ] Or to shrink or defragment or dedup etc., except on very high IOPS-per-TB storage. > [ ... ] how much simpler it would have been had they had an > independent btrfs of say a TB or two for each system they were > backing up. That is the general alternative to a single large pool/volume: sharding/chunking of filetrees, sometimes, like with Lustre or Ceph etc. with a "metafilesystem" layer on top. Done manually my suggestion is to do the sharding per-week (or other suitable period) rather than per-system, in a circular "crop rotation" scheme. So that once a volume has been filled, it becomes read-only and can even be unmounted until it needs to be reused: http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b Then there is the problem that "a TB or two" is less easy with increasing disk capacities, but then I think that disks with a capacity larger than 1TB are not suitable for ordinary workloads, and more for tape-cartridge like usage. > What would they have done had the btrfs gone bad and needed > repaired? [ ... ] In most cases I have seen of designs aimed at achieving the lowest cost and highest flexibility "low IOPS single poool" at the expense of scalability and maintainability, the "clever" designer had been promoted or had wisely moved to another job while the storage system was still mostly empty so the problems had not yet happened. [ ... ] > But like I said I'm biased. By hard experience, yes, and > getting the sizes for the partitions wrong can be a hassle > until you get to know your use-case and size them correctly, > but it's a definite bias. Yes, I am very pleased that this post shares this and many other insights from the wisdom of the ancients, not everybody knows better :-). [ ... ] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-31 1:00 ` GWB 2017-03-31 5:26 ` Duncan @ 2017-03-31 11:37 ` Peter Grandi 1 sibling, 0 replies; 42+ messages in thread From: Peter Grandi @ 2017-03-31 11:37 UTC (permalink / raw) To: Linux fs Btrfs > Can you try to first dedup the btrfs volume? This is probably > out of date, but you could try one of these: [ ... ] Yep, > that's probably a lot of work. [ ... ] My recollection is that > btrfs handles deduplication differently than zfs, but both of > them can be very, very slow But the big deal there is that dedup is indeed a very expensive operation, even worse than 'balance'. A balanced, deduped volume will shrink faster in most cases, but the time taken simply moved from shrinking to preparing. > Again, I'm not an expert in btrfs, but in most cases a full > balance and scrub takes care of any problems on the root > partition, but that is a relatively small partition. A full > balance (without the options) and scrub on 20 TiB must take a > very long time even with robust hardware, would it not? There have been reports of several months for volumes of that size subject to ordinary workload. > CentOS, Redhat, and Oracle seem to take the position that very > large data subvolumes using btrfs should work fine. This is a long standing controvery, and for example there have been "interesting" debates in the XFS mailing list. Btrfs in this is not really different from others, with one major difference in context, that many Btrfs developers work for a company that relies of large numbers of small servers, to the point that fixing multidevice issues has not been a priority. The controversy of large volumes is that while no doubt the logical structures of recent filesystem types can support single volumes of many petabytes (or even much larger), and such volumes have indeed been created and "work"-ish, so they are unquestionably "syntactically valid", the tradeoffs involved especially as to maintainability may mean that they don't "work" well and sustainably so. The fundamental issue is metadata: while the logical structures, using 48-64 bit pointers, unquestionably scale "syntactically", they don't scale pragmatically when considering whole-volume maintenance like checking, repair, balancing, scrubbing, indexing (which includes making incremental backups etc.). Note: large volumes don't have just a speed problem for whole-volume operations, they also have a memory problem, as most tools hold in memory copy of the metadata. There have been cases where indexing or repair of a volume requires a lot more RAM (many hundreds GiB or some TiB of RAM) than the system on which the volume was being used. The problem is of course smaller if the large volume contains mostly large files, and bigger if the volume is stored on low IOPS-per-TB devices and used on small-memory systems. But even with large files even if filetree object metadata (inodes etc.) are relatively few eventually space metadata must at least potentially resolve down to single sectors, and that can be a lot of metadata unless both used and free space are very unfragmented. The fundamental technological issue is: *data* IO rates, in both random IOPS and sequential ones, can be scaled "almost" linearly by parallelizing them using RAID or equivalent, allowing large volumes to serve scalably large and parallel *data* workloads, but *metadata* IO rates cannot be easily parallelized, because metadata structures are graphs, not arrays of bytes like files. So a large volume on 100 storage devices can serve in parallel a significant percentage of 100 times the data workload of a small volume on 1 storage device, but not so much for the metadata workload. For example, I have never seen a parallel 'fsck' tool that can take advantage of 100 storage devices to complete a scan of a single volume on 100 storage devices in not much longer time than the scan of a volume on 1 of the storage device. > But I would be curious what the rest of the list thinks about > 20 TiB in one volume/subvolume. Personally I think that while volumes of many petabytes "work" syntactically, there are serious maintainability problem (which I have seen happen at a number of sites) with volumes larger than 4TB-8TB with any current local filesystem design. That depends also on number/size of storage devices, and their nature, that is IOPS, as after all metadata workloads do scale a bit with number of available IOPS, even if far more slowly than data workloads. For for example I think that an 8TB volume is not desirable on a single 8TB disk for ordinary workloads (but then I think that disks above 1-2TB are just not suitable for ordinary filesystem workloads), but with lots of smaller/faster disks a 12TB volume would probably be acceptable, and maybe a number of flash SSDs might make acceptable even a 20TB volume. Of course there are lots of people who know better. :-) ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-30 22:13 ` Piotr Pawłow 2017-03-31 1:00 ` GWB @ 2017-03-31 10:51 ` Peter Grandi 1 sibling, 0 replies; 42+ messages in thread From: Peter Grandi @ 2017-03-31 10:51 UTC (permalink / raw) To: Linux fs Btrfs >>> The way btrfs is designed I'd actually expect shrinking to >>> be fast in most cases. [ ... ] >> The proposed "move whole chunks" implementation helps only if >> there are enough unallocated chunks "below the line". If regular >> 'balance' is done on the filesystem there will be some, but that >> just spreads the cost of the 'balance' across time, it does not >> by itself make a «risky, difficult, slow operation» any less so, >> just spreads the risk, difficulty, slowness across time. > Isn't that too pessimistic? Maybe, it depends on the workload impacting the volume and how much it churns the free/unallocated situation. > Most of my filesystems have 90+% of free space unallocated, > even those I never run balance on. That seems quite lucky to me, as definitely is not my experience or even my expectation in the general case: in my laptop and desktop with relatively few updates I have to run 'balance' fairly frequently, and "Knorrie" has produced a nice tools that produces a graphical map of free vs. unallocated space and most examples and users find quite a bit of balancing needs to be done > For me it wouldn't just spread the cost, it would reduce it > considerably. In your case the cost of the implicit or explicit 'balance' simply does not arise because 'balance' is not necessary, and then moving whole chunks is indeed cheap. The argument here is in part whether used space (extents) or allocated space (chunks) is more fragmented as well as the amount of metadata to update in either case. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Shrinking a device - performance? @ 2017-03-27 11:51 Christian Theune 2017-03-27 12:55 ` Christian Theune 0 siblings, 1 reply; 42+ messages in thread From: Christian Theune @ 2017-03-27 11:51 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1280 bytes --] Hi, (I hope I’m not double posting. My mail client was misconfigured and I think I only managed to send the mail correctly this time.) I’m currently shrinking a device and it seems that the performance of shrink is abysmal. I intended to shrink a ~22TiB filesystem down to 20TiB. This is still using LVM underneath so that I can’t just remove a device from the filesystem but have to use the resize command. Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 Total devices 1 FS bytes used 18.21TiB devid 1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy This has been running since last Thursday, so roughly 3.5days now. The “used” number in devid1 has moved about 1TiB in this time. The filesystem is seeing regular usage (read and write) and when I’m suspending any application traffic I see about 1GiB of movement every now and then. Maybe once every 30 seconds or so. Does this sound fishy or normal to you? Kind regards, Christian -- Christian Theune · ct@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 496 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Shrinking a device - performance? 2017-03-27 11:51 Christian Theune @ 2017-03-27 12:55 ` Christian Theune 0 siblings, 0 replies; 42+ messages in thread From: Christian Theune @ 2017-03-27 12:55 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 549 bytes --] > On Mar 27, 2017, at 1:51 PM, Christian Theune <ct@flyingcircus.io> wrote: > > Hi, > > (I hope I’m not double posting. My mail client was misconfigured and I think I only managed to send the mail correctly this time.) Turns out I did double post. Mea culpa. -- Christian Theune · ct@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 496 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2017-04-01 11:30 UTC | newest] Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-27 11:17 Shrinking a device - performance? Christian Theune 2017-03-27 13:07 ` Hugo Mills 2017-03-27 13:20 ` Christian Theune 2017-03-27 13:24 ` Hugo Mills 2017-03-27 13:46 ` Austin S. Hemmelgarn 2017-03-27 13:50 ` Christian Theune 2017-03-27 13:54 ` Christian Theune 2017-03-27 14:17 ` Austin S. Hemmelgarn 2017-03-27 14:49 ` Christian Theune 2017-03-27 15:06 ` Roman Mamedov 2017-04-01 9:05 ` Kai Krakow 2017-03-27 14:14 ` Austin S. Hemmelgarn 2017-03-27 14:48 ` Roman Mamedov 2017-03-27 14:53 ` Christian Theune 2017-03-28 14:43 ` Peter Grandi 2017-03-28 14:50 ` Tomasz Kusmierz 2017-03-28 15:06 ` Peter Grandi 2017-03-28 15:35 ` Tomasz Kusmierz 2017-03-28 16:20 ` Peter Grandi 2017-03-28 14:59 ` Peter Grandi 2017-03-28 15:20 ` Peter Grandi 2017-03-28 15:56 ` Austin S. Hemmelgarn 2017-03-30 15:55 ` Peter Grandi 2017-03-31 12:41 ` Austin S. Hemmelgarn 2017-03-31 17:25 ` Peter Grandi 2017-03-31 19:38 ` GWB 2017-03-31 20:27 ` Peter Grandi 2017-04-01 0:02 ` GWB 2017-04-01 2:42 ` Duncan 2017-04-01 4:26 ` GWB 2017-04-01 11:30 ` Peter Grandi 2017-03-30 15:00 ` Piotr Pawłow 2017-03-30 16:13 ` Peter Grandi 2017-03-30 22:13 ` Piotr Pawłow 2017-03-31 1:00 ` GWB 2017-03-31 5:26 ` Duncan 2017-03-31 5:38 ` Duncan 2017-03-31 12:37 ` Peter Grandi 2017-03-31 11:37 ` Peter Grandi 2017-03-31 10:51 ` Peter Grandi 2017-03-27 11:51 Christian Theune 2017-03-27 12:55 ` Christian Theune
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.