All of lore.kernel.org
 help / color / mirror / Atom feed
* Shrinking a device - performance?
@ 2017-03-27 11:17 Christian Theune
  2017-03-27 13:07 ` Hugo Mills
  0 siblings, 1 reply; 42+ messages in thread
From: Christian Theune @ 2017-03-27 11:17 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1146 bytes --]

Hi,

I’m currently shrinking a device and it seems that the performance of shrink is abysmal. I intended to shrink a ~22TiB filesystem down to 20TiB. This is still using LVM underneath so that I can’t just remove a device from the filesystem but have to use the resize command.

Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
        Total devices 1 FS bytes used 18.21TiB
        devid    1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy

This has been running since last Thursday, so roughly 3.5days now. The “used” number in devid1 has moved about 1TiB in this time. The filesystem is seeing regular usage (read and write) and when I’m suspending any application traffic I see about 1GiB of movement every now and then. Maybe once every 30 seconds or so.

Does this sound fishy or normal to you?

Kind regards,
Christian

--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 11:17 Shrinking a device - performance? Christian Theune
@ 2017-03-27 13:07 ` Hugo Mills
  2017-03-27 13:20   ` Christian Theune
  0 siblings, 1 reply; 42+ messages in thread
From: Hugo Mills @ 2017-03-27 13:07 UTC (permalink / raw)
  To: Christian Theune; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1624 bytes --]

On Mon, Mar 27, 2017 at 01:17:26PM +0200, Christian Theune wrote:
> Hi,
> 
> I’m currently shrinking a device and it seems that the performance of shrink is abysmal. I intended to shrink a ~22TiB filesystem down to 20TiB. This is still using LVM underneath so that I can’t just remove a device from the filesystem but have to use the resize command.
> 
> Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
>         Total devices 1 FS bytes used 18.21TiB
>         devid    1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy
> 
> This has been running since last Thursday, so roughly 3.5days now. The “used” number in devid1 has moved about 1TiB in this time. The filesystem is seeing regular usage (read and write) and when I’m suspending any application traffic I see about 1GiB of movement every now and then. Maybe once every 30 seconds or so.
> 
> Does this sound fishy or normal to you?

   On my hardware (consumer HDDs and SATA, RAID-1 over 6 devices), it
takes about a minute to move 1 GiB of data. At that rate, it would
take 1000 minutes (or about 16 hours) to move 1 TiB of data.

   However, there are cases where some items of data can take *much*
longer to move. The biggest of these is when you have lots of
snapshots. When that happens, some (but not all) of the metadata can
take a very long time. In my case, with a couple of hundred snapshots,
some metadata chunks take 4+ hours to move.

   Hugo.

-- 
Hugo Mills             | Great films about cricket: Silly Point Break
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 13:07 ` Hugo Mills
@ 2017-03-27 13:20   ` Christian Theune
  2017-03-27 13:24     ` Hugo Mills
  2017-03-27 14:48     ` Roman Mamedov
  0 siblings, 2 replies; 42+ messages in thread
From: Christian Theune @ 2017-03-27 13:20 UTC (permalink / raw)
  To: Hugo Mills; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1936 bytes --]

Hi,

> On Mar 27, 2017, at 3:07 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> 
>   On my hardware (consumer HDDs and SATA, RAID-1 over 6 devices), it
> takes about a minute to move 1 GiB of data. At that rate, it would
> take 1000 minutes (or about 16 hours) to move 1 TiB of data.
> 
>   However, there are cases where some items of data can take *much*
> longer to move. The biggest of these is when you have lots of
> snapshots. When that happens, some (but not all) of the metadata can
> take a very long time. In my case, with a couple of hundred snapshots,
> some metadata chunks take 4+ hours to move.

Thanks for that info. The 1min per 1GiB is what I saw too - the “it can take longer” wasn’t really explainable to me.

As I’m not using snapshots: would large files (100+gb) with long chains of CoW history (specifically reflink copies) also hurt?

Something I’d like to verify: does having traffic on the volume have the potential to delay this infinitely? I.e. does the system write to any segments that we’re trying to free so it may have to work on the same chunk over and over again? If not, then this means it’s just slow and we’re looking forward to about 2 months worth of time shrinking this volume. (And then again on the next bigger server probably about 3-4 months).

(Background info: we’re migrating large volumes from btrfs to xfs and can only do this step by step: copying some data, shrinking the btrfs volume, extending the xfs volume, rinse repeat. If someone should have any suggestions to speed this up and not having to think in terms of _months_ then I’m all ears.)

Cheers,
Christian

--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 13:20   ` Christian Theune
@ 2017-03-27 13:24     ` Hugo Mills
  2017-03-27 13:46       ` Austin S. Hemmelgarn
  2017-03-27 14:48     ` Roman Mamedov
  1 sibling, 1 reply; 42+ messages in thread
From: Hugo Mills @ 2017-03-27 13:24 UTC (permalink / raw)
  To: Christian Theune; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2476 bytes --]

On Mon, Mar 27, 2017 at 03:20:37PM +0200, Christian Theune wrote:
> Hi,
> 
> > On Mar 27, 2017, at 3:07 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> > 
> >   On my hardware (consumer HDDs and SATA, RAID-1 over 6 devices), it
> > takes about a minute to move 1 GiB of data. At that rate, it would
> > take 1000 minutes (or about 16 hours) to move 1 TiB of data.
> > 
> >   However, there are cases where some items of data can take *much*
> > longer to move. The biggest of these is when you have lots of
> > snapshots. When that happens, some (but not all) of the metadata can
> > take a very long time. In my case, with a couple of hundred snapshots,
> > some metadata chunks take 4+ hours to move.

> Thanks for that info. The 1min per 1GiB is what I saw too - the “it
> can take longer” wasn’t really explainable to me.

> As I’m not using snapshots: would large files (100+gb) with long
> chains of CoW history (specifically reflink copies) also hurt?

   Yes, that's the same issue -- it's to do with the number of times
an extent is shared. Snapshots are one way of creating that sharing,
reflinks are another.

> Something I’d like to verify: does having traffic on the volume have
> the potential to delay this infinitely? I.e. does the system write
> to any segments that we’re trying to free so it may have to work on
> the same chunk over and over again? If not, then this means it’s
> just slow and we’re looking forward to about 2 months worth of time
> shrinking this volume. (And then again on the next bigger server
> probably about 3-4 months).

   I don't know. I would hope not, but I simply don't know enough
about the internal algorithms for that. Maybe someone else can confirm?

> (Background info: we’re migrating large volumes from btrfs to xfs
> and can only do this step by step: copying some data, shrinking the
> btrfs volume, extending the xfs volume, rinse repeat. If someone
> should have any suggestions to speed this up and not having to think
> in terms of _months_ then I’m all ears.)

   All I can suggest is to move some unused data off the volume and do
it in fewer larger steps. Sorry.

   Hugo.

-- 
Hugo Mills             | Jenkins! Chap with the wings there! Five rounds
hugo@... carfax.org.uk | rapid!
http://carfax.org.uk/  |                 Brigadier Alistair Lethbridge-Stewart
PGP: E2AB1DE4          |                                Dr Who and the Daemons

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 13:24     ` Hugo Mills
@ 2017-03-27 13:46       ` Austin S. Hemmelgarn
  2017-03-27 13:50         ` Christian Theune
  0 siblings, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-03-27 13:46 UTC (permalink / raw)
  To: Hugo Mills, Christian Theune, linux-btrfs

On 2017-03-27 09:24, Hugo Mills wrote:
> On Mon, Mar 27, 2017 at 03:20:37PM +0200, Christian Theune wrote:
>> Hi,
>>
>>> On Mar 27, 2017, at 3:07 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>>
>>>   On my hardware (consumer HDDs and SATA, RAID-1 over 6 devices), it
>>> takes about a minute to move 1 GiB of data. At that rate, it would
>>> take 1000 minutes (or about 16 hours) to move 1 TiB of data.
>>>
>>>   However, there are cases where some items of data can take *much*
>>> longer to move. The biggest of these is when you have lots of
>>> snapshots. When that happens, some (but not all) of the metadata can
>>> take a very long time. In my case, with a couple of hundred snapshots,
>>> some metadata chunks take 4+ hours to move.
>
>> Thanks for that info. The 1min per 1GiB is what I saw too - the “it
>> can take longer” wasn’t really explainable to me.
>
>> As I’m not using snapshots: would large files (100+gb) with long
>> chains of CoW history (specifically reflink copies) also hurt?
>
>    Yes, that's the same issue -- it's to do with the number of times
> an extent is shared. Snapshots are one way of creating that sharing,
> reflinks are another.
FWIW, I've noticed less of an issue with reflinks than snapshots, but I 
can't comment on this specific case.
>
>> Something I’d like to verify: does having traffic on the volume have
>> the potential to delay this infinitely? I.e. does the system write
>> to any segments that we’re trying to free so it may have to work on
>> the same chunk over and over again? If not, then this means it’s
>> just slow and we’re looking forward to about 2 months worth of time
>> shrinking this volume. (And then again on the next bigger server
>> probably about 3-4 months).
>
>    I don't know. I would hope not, but I simply don't know enough
> about the internal algorithms for that. Maybe someone else can confirm?
I'm not 100% certain, but I believe that while it can delay things, it 
can't do so infinitely.  AFAICT from looking at the code (disclaimer: I 
am not a C programmer by profession), it looks like writes to chunks 
that are being compacted or moved will go to the new location, not the 
old one, but writes to chunks which aren't being touched by the resize 
currently will just go to where the chunk is currently.  Based on this, 
lowering the amount of traffic to the FS could probably speed things up 
a bit, but it likely won't help much.
>
>> (Background info: we’re migrating large volumes from btrfs to xfs
>> and can only do this step by step: copying some data, shrinking the
>> btrfs volume, extending the xfs volume, rinse repeat. If someone
>> should have any suggestions to speed this up and not having to think
>> in terms of _months_ then I’m all ears.)
>
>    All I can suggest is to move some unused data off the volume and do
> it in fewer larger steps. Sorry.
Same.

The other option though is to just schedule a maintenance window, nuke 
the old FS, and restore from a backup.  If you can afford to take the 
system off-line temporarily, this will almost certainly go faster 
(assuming you have a reasonably fast means of restoring backups).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 13:46       ` Austin S. Hemmelgarn
@ 2017-03-27 13:50         ` Christian Theune
  2017-03-27 13:54           ` Christian Theune
  2017-03-27 14:14           ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 42+ messages in thread
From: Christian Theune @ 2017-03-27 13:50 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Hugo Mills, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2446 bytes --]

Hi,

> On Mar 27, 2017, at 3:46 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>> 
>>> Something I’d like to verify: does having traffic on the volume have
>>> the potential to delay this infinitely? I.e. does the system write
>>> to any segments that we’re trying to free so it may have to work on
>>> the same chunk over and over again? If not, then this means it’s
>>> just slow and we’re looking forward to about 2 months worth of time
>>> shrinking this volume. (And then again on the next bigger server
>>> probably about 3-4 months).
>> 
>>   I don't know. I would hope not, but I simply don't know enough
>> about the internal algorithms for that. Maybe someone else can confirm?
> I'm not 100% certain, but I believe that while it can delay things, it can't do so infinitely.  AFAICT from looking at the code (disclaimer: I am not a C programmer by profession), it looks like writes to chunks that are being compacted or moved will go to the new location, not the old one, but writes to chunks which aren't being touched by the resize currently will just go to where the chunk is currently.  Based on this, lowering the amount of traffic to the FS could probably speed things up a bit, but it likely won't help much.

I hoped that this is the strategy implemented, otherwise it would end up in an infinite cat-and-mouse game. ;)

>>> (Background info: we’re migrating large volumes from btrfs to xfs
>>> and can only do this step by step: copying some data, shrinking the
>>> btrfs volume, extending the xfs volume, rinse repeat. If someone
>>> should have any suggestions to speed this up and not having to think
>>> in terms of _months_ then I’m all ears.)
>> 
>>   All I can suggest is to move some unused data off the volume and do
>> it in fewer larger steps. Sorry.
> Same.
> 
> The other option though is to just schedule a maintenance window, nuke the old FS, and restore from a backup.  If you can afford to take the system off-line temporarily, this will almost certainly go faster (assuming you have a reasonably fast means of restoring backups).

Well. This is the backup. ;)

Thanks,
Christian

--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 13:50         ` Christian Theune
@ 2017-03-27 13:54           ` Christian Theune
  2017-03-27 14:17             ` Austin S. Hemmelgarn
  2017-03-27 14:14           ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Theune @ 2017-03-27 13:54 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Hugo Mills, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2898 bytes --]

Hi,

> On Mar 27, 2017, at 3:50 PM, Christian Theune <ct@flyingcircus.io> wrote:
> 
> Hi,
> 
>> On Mar 27, 2017, at 3:46 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>>> 
>>>> Something I’d like to verify: does having traffic on the volume have
>>>> the potential to delay this infinitely? I.e. does the system write
>>>> to any segments that we’re trying to free so it may have to work on
>>>> the same chunk over and over again? If not, then this means it’s
>>>> just slow and we’re looking forward to about 2 months worth of time
>>>> shrinking this volume. (And then again on the next bigger server
>>>> probably about 3-4 months).
>>> 
>>>  I don't know. I would hope not, but I simply don't know enough
>>> about the internal algorithms for that. Maybe someone else can confirm?
>> I'm not 100% certain, but I believe that while it can delay things, it can't do so infinitely.  AFAICT from looking at the code (disclaimer: I am not a C programmer by profession), it looks like writes to chunks that are being compacted or moved will go to the new location, not the old one, but writes to chunks which aren't being touched by the resize currently will just go to where the chunk is currently.  Based on this, lowering the amount of traffic to the FS could probably speed things up a bit, but it likely won't help much.
> 
> I hoped that this is the strategy implemented, otherwise it would end up in an infinite cat-and-mouse game. ;)
> 
>>>> (Background info: we’re migrating large volumes from btrfs to xfs
>>>> and can only do this step by step: copying some data, shrinking the
>>>> btrfs volume, extending the xfs volume, rinse repeat. If someone
>>>> should have any suggestions to speed this up and not having to think
>>>> in terms of _months_ then I’m all ears.)
>>> 
>>>  All I can suggest is to move some unused data off the volume and do
>>> it in fewer larger steps. Sorry.
>> Same.
>> 
>> The other option though is to just schedule a maintenance window, nuke the old FS, and restore from a backup.  If you can afford to take the system off-line temporarily, this will almost certainly go faster (assuming you have a reasonably fast means of restoring backups).
> 
> Well. This is the backup. ;)

One strategy that does come to mind: we’re converting our backup from a system that uses reflinks to a non-reflink based system. We can convert this in place so this would remove all the reflink stuff in the existing filesystem and then we maybe can do the FS conversion faster when this isn’t an issue any longer. I think I’ll

Christian

--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 13:50         ` Christian Theune
  2017-03-27 13:54           ` Christian Theune
@ 2017-03-27 14:14           ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-03-27 14:14 UTC (permalink / raw)
  To: Christian Theune; +Cc: Hugo Mills, linux-btrfs

On 2017-03-27 09:50, Christian Theune wrote:
> Hi,
>
>> On Mar 27, 2017, at 3:46 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>>>
>>>> Something I’d like to verify: does having traffic on the volume have
>>>> the potential to delay this infinitely? I.e. does the system write
>>>> to any segments that we’re trying to free so it may have to work on
>>>> the same chunk over and over again? If not, then this means it’s
>>>> just slow and we’re looking forward to about 2 months worth of time
>>>> shrinking this volume. (And then again on the next bigger server
>>>> probably about 3-4 months).
>>>
>>>   I don't know. I would hope not, but I simply don't know enough
>>> about the internal algorithms for that. Maybe someone else can confirm?
>> I'm not 100% certain, but I believe that while it can delay things, it can't do so infinitely.  AFAICT from looking at the code (disclaimer: I am not a C programmer by profession), it looks like writes to chunks that are being compacted or moved will go to the new location, not the old one, but writes to chunks which aren't being touched by the resize currently will just go to where the chunk is currently.  Based on this, lowering the amount of traffic to the FS could probably speed things up a bit, but it likely won't help much.
>
> I hoped that this is the strategy implemented, otherwise it would end up in an infinite cat-and-mouse game. ;)
I know that balance and replace work this way, and the code for resize 
appears to handle things similarly to both, so I'm pretty certain it 
works this way.  TBH though, it's really the only sane way to handle 
something like this.
>
>>>> (Background info: we’re migrating large volumes from btrfs to xfs
>>>> and can only do this step by step: copying some data, shrinking the
>>>> btrfs volume, extending the xfs volume, rinse repeat. If someone
>>>> should have any suggestions to speed this up and not having to think
>>>> in terms of _months_ then I’m all ears.)
>>>
>>>   All I can suggest is to move some unused data off the volume and do
>>> it in fewer larger steps. Sorry.
>> Same.
>>
>> The other option though is to just schedule a maintenance window, nuke the old FS, and restore from a backup.  If you can afford to take the system off-line temporarily, this will almost certainly go faster (assuming you have a reasonably fast means of restoring backups).
>
> Well. This is the backup. ;)
Ah, yeah, that does complicate things a bit more.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 13:54           ` Christian Theune
@ 2017-03-27 14:17             ` Austin S. Hemmelgarn
  2017-03-27 14:49               ` Christian Theune
  0 siblings, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-03-27 14:17 UTC (permalink / raw)
  To: Christian Theune; +Cc: Hugo Mills, linux-btrfs

On 2017-03-27 09:54, Christian Theune wrote:
> Hi,
>
>> On Mar 27, 2017, at 3:50 PM, Christian Theune <ct@flyingcircus.io> wrote:
>>
>> Hi,
>>
>>> On Mar 27, 2017, at 3:46 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>>>>
>>>>> Something I’d like to verify: does having traffic on the volume have
>>>>> the potential to delay this infinitely? I.e. does the system write
>>>>> to any segments that we’re trying to free so it may have to work on
>>>>> the same chunk over and over again? If not, then this means it’s
>>>>> just slow and we’re looking forward to about 2 months worth of time
>>>>> shrinking this volume. (And then again on the next bigger server
>>>>> probably about 3-4 months).
>>>>
>>>>  I don't know. I would hope not, but I simply don't know enough
>>>> about the internal algorithms for that. Maybe someone else can confirm?
>>> I'm not 100% certain, but I believe that while it can delay things, it can't do so infinitely.  AFAICT from looking at the code (disclaimer: I am not a C programmer by profession), it looks like writes to chunks that are being compacted or moved will go to the new location, not the old one, but writes to chunks which aren't being touched by the resize currently will just go to where the chunk is currently.  Based on this, lowering the amount of traffic to the FS could probably speed things up a bit, but it likely won't help much.
>>
>> I hoped that this is the strategy implemented, otherwise it would end up in an infinite cat-and-mouse game. ;)
>>
>>>>> (Background info: we’re migrating large volumes from btrfs to xfs
>>>>> and can only do this step by step: copying some data, shrinking the
>>>>> btrfs volume, extending the xfs volume, rinse repeat. If someone
>>>>> should have any suggestions to speed this up and not having to think
>>>>> in terms of _months_ then I’m all ears.)
>>>>
>>>>  All I can suggest is to move some unused data off the volume and do
>>>> it in fewer larger steps. Sorry.
>>> Same.
>>>
>>> The other option though is to just schedule a maintenance window, nuke the old FS, and restore from a backup.  If you can afford to take the system off-line temporarily, this will almost certainly go faster (assuming you have a reasonably fast means of restoring backups).
>>
>> Well. This is the backup. ;)
>
> One strategy that does come to mind: we’re converting our backup from a system that uses reflinks to a non-reflink based system. We can convert this in place so this would remove all the reflink stuff in the existing filesystem and then we maybe can do the FS conversion faster when this isn’t an issue any longer. I think I’ll
One other thing that I just thought of:
For a backup system, assuming some reasonable thinning system is used 
for the backups, I would personally migrate things slowly over time by 
putting new backups on the new filesystem, and shrinking the old 
filesystem as the old backups there get cleaned out.  Unfortunately, 
most backup software I've seen doesn't handle this well, so it's not all 
that easy to do, but it does save you from having to migrate data off of 
the old filesystem, and means you don't have to worry as much about the 
resize of the old FS taking forever.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 13:20   ` Christian Theune
  2017-03-27 13:24     ` Hugo Mills
@ 2017-03-27 14:48     ` Roman Mamedov
  2017-03-27 14:53       ` Christian Theune
  1 sibling, 1 reply; 42+ messages in thread
From: Roman Mamedov @ 2017-03-27 14:48 UTC (permalink / raw)
  To: Christian Theune; +Cc: Hugo Mills, linux-btrfs

On Mon, 27 Mar 2017 15:20:37 +0200
Christian Theune <ct@flyingcircus.io> wrote:

> (Background info: we’re migrating large volumes from btrfs to xfs and can
> only do this step by step: copying some data, shrinking the btrfs volume,
> extending the xfs volume, rinse repeat. If someone should have any
> suggestions to speed this up and not having to think in terms of _months_
> then I’m all ears.)

I would only suggest that you reconsider XFS. You can't shrink XFS, therefore
you won't have the flexibility to migrate in the same way to anything better
that comes along in the future (ZFS perhaps? or even Bcachefs?). XFS does not
perform that much better over Ext4, and very importantly, Ext4 can be shrunk.

>From the looks of it Ext4 has also overcome its 16TB limitation:
http://askubuntu.com/questions/779754/how-do-i-resize-an-ext4-partition-beyond-the-16tb-limit

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 14:17             ` Austin S. Hemmelgarn
@ 2017-03-27 14:49               ` Christian Theune
  2017-03-27 15:06                 ` Roman Mamedov
  0 siblings, 1 reply; 42+ messages in thread
From: Christian Theune @ 2017-03-27 14:49 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Hugo Mills, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1948 bytes --]

Hi,

> On Mar 27, 2017, at 4:17 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
> 
> One other thing that I just thought of:
> For a backup system, assuming some reasonable thinning system is used for the backups, I would personally migrate things slowly over time by putting new backups on the new filesystem, and shrinking the old filesystem as the old backups there get cleaned out.  Unfortunately, most backup software I've seen doesn't handle this well, so it's not all that easy to do, but it does save you from having to migrate data off of the old filesystem, and means you don't have to worry as much about the resize of the old FS taking forever.

Right. This is an option we can do from a software perspective (our own solution - https://bitbucket.org/flyingcircus/backy) but our systems in use can’t hold all the data twice. Even though we’re migrating to a backend implementation that uses less data than before I have to perform an “inplace” migration in some way. This is VM block device backup. So basically we migrate one VM with all its previous data and that works quite fine with a little headroom. However, migrating all VMs to a new “full” backup and then wait for the old to shrink would only work if we had a completely empty backup server in place, which we don’t.

Also: the idea of migrating on btrfs also has its downside - the performance of “mkdir” and “fsync” is abysmal at the moment. I’m waiting for the current shrinking job to finish but this is likely limited to the “find free space” algorithm. We’re talking about a few megabytes converted per second. Sigh.

Cheers,
Christian Theune

--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 14:48     ` Roman Mamedov
@ 2017-03-27 14:53       ` Christian Theune
  2017-03-28 14:43         ` Peter Grandi
  0 siblings, 1 reply; 42+ messages in thread
From: Christian Theune @ 2017-03-27 14:53 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Hugo Mills, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1988 bytes --]

Hi,

> On Mar 27, 2017, at 4:48 PM, Roman Mamedov <rm@romanrm.net> wrote:
> 
> On Mon, 27 Mar 2017 15:20:37 +0200
> Christian Theune <ct@flyingcircus.io> wrote:
> 
>> (Background info: we’re migrating large volumes from btrfs to xfs and can
>> only do this step by step: copying some data, shrinking the btrfs volume,
>> extending the xfs volume, rinse repeat. If someone should have any
>> suggestions to speed this up and not having to think in terms of _months_
>> then I’m all ears.)
> 
> I would only suggest that you reconsider XFS. You can't shrink XFS, therefore
> you won't have the flexibility to migrate in the same way to anything better
> that comes along in the future (ZFS perhaps? or even Bcachefs?). XFS does not
> perform that much better over Ext4, and very importantly, Ext4 can be shrunk.

That is true. However, we do have moved the expected feature set of the filesystem (i.e. cow) down to “store files safely and reliably” and we’ve seen too much breakage with ext4 in the past. Of course “persistence means you’ll have to say I’m sorry” and thus with either choice we may be faced with some issue in the future that we might have circumvented with another solution and yes flexibility is worth a great deal.

We’ve run XFS and ext4 on different (large and small) workloads in the last 2 years and I have to say I’m much more happy about XFS even with the shrinking limitation.

To us ext4 is prohibitive with it’s fsck performance and we do like the tight error checking in XFS.

Thanks for the reminder though - especially in the public archive making this tradeoff with flexibility known is wise to communicate. :-)

Hugs,
Christian

--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 14:49               ` Christian Theune
@ 2017-03-27 15:06                 ` Roman Mamedov
  2017-04-01  9:05                   ` Kai Krakow
  0 siblings, 1 reply; 42+ messages in thread
From: Roman Mamedov @ 2017-03-27 15:06 UTC (permalink / raw)
  To: Christian Theune; +Cc: Austin S. Hemmelgarn, Hugo Mills, linux-btrfs

On Mon, 27 Mar 2017 16:49:47 +0200
Christian Theune <ct@flyingcircus.io> wrote:

> Also: the idea of migrating on btrfs also has its downside - the performance of “mkdir” and “fsync” is abysmal at the moment. I’m waiting for the current shrinking job to finish but this is likely limited to the “find free space” algorithm. We’re talking about a few megabytes converted per second. Sigh.

Btw since this is all on LVM already, you could set up lvmcache with a small
SSD-based cache volume. Even some old 60GB SSD would work wonders for
performance, and with the cache policy of "writethrough" you don't have to
worry about its reliability (much).

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 14:53       ` Christian Theune
@ 2017-03-28 14:43         ` Peter Grandi
  2017-03-28 14:50           ` Tomasz Kusmierz
                             ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Peter Grandi @ 2017-03-28 14:43 UTC (permalink / raw)
  To: Linux fs Btrfs

This is going to be long because I am writing something detailed
hoping pointlessly that someone in the future will find it by
searching the list archives while doing research before setting
up a new storage system, and they will be the kind of person
that tolerates reading messages longer than Twitter. :-).

> I’m currently shrinking a device and it seems that the
> performance of shrink is abysmal.

When I read this kind of statement I am reminded of all the
cases where someone left me to decatastrophize a storage system
built on "optimistic" assumptions. The usual "optimism" is what
I call the "syntactic approach", that is the axiomatic belief
that any syntactically valid combination of features not only
will "work", but very fast too and reliably despite slow cheap
hardware and "unattentive" configuration. Some people call that
the expectation that system developers provide or should provide
an "O_PONIES" option. In particular I get very saddened when
people use "performance" to mean "speed", as the difference
between the two is very great.

As a general consideration, shrinking a large filetree online
in-place is an amazingly risky, difficult, slow operation and
should be a last desperate resort (as apparently in this case),
regardless of the filesystem type, and expecting otherwise is
"optimistic".

My guess is that very complex risky slow operations like that
are provided by "clever" filesystem developers for "marketing"
purposes, to win box-ticking competitions. That applies to those
system developers who do know better; I suspect that even some
filesystem developers are "optimistic" as to what they can
actually achieve.

> I intended to shrink a ~22TiB filesystem down to 20TiB. This is
> still using LVM underneath so that I can’t just remove a device
> from the filesystem but have to use the resize command.

That is actually a very good idea because Btrfs multi-device is
not quite as reliable as DM/LVM2 multi-device.

> Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
>        Total devices 1 FS bytes used 18.21TiB
>        devid    1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy

Maybe 'balance' should have been used a bit more.

> This has been running since last Thursday, so roughly 3.5days
> now. The “used” number in devid1 has moved about 1TiB in this
> time. The filesystem is seeing regular usage (read and write)
> and when I’m suspending any application traffic I see about
> 1GiB of movement every now and then. Maybe once every 30
> seconds or so. Does this sound fishy or normal to you?

With consistent "optimism" this is a request to assess whether
"performance" of some operations is adequate on a filetree
without telling us either what the filetree contents look like,
what the regular workload is, or what the storage layer looks
like.

Being one of the few system administrators crippled by lack of
psychic powers :-), I rely on guesses and inferences here, and
having read the whole thread containing some belated details.

>From the ~22TB total capacity my guess is that the storage layer
involves rotating hard disks, and from later details the
filesystem contents seems to be heavily reflinked files of
several GB in size, and workload seems to be backups to those
files from several source hosts. Considering the general level
of "optimism" in the situation my wild guess is that the storage
layer is based on large slow cheap rotating disks in teh 4GB-8GB
range, with very low IOPS-per-TB.

> Thanks for that info. The 1min per 1GiB is what I saw too -
> the “it can take longer” wasn’t really explainable to me.

A contemporary rotating disk device can do around 0.5MB/s
transfer rate with small random accesses with barriers up to
around 80-160MB/s in purely sequential access without barriers.

1GB/m of simultaneous read-write means around 16MB/s reads plus
16MB/s writes which is fairly good *performance* (even if slow
*speed*) considering that moving extents around, even across
disks, involves quite a bit of randomish same-disk updates of
metadata; because it all depends usually on how much randomish
metadata updates need to done, on any filesystem type, as those
must be done with barriers.

> As I’m not using snapshots: would large files (100+gb)

Using 100GB sized VM virtual disks (never mind with COW) seems
very unwise to me to start with, but of course a lot of other
people know better :-). Just like a lot of other people know
better that large single pool storage systems are awesome in
every respect :-): cost, reliability, speed, flexibility,
maintenance, etc.

> with long chains of CoW history (specifically reflink copies)
> also hurt?

Oh yes... They are about one of the worst cases for using
Btrfs. But also very "optimistic" to think that kind of stuff
can work awesomely on *any* filesystem type.

> Something I’d like to verify: does having traffic on the
> volume have the potential to delay this infinitely? [ ... ]
> it’s just slow and we’re looking forward to about 2 months
> worth of time shrinking this volume. (And then again on the
> next bigger server probably about 3-4 months).

Those are pretty typical times for whole-filesystem operations
like that on rotating disk media. There are some reports in the
list and IRC channel archives to 'scrub' or 'balance' or 'check'
times for filetrees of that size.

> (Background info: we’re migrating large volumes from btrfs to
> xfs and can only do this step by step: copying some data,
> shrinking the btrfs volume, extending the xfs volume, rinse
> repeat.

That "extending the xfs volume" will have consequences too, but
not too bad hopefully.

> If someone should have any suggestions to speed this up and
> not having to think in terms of _months_ then I’m all ears.)

High IOPS-per-TB enterprise SSDs with capacitor backed caches :-).

> One strategy that does come to mind: we’re converting our
> backup from a system that uses reflinks to a non-reflink based
> system. We can convert this in place so this would remove all
> the reflink stuff in the existing filesystem

Do you have enough space to do that? Either your reflinks are
pointless or they are saving a lot of storage. But I guess that
you can do it one 100GB file at a time...

> and then we maybe can do the FS conversion faster when this
> isn’t an issue any longer. I think I’ll

I suspect the de-reflinking plus shrinking will take longer, but
not totally sure.

> Right. This is wan option we can do from a software perspective
> (our own solution - https://bitbucket.org/flyingcircus/backy)

Many thanks for sharing your system, I'll have a look.

> but our systems in use can’t hold all the data twice. Even
> though we’re migrating to a backend implementation that uses
> less data than before I have to perform an “inplace” migration
> in some way. This is VM block device backup. So basically we
> migrate one VM with all its previous data and that works quite
> fine with a little headroom. However, migrating all VMs to a
> new “full” backup and then wait for the old to shrink would
> only work if we had a completely empty backup server in place,
> which we don’t.

> Also: the idea of migrating on btrfs also has its downside -
> the performance of “mkdir” and “fsync” is abysmal at the
> moment.

That *performance* is pretty good indeed, it is the *speed* that
may be low, but that's obvious. Please consider looking at these
entirely typical speeds:

  http://www.sabi.co.uk/blog/17-one.html?170302#170302
  http://www.sabi.co.uk/blog/17-one.html?170228#170228

> I’m waiting for the current shrinking job to finish but this
> is likely limited to the “find free space” algorithm. We’re
> talking about a few megabytes converted per second. Sigh.

Well, if the filetree is being actively used for COW backups
while being shrunk that involves a lot of randomish IO with
barriers.

>> I would only suggest that you reconsider XFS. You can't
>> shrink XFS, therefore you won't have the flexibility to
>> migrate in the same way to anything better that comes along
>> in the future (ZFS perhaps? or even Bcachefs?). XFS does not
>> perform that much better over Ext4, and very importantly,
>> Ext4 can be shrunk.

ZFS is a complicated mess too with an intensely anisotropic
performance envelope too and not necessarily that good for
backup archival for various reasons. I would consider looking
instead at using a collection of smaller "silo" JFS, F2FS,
NILFS2 filetrees as well as XFS, and using MD RAID in RAID10
mode instead of DM/LVM2:

  http://www.sabi.co.uk/blog/16-two.html?161217#161217
  http://www.sabi.co.uk/blog/17-one.html?170107#170107
  http://www.sabi.co.uk/blog/12-fou.html?121223#121223
  http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b
  http://www.sabi.co.uk/blog/12-fou.html?121218#121218

and yes, Bcachefs looks promising, but I am sticking with Btrfs:

  https://lwn.net/Articles/717379

> That is true. However, we do have moved the expected feature
> set of the filesystem (i.e. cow)

That feature set is arguably not appropriate for VM images, but
lots of people know better :-).

> down to “store files safely and reliably” and we’ve seen too
> much breakage with ext4 in the past.

That is extremely unlikely unless your storage layer has
unreliable barriers, and then you need a lot of "optimism".

> Of course “persistence means you’ll have to say I’m sorry” and
> thus with either choice we may be faced with some issue in the
> future that we might have circumvented with another solution
> and yes flexibility is worth a great deal.

Enterprise SSDs with high small-random-write IOPS-per-TB can
give both excellent speed and high flexibility :-).

> We’ve run XFS and ext4 on different (large and small)
> workloads in the last 2 years and I have to say I’m much more
> happy about XFS even with the shrinking limitation.

XFS and 'ext4' are essentially equivalent, except for the
fixed-size inode table limitation of 'ext4' (and XFS reportedly
has finer grained locking). Btrfs is nearly as good as either on
most workloads is single-device mode without using the more
complicated features (compression, qgroups, ...) and with
appropriate use of the 'nowcow' options, and gives checksums on
data too if needed.

> To us ext4 is prohibitive with it’s fsck performance and we do
> like the tight error checking in XFS.

It is very pleasing to see someone care about the speed of
whole-tree operations like 'fsck', a very often forgotten
"little detail". But in my experience 'ext4' checking is quite
competitive with XFS checking and repair, at least in recent
years, as both have been hugely improved. XFS checking and
repair still require a lot of RAM though.

> Thanks for the reminder though - especially in the public
> archive making this tradeoff with flexibility known is wise to
> communicate. :-)

"Flexibility" in filesystems, especially on rotating disk
storage with extremely anisotropic performance envelopes, is
very expensive, but of course lots of people know better :-).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-28 14:43         ` Peter Grandi
@ 2017-03-28 14:50           ` Tomasz Kusmierz
  2017-03-28 15:06             ` Peter Grandi
  2017-03-28 14:59           ` Peter Grandi
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Tomasz Kusmierz @ 2017-03-28 14:50 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs Btrfs

I glazed over at “This is going to be long” … :)

> On 28 Mar 2017, at 15:43, Peter Grandi <pg@btrfs.for.sabi.co.UK> wrote:
> 
> This is going to be long because I am writing something detailed
> hoping pointlessly that someone in the future will find it by
> searching the list archives while doing research before setting
> up a new storage system, and they will be the kind of person
> that tolerates reading messages longer than Twitter. :-).
> 
>> I’m currently shrinking a device and it seems that the
>> performance of shrink is abysmal.
> 
> When I read this kind of statement I am reminded of all the
> cases where someone left me to decatastrophize a storage system
> built on "optimistic" assumptions. The usual "optimism" is what
> I call the "syntactic approach", that is the axiomatic belief
> that any syntactically valid combination of features not only
> will "work", but very fast too and reliably despite slow cheap
> hardware and "unattentive" configuration. Some people call that
> the expectation that system developers provide or should provide
> an "O_PONIES" option. In particular I get very saddened when
> people use "performance" to mean "speed", as the difference
> between the two is very great.
> 
> As a general consideration, shrinking a large filetree online
> in-place is an amazingly risky, difficult, slow operation and
> should be a last desperate resort (as apparently in this case),
> regardless of the filesystem type, and expecting otherwise is
> "optimistic".
> 
> My guess is that very complex risky slow operations like that
> are provided by "clever" filesystem developers for "marketing"
> purposes, to win box-ticking competitions. That applies to those
> system developers who do know better; I suspect that even some
> filesystem developers are "optimistic" as to what they can
> actually achieve.
> 
>> I intended to shrink a ~22TiB filesystem down to 20TiB. This is
>> still using LVM underneath so that I can’t just remove a device
>> from the filesystem but have to use the resize command.
> 
> That is actually a very good idea because Btrfs multi-device is
> not quite as reliable as DM/LVM2 multi-device.
> 
>> Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
>>       Total devices 1 FS bytes used 18.21TiB
>>       devid    1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy
> 
> Maybe 'balance' should have been used a bit more.
> 
>> This has been running since last Thursday, so roughly 3.5days
>> now. The “used” number in devid1 has moved about 1TiB in this
>> time. The filesystem is seeing regular usage (read and write)
>> and when I’m suspending any application traffic I see about
>> 1GiB of movement every now and then. Maybe once every 30
>> seconds or so. Does this sound fishy or normal to you?
> 
> With consistent "optimism" this is a request to assess whether
> "performance" of some operations is adequate on a filetree
> without telling us either what the filetree contents look like,
> what the regular workload is, or what the storage layer looks
> like.
> 
> Being one of the few system administrators crippled by lack of
> psychic powers :-), I rely on guesses and inferences here, and
> having read the whole thread containing some belated details.
> 
> From the ~22TB total capacity my guess is that the storage layer
> involves rotating hard disks, and from later details the
> filesystem contents seems to be heavily reflinked files of
> several GB in size, and workload seems to be backups to those
> files from several source hosts. Considering the general level
> of "optimism" in the situation my wild guess is that the storage
> layer is based on large slow cheap rotating disks in teh 4GB-8GB
> range, with very low IOPS-per-TB.
> 
>> Thanks for that info. The 1min per 1GiB is what I saw too -
>> the “it can take longer” wasn’t really explainable to me.
> 
> A contemporary rotating disk device can do around 0.5MB/s
> transfer rate with small random accesses with barriers up to
> around 80-160MB/s in purely sequential access without barriers.
> 
> 1GB/m of simultaneous read-write means around 16MB/s reads plus
> 16MB/s writes which is fairly good *performance* (even if slow
> *speed*) considering that moving extents around, even across
> disks, involves quite a bit of randomish same-disk updates of
> metadata; because it all depends usually on how much randomish
> metadata updates need to done, on any filesystem type, as those
> must be done with barriers.
> 
>> As I’m not using snapshots: would large files (100+gb)
> 
> Using 100GB sized VM virtual disks (never mind with COW) seems
> very unwise to me to start with, but of course a lot of other
> people know better :-). Just like a lot of other people know
> better that large single pool storage systems are awesome in
> every respect :-): cost, reliability, speed, flexibility,
> maintenance, etc.
> 
>> with long chains of CoW history (specifically reflink copies)
>> also hurt?
> 
> Oh yes... They are about one of the worst cases for using
> Btrfs. But also very "optimistic" to think that kind of stuff
> can work awesomely on *any* filesystem type.
> 
>> Something I’d like to verify: does having traffic on the
>> volume have the potential to delay this infinitely? [ ... ]
>> it’s just slow and we’re looking forward to about 2 months
>> worth of time shrinking this volume. (And then again on the
>> next bigger server probably about 3-4 months).
> 
> Those are pretty typical times for whole-filesystem operations
> like that on rotating disk media. There are some reports in the
> list and IRC channel archives to 'scrub' or 'balance' or 'check'
> times for filetrees of that size.
> 
>> (Background info: we’re migrating large volumes from btrfs to
>> xfs and can only do this step by step: copying some data,
>> shrinking the btrfs volume, extending the xfs volume, rinse
>> repeat.
> 
> That "extending the xfs volume" will have consequences too, but
> not too bad hopefully.
> 
>> If someone should have any suggestions to speed this up and
>> not having to think in terms of _months_ then I’m all ears.)
> 
> High IOPS-per-TB enterprise SSDs with capacitor backed caches :-).
> 
>> One strategy that does come to mind: we’re converting our
>> backup from a system that uses reflinks to a non-reflink based
>> system. We can convert this in place so this would remove all
>> the reflink stuff in the existing filesystem
> 
> Do you have enough space to do that? Either your reflinks are
> pointless or they are saving a lot of storage. But I guess that
> you can do it one 100GB file at a time...
> 
>> and then we maybe can do the FS conversion faster when this
>> isn’t an issue any longer. I think I’ll
> 
> I suspect the de-reflinking plus shrinking will take longer, but
> not totally sure.
> 
>> Right. This is wan option we can do from a software perspective
>> (our own solution - https://bitbucket.org/flyingcircus/backy)
> 
> Many thanks for sharing your system, I'll have a look.
> 
>> but our systems in use can’t hold all the data twice. Even
>> though we’re migrating to a backend implementation that uses
>> less data than before I have to perform an “inplace” migration
>> in some way. This is VM block device backup. So basically we
>> migrate one VM with all its previous data and that works quite
>> fine with a little headroom. However, migrating all VMs to a
>> new “full” backup and then wait for the old to shrink would
>> only work if we had a completely empty backup server in place,
>> which we don’t.
> 
>> Also: the idea of migrating on btrfs also has its downside -
>> the performance of “mkdir” and “fsync” is abysmal at the
>> moment.
> 
> That *performance* is pretty good indeed, it is the *speed* that
> may be low, but that's obvious. Please consider looking at these
> entirely typical speeds:
> 
>  http://www.sabi.co.uk/blog/17-one.html?170302#170302
>  http://www.sabi.co.uk/blog/17-one.html?170228#170228
> 
>> I’m waiting for the current shrinking job to finish but this
>> is likely limited to the “find free space” algorithm. We’re
>> talking about a few megabytes converted per second. Sigh.
> 
> Well, if the filetree is being actively used for COW backups
> while being shrunk that involves a lot of randomish IO with
> barriers.
> 
>>> I would only suggest that you reconsider XFS. You can't
>>> shrink XFS, therefore you won't have the flexibility to
>>> migrate in the same way to anything better that comes along
>>> in the future (ZFS perhaps? or even Bcachefs?). XFS does not
>>> perform that much better over Ext4, and very importantly,
>>> Ext4 can be shrunk.
> 
> ZFS is a complicated mess too with an intensely anisotropic
> performance envelope too and not necessarily that good for
> backup archival for various reasons. I would consider looking
> instead at using a collection of smaller "silo" JFS, F2FS,
> NILFS2 filetrees as well as XFS, and using MD RAID in RAID10
> mode instead of DM/LVM2:
> 
>  http://www.sabi.co.uk/blog/16-two.html?161217#161217
>  http://www.sabi.co.uk/blog/17-one.html?170107#170107
>  http://www.sabi.co.uk/blog/12-fou.html?121223#121223
>  http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b
>  http://www.sabi.co.uk/blog/12-fou.html?121218#121218
> 
> and yes, Bcachefs looks promising, but I am sticking with Btrfs:
> 
>  https://lwn.net/Articles/717379
> 
>> That is true. However, we do have moved the expected feature
>> set of the filesystem (i.e. cow)
> 
> That feature set is arguably not appropriate for VM images, but
> lots of people know better :-).
> 
>> down to “store files safely and reliably” and we’ve seen too
>> much breakage with ext4 in the past.
> 
> That is extremely unlikely unless your storage layer has
> unreliable barriers, and then you need a lot of "optimism".
> 
>> Of course “persistence means you’ll have to say I’m sorry” and
>> thus with either choice we may be faced with some issue in the
>> future that we might have circumvented with another solution
>> and yes flexibility is worth a great deal.
> 
> Enterprise SSDs with high small-random-write IOPS-per-TB can
> give both excellent speed and high flexibility :-).
> 
>> We’ve run XFS and ext4 on different (large and small)
>> workloads in the last 2 years and I have to say I’m much more
>> happy about XFS even with the shrinking limitation.
> 
> XFS and 'ext4' are essentially equivalent, except for the
> fixed-size inode table limitation of 'ext4' (and XFS reportedly
> has finer grained locking). Btrfs is nearly as good as either on
> most workloads is single-device mode without using the more
> complicated features (compression, qgroups, ...) and with
> appropriate use of the 'nowcow' options, and gives checksums on
> data too if needed.
> 
>> To us ext4 is prohibitive with it’s fsck performance and we do
>> like the tight error checking in XFS.
> 
> It is very pleasing to see someone care about the speed of
> whole-tree operations like 'fsck', a very often forgotten
> "little detail". But in my experience 'ext4' checking is quite
> competitive with XFS checking and repair, at least in recent
> years, as both have been hugely improved. XFS checking and
> repair still require a lot of RAM though.
> 
>> Thanks for the reminder though - especially in the public
>> archive making this tradeoff with flexibility known is wise to
>> communicate. :-)
> 
> "Flexibility" in filesystems, especially on rotating disk
> storage with extremely anisotropic performance envelopes, is
> very expensive, but of course lots of people know better :-).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-28 14:43         ` Peter Grandi
  2017-03-28 14:50           ` Tomasz Kusmierz
@ 2017-03-28 14:59           ` Peter Grandi
  2017-03-28 15:20             ` Peter Grandi
  2017-03-28 15:56           ` Austin S. Hemmelgarn
  2017-03-30 15:00           ` Piotr Pawłow
  3 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2017-03-28 14:59 UTC (permalink / raw)
  To: Linux fs Btrfs

> [ ... ] reminded of all the cases where someone left me to
> decatastrophize a storage system built on "optimistic"
> assumptions.

In particular when some "clever" sysadm with a "clever" (or
dumb) manager slaps together a large storage system in the
cheapest and quickest way knowing that while it is mostly empty
it will seem very fast regardless and therefore to have awesome
performance, and then the "clever" sysadm disappears surrounded
by a halo of glory before the storage system gets full workload
and fills up; when that happens usually I get to inherit it.
BTW The same technique also can be done with HPC clusters.

>> I intended to shrink a ~22TiB filesystem down to 20TiB. This
>> is still using LVM underneath so that I can’t just remove a
>> device from the filesystem but have to use the resize
>> command.

>> Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
>> Total devices 1 FS bytes used 18.21TiB
>> devid    1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy

Ahh it is indeed a filled up storage system now running a full
workload. At least it wasn't me who inherited it this time. :-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-28 14:50           ` Tomasz Kusmierz
@ 2017-03-28 15:06             ` Peter Grandi
  2017-03-28 15:35               ` Tomasz Kusmierz
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2017-03-28 15:06 UTC (permalink / raw)
  To: Linux fs Btrfs

> I glazed over at “This is going to be long” … :)
>> [ ... ]

Not only that, you also top-posted while quoting it pointlessly
in its entirety, to the whole mailing list. Well played :-).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-28 14:59           ` Peter Grandi
@ 2017-03-28 15:20             ` Peter Grandi
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Grandi @ 2017-03-28 15:20 UTC (permalink / raw)
  To: Linux fs Btrfs

>  [ ... ] slaps together a large storage system in the cheapest
> and quickest way knowing that while it is mostly empty it will
> seem very fast regardless and therefore to have awesome
> performance, and then the "clever" sysadm disappears surrounded
> by a halo of glory before the storage system gets full workload
> and fills up; [ ... ]

Fortunately or unfortunately Btrfs is particularly suitable for
this technique, as it has an enormous number of checkbox-ticking
awesome looking feature: transparent compression, dynamic
add/remove, online balance/scrub, different sized member devices,
online grow/shrink, online defrag, limitless scalability, online
dedup, arbitrary subvolumes and snapshots, COW and reflinking,
online conversion of RAID profiles, ... and one can use all of
them at the same time, and for the initial period where volume
workload is low and space used not much, it will looks absolutely
fantastic, cheap, flexible, always available, fast, the work of
genius of a very cool sysadm.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-28 15:06             ` Peter Grandi
@ 2017-03-28 15:35               ` Tomasz Kusmierz
  2017-03-28 16:20                 ` Peter Grandi
  0 siblings, 1 reply; 42+ messages in thread
From: Tomasz Kusmierz @ 2017-03-28 15:35 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs Btrfs

I’ve glazed over on “Not only that …” … can you make youtube video of that :))))
> On 28 Mar 2017, at 16:06, Peter Grandi <pg@btrfs.for.sabi.co.UK> wrote:
> 
>> I glazed over at “This is going to be long” … :)
>>> [ ... ]
> 
> Not only that, you also top-posted while quoting it pointlessly
> in its entirety, to the whole mailing list. Well played :-).
It’s because I’m special :* 

On a real note thank’s for giving a f to provide a detailed comment … to much of open source stuff is based on short comments :/


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-28 14:43         ` Peter Grandi
  2017-03-28 14:50           ` Tomasz Kusmierz
  2017-03-28 14:59           ` Peter Grandi
@ 2017-03-28 15:56           ` Austin S. Hemmelgarn
  2017-03-30 15:55             ` Peter Grandi
  2017-03-30 15:00           ` Piotr Pawłow
  3 siblings, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-03-28 15:56 UTC (permalink / raw)
  To: Peter Grandi, Linux fs Btrfs

On 2017-03-28 10:43, Peter Grandi wrote:
> This is going to be long because I am writing something detailed
> hoping pointlessly that someone in the future will find it by
> searching the list archives while doing research before setting
> up a new storage system, and they will be the kind of person
> that tolerates reading messages longer than Twitter. :-).
>
>> I’m currently shrinking a device and it seems that the
>> performance of shrink is abysmal.
>
> When I read this kind of statement I am reminded of all the
> cases where someone left me to decatastrophize a storage system
> built on "optimistic" assumptions. The usual "optimism" is what
> I call the "syntactic approach", that is the axiomatic belief
> that any syntactically valid combination of features not only
> will "work", but very fast too and reliably despite slow cheap
> hardware and "unattentive" configuration. Some people call that
> the expectation that system developers provide or should provide
> an "O_PONIES" option. In particular I get very saddened when
> people use "performance" to mean "speed", as the difference
> between the two is very great.
>
> As a general consideration, shrinking a large filetree online
> in-place is an amazingly risky, difficult, slow operation and
> should be a last desperate resort (as apparently in this case),
> regardless of the filesystem type, and expecting otherwise is
> "optimistic".
>
> My guess is that very complex risky slow operations like that
> are provided by "clever" filesystem developers for "marketing"
> purposes, to win box-ticking competitions. That applies to those
> system developers who do know better; I suspect that even some
> filesystem developers are "optimistic" as to what they can
> actually achieve.
There are cases where there really is no other sane option.  Not 
everyone has the kind of budget needed for proper HA setups, and if you 
need maximal uptime and as a result have to reprovision the system 
online, then you pretty much need a filesystem that supports online 
shrinking.  Also, it's not really all that slow on most filesystem, 
BTRFS is just hurt by it's comparatively poor performance, and the COW 
metadata updates that are needed.
>
>> I intended to shrink a ~22TiB filesystem down to 20TiB. This is
>> still using LVM underneath so that I can’t just remove a device
>> from the filesystem but have to use the resize command.
>
> That is actually a very good idea because Btrfs multi-device is
> not quite as reliable as DM/LVM2 multi-device.
This depends on how much you trust your storage hardware relative to how 
much you trust the kernel code.  For raid5/6, yes, BTRFS multi-device is 
currently crap.  For most people raid10 in BTRFS is too.  For raid1 mode 
however, it really is personal opinion.
>
>> Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
>>        Total devices 1 FS bytes used 18.21TiB
>>        devid    1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy
>
> Maybe 'balance' should have been used a bit more.
>
>> This has been running since last Thursday, so roughly 3.5days
>> now. The “used” number in devid1 has moved about 1TiB in this
>> time. The filesystem is seeing regular usage (read and write)
>> and when I’m suspending any application traffic I see about
>> 1GiB of movement every now and then. Maybe once every 30
>> seconds or so. Does this sound fishy or normal to you?
>
> With consistent "optimism" this is a request to assess whether
> "performance" of some operations is adequate on a filetree
> without telling us either what the filetree contents look like,
> what the regular workload is, or what the storage layer looks
> like.
>
> Being one of the few system administrators crippled by lack of
> psychic powers :-), I rely on guesses and inferences here, and
> having read the whole thread containing some belated details.
>
> From the ~22TB total capacity my guess is that the storage layer
> involves rotating hard disks, and from later details the
> filesystem contents seems to be heavily reflinked files of
> several GB in size, and workload seems to be backups to those
> files from several source hosts. Considering the general level
> of "optimism" in the situation my wild guess is that the storage
> layer is based on large slow cheap rotating disks in teh 4GB-8GB
> range, with very low IOPS-per-TB.
>
>> Thanks for that info. The 1min per 1GiB is what I saw too -
>> the “it can take longer” wasn’t really explainable to me.
>
> A contemporary rotating disk device can do around 0.5MB/s
> transfer rate with small random accesses with barriers up to
> around 80-160MB/s in purely sequential access without barriers.
>
> 1GB/m of simultaneous read-write means around 16MB/s reads plus
> 16MB/s writes which is fairly good *performance* (even if slow
> *speed*) considering that moving extents around, even across
> disks, involves quite a bit of randomish same-disk updates of
> metadata; because it all depends usually on how much randomish
> metadata updates need to done, on any filesystem type, as those
> must be done with barriers.
>
>> As I’m not using snapshots: would large files (100+gb)
>
> Using 100GB sized VM virtual disks (never mind with COW) seems
> very unwise to me to start with, but of course a lot of other
> people know better :-). Just like a lot of other people know
> better that large single pool storage systems are awesome in
> every respect :-): cost, reliability, speed, flexibility,
> maintenance, etc.
>
>> with long chains of CoW history (specifically reflink copies)
>> also hurt?
>
> Oh yes... They are about one of the worst cases for using
> Btrfs. But also very "optimistic" to think that kind of stuff
> can work awesomely on *any* filesystem type.
It works just fine for archival storage on any number of other 
filesystems.  Performance is poor, but with backups that shouldn't 
matter (performance should be your last criteria for designing a backup 
strategy, period).
>
>> Something I’d like to verify: does having traffic on the
>> volume have the potential to delay this infinitely? [ ... ]
>> it’s just slow and we’re looking forward to about 2 months
>> worth of time shrinking this volume. (And then again on the
>> next bigger server probably about 3-4 months).
>
> Those are pretty typical times for whole-filesystem operations
> like that on rotating disk media. There are some reports in the
> list and IRC channel archives to 'scrub' or 'balance' or 'check'
> times for filetrees of that size.
>
>> (Background info: we’re migrating large volumes from btrfs to
>> xfs and can only do this step by step: copying some data,
>> shrinking the btrfs volume, extending the xfs volume, rinse
>> repeat.
>
> That "extending the xfs volume" will have consequences too, but
> not too bad hopefully.
It shouldn't have any beyond the FS being bigger and the FS level 
metadata being a bit fragmented.  Extending a filesystem if done right 
(and XFS absolutely does it right) doesn't need to move any data, just 
allocate a bit more space in a few places and update the super-blocks to 
point to the new end of the filesystem.
>
>> If someone should have any suggestions to speed this up and
>> not having to think in terms of _months_ then I’m all ears.)
>
> High IOPS-per-TB enterprise SSDs with capacitor backed caches :-).
>
>> One strategy that does come to mind: we’re converting our
>> backup from a system that uses reflinks to a non-reflink based
>> system. We can convert this in place so this would remove all
>> the reflink stuff in the existing filesystem
>
> Do you have enough space to do that? Either your reflinks are
> pointless or they are saving a lot of storage. But I guess that
> you can do it one 100GB file at a time...
>
>> and then we maybe can do the FS conversion faster when this
>> isn’t an issue any longer. I think I’ll
>
> I suspect the de-reflinking plus shrinking will take longer, but
> not totally sure.
>
>> Right. This is wan option we can do from a software perspective
>> (our own solution - https://bitbucket.org/flyingcircus/backy)
>
> Many thanks for sharing your system, I'll have a look.
>
>> but our systems in use can’t hold all the data twice. Even
>> though we’re migrating to a backend implementation that uses
>> less data than before I have to perform an “inplace” migration
>> in some way. This is VM block device backup. So basically we
>> migrate one VM with all its previous data and that works quite
>> fine with a little headroom. However, migrating all VMs to a
>> new “full” backup and then wait for the old to shrink would
>> only work if we had a completely empty backup server in place,
>> which we don’t.
>
>> Also: the idea of migrating on btrfs also has its downside -
>> the performance of “mkdir” and “fsync” is abysmal at the
>> moment.
>
> That *performance* is pretty good indeed, it is the *speed* that
> may be low, but that's obvious. Please consider looking at these
> entirely typical speeds:
>
>   http://www.sabi.co.uk/blog/17-one.html?170302#170302
>   http://www.sabi.co.uk/blog/17-one.html?170228#170228
>
>> I’m waiting for the current shrinking job to finish but this
>> is likely limited to the “find free space” algorithm. We’re
>> talking about a few megabytes converted per second. Sigh.
>
> Well, if the filetree is being actively used for COW backups
> while being shrunk that involves a lot of randomish IO with
> barriers.
>
>>> I would only suggest that you reconsider XFS. You can't
>>> shrink XFS, therefore you won't have the flexibility to
>>> migrate in the same way to anything better that comes along
>>> in the future (ZFS perhaps? or even Bcachefs?). XFS does not
>>> perform that much better over Ext4, and very importantly,
>>> Ext4 can be shrunk.
>
> ZFS is a complicated mess too with an intensely anisotropic
> performance envelope too and not necessarily that good for
> backup archival for various reasons. I would consider looking
> instead at using a collection of smaller "silo" JFS, F2FS,
> NILFS2 filetrees as well as XFS, and using MD RAID in RAID10
> mode instead of DM/LVM2:
>
>   http://www.sabi.co.uk/blog/16-two.html?161217#161217
>   http://www.sabi.co.uk/blog/17-one.html?170107#170107
>   http://www.sabi.co.uk/blog/12-fou.html?121223#121223
>   http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b
>   http://www.sabi.co.uk/blog/12-fou.html?121218#121218
>
> and yes, Bcachefs looks promising, but I am sticking with Btrfs:
>
>   https://lwn.net/Articles/717379
>
>> That is true. However, we do have moved the expected feature
>> set of the filesystem (i.e. cow)
>
> That feature set is arguably not appropriate for VM images, but
> lots of people know better :-).
That depends on a lot of factors.  I have no issues personally running 
small VM images on BTRFS, but I'm also running on decent SSD's (>500MB/s 
read and write speeds), using sparse files, and keeping on top of 
managing them.  Most of the issue boils down to 3 things:
1. Running Windows in VM's.  Windows has a horrendous allocator and does 
a horrible job of keeping data localized, which makes fragmentation on 
the back-end far worse.
2. Running another COW filesystem inside the VM.  Having multiple COW 
layers on top of each other nukes performance and makes file fragments 
breed like rabbits.
3. Not taking the time to do proper routine maintenance.  Unless you're 
running directly on a block storage device, you should be defragmenting 
your VM images both in the VM and on the host (internal first of 
course), and generally keeping on top of making sure they stay in good 
condition.
>
>> down to “store files safely and reliably” and we’ve seen too
>> much breakage with ext4 in the past.
>
> That is extremely unlikely unless your storage layer has
> unreliable barriers, and then you need a lot of "optimism".
Then you've been lucky yourself.  outside of ZFS or BTRFS, most 
filesystems choke the moment they hit some at-rest data corruption, 
which has a much higher rate than most people want to admit.  Hardware 
failures happen, as do transient errors, and XFS usually does a better 
job recovering from them than ext4.
>
>> Of course “persistence means you’ll have to say I’m sorry” and
>> thus with either choice we may be faced with some issue in the
>> future that we might have circumvented with another solution
>> and yes flexibility is worth a great deal.
>
> Enterprise SSDs with high small-random-write IOPS-per-TB can
> give both excellent speed and high flexibility :-).
>
>> We’ve run XFS and ext4 on different (large and small)
>> workloads in the last 2 years and I have to say I’m much more
>> happy about XFS even with the shrinking limitation.
>
> XFS and 'ext4' are essentially equivalent, except for the
> fixed-size inode table limitation of 'ext4' (and XFS reportedly
> has finer grained locking). Btrfs is nearly as good as either on
> most workloads is single-device mode without using the more
> complicated features (compression, qgroups, ...) and with
> appropriate use of the 'nowcow' options, and gives checksums on
> data too if needed.
No, if you look at actual data, they aren't anywhere near equivalent 
unless you're comparing them to crappy filesystems like FAT32 or 
drastically different filesystems like NILFFS2, ZFS, or BTRFS.  XFS 
supports metadata checksumming, reflinks and a number of other things 
ext4 doesn't while also focusing on consistent performance across the 
life of the FS (so it performs worse on a clean FS than ext4, but better 
on a heavily used one than ext4).  ext4 by contrast has support for a 
handful of things that XFS doesn't (like journaling all writes, not just 
metadata, optional lazy metadata initialization, optional multiple-mount 
protection, etc), and takes a rather optimistic view on performance, 
focusing on trying to make it as good as possible at all times.
>
>> To us ext4 is prohibitive with it’s fsck performance and we do
>> like the tight error checking in XFS.
>
> It is very pleasing to see someone care about the speed of
> whole-tree operations like 'fsck', a very often forgotten
> "little detail". But in my experience 'ext4' checking is quite
> competitive with XFS checking and repair, at least in recent
> years, as both have been hugely improved. XFS checking and
> repair still require a lot of RAM though.
>
>> Thanks for the reminder though - especially in the public
>> archive making this tradeoff with flexibility known is wise to
>> communicate. :-)
>
> "Flexibility" in filesystems, especially on rotating disk
> storage with extremely anisotropic performance envelopes, is
> very expensive, but of course lots of people know better :-).
Time is not free, and humans generally prefer to minimize the amount of 
time they have to work on things.  This is why ZFS is so popular, it 
handles most errors correctly by itself and usually requires very little 
human intervention for maintenance.  'Flexibility' in a filesystem costs 
some time on a regular basis, but can save a huge amount of time in the 
long run.

To look at it another way, I have a home server system running BTRFS on 
top of LVM.  Because of the flexibility this allows, I've been able to 
configure the system such that it is statistically certain that it will 
survive any combination of failed storage devices short of a complete 
catastrophic failure, keep running correctly and can recover completely 
with zero down-time, while still getting performance within 5-10% of 
what I would see just running BTRFS directly on the SSD's in the system. 
  That flexibility is what makes this system work as well and reliably 
as it does, which in turn means that the extent of manual maintenance is 
running updates, thus saving me significantly more time that it costs in 
lost performance.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-28 15:35               ` Tomasz Kusmierz
@ 2017-03-28 16:20                 ` Peter Grandi
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Grandi @ 2017-03-28 16:20 UTC (permalink / raw)
  To: Linux fs Btrfs

> I’ve glazed over on “Not only that …” … can you make youtube
> video of that :)) [ ... ]  It’s because I’m special :*

Well played again, that's a fairly credible impersonation of a
node.js/mongodb developer :-).

> On a real note thank’s [ ... ] to much of open source stuff is
> based on short comments :/

Yes... In part that's because the "sw engineering" aspect of
programming takes a lot of time that unpaid volunteers sometimes
cannot afford to take, in part though I have noticed sometimes
free sw authors who do get paid to do free sw act as if they had
a policy of obfuscation to protect their turf/jobs.

Regardless, mailing lists, IRC channel logs, wikis, personal
blogs, search engines allow a mosaic of lore to form, which
in part remedies the situation, and here we are :-).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-28 14:43         ` Peter Grandi
                             ` (2 preceding siblings ...)
  2017-03-28 15:56           ` Austin S. Hemmelgarn
@ 2017-03-30 15:00           ` Piotr Pawłow
  2017-03-30 16:13             ` Peter Grandi
  3 siblings, 1 reply; 42+ messages in thread
From: Piotr Pawłow @ 2017-03-30 15:00 UTC (permalink / raw)
  To: Peter Grandi, Linux fs Btrfs

> As a general consideration, shrinking a large filetree online
> in-place is an amazingly risky, difficult, slow operation and
> should be a last desperate resort (as apparently in this case),
> regardless of the filesystem type, and expecting otherwise is
> "optimistic".

The way btrfs is designed I'd actually expect shrinking to be fast in
most cases. It could probably be done by moving whole chunks at near
platter speed, instead of extent-by-extent as it is done now, as long as
there is enough free space. There was a discussion about it already:
http://www.spinics.net/lists/linux-btrfs/msg38608.html. It just hasn't
been implemented yet.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-28 15:56           ` Austin S. Hemmelgarn
@ 2017-03-30 15:55             ` Peter Grandi
  2017-03-31 12:41               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2017-03-30 15:55 UTC (permalink / raw)
  To: Linux fs Btrfs

>> My guess is that very complex risky slow operations like that are
>> provided by "clever" filesystem developers for "marketing" purposes,
>> to win box-ticking competitions. That applies to those system
>> developers who do know better; I suspect that even some filesystem
>> developers are "optimistic" as to what they can actually achieve.

> There are cases where there really is no other sane option. Not
> everyone has the kind of budget needed for proper HA setups,

Thnaks for letting me know, that must have never occurred to me, just as
it must have never occurred to me that some people expect extremely
advanced features that imply big-budget high-IOPS high-reliability
storage to be fast and reliable on small-budget storage too :-)

> and if you need maximal uptime and as a result have to reprovision the
> system online, then you pretty much need a filesystem that supports
> online shrinking.

That's a bigger topic than we can address here. The topic used to be
known in one related domain as "Very Large Databases", which were
defined as databases so large and critical that they the time needed for
maintenance and backup were too slow for taking them them offline etc.;
that is a topics that has largely vanished for discussion, I guess
because most management just don't want to hear it :-).

> Also, it's not really all that slow on most filesystem, BTRFS is just
> hurt by it's comparatively poor performance, and the COW metadata
> updates that are needed.

Btrfs in realistic situations has pretty good speed *and* performance,
and COW actually helps, as it often results in less head repositioning
than update-in-place. What makes it a bit slower with metadata is having
'dup' by default to recover from especially damaging bitflips in
metadata, but then that does not impact performance, only speed.

>> That feature set is arguably not appropriate for VM images, but
>> lots of people know better :-).

> That depends on a lot of factors.  I have no issues personally running
> small VM images on BTRFS, but I'm also running on decent SSD's
> (>500MB/s read and write speeds), using sparse files, and keeping on
> top of managing them. [ ... ]

Having (relatively) big-budget high-IOPS storage for high-IOPS workloads
helps, that must have never occurred to me either :-).

>> XFS and 'ext4' are essentially equivalent, except for the fixed-size
>> inode table limitation of 'ext4' (and XFS reportedly has finer
>> grained locking). Btrfs is nearly as good as either on most workloads
>> is single-device mode [ ... ]

> No, if you look at actual data, [ ... ]

Well, I have looked at actual data in many published but often poorly
made "benchmarks", and to me they seem they seem quite equivalent
indeed, within somewhat differently shaped performance envelopes, so the
results depend on the testing point within that envelope. I have been
done my own simplistic actual data gathering, most recently here:

  http://www.sabi.co.uk/blog/17-one.html?170302#170302
  http://www.sabi.co.uk/blog/17-one.html?170228#170228

and however simplistic they are fairly informative (and for writes they
point a finger at a layer below the filesystem type).

[ ... ]

>> "Flexibility" in filesystems, especially on rotating disk
>> storage with extremely anisotropic performance envelopes, is
>> very expensive, but of course lots of people know better :-).

> Time is not free,

Your time seems especially and uniquely precious as you "waste"
as little as possible editing your replies into readability.

> and humans generally prefer to minimize the amount of time they have
> to work on things. This is why ZFS is so popular, it handles most
> errors correctly by itself and usually requires very little human
> intervention for maintenance.

That seems to me a pretty illusion, as it does not contain any magical
AI, just pretty ordinary and limited error correction for trivial cases.

> 'Flexibility' in a filesystem costs some time on a regular basis, but
> can save a huge amount of time in the long run.

Like everything else. The difficulty is having flexibility at scale with
challenging workloads. "An engineer can do  for a nickel what  any damn
fool can do for a dollar" :-).

> To look at it another way, I have a home server system running BTRFS
> on top of LVM. [ ... ]

But usually home servers have "unchallenging" workloads, and it is
relatively easy to overbudget their storage, because the total absolute
cost is "affordable".

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-30 15:00           ` Piotr Pawłow
@ 2017-03-30 16:13             ` Peter Grandi
  2017-03-30 22:13               ` Piotr Pawłow
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2017-03-30 16:13 UTC (permalink / raw)
  To: Linux fs Btrfs

>> As a general consideration, shrinking a large filetree online
>> in-place is an amazingly risky, difficult, slow operation and
>> should be a last desperate resort (as apparently in this case),
>> regardless of the filesystem type, and expecting otherwise is
>> "optimistic".

> The way btrfs is designed I'd actually expect shrinking to be
> fast in most cases. It could probably be done by moving whole
> chunks at near platter speed, [ ... ] It just hasn't been
> implemented yet.

That seems to me a rather "optimistic" argument, as most of the
cost of shrinking is the 'balance' to pack extents into chunks.

As that thread implies, the current implementation in effect
does a "balance" while shrinking, by moving extents from chunks
"above the line" to free space in chunks "below the line".

The proposed "move whole chunks" implementation helps only if
there are enough unallocated chunks "below the line". If regular
'balance' is done on the filesystem there will be some, but that
just spreads the cost of the 'balance' across time, it does not
by itself make a «risky, difficult, slow operation» any less so,
just spreads the risk, difficulty, slowness across time.

More generally one of the downsides of Btrfs is that because of
its two-level (allocated/unallocated chunks, used/free nodes or
blocks) design it requires more than most other designs to do
regular 'balance', which is indeed «risky, difficult, slow».

Compare an even more COW design like NILFS2, which requires, but a
bit less, to run its garbage collector, which is also «risky,
difficult, slow». Just like in Btrfs that is a tradeoff that
shrinks the performance envelope in one direction and expands it
in another.

But in the case of Btrfs it shrinks it perhaps a bit more than it
expands it, as the added flexibility of having chunk-based
'profiles' is only very partially taken advantage of.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-30 16:13             ` Peter Grandi
@ 2017-03-30 22:13               ` Piotr Pawłow
  2017-03-31  1:00                 ` GWB
  2017-03-31 10:51                 ` Peter Grandi
  0 siblings, 2 replies; 42+ messages in thread
From: Piotr Pawłow @ 2017-03-30 22:13 UTC (permalink / raw)
  To: Peter Grandi, Linux fs Btrfs

> The proposed "move whole chunks" implementation helps only if
> there are enough unallocated chunks "below the line". If regular
> 'balance' is done on the filesystem there will be some, but that
> just spreads the cost of the 'balance' across time, it does not
> by itself make a «risky, difficult, slow operation» any less so,
> just spreads the risk, difficulty, slowness across time.

Isn't that too pessimistic? Most of my filesystems have 90+% of free
space unallocated, even those I never run balance on. For me it wouldn't
just spread the cost, it would reduce it considerably.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-30 22:13               ` Piotr Pawłow
@ 2017-03-31  1:00                 ` GWB
  2017-03-31  5:26                   ` Duncan
  2017-03-31 11:37                   ` Peter Grandi
  2017-03-31 10:51                 ` Peter Grandi
  1 sibling, 2 replies; 42+ messages in thread
From: GWB @ 2017-03-31  1:00 UTC (permalink / raw)
  To: ct, Linux fs Btrfs

Hello, Christiane,

I very much enjoyed the discussion you sparked with your original
post.  My ability in btrfs is very limited, much less than the others
who have replied here, so this may not be much help.

Let us assume that you have been able to shrink the device to the size
you need, and you are now merrily on your way to moving the data to
XFS.  If so, ignore this email, delete, whatever, and read no further.

If that is not the case, perhaps try something like the following.

Can you try to first dedup the btrfs volume?  This is probably out of
date, but you could try one of these:

https://btrfs.wiki.kernel.org/index.php/Deduplication

If that does not work, this is a longer shot, but you might consider
adding an intermediate step of creating yet another btrfs volume on
the underlying lvm2 device mapper, turning on dedup, compression, and
whatever else can squeeze some extra space out of the current btrfs
volume.  You could then try to copy over files and see if you get the
results you need (or try sending the current btrfs volume as a
snapshot, but I'm guessing 20TB is too much).

Once the new btrfs volume on top of lvm2 is complete, you could just
delete the old one, and then transfer the (hopefully compressed and
deduped) data to XFS.

Yep, that's probably a lot of work.

I use both btrfs (on root on Ubuntu) and zfs (for data, home), and I
try to do as little as possible with live mounted file systems other
than snapshots.  I avoid sending and receive snapshots from the live
system (mostly zfs, but sometimes btrfs) but instead write increment
snapshots as a file on the backup disks, and then import the
incremental snaps into a backup pool at night.

My recollection is that btrfs handles deduplication differently than
zfs, but both of them can be very, very slow (from the human
perspective; call that what you will; a sub optimal relationship of
the parameters of performance and speed).

The advantage you have is that with lvm you can create a number of
different file systems.  And lvm can also create snapshots.  I think
both zfs and btrfs both have a more "elegant" way of dealing with
snapshots, but lvm allows a file system without that feature to have
it.  Others on the list can tell you about the disadvantages.

I would be curious how it turns out for you.  If you are able to move
the data to XFS running on top of lvm, what is your plan for snapshots
in lvm?

Again, I'm not an expert in btrfs, but in most cases a full balance
and scrub takes care of any problems on the root partition, but that
is a relatively small partition.  A full balance (without the options)
and scrub on 20 TiB must take a very long time even with robust
hardware, would it not?

CentOS, Redhat, and Oracle seem to take the position that very large
data subvolumes using btrfs should work fine.  But I would be curious
what the rest of the list thinks about 20 TiB in one volume/subvolume.

Gordon



On Thu, Mar 30, 2017 at 5:13 PM, Piotr Pawłow <pp@siedziba.pl> wrote:
>> The proposed "move whole chunks" implementation helps only if
>> there are enough unallocated chunks "below the line". If regular
>> 'balance' is done on the filesystem there will be some, but that
>> just spreads the cost of the 'balance' across time, it does not
>> by itself make a «risky, difficult, slow operation» any less so,
>> just spreads the risk, difficulty, slowness across time.
>
> Isn't that too pessimistic? Most of my filesystems have 90+% of free
> space unallocated, even those I never run balance on. For me it wouldn't
> just spread the cost, it would reduce it considerably.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-31  1:00                 ` GWB
@ 2017-03-31  5:26                   ` Duncan
  2017-03-31  5:38                     ` Duncan
  2017-03-31 11:37                   ` Peter Grandi
  1 sibling, 1 reply; 42+ messages in thread
From: Duncan @ 2017-03-31  5:26 UTC (permalink / raw)
  To: linux-btrfs

GWB posted on Thu, 30 Mar 2017 20:00:22 -0500 as excerpted:

> CentOS, Redhat, and Oracle seem to take the position that very large
> data subvolumes using btrfs should work fine.  But I would be curious
> what the rest of the list thinks about 20 TiB in one volume/subvolume.

To be sure I'm a biased voice here, as I have multiple independent btrfs 
on multiple partitions here, with no btrfs over 100 GiB in size, and 
that's on ssd so maintenance commands normally return in minutes or even 
seconds, not the hours to days or even weeks it takes on multi-TB btrfs 
on spinning rust.  But FWIW...

IMO there are two rules favoring multiple relatively smaller btrfs over 
single far larger btrfs:

1) Don't put all your data eggs in one basket, especially when that 
basket isn't yet entirely stable and mature.

A mantra commonly repeated on this list is that btrfs is still 
stabilizing, not fully stable and mature, the result being that keeping 
backups of any data you value more than the time/cost/hassle-factor of 
the backup, and being practically prepared to use them, is even *MORE* 
important than it is on fully mature and stable filesystems.  If 
potential users aren't prepared to do that, flat answer, they should be 
looking at other filesystems, tho in reality, that rule applies to stable 
and mature filesystems too, and any good sysadmin understands that not 
having a backup is in reality defining the data in question as worth less 
than the cost of that backup, regardless of any protests to the contrary.

Based on that and the fact that if this less than 100% stable and mature 
filesystem fails, all those subvolumes and snapshots you painstakingly 
created aren't going to matter, it's all up in smoke, it just makes sense 
to subdivide that data roughly along functional lines and split it up 
into multiple independent btrfs, so that if a filesystem fails, it'll 
take only a fraction of the total data with it, and restoring/repairing/
rebuilding will hopefully only have to be done on a small fraction of 
that data.

Which brings us to rule #2:

2) Don't make your filesystems so large that any maintenance on them, 
including both filesystem maintenance like btrfs balance/scrub/check/
whatever, and normal backup and restore operations, takes impractically 
long, where "impractically" can be reasonably defined as so long it 
discourages you from doing them in the first place and/or so long that 
it's going to cause unwarranted downtime.

Some years ago, before I started using btrfs and while I was using 
mdraid, I learned this one the hard way.  I had a bunch of rather large 
mdraids setup, each with multiple partitions and filesystems[1].  This 
was before mdraid got proper write-intent bitmap support, so after a 
crash, I'd have to repair any of these large mdraids that had been active 
at the time, a process taking hours, even for the primary one containing 
root and /home, because it contained for example a large media partition 
that was unlikely to have been mounted at the same time.

After getting tired of this I redid things, putting each partition/
filesystem on its own mdraid.  Then it would take only a few minutes each 
for the mdraids for root, /home and /var/log, and I could be back in 
business with them in half an hour or so, instead of the couple hours I 
had to wait before, to get the bigger mdraid back up and repaired.  Sure, 
if the much larger media raid was active and the partition mounted too, 
I'd still have it to repair, but I could do that in the background.  And 
there was a good chance it was /not/ active and mounted at the time of 
the crash and thus didn't need repaired, saving that time entirely! =:^)

Eventually I arranged things so I could keep root mounted read-only 
unless I was updating it, and that's still the way I run it today.  That 
makes it very nice when a crash impairs /home and /var/log, since there's 
much less chance root was affected, and with a normal root mount, at 
least I have my full normal system available to me, including the latest 
installed btrfs-progs, and manpages and text-mode browsers such as lynx 
available to me to help troubleshoot, that aren't normally available in 
typical distros' rescue modes.

Meanwhile, a scrub (my btrfs but for /boot are raid1 both data and 
metadata, and /boot is mixed-mode dup, so scrub can normally repair crash 
damage getting the two mirrors out of sync) of root takes only ~10 
seconds, a scrub of /home takes only ~45 seconds, and a scrub of /var/log 
is normally done nearly as fast as I hit enter on the command.  
Similarly, btrfs balance and btrfs check normally run in under a minute, 
partly because I'm on ssd, and partly because those three filesystems are 
all well under 50 GiB each.

Of course I may have to run two or three scrubs, depending on what was 
mounted writable at the time of the crash, and I've had /home and /var/
log (but not root as it's read-only by default) go unmountable until 
repaired a couple times, but repairs are typically short too, and if that 
fails, blow away with a fresh mkfs.btrfs and restore from backup is 
typically well under an hour.

So I don't tend to be down for more than an hour.  Of course some other 
partitions may still need fixed, but that can continue in the background, 
while I'm back up and posting about it to the btrfs list or whatever.

Compare that to the current thread where someone's trying to do a resize 
of a 20+ TB btrfs and it was looking to take a week, due to the massive 
size and the slow speed of balance on his highly reflinked filesystem on 
spinning rust.

Point of fact.  If it's multiple TBs, chances are it's going to be faster 
to simply blow away and recreate from backup, than it is to try to 
repair... and repair may or may not actually work and leave you with a 
fully functional btrfs afterward.

Apparently that 20+ TB /is/ the backup, but it's a backup of a whole 
bunch of systems.  OK, so even if they'd still put all those backups on 
the same physical hardware, consider how much simpler it would have been 
had they had an independent btrfs of say a TB or two for each system they 
were backing up.  At 2 TB, it's possible to work with one or two at a 
time, copying them over to say a 3-4 TB hard drive (or btrfs raid1 with a 
pair of hard drives), blowing away the original partition, and copying 
back from the second backup.  But with a single 20+ TB monster, they 
don't have anything else close to that size to work with, and have to do 
the shrink-current-btrfs, expand-new-filesystem (which is xfs IIRC, 
they're getting off of btrfs), move-more-over-from-the-old-one, repeat, 
dance.  And /each/ /iteration/ of that dance is taking them a week or so!

What would they have done had the btrfs gone bad and needed repaired?  
Try repair and wait a week or two to see if it worked?  Blow away the 
filesystem as it was only the backup and recreate?

A single 20+ TB btrfs was clearly beyond anything practical for them.  
Had rule #2 been followed, they'd have never been in this spot in the 
first place, as even if all those backups from multiple machines (virtual 
or physical) were on the same hardware, they'd be in different 
independent btrfs, and those could be handled independently.

Of course once they're multiple independent btrfs, it would make sense to 
split that 20+ TB onto smaller hardware setups as well, and they'd have 
been dealing with less data overall too, because part of it would have 
been unaffected (or handled separately if they were moving it /all/) as 
it would have been on other machines.  Much like creating multiple mdraids 
and putting a single filesystem in each, instead of putting a bunch of 
data on a single mdraid, ended up working much better for me, because 
then only a fraction of the data was affected and I could do the repairs 
on those mdraids far faster as there wasn't as much data to deal with!

But like I said I'm biased.  By hard experience, yes, and getting the 
sizes for the partitions wrong can be a hassle until you get to know your 
use-case and size them correctly, but it's a definite bias.

---
[1] Partitions and filesystems:  I had learned about a somewhat different 
benefit of multiple partitions and filesystems even longer ago, 1997 or 
so, when I was still on MS, testing an IE 4 beta that for performance 
reasons used direct-disk IO on its cache-index file, but it forgot to set 
the system attribute on it that would have kept defrag from touching it.  
So defrag would move the file out from under the now constantly running 
IE, as IE was part of the explorer shell.  IE would then happily 
overwrite whatever got moved into the old index file location, and a 
number of testers had important files seriously damaged that way.  I 
didn't, because I had my cache on a separate "temp" partition, so while 
it could and did still damage data, all it could touch was "temporary" 
data in the first place, meaning no real damage on my system. =:^)  All 
because I had the temp data on its own partition/filesystem.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-31  5:26                   ` Duncan
@ 2017-03-31  5:38                     ` Duncan
  2017-03-31 12:37                       ` Peter Grandi
  0 siblings, 1 reply; 42+ messages in thread
From: Duncan @ 2017-03-31  5:38 UTC (permalink / raw)
  To: linux-btrfs

Duncan posted on Fri, 31 Mar 2017 05:26:39 +0000 as excerpted:

> Compare that to the current thread where someone's trying to do a resize
> of a 20+ TB btrfs and it was looking to take a week, due to the massive
> size and the slow speed of balance on his highly reflinked filesystem on
> spinning rust.

Heh, /this/ thread.  =:^)  I obviously lost track of the thread I was 
replying to.

Which in a way makes the reply even more forceful, as it's obviously 
generically targeted, not just at this thread.  Even if I were so devious 
as to arrange that deliberately (I'm not and I didn't, FWIW, but of 
course if you suspect that than this assurance won't mean much either).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-30 22:13               ` Piotr Pawłow
  2017-03-31  1:00                 ` GWB
@ 2017-03-31 10:51                 ` Peter Grandi
  1 sibling, 0 replies; 42+ messages in thread
From: Peter Grandi @ 2017-03-31 10:51 UTC (permalink / raw)
  To: Linux fs Btrfs

>>> The way btrfs is designed I'd actually expect shrinking to
>>> be fast in most cases. [ ... ]

>> The proposed "move whole chunks" implementation helps only if
>> there are enough unallocated chunks "below the line". If regular
>> 'balance' is done on the filesystem there will be some, but that
>> just spreads the cost of the 'balance' across time, it does not
>> by itself make a «risky, difficult, slow operation» any less so,
>> just spreads the risk, difficulty, slowness across time.

> Isn't that too pessimistic?

Maybe, it depends on the workload impacting the volume and how
much it churns the free/unallocated situation.

> Most of my filesystems have 90+% of free space unallocated,
> even those I never run balance on.

That seems quite lucky to me, as definitely is not my experience
or even my expectation in the general case: in my laptop and
desktop with relatively few updates I have to run 'balance'
fairly frequently, and "Knorrie" has produced a nice tools that
produces a graphical map of free vs. unallocated space and most
examples and users find quite a bit of balancing needs to be
done

> For me it wouldn't just spread the cost, it would reduce it
> considerably.

In your case the cost of the implicit or explicit 'balance'
simply does not arise because 'balance' is not necessary, and
then moving whole chunks is indeed cheap. The argument here is
in part whether used space (extents) or allocated space (chunks)
is more fragmented as well as the amount of metadata to update
in either case.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-31  1:00                 ` GWB
  2017-03-31  5:26                   ` Duncan
@ 2017-03-31 11:37                   ` Peter Grandi
  1 sibling, 0 replies; 42+ messages in thread
From: Peter Grandi @ 2017-03-31 11:37 UTC (permalink / raw)
  To: Linux fs Btrfs

> Can you try to first dedup the btrfs volume?  This is probably
> out of date, but you could try one of these: [ ... ] Yep,
> that's probably a lot of work. [ ... ] My recollection is that
> btrfs handles deduplication differently than zfs, but both of
> them can be very, very slow

But the big deal there is that dedup is indeed a very expensive
operation, even worse than 'balance'. A balanced, deduped volume
will shrink faster in most cases, but the time taken simply
moved from shrinking to preparing.

> Again, I'm not an expert in btrfs, but in most cases a full
> balance and scrub takes care of any problems on the root
> partition, but that is a relatively small partition.  A full
> balance (without the options) and scrub on 20 TiB must take a
> very long time even with robust hardware, would it not?

There have been reports of several months for volumes of that
size subject to ordinary workload.

> CentOS, Redhat, and Oracle seem to take the position that very
> large data subvolumes using btrfs should work fine.

This is a long standing controvery, and for example there have
been "interesting" debates in the XFS mailing list. Btrfs in
this is not really different from others, with one major
difference in context, that many Btrfs developers work for a
company that relies of large numbers of small servers, to the
point that fixing multidevice issues has not been a priority.

The controversy of large volumes is that while no doubt the
logical structures of recent filesystem types can support single
volumes of many petabytes (or even much larger), and such
volumes have indeed been created and "work"-ish, so they are
unquestionably "syntactically valid", the tradeoffs involved
especially as to maintainability may mean that they don't "work"
well and sustainably so.

The fundamental issue is metadata: while the logical structures,
using 48-64 bit pointers, unquestionably scale "syntactically",
they don't scale pragmatically when considering whole-volume
maintenance like checking, repair, balancing, scrubbing,
indexing (which includes making incremental backups etc.).

Note: large volumes don't have just a speed problem for
whole-volume operations, they also have a memory problem, as
most tools hold in memory copy of the metadata. There have been
cases where indexing or repair of a volume requires a lot more
RAM (many hundreds GiB or some TiB of RAM) than the system on
which the volume was being used.

The problem is of course smaller if the large volume contains
mostly large files, and bigger if the volume is stored on low
IOPS-per-TB devices and used on small-memory systems. But even
with large files even if filetree object metadata (inodes etc.)
are relatively few eventually space metadata must at least
potentially resolve down to single sectors, and that can be a
lot of metadata unless both used and free space are very
unfragmented.

The fundamental technological issue is: *data* IO rates, in both
random IOPS and sequential ones, can be scaled "almost" linearly
by parallelizing them using RAID or equivalent, allowing large
volumes to serve scalably large and parallel *data* workloads,
but *metadata* IO rates cannot be easily parallelized, because
metadata structures are graphs, not arrays of bytes like files.

So a large volume on 100 storage devices can serve in parallel a
significant percentage of 100 times the data workload of a small
volume on 1 storage device, but not so much for the metadata
workload.

For example, I have never seen a parallel 'fsck' tool that can
take advantage of 100 storage devices to complete a scan of a
single volume on 100 storage devices in not much longer time
than the scan of a volume on 1 of the storage device.

> But I would be curious what the rest of the list thinks about
> 20 TiB in one volume/subvolume.

Personally I think that while volumes of many petabytes "work"
syntactically, there are serious maintainability problem (which
I have seen happen at a number of sites) with volumes larger
than 4TB-8TB with any current local filesystem design.

That depends also on number/size of storage devices, and their
nature, that is IOPS, as after all metadata workloads do scale a
bit with number of available IOPS, even if far more slowly than
data workloads.

For for example I think that an 8TB volume is not desirable on a
single 8TB disk for ordinary workloads (but then I think that
disks above 1-2TB are just not suitable for ordinary filesystem
workloads), but with lots of smaller/faster disks a 12TB volume
would probably be acceptable, and maybe a number of flash SSDs
might make acceptable even a 20TB volume.

Of course there are lots of people who know better. :-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-31  5:38                     ` Duncan
@ 2017-03-31 12:37                       ` Peter Grandi
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Grandi @ 2017-03-31 12:37 UTC (permalink / raw)
  To: linux-btrfs

>> [ ... ] CentOS, Redhat, and Oracle seem to take the position
>> that very large data subvolumes using btrfs should work
>> fine. But I would be curious what the rest of the list thinks
>> about 20 TiB in one volume/subvolume.

> To be sure I'm a biased voice here, as I have multiple
> independent btrfs on multiple partitions here, with no btrfs
> over 100 GiB in size, and that's on ssd so maintenance
> commands normally return in minutes or even seconds,

That's a bit extreme I think, as there are downsides to have
many too small volumes too.

> not the hours to days or even weeks it takes on multi-TB btrfs
> on spinning rust.

Or months :-).

> But FWIW... 1) Don't put all your data eggs in one basket,
> especially when that basket isn't yet entirely stable and
> mature.

Really good point here.

> A mantra commonly repeated on this list is that btrfs is still
> stabilizing,

My impression is that most 4.x and later versions are very
reliable for "base" functionality, that is excluding
multi-device, compression, qgroups, ... Put another way, what
scratches the Facebook itches works well :-).

> [ ... ] the time/cost/hassle-factor of the backup, and being
> practically prepared to use them, is even *MORE* important
> than it is on fully mature and stable filesystems.

Indeed, or at least *different* filesystems. I backup JFS
filesystems to XFS ones, and Btrfs filesystems to NILFS2 ones,
for example.

> 2) Don't make your filesystems so large that any maintenance
> on them, including both filesystem maintenance like btrfs
> balance/scrub/check/ whatever, and normal backup and restore
> operations, takes impractically long,

As per my preceding post, that's the big deal, but so many
people "know better" :-).

> where "impractically" can be reasonably defined as so long it
> discourages you from doing them in the first place and/or so
> long that it's going to cause unwarranted downtime.

That's the "Very Large DataBase" level of trouble.

> Some years ago, before I started using btrfs and while I was
> using mdraid, I learned this one the hard way. I had a bunch
> of rather large mdraids setup, [ ... ]

I have recently seen another much "funnier" example: people who
"know better" and follow every cool trend decide to consolidate
their server farm on VMs, backed by a storage server with a
largish single pool of storage holding the virtual disk images
of all the server VMs. They look like geniuses until the storage
pool system crashes, and a minimal integrity check on restart
takes two days during which the whole organization is without
access to any email, files, databases, ...

> [ ... ] And there was a good chance it was /not/ active and
> mounted at the time of the crash and thus didn't need
> repaired, saving that time entirely! =:^)

As to that I have switched to using 'autofs' to mount volumes
only on access, using a simple script that turns '/etc/fstab'
into an automounter dynamic map, which means that most of the
time most volumes on my (home) systems are not mounted:

  http://www.sabi.co.uk/blog/anno06-3rd.html?060928#060928

> Eventually I arranged things so I could keep root mounted
> read-only unless I was updating it, and that's still the way I
> run it today.

The ancient way, instead of having '/' RO and '/var' RW, to have
'/' RW and '/usr' RO (so for example it could be shared across
many systems via NFS etc.), and while both are good ideas, I
prefer the ancient way. But then some people who know better are
moving to merge '/' with '/usr' without understanding what's the
history and the advantages.

> [ ... ] If it's multiple TBs, chances are it's going to be
> faster to simply blow away and recreate from backup, than it
> is to try to repair... [ ... ]

Or to shrink or defragment or dedup etc., except on very high
IOPS-per-TB storage.

> [ ... ] how much simpler it would have been had they had an
> independent btrfs of say a TB or two for each system they were
> backing up.

That is the general alternative to a single large pool/volume:
sharding/chunking of filetrees, sometimes, like with Lustre or
Ceph etc. with a "metafilesystem" layer on top.

Done manually my suggestion is to do the sharding per-week (or
other suitable period) rather than per-system, in a circular
"crop rotation" scheme. So that once a volume has been filled,
it becomes read-only and can even be unmounted until it needs
to be reused:

  http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b

Then there is the problem that "a TB or two" is less easy with
increasing disk capacities, but then I think that disks with a
capacity larger than 1TB are not suitable for ordinary
workloads, and more for tape-cartridge like usage.

> What would they have done had the btrfs gone bad and needed
> repaired? [ ... ]

In most cases I have seen of designs aimed at achieving the
lowest cost and highest flexibility "low IOPS single poool" at
the expense of scalability and maintainability, the "clever"
designer had been promoted or had wisely moved to another job
while the storage system was still mostly empty so the problems
had not yet happened.

[ ... ]

> But like I said I'm biased.  By hard experience, yes, and
> getting the sizes for the partitions wrong can be a hassle
> until you get to know your use-case and size them correctly,
> but it's a definite bias.

Yes, I am very pleased that this post shares this and many
other insights from the wisdom of the ancients, not everybody
knows better :-).

[ ... ]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-30 15:55             ` Peter Grandi
@ 2017-03-31 12:41               ` Austin S. Hemmelgarn
  2017-03-31 17:25                 ` Peter Grandi
  0 siblings, 1 reply; 42+ messages in thread
From: Austin S. Hemmelgarn @ 2017-03-31 12:41 UTC (permalink / raw)
  To: Peter Grandi, Linux fs Btrfs

On 2017-03-30 11:55, Peter Grandi wrote:
>>> My guess is that very complex risky slow operations like that are
>>> provided by "clever" filesystem developers for "marketing" purposes,
>>> to win box-ticking competitions. That applies to those system
>>> developers who do know better; I suspect that even some filesystem
>>> developers are "optimistic" as to what they can actually achieve.
>
>> There are cases where there really is no other sane option. Not
>> everyone has the kind of budget needed for proper HA setups,
>
> Thnaks for letting me know, that must have never occurred to me, just as
> it must have never occurred to me that some people expect extremely
> advanced features that imply big-budget high-IOPS high-reliability
> storage to be fast and reliable on small-budget storage too :-)
You're missing my point (or intentionally ignoring it).  Those types of 
operations are implemented because there are use cases that actually 
need them, not because some developer thought it would be cool.  The one 
possible counter-example of this is XFS, which doesn't support shrinking 
the filesystem at all, but that was a conscious decision because their 
target use case (very large scale data storage) does not need that 
feature and not implementing it allows them to make certain other parts 
of the filesystem faster.
>
>> and if you need maximal uptime and as a result have to reprovision the
>> system online, then you pretty much need a filesystem that supports
>> online shrinking.
>
> That's a bigger topic than we can address here. The topic used to be
> known in one related domain as "Very Large Databases", which were
> defined as databases so large and critical that they the time needed for
> maintenance and backup were too slow for taking them them offline etc.;
> that is a topics that has largely vanished for discussion, I guess
> because most management just don't want to hear it :-).
No, it's mostly vanished because of changes in best current practice. 
That was a topic in an era where the only platform that could handle 
high-availability was VMS, and software wasn't routinely written to 
handle things like load balancing.  As a result, people ran a single 
system which hosted the database, and if that went down, everything went 
down.  By contrast, it's rare these days outside of small companies to 
see singly hosted databases that aren't specific to the local system, 
and once you start parallelizing on the system level, backup and 
maintenance times generally go down.
>
>> Also, it's not really all that slow on most filesystem, BTRFS is just
>> hurt by it's comparatively poor performance, and the COW metadata
>> updates that are needed.
>
> Btrfs in realistic situations has pretty good speed *and* performance,
> and COW actually helps, as it often results in less head repositioning
> than update-in-place. What makes it a bit slower with metadata is having
> 'dup' by default to recover from especially damaging bitflips in
> metadata, but then that does not impact performance, only speed.
I and numerous other people have done benchmarks running single metadata 
and single data profiles on BTRFS, and it consistently performs worse 
than XFS and ext4 even under those circumstances.  It's not horrible 
performance (it's better for example than trying the same workload on 
NTFS on Windows), but it's still not what most people would call 'high' 
performance or speed.
>
>>> That feature set is arguably not appropriate for VM images, but
>>> lots of people know better :-).
>
>> That depends on a lot of factors.  I have no issues personally running
>> small VM images on BTRFS, but I'm also running on decent SSD's
>> (>500MB/s read and write speeds), using sparse files, and keeping on
>> top of managing them. [ ... ]
>
> Having (relatively) big-budget high-IOPS storage for high-IOPS workloads
> helps, that must have never occurred to me either :-).
It's not big budget, the SSD's in question are at best mid-range 
consumer SSD's that cost only marginally more than a decent hard drive, 
and they really don't get all that great performance in terms of IOPS 
because they're all on the same cheap SATA controller.  The point I was 
trying to make (which I should have been clearer about) is that they 
have good bulk throughput, which means that the OS can do much more 
aggressive writeback caching, which in turn means that COW and 
fragmentation have less impact.
>
>>> XFS and 'ext4' are essentially equivalent, except for the fixed-size
>>> inode table limitation of 'ext4' (and XFS reportedly has finer
>>> grained locking). Btrfs is nearly as good as either on most workloads
>>> is single-device mode [ ... ]
>
>> No, if you look at actual data, [ ... ]
>
> Well, I have looked at actual data in many published but often poorly
> made "benchmarks", and to me they seem they seem quite equivalent
> indeed, within somewhat differently shaped performance envelopes, so the
> results depend on the testing point within that envelope. I have been
> done my own simplistic actual data gathering, most recently here:
>
>   http://www.sabi.co.uk/blog/17-one.html?170302#170302
>   http://www.sabi.co.uk/blog/17-one.html?170228#170228
>
> and however simplistic they are fairly informative (and for writes they
> point a finger at a layer below the filesystem type).
In terms of performance, yes they are roughly equivalent.  Performance 
isn't all that matters though, and once you get that point, ext4 and XFS 
are significantly different in what they offer.
>
> [ ... ]
>
>>> "Flexibility" in filesystems, especially on rotating disk
>>> storage with extremely anisotropic performance envelopes, is
>>> very expensive, but of course lots of people know better :-).
>
>> Time is not free,
>
> Your time seems especially and uniquely precious as you "waste"
> as little as possible editing your replies into readability.
>
>> and humans generally prefer to minimize the amount of time they have
>> to work on things. This is why ZFS is so popular, it handles most
>> errors correctly by itself and usually requires very little human
>> intervention for maintenance.
>
> That seems to me a pretty illusion, as it does not contain any magical
> AI, just pretty ordinary and limited error correction for trivial cases.
On average, trivial cases account for most errors in any computer.  So, 
by definition, to handle most errors correctly, you can get by with just 
handling all 'trivial' cases correctly.  By handling all trivial cases 
correctly, ZFS is doing far better than any other current filesystem or 
storage stack can even begin to claim.  It's been doing this since 
before most modern Linux distributions made their first release too, so 
compared to just about anything else people are using these days, it's 
got a pretty solid track record.  Anyone trying to claim it's the best 
option in any case is obviously either a zealot or being paid, but for 
many cases, it really is one of the top options.
>
>> 'Flexibility' in a filesystem costs some time on a regular basis, but
>> can save a huge amount of time in the long run.
>
> Like everything else. The difficulty is having flexibility at scale with
> challenging workloads. "An engineer can do  for a nickel what  any damn
> fool can do for a dollar" :-).
>
>> To look at it another way, I have a home server system running BTRFS
>> on top of LVM. [ ... ]
>
> But usually home servers have "unchallenging" workloads, and it is
> relatively easy to overbudget their storage, because the total absolute
> cost is "affordable".
OK, so running
    * Almost a dozen statically allocated VM's with a variety of 
differing workloads including web-servers, a local mail server, DHCP and 
DNS for the network, a VPN server, and 3 different file sharing 
protocols (which see rather regular use) among other things
   * On average between 4 and 10 transient VM's running regression 
testing on kernel patches (including automation of almost everything but 
selecting patches)
   * A BOINC client
   * GlusterFS (both client and storage node)
   * Network security monitoring (Nagios plus a handful of custom scripts)
   * Cloud storage software
All on the same system is an 'unchallenging' workload.  Given the fact 
that it's only got 32G of RAM and a cheap quad-core Xeon, that's a 
pretty damn challenging workload by most people standards.  I call it a 
home server because I run it out of my house, not because it's some 
trivial dinky little file server that could run just fine on something 
like a Raspberry Pi.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-31 12:41               ` Austin S. Hemmelgarn
@ 2017-03-31 17:25                 ` Peter Grandi
  2017-03-31 19:38                   ` GWB
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2017-03-31 17:25 UTC (permalink / raw)
  To: Linux fs Btrfs

>>> My guess is that very complex risky slow operations like
>>> that are provided by "clever" filesystem developers for
>>> "marketing" purposes, to win box-ticking competitions.

>>> That applies to those system developers who do know better;
>>> I suspect that even some filesystem developers are
>>> "optimistic" as to what they can actually achieve.

>>> There are cases where there really is no other sane
>>> option. Not everyone has the kind of budget needed for
>>> proper HA setups,

>> Thnaks for letting me know, that must have never occurred to
>> me, just as it must have never occurred to me that some
>> people expect extremely advanced features that imply
>> big-budget high-IOPS high-reliability storage to be fast and
>> reliable on small-budget storage too :-)

> You're missing my point (or intentionally ignoring it).

In "Thanks for letting me know" I am not missing your point, I
am simply pointing out that I do know that people try to run
high-budget workloads on low-budget storage.

The argument as to whether "very complex risky slow operations"
should be provided in the filesystem itself is a very different
one, and I did not develop it fully. But is quite "optimistic"
to simply state "there really is no other sane option", even
when for people that don't have "proper HA setups".

Let'a start by assuming for the time being. that "very complex
risky slow operations" are indeed feasible on very reliable high
speed storage layers. Then the questions become:

* Is it really true that "there is no other sane option" to
  running "very complex risky slow operations" even on storage
  that is not "big-budget high-IOPS high-reliability"?

* Is is really true that it is a good idea to run "very complex
  risky slow operations" even on ¨big-budget high-IOPS
  high-reliability storage"?

> Those types of operations are implemented because there are
> use cases that actually need them, not because some developer
> thought it would be cool. [ ... ]

And this is the really crucial bit, I'll disregard without
agreeing too much (but in part I do) with the rest of the
response, as those are less important matters, and this is going
to be londer than a twitter message.

First, I agree that "there are use cases that actually need
them", and I need to explain what I am agreeing to: I believe
that computer systems, "system" in a wide sense, have what I
call "inewvitable functionality", that is functionality that is
not optional, but must be provided *somewhere*: for example
print spooling is "inevitable functionality" as long as there
are multuple users, and spell checking is another example.

The only choice as to "inevitable functionality" is *where* to
provide it. For example spooling can be done among two users by
queuing jobs manually with one saying "I am going to print now",
and the other user waits until the print is finished, or by
using a spool program that queues jobs on the source system, or
by using a spool program that queues jobs on the target
printer. Spell checking can be done on the fly in the document
processor, batch with a tool, or manually by the document
author. All these are valid implementations of "inevitable
functionality", just with very different performance envelope,
where the "system" includes the users as "peripherals" or
"plugins" :-) in the manual implementations.

There is no dispute from me that multiple devices,
adding/removing block devices, data compression, structural
repair, balancing, growing/shrinking, defragmentation, quota
groups, integrity checking, deduplication, ...a are all in the
general case "inevitably functionality", and every non-trivial
storage system *must* implement them.

The big question is *where*: for example when I started using
UNIX the 'fsck' tool was several years away, and when the system
crashed I did like everybody filetree integrity checking and
structure recovery myself (with the help of 'ncheck' and
'icheck' and 'adb'), that is 'fsck' was implemented in my head.

In the general case there are three places where such
"inevitable functionality" can be implemented:

* In the filesystem module in the kernel, for example Btrfs
  scrubbing.
* In a tool that uses hook provided by the filesystem module in
  the kernel, for example Btrfs deduplication, 'send'/'receive'.
* In a tool, for example 'btrfsck'.
* In the system administrator.

Consider the "very complex risky slow" operation of
defragmentation; the system administrator can implement it by
dumping and reloading the volume, or a tool ban implement it by
running on the unmounted filesystem, or a tool and the kernel
can implement it by using kernel module hooks, or it can be
provided entirely in the kernel module.

My argument is that providing "very complex risky slow"
maintenance operations as filesystem primitives looks awesomely
convenient, a good way to "win box-ticking competitions" for
"marketing" purposes, but is rather bad idea for several
reasons, of varying strengths:

* Most system administrators apparently don't understand the
  most basic concepts of storage, or try to not understand them,
  and in particular don't understand that some in-place
  maintenance operations are "very complex risky slow" and
  should be avoided. Manual alternatives to shrinking like
  dumping and reloading should be encouraged.

* In an ideal world "very complex risky slow operations" could
  be done either "automagically" or manually, and wise system
  administrators would choose appropriately, but the risk of the
  wrong choice by less wise system administrators can reflect
  badly on the filesystem reputation and that of their
  designers, as in "after 10 years it still is like this" :-).

* In particular for whatever reasons many system administrators
  seems to be very "optimistic" as to cost/benefit planning,
  maybe because they want to be considered geniuses who can
  deliver large high performance high reliability storage for
  cheap, and systematically under-resource IOPS because they are
  very expensive, yet large quantities of these are consumed by
  most maintenance "very complex risky slow operations",
  especially those involving in-place manipulation, and then
  ingenuously or disingenuously complain when 'balance' takes 3
  months, because after all it is a single command, and that
  single command hides a "very complex risky slow" operation.

* In an ideal world implementing "very complex risky slow
  operations" in kernel modules (or even in tools) is entirely
  cost free, as kernel developers never make mistakes as to
  state machines or race conditions or lessedr bug despite the
  enormouse complexity of the code paths needed to support many
  possible options, but kernel code is particularly fragile,
  kernel developers seem to be human after all, when they are
  are not quite careless, and making it hard to stabilize kernel
  code can reflect badly on the filesystem reputation and that
  of their designers, as in "after 10 years it still is like
  this" :-).

Therefore in my judgement a filesystem design should only
provide the barest and most direct functionality, unless the
designers really overrate themselves, or rate highly their skill
at marketing long lists of features as "magic dust". Im my
judgement higher level functionality can be left to the
ingenuity of system administrators, both because crude methods
like dump and reload actually work pretty well and quickly, even
if they are most costly in terms of resources used, and because
they give a more direct feel to system administrators of the
real costs of doing certain maintenance operations.

Put another way, as to this:

> Those types of operations are implemented because there are
> use cases that actually need them,

Implementing "very complex risky slow operations" like in-place
shrinking *in the kernel module* as a "just do it" primitive is
certainly possible and looks great in a box-ticking competition
but has large hidden costs as to complexity and opacity, and
simpler cruder more manual out of kernel implementations are
usually less complex, less risky, less slow, even if more
expensive in terms of budget. In the end the question for either
filesystem designers or system administrators is "Do you feel
lucky?" :-).

The following crudely tells part of the story, for example that
some filesystem designers know better :-)

  $  D='btrfs f2fs gfs2 hfsplus jfs nilfs2 reiserfs udf xfs'
  $  find $D -name '*.ko' | xargs size | sed 's/^  *//;s/ .*\t//g'
  text    filename
  832719  btrfs/btrfs.ko
  237952  f2fs/f2fs.ko
  251805  gfs2/gfs2.ko
  72731   hfsplus/hfsplus.ko
  171623  jfs/jfs.ko
  173540  nilfs2/nilfs2.ko
  214655  reiserfs/reiserfs.ko
  81628   udf/udf.ko
  658637  xfs/xfs.ko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-31 17:25                 ` Peter Grandi
@ 2017-03-31 19:38                   ` GWB
  2017-03-31 20:27                     ` Peter Grandi
  0 siblings, 1 reply; 42+ messages in thread
From: GWB @ 2017-03-31 19:38 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs Btrfs

Well, now I am curious.  Until we hear back from Christiane on the
progress of the never ending file system shrinkage, I suppose it can't
hurt to ask what the signifigance of the xargs size limits of btrfs
might be.  Or, again, if Christiane is already happily on his way to
an xfs server running over lvm, skip, ignore, delete.

Here is the output of xargs --size-limits on my laptop:

<<
$ xargs --show-limits
Your environment variables take up 4830 bytes
POSIX upper limit on argument length (this system): 2090274
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2085444
Size of command buffer we are actually using: 131072

Execution of xargs will continue now...
>>

That is for a laptop system.  So what does it mean that btrfs has a
higher xargs size limit than other file systems?  Could I
theoretically use 40% of the total allowed argument length of the
system for btrfs arguments alone?  Would that make balance, shrinkage,
etc., faster?  Does the higher capacity for argument length mean btrfs
is overly complex and therefore more prone to breakage?  Or does the
lower capacity for argument length for hfsplus demonstrate it is the
superior file system for avoiding breakage?

Or does it means that hfsplus is very old (and reflects older xargs
limits), and that btrfs is newer code?  I am relatively new to btrfs,
and would like to find out.  I am also attracted to the idea that it
is better to leave some operations to the system itself, and not code
them into the file system.  For example, I think deduplication "off
line" or "out of band" is an advantage for btrfs over zfs.  But that's
only for what I do.  For other uses deduplication "in line", while
writing the file, is preferred, and that is what zfs does (preferably
with lots of memory, at least one ssd to run zil, caches, etc.).

I use btrfs now because Ubuntu has it as a default in the kernel, and
I assume that when (not "if") I have to use a system rescue disk (USB
or CD) it will have some capacity to repair btrfs.  Along the way,
btrfs has been quite good as a general purpose file system on root; it
makes and sends snapshots, and so far only needs an occasional scrub
and balance.  My earlier experience with btrfs on a 2TB drive was more
complicated, but I expected that for a file system with a lot of
potential but less maturity.

Personally, I would go back to fossil and venti on Plan 9 for an
archival data server (using WORM drives), and VAX/VMS cluster for an
HA server.  But of course that no longer makes sense except for a very
few usage cases.  Time has moved on, prices have dropped drastically,
and hardware can do a lot more per penny than it used to.

Gordon

On Fri, Mar 31, 2017 at 12:25 PM, Peter Grandi <pg@btrfs.list.sabi.co.uk> wrote:
>>>> My guess is that very complex risky slow operations like
>>>> that are provided by "clever" filesystem developers for
>>>> "marketing" purposes, to win box-ticking competitions.
>
>>>> That applies to those system developers who do know better;
>>>> I suspect that even some filesystem developers are
>>>> "optimistic" as to what they can actually achieve.
>
>>>> There are cases where there really is no other sane
>>>> option. Not everyone has the kind of budget needed for
>>>> proper HA setups,
>
>>> Thnaks for letting me know, that must have never occurred to
>>> me, just as it must have never occurred to me that some
>>> people expect extremely advanced features that imply
>>> big-budget high-IOPS high-reliability storage to be fast and
>>> reliable on small-budget storage too :-)
>
>> You're missing my point (or intentionally ignoring it).
>
> In "Thanks for letting me know" I am not missing your point, I
> am simply pointing out that I do know that people try to run
> high-budget workloads on low-budget storage.
>
> The argument as to whether "very complex risky slow operations"
> should be provided in the filesystem itself is a very different
> one, and I did not develop it fully. But is quite "optimistic"
> to simply state "there really is no other sane option", even
> when for people that don't have "proper HA setups".
>
> Let'a start by assuming for the time being. that "very complex
> risky slow operations" are indeed feasible on very reliable high
> speed storage layers. Then the questions become:
>
> * Is it really true that "there is no other sane option" to
>   running "very complex risky slow operations" even on storage
>   that is not "big-budget high-IOPS high-reliability"?
>
> * Is is really true that it is a good idea to run "very complex
>   risky slow operations" even on ¨big-budget high-IOPS
>   high-reliability storage"?
>
>> Those types of operations are implemented because there are
>> use cases that actually need them, not because some developer
>> thought it would be cool. [ ... ]
>
> And this is the really crucial bit, I'll disregard without
> agreeing too much (but in part I do) with the rest of the
> response, as those are less important matters, and this is going
> to be londer than a twitter message.
>
> First, I agree that "there are use cases that actually need
> them", and I need to explain what I am agreeing to: I believe
> that computer systems, "system" in a wide sense, have what I
> call "inewvitable functionality", that is functionality that is
> not optional, but must be provided *somewhere*: for example
> print spooling is "inevitable functionality" as long as there
> are multuple users, and spell checking is another example.
>
> The only choice as to "inevitable functionality" is *where* to
> provide it. For example spooling can be done among two users by
> queuing jobs manually with one saying "I am going to print now",
> and the other user waits until the print is finished, or by
> using a spool program that queues jobs on the source system, or
> by using a spool program that queues jobs on the target
> printer. Spell checking can be done on the fly in the document
> processor, batch with a tool, or manually by the document
> author. All these are valid implementations of "inevitable
> functionality", just with very different performance envelope,
> where the "system" includes the users as "peripherals" or
> "plugins" :-) in the manual implementations.
>
> There is no dispute from me that multiple devices,
> adding/removing block devices, data compression, structural
> repair, balancing, growing/shrinking, defragmentation, quota
> groups, integrity checking, deduplication, ...a are all in the
> general case "inevitably functionality", and every non-trivial
> storage system *must* implement them.
>
> The big question is *where*: for example when I started using
> UNIX the 'fsck' tool was several years away, and when the system
> crashed I did like everybody filetree integrity checking and
> structure recovery myself (with the help of 'ncheck' and
> 'icheck' and 'adb'), that is 'fsck' was implemented in my head.
>
> In the general case there are three places where such
> "inevitable functionality" can be implemented:
>
> * In the filesystem module in the kernel, for example Btrfs
>   scrubbing.
> * In a tool that uses hook provided by the filesystem module in
>   the kernel, for example Btrfs deduplication, 'send'/'receive'.
> * In a tool, for example 'btrfsck'.
> * In the system administrator.
>
> Consider the "very complex risky slow" operation of
> defragmentation; the system administrator can implement it by
> dumping and reloading the volume, or a tool ban implement it by
> running on the unmounted filesystem, or a tool and the kernel
> can implement it by using kernel module hooks, or it can be
> provided entirely in the kernel module.
>
> My argument is that providing "very complex risky slow"
> maintenance operations as filesystem primitives looks awesomely
> convenient, a good way to "win box-ticking competitions" for
> "marketing" purposes, but is rather bad idea for several
> reasons, of varying strengths:
>
> * Most system administrators apparently don't understand the
>   most basic concepts of storage, or try to not understand them,
>   and in particular don't understand that some in-place
>   maintenance operations are "very complex risky slow" and
>   should be avoided. Manual alternatives to shrinking like
>   dumping and reloading should be encouraged.
>
> * In an ideal world "very complex risky slow operations" could
>   be done either "automagically" or manually, and wise system
>   administrators would choose appropriately, but the risk of the
>   wrong choice by less wise system administrators can reflect
>   badly on the filesystem reputation and that of their
>   designers, as in "after 10 years it still is like this" :-).
>
> * In particular for whatever reasons many system administrators
>   seems to be very "optimistic" as to cost/benefit planning,
>   maybe because they want to be considered geniuses who can
>   deliver large high performance high reliability storage for
>   cheap, and systematically under-resource IOPS because they are
>   very expensive, yet large quantities of these are consumed by
>   most maintenance "very complex risky slow operations",
>   especially those involving in-place manipulation, and then
>   ingenuously or disingenuously complain when 'balance' takes 3
>   months, because after all it is a single command, and that
>   single command hides a "very complex risky slow" operation.
>
> * In an ideal world implementing "very complex risky slow
>   operations" in kernel modules (or even in tools) is entirely
>   cost free, as kernel developers never make mistakes as to
>   state machines or race conditions or lessedr bug despite the
>   enormouse complexity of the code paths needed to support many
>   possible options, but kernel code is particularly fragile,
>   kernel developers seem to be human after all, when they are
>   are not quite careless, and making it hard to stabilize kernel
>   code can reflect badly on the filesystem reputation and that
>   of their designers, as in "after 10 years it still is like
>   this" :-).
>
> Therefore in my judgement a filesystem design should only
> provide the barest and most direct functionality, unless the
> designers really overrate themselves, or rate highly their skill
> at marketing long lists of features as "magic dust". Im my
> judgement higher level functionality can be left to the
> ingenuity of system administrators, both because crude methods
> like dump and reload actually work pretty well and quickly, even
> if they are most costly in terms of resources used, and because
> they give a more direct feel to system administrators of the
> real costs of doing certain maintenance operations.
>
> Put another way, as to this:
>
>> Those types of operations are implemented because there are
>> use cases that actually need them,
>
> Implementing "very complex risky slow operations" like in-place
> shrinking *in the kernel module* as a "just do it" primitive is
> certainly possible and looks great in a box-ticking competition
> but has large hidden costs as to complexity and opacity, and
> simpler cruder more manual out of kernel implementations are
> usually less complex, less risky, less slow, even if more
> expensive in terms of budget. In the end the question for either
> filesystem designers or system administrators is "Do you feel
> lucky?" :-).
>
> The following crudely tells part of the story, for example that
> some filesystem designers know better :-)
>
>   $  D='btrfs f2fs gfs2 hfsplus jfs nilfs2 reiserfs udf xfs'
>   $  find $D -name '*.ko' | xargs size | sed 's/^  *//;s/ .*\t//g'
>   text    filename
>   832719  btrfs/btrfs.ko
>   237952  f2fs/f2fs.ko
>   251805  gfs2/gfs2.ko
>   72731   hfsplus/hfsplus.ko
>   171623  jfs/jfs.ko
>   173540  nilfs2/nilfs2.ko
>   214655  reiserfs/reiserfs.ko
>   81628   udf/udf.ko
>   658637  xfs/xfs.ko
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-31 19:38                   ` GWB
@ 2017-03-31 20:27                     ` Peter Grandi
  2017-04-01  0:02                       ` GWB
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Grandi @ 2017-03-31 20:27 UTC (permalink / raw)
  To: Linux fs Btrfs

> [ ... ] what the signifigance of the xargs size limits of
> btrfs might be. [ ... ] So what does it mean that btrfs has a
> higher xargs size limit than other file systems? [ ... ] Or
> does the lower capacity for argument length for hfsplus
> demonstrate it is the superior file system for avoiding
> breakage? [ ... ]

That confuses, as my understanding of command argument size
limit is that it is a system, not filesystem, property, and for
example can be obtained with 'getconf _POSIX_ARG_MAX'.

> Personally, I would go back to fossil and venti on Plan 9 for
> an archival data server (using WORM drives),

In an ideal world we would be using Plan 9. Not necessarily with
Fossil and Venti. As a to storage/backup/archival Linux based
options are not bad, even if the platform is far messier than
Plan 9 (or some other alternatives). BTW I just noticed with a
search that AWS might be offering Plan 9 hosts :-).

> and VAX/VMS cluster for an HA server. [ ... ]

Uhmmm, however nice it was, it was fairly weird. An IA32 or
AMD64 port has been promised however :-).

https://www.theregister.co.uk/2016/10/13/openvms_moves_slowly_towards_x86/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-31 20:27                     ` Peter Grandi
@ 2017-04-01  0:02                       ` GWB
  2017-04-01  2:42                         ` Duncan
  0 siblings, 1 reply; 42+ messages in thread
From: GWB @ 2017-04-01  0:02 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs Btrfs

It is confusing, and now that I look at it, more than a little funny.
Your use of xargs returns the size of the kernel module for each of
the filesystem types.  I think I get it now: you are pointing to how
large the kernel module for btrfs is compared to other file system
kernel modules, 833 megs (piping find through xargs to sed).  That
does not mean the btrfs kernel module can accommodate an upper limit
of a command line length that is 833 megs.  It is just a very big
loadable kernel module.

So same question, but different expression: what is the signifigance
of the large size of the btrfs kernel module?  Is it that the larger
the module, the more complex, the more prone to breakage, and more
difficult to debug?  Is the hfsplus kernel module less complex, and
more robust?  What did the file system designers of hfsplus (or udf)
know better (or worse?) than the file system designers of btrfs?

VAX/VMS clusters just aren't happy outside of a deeply hidden bunker
running 9 machines in a cluster from one storage device connected by
myranet over 500 miles to the next cluster.  I applaud the move to
x86, but like I wrote earlier, time has moved on.  I suppose weird is
in the eye of the beholder, but yes, when dial up was king and disco
pants roamed the earth, they were nice.  I don't think x86 is a viable
use case even for OpenVMS.  If you really need a VAX/VMS cluster,
chances are you have already have had one running with a continuous
uptime of more than a decade and you have already upgraded and changed
out every component several times by cycling down one machine in the
cluster at a time.

Gordon

On Fri, Mar 31, 2017 at 3:27 PM, Peter Grandi <pg@btrfs.list.sabi.co.uk> wrote:
>> [ ... ] what the signifigance of the xargs size limits of
>> btrfs might be. [ ... ] So what does it mean that btrfs has a
>> higher xargs size limit than other file systems? [ ... ] Or
>> does the lower capacity for argument length for hfsplus
>> demonstrate it is the superior file system for avoiding
>> breakage? [ ... ]
>
> That confuses, as my understanding of command argument size
> limit is that it is a system, not filesystem, property, and for
> example can be obtained with 'getconf _POSIX_ARG_MAX'.
>
>> Personally, I would go back to fossil and venti on Plan 9 for
>> an archival data server (using WORM drives),
>
> In an ideal world we would be using Plan 9. Not necessarily with
> Fossil and Venti. As a to storage/backup/archival Linux based
> options are not bad, even if the platform is far messier than
> Plan 9 (or some other alternatives). BTW I just noticed with a
> search that AWS might be offering Plan 9 hosts :-).
>
>> and VAX/VMS cluster for an HA server. [ ... ]
>
> Uhmmm, however nice it was, it was fairly weird. An IA32 or
> AMD64 port has been promised however :-).
>
> https://www.theregister.co.uk/2016/10/13/openvms_moves_slowly_towards_x86/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-04-01  0:02                       ` GWB
@ 2017-04-01  2:42                         ` Duncan
  2017-04-01  4:26                           ` GWB
  0 siblings, 1 reply; 42+ messages in thread
From: Duncan @ 2017-04-01  2:42 UTC (permalink / raw)
  To: linux-btrfs

GWB posted on Fri, 31 Mar 2017 19:02:40 -0500 as excerpted:

> It is confusing, and now that I look at it, more than a little funny.
> Your use of xargs returns the size of the kernel module for each of the
> filesystem types.  I think I get it now: you are pointing to how large
> the kernel module for btrfs is compared to other file system kernel
> modules, 833 megs (piping find through xargs to sed).  That does not
> mean the btrfs kernel module can accommodate an upper limit of a command
> line length that is 833 megs.  It is just a very big loadable kernel
> module.

Umm... 833 K, not M, I believe.  (The unit is bytes not KiB.)

Because if just one kernel module is nearing a gigabyte, then the kernel 
must be many gigabytes either monolithic or once assembled in memory, and 
it just ain't so.

But FWIW megs was my first-glance impression too, until my brain said "No 
way!  Doesn't work!" and I took a second look.

The kernel may indeed no longer fit on a 1.44 MB floppy, but it's still 
got a ways to go before it's multiple GiB! =:^)  While they're XZ-
compressed, I'm still fitting several monolithic-build kernels including 
their appended initramfs, along with grub, its config and modules, and a 
few other misc things, in a quarter-GB dup-mode btrfs, meaning 128 MiB 
capacity, including the 16 MiB system chunk so 112 MiB for data and 
metadata.  That simply wouldn't be possible if the kernel itself were 
multi-GB, even uncompressed.  Even XZ isn't /that/ good!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-04-01  2:42                         ` Duncan
@ 2017-04-01  4:26                           ` GWB
  2017-04-01 11:30                             ` Peter Grandi
  0 siblings, 1 reply; 42+ messages in thread
From: GWB @ 2017-04-01  4:26 UTC (permalink / raw)
  To: Btrfs BTRFS

Indeed, that does make sense.  It's the output of the size command in
the Berkeley format of "text", not decimal, octal or hex.  Out of
curiosity about kernel module sizes, I dug up some old MacBooks and
looked around in:

/System/Library/Extensions/[modulename].kext/Content/MacOS:

udf is 637K on Mac OS 10.6
exfat is 75K on Mac OS 10.9
msdosfs is 79K on Mac OS 10.9
ntfs is 394K (That must be Paragon's ntfs for Mac)

And here's the kernel extension sizes for zfs (From OpenZFS):

/Library/Extensions/[modulename].kext/Content/MacOS:

zfs is 1.7M (10.9)
spl is 247K (10.9)

Different kernel from linux, of course (evidently a "mish mash" of
NextStep, BSD, Mach and Apple's own code), but that is one large
kernel extension for zfs.  If they are somehow comparable even with
the differences, 833K is not bad for btrfs compared to zfs.  I did not
look at the format of the file; it must be binary, but compression may
be optional for third party kexts.

So the kernel module sizes are large for both btrfs and zfs.  Given
the feature sets of both, is that surprising?

My favourite kernel extension in Mac OS X is:

/System/Library/Extensions/Dont Steal Mac OS X.kext/

Subtle, very subtle.

Gordon

On Fri, Mar 31, 2017 at 9:42 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> GWB posted on Fri, 31 Mar 2017 19:02:40 -0500 as excerpted:
>
>> It is confusing, and now that I look at it, more than a little funny.
>> Your use of xargs returns the size of the kernel module for each of the
>> filesystem types.  I think I get it now: you are pointing to how large
>> the kernel module for btrfs is compared to other file system kernel
>> modules, 833 megs (piping find through xargs to sed).  That does not
>> mean the btrfs kernel module can accommodate an upper limit of a command
>> line length that is 833 megs.  It is just a very big loadable kernel
>> module.
>
> Umm... 833 K, not M, I believe.  (The unit is bytes not KiB.)
>
> Because if just one kernel module is nearing a gigabyte, then the kernel
> must be many gigabytes either monolithic or once assembled in memory, and
> it just ain't so.
>
> But FWIW megs was my first-glance impression too, until my brain said "No
> way!  Doesn't work!" and I took a second look.
>
> The kernel may indeed no longer fit on a 1.44 MB floppy, but it's still
> got a ways to go before it's multiple GiB! =:^)  While they're XZ-
> compressed, I'm still fitting several monolithic-build kernels including
> their appended initramfs, along with grub, its config and modules, and a
> few other misc things, in a quarter-GB dup-mode btrfs, meaning 128 MiB
> capacity, including the 16 MiB system chunk so 112 MiB for data and
> metadata.  That simply wouldn't be possible if the kernel itself were
> multi-GB, even uncompressed.  Even XZ isn't /that/ good!
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 15:06                 ` Roman Mamedov
@ 2017-04-01  9:05                   ` Kai Krakow
  0 siblings, 0 replies; 42+ messages in thread
From: Kai Krakow @ 2017-04-01  9:05 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 27 Mar 2017 20:06:46 +0500
schrieb Roman Mamedov <rm@romanrm.net>:

> On Mon, 27 Mar 2017 16:49:47 +0200
> Christian Theune <ct@flyingcircus.io> wrote:
> 
> > Also: the idea of migrating on btrfs also has its downside - the
> > performance of “mkdir” and “fsync” is abysmal at the moment. I’m
> > waiting for the current shrinking job to finish but this is likely
> > limited to the “find free space” algorithm. We’re talking about a
> > few megabytes converted per second. Sigh.  
> 
> Btw since this is all on LVM already, you could set up lvmcache with
> a small SSD-based cache volume. Even some old 60GB SSD would work
> wonders for performance, and with the cache policy of "writethrough"
> you don't have to worry about its reliability (much).

That's maybe the best recommendation to speed things up. I'm using
bcache here for the same reasons (speeding up random workloads) and it
works wonders.

Tho, for such big storage I'd maybe recommend a bigger SSD and a new
one. Bigger SSDs tend to last much longer. Just don't use the whole of
it to allow for better wear leveling and you'll get a final setup that
can serve the system much longer than for the period of migration.

-- 
Regards,
Kai

Replies to list-only preferred.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-04-01  4:26                           ` GWB
@ 2017-04-01 11:30                             ` Peter Grandi
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Grandi @ 2017-04-01 11:30 UTC (permalink / raw)
  To: Linux fs Btrfs

[ ... ]

>>>   $  D='btrfs f2fs gfs2 hfsplus jfs nilfs2 reiserfs udf xfs'
>>>   $  find $D -name '*.ko' | xargs size | sed 's/^  *//;s/ .*\t//g'
>>>   text    filename
>>>   832719  btrfs/btrfs.ko
>>>   237952  f2fs/f2fs.ko
>>>   251805  gfs2/gfs2.ko
>>>   72731   hfsplus/hfsplus.ko
>>>   171623  jfs/jfs.ko
>>>   173540  nilfs2/nilfs2.ko
>>>   214655  reiserfs/reiserfs.ko
>>>   81628   udf/udf.ko
>>>   658637  xfs/xfs.ko

That was Linux AMD64.

> udf is 637K on Mac OS 10.6
> exfat is 75K on Mac OS 10.9
> msdosfs is 79K on Mac OS 10.9
> ntfs is 394K (That must be Paragon's ntfs for Mac)
...
> zfs is 1.7M (10.9)
> spl is 247K (10.9)

Similar on Linux AMD64 but smaller:

  $ size updates/dkms/*.ko | sed 's/^  *//;s/ .*\t//g'
  text    filename
  62005   updates/dkms/spl.ko
  184370  updates/dkms/splat.ko
  3879    updates/dkms/zavl.ko
  22688   updates/dkms/zcommon.ko
  1012212 updates/dkms/zfs.ko
  39874   updates/dkms/znvpair.ko
  18321   updates/dkms/zpios.ko
  319224  updates/dkms/zunicode.ko

> If they are somehow comparable even with the differences, 833K
> is not bad for btrfs compared to zfs. I did not look at the
> format of the file; it must be binary, but compression may be
> optional for third party kexts. So the kernel module sizes are
> large for both btrfs and zfs. Given the feature sets of both,
> is that surprising?

Not surprising and indeed I agree with the statement that
appeared earlier that "there are use cases that actually need
them". There are also use cases that need realtime translation
of file content from chinese to spanish, and one could add to
ZFS or Btrfs an extension to detect the language of text files
and invoke via HTTP Google Translate, for example with option
"translate=chinese-spanish" at mount time; or less flexibly
there are many use cases where B-Tree lookup of records in files
is useful, and it would be possible to add that to Btrfs or ZFS,
so that for example 'lseek(4,"Jane Smith",SEEK_KEY)' would be
possible, as in the ancient TSS/370 filesystem design.

But the question is about engineering, where best to implement
those "feature sets": in the kernel or higher levels. There is
no doubt for me that realtime language translation and seeking
by key can be added to a filesystem kernel module, and would
"work". The issue is a crudely technical one: "works" for an
engineer is not a binary state, but a statistical property over
a wide spectrum of cost/benefit tradeoffs.

Adding "feature sets" because "there are use cases that actually
need them" is fine, adding their implementation to the kernel
driver of a filesystem is quite a different proposition, which
may have downsides, as the implementations of those feature sets
may make code more complex and harder to understand and test,
never mind debug, even for the base features. But of course lots
of people know better :-).

Buit there is more; look again at some compiled code sizes as a
crude proxy for complexity, divided in two groups, both of
robust, full featured designs:

  1012212 updates/dkms/zfs.ko
  832719  btrfs/btrfs.ko
  658637  xfs/xfs.ko

  237952  f2fs/f2fs.ko
  173540  nilfs2/nilfs2.ko
  171623  jfs/jfs.ko
  81628   udf/udf.ko

The code size for JFS or NILFS2 or UDF is roughly 1/4 the code
size for XFS, yet there is little difference in functionality.
Compared to ZFS as to base functionality JFS lacks checksums and
snapshots (in theory it has subvolumes, but they are disabled),
but NILFS2 has snapshots and checksums (but does not verify them
on ordinary reads), and yet the code size is 1/6 that of ZFS.
ZFS has also RAID, but looking at the code size of the Linux MD
RAID modules I see rather smaller numbers. Even so ZFS has a
good reputation for reliability despire its amazing complexity,
but that is also because SUN invested big into massive release
engineering for it, and similarly for XFS.

Therefore my impression is that the filesystems in the first
group have a lot of cool features like compression or dedup
etc. that could have been implemented user-level, and having
them in the kernel is good "for "marketing" purposes, to win
box-ticking competitions".

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Shrinking a device - performance?
  2017-03-27 11:51 Christian Theune
@ 2017-03-27 12:55 ` Christian Theune
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Theune @ 2017-03-27 12:55 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 549 bytes --]


> On Mar 27, 2017, at 1:51 PM, Christian Theune <ct@flyingcircus.io> wrote:
> 
> Hi,
> 
> (I hope I’m not double posting. My mail client was misconfigured and I think I only managed to send the mail correctly this time.)

Turns out I did double post. Mea culpa.

--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Shrinking a device - performance?
@ 2017-03-27 11:51 Christian Theune
  2017-03-27 12:55 ` Christian Theune
  0 siblings, 1 reply; 42+ messages in thread
From: Christian Theune @ 2017-03-27 11:51 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1280 bytes --]

Hi,

(I hope I’m not double posting. My mail client was misconfigured and I think I only managed to send the mail correctly this time.)

I’m currently shrinking a device and it seems that the performance of shrink is abysmal. I intended to shrink a ~22TiB filesystem down to 20TiB. This is still using LVM underneath so that I can’t just remove a device from the filesystem but have to use the resize command.

Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
       Total devices 1 FS bytes used 18.21TiB
       devid    1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy

This has been running since last Thursday, so roughly 3.5days now. The “used” number in devid1 has moved about 1TiB in this time. The filesystem is seeing regular usage (read and write) and when I’m suspending any application traffic I see about 1GiB of movement every now and then. Maybe once every 30 seconds or so.

Does this sound fishy or normal to you?

Kind regards,
Christian

--
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2017-04-01 11:30 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-27 11:17 Shrinking a device - performance? Christian Theune
2017-03-27 13:07 ` Hugo Mills
2017-03-27 13:20   ` Christian Theune
2017-03-27 13:24     ` Hugo Mills
2017-03-27 13:46       ` Austin S. Hemmelgarn
2017-03-27 13:50         ` Christian Theune
2017-03-27 13:54           ` Christian Theune
2017-03-27 14:17             ` Austin S. Hemmelgarn
2017-03-27 14:49               ` Christian Theune
2017-03-27 15:06                 ` Roman Mamedov
2017-04-01  9:05                   ` Kai Krakow
2017-03-27 14:14           ` Austin S. Hemmelgarn
2017-03-27 14:48     ` Roman Mamedov
2017-03-27 14:53       ` Christian Theune
2017-03-28 14:43         ` Peter Grandi
2017-03-28 14:50           ` Tomasz Kusmierz
2017-03-28 15:06             ` Peter Grandi
2017-03-28 15:35               ` Tomasz Kusmierz
2017-03-28 16:20                 ` Peter Grandi
2017-03-28 14:59           ` Peter Grandi
2017-03-28 15:20             ` Peter Grandi
2017-03-28 15:56           ` Austin S. Hemmelgarn
2017-03-30 15:55             ` Peter Grandi
2017-03-31 12:41               ` Austin S. Hemmelgarn
2017-03-31 17:25                 ` Peter Grandi
2017-03-31 19:38                   ` GWB
2017-03-31 20:27                     ` Peter Grandi
2017-04-01  0:02                       ` GWB
2017-04-01  2:42                         ` Duncan
2017-04-01  4:26                           ` GWB
2017-04-01 11:30                             ` Peter Grandi
2017-03-30 15:00           ` Piotr Pawłow
2017-03-30 16:13             ` Peter Grandi
2017-03-30 22:13               ` Piotr Pawłow
2017-03-31  1:00                 ` GWB
2017-03-31  5:26                   ` Duncan
2017-03-31  5:38                     ` Duncan
2017-03-31 12:37                       ` Peter Grandi
2017-03-31 11:37                   ` Peter Grandi
2017-03-31 10:51                 ` Peter Grandi
2017-03-27 11:51 Christian Theune
2017-03-27 12:55 ` Christian Theune

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.