* Healthy amount of free space?
@ 2018-07-16 20:58 Wolf
2018-07-17 7:20 ` Nikolay Borisov
2018-07-17 11:46 ` Austin S. Hemmelgarn
0 siblings, 2 replies; 19+ messages in thread
From: Wolf @ 2018-07-16 20:58 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2343 bytes --]
Greetings,
I would like to ask what what is healthy amount of free space to keep on
each device for btrfs to be happy?
This is how my disk array currently looks like
[root@dennas ~]# btrfs fi usage /raid
Overall:
Device size: 29.11TiB
Device allocated: 21.26TiB
Device unallocated: 7.85TiB
Device missing: 0.00B
Used: 21.18TiB
Free (estimated): 3.96TiB (min: 3.96TiB)
Data ratio: 2.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Data,RAID1: Size:10.61TiB, Used:10.58TiB
/dev/mapper/data1 1.75TiB
/dev/mapper/data2 1.75TiB
/dev/mapper/data3 856.00GiB
/dev/mapper/data4 856.00GiB
/dev/mapper/data5 1.75TiB
/dev/mapper/data6 1.75TiB
/dev/mapper/data7 6.29TiB
/dev/mapper/data8 6.29TiB
Metadata,RAID1: Size:15.00GiB, Used:13.00GiB
/dev/mapper/data1 2.00GiB
/dev/mapper/data2 3.00GiB
/dev/mapper/data3 1.00GiB
/dev/mapper/data4 1.00GiB
/dev/mapper/data5 3.00GiB
/dev/mapper/data6 1.00GiB
/dev/mapper/data7 9.00GiB
/dev/mapper/data8 10.00GiB
System,RAID1: Size:64.00MiB, Used:1.50MiB
/dev/mapper/data2 32.00MiB
/dev/mapper/data6 32.00MiB
/dev/mapper/data7 32.00MiB
/dev/mapper/data8 32.00MiB
Unallocated:
/dev/mapper/data1 1004.52GiB
/dev/mapper/data2 1004.49GiB
/dev/mapper/data3 1006.01GiB
/dev/mapper/data4 1006.01GiB
/dev/mapper/data5 1004.52GiB
/dev/mapper/data6 1004.49GiB
/dev/mapper/data7 1005.00GiB
/dev/mapper/data8 1005.00GiB
Btrfs does quite good job of evenly using space on all devices. No, how
low can I let that go? In other words, with how much space
free/unallocated remaining space should I consider adding new disk?
Thanks for advice :)
W.
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-16 20:58 Healthy amount of free space? Wolf
@ 2018-07-17 7:20 ` Nikolay Borisov
2018-07-17 8:02 ` Martin Steigerwald
2018-07-17 11:46 ` Austin S. Hemmelgarn
1 sibling, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2018-07-17 7:20 UTC (permalink / raw)
To: Wolf, linux-btrfs
On 16.07.2018 23:58, Wolf wrote:
> Greetings,
> I would like to ask what what is healthy amount of free space to keep on
> each device for btrfs to be happy?
>
> This is how my disk array currently looks like
>
> [root@dennas ~]# btrfs fi usage /raid
> Overall:
> Device size: 29.11TiB
> Device allocated: 21.26TiB
> Device unallocated: 7.85TiB
> Device missing: 0.00B
> Used: 21.18TiB
> Free (estimated): 3.96TiB (min: 3.96TiB)
> Data ratio: 2.00
> Metadata ratio: 2.00
> Global reserve: 512.00MiB (used: 0.00B)
>
> Data,RAID1: Size:10.61TiB, Used:10.58TiB
> /dev/mapper/data1 1.75TiB
> /dev/mapper/data2 1.75TiB
> /dev/mapper/data3 856.00GiB
> /dev/mapper/data4 856.00GiB
> /dev/mapper/data5 1.75TiB
> /dev/mapper/data6 1.75TiB
> /dev/mapper/data7 6.29TiB
> /dev/mapper/data8 6.29TiB
>
> Metadata,RAID1: Size:15.00GiB, Used:13.00GiB
> /dev/mapper/data1 2.00GiB
> /dev/mapper/data2 3.00GiB
> /dev/mapper/data3 1.00GiB
> /dev/mapper/data4 1.00GiB
> /dev/mapper/data5 3.00GiB
> /dev/mapper/data6 1.00GiB
> /dev/mapper/data7 9.00GiB
> /dev/mapper/data8 10.00GiB
>
> System,RAID1: Size:64.00MiB, Used:1.50MiB
> /dev/mapper/data2 32.00MiB
> /dev/mapper/data6 32.00MiB
> /dev/mapper/data7 32.00MiB
> /dev/mapper/data8 32.00MiB
>
> Unallocated:
> /dev/mapper/data1 1004.52GiB
> /dev/mapper/data2 1004.49GiB
> /dev/mapper/data3 1006.01GiB
> /dev/mapper/data4 1006.01GiB
> /dev/mapper/data5 1004.52GiB
> /dev/mapper/data6 1004.49GiB
> /dev/mapper/data7 1005.00GiB
> /dev/mapper/data8 1005.00GiB
>
> Btrfs does quite good job of evenly using space on all devices. No, how
> low can I let that go? In other words, with how much space
> free/unallocated remaining space should I consider adding new disk?
Btrfs will start running into problems when you run out of unallocated
space. So the best advice will be monitor your device unallocated, once
it gets really low - like 2-3 gb I will suggest you run balance which
will try to free up unallocated space by rewriting data more compactly
into sparsely populated block groups. If after running balance you
haven't really freed any space then you should consider adding a new
drive and running balance to even out the spread of data/metadata.
>
> Thanks for advice :)
>
> W.
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-17 7:20 ` Nikolay Borisov
@ 2018-07-17 8:02 ` Martin Steigerwald
2018-07-17 8:16 ` Nikolay Borisov
0 siblings, 1 reply; 19+ messages in thread
From: Martin Steigerwald @ 2018-07-17 8:02 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Wolf, linux-btrfs
Hi Nikolay.
Nikolay Borisov - 17.07.18, 09:20:
> On 16.07.2018 23:58, Wolf wrote:
> > Greetings,
> > I would like to ask what what is healthy amount of free space to
> > keep on each device for btrfs to be happy?
> >
> > This is how my disk array currently looks like
> >
> > [root@dennas ~]# btrfs fi usage /raid
> >
> > Overall:
> > Device size: 29.11TiB
> > Device allocated: 21.26TiB
> > Device unallocated: 7.85TiB
> > Device missing: 0.00B
> > Used: 21.18TiB
> > Free (estimated): 3.96TiB (min: 3.96TiB)
> > Data ratio: 2.00
> > Metadata ratio: 2.00
> > Global reserve: 512.00MiB (used: 0.00B)
[…]
> > Btrfs does quite good job of evenly using space on all devices. No,
> > how low can I let that go? In other words, with how much space
> > free/unallocated remaining space should I consider adding new disk?
>
> Btrfs will start running into problems when you run out of unallocated
> space. So the best advice will be monitor your device unallocated,
> once it gets really low - like 2-3 gb I will suggest you run balance
> which will try to free up unallocated space by rewriting data more
> compactly into sparsely populated block groups. If after running
> balance you haven't really freed any space then you should consider
> adding a new drive and running balance to even out the spread of
> data/metadata.
What are these issues exactly?
I have
% btrfs fi us -T /home
Overall:
Device size: 340.00GiB
Device allocated: 340.00GiB
Device unallocated: 2.00MiB
Device missing: 0.00B
Used: 308.37GiB
Free (estimated): 14.65GiB (min: 14.65GiB)
Data ratio: 2.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Data Metadata System
Id Path RAID1 RAID1 RAID1 Unallocated
-- ---------------------- --------- -------- -------- -----------
1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB
2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB
-- ---------------------- --------- -------- -------- -----------
Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB
Used 151.24GiB 2.95GiB 48.00KiB
on a RAID-1 filesystem one, part of the time two Plasma desktops +
KDEPIM and Akonadi + Baloo desktop search + you name it write to like
mad.
Since kernel 4.5 or 4.6 this simply works. Before that sometimes BTRFS
crawled to an halt on searching for free blocks, and I had to switch off
the laptop uncleanly. If that happened, a balance helped for a while.
But since 4.5 or 4.6 this did not happen anymore.
I found with SLES 12 SP 3 or so there is btrfsmaintenance running a
balance weekly. Which created an issue on our Proxmox + Ceph on Intel
NUC based opensource demo lab. This is for sure no recommended
configuration for Ceph and Ceph is quite slow on these 2,5 inch
harddisks and 1 GBit network link, despite albeit somewhat minimal,
limited to 5 GiB m.2 SSD caching. What happened it that the VM crawled
to a halt and the kernel gave task hung for more than 120 seconds
messages. The VM was basically unusable during the balance. Sure that
should not happen with a "proper" setup, also it also did not happen
without the automatic balance.
Also what would happen on a hypervisor setup with several thousands of
VMs with BTRFS, when several 100 of them decide to start the balance at
a similar time? It could probably bring the I/O system below to an halt,
as many enterprise storage systems are designed to sustain burst I/O
loads, but not maximum utilization during an extended period of time.
I am really wondering what to recommend in my Linux performance tuning
and analysis courses. On my own laptop I do not do regular balances so
far. Due to my thinking: If it is not broken, do not fix it.
My personal opinion here also is: If the filesystem degrades that much
that it becomes unusable without regular maintenance from user space,
the filesystem needs to be fixed. Ideally I would not have to worry on
whether to regularly balance an BTRFS or not. In other words: I should
not have to visit a performance analysis and tuning course in order to
use a computer with BTRFS filesystem.
Thanks,
--
Martin
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-17 8:02 ` Martin Steigerwald
@ 2018-07-17 8:16 ` Nikolay Borisov
2018-07-17 17:54 ` Martin Steigerwald
0 siblings, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2018-07-17 8:16 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Wolf, linux-btrfs
On 17.07.2018 11:02, Martin Steigerwald wrote:
> Hi Nikolay.
>
> Nikolay Borisov - 17.07.18, 09:20:
>> On 16.07.2018 23:58, Wolf wrote:
>>> Greetings,
>>> I would like to ask what what is healthy amount of free space to
>>> keep on each device for btrfs to be happy?
>>>
>>> This is how my disk array currently looks like
>>>
>>> [root@dennas ~]# btrfs fi usage /raid
>>>
>>> Overall:
>>> Device size: 29.11TiB
>>> Device allocated: 21.26TiB
>>> Device unallocated: 7.85TiB
>>> Device missing: 0.00B
>>> Used: 21.18TiB
>>> Free (estimated): 3.96TiB (min: 3.96TiB)
>>> Data ratio: 2.00
>>> Metadata ratio: 2.00
>>> Global reserve: 512.00MiB (used: 0.00B)
> […]
>>> Btrfs does quite good job of evenly using space on all devices. No,
>>> how low can I let that go? In other words, with how much space
>>> free/unallocated remaining space should I consider adding new disk?
>>
>> Btrfs will start running into problems when you run out of unallocated
>> space. So the best advice will be monitor your device unallocated,
>> once it gets really low - like 2-3 gb I will suggest you run balance
>> which will try to free up unallocated space by rewriting data more
>> compactly into sparsely populated block groups. If after running
>> balance you haven't really freed any space then you should consider
>> adding a new drive and running balance to even out the spread of
>> data/metadata.
>
> What are these issues exactly?
For example if you have plenty of data space but your metadata is full
then you will be getting ENOSPC.
>
> I have
>
> % btrfs fi us -T /home
> Overall:
> Device size: 340.00GiB
> Device allocated: 340.00GiB
> Device unallocated: 2.00MiB
> Device missing: 0.00B
> Used: 308.37GiB
> Free (estimated): 14.65GiB (min: 14.65GiB)
> Data ratio: 2.00
> Metadata ratio: 2.00
> Global reserve: 512.00MiB (used: 0.00B)
>
> Data Metadata System
> Id Path RAID1 RAID1 RAID1 Unallocated
> -- ---------------------- --------- -------- -------- -----------
> 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB
> 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB
> -- ---------------------- --------- -------- -------- -----------
> Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB
> Used 151.24GiB 2.95GiB 48.00KiB
You already have only 33% of your metadata full so if your workload
turned out to actually be making more metadata-heavy changed i.e
snapshots you could exhaust this and get ENOSPC, despite having around
14gb of free data space. Furthermore this data space is spread around
multiple data chunks, depending on how populated they are a balance
could be able to free up unallocated space which later could be
re-purposed for metadata (again, depending on what you are doing).
>
> on a RAID-1 filesystem one, part of the time two Plasma desktops +
> KDEPIM and Akonadi + Baloo desktop search + you name it write to like
> mad.
>
<snip>
>
> Thanks,
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-16 20:58 Healthy amount of free space? Wolf
2018-07-17 7:20 ` Nikolay Borisov
@ 2018-07-17 11:46 ` Austin S. Hemmelgarn
1 sibling, 0 replies; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-17 11:46 UTC (permalink / raw)
To: Wolf, linux-btrfs
On 2018-07-16 16:58, Wolf wrote:
> Greetings,
> I would like to ask what what is healthy amount of free space to keep on
> each device for btrfs to be happy?
>
> This is how my disk array currently looks like
>
> [root@dennas ~]# btrfs fi usage /raid
> Overall:
> Device size: 29.11TiB
> Device allocated: 21.26TiB
> Device unallocated: 7.85TiB
> Device missing: 0.00B
> Used: 21.18TiB
> Free (estimated): 3.96TiB (min: 3.96TiB)
> Data ratio: 2.00
> Metadata ratio: 2.00
> Global reserve: 512.00MiB (used: 0.00B)
>
> Data,RAID1: Size:10.61TiB, Used:10.58TiB
> /dev/mapper/data1 1.75TiB
> /dev/mapper/data2 1.75TiB
> /dev/mapper/data3 856.00GiB
> /dev/mapper/data4 856.00GiB
> /dev/mapper/data5 1.75TiB
> /dev/mapper/data6 1.75TiB
> /dev/mapper/data7 6.29TiB
> /dev/mapper/data8 6.29TiB
>
> Metadata,RAID1: Size:15.00GiB, Used:13.00GiB
> /dev/mapper/data1 2.00GiB
> /dev/mapper/data2 3.00GiB
> /dev/mapper/data3 1.00GiB
> /dev/mapper/data4 1.00GiB
> /dev/mapper/data5 3.00GiB
> /dev/mapper/data6 1.00GiB
> /dev/mapper/data7 9.00GiB
> /dev/mapper/data8 10.00GiB
Slightly OT, but the distribution of metadata chunks across devices
looks a bit sub-optimal here. If you can tolerate the volume being
somewhat slower for a while, I'd suggest balancing these (it should get
you better performance long-term).
>
> System,RAID1: Size:64.00MiB, Used:1.50MiB
> /dev/mapper/data2 32.00MiB
> /dev/mapper/data6 32.00MiB
> /dev/mapper/data7 32.00MiB
> /dev/mapper/data8 32.00MiB
>
> Unallocated:
> /dev/mapper/data1 1004.52GiB
> /dev/mapper/data2 1004.49GiB
> /dev/mapper/data3 1006.01GiB
> /dev/mapper/data4 1006.01GiB
> /dev/mapper/data5 1004.52GiB
> /dev/mapper/data6 1004.49GiB
> /dev/mapper/data7 1005.00GiB
> /dev/mapper/data8 1005.00GiB
>
> Btrfs does quite good job of evenly using space on all devices. No, how
> low can I let that go? In other words, with how much space
> free/unallocated remaining space should I consider adding new disk?
Disclaimer: What I'm about to say is based on personal experience. YMMV.
It depends on how you use the filesystem.
Realistically, there are a couple of things I consider when trying to
decide on this myself:
* How quickly does the total usage increase on average, and how much can
it be expected to increase in one day in the worst case scenario? This
isn't really BTRFS specific, but it's worth mentioning. I usually don't
let an array get close enough to full that it wouldn't be able to safely
handle at least one day of the worst case increase and another 2 of
average increases. In BTRFS terms, the 'safely handle' part means you
should be adding about 5GB for a multi-TB array like you have, or about
1GB for a sub-TB array.
* What are the typical write patterns? Do files get rewritten in-place,
or are they only ever rewritten with a replace-by-rename? Are writes
mostly random, or mostly sequential? Are writes mostly small or mostly
large? The more towards the first possibility listed in each of those
question (in-place rewrites, random access, and small writes), the more
free space you should keep on the volume.
* Does this volume see heavy usage of fallocate() either to preallocate
space (note that this _DOES NOT WORK SANELY_ on BTRFS), or to punch
holes or remove ranges from files. If whatever software you're using
does this a lot on this volume, you want even more free space.
* Do old files tend to get removed in large batches? That is, possibly
hundreds or thousands of files at a time. If so, and you're running a
reasonably recent (4.x series) kernel or regularly balance the volume to
clean up empty chunks, you don't need quite as much free space.
* How quickly can you get a new device added, and is it critical that
this volume always be writable? Sounds stupid, but a lot of people
don't consider this. If you can trivially get a new device added
immediately, you can generally let things go a bit further than you
would normally, same for if the volume being read-only can be tolerated
for a while without significant issues.
It's worth noting that I explicitly do not care about snapshot usage.
It rarely has much impact on this other than changing how the total
usage increases in a day.
Evaluating all of this is of course something I can't really do for you.
If I had to guess, with no other information that the allocations
shown, I'd say that you're probably generically fine until you get down
to about 5GB more than twice the average amount by which the total usage
increases in a day. That's a rather conservative guess without any
spare overhead for more than a day, and assumes you aren't using
fallocate much but have an otherwise evenly mixed write/delete workload.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-17 8:16 ` Nikolay Borisov
@ 2018-07-17 17:54 ` Martin Steigerwald
2018-07-18 12:35 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 19+ messages in thread
From: Martin Steigerwald @ 2018-07-17 17:54 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Wolf, linux-btrfs
Nikolay Borisov - 17.07.18, 10:16:
> On 17.07.2018 11:02, Martin Steigerwald wrote:
> > Nikolay Borisov - 17.07.18, 09:20:
> >> On 16.07.2018 23:58, Wolf wrote:
> >>> Greetings,
> >>> I would like to ask what what is healthy amount of free space to
> >>> keep on each device for btrfs to be happy?
> >>>
> >>> This is how my disk array currently looks like
> >>>
> >>> [root@dennas ~]# btrfs fi usage /raid
> >>>
> >>> Overall:
> >>> Device size: 29.11TiB
> >>> Device allocated: 21.26TiB
> >>> Device unallocated: 7.85TiB
> >>> Device missing: 0.00B
> >>> Used: 21.18TiB
> >>> Free (estimated): 3.96TiB (min: 3.96TiB)
> >>> Data ratio: 2.00
> >>> Metadata ratio: 2.00
> >>> Global reserve: 512.00MiB (used: 0.00B)
> >
> > […]
> >
> >>> Btrfs does quite good job of evenly using space on all devices.
> >>> No,
> >>> how low can I let that go? In other words, with how much space
> >>> free/unallocated remaining space should I consider adding new
> >>> disk?
> >>
> >> Btrfs will start running into problems when you run out of
> >> unallocated space. So the best advice will be monitor your device
> >> unallocated, once it gets really low - like 2-3 gb I will suggest
> >> you run balance which will try to free up unallocated space by
> >> rewriting data more compactly into sparsely populated block
> >> groups. If after running balance you haven't really freed any
> >> space then you should consider adding a new drive and running
> >> balance to even out the spread of data/metadata.
> >
> > What are these issues exactly?
>
> For example if you have plenty of data space but your metadata is full
> then you will be getting ENOSPC.
Of that one I am aware.
This just did not happen so far.
I did not yet add it explicitly to the training slides, but I just make
myself a note to do that.
Anything else?
> > I have
> >
> > % btrfs fi us -T /home
> >
> > Overall:
> > Device size: 340.00GiB
> > Device allocated: 340.00GiB
> > Device unallocated: 2.00MiB
> > Device missing: 0.00B
> > Used: 308.37GiB
> > Free (estimated): 14.65GiB (min: 14.65GiB)
> > Data ratio: 2.00
> > Metadata ratio: 2.00
> > Global reserve: 512.00MiB (used: 0.00B)
> >
> > Data Metadata System
> >
> > Id Path RAID1 RAID1 RAID1 Unallocated
> > -- ---------------------- --------- -------- -------- -----------
> >
> > 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB
> > 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB
> >
> > -- ---------------------- --------- -------- -------- -----------
> >
> > Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB
> > Used 151.24GiB 2.95GiB 48.00KiB
>
> You already have only 33% of your metadata full so if your workload
> turned out to actually be making more metadata-heavy changed i.e
> snapshots you could exhaust this and get ENOSPC, despite having around
> 14gb of free data space. Furthermore this data space is spread around
> multiple data chunks, depending on how populated they are a balance
> could be able to free up unallocated space which later could be
> re-purposed for metadata (again, depending on what you are doing).
The filesystem above IMO is not fit for snapshots. It would fill up
rather quickly, I think even when I balance metadata. Actually I tried
this and as I remember it took at most a day until it was full.
If I read above figures currently at maximum I could gain one additional
GiB by balancing metadata. That would not make a huge difference.
I bet I am already running this filesystem beyond recommendation, as I
bet many would argue it is to full already for regular usage… I do not
see the benefit of squeezing the last free space out of it just to fit
in another GiB.
So I still do not get the point why it would make sense to balance it at
this point in time. Especially as this 1 GiB I could regain is not even
needed. And I do not see the point of balancing it weekly. I would
regain about 1 GiB of metadata space every now and then, but the cost
would be a lot of additional I/O to the SSD. They still take it very
nicely so far, but I think, right now, there is simply no point in
balancing, at least not regularly, unless…
there would be an performance gain. Whenever I balanced a complete
filesystem with data and metadata I however saw a cross drop in
performance, like doubling the boot time for example (no scientific
measurement, just my personal observation). I admit I did not do this
for a long time and the balancing might have gotten better during the
last few years of kernel development, but I am not yet convinced of
that.
So is balancing this filesystem likely to improve the performance of it?
And if so, why?
What it could improve, I think, is allocating new data, cause BTRFS due
to the balancing might have freed some chunks, so in case lots of new
data is written it does not have to search inside existing chunks which
are likely to fragment their free space over time.
I just like to understand this better. Right now I am quite confused at
what recommendations to give about balancing.
I bet SLES developers had a good reason for going with weekly balancing.
Right now I just don´t get it. And as you work at SUSE, I thought I just
ask about it.
I am aware of some earlier threads, but I did not read everything that
has been discussed so far. In case there is a good summary, feel free to
point me to it.
I bet a page in BTRFS wiki about performance aspects would be a nice
idea. I would even create one, if I still can access the wiki.
> > on a RAID-1 filesystem one, part of the time two Plasma desktops +
> > KDEPIM and Akonadi + Baloo desktop search + you name it write to
> > like
> > mad.
Thanks,
--
Martin
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-17 17:54 ` Martin Steigerwald
@ 2018-07-18 12:35 ` Austin S. Hemmelgarn
2018-07-18 13:07 ` Chris Murphy
0 siblings, 1 reply; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-18 12:35 UTC (permalink / raw)
To: Martin Steigerwald, Nikolay Borisov; +Cc: Wolf, linux-btrfs
On 2018-07-17 13:54, Martin Steigerwald wrote:
> Nikolay Borisov - 17.07.18, 10:16:
>> On 17.07.2018 11:02, Martin Steigerwald wrote:
>>> Nikolay Borisov - 17.07.18, 09:20:
>>>> On 16.07.2018 23:58, Wolf wrote:
>>>>> Greetings,
>>>>> I would like to ask what what is healthy amount of free space to
>>>>> keep on each device for btrfs to be happy?
>>>>>
>>>>> This is how my disk array currently looks like
>>>>>
>>>>> [root@dennas ~]# btrfs fi usage /raid
>>>>>
>>>>> Overall:
>>>>> Device size: 29.11TiB
>>>>> Device allocated: 21.26TiB
>>>>> Device unallocated: 7.85TiB
>>>>> Device missing: 0.00B
>>>>> Used: 21.18TiB
>>>>> Free (estimated): 3.96TiB (min: 3.96TiB)
>>>>> Data ratio: 2.00
>>>>> Metadata ratio: 2.00
>>>>> Global reserve: 512.00MiB (used: 0.00B)
>>>
>>> […]
>>>
>>>>> Btrfs does quite good job of evenly using space on all devices.
>>>>> No,
>>>>> how low can I let that go? In other words, with how much space
>>>>> free/unallocated remaining space should I consider adding new
>>>>> disk?
>>>>
>>>> Btrfs will start running into problems when you run out of
>>>> unallocated space. So the best advice will be monitor your device
>>>> unallocated, once it gets really low - like 2-3 gb I will suggest
>>>> you run balance which will try to free up unallocated space by
>>>> rewriting data more compactly into sparsely populated block
>>>> groups. If after running balance you haven't really freed any
>>>> space then you should consider adding a new drive and running
>>>> balance to even out the spread of data/metadata.
>>>
>>> What are these issues exactly?
>>
>> For example if you have plenty of data space but your metadata is full
>> then you will be getting ENOSPC.
>
> Of that one I am aware.
>
> This just did not happen so far.
>
> I did not yet add it explicitly to the training slides, but I just make
> myself a note to do that.
>
> Anything else?
If you're doing a training presentation, it may be worth mentioning that
preallocation with fallocate() does not behave the same on BTRFS as it
does on other filesystems. For example, the following sequence of commands:
fallocate -l X ./tmp
dd if=/dev/zero of=./tmp bs=1 count=X
Will always work on ext4, XFS, and most other filesystems, for any value
of X between zero and just below the total amount of free space on the
filesystem. On BTRFS though, it will reliably fail with ENOSPC for
values of X that are greater than _half_ of the total amount of free
space on the filesystem (actually, greater than just short of half). In
essence, preallocating space does not prevent COW semantics for the
first write unless the file is marked NOCOW.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-18 12:35 ` Austin S. Hemmelgarn
@ 2018-07-18 13:07 ` Chris Murphy
2018-07-18 13:30 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2018-07-18 13:07 UTC (permalink / raw)
To: Austin S. Hemmelgarn
Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS
On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> If you're doing a training presentation, it may be worth mentioning that
> preallocation with fallocate() does not behave the same on BTRFS as it does
> on other filesystems. For example, the following sequence of commands:
>
> fallocate -l X ./tmp
> dd if=/dev/zero of=./tmp bs=1 count=X
>
> Will always work on ext4, XFS, and most other filesystems, for any value of
> X between zero and just below the total amount of free space on the
> filesystem. On BTRFS though, it will reliably fail with ENOSPC for values
> of X that are greater than _half_ of the total amount of free space on the
> filesystem (actually, greater than just short of half). In essence,
> preallocating space does not prevent COW semantics for the first write
> unless the file is marked NOCOW.
Is this a bug, or is it suboptimal behavior, or is it intentional?
And then I wonder what happens with XFS COW:
fallocate -l X ./tmp
cp --reflink ./tmp ./tmp2
dd if=/dev/zero of=./tmp bs=1 count=X
--
Chris Murphy
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-18 13:07 ` Chris Murphy
@ 2018-07-18 13:30 ` Austin S. Hemmelgarn
2018-07-18 17:04 ` Chris Murphy
2018-07-20 5:01 ` Andrei Borzenkov
0 siblings, 2 replies; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-18 13:30 UTC (permalink / raw)
To: Chris Murphy; +Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS
On 2018-07-18 09:07, Chris Murphy wrote:
> On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>> If you're doing a training presentation, it may be worth mentioning that
>> preallocation with fallocate() does not behave the same on BTRFS as it does
>> on other filesystems. For example, the following sequence of commands:
>>
>> fallocate -l X ./tmp
>> dd if=/dev/zero of=./tmp bs=1 count=X
>>
>> Will always work on ext4, XFS, and most other filesystems, for any value of
>> X between zero and just below the total amount of free space on the
>> filesystem. On BTRFS though, it will reliably fail with ENOSPC for values
>> of X that are greater than _half_ of the total amount of free space on the
>> filesystem (actually, greater than just short of half). In essence,
>> preallocating space does not prevent COW semantics for the first write
>> unless the file is marked NOCOW.
>
> Is this a bug, or is it suboptimal behavior, or is it intentional?
It's been discussed before, though I can't find the email thread right
now. Pretty much, this is _technically_ not incorrect behavior, as the
documentation for fallocate doesn't say that subsequent writes can't
fail due to lack of space. I personally consider it a bug though
because it breaks from existing behavior in a way that is avoidable and
defies user expectations.
There are two issues here:
1. Regions preallocated with fallocate still do COW on the first write
to any given block in that region. This can be handled by either
treating the first write to each block as NOCOW, or by allocating a bit
of extra space and doing a rotating approach like this for writes:
- Write goes into the extra space.
- Once the write is done, convert the region covered by the write
into a new block of extra space.
- When the final block of the preallocated region is written,
deallocate the extra space.
2. Preallocation does not completely account for necessary metadata
space that will be needed to store the data there. This may not be
necessary if the first issue is addressed properly.
>
> And then I wonder what happens with XFS COW:
>
> fallocate -l X ./tmp
> cp --reflink ./tmp ./tmp2
> dd if=/dev/zero of=./tmp bs=1 count=X
I'm not sure. In this particular case, this will fail on BTRFS for any
X larger than just short of one third of the total free space. I would
expect it to fail for any X larger than just short of half instead.
ZFS gets around this by not supporting fallocate (well, kind of, if
you're using glibc and call posix_fallocate, that _will_ work, but it
will take forever because it works by writing out each block of space
that's being allocated, which, ironically, means that that still suffers
from the same issue potentially that we have).
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-18 13:30 ` Austin S. Hemmelgarn
@ 2018-07-18 17:04 ` Chris Murphy
2018-07-18 17:06 ` Austin S. Hemmelgarn
2018-07-20 5:01 ` Andrei Borzenkov
1 sibling, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2018-07-18 17:04 UTC (permalink / raw)
To: Austin S. Hemmelgarn
Cc: Chris Murphy, Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS
On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
>
> I'm not sure. In this particular case, this will fail on BTRFS for any X
> larger than just short of one third of the total free space. I would expect
> it to fail for any X larger than just short of half instead.
I'm confused. I can't get it to fail when X is 3/4 of free space.
lvcreate -V 2g -T vg/thintastic -n btrfstest
mkfs.btrfs -M /dev/mapper/vg-btrfstest
mount /dev/mapper/vg-btrfstest /mnt/btrfs
cd /mnt/btrfs
fallocate -l 1500m tmp
dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450
Succeeds. No enospc. This is on kernel 4.17.6.
Copied from terminal:
[chris@f28s btrfs]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest 2.0G 17M 2.0G 1% /mnt/btrfs
[chris@f28s btrfs]$ sudo fallocate -l 1500m /mnt/btrfs/tmp
[chris@f28s btrfs]$ filefrag -v tmp
Filesystem type is: 9123683e
File size of tmp is 1572864000 (384000 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 32767: 16400.. 49167: 32768: unwritten
1: 32768.. 65535: 56576.. 89343: 32768: 49168: unwritten
2: 65536.. 98303: 109824.. 142591: 32768: 89344: unwritten
3: 98304.. 131071: 163072.. 195839: 32768: 142592: unwritten
4: 131072.. 163839: 216320.. 249087: 32768: 195840: unwritten
5: 163840.. 196607: 269568.. 302335: 32768: 249088: unwritten
6: 196608.. 229375: 322816.. 355583: 32768: 302336: unwritten
7: 229376.. 262143: 376064.. 408831: 32768: 355584: unwritten
8: 262144.. 294911: 429312.. 462079: 32768: 408832: unwritten
9: 294912.. 327679: 482560.. 515327: 32768: 462080: unwritten
10: 327680.. 344063: 89344.. 105727: 16384: 515328: unwritten
11: 344064.. 360447: 142592.. 158975: 16384: 105728: unwritten
12: 360448.. 376831: 195840.. 212223: 16384: 158976: unwritten
13: 376832.. 383999: 249088.. 256255: 7168: 212224:
last,unwritten,eof
tmp: 14 extents found
[chris@f28s btrfs]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest 2.0G 1.5G 543M 74% /mnt/btrfs
[chris@f28s btrfs]$ sudo dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450
1450+0 records in
1450+0 records out
1520435200 bytes (1.5 GB, 1.4 GiB) copied, 13.4757 s, 113 MB/s
[chris@f28s btrfs]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest 2.0G 1.5G 591M 72% /mnt/btrfs
[chris@f28s btrfs]$ filefrag -v tmp
Filesystem type is: 9123683e
File size of tmp is 1520435200 (371200 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 16383: 302336.. 318719: 16384:
1: 16384.. 32767: 355584.. 371967: 16384: 318720:
2: 32768.. 49151: 408832.. 425215: 16384: 371968:
3: 49152.. 65535: 462080.. 478463: 16384: 425216:
4: 65536.. 73727: 515328.. 523519: 8192: 478464:
5: 73728.. 86015: 3328.. 15615: 12288: 523520:
6: 86016.. 98303: 256256.. 268543: 12288: 15616:
7: 98304.. 104959: 49168.. 55823: 6656: 268544:
8: 104960.. 109047: 105728.. 109815: 4088: 55824:
9: 109048.. 113143: 158976.. 163071: 4096: 109816:
10: 113144.. 117239: 212224.. 216319: 4096: 163072:
11: 117240.. 121335: 318720.. 322815: 4096: 216320:
12: 121336.. 125431: 371968.. 376063: 4096: 322816:
13: 125432.. 128251: 425216.. 428035: 2820: 376064:
14: 128252.. 131071: 478464.. 481283: 2820: 428036:
15: 131072.. 132409: 1460.. 2797: 1338: 481284:
16: 132410.. 165177: 322816.. 355583: 32768: 2798:
17: 165178.. 197945: 376064.. 408831: 32768: 355584:
18: 197946.. 230713: 429312.. 462079: 32768: 408832:
19: 230714.. 263481: 482560.. 515327: 32768: 462080:
20: 263482.. 296249: 16400.. 49167: 32768: 515328:
21: 296250.. 327687: 56576.. 88013: 31438: 49168:
22: 327688.. 328711: 428036.. 429059: 1024: 88014:
23: 328712.. 361479: 109824.. 142591: 32768: 429060:
24: 361480.. 371199: 88014.. 97733: 9720: 142592: last,eof
tmp: 25 extents found
[chris@f28s btrfs]$
*shrug*
--
Chris Murphy
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-18 17:04 ` Chris Murphy
@ 2018-07-18 17:06 ` Austin S. Hemmelgarn
2018-07-18 17:14 ` Chris Murphy
0 siblings, 1 reply; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-18 17:06 UTC (permalink / raw)
To: Chris Murphy; +Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS
On 2018-07-18 13:04, Chris Murphy wrote:
> On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>>
>> I'm not sure. In this particular case, this will fail on BTRFS for any X
>> larger than just short of one third of the total free space. I would expect
>> it to fail for any X larger than just short of half instead.
>
> I'm confused. I can't get it to fail when X is 3/4 of free space.
>
> lvcreate -V 2g -T vg/thintastic -n btrfstest
> mkfs.btrfs -M /dev/mapper/vg-btrfstest
> mount /dev/mapper/vg-btrfstest /mnt/btrfs
> cd /mnt/btrfs
> fallocate -l 1500m tmp
> dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450
>
> Succeeds. No enospc. This is on kernel 4.17.6.
Odd, I could have sworn it would fail reliably. Unless something has
changed since I last tested though, doing it with X equal to the free
space on the filesystem will fail.
>
>
> Copied from terminal:
>
> [chris@f28s btrfs]$ df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/vg-btrfstest 2.0G 17M 2.0G 1% /mnt/btrfs
> [chris@f28s btrfs]$ sudo fallocate -l 1500m /mnt/btrfs/tmp
> [chris@f28s btrfs]$ filefrag -v tmp
> Filesystem type is: 9123683e
> File size of tmp is 1572864000 (384000 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 32767: 16400.. 49167: 32768: unwritten
> 1: 32768.. 65535: 56576.. 89343: 32768: 49168: unwritten
> 2: 65536.. 98303: 109824.. 142591: 32768: 89344: unwritten
> 3: 98304.. 131071: 163072.. 195839: 32768: 142592: unwritten
> 4: 131072.. 163839: 216320.. 249087: 32768: 195840: unwritten
> 5: 163840.. 196607: 269568.. 302335: 32768: 249088: unwritten
> 6: 196608.. 229375: 322816.. 355583: 32768: 302336: unwritten
> 7: 229376.. 262143: 376064.. 408831: 32768: 355584: unwritten
> 8: 262144.. 294911: 429312.. 462079: 32768: 408832: unwritten
> 9: 294912.. 327679: 482560.. 515327: 32768: 462080: unwritten
> 10: 327680.. 344063: 89344.. 105727: 16384: 515328: unwritten
> 11: 344064.. 360447: 142592.. 158975: 16384: 105728: unwritten
> 12: 360448.. 376831: 195840.. 212223: 16384: 158976: unwritten
> 13: 376832.. 383999: 249088.. 256255: 7168: 212224:
> last,unwritten,eof
> tmp: 14 extents found
> [chris@f28s btrfs]$ df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/vg-btrfstest 2.0G 1.5G 543M 74% /mnt/btrfs
> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450
> 1450+0 records in
> 1450+0 records out
> 1520435200 bytes (1.5 GB, 1.4 GiB) copied, 13.4757 s, 113 MB/s
> [chris@f28s btrfs]$ df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/vg-btrfstest 2.0G 1.5G 591M 72% /mnt/btrfs
> [chris@f28s btrfs]$ filefrag -v tmp
> Filesystem type is: 9123683e
> File size of tmp is 1520435200 (371200 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 16383: 302336.. 318719: 16384:
> 1: 16384.. 32767: 355584.. 371967: 16384: 318720:
> 2: 32768.. 49151: 408832.. 425215: 16384: 371968:
> 3: 49152.. 65535: 462080.. 478463: 16384: 425216:
> 4: 65536.. 73727: 515328.. 523519: 8192: 478464:
> 5: 73728.. 86015: 3328.. 15615: 12288: 523520:
> 6: 86016.. 98303: 256256.. 268543: 12288: 15616:
> 7: 98304.. 104959: 49168.. 55823: 6656: 268544:
> 8: 104960.. 109047: 105728.. 109815: 4088: 55824:
> 9: 109048.. 113143: 158976.. 163071: 4096: 109816:
> 10: 113144.. 117239: 212224.. 216319: 4096: 163072:
> 11: 117240.. 121335: 318720.. 322815: 4096: 216320:
> 12: 121336.. 125431: 371968.. 376063: 4096: 322816:
> 13: 125432.. 128251: 425216.. 428035: 2820: 376064:
> 14: 128252.. 131071: 478464.. 481283: 2820: 428036:
> 15: 131072.. 132409: 1460.. 2797: 1338: 481284:
> 16: 132410.. 165177: 322816.. 355583: 32768: 2798:
> 17: 165178.. 197945: 376064.. 408831: 32768: 355584:
> 18: 197946.. 230713: 429312.. 462079: 32768: 408832:
> 19: 230714.. 263481: 482560.. 515327: 32768: 462080:
> 20: 263482.. 296249: 16400.. 49167: 32768: 515328:
> 21: 296250.. 327687: 56576.. 88013: 31438: 49168:
> 22: 327688.. 328711: 428036.. 429059: 1024: 88014:
> 23: 328712.. 361479: 109824.. 142591: 32768: 429060:
> 24: 361480.. 371199: 88014.. 97733: 9720: 142592: last,eof
> tmp: 25 extents found
> [chris@f28s btrfs]$
>
>
> *shrug*
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-18 17:06 ` Austin S. Hemmelgarn
@ 2018-07-18 17:14 ` Chris Murphy
2018-07-18 17:40 ` Chris Murphy
0 siblings, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2018-07-18 17:14 UTC (permalink / raw)
To: Austin S. Hemmelgarn
Cc: Chris Murphy, Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS
On Wed, Jul 18, 2018 at 11:06 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2018-07-18 13:04, Chris Murphy wrote:
>>
>> On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>
>>>
>>> I'm not sure. In this particular case, this will fail on BTRFS for any X
>>> larger than just short of one third of the total free space. I would
>>> expect
>>> it to fail for any X larger than just short of half instead.
>>
>>
>> I'm confused. I can't get it to fail when X is 3/4 of free space.
>>
>> lvcreate -V 2g -T vg/thintastic -n btrfstest
>> mkfs.btrfs -M /dev/mapper/vg-btrfstest
>> mount /dev/mapper/vg-btrfstest /mnt/btrfs
>> cd /mnt/btrfs
>> fallocate -l 1500m tmp
>> dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450
>>
>> Succeeds. No enospc. This is on kernel 4.17.6.
>
> Odd, I could have sworn it would fail reliably. Unless something has
> changed since I last tested though, doing it with X equal to the free space
> on the filesystem will fail.
OK well X is being defined twice here so I can't tell if I'm doing
this correctly. There's fallocate X and that's 75% of free space for
the empty fs at the time of fallocate.
And then there's dd which is 1450m which is ~2.67x the free space at
the time of dd.
I don't know for sure, but based on the addresses reported before and
after dd for the fallocated tmp file, it looks like Btrfs is not using
the originally fallocated addresses for dd. So maybe it is COWing into
new blocks, but is just as quickly deallocating the fallocated blocks
as it goes, and hence doesn't end up in enospc?
--
Chris Murphy
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-18 17:14 ` Chris Murphy
@ 2018-07-18 17:40 ` Chris Murphy
2018-07-18 18:01 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2018-07-18 17:40 UTC (permalink / raw)
To: Chris Murphy
Cc: Austin S. Hemmelgarn, Martin Steigerwald, Nikolay Borisov, Wolf,
Btrfs BTRFS
On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy <lists@colorremedies.com> wrote:
> I don't know for sure, but based on the addresses reported before and
> after dd for the fallocated tmp file, it looks like Btrfs is not using
> the originally fallocated addresses for dd. So maybe it is COWing into
> new blocks, but is just as quickly deallocating the fallocated blocks
> as it goes, and hence doesn't end up in enospc?
Previous thread is "Problem with file system" from August 2017. And
there's these reproduce steps from Austin which have fallocate coming
after the dd.
truncate --size=4G ./test-fs
mkfs.btrfs ./test-fs
mkdir ./test
mount -t auto ./test-fs ./test
dd if=/dev/zero of=./test/test bs=65536 count=32768
fallocate -l 2147483650 ./test/test && echo "Success!"
My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
fallocate in half.
[chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
[chris@f28s btrfs]$ sync
[chris@f28s btrfs]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs
[chris@f28s btrfs]$ sudo fallocate -l 1000m tmp
Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
it, this fails, but I kinda expect that because there's only 1.1G free
space. But maybe that's what you're saying is the bug, it shouldn't
fail?
--
Chris Murphy
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-18 17:40 ` Chris Murphy
@ 2018-07-18 18:01 ` Austin S. Hemmelgarn
2018-07-18 21:32 ` Chris Murphy
0 siblings, 1 reply; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-18 18:01 UTC (permalink / raw)
To: Chris Murphy; +Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS
On 2018-07-18 13:40, Chris Murphy wrote:
> On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy <lists@colorremedies.com> wrote:
>
>> I don't know for sure, but based on the addresses reported before and
>> after dd for the fallocated tmp file, it looks like Btrfs is not using
>> the originally fallocated addresses for dd. So maybe it is COWing into
>> new blocks, but is just as quickly deallocating the fallocated blocks
>> as it goes, and hence doesn't end up in enospc?
>
> Previous thread is "Problem with file system" from August 2017. And
> there's these reproduce steps from Austin which have fallocate coming
> after the dd.
>
> truncate --size=4G ./test-fs
> mkfs.btrfs ./test-fs
> mkdir ./test
> mount -t auto ./test-fs ./test
> dd if=/dev/zero of=./test/test bs=65536 count=32768
> fallocate -l 2147483650 ./test/test && echo "Success!"
>
>
> My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
> fallocate in half.
>
> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
> [chris@f28s btrfs]$ sync
> [chris@f28s btrfs]$ df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs
> [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp
>
>
> Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
> it, this fails, but I kinda expect that because there's only 1.1G free
> space. But maybe that's what you're saying is the bug, it shouldn't
> fail?
Yes, you're right, I had things backwards (well, kind of, this does work
on ext4 and regular XFS, so it arguably should work here).
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-18 18:01 ` Austin S. Hemmelgarn
@ 2018-07-18 21:32 ` Chris Murphy
2018-07-18 21:47 ` Chris Murphy
2018-07-19 11:21 ` Austin S. Hemmelgarn
0 siblings, 2 replies; 19+ messages in thread
From: Chris Murphy @ 2018-07-18 21:32 UTC (permalink / raw)
To: Austin S. Hemmelgarn
Cc: Chris Murphy, Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS
On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2018-07-18 13:40, Chris Murphy wrote:
>>
>> On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy <lists@colorremedies.com>
>> wrote:
>>
>>> I don't know for sure, but based on the addresses reported before and
>>> after dd for the fallocated tmp file, it looks like Btrfs is not using
>>> the originally fallocated addresses for dd. So maybe it is COWing into
>>> new blocks, but is just as quickly deallocating the fallocated blocks
>>> as it goes, and hence doesn't end up in enospc?
>>
>>
>> Previous thread is "Problem with file system" from August 2017. And
>> there's these reproduce steps from Austin which have fallocate coming
>> after the dd.
>>
>> truncate --size=4G ./test-fs
>> mkfs.btrfs ./test-fs
>> mkdir ./test
>> mount -t auto ./test-fs ./test
>> dd if=/dev/zero of=./test/test bs=65536 count=32768
>> fallocate -l 2147483650 ./test/test && echo "Success!"
>>
>>
>> My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
>> fallocate in half.
>>
>> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
>> 1000+0 records in
>> 1000+0 records out
>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
>> [chris@f28s btrfs]$ sync
>> [chris@f28s btrfs]$ df -h
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs
>> [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp
>>
>>
>> Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
>> it, this fails, but I kinda expect that because there's only 1.1G free
>> space. But maybe that's what you're saying is the bug, it shouldn't
>> fail?
>
> Yes, you're right, I had things backwards (well, kind of, this does work on
> ext4 and regular XFS, so it arguably should work here).
I guess I'm confused what it even means to fallocate over a file with
in-use blocks unless either -d or -p options are used. And from the
man page, I don't grok the distinction between -d and -p either. But
based on their descriptions I'd expect they both should work without
enospc.
--
Chris Murphy
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-18 21:32 ` Chris Murphy
@ 2018-07-18 21:47 ` Chris Murphy
2018-07-19 11:21 ` Austin S. Hemmelgarn
1 sibling, 0 replies; 19+ messages in thread
From: Chris Murphy @ 2018-07-18 21:47 UTC (permalink / raw)
To: Chris Murphy
Cc: Austin S. Hemmelgarn, Martin Steigerwald, Nikolay Borisov, Wolf,
Btrfs BTRFS
Related on XFS list.
https://www.spinics.net/lists/linux-xfs/msg20722.html
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-18 21:32 ` Chris Murphy
2018-07-18 21:47 ` Chris Murphy
@ 2018-07-19 11:21 ` Austin S. Hemmelgarn
1 sibling, 0 replies; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-19 11:21 UTC (permalink / raw)
To: Chris Murphy; +Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS
On 2018-07-18 17:32, Chris Murphy wrote:
> On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2018-07-18 13:40, Chris Murphy wrote:
>>>
>>> On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy <lists@colorremedies.com>
>>> wrote:
>>>
>>>> I don't know for sure, but based on the addresses reported before and
>>>> after dd for the fallocated tmp file, it looks like Btrfs is not using
>>>> the originally fallocated addresses for dd. So maybe it is COWing into
>>>> new blocks, but is just as quickly deallocating the fallocated blocks
>>>> as it goes, and hence doesn't end up in enospc?
>>>
>>>
>>> Previous thread is "Problem with file system" from August 2017. And
>>> there's these reproduce steps from Austin which have fallocate coming
>>> after the dd.
>>>
>>> truncate --size=4G ./test-fs
>>> mkfs.btrfs ./test-fs
>>> mkdir ./test
>>> mount -t auto ./test-fs ./test
>>> dd if=/dev/zero of=./test/test bs=65536 count=32768
>>> fallocate -l 2147483650 ./test/test && echo "Success!"
>>>
>>>
>>> My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
>>> fallocate in half.
>>>
>>> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
>>> 1000+0 records in
>>> 1000+0 records out
>>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
>>> [chris@f28s btrfs]$ sync
>>> [chris@f28s btrfs]$ df -h
>>> Filesystem Size Used Avail Use% Mounted on
>>> /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs
>>> [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp
>>>
>>>
>>> Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
>>> it, this fails, but I kinda expect that because there's only 1.1G free
>>> space. But maybe that's what you're saying is the bug, it shouldn't
>>> fail?
>>
>> Yes, you're right, I had things backwards (well, kind of, this does work on
>> ext4 and regular XFS, so it arguably should work here).
>
> I guess I'm confused what it even means to fallocate over a file with
> in-use blocks unless either -d or -p options are used. And from the
> man page, I don't grok the distinction between -d and -p either. But
> based on their descriptions I'd expect they both should work without
> enospc.
>
Without any specific options, it forces allocation of any sparse regions
in the file (that is, it gets rid of holes in the file). On BTRFS, I
believe the command also forcibly unshares all the extents in the file
(for the system call, there's a special flag for doing this).
Additionally, you can extend a file with fallocate this way by
specifying a length longer than the current size of the file, which
guarantees that writes into that region will succeed, unlike truncating
the file to a larger size, which just creates a hole at the end of the
file to bring it up to size.
As far as `-d` versus `-p`: `-p` directly translates to the option for
the system call that punches a hole. It requires a length and possibly
an offset, and will punch a hole at that exact location of that exact
size. `-d` is a special option that's only available for the command.
It tells the `fallocate` command to search the file for zero-filled
regions, and punch holes there. Neither option should ever trigger an
ENOSPC, except possibly if it has to split an extent for some reason and
you are completely out of metadata space.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-18 13:30 ` Austin S. Hemmelgarn
2018-07-18 17:04 ` Chris Murphy
@ 2018-07-20 5:01 ` Andrei Borzenkov
2018-07-20 11:36 ` Austin S. Hemmelgarn
1 sibling, 1 reply; 19+ messages in thread
From: Andrei Borzenkov @ 2018-07-20 5:01 UTC (permalink / raw)
To: Austin S. Hemmelgarn, Chris Murphy
Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS
18.07.2018 16:30, Austin S. Hemmelgarn пишет:
> On 2018-07-18 09:07, Chris Murphy wrote:
>> On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>
>>> If you're doing a training presentation, it may be worth mentioning that
>>> preallocation with fallocate() does not behave the same on BTRFS as
>>> it does
>>> on other filesystems. For example, the following sequence of commands:
>>>
>>> fallocate -l X ./tmp
>>> dd if=/dev/zero of=./tmp bs=1 count=X
>>>
>>> Will always work on ext4, XFS, and most other filesystems, for any
>>> value of
>>> X between zero and just below the total amount of free space on the
>>> filesystem. On BTRFS though, it will reliably fail with ENOSPC for
>>> values
>>> of X that are greater than _half_ of the total amount of free space
>>> on the
>>> filesystem (actually, greater than just short of half). In essence,
>>> preallocating space does not prevent COW semantics for the first write
>>> unless the file is marked NOCOW.
>>
>> Is this a bug, or is it suboptimal behavior, or is it intentional?
> It's been discussed before, though I can't find the email thread right
> now. Pretty much, this is _technically_ not incorrect behavior, as the
> documentation for fallocate doesn't say that subsequent writes can't
> fail due to lack of space. I personally consider it a bug though
> because it breaks from existing behavior in a way that is avoidable and
> defies user expectations.
>
> There are two issues here:
>
> 1. Regions preallocated with fallocate still do COW on the first write
> to any given block in that region. This can be handled by either
> treating the first write to each block as NOCOW, or by allocating a bit
How is it possible? As long as fallocate actually allocates space, this
should be checksummed which means it is no more possible to overwrite
it. May be fallocate on btrfs could simply reserve space. Not sure
whether it complies with fallocate specification, but as long as
intention is to ensure write will not fail for the lack of space it
should be adequate (to the extent it can be ensured on btrfs of course).
Also hole in file returns zeros by definition which also matches
fallocate behavior.
> of extra space and doing a rotating approach like this for writes:
> - Write goes into the extra space.
> - Once the write is done, convert the region covered by the write
> into a new block of extra space.
> - When the final block of the preallocated region is written,
> deallocate the extra space.
> 2. Preallocation does not completely account for necessary metadata
> space that will be needed to store the data there. This may not be
> necessary if the first issue is addressed properly.
>>
>> And then I wonder what happens with XFS COW:
>>
>> fallocate -l X ./tmp
>> cp --reflink ./tmp ./tmp2
>> dd if=/dev/zero of=./tmp bs=1 count=X
> I'm not sure. In this particular case, this will fail on BTRFS for any
> X larger than just short of one third of the total free space. I would
> expect it to fail for any X larger than just short of half instead.
>
> ZFS gets around this by not supporting fallocate (well, kind of, if
> you're using glibc and call posix_fallocate, that _will_ work, but it
> will take forever because it works by writing out each block of space
> that's being allocated, which, ironically, means that that still suffers
> from the same issue potentially that we have).
What happens on btrfs then? fallocate specifies that new space should be
initialized to zero, so something should still write those zeros?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Healthy amount of free space?
2018-07-20 5:01 ` Andrei Borzenkov
@ 2018-07-20 11:36 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-20 11:36 UTC (permalink / raw)
To: Andrei Borzenkov, Chris Murphy
Cc: Martin Steigerwald, Nikolay Borisov, Wolf, Btrfs BTRFS
On 2018-07-20 01:01, Andrei Borzenkov wrote:
> 18.07.2018 16:30, Austin S. Hemmelgarn пишет:
>> On 2018-07-18 09:07, Chris Murphy wrote:
>>> On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>
>>>> If you're doing a training presentation, it may be worth mentioning that
>>>> preallocation with fallocate() does not behave the same on BTRFS as
>>>> it does
>>>> on other filesystems. For example, the following sequence of commands:
>>>>
>>>> fallocate -l X ./tmp
>>>> dd if=/dev/zero of=./tmp bs=1 count=X
>>>>
>>>> Will always work on ext4, XFS, and most other filesystems, for any
>>>> value of
>>>> X between zero and just below the total amount of free space on the
>>>> filesystem. On BTRFS though, it will reliably fail with ENOSPC for
>>>> values
>>>> of X that are greater than _half_ of the total amount of free space
>>>> on the
>>>> filesystem (actually, greater than just short of half). In essence,
>>>> preallocating space does not prevent COW semantics for the first write
>>>> unless the file is marked NOCOW.
>>>
>>> Is this a bug, or is it suboptimal behavior, or is it intentional?
>> It's been discussed before, though I can't find the email thread right
>> now. Pretty much, this is _technically_ not incorrect behavior, as the
>> documentation for fallocate doesn't say that subsequent writes can't
>> fail due to lack of space. I personally consider it a bug though
>> because it breaks from existing behavior in a way that is avoidable and
>> defies user expectations.
>>
>> There are two issues here:
>>
>> 1. Regions preallocated with fallocate still do COW on the first write
>> to any given block in that region. This can be handled by either
>> treating the first write to each block as NOCOW, or by allocating a bit
>
> How is it possible? As long as fallocate actually allocates space, this
> should be checksummed which means it is no more possible to overwrite
> it. May be fallocate on btrfs could simply reserve space. Not sure
> whether it complies with fallocate specification, but as long as
> intention is to ensure write will not fail for the lack of space it
> should be adequate (to the extent it can be ensured on btrfs of course).
> Also hole in file returns zeros by definition which also matches
> fallocate behavior.
Except it doesn't _have_ to be checksummed if there's no data there, and
that will always be the case for a new allocation. When I say it could
be NOCOW, I'm talking specifically about the first write to each newly
allocated block (that is, one either beyond the previous end of the
file, or one in a region that used to be a hole). This obviously won't
work for places where there are already data.
>
>> of extra space and doing a rotating approach like this for writes:
>> - Write goes into the extra space.
>> - Once the write is done, convert the region covered by the write
>> into a new block of extra space.
>> - When the final block of the preallocated region is written,
>> deallocate the extra space.
>> 2. Preallocation does not completely account for necessary metadata
>> space that will be needed to store the data there. This may not be
>> necessary if the first issue is addressed properly.
>>>
>>> And then I wonder what happens with XFS COW:
>>>
>>> fallocate -l X ./tmp
>>> cp --reflink ./tmp ./tmp2
>>> dd if=/dev/zero of=./tmp bs=1 count=X
>> I'm not sure. In this particular case, this will fail on BTRFS for any
>> X larger than just short of one third of the total free space. I would
>> expect it to fail for any X larger than just short of half instead.
>>
>> ZFS gets around this by not supporting fallocate (well, kind of, if
>> you're using glibc and call posix_fallocate, that _will_ work, but it
>> will take forever because it works by writing out each block of space
>> that's being allocated, which, ironically, means that that still suffers
>> from the same issue potentially that we have).
>
> What happens on btrfs then? fallocate specifies that new space should be
> initialized to zero, so something should still write those zeros?
>
For new regions (places that were holes previously, or were beyond the
end of the file), we create an unwritten extent, which is a region
that's 'allocated', but everything reads back as zero. The problem is
that we don't write into the blocks allocated for the unwritten extent
at all, and only deallocate them once a write to another block finishes.
In essence, we're (either explicitly or implicitly) applying COW
semantics to a region that should not be COW until after the first write
to each block.
For the case of calling fallocate on existing data, we don't really do
anything (unless the flag telling fallocate to unshare the region is
passed). This is actually consistent with pretty much every other
filesystem in existence, but that's because pretty much every other
filesystem in existence implicitly provides the same guarantee that
fallocate does for regions that already have data. This case can in
theory be handled by the same looping algorithm I described above
without needing the base amount of space allocated, but I wouldn't
consider it important enough currently to worry about (because calling
fallocate on regions with existing data is not a common practice).
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2018-07-20 12:24 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-16 20:58 Healthy amount of free space? Wolf
2018-07-17 7:20 ` Nikolay Borisov
2018-07-17 8:02 ` Martin Steigerwald
2018-07-17 8:16 ` Nikolay Borisov
2018-07-17 17:54 ` Martin Steigerwald
2018-07-18 12:35 ` Austin S. Hemmelgarn
2018-07-18 13:07 ` Chris Murphy
2018-07-18 13:30 ` Austin S. Hemmelgarn
2018-07-18 17:04 ` Chris Murphy
2018-07-18 17:06 ` Austin S. Hemmelgarn
2018-07-18 17:14 ` Chris Murphy
2018-07-18 17:40 ` Chris Murphy
2018-07-18 18:01 ` Austin S. Hemmelgarn
2018-07-18 21:32 ` Chris Murphy
2018-07-18 21:47 ` Chris Murphy
2018-07-19 11:21 ` Austin S. Hemmelgarn
2018-07-20 5:01 ` Andrei Borzenkov
2018-07-20 11:36 ` Austin S. Hemmelgarn
2018-07-17 11:46 ` Austin S. Hemmelgarn
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.