All of lore.kernel.org
 help / color / mirror / Atom feed
* exclusive subvolume space missing
@ 2017-12-01 16:15 Tomasz Pala
  2017-12-01 21:27 ` Duncan
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Tomasz Pala @ 2017-12-01 16:15 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I got a problem with btrfs running out of space (not THE
Internet-wide, well known issues with interpretation).

The problem is: something eats the space while not running anything that
justifies this. There were 18 GB free space available, suddenly it
dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
right now, just as I'm writing this e-mail:

/dev/sda2        64G   63G  452M 100% /
/dev/sda2        64G   63G  365M 100% /
/dev/sda2        64G   63G  316M 100% /
/dev/sda2        64G   63G  287M 100% /
/dev/sda2        64G   63G  268M 100% /
/dev/sda2        64G   63G  239M 100% /
/dev/sda2        64G   63G  230M 100% /
/dev/sda2        64G   63G  182M 100% /
/dev/sda2        64G   63G  163M 100% /
/dev/sda2        64G   64G  153M 100% /
/dev/sda2        64G   64G  143M 100% /
/dev/sda2        64G   64G   96M 100% /
/dev/sda2        64G   64G   88M 100% /
/dev/sda2        64G   64G   57M 100% /
/dev/sda2        64G   64G   25M 100% /

while my rough calculations show, that there should be at least 10 GB of
free space. After enabling quotas it is somehow confirmed:

# btrfs qgroup sh --sort=excl / 
qgroupid         rfer         excl 
--------         ----         ---- 
0/5          16.00KiB     16.00KiB 
[30 snapshots with about 100 MiB excl]
0/333        24.53GiB    305.79MiB 
0/298        13.44GiB    312.74MiB 
0/327        23.79GiB    427.13MiB 
0/331        23.93GiB    930.51MiB 
0/260        12.25GiB      3.22GiB 
0/312        19.70GiB      4.56GiB 
0/388        28.75GiB      7.15GiB 
0/291        30.60GiB      9.01GiB <- this is the running one

This is about 30 GB total excl (didn't find a switch to sum this up). I
know I can't just add 'excl' to get usage, so tried to pinpoint the
exact files that occupy space in 0/388 exclusively (this is the last
snapshots taken, all of the snapshots are created from the running fs).


Now, the weird part for me is exclusive data count:

# btrfs sub sh ./snapshot-171125
[...]
        Subvolume ID:           388
# btrfs fi du -s ./snapshot-171125 
     Total   Exclusive  Set shared  Filename
  21.50GiB    63.35MiB    20.77GiB  snapshot-171125


How is that possible? This doesn't even remotely relate to 7.15 GiB
from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
And the same happens with other snapshots, much more exclusive data
shown in qgroup than actually found in files. So if not files, where
is that space wasted? Metadata?

btrfs-progs-4.12 running on Linux 4.9.46.

best regards,
-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-01 16:15 exclusive subvolume space missing Tomasz Pala
@ 2017-12-01 21:27 ` Duncan
  2017-12-01 21:36 ` Hugo Mills
  2017-12-02  0:27 ` Qu Wenruo
  2 siblings, 0 replies; 32+ messages in thread
From: Duncan @ 2017-12-01 21:27 UTC (permalink / raw)
  To: linux-btrfs

Tomasz Pala posted on Fri, 01 Dec 2017 17:15:55 +0100 as excerpted:

> Hello,
> 
> I got a problem with btrfs running out of space (not THE
> Internet-wide, well known issues with interpretation).
> 
> The problem is: something eats the space while not running anything that
> justifies this. There were 18 GB free space available, suddenly it
> dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
> with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
> right now, just as I'm writing this e-mail:
> 
> /dev/sda2        64G   63G  452M 100% /
> /dev/sda2        64G   63G  365M 100% /
> /dev/sda2        64G   63G  316M 100% /
> /dev/sda2        64G   63G  287M 100% /
> /dev/sda2        64G   63G  268M 100% /
> /dev/sda2        64G   63G  239M 100% /
> /dev/sda2        64G   63G  230M 100% /
> /dev/sda2        64G   63G  182M 100% /
> /dev/sda2        64G   63G  163M 100% /
> /dev/sda2        64G   64G  153M 100% /
> /dev/sda2        64G   64G  143M 100% /
> /dev/sda2        64G   64G   96M 100% /
> /dev/sda2        64G   64G   88M 100% /
> /dev/sda2        64G   64G   57M 100% /
> /dev/sda2        64G   64G   25M 100% /

Scary.

> while my rough calculations show, that there should be at least 10 GB of
> free space. After enabling quotas it is somehow confirmed:

I don't use quotas so won't claim working knowledge or an explanation of
that side of things, however...
> 
> btrfs-progs-4.12 running on Linux 4.9.46.

Until quite recently btrfs quotas were too buggy to recommend for use.
While the known blocker-level bugs are now fixed, scaling and real-world
performance are still an issue, and AFAIK, the fixes didn't make 4.9 and
may not be backported as the feature was simply known to be broken beyond
reliable usability at that point.

Based on comments in other threads here, I /think/ the critical quota
fixes hit 4.10, but of course not being an LTS, 4.10 is long out of support.
I'd suggest either turning off and forgetting about quotas since it doesn't
appear you actually need them, or upgrading to at least 4.13 and keeping
current, or the LTS 4.14 if you want to stay on the same kernel series for
awhile.

As for the scaling and performance issues, during normal/generic filesystem
use things are generally fine; it's various btrfs maintenance commands such
as balance, snapshot deletion, and btrfs check, that have the scaling
issues, and they have /some/ scaling issues even without quotas, it's just
that quotas makes the problem *much* worse.  One workaround for balance
and snapshot deletion is to temporarily disable quotas while the job is
running, then reenable (and rescan if necessary, as I don't use the feature
here I'm not sure whether it is).  That can literally turn a job that was
looking to take /weeks/ due to the scaling issue, into a job of hours.
Unfortunately, the sorts of conditions that would trigger running a btrfs
check don't lend themselves to the same sort of workaround, so not having
quotas on at all is the only workaround there.


As to your space being eaten problem, the output of btrfs filesystem usage
(and perhaps btrfs device usage if it's a multi-device btrfs) could be
really helpful here, much more so than quota reports if it's a btrfs
issue, or to help eliminate btrfs as the problem if it's not.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-01 16:15 exclusive subvolume space missing Tomasz Pala
  2017-12-01 21:27 ` Duncan
@ 2017-12-01 21:36 ` Hugo Mills
  2017-12-02  0:53   ` Tomasz Pala
  2017-12-02  0:27 ` Qu Wenruo
  2 siblings, 1 reply; 32+ messages in thread
From: Hugo Mills @ 2017-12-01 21:36 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3605 bytes --]

On Fri, Dec 01, 2017 at 05:15:55PM +0100, Tomasz Pala wrote:
> Hello,
> 
> I got a problem with btrfs running out of space (not THE
> Internet-wide, well known issues with interpretation).
> 
> The problem is: something eats the space while not running anything that
> justifies this. There were 18 GB free space available, suddenly it
> dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
> with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
> right now, just as I'm writing this e-mail:
> 
> /dev/sda2        64G   63G  452M 100% /
> /dev/sda2        64G   63G  365M 100% /
> /dev/sda2        64G   63G  316M 100% /
> /dev/sda2        64G   63G  287M 100% /
> /dev/sda2        64G   63G  268M 100% /
> /dev/sda2        64G   63G  239M 100% /
> /dev/sda2        64G   63G  230M 100% /
> /dev/sda2        64G   63G  182M 100% /
> /dev/sda2        64G   63G  163M 100% /
> /dev/sda2        64G   64G  153M 100% /
> /dev/sda2        64G   64G  143M 100% /
> /dev/sda2        64G   64G   96M 100% /
> /dev/sda2        64G   64G   88M 100% /
> /dev/sda2        64G   64G   57M 100% /
> /dev/sda2        64G   64G   25M 100% /
> 
> while my rough calculations show, that there should be at least 10 GB of
> free space. After enabling quotas it is somehow confirmed:
> 
> # btrfs qgroup sh --sort=excl / 
> qgroupid         rfer         excl 
> --------         ----         ---- 
> 0/5          16.00KiB     16.00KiB 
> [30 snapshots with about 100 MiB excl]
> 0/333        24.53GiB    305.79MiB 
> 0/298        13.44GiB    312.74MiB 
> 0/327        23.79GiB    427.13MiB 
> 0/331        23.93GiB    930.51MiB 
> 0/260        12.25GiB      3.22GiB 
> 0/312        19.70GiB      4.56GiB 
> 0/388        28.75GiB      7.15GiB 
> 0/291        30.60GiB      9.01GiB <- this is the running one
> 
> This is about 30 GB total excl (didn't find a switch to sum this up). I
> know I can't just add 'excl' to get usage, so tried to pinpoint the
> exact files that occupy space in 0/388 exclusively (this is the last
> snapshots taken, all of the snapshots are created from the running fs).

   The thing I'd first go looking for here is some rogue process
writing lots of data. I've had something like this happen to me
before, a few times. First, I'd look for large files with "du -ms /* |
sort -n", then work down into the tree until you find them.

   If that doesn't show up anything unusually large, then lsof to look
for open but deleted files (orphans) which are still being written to
by some process.

   This is very likely _not_ to be a btrfs problem, but instead some
runaway process writing lots of crap very fast. Log files are probably
the most plausible location, but not the only one.

> Now, the weird part for me is exclusive data count:
> 
> # btrfs sub sh ./snapshot-171125
> [...]
>         Subvolume ID:           388
> # btrfs fi du -s ./snapshot-171125 
>      Total   Exclusive  Set shared  Filename
>   21.50GiB    63.35MiB    20.77GiB  snapshot-171125
> 
> 
> How is that possible? This doesn't even remotely relate to 7.15 GiB
> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
> And the same happens with other snapshots, much more exclusive data
> shown in qgroup than actually found in files. So if not files, where
> is that space wasted? Metadata?

   Personally, I'd trust qgroups' output about as far as I could spit
Belgium(*).

   Hugo.

(*) No offence indended to Belgium.

-- 
Hugo Mills             | I used to live in hope, but I got evicted.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-01 16:15 exclusive subvolume space missing Tomasz Pala
  2017-12-01 21:27 ` Duncan
  2017-12-01 21:36 ` Hugo Mills
@ 2017-12-02  0:27 ` Qu Wenruo
  2017-12-02  1:23   ` Tomasz Pala
  2017-12-05 18:47   ` How exclusive in parent qgroup is computed? (was: Re: exclusive subvolume space missing) Andrei Borzenkov
  2 siblings, 2 replies; 32+ messages in thread
From: Qu Wenruo @ 2017-12-02  0:27 UTC (permalink / raw)
  To: Tomasz Pala, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3724 bytes --]



On 2017年12月02日 00:15, Tomasz Pala wrote:
> Hello,
> 
> I got a problem with btrfs running out of space (not THE
> Internet-wide, well known issues with interpretation).
> 
> The problem is: something eats the space while not running anything that
> justifies this. There were 18 GB free space available, suddenly it
> dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
> with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
> right now, just as I'm writing this e-mail:
> 
> /dev/sda2        64G   63G  452M 100% /
> /dev/sda2        64G   63G  365M 100% /
> /dev/sda2        64G   63G  316M 100% /
> /dev/sda2        64G   63G  287M 100% /
> /dev/sda2        64G   63G  268M 100% /
> /dev/sda2        64G   63G  239M 100% /
> /dev/sda2        64G   63G  230M 100% /
> /dev/sda2        64G   63G  182M 100% /
> /dev/sda2        64G   63G  163M 100% /
> /dev/sda2        64G   64G  153M 100% /
> /dev/sda2        64G   64G  143M 100% /
> /dev/sda2        64G   64G   96M 100% /
> /dev/sda2        64G   64G   88M 100% /
> /dev/sda2        64G   64G   57M 100% /
> /dev/sda2        64G   64G   25M 100% /
> 
> while my rough calculations show, that there should be at least 10 GB of
> free space. After enabling quotas it is somehow confirmed:
> 
> # btrfs qgroup sh --sort=excl / 
> qgroupid         rfer         excl 
> --------         ----         ---- 
> 0/5          16.00KiB     16.00KiB 
> [30 snapshots with about 100 MiB excl]
> 0/333        24.53GiB    305.79MiB 
> 0/298        13.44GiB    312.74MiB 
> 0/327        23.79GiB    427.13MiB 
> 0/331        23.93GiB    930.51MiB 
> 0/260        12.25GiB      3.22GiB 
> 0/312        19.70GiB      4.56GiB 
> 0/388        28.75GiB      7.15GiB 
> 0/291        30.60GiB      9.01GiB <- this is the running one
> 
> This is about 30 GB total excl (didn't find a switch to sum this up). I
> know I can't just add 'excl' to get usage, so tried to pinpoint the
> exact files that occupy space in 0/388 exclusively (this is the last
> snapshots taken, all of the snapshots are created from the running fs).

I assume there is program eating up the space.
Not btrfs itself.

> 
> 
> Now, the weird part for me is exclusive data count:
> 
> # btrfs sub sh ./snapshot-171125
> [...]
>         Subvolume ID:           388
> # btrfs fi du -s ./snapshot-171125 
>      Total   Exclusive  Set shared  Filename
>   21.50GiB    63.35MiB    20.77GiB  snapshot-171125

That's the difference between how sub show and quota works.

For quota, it's per-root owner check.
Means even a file extent is shared between different inodes, if all
inodes are inside the same subvolume, it's counted as exclusive.
And if any of the file extent belongs to other subvolume, then it's
counted as shared.

For fi du, it's per-inode owner check. (The exact behavior is a little
more complex, I'll skip such corner case to make it a little easier to
understand).

That's to say, if one file extent is shared by different inodes, then
it's counted as shared, no matter if these inodes belong to different or
the same subvolume.

That's to say, "fi du" has a looser condition for "shared" calculation,
and that should explain why you have 20+G shared.

Thanks,
Qu


> 
> 
> How is that possible? This doesn't even remotely relate to 7.15 GiB
> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
> And the same happens with other snapshots, much more exclusive data
> shown in qgroup than actually found in files. So if not files, where
> is that space wasted? Metadata?
> 
> btrfs-progs-4.12 running on Linux 4.9.46.
> 
> best regards,
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 516 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-01 21:36 ` Hugo Mills
@ 2017-12-02  0:53   ` Tomasz Pala
  2017-12-02  1:05     ` Qu Wenruo
                       ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Tomasz Pala @ 2017-12-02  0:53 UTC (permalink / raw)
  To: Hugo Mills; +Cc: linux-btrfs

On Fri, Dec 01, 2017 at 21:36:14 +0000, Hugo Mills wrote:

>    The thing I'd first go looking for here is some rogue process
> writing lots of data. I've had something like this happen to me
> before, a few times. First, I'd look for large files with "du -ms /* |
> sort -n", then work down into the tree until you find them.

I already did a handful of searches (mounting parent node in separate
directory and diving into default working subvolume on order to unhide
possible things covered by any other mounts on top of actual root fs).
That's how it looks like:

[~/test/@]#  du -sh . 
15G     .

>    If that doesn't show up anything unusually large, then lsof to look
> for open but deleted files (orphans) which are still being written to
> by some process.

No (deleted) files, the only activity on iotop are internals...

  174 be/4 root       15.64 K/s    3.67 M/s  0.00 %  5.88 % [btrfs-transacti]
 1439 be/4 root        0.00 B/s 1173.22 K/s  0.00 %  0.00 % [kworker/u8:8]

Only the systemd-journald is writing, but the /var/log is mounted to
separate ext3 parition (with journald restarted after the mount); this
is also confirmed by looking into separate mount. Anyway that can't be
opened-deleted files, as the usage doesn't change after booting into
emergency. The worst thing is that the 8 GB was lost during the night,
when nothing except for stats collector was running.

As already said, this is not the classical "Linux eats my HDD" problem.

>    This is very likely _not_ to be a btrfs problem, but instead some
> runaway process writing lots of crap very fast. Log files are probably
> the most plausible location, but not the only one.

That would be visible in iostat or /proc/diskstats - it isn't. The free
space disappears without being physically written, which means it is
some allocation problem.


I also created a list of files modified between the snapshots with:

find test/@ -xdev -newer some_reference_file_inside_snapshot

and there is nothing bigger than a few MBs.


I've changed the snapshots to rw and removed some data from all the
instances: 4.8 GB in two ISO images and 5 GB-limited .ccache directory.
After this I got 11 GB freed, so the numbers are fine.

#  btrfs fi usage /
Overall:
    Device size:                 128.00GiB
    Device allocated:            117.19GiB
    Device unallocated:           10.81GiB
    Device missing:                  0.00B
    Used:                        103.56GiB
    Free (estimated):             11.19GiB      (min: 11.14GiB)
    Data ratio:                       1.98
    Metadata ratio:                   2.00
    Global reserve:              146.08MiB      (used: 0.00B)

Data,single: Size:1.19GiB, Used:1.18GiB
   /dev/sda2       1.07GiB
   /dev/sdb2     132.00MiB

Data,RAID1: Size:55.97GiB, Used:50.30GiB
   /dev/sda2      55.97GiB
   /dev/sdb2      55.97GiB

Metadata,RAID1: Size:2.00GiB, Used:908.61MiB
   /dev/sda2       2.00GiB
   /dev/sdb2       2.00GiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/sda2      32.00MiB
   /dev/sdb2      32.00MiB

Unallocated:
   /dev/sda2       4.93GiB
   /dev/sdb2       5.87GiB

>> Now, the weird part for me is exclusive data count:
>> 
>> # btrfs sub sh ./snapshot-171125
>> [...]
>>         Subvolume ID:           388
>> # btrfs fi du -s ./snapshot-171125 
>>      Total   Exclusive  Set shared  Filename
>>   21.50GiB    63.35MiB    20.77GiB  snapshot-171125
>> 
>> How is that possible? This doesn't even remotely relate to 7.15 GiB
>> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
>> And the same happens with other snapshots, much more exclusive data
>> shown in qgroup than actually found in files. So if not files, where
>> is that space wasted? Metadata?
> 
>    Personally, I'd trust qgroups' output about as far as I could spit
> Belgium(*).

Well, there is something wrong here, as after removing the .ccache
directories inside all the snapshots the 'excl' values decreased
...except for the last snapshot (the list below is short by ~40 snapshots
that have 2 GB excl in total):

qgroupid         rfer         excl 
--------         ----         ---- 
0/260        12.25GiB      3.22GiB	from 170712 - first snapshot
0/312        17.54GiB      4.56GiB	from 170811
0/366        25.59GiB      2.44GiB	from 171028
0/370        23.27GiB     59.46MiB 	from 111118 - prev snapshot
0/388        21.69GiB      7.16GiB	from 171125 - last snapshot
0/291        24.29GiB      9.77GiB	default subvolume


[~/test/snapshot-171125]#  du -sh .
15G     .


After changing back to ro I tested how much data really has changed
between the previous and last snapshot:

[~/test]#  btrfs send -p snapshot-171118 snapshot-171125 | pv > /dev/null
At subvol snapshot-171125
74.2MiB 0:00:32 [2.28MiB/s]

This means there can't be 7 GiB of exclusive data in the last snapshot.

Well, even btrfs send -p snapshot-170712 snapshot-171125 | pv > /dev/null
5.68GiB 0:03:23 [28.6MiB/s]

I've created a new snapshot right now to compare it with 171125:
75.5MiB 0:00:43 [1.73MiB/s]


OK, I could even compare all the snapshots in sequence:

# for i in snapshot-17*; btrfs prop set $i ro true
# p=''; for i in snapshot-17*; do [ -n "$p" ] && btrfs send -p "$p" "$i" | pv > /dev/null; p="$i" done
 1.7GiB 0:00:15 [ 114MiB/s]
1.03GiB 0:00:38 [27.2MiB/s]
 155MiB 0:00:08 [19.1MiB/s]
1.08GiB 0:00:47 [23.3MiB/s]
 294MiB 0:00:29 [ 9.9MiB/s]
 324MiB 0:00:42 [7.69MiB/s]
82.8MiB 0:00:06 [12.7MiB/s]
64.3MiB 0:00:05 [11.6MiB/s]
 137MiB 0:00:07 [19.3MiB/s]
85.3MiB 0:00:13 [6.18MiB/s]
62.8MiB 0:00:19 [3.21MiB/s]
 132MiB 0:00:42 [3.15MiB/s]
 102MiB 0:00:42 [2.42MiB/s]
 197MiB 0:00:50 [3.91MiB/s]
 321MiB 0:01:01 [5.21MiB/s]
 229MiB 0:00:18 [12.3MiB/s]
 109MiB 0:00:11 [ 9.7MiB/s]
 139MiB 0:00:14 [9.32MiB/s]
 573MiB 0:00:35 [15.9MiB/s]
64.1MiB 0:00:30 [2.11MiB/s]
 172MiB 0:00:11 [14.9MiB/s]
98.9MiB 0:00:07 [14.1MiB/s]
  54MiB 0:00:08 [6.17MiB/s]
78.6MiB 0:00:02 [32.1MiB/s]
15.1MiB 0:00:01 [12.5MiB/s]
20.6MiB 0:00:00 [  23MiB/s]
20.3MiB 0:00:00 [  23MiB/s]
 110MiB 0:00:14 [7.39MiB/s]
62.6MiB 0:00:11 [5.67MiB/s]
65.7MiB 0:00:08 [7.58MiB/s]
 731MiB 0:00:42 [  17MiB/s]
73.7MiB 0:00:29 [ 2.5MiB/s]
 322MiB 0:00:53 [6.04MiB/s]
 105MiB 0:00:35 [2.95MiB/s]
95.2MiB 0:00:36 [2.58MiB/s]
74.2MiB 0:00:30 [2.43MiB/s]
75.5MiB 0:00:46 [1.61MiB/s]

This is 9.3 GB of total diffs between all the snapshots I got.
Plus 15 GB of initial snapshot means there is about 25 GB used,
while df reports twice the amount, way too much for overhead:
/dev/sda2        64G   52G   11G  84% /


# btrfs quota enable /
# btrfs qgroup show /
WARNING: quota disabled, qgroup data may be out of date
[...]
# btrfs quota enable /		- for the second time!
# btrfs qgroup show /
WARNING: qgroup data inconsistent, rescan recommended
[...]
0/428        15.96GiB     19.23MiB 	newly created (now) snapshot



Assuming the qgroups output is bugus and the space isn't physically
occupied (which is coherent with btrfs fi du output and my expectation)
the question remains: why is that bogus-excl removed from available
space as reported by df or btrfs fi df/usage? And how to reclaim it?


[~/test]#  btrfs device usage /
/dev/sda2, ID: 1
   Device size:            64.00GiB
   Device slack:              0.00B
   Data,single:             1.07GiB
   Data,RAID1:             55.97GiB
   Metadata,RAID1:          2.00GiB
   System,RAID1:           32.00MiB
   Unallocated:             4.93GiB

/dev/sdb2, ID: 2
   Device size:            64.00GiB
   Device slack:              0.00B
   Data,single:           132.00MiB
   Data,RAID1:             55.97GiB
   Metadata,RAID1:          2.00GiB
   System,RAID1:           32.00MiB
   Unallocated:             5.87GiB

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02  0:53   ` Tomasz Pala
@ 2017-12-02  1:05     ` Qu Wenruo
  2017-12-02  1:43       ` Tomasz Pala
  2017-12-02  2:56     ` Duncan
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2017-12-02  1:05 UTC (permalink / raw)
  To: Tomasz Pala, Hugo Mills; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5480 bytes --]

>>> Now, the weird part for me is exclusive data count:
>>>
>>> # btrfs sub sh ./snapshot-171125
>>> [...]
>>>         Subvolume ID:           388
>>> # btrfs fi du -s ./snapshot-171125 
>>>      Total   Exclusive  Set shared  Filename
>>>   21.50GiB    63.35MiB    20.77GiB  snapshot-171125
>>>
>>> How is that possible? This doesn't even remotely relate to 7.15 GiB
>>> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
>>> And the same happens with other snapshots, much more exclusive data
>>> shown in qgroup than actually found in files. So if not files, where
>>> is that space wasted? Metadata?
>>
>>    Personally, I'd trust qgroups' output about as far as I could spit
>> Belgium(*).
> 
> Well, there is something wrong here, as after removing the .ccache
> directories inside all the snapshots the 'excl' values decreased
> ...except for the last snapshot (the list below is short by ~40 snapshots
> that have 2 GB excl in total):
> 
> qgroupid         rfer         excl 
> --------         ----         ---- 
> 0/260        12.25GiB      3.22GiB	from 170712 - first snapshot
> 0/312        17.54GiB      4.56GiB	from 170811
> 0/366        25.59GiB      2.44GiB	from 171028
> 0/370        23.27GiB     59.46MiB 	from 111118 - prev snapshot
> 0/388        21.69GiB      7.16GiB	from 171125 - last snapshot
> 0/291        24.29GiB      9.77GiB	default subvolume

You may need to manually sync the filesystem (trigger a transaction
commitment) to update qgroup accounting.
> 
> 
> [~/test/snapshot-171125]#  du -sh .
> 15G     .
> 
> 
> After changing back to ro I tested how much data really has changed
> between the previous and last snapshot:
> 
> [~/test]#  btrfs send -p snapshot-171118 snapshot-171125 | pv > /dev/null
> At subvol snapshot-171125
> 74.2MiB 0:00:32 [2.28MiB/s]
> 
> This means there can't be 7 GiB of exclusive data in the last snapshot.

Mentioned before, sync the fs first before checking the qgroup numbers.
Or use --sync option along with qgroup show.

> 
> Well, even btrfs send -p snapshot-170712 snapshot-171125 | pv > /dev/null
> 5.68GiB 0:03:23 [28.6MiB/s]
> 
> I've created a new snapshot right now to compare it with 171125:
> 75.5MiB 0:00:43 [1.73MiB/s]
> 
> 
> OK, I could even compare all the snapshots in sequence:
> 
> # for i in snapshot-17*; btrfs prop set $i ro true
> # p=''; for i in snapshot-17*; do [ -n "$p" ] && btrfs send -p "$p" "$i" | pv > /dev/null; p="$i" done
>  1.7GiB 0:00:15 [ 114MiB/s]
> 1.03GiB 0:00:38 [27.2MiB/s]
>  155MiB 0:00:08 [19.1MiB/s]
> 1.08GiB 0:00:47 [23.3MiB/s]
>  294MiB 0:00:29 [ 9.9MiB/s]
>  324MiB 0:00:42 [7.69MiB/s]
> 82.8MiB 0:00:06 [12.7MiB/s]
> 64.3MiB 0:00:05 [11.6MiB/s]
>  137MiB 0:00:07 [19.3MiB/s]
> 85.3MiB 0:00:13 [6.18MiB/s]
> 62.8MiB 0:00:19 [3.21MiB/s]
>  132MiB 0:00:42 [3.15MiB/s]
>  102MiB 0:00:42 [2.42MiB/s]
>  197MiB 0:00:50 [3.91MiB/s]
>  321MiB 0:01:01 [5.21MiB/s]
>  229MiB 0:00:18 [12.3MiB/s]
>  109MiB 0:00:11 [ 9.7MiB/s]
>  139MiB 0:00:14 [9.32MiB/s]
>  573MiB 0:00:35 [15.9MiB/s]
> 64.1MiB 0:00:30 [2.11MiB/s]
>  172MiB 0:00:11 [14.9MiB/s]
> 98.9MiB 0:00:07 [14.1MiB/s]
>   54MiB 0:00:08 [6.17MiB/s]
> 78.6MiB 0:00:02 [32.1MiB/s]
> 15.1MiB 0:00:01 [12.5MiB/s]
> 20.6MiB 0:00:00 [  23MiB/s]
> 20.3MiB 0:00:00 [  23MiB/s]
>  110MiB 0:00:14 [7.39MiB/s]
> 62.6MiB 0:00:11 [5.67MiB/s]
> 65.7MiB 0:00:08 [7.58MiB/s]
>  731MiB 0:00:42 [  17MiB/s]
> 73.7MiB 0:00:29 [ 2.5MiB/s]
>  322MiB 0:00:53 [6.04MiB/s]
>  105MiB 0:00:35 [2.95MiB/s]
> 95.2MiB 0:00:36 [2.58MiB/s]
> 74.2MiB 0:00:30 [2.43MiB/s]
> 75.5MiB 0:00:46 [1.61MiB/s]
> 
> This is 9.3 GB of total diffs between all the snapshots I got.
> Plus 15 GB of initial snapshot means there is about 25 GB used,
> while df reports twice the amount, way too much for overhead:
> /dev/sda2        64G   52G   11G  84% /
> 
> 
> # btrfs quota enable /
> # btrfs qgroup show /
> WARNING: quota disabled, qgroup data may be out of date
> [...]
> # btrfs quota enable /		- for the second time!
> # btrfs qgroup show /
> WARNING: qgroup data inconsistent, rescan recommended

Please wait the rescan, or any number is not correct.
(Although it will only be less than actual occupied space)

It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to
ensure you understand all the limitation.

> [...]
> 0/428        15.96GiB     19.23MiB 	newly created (now) snapshot
> 
> 
> 
> Assuming the qgroups output is bugus and the space isn't physically
> occupied (which is coherent with btrfs fi du output and my expectation)
> the question remains: why is that bogus-excl removed from available
> space as reported by df or btrfs fi df/usage? And how to reclaim it?

Already explained the difference in another thread.

Thanks,
Qu

> 
> 
> [~/test]#  btrfs device usage /
> /dev/sda2, ID: 1
>    Device size:            64.00GiB
>    Device slack:              0.00B
>    Data,single:             1.07GiB
>    Data,RAID1:             55.97GiB
>    Metadata,RAID1:          2.00GiB
>    System,RAID1:           32.00MiB
>    Unallocated:             4.93GiB
> 
> /dev/sdb2, ID: 2
>    Device size:            64.00GiB
>    Device slack:              0.00B
>    Data,single:           132.00MiB
>    Data,RAID1:             55.97GiB
>    Metadata,RAID1:          2.00GiB
>    System,RAID1:           32.00MiB
>    Unallocated:             5.87GiB
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02  0:27 ` Qu Wenruo
@ 2017-12-02  1:23   ` Tomasz Pala
  2017-12-02  1:47     ` Qu Wenruo
  2017-12-05 18:47   ` How exclusive in parent qgroup is computed? (was: Re: exclusive subvolume space missing) Andrei Borzenkov
  1 sibling, 1 reply; 32+ messages in thread
From: Tomasz Pala @ 2017-12-02  1:23 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Dec 02, 2017 at 08:27:56 +0800, Qu Wenruo wrote:

> I assume there is program eating up the space.
> Not btrfs itself.

Very doubtful. I've encountered ext3 "eating" problem once, that couldn't be
find by lsof on 3.4.75 kernel, but the space was returning after killing
Xorg. The system I'm having problem now is very recent, the space
doesn't return after reboot/emergency and doesn't sum up with files.

>> Now, the weird part for me is exclusive data count:
>> 
>> # btrfs sub sh ./snapshot-171125
>> [...]
>>         Subvolume ID:           388
>> # btrfs fi du -s ./snapshot-171125 
>>      Total   Exclusive  Set shared  Filename
>>   21.50GiB    63.35MiB    20.77GiB  snapshot-171125
> 
> That's the difference between how sub show and quota works.
> 
> For quota, it's per-root owner check.

Just to be clear: I've enabled quota _only_ to see subvolume usage on
spot. And exclusive data - the more detailed approach I've described in
e-mail I've send a minute ago.

> Means even a file extent is shared between different inodes, if all
> inodes are inside the same subvolume, it's counted as exclusive.
> And if any of the file extent belongs to other subvolume, then it's
> counted as shared.

Good to know, but this is almost UID0-only system. There are system
users (vendor provided) and 2 ssh accounts for su, but nobody uses this
machine for daily work. The quota values were the last tool I could find
to debug.

> For fi du, it's per-inode owner check. (The exact behavior is a little
> more complex, I'll skip such corner case to make it a little easier to
> understand).
> 
> That's to say, if one file extent is shared by different inodes, then
> it's counted as shared, no matter if these inodes belong to different or
> the same subvolume.
> 
> That's to say, "fi du" has a looser condition for "shared" calculation,
> and that should explain why you have 20+G shared.

There shouldn't be many multi-inode extents inside single subvolume, as this is mostly fresh
system, with no containers, no deduplication, snapshots are taken from
the same running system before or after some more important change is
done. By 'change' I mean altering text config files mostly (plus
etckeeper's git metadata), so the volume of difference is extremelly
low. Actually most of the difs between subvolumes come from updating
distro packages. There were not much reflink copies made on this
partition, only one kernel source compiled (.ccache files removed
today). So this partition is as clean, as it could be after almost
5 months in use.

Actually I should rephrase the problem:

"snapshot has taken 8 GB of space despite nothing has altered source subvolume"

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02  1:05     ` Qu Wenruo
@ 2017-12-02  1:43       ` Tomasz Pala
  2017-12-02  2:17         ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: Tomasz Pala @ 2017-12-02  1:43 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Hugo Mills, linux-btrfs

On Sat, Dec 02, 2017 at 09:05:50 +0800, Qu Wenruo wrote:

>> qgroupid         rfer         excl 
>> --------         ----         ---- 
>> 0/260        12.25GiB      3.22GiB	from 170712 - first snapshot
>> 0/312        17.54GiB      4.56GiB	from 170811
>> 0/366        25.59GiB      2.44GiB	from 171028
>> 0/370        23.27GiB     59.46MiB 	from 111118 - prev snapshot
>> 0/388        21.69GiB      7.16GiB	from 171125 - last snapshot
>> 0/291        24.29GiB      9.77GiB	default subvolume
> 
> You may need to manually sync the filesystem (trigger a transaction
> commitment) to update qgroup accounting.

The data I've pasted were just calculated.

>> # btrfs quota enable /
>> # btrfs qgroup show /
>> WARNING: quota disabled, qgroup data may be out of date
>> [...]
>> # btrfs quota enable /		- for the second time!
>> # btrfs qgroup show /
>> WARNING: qgroup data inconsistent, rescan recommended
> 
> Please wait the rescan, or any number is not correct.

Here I was pointing that first "quota enable" resulted in "quota
disabled" warning until I've enabled it once again.

> It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to
> ensure you understand all the limitation.

I probably won't understand them all, but this is not an issue of my
concern as I don't use it. There is simply no other way I am aware that
could show me per-subvolume stats. Well, straightforward way, as the
hard way I'm using (btrfs send) confirms the problem.

You could simply remove all the quota results I've posted and there will
be the underlaying problem, that the 25 GB of data I got occupies 52 GB.
At least one recent snapshot, that was taken after some minor (<100 MB) changes
from the subvolume, that has undergo some minor changes since then,
occupied 8 GB during one night when the entire system was idling.

This was crosschecked on files metadata (mtimes compared) and 'du'
results.


As a last-resort I've rebalanced the disk (once again), this time with
-dconvert=raid1 (to get rid of the single residue).

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02  1:23   ` Tomasz Pala
@ 2017-12-02  1:47     ` Qu Wenruo
  2017-12-02  2:21       ` Tomasz Pala
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2017-12-02  1:47 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3998 bytes --]



On 2017年12月02日 09:23, Tomasz Pala wrote:
> On Sat, Dec 02, 2017 at 08:27:56 +0800, Qu Wenruo wrote:
> 
>> I assume there is program eating up the space.
>> Not btrfs itself.
> 
> Very doubtful. I've encountered ext3 "eating" problem once, that couldn't be
> find by lsof on 3.4.75 kernel, but the space was returning after killing
> Xorg. The system I'm having problem now is very recent, the space
> doesn't return after reboot/emergency and doesn't sum up with files.

Unlike vanilla df or "fi usage" or "fi df", btrfs quota only counts
on-disk extents.

That's to say, reserved space won't contribute to qgroup.
Unless one is using anonymous file, which is opened but unlinked, so no
one can access it except the owner.
(Which I doubt that may be your case)

Which should make quota the best tool to debug your problem.
(As long as you follow the variant limitations of btrfs quota,
especially you need to sync or use --sync option to show qgroup numbers)

> 
>>> Now, the weird part for me is exclusive data count:
>>>
>>> # btrfs sub sh ./snapshot-171125
>>> [...]
>>>         Subvolume ID:           388
>>> # btrfs fi du -s ./snapshot-171125 
>>>      Total   Exclusive  Set shared  Filename
>>>   21.50GiB    63.35MiB    20.77GiB  snapshot-171125
>>
>> That's the difference between how sub show and quota works.
>>
>> For quota, it's per-root owner check.
> 
> Just to be clear: I've enabled quota _only_ to see subvolume usage on
> spot. And exclusive data - the more detailed approach I've described in
> e-mail I've send a minute ago.
> 
>> Means even a file extent is shared between different inodes, if all
>> inodes are inside the same subvolume, it's counted as exclusive.
>> And if any of the file extent belongs to other subvolume, then it's
>> counted as shared.
> 
> Good to know, but this is almost UID0-only system. There are system
> users (vendor provided) and 2 ssh accounts for su, but nobody uses this
> machine for daily work. The quota values were the last tool I could find
> to debug.
> 
>> For fi du, it's per-inode owner check. (The exact behavior is a little
>> more complex, I'll skip such corner case to make it a little easier to
>> understand).
>>
>> That's to say, if one file extent is shared by different inodes, then
>> it's counted as shared, no matter if these inodes belong to different or
>> the same subvolume.
>>
>> That's to say, "fi du" has a looser condition for "shared" calculation,
>> and that should explain why you have 20+G shared.
> 
> There shouldn't be many multi-inode extents inside single subvolume, as this is mostly fresh
> system, with no containers, no deduplication, snapshots are taken from
> the same running system before or after some more important change is
> done. By 'change' I mean altering text config files mostly (plus
> etckeeper's git metadata), so the volume of difference is extremelly
> low. Actually most of the difs between subvolumes come from updating
> distro packages. There were not much reflink copies made on this
> partition, only one kernel source compiled (.ccache files removed
> today). So this partition is as clean, as it could be after almost
> 5 months in use.
> 
> Actually I should rephrase the problem:
> 
> "snapshot has taken 8 GB of space despite nothing has altered source subvolume"

Then please provide correct qgroup numbers.

The correct number should be get by:
# btrfs quota enable <mnt>
# btrfs quota rescan -w <mnt>
# btrfs qgroup show -prce --sync <mnt>

Rescan and --sync are important to get the correct number.
(while rescan can take a long long time to finish)

And further more, please ensure that all deleted files are really deleted.
Btrfs delay file and subvolume deletion, so you may need to sync several
times or use "btrfs subv sync" to ensure deleted files are deleted.

(vanilla du won't tell you if such delayed file deletion is really done)

Thanks,
Qu
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02  1:43       ` Tomasz Pala
@ 2017-12-02  2:17         ` Qu Wenruo
  0 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2017-12-02  2:17 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Hugo Mills, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3927 bytes --]



On 2017年12月02日 09:43, Tomasz Pala wrote:
> On Sat, Dec 02, 2017 at 09:05:50 +0800, Qu Wenruo wrote:
> 
>>> qgroupid         rfer         excl 
>>> --------         ----         ---- 
>>> 0/260        12.25GiB      3.22GiB	from 170712 - first snapshot
>>> 0/312        17.54GiB      4.56GiB	from 170811
>>> 0/366        25.59GiB      2.44GiB	from 171028
>>> 0/370        23.27GiB     59.46MiB 	from 111118 - prev snapshot
>>> 0/388        21.69GiB      7.16GiB	from 171125 - last snapshot
>>> 0/291        24.29GiB      9.77GiB	default subvolume
>>
>> You may need to manually sync the filesystem (trigger a transaction
>> commitment) to update qgroup accounting.
> 
> The data I've pasted were just calculated.
> 
>>> # btrfs quota enable /
>>> # btrfs qgroup show /
>>> WARNING: quota disabled, qgroup data may be out of date
>>> [...]
>>> # btrfs quota enable /		- for the second time!
>>> # btrfs qgroup show /
>>> WARNING: qgroup data inconsistent, rescan recommended
>>
>> Please wait the rescan, or any number is not correct.
> 
> Here I was pointing that first "quota enable" resulted in "quota
> disabled" warning until I've enabled it once again.
> 
>> It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to
>> ensure you understand all the limitation.
> 
> I probably won't understand them all, but this is not an issue of my
> concern as I don't use it. There is simply no other way I am aware that
> could show me per-subvolume stats. Well, straightforward way, as the
> hard way I'm using (btrfs send) confirms the problem.

Unfortunately, send doesn't count everything.

The most common thing is, send doesn't count extent booking space.
Try the following command:

# fallocate -l 1G <tmp_file>
# mkfs.btrfs -f <tmp_file>
# mount <tmp_file> <mnt>
# btrfs subv create <mnt>/subv1
# xfs_io -f -c "pwrite 0 128M" -c "sync" <mnt>/subv1/file1
# xfs_io -f "fpunch 0 127M" -c "sync" <mnt>/subv1/file1
# btrfs subv snapshot -r <mnt>/subv1 <mnt>/snapshot
# btrfs send <mnt>/snapshot

You will only get the 1M data, while it still takes 128M space on-disk.

Btrfs extent book will only free the whole extent if and only if there
is no inode referring *ANY* part of the extent.

Even only 1M of a 128M file extent is used, it will still takes 128M
space on-disk.

And that's what send can't tell you.
And that's also what qgroup can tell you.

That's also why I need *CORRECT* qgroup numbers to further investigate
the problem.

> 
> You could simply remove all the quota results I've posted and there will
> be the underlaying problem, that the 25 GB of data I got occupies 52 GB.

If you only want to know the answer why your "25G" data occupies 52G on
disk, above is one of the possible explanations.

(And I think I should put it into btrfs(5), although I highly doubt if
user will really read them)

You could try to defrag, but I'm not sure if defrag works well in multi
subvolumes case.

> At least one recent snapshot, that was taken after some minor (<100 MB) changes
> from the subvolume, that has undergo some minor changes since then,
> occupied 8 GB during one night when the entire system was idling.

The only possible method to fully isolate all the disturbing factors is
to get rid of snapshot.

Build the subvolume from scratch (even no cp --reflink from other
subvolume), then test what's happening.

Only in that case, you could trust vanilla du (if you don't do any reflink).
Although you can always trust qgroups number, such subvolume built from
scratch will makes exclusive number equals to reference, making
debugging a little easier.

Thanks,
Qu

> 
> This was crosschecked on files metadata (mtimes compared) and 'du'
> results.
> 
> 
> As a last-resort I've rebalanced the disk (once again), this time with
> -dconvert=raid1 (to get rid of the single residue).
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02  1:47     ` Qu Wenruo
@ 2017-12-02  2:21       ` Tomasz Pala
  2017-12-02  2:35         ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: Tomasz Pala @ 2017-12-02  2:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Dec 02, 2017 at 09:47:19 +0800, Qu Wenruo wrote:

>> Actually I should rephrase the problem:
>> 
>> "snapshot has taken 8 GB of space despite nothing has altered source subvolume"

Actually, after:

# btrfs balance start -v -dconvert=raid1 /
ctrl-c on block group 35G/113G
# btrfs balance start -v -dconvert=raid1,soft /
# btrfs balance start -v -dusage=55 /
Done, had to relocate 1 out of 56 chunks
# btrfs balance start -v -musage=55 /
Done, had to relocate 2 out of 55 chunks

and waiting a few minutes after ...the 8 GB I've lost yesterday is back:

#  btrfs fi sh /
Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
        Total devices 2 FS bytes used 44.10GiB
        devid    1 size 64.00GiB used 54.00GiB path /dev/sda2
        devid    2 size 64.00GiB used 54.00GiB path /dev/sdb2

#  btrfs fi usage /
Overall:
    Device size:                 128.00GiB
    Device allocated:            108.00GiB
    Device unallocated:           20.00GiB
    Device missing:                  0.00B
    Used:                         88.19GiB
    Free (estimated):             18.75GiB      (min: 18.75GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              131.14MiB      (used: 0.00B)

Data,RAID1: Size:51.97GiB, Used:43.22GiB
   /dev/sda2      51.97GiB
   /dev/sdb2      51.97GiB

Metadata,RAID1: Size:2.00GiB, Used:895.69MiB
   /dev/sda2       2.00GiB
   /dev/sdb2       2.00GiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/sda2      32.00MiB
   /dev/sdb2      32.00MiB

Unallocated:
   /dev/sda2      10.00GiB
   /dev/sdb2      10.00GiB

#  btrfs dev usage /
/dev/sda2, ID: 1
   Device size:            64.00GiB
   Device slack:              0.00B
   Data,RAID1:             51.97GiB
   Metadata,RAID1:          2.00GiB
   System,RAID1:           32.00MiB
   Unallocated:            10.00GiB

/dev/sdb2, ID: 2
   Device size:            64.00GiB
   Device slack:              0.00B
   Data,RAID1:             51.97GiB
   Metadata,RAID1:          2.00GiB
   System,RAID1:           32.00MiB
   Unallocated:            10.00GiB

#  btrfs fi df /    
Data, RAID1: total=51.97GiB, used=43.22GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=2.00GiB, used=895.69MiB
GlobalReserve, single: total=131.14MiB, used=0.00B

# df
/dev/sda2        64G   45G   19G  71% /

However the difference is on active root fs:

-0/291        24.29GiB      9.77GiB
+0/291        15.99GiB     76.00MiB

Still, 45G used, while there is (if I counted this correctly) 25G of data...

> Then please provide correct qgroup numbers.
> 
> The correct number should be get by:
> # btrfs quota enable <mnt>
> # btrfs quota rescan -w <mnt>
> # btrfs qgroup show -prce --sync <mnt>

OK, just added the --sort=excl:

qgroupid         rfer         excl     max_rfer     max_excl parent  child 
--------         ----         ----     --------     -------- ------  ----- 
0/5          16.00KiB     16.00KiB         none         none ---     ---  
0/361        22.57GiB      7.00MiB         none         none ---     ---  
0/358        22.54GiB      7.50MiB         none         none ---     ---  
0/343        22.36GiB      7.84MiB         none         none ---     ---  
0/345        22.49GiB      8.05MiB         none         none ---     ---  
0/357        22.50GiB      9.27MiB         none         none ---     ---  
0/360        22.57GiB     10.27MiB         none         none ---     ---  
0/344        22.48GiB     11.09MiB         none         none ---     ---  
0/359        22.55GiB     12.57MiB         none         none ---     ---  
0/362        22.59GiB     22.96MiB         none         none ---     ---  
0/302        12.87GiB     31.23MiB         none         none ---     ---  
0/428        15.96GiB     38.68MiB         none         none ---     ---  
0/294        11.09GiB     47.86MiB         none         none ---     ---  
0/336        21.80GiB     49.59MiB         none         none ---     ---  
0/300        12.56GiB     51.43MiB         none         none ---     ---  
0/342        22.31GiB     52.93MiB         none         none ---     ---  
0/333        21.71GiB     54.54MiB         none         none ---     ---  
0/363        22.63GiB     58.83MiB         none         none ---     ---  
0/370        23.27GiB     59.46MiB         none         none ---     ---  
0/305        13.01GiB     61.47MiB         none         none ---     ---  
0/331        21.61GiB     61.49MiB         none         none ---     ---  
0/334        21.78GiB     62.95MiB         none         none ---     ---  
0/306        13.04GiB     64.11MiB         none         none ---     ---  
0/304        12.96GiB     64.90MiB         none         none ---     ---  
0/303        12.94GiB     68.39MiB         none         none ---     ---  
0/367        23.20GiB     68.52MiB         none         none ---     ---  
0/366        23.22GiB     69.79MiB         none         none ---     ---  
0/364        22.63GiB     72.03MiB         none         none ---     ---  
0/285        10.78GiB     75.95MiB         none         none ---     ---  
0/291        15.99GiB     76.24MiB         none         none ---     ---  <- this one (default rootfs) got fixed
0/323        21.35GiB     95.85MiB         none         none ---     ---  
0/369        23.26GiB     96.12MiB         none         none ---     ---  
0/324        21.36GiB    104.46MiB         none         none ---     ---  
0/327        21.36GiB    115.42MiB         none         none ---     ---  
0/368        23.27GiB    118.25MiB         none         none ---     ---  
0/295        11.20GiB    148.59MiB         none         none ---     ---  
0/298        12.38GiB    283.41MiB         none         none ---     ---  
0/260        12.25GiB      3.22GiB         none         none ---     ---  <- 170712, initial snapshot, OK
0/312        17.54GiB      4.56GiB         none         none ---     ---  <- 170811, definitely less excl
0/388        21.69GiB      7.16GiB         none         none ---     ---  <- this one has <100M exclusive


So the one block of data was released, but there are probably two more
stuck here. If the 4.5G and 7G were freed I would have 45-4.5-7=33G used,
which would agree with the 25G of data I've counted manually.

Any ideas how to look inside these two snapshots?

> Rescan and --sync are important to get the correct number.
> (while rescan can take a long long time to finish)

#  time btrfs quota rescan -w /
quota rescan started
btrfs quota rescan -w /  0.00s user 0.00s system 0% cpu 30.798 total

> And further more, please ensure that all deleted files are really deleted.
> Btrfs delay file and subvolume deletion, so you may need to sync several
> times or use "btrfs subv sync" to ensure deleted files are deleted.

Yes, I was aware about that. However I've never had to wait after rebalance...

regards,
-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02  2:21       ` Tomasz Pala
@ 2017-12-02  2:35         ` Qu Wenruo
  2017-12-02  9:33           ` Tomasz Pala
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2017-12-02  2:35 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 8062 bytes --]



On 2017年12月02日 10:21, Tomasz Pala wrote:
> On Sat, Dec 02, 2017 at 09:47:19 +0800, Qu Wenruo wrote:
> 
>>> Actually I should rephrase the problem:
>>>
>>> "snapshot has taken 8 GB of space despite nothing has altered source subvolume"
> 
> Actually, after:
> 
> # btrfs balance start -v -dconvert=raid1 /
> ctrl-c on block group 35G/113G
> # btrfs balance start -v -dconvert=raid1,soft /
> # btrfs balance start -v -dusage=55 /
> Done, had to relocate 1 out of 56 chunks
> # btrfs balance start -v -musage=55 /
> Done, had to relocate 2 out of 55 chunks
> 
> and waiting a few minutes after ...the 8 GB I've lost yesterday is back:
> 
> #  btrfs fi sh /
> Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
>         Total devices 2 FS bytes used 44.10GiB
>         devid    1 size 64.00GiB used 54.00GiB path /dev/sda2
>         devid    2 size 64.00GiB used 54.00GiB path /dev/sdb2
> 
> #  btrfs fi usage /
> Overall:
>     Device size:                 128.00GiB
>     Device allocated:            108.00GiB
>     Device unallocated:           20.00GiB
>     Device missing:                  0.00B
>     Used:                         88.19GiB
>     Free (estimated):             18.75GiB      (min: 18.75GiB)
>     Data ratio:                       2.00
>     Metadata ratio:                   2.00
>     Global reserve:              131.14MiB      (used: 0.00B)
> 
> Data,RAID1: Size:51.97GiB, Used:43.22GiB
>    /dev/sda2      51.97GiB
>    /dev/sdb2      51.97GiB
> 
> Metadata,RAID1: Size:2.00GiB, Used:895.69MiB
>    /dev/sda2       2.00GiB
>    /dev/sdb2       2.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:16.00KiB
>    /dev/sda2      32.00MiB
>    /dev/sdb2      32.00MiB
> 
> Unallocated:
>    /dev/sda2      10.00GiB
>    /dev/sdb2      10.00GiB
> 
> #  btrfs dev usage /
> /dev/sda2, ID: 1
>    Device size:            64.00GiB
>    Device slack:              0.00B
>    Data,RAID1:             51.97GiB
>    Metadata,RAID1:          2.00GiB
>    System,RAID1:           32.00MiB
>    Unallocated:            10.00GiB
> 
> /dev/sdb2, ID: 2
>    Device size:            64.00GiB
>    Device slack:              0.00B
>    Data,RAID1:             51.97GiB
>    Metadata,RAID1:          2.00GiB
>    System,RAID1:           32.00MiB
>    Unallocated:            10.00GiB
> 
> #  btrfs fi df /    
> Data, RAID1: total=51.97GiB, used=43.22GiB
> System, RAID1: total=32.00MiB, used=16.00KiB
> Metadata, RAID1: total=2.00GiB, used=895.69MiB
> GlobalReserve, single: total=131.14MiB, used=0.00B
> 
> # df
> /dev/sda2        64G   45G   19G  71% /
> 
> However the difference is on active root fs:
> 
> -0/291        24.29GiB      9.77GiB
> +0/291        15.99GiB     76.00MiB
> 
> Still, 45G used, while there is (if I counted this correctly) 25G of data...
> 
>> Then please provide correct qgroup numbers.
>>
>> The correct number should be get by:
>> # btrfs quota enable <mnt>
>> # btrfs quota rescan -w <mnt>
>> # btrfs qgroup show -prce --sync <mnt>
> 
> OK, just added the --sort=excl:
> 
> qgroupid         rfer         excl     max_rfer     max_excl parent  child 
> --------         ----         ----     --------     -------- ------  ----- 
> 0/5          16.00KiB     16.00KiB         none         none ---     ---  
> 0/361        22.57GiB      7.00MiB         none         none ---     ---  
> 0/358        22.54GiB      7.50MiB         none         none ---     ---  
> 0/343        22.36GiB      7.84MiB         none         none ---     ---  
> 0/345        22.49GiB      8.05MiB         none         none ---     ---  
> 0/357        22.50GiB      9.27MiB         none         none ---     ---  
> 0/360        22.57GiB     10.27MiB         none         none ---     ---  
> 0/344        22.48GiB     11.09MiB         none         none ---     ---  
> 0/359        22.55GiB     12.57MiB         none         none ---     ---  
> 0/362        22.59GiB     22.96MiB         none         none ---     ---  
> 0/302        12.87GiB     31.23MiB         none         none ---     ---  
> 0/428        15.96GiB     38.68MiB         none         none ---     ---  
> 0/294        11.09GiB     47.86MiB         none         none ---     ---  
> 0/336        21.80GiB     49.59MiB         none         none ---     ---  
> 0/300        12.56GiB     51.43MiB         none         none ---     ---  
> 0/342        22.31GiB     52.93MiB         none         none ---     ---  
> 0/333        21.71GiB     54.54MiB         none         none ---     ---  
> 0/363        22.63GiB     58.83MiB         none         none ---     ---  
> 0/370        23.27GiB     59.46MiB         none         none ---     ---  
> 0/305        13.01GiB     61.47MiB         none         none ---     ---  
> 0/331        21.61GiB     61.49MiB         none         none ---     ---  
> 0/334        21.78GiB     62.95MiB         none         none ---     ---  
> 0/306        13.04GiB     64.11MiB         none         none ---     ---  
> 0/304        12.96GiB     64.90MiB         none         none ---     ---  
> 0/303        12.94GiB     68.39MiB         none         none ---     ---  
> 0/367        23.20GiB     68.52MiB         none         none ---     ---  
> 0/366        23.22GiB     69.79MiB         none         none ---     ---  
> 0/364        22.63GiB     72.03MiB         none         none ---     ---  
> 0/285        10.78GiB     75.95MiB         none         none ---     ---  
> 0/291        15.99GiB     76.24MiB         none         none ---     ---  <- this one (default rootfs) got fixed
> 0/323        21.35GiB     95.85MiB         none         none ---     ---  
> 0/369        23.26GiB     96.12MiB         none         none ---     ---  
> 0/324        21.36GiB    104.46MiB         none         none ---     ---  
> 0/327        21.36GiB    115.42MiB         none         none ---     ---  
> 0/368        23.27GiB    118.25MiB         none         none ---     ---  
> 0/295        11.20GiB    148.59MiB         none         none ---     ---  
> 0/298        12.38GiB    283.41MiB         none         none ---     ---  
> 0/260        12.25GiB      3.22GiB         none         none ---     ---  <- 170712, initial snapshot, OK
> 0/312        17.54GiB      4.56GiB         none         none ---     ---  <- 170811, definitely less excl
> 0/388        21.69GiB      7.16GiB         none         none ---     ---  <- this one has <100M exclusive
> 
> 
> So the one block of data was released, but there are probably two more
> stuck here. If the 4.5G and 7G were freed I would have 45-4.5-7=33G used,
> which would agree with the 25G of data I've counted manually.
> 
> Any ideas how to look inside these two snapshots?

I would try du first to locate the largest file.

Then use use "btrfs fi du" to ensure it's not shared.
(If it's not shared in "fi du" it must be exclusive to qgroup)

Although this will not work as expected if it's not a large file but
tons of small files.


Another method is to use the send output you get.
Use "btrfs receive --dump" to get a comprehensive view of what's the
different between two different snapshots.

Since you have already showed the size of the snapshots, which hardly
goes beyond 1G, it may be possible that extent booking is the cause.

And considering it's all exclusive, defrag may help in this case.

Thanks,
Qu

> 
>> Rescan and --sync are important to get the correct number.
>> (while rescan can take a long long time to finish)
> 
> #  time btrfs quota rescan -w /
> quota rescan started
> btrfs quota rescan -w /  0.00s user 0.00s system 0% cpu 30.798 total>
>> And further more, please ensure that all deleted files are really deleted.
>> Btrfs delay file and subvolume deletion, so you may need to sync several
>> times or use "btrfs subv sync" to ensure deleted files are deleted.
> 
> Yes, I was aware about that. However I've never had to wait after rebalance...
> 
> regards,
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02  0:53   ` Tomasz Pala
  2017-12-02  1:05     ` Qu Wenruo
@ 2017-12-02  2:56     ` Duncan
  2017-12-02 16:28     ` Tomasz Pala
  2017-12-04  4:58     ` Chris Murphy
  3 siblings, 0 replies; 32+ messages in thread
From: Duncan @ 2017-12-02  2:56 UTC (permalink / raw)
  To: linux-btrfs

Tomasz Pala posted on Sat, 02 Dec 2017 01:53:39 +0100 as excerpted:

> #  btrfs fi usage /
> Overall:
>     Device size:                 128.00GiB
>     Device allocated:            117.19GiB
>     Device unallocated:           10.81GiB
>     Device missing:                  0.00B
>     Used:                        103.56GiB
>     Free (estimated):             11.19GiB      (min: 11.14GiB)
>     Data ratio:                       1.98
>     Metadata ratio:                   2.00
>     Global reserve:              146.08MiB      (used: 0.00B)
> 
> Data,single: Size:1.19GiB, Used:1.18GiB
>    /dev/sda2       1.07GiB
>    /dev/sdb2     132.00MiB
> 
> Data,RAID1: Size:55.97GiB, Used:50.30GiB
>    /dev/sda2      55.97GiB
>    /dev/sdb2      55.97GiB
> 
> Metadata,RAID1: Size:2.00GiB, Used:908.61MiB
>    /dev/sda2       2.00GiB
>    /dev/sdb2       2.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:16.00KiB
>    /dev/sda2      32.00MiB
>    /dev/sdb2      32.00MiB
> 
> Unallocated:
>    /dev/sda2       4.93GiB
>    /dev/sdb2       5.87GiB

OK, is this supposed to be raid1 or single data, because the above shows
metadata as all raid1, while some data is single tho most is raid1, and
while old mkfs used to create unused single chunks on raid1 that had to
be removed manually via balance, those single data chunks aren't unused.

Which means if it's supposed to raid1, you don't have redundancy on that
single data.

Assuming the intent is raid1, I'd recommend doing...

btrfs balance start -dconvert=raid1,soft /

Probably disable quotas at least temporarily while you do so, tho, as
they don't scale well with balance and make it take much longer.

That should go reasonably fast as it's only a bit over 1 GiB on the one
device, and 132 MiB on the other (from your btrfs device usage), and the
soft allows it to skip chunks that don't need conversion.

It should kill those single entries and even up usage on both devices,
along with making the filesystem much more tolerant of loss of one of
the two devices.


Other than that, what we can see from the above is that it's a relatively
small filesystem, 64 GiB each on a pair of devices, raid1 but for the
above.

We also see that the allocated chunks vs. chunk usage isn't /too/ bad,
with that being a somewhat common problem.  However, given the relatively
small 64 GiB per device pair-device raid1 filesystem, there is some
slack, about 5 GiB worth, in that raid1 data, that you can recover.

btrfs balance start -dusage=N /

Where N represents a percentage full, so 0-100.  Normally, smaller
values of N complete much faster, with the most effect if they're
enough, because at say 10% usage, 10 90% empty chunks can be rewritten
into a single 100% full chunk.

The idea is to start with a small N value since it completes fast, and
redo with higher values as necessary to shrink the total data chunk
allocated value toward usage.  I too run relatively small btrfs raid1s
and would suggest trying N=5, 20, 40, 70, until the spread between
used and total is under 2 gigs, under a gig if you want to go that far
(nominal data chunk size is a gig so even a full balance will be unlikely
to get you a spread less than that).  Over 70 likely won't get you much
so isn't worth it.

That should return the excess to unallocated, leaving the filesystem 
able to use the freed space for data or metadata chunks as necessary,
tho you're unlikely to see an increase in available space in (non-btrfs)
df or similar.  If the unallocated value gets down below 1 GiB you may
have issues trying to free space since balance will want space to write
the chunk it's going to write into to free the others, so you probably 
want to keep an eye on this and rebalance if it gets under 2-3 gigs
free space, assuming of course that there's slack between used and
total that /can/ be freed by a rebalance.

FWIW the same can be done with metadata using -musage=, with metadata
chunks being 256 MiB nominal, but keep in mind that global reserve is
allocated from metadata space but doesn't count as used, so you typically
can't get the spread down below half a GiB or so.  And in most cases
it's data chunks that get the big spread, not metadata, so it's much
more common to have to do -d for data than -m for metadata.


All that said, the numbers don't show a runaway spread between total
and used, so while this might help, it's not going to fix the primary
space being eaten problem of the thread, as I had hoped it might.

Additionally, at 2 GiB total per device, metadata chunks aren't runaway
consuming your space either, as I'd suspect they might if the problem were
for instance atime updates, so while noatime is certainly recommended and
might help some, it doesn't appear to be a primary contributor to the
problem either.


The other possibility that comes to mind here has to do with btrfs COW
write patterns...

Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
the GiB+ example typically used due to the filesystem size being small,
64 MiB usable capacity due to raid1).  And for simplicity, suppose it's
allocated as a single 100 MiB extent

Now make various small changes to the file, say under 16 KiB each.  These
will each be COWed elsewhere as one might expect. by default 16 KiB at
a time I believe (might be 4 KiB, as it was back when the default leaf
size was 4 KiB, but I think with the change to 16 KiB leaf sizes by default
it's now 16 KiB).

But here's the kicker.  Even without a snapshot locking that original 100
MiB extent in place, if even one of the original 16 KiB blocks isn't
rewritten, that entire 100 MiB extent will remain locked in place, as the
original 16 KiB blocks that have been changed and thus COWed elsewhere
aren't freed one at a time, the full 100 MiB extent only gets freed, all
at once, once no references to it remain, which means once that last
block of the extent gets rewritten.

So perhaps you have a pattern where files of several MiB get mostly
rewritten, taking more space for the rewrites due to COW, but one or
more blocks remain as originally written, locking the original extent
in place at its full size, thus taking twice the space of the original
file.

Of course worst-case is rewrite the file minus a block, then rewrite
that minus a block, then rewrite... in which case the total space
usage will end up being several times the size of the original file!

Luckily few people have this sort of usage pattern, but if you do...

It would certainly explain the space eating...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02  2:35         ` Qu Wenruo
@ 2017-12-02  9:33           ` Tomasz Pala
  2017-12-04  0:34             ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: Tomasz Pala @ 2017-12-02  9:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

OK, I seriously need to address that, as during the night I lost
3 GB again:

On Sat, Dec 02, 2017 at 10:35:12 +0800, Qu Wenruo wrote:

>> #  btrfs fi sh /
>> Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
>>         Total devices 2 FS bytes used 44.10GiB
           Total devices 2 FS bytes used 47.28GiB

>> #  btrfs fi usage /
>> Overall:
>>     Used:                         88.19GiB
       Used:                         94.58GiB
>>     Free (estimated):             18.75GiB      (min: 18.75GiB)
       Free (estimated):             15.56GiB      (min: 15.56GiB)
>> 
>> #  btrfs dev usage /
- output not changed

>> #  btrfs fi df /    
>> Data, RAID1: total=51.97GiB, used=43.22GiB
   Data, RAID1: total=51.97GiB, used=46.42GiB
>> System, RAID1: total=32.00MiB, used=16.00KiB
>> Metadata, RAID1: total=2.00GiB, used=895.69MiB
>> GlobalReserve, single: total=131.14MiB, used=0.00B
   GlobalReserve, single: total=135.50MiB, used=0.00B
>> 
>> # df
>> /dev/sda2        64G   45G   19G  71% /
   /dev/sda2        64G   48G   16G  76% /
>> However the difference is on active root fs:
>> 
>> -0/291        24.29GiB      9.77GiB
>> +0/291        15.99GiB     76.00MiB
    0/291        19.19GiB      3.28GiB
> 
> Since you have already showed the size of the snapshots, which hardly
> goes beyond 1G, it may be possible that extent booking is the cause.
> 
> And considering it's all exclusive, defrag may help in this case.

I'm going to try defrag here, but have a bunch of questions before;
as defrag would break CoW, I don't want to defrag files that span
multiple snapshots, unless they have huge overhead:
1. is there any switch resulting in 'defrag only exclusive data'?
2. is there any switch resulting in 'defrag only extents fragmented more than X'
   or 'defrag only fragments that would be possibly freed'?
3. I guess there aren't, so how could I accomplish my target, i.e.
   reclaiming space that was lost due to fragmentation, without breaking
   spanshoted CoW where it would be not only pointless, but actually harmful?
4. How can I prevent this from happening again? All the files, that are
   written constantly (stats collector here, PostgreSQL database and
   logs on other machines), are marked with nocow (+C); maybe some new
   attribute to mark file as autodefrag? +t?

For example, the largest file from stats collector:
     Total   Exclusive  Set shared  Filename
 432.00KiB   176.00KiB   256.00KiB  load/load.rrd

but most of them has 'Set shared'==0.

5. The stats collector is running from the beginning, according to the
quota output was not the issue since something happened. If the problem
was triggered by (guessing) low space condition, and it results in even
more space lost, there is positive feedback that is dangerous, as makes
any filesystem unstable ("once you run out of space, you won't recover").
Does it mean btrfs is simply not suitable (yet?) for frequent updates usage
pattern, like RRD files?

6. Or maybe some extra steps just before taking snapshot should be taken?
I guess 'defrag exclusive' would be perfect here - reclaiming space
before it is being locked inside snapshot.
Rationale behind this is obvious: since the snapshot-aware defrag was
removed, allow to defrag snapshot exclusive data only.
This would of course result in partial file defragmentation, but that
should be enough for pathological cases like mine.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02  0:53   ` Tomasz Pala
  2017-12-02  1:05     ` Qu Wenruo
  2017-12-02  2:56     ` Duncan
@ 2017-12-02 16:28     ` Tomasz Pala
  2017-12-02 17:18       ` Tomasz Pala
  2017-12-04  4:58     ` Chris Murphy
  3 siblings, 1 reply; 32+ messages in thread
From: Tomasz Pala @ 2017-12-02 16:28 UTC (permalink / raw)
  To: linux-btrfs

On Fri, 01 Dec 2017 18:57:08 -0800, Duncan wrote:

> OK, is this supposed to be raid1 or single data, because the above shows
> metadata as all raid1, while some data is single tho most is raid1, and
> while old mkfs used to create unused single chunks on raid1 that had to
> be removed manually via balance, those single data chunks aren't unused.

It is supposed to be RAID1, the single data were leftovers from my previous
attempts to gain some space by converting into single profile. Which
miserably failed BTW (would it be smarter with "soft" option?),
but I've already managed to clear this.

> Assuming the intent is raid1, I'd recommend doing...
>
> btrfs balance start -dconvert=raid1,soft /

Yes, this was the way to go. It also reclaimed the 8 GB. I assume the
failing -dconvert=single somehow locked that 8 GB, so this issue should
be addressed in btrfs-tools to report such locked out region. You've
already noted that the single profile data occupied much less itself.

So this was the first issue, the second is running overhead, that
accumulates over time. Since yesterday, when I had 19 GB free, I've lost
4 GB already. The scenario you've described is very probable:

> btrfs balance start -dusage=N /
[...]
> allocated value toward usage.  I too run relatively small btrfs raid1s
> and would suggest trying N=5, 20, 40, 70, until the spread between

There were no effects above N=10 (both dusage and musage).

> consuming your space either, as I'd suspect they might if the problem were
> for instance atime updates, so while noatime is certainly recommended and

I use noatime by default since years, so not the source of problem here.

> The other possibility that comes to mind here has to do with btrfs COW
> write patterns...

> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
[...]
> Now make various small changes to the file, say under 16 KiB each.  These
> will each be COWed elsewhere as one might expect. by default 16 KiB at
> a time I believe (might be 4 KiB, as it was back when the default leaf

I got ~500 small files (100-500 kB) updated partially in regular
intervals:

# du -Lc **/*.rrd | tail -n1
105M    total

> But here's the kicker.  Even without a snapshot locking that original 100
> MiB extent in place, if even one of the original 16 KiB blocks isn't
> rewritten, that entire 100 MiB extent will remain locked in place, as the
> original 16 KiB blocks that have been changed and thus COWed elsewhere
> aren't freed one at a time, the full 100 MiB extent only gets freed, all
> at once, once no references to it remain, which means once that last
> block of the extent gets rewritten.
>
> So perhaps you have a pattern where files of several MiB get mostly
> rewritten, taking more space for the rewrites due to COW, but one or
> more blocks remain as originally written, locking the original extent
> in place at its full size, thus taking twice the space of the original
> file.
>
> Of course worst-case is rewrite the file minus a block, then rewrite
> that minus a block, then rewrite... in which case the total space
> usage will end up being several times the size of the original file!
>
> Luckily few people have this sort of usage pattern, but if you do...
>
> It would certainly explain the space eating...

Did anyone investigated how is that related to RRD rewrites? I don't use
rrdcached, never thought that 100 MB of data might trash entire
filesystem...

best regards,
-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02 16:28     ` Tomasz Pala
@ 2017-12-02 17:18       ` Tomasz Pala
  2017-12-03  1:45         ` Duncan
  0 siblings, 1 reply; 32+ messages in thread
From: Tomasz Pala @ 2017-12-02 17:18 UTC (permalink / raw)
  To: linux-btrfs

On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote:

>> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
> [...]
>> Now make various small changes to the file, say under 16 KiB each.  These
>> will each be COWed elsewhere as one might expect. by default 16 KiB at
>> a time I believe (might be 4 KiB, as it was back when the default leaf
> 
> I got ~500 small files (100-500 kB) updated partially in regular
> intervals:
> 
> # du -Lc **/*.rrd | tail -n1
> 105M    total
> 
>> But here's the kicker.  Even without a snapshot locking that original 100
>> MiB extent in place, if even one of the original 16 KiB blocks isn't
>> rewritten, that entire 100 MiB extent will remain locked in place, as the
>> original 16 KiB blocks that have been changed and thus COWed elsewhere
>> aren't freed one at a time, the full 100 MiB extent only gets freed, all
>> at once, once no references to it remain, which means once that last
>> block of the extent gets rewritten.

OTOH - should this happen with nodatacow files? As I mentioned before,
these files are chattred +C (however this was not their initial state
due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ).
Am I wrong thinking, that in such case they should occupy twice their
size maximum? Or maybe there is some tool that could show me the real
space wasted by file, including extents count etc?

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02 17:18       ` Tomasz Pala
@ 2017-12-03  1:45         ` Duncan
  2017-12-03 10:47           ` Adam Borowski
  2017-12-10 10:49           ` Tomasz Pala
  0 siblings, 2 replies; 32+ messages in thread
From: Duncan @ 2017-12-03  1:45 UTC (permalink / raw)
  To: linux-btrfs

Tomasz Pala posted on Sat, 02 Dec 2017 18:18:19 +0100 as excerpted:

> On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote:
> 
>>> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
>> [...]
>>> Now make various small changes to the file, say under 16 KiB each.  These
>>> will each be COWed elsewhere as one might expect. by default 16 KiB at
>>> a time I believe (might be 4 KiB, as it was back when the default leaf
>> 
>> I got ~500 small files (100-500 kB) updated partially in regular
>> intervals:
>> 
>> # du -Lc **/*.rrd | tail -n1
>> 105M    total

FWIW, I've no idea what rrd files, or rrdcached (from the grandparent post)
are (other than that a quick google suggests that it's...
round-robin-database... and the database bit alone sounds bad in this context
as database-file rewrites are known to be a worst-case for cow-based
filesystems), but it sounds like you suspect that they have this
rewrite-most pattern that could explain your problem...

>>> But here's the kicker.  Even without a snapshot locking that original 100
>>> MiB extent in place, if even one of the original 16 KiB blocks isn't
>>> rewritten, that entire 100 MiB extent will remain locked in place, as the
>>> original 16 KiB blocks that have been changed and thus COWed elsewhere
>>> aren't freed one at a time, the full 100 MiB extent only gets freed, all
>>> at once, once no references to it remain, which means once that last
>>> block of the extent gets rewritten.
> 
> OTOH - should this happen with nodatacow files? As I mentioned before,
> these files are chattred +C (however this was not their initial state
> due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ).
> Am I wrong thinking, that in such case they should occupy twice their
> size maximum? Or maybe there is some tool that could show me the real
> space wasted by file, including extents count etc?

Nodatacow... isn't as simple as the name might suggest.

For one thing, snapshots depend on COW and lock the extents they reference
in-place, so while a file might be set nocow and that setting is retained,
the first write to a block after a snapshot *MUST* cow that block... because
the snapshot has the existing version referenced and it can't change without
changing the snapshot as well, and that would of course defeat the purpose
of snapshots.

Tho the attribute is retained and further writes to the same already cowed
block won't cow it again.

FWIW, on this list that behavior is often referred to as cow1, cow only the
first time that a block is written after a snapshot locks the previous
version in place.

The effect of cow1 depends on the frequency and extent of block rewrites vs.
the frequency of snapshots of the subvolume they're on.  As should be
obvious if you think about it, once you've done the cow1, further rewrites
to the same block before further snapshots won't cow further, so if only
a few blocks are repeatedly rewritten multiple times between snapshots, the
effect should be relatively small.  Similarly if snapshots happen far more
frequently than block rewrites, since in that case most of the snapshots
won't have anything changed (for that file anyway) since the last one.

However, if most of the file gets rewritten between snapshots and the
snapshot frequency is often enough to be a major factor, the effect can be
practically as bad as if the file weren't nocow in the first place.

If I knew a bit more about rrd's rewrite pattern... and your snapshot
pattern...


Second, as you alluded, for btrfs files must be set nocow before anything
is written to them.  Quoting the chattr (1) manpage:  "If it is set on a
file which already has data blocks, it is undefined when the blocks
assigned to the file will be fully stable."

Not being a dev I don't read the code to know what that means in practice,
but it could well be effectively cow1, which would yield the maximum 2X
size you assumed.

But I think it's best to take "undefined" at its meaning, and assume
worst-case "no effect at all", for size calculation purposes, unless you
really /did/ set it at file creation, before the file had content.

And the easiest way to do /that/, and something that might be worthwhile
doing anyway if you think unreclaimed still referenced extents are your
problem, is to set the nocow flag on the /directory/, then copy the
files into it, taking care to actually create them new, that is, use
--reflink=never or copy the files to a different filesystem, perhaps
tmpfs, and back, so they /have/ to be created new.  Of course with the
rewriter (rrdcached, apparently) shut down for the process.

Then, once the files are safely back in place and the filesystem synced
so the data is actually on disk, you can delete the old copies (which
will continue to serve as backups until then), and sync the filesystem
again.

While snapshots will of course continue to keep extents they reference
locked, for unsnapshotted files at least, this process should clear up
any still referenced but partially unused extents for those files, thus
clearing up the problem if this is it.  After deleting the original
copies to free the space and syncing, you can check to see.


Meanwhile, /because/ nocow has these complexities along with others (nocow
automatically turns off data checksumming and compression for the files
too), and the fact that they nullify some of the big reasons people might
choose btrfs in the first place, I actually don't recommend setting
nocow in the first place -- if usage is such than a file needs nocow,
my thinking is that btrfs isn't a particularly good hosting choice for
that file in the first place, a more traditional rewrite-in-place
filesystem is likely to be a better fit.

OTOH, it's also quite possible that people chose btrfs at least partly
for other reasons, say the "storage pool" qualities, and would rather
just shove everything on a single btrfs "pool" and not have to worry
about it, however much that sets off my own "all eggs in one basket"
risk alert alarms. [shrug]  For them, having to separate all their
nocow stuff into a different non-btrfs filesystem would defeat their
purpose, and they'd rather just deal with all the complexities of
nocow.

For this sort of usage, we actually have reports that first setting up
the nocow dirs and ensuring that files inherit the nocow at creating,
so they /are/ actually nocow, then setting up snapshotting at a sane
schedule, /and/ setting up a periodic (perhaps weekly or monthly)
defrag of their nocow files to eliminate the fragmentation caused
by the snapshot-triggered cow1, actually works reasonably well.

Of course if snapshots are being kept effectively "forever", that'll
make space usage even /worse/, because defrag breaks reflinks and
unshares the data, but arguably, that's doing it wrong, because
snapshots are /not/ backups, and what might be temporarily snapshotted
should eventually be real-backed-up, allowing one to delete those
snapshots, thus freeing the space they took.  Of course if you're
using btrfs send/receive to do those backups, keeping around
selected "parent" snapshots to reference with send is useful,
but choosing a new "send parent" at least every quarter, say, and
deleting the old ones, does at least put a reasonable limit on
the time such snapshots need to be kept on the operational filesystem.

And since more than double-digits snapshots of the same subvolume creates
scaling issues for btrfs balance, snapshot deletion, etc, a regular
snapshot thinning schedule combined with a cap on the age of the oldest
one nicely helps there as well. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-03  1:45         ` Duncan
@ 2017-12-03 10:47           ` Adam Borowski
  2017-12-04  5:11             ` Chris Murphy
  2017-12-10 10:49           ` Tomasz Pala
  1 sibling, 1 reply; 32+ messages in thread
From: Adam Borowski @ 2017-12-03 10:47 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Sun, Dec 03, 2017 at 01:45:45AM +0000, Duncan wrote:
> Tomasz Pala posted on Sat, 02 Dec 2017 18:18:19 +0100 as excerpted:
> >> I got ~500 small files (100-500 kB) updated partially in regular
> >> intervals:
> >> 
> >> # du -Lc **/*.rrd | tail -n1
> >> 105M    total
> 
> FWIW, I've no idea what rrd files, or rrdcached (from the grandparent post)
> are (other than that a quick google suggests that it's...
> round-robin-database...

Basically: preallocate a file, its size doesn't change since then.  Every a
few minutes, write several bytes into the file, slowly advancing.

This is indeed the worst possible case for btrfs, and nocow doesn't help the
slightest as the database doesn't wrap around before a typical snapshot
interval.

> Meanwhile, /because/ nocow has these complexities along with others (nocow
> automatically turns off data checksumming and compression for the files
> too), and the fact that they nullify some of the big reasons people might
> choose btrfs in the first place, I actually don't recommend setting
> nocow in the first place -- if usage is such than a file needs nocow,
> my thinking is that btrfs isn't a particularly good hosting choice for
> that file in the first place, a more traditional rewrite-in-place
> filesystem is likely to be a better fit.

I'd say that the only good use for nocow is "I wish I have placed this file
on a non-btrfs, but it'd be too much hassle to repartition".

If you snapshot nocow at all, you get the worst of both worlds.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Mozilla's Hippocritic Oath: "Keep trackers off your trail"
⣾⠁⢰⠒⠀⣿⡁ blah blah evading "tracking technology" blah blah
⢿⡄⠘⠷⠚⠋⠀ "https://click.e.mozilla.org/?qs=e7bb0dcf14b1013fca3820..."
⠈⠳⣄⠀⠀⠀⠀ (same for all links)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02  9:33           ` Tomasz Pala
@ 2017-12-04  0:34             ` Qu Wenruo
  2017-12-10 11:27               ` Tomasz Pala
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2017-12-04  0:34 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4456 bytes --]



On 2017年12月02日 17:33, Tomasz Pala wrote:
> OK, I seriously need to address that, as during the night I lost
> 3 GB again:
> 
> On Sat, Dec 02, 2017 at 10:35:12 +0800, Qu Wenruo wrote:
> 
>>> #  btrfs fi sh /
>>> Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
>>>         Total devices 2 FS bytes used 44.10GiB
>            Total devices 2 FS bytes used 47.28GiB
> 
>>> #  btrfs fi usage /
>>> Overall:
>>>     Used:                         88.19GiB
>        Used:                         94.58GiB
>>>     Free (estimated):             18.75GiB      (min: 18.75GiB)
>        Free (estimated):             15.56GiB      (min: 15.56GiB)
>>>
>>> #  btrfs dev usage /
> - output not changed
> 
>>> #  btrfs fi df /    
>>> Data, RAID1: total=51.97GiB, used=43.22GiB
>    Data, RAID1: total=51.97GiB, used=46.42GiB
>>> System, RAID1: total=32.00MiB, used=16.00KiB
>>> Metadata, RAID1: total=2.00GiB, used=895.69MiB
>>> GlobalReserve, single: total=131.14MiB, used=0.00B
>    GlobalReserve, single: total=135.50MiB, used=0.00B
>>>
>>> # df
>>> /dev/sda2        64G   45G   19G  71% /
>    /dev/sda2        64G   48G   16G  76% /
>>> However the difference is on active root fs:
>>>
>>> -0/291        24.29GiB      9.77GiB
>>> +0/291        15.99GiB     76.00MiB
>     0/291        19.19GiB      3.28GiB
>>
>> Since you have already showed the size of the snapshots, which hardly
>> goes beyond 1G, it may be possible that extent booking is the cause.
>>
>> And considering it's all exclusive, defrag may help in this case.
> 
> I'm going to try defrag here, but have a bunch of questions before;
> as defrag would break CoW, I don't want to defrag files that span
> multiple snapshots, unless they have huge overhead:
> 1. is there any switch resulting in 'defrag only exclusive data'?

IIRC, no.

> 2. is there any switch resulting in 'defrag only extents fragmented more than X'
>    or 'defrag only fragments that would be possibly freed'?

No, either.

> 3. I guess there aren't, so how could I accomplish my target, i.e.
>    reclaiming space that was lost due to fragmentation, without breaking
>    spanshoted CoW where it would be not only pointless, but actually harmful?

What about using old kernel, like v4.13?

> 4. How can I prevent this from happening again? All the files, that are
>    written constantly (stats collector here, PostgreSQL database and
>    logs on other machines), are marked with nocow (+C); maybe some new
>    attribute to mark file as autodefrag? +t?

Unfortunately, nocow only works if there is no other subvolume/inode
referring to it.

That's to say, if you're using snapshot, then NOCOW won't help as much
as you expected, but still much better than normal data cow.

> 
> For example, the largest file from stats collector:
>      Total   Exclusive  Set shared  Filename
>  432.00KiB   176.00KiB   256.00KiB  load/load.rrd
> 
> but most of them has 'Set shared'==0.
> 
> 5. The stats collector is running from the beginning, according to the
> quota output was not the issue since something happened. If the problem
> was triggered by (guessing) low space condition, and it results in even
> more space lost, there is positive feedback that is dangerous, as makes
> any filesystem unstable ("once you run out of space, you won't recover").
> Does it mean btrfs is simply not suitable (yet?) for frequent updates usage
> pattern, like RRD files?

Hard to say the cause.

But in my understanding, btrfs is not suitable for such conflicting
situation, where you want to have snapshots of frequent partial updates.

IIRC, btrfs is better for use case where either update is less frequent,
or update is replacing the whole file, not just part of it.

So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
is pointing to /usr/bin and /usr/lib) , but not for /var or /run.

> 
> 6. Or maybe some extra steps just before taking snapshot should be taken?
> I guess 'defrag exclusive' would be perfect here - reclaiming space
> before it is being locked inside snapshot.

Yes, this sounds perfectly reasonable.

Thanks,
Qu

> Rationale behind this is obvious: since the snapshot-aware defrag was
> removed, allow to defrag snapshot exclusive data only.
> This would of course result in partial file defragmentation, but that
> should be enough for pathological cases like mine.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-02  0:53   ` Tomasz Pala
                       ` (2 preceding siblings ...)
  2017-12-02 16:28     ` Tomasz Pala
@ 2017-12-04  4:58     ` Chris Murphy
  3 siblings, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2017-12-04  4:58 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Btrfs BTRFS

On Fri, Dec 1, 2017 at 5:53 PM, Tomasz Pala <gotar@polanet.pl> wrote:

> #  btrfs fi usage /
> Overall:
>     Device size:                 128.00GiB
>     Device allocated:            117.19GiB
>     Device unallocated:           10.81GiB
>     Device missing:                  0.00B
>     Used:                        103.56GiB
>     Free (estimated):             11.19GiB      (min: 11.14GiB)
>     Data ratio:                       1.98
>     Metadata ratio:                   2.00
>     Global reserve:              146.08MiB      (used: 0.00B)
>
> Data,single: Size:1.19GiB, Used:1.18GiB
>    /dev/sda2       1.07GiB
>    /dev/sdb2     132.00MiB

This is asking for trouble. Two devices have single copy data chunks,
if those drives die, you lose that data. But the metadata referring to
those files will survive and Btrfs will keep complaining about them at
every scrub until they're all deleted - there is no command that makes
this easy. You'd have to scrape scrub output, which includes paths to
the missing files, and script something to delete them all.

You should convert this with something like 'btrfs balance start
-dconvert=raid1,soft <mountpoint>'



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-03 10:47           ` Adam Borowski
@ 2017-12-04  5:11             ` Chris Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2017-12-04  5:11 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Btrfs BTRFS

On Sun, Dec 3, 2017 at 3:47 AM, Adam Borowski <kilobyte@angband.pl> wrote:

> I'd say that the only good use for nocow is "I wish I have placed this file
> on a non-btrfs, but it'd be too much hassle to repartition".
>
> If you snapshot nocow at all, you get the worst of both worlds.

I think it's better to have the option than not have it, but for
regular Joe user I think it's a problem. And that's why I'm not such a
big fan of systemd-journald using chattr +C on journals when on Btrfs,
by default. I wouldn't mind it if systemd also made /var/log/journal/
a subvolume, just like it automatically creates /var/lib/machines as
as subvolume. That way by default /var/log/journal would be immune to
snapshots.

Or alternatively a rework of how journals are written to be more COW friendly.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* How exclusive in parent qgroup is computed? (was: Re: exclusive subvolume space missing)
  2017-12-02  0:27 ` Qu Wenruo
  2017-12-02  1:23   ` Tomasz Pala
@ 2017-12-05 18:47   ` Andrei Borzenkov
  2017-12-05 23:57     ` How exclusive in parent qgroup is computed? Qu Wenruo
  1 sibling, 1 reply; 32+ messages in thread
From: Andrei Borzenkov @ 2017-12-05 18:47 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3033 bytes --]

02.12.2017 03:27, Qu Wenruo пишет:
> 
> That's the difference between how sub show and quota works.
> 
> For quota, it's per-root owner check.
> Means even a file extent is shared between different inodes, if all
> inodes are inside the same subvolume, it's counted as exclusive.
> And if any of the file extent belongs to other subvolume, then it's
> counted as shared.
> 

Could you also explain how parent qgroup computes exclusive space? I.e.

10:~ # mkfs -t btrfs -f /dev/sdb1
btrfs-progs v4.13.3
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM /dev/sdb1 (1023.00MiB) ...
Label:              (null)
UUID:               b9b0643f-a248-4667-9e69-acf5baaef05b
Node size:          16384
Sector size:        4096
Filesystem size:    1023.00MiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         DUP              51.12MiB
  System:           DUP               8.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1  1023.00MiB  /dev/sdb1

10:~ # mount -t btrfs /dev/sdb1 /mnt
10:~ # cd /mnt
10:/mnt # btrfs quota enable .
10:/mnt # btrfs su cre sub1
Create subvolume './sub1'
10:/mnt # dd if=/dev/urandom of=sub1/file1 bs=1K count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00833739 s, 126 MB/s
10:/mnt # dd if=/dev/urandom of=sub1/file2 bs=1K count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0179272 s, 58.5 MB/s
10:/mnt # btrfs subvolume snapshot sub1 sub2
Create a snapshot of 'sub1' in './sub2'
10:/mnt # dd if=/dev/urandom of=sub2/file2 bs=1K count=1024 conv=notrunc
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0348762 s, 30.1 MB/s
10:/mnt # btrfs qgroup show --sync -p .
qgroupid         rfer         excl parent
--------         ----         ---- ------
0/5          16.00KiB     16.00KiB ---
0/256         2.02MiB      1.02MiB ---
0/257         2.02MiB      1.02MiB ---

So far so good. This is expected, each subvolume has 1MiB shared and
1MiB exclusive.

10:/mnt # btrfs qgroup create 22/7 /mnt
10:/mnt # btrfs qgroup assign --rescan 0/256 22/7 /mnt
Quota data changed, rescan scheduled
10:/mnt # btrfs quota rescan -s /mnt
no rescan operation in progress
10:/mnt # btrfs qgroup assign --rescan 0/257 22/7 /mnt
Quota data changed, rescan scheduled
10:/mnt # btrfs quota rescan -s /mnt
no rescan operation in progress
10:/mnt # btrfs qgroup show --sync -p .
qgroupid         rfer         excl parent
--------         ----         ---- ------
0/5          16.00KiB     16.00KiB ---
0/256         2.02MiB      1.02MiB 22/7
0/257         2.02MiB      1.02MiB 22/7
22/7          3.03MiB      3.03MiB ---
10:/mnt #

Oops. Total for 22/7 is correct (1MiB shared + 2 * 1MiB exclusive) but
why all data is treated as exclusive here? It does not match your
explanation ...


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: How exclusive in parent qgroup is computed?
  2017-12-05 18:47   ` How exclusive in parent qgroup is computed? (was: Re: exclusive subvolume space missing) Andrei Borzenkov
@ 2017-12-05 23:57     ` Qu Wenruo
  0 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2017-12-05 23:57 UTC (permalink / raw)
  To: Andrei Borzenkov, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3755 bytes --]



On 2017年12月06日 02:47, Andrei Borzenkov wrote:
> 02.12.2017 03:27, Qu Wenruo пишет:
>>
>> That's the difference between how sub show and quota works.
>>
>> For quota, it's per-root owner check.
>> Means even a file extent is shared between different inodes, if all
>> inodes are inside the same subvolume, it's counted as exclusive.
>> And if any of the file extent belongs to other subvolume, then it's
>> counted as shared.
>>
> 
> Could you also explain how parent qgroup computes exclusive space? I.e.
> 
> 10:~ # mkfs -t btrfs -f /dev/sdb1
> btrfs-progs v4.13.3
> See http://btrfs.wiki.kernel.org for more information.
> 
> Performing full device TRIM /dev/sdb1 (1023.00MiB) ...
> Label:              (null)
> UUID:               b9b0643f-a248-4667-9e69-acf5baaef05b
> Node size:          16384
> Sector size:        4096
> Filesystem size:    1023.00MiB
> Block group profiles:
>   Data:             single            8.00MiB
>   Metadata:         DUP              51.12MiB
>   System:           DUP               8.00MiB
> SSD detected:       no
> Incompat features:  extref, skinny-metadata
> Number of devices:  1
> Devices:
>    ID        SIZE  PATH
>     1  1023.00MiB  /dev/sdb1
> 
> 10:~ # mount -t btrfs /dev/sdb1 /mnt
> 10:~ # cd /mnt
> 10:/mnt # btrfs quota enable .
> 10:/mnt # btrfs su cre sub1
> Create subvolume './sub1'
> 10:/mnt # dd if=/dev/urandom of=sub1/file1 bs=1K count=1024
> 1024+0 records in
> 1024+0 records out
> 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00833739 s, 126 MB/s
> 10:/mnt # dd if=/dev/urandom of=sub1/file2 bs=1K count=1024
> 1024+0 records in
> 1024+0 records out
> 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0179272 s, 58.5 MB/s
> 10:/mnt # btrfs subvolume snapshot sub1 sub2
> Create a snapshot of 'sub1' in './sub2'
> 10:/mnt # dd if=/dev/urandom of=sub2/file2 bs=1K count=1024 conv=notrunc
> 1024+0 records in
> 1024+0 records out
> 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0348762 s, 30.1 MB/s
> 10:/mnt # btrfs qgroup show --sync -p .
> qgroupid         rfer         excl parent
> --------         ----         ---- ------
> 0/5          16.00KiB     16.00KiB ---
> 0/256         2.02MiB      1.02MiB ---
> 0/257         2.02MiB      1.02MiB ---
> 
> So far so good. This is expected, each subvolume has 1MiB shared and
> 1MiB exclusive.
> 
> 10:/mnt # btrfs qgroup create 22/7 /mnt
> 10:/mnt # btrfs qgroup assign --rescan 0/256 22/7 /mnt
> Quota data changed, rescan scheduled
> 10:/mnt # btrfs quota rescan -s /mnt
> no rescan operation in progress
> 10:/mnt # btrfs qgroup assign --rescan 0/257 22/7 /mnt
> Quota data changed, rescan scheduled
> 10:/mnt # btrfs quota rescan -s /mnt
> no rescan operation in progress
> 10:/mnt # btrfs qgroup show --sync -p .
> qgroupid         rfer         excl parent
> --------         ----         ---- ------
> 0/5          16.00KiB     16.00KiB ---
> 0/256         2.02MiB      1.02MiB 22/7
> 0/257         2.02MiB      1.02MiB 22/7
> 22/7          3.03MiB      3.03MiB ---
> 10:/mnt #
> 
> Oops. Total for 22/7 is correct (1MiB shared + 2 * 1MiB exclusive) but
> why all data is treated as exclusive here? It does not match your
> explanation ...


Why? The qgroup calculates it correctly without problem.

All extents in subvolume 256 and 257 belongs to qgroup 22/7.
the 1M shared data all belongs to qgroup 22/7 so it's counted as exclusive.

Not to mention the already exclusive extent from 0/256 and 0/257.


The name "exclusive" just means, all referencer(s) of one extent are all
in the qgroup.
So the explanation is still correct.


Please read btrfs-quota(8) 'SUBVOLUME QUOTA GROUPS' section for more
details.

Thanks,
Qu

> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-03  1:45         ` Duncan
  2017-12-03 10:47           ` Adam Borowski
@ 2017-12-10 10:49           ` Tomasz Pala
  1 sibling, 0 replies; 32+ messages in thread
From: Tomasz Pala @ 2017-12-10 10:49 UTC (permalink / raw)
  To: Duncan

On Sun, Dec 03, 2017 at 01:45:45 +0000, Duncan wrote:

> OTOH, it's also quite possible that people chose btrfs at least partly
> for other reasons, say the "storage pool" qualities, and would rather

Well, to name some:

1. filesystem-level backups via snapshot/send/receive - much cleaner and
faster than rsyncs or other old-fashioned methods. This obviously requires the CoW-once feature;

- caveat: for btrfs-killing usage patterns all the snapshots but the
  last one need to be removed;


2. block-level checksums with RAID1-awareness - in contrary to mdadm
RAIDx, which chooses random data copy from underlying devices, this is
much less susceptible to bit rot;

- caveats: requires CoW enabled, RAID1 reading is dumb (even/odd PID
  instead of real balancing), no N-way mirroring nor write-mostly flag.


3. compression - there is no real alternative, however:

- caveat: requires CoW enabled, which makes it not suitable for
  ...systemd journals, which compress with great ratio (c.a. 1:10),
  nor for various databases, as they will be nocowed sooner or later;


4. storage pools you've mentioned - they are actually not much superior to
LVM-based approach; until one could create subvolume with different
profile (e.g. 'disable RAID1 for /var/log/journal') it is still better
to create separate filesystems, meaning one have to use LVM or (the hard
way) paritioning.


Some of the drawbacks above are immanent to CoW and so shouldn't be
expected to be fixed internally, as the needs are conflicting, but their
impact might be nullified by some housekeeping.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-04  0:34             ` Qu Wenruo
@ 2017-12-10 11:27               ` Tomasz Pala
  2017-12-10 15:49                 ` Tomasz Pala
  2017-12-10 23:44                 ` Qu Wenruo
  0 siblings, 2 replies; 32+ messages in thread
From: Tomasz Pala @ 2017-12-10 11:27 UTC (permalink / raw)
  To: linux-btrfs

On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote:

>> 1. is there any switch resulting in 'defrag only exclusive data'?
> 
> IIRC, no.

I have found a directory - pam_abl databases, which occupy 10 MB (yes,
TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after
defrag. After defragging files were not snapshotted again and I've lost
3.6 GB again, so I got this fully reproducible.
There are 7 files, one of which is 99% of the space (10 MB). None of
them has nocow set, so they're riding all-btrfs.

I could debug something before I'll clean this up, is there anything you
want to me to check/know about the files?

The fragmentation impact is HUGE here, 1000-ratio is almost a DoS
condition which could be triggered by malicious user during a few hours
or faster - I've lost 3.6 GB during the night with reasonably small
amount of writes, I guess it might be possible to trash entire
filesystem within 10 minutes if doing this on purpose.

>> 3. I guess there aren't, so how could I accomplish my target, i.e.
>>    reclaiming space that was lost due to fragmentation, without breaking
>>    spanshoted CoW where it would be not only pointless, but actually harmful?
> 
> What about using old kernel, like v4.13?

Unfortunately (I guess you had 3.13 on mind), I need the new ones and
will be pushing towards 4.14.

>> 4. How can I prevent this from happening again? All the files, that are
>>    written constantly (stats collector here, PostgreSQL database and
>>    logs on other machines), are marked with nocow (+C); maybe some new
>>    attribute to mark file as autodefrag? +t?
> 
> Unfortunately, nocow only works if there is no other subvolume/inode
> referring to it.

This shouldn't be my case anymore after defrag (==breaking links).
I guess no easy way to check refcounts of the blocks?

> But in my understanding, btrfs is not suitable for such conflicting
> situation, where you want to have snapshots of frequent partial updates.
> 
> IIRC, btrfs is better for use case where either update is less frequent,
> or update is replacing the whole file, not just part of it.
> 
> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
> is pointing to /usr/bin and /usr/lib) , but not for /var or /run.

That is something coherent with my conclusions after 2 years on btrfs,
however I didn't expect a single file to eat 1000 times more space than it
should...


I wonder how many other filesystems were trashed like this - I'm short
of ~10 GB on other system, many other users might be affected by that
(telling the Internet stories about btrfs running out of space).

It is not a problem that I need to defrag a file, the problem is I don't know:
1. whether I need to defrag,
2. *what* should I defrag
nor have a tool that would defrag smart - only the exclusive data or, in
general, the block that are worth defragging if space released from
extents is greater than space lost on inter-snapshot duplication.

I can't just defrag entire filesystem since it breaks links with snapshots.
This change was a real deal-breaker here...

Any way to fed the deduplication code with snapshots maybe? There are
directories and files in the same layout, this could be fast-tracked to
check and deduplicate.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-10 11:27               ` Tomasz Pala
@ 2017-12-10 15:49                 ` Tomasz Pala
  2017-12-10 23:44                 ` Qu Wenruo
  1 sibling, 0 replies; 32+ messages in thread
From: Tomasz Pala @ 2017-12-10 15:49 UTC (permalink / raw)
  To: linux-btrfs

On Sun, Dec 10, 2017 at 12:27:38 +0100, Tomasz Pala wrote:

> I have found a directory - pam_abl databases, which occupy 10 MB (yes,
> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after

#  df            
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        64G   61G  2.8G  96% /

#  btrfs fi du .
     Total   Exclusive  Set shared  Filename
     0.00B       0.00B           -  ./1/__db.register
  10.00MiB    10.00MiB           -  ./1/log.0000000001
  16.00KiB       0.00B           -  ./1/hosts.db
  16.00KiB       0.00B           -  ./1/users.db
 168.00KiB       0.00B           -  ./1/__db.001
  40.00KiB       0.00B           -  ./1/__db.002
  44.00KiB       0.00B           -  ./1/__db.003
  10.28MiB    10.00MiB           -  ./1
     0.00B       0.00B           -  ./__db.register
  16.00KiB    16.00KiB           -  ./hosts.db
  16.00KiB    16.00KiB           -  ./users.db
  10.00MiB    10.00MiB           -  ./log.0000000013
     0.00B       0.00B           -  ./__db.001
     0.00B       0.00B           -  ./__db.002
     0.00B       0.00B           -  ./__db.003
  20.31MiB    20.03MiB   284.00KiB  .

#  btrfs fi defragment log.0000000013 
#  df
/dev/sda2        64G   54G  9.4G  86% /


6.6 GB / 10 MB = 660:1 overhead within 1 day of uptime.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-10 11:27               ` Tomasz Pala
  2017-12-10 15:49                 ` Tomasz Pala
@ 2017-12-10 23:44                 ` Qu Wenruo
  2017-12-11  0:24                   ` Qu Wenruo
  2017-12-11 11:40                   ` Tomasz Pala
  1 sibling, 2 replies; 32+ messages in thread
From: Qu Wenruo @ 2017-12-10 23:44 UTC (permalink / raw)
  To: Tomasz Pala, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5357 bytes --]



On 2017年12月10日 19:27, Tomasz Pala wrote:
> On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote:
> 
>>> 1. is there any switch resulting in 'defrag only exclusive data'?
>>
>> IIRC, no.
> 
> I have found a directory - pam_abl databases, which occupy 10 MB (yes,
> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after
> defrag. After defragging files were not snapshotted again and I've lost
> 3.6 GB again, so I got this fully reproducible.
> There are 7 files, one of which is 99% of the space (10 MB). None of
> them has nocow set, so they're riding all-btrfs.
> 
> I could debug something before I'll clean this up, is there anything you
> want to me to check/know about the files?

fiemap result along with btrfs dump-tree -t2 result.

Both output has nothing related to file name/dir name, but only some
"meaningless" bytenr, so it should be completely OK to share them.

> 
> The fragmentation impact is HUGE here, 1000-ratio is almost a DoS
> condition which could be triggered by malicious user during a few hours
> or faster

You won't want to hear this:
The biggest ratio in theory is, 128M / 4K = 32768.

> - I've lost 3.6 GB during the night with reasonably small
> amount of writes, I guess it might be possible to trash entire
> filesystem within 10 minutes if doing this on purpose.

That's a little complex.
To get into such situation, snapshot must be used and one must know
which file extent is shared and how it's shared.

But yes, it's possible.

While on the other hand, XFS, which also supports reflink, handles it
quite well, so I'm wondering if it's possible for btrfs to follow its
behavior.

> 
>>> 3. I guess there aren't, so how could I accomplish my target, i.e.
>>>    reclaiming space that was lost due to fragmentation, without breaking
>>>    spanshoted CoW where it would be not only pointless, but actually harmful?
>>
>> What about using old kernel, like v4.13?
> 
> Unfortunately (I guess you had 3.13 on mind), I need the new ones and
> will be pushing towards 4.14.

No, I really mean v4.13.

From btrfs(5):
---
               Warning
               Defragmenting with Linux kernel versions < 3.9 or ≥
3.14-rc2 as
               well as with Linux stable kernel versions ≥ 3.10.31, ≥
3.12.12
               or ≥ 3.13.4 will break up the ref-links of CoW data (for
               example files copied with cp --reflink, snapshots or
               de-duplicated data). This may cause considerable increase of
               space usage depending on the broken up ref-links.
---

> 
>>> 4. How can I prevent this from happening again? All the files, that are
>>>    written constantly (stats collector here, PostgreSQL database and
>>>    logs on other machines), are marked with nocow (+C); maybe some new
>>>    attribute to mark file as autodefrag? +t?
>>
>> Unfortunately, nocow only works if there is no other subvolume/inode
>> referring to it.
> 
> This shouldn't be my case anymore after defrag (==breaking links).
> I guess no easy way to check refcounts of the blocks?

No easy way unfortunately.
It's either time consuming (used by qgroup) or complex (manually tree
search and do the backref walk by yourself)

> 
>> But in my understanding, btrfs is not suitable for such conflicting
>> situation, where you want to have snapshots of frequent partial updates.
>>
>> IIRC, btrfs is better for use case where either update is less frequent,
>> or update is replacing the whole file, not just part of it.
>>
>> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
>> is pointing to /usr/bin and /usr/lib) , but not for /var or /run.
> 
> That is something coherent with my conclusions after 2 years on btrfs,
> however I didn't expect a single file to eat 1000 times more space than it
> should...
> 
> 
> I wonder how many other filesystems were trashed like this - I'm short
> of ~10 GB on other system, many other users might be affected by that
> (telling the Internet stories about btrfs running out of space).

Firstly, no other filesystem supports snapshot.
So it's pretty hard to get a baseline.

But as I mentioned, XFS supports reflink, which means file extent can be
shared between several inodes.

From the message I got from XFS guys, they free any unused space of a
file extent, so it should handle it quite well.

But it's quite a hard work to achieve in btrfs, needs years development
at least.

> 
> It is not a problem that I need to defrag a file, the problem is I don't know:
> 1. whether I need to defrag,
> 2. *what* should I defrag
> nor have a tool that would defrag smart - only the exclusive data or, in
> general, the block that are worth defragging if space released from
> extents is greater than space lost on inter-snapshot duplication.
> 
> I can't just defrag entire filesystem since it breaks links with snapshots.
> This change was a real deal-breaker here...

IIRC it's better to add a option to make defrag snapshot-aware.
(Don't break snapshot sharing but only to defrag exclusive data)

Thanks,
Qu

> 
> Any way to fed the deduplication code with snapshots maybe? There are
> directories and files in the same layout, this could be fast-tracked to
> check and deduplicate.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-10 23:44                 ` Qu Wenruo
@ 2017-12-11  0:24                   ` Qu Wenruo
  2017-12-11 11:40                   ` Tomasz Pala
  1 sibling, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2017-12-11  0:24 UTC (permalink / raw)
  To: Tomasz Pala, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5654 bytes --]



On 2017年12月11日 07:44, Qu Wenruo wrote:
> 
> 
> On 2017年12月10日 19:27, Tomasz Pala wrote:
>> On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote:
>>
>>>> 1. is there any switch resulting in 'defrag only exclusive data'?
>>>
>>> IIRC, no.
>>
>> I have found a directory - pam_abl databases, which occupy 10 MB (yes,
>> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after
>> defrag. After defragging files were not snapshotted again and I've lost
>> 3.6 GB again, so I got this fully reproducible.
>> There are 7 files, one of which is 99% of the space (10 MB). None of
>> them has nocow set, so they're riding all-btrfs.
>>
>> I could debug something before I'll clean this up, is there anything you
>> want to me to check/know about the files?
> 
> fiemap result along with btrfs dump-tree -t2 result.
> 
> Both output has nothing related to file name/dir name, but only some
> "meaningless" bytenr, so it should be completely OK to share them.
> 
>>
>> The fragmentation impact is HUGE here, 1000-ratio is almost a DoS
>> condition which could be triggered by malicious user during a few hours
>> or faster
> 
> You won't want to hear this:
> The biggest ratio in theory is, 128M / 4K = 32768.
> 
>> - I've lost 3.6 GB during the night with reasonably small
>> amount of writes, I guess it might be possible to trash entire
>> filesystem within 10 minutes if doing this on purpose.
> 
> That's a little complex.
> To get into such situation, snapshot must be used and one must know
> which file extent is shared and how it's shared.
> 
> But yes, it's possible.
> 
> While on the other hand, XFS, which also supports reflink, handles it
> quite well, so I'm wondering if it's possible for btrfs to follow its
> behavior.
> 
>>
>>>> 3. I guess there aren't, so how could I accomplish my target, i.e.
>>>>    reclaiming space that was lost due to fragmentation, without breaking
>>>>    spanshoted CoW where it would be not only pointless, but actually harmful?
>>>
>>> What about using old kernel, like v4.13?
>>
>> Unfortunately (I guess you had 3.13 on mind), I need the new ones and
>> will be pushing towards 4.14.
> 
> No, I really mean v4.13.

My fault, it is v3.13.

What a stupid error...

> 
> From btrfs(5):
> ---
>                Warning
>                Defragmenting with Linux kernel versions < 3.9 or ≥
> 3.14-rc2 as
>                well as with Linux stable kernel versions ≥ 3.10.31, ≥
> 3.12.12
>                or ≥ 3.13.4 will break up the ref-links of CoW data (for
>                example files copied with cp --reflink, snapshots or
>                de-duplicated data). This may cause considerable increase of
>                space usage depending on the broken up ref-links.
> ---
> 
>>
>>>> 4. How can I prevent this from happening again? All the files, that are
>>>>    written constantly (stats collector here, PostgreSQL database and
>>>>    logs on other machines), are marked with nocow (+C); maybe some new
>>>>    attribute to mark file as autodefrag? +t?
>>>
>>> Unfortunately, nocow only works if there is no other subvolume/inode
>>> referring to it.
>>
>> This shouldn't be my case anymore after defrag (==breaking links).
>> I guess no easy way to check refcounts of the blocks?
> 
> No easy way unfortunately.
> It's either time consuming (used by qgroup) or complex (manually tree
> search and do the backref walk by yourself)
> 
>>
>>> But in my understanding, btrfs is not suitable for such conflicting
>>> situation, where you want to have snapshots of frequent partial updates.
>>>
>>> IIRC, btrfs is better for use case where either update is less frequent,
>>> or update is replacing the whole file, not just part of it.
>>>
>>> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
>>> is pointing to /usr/bin and /usr/lib) , but not for /var or /run.
>>
>> That is something coherent with my conclusions after 2 years on btrfs,
>> however I didn't expect a single file to eat 1000 times more space than it
>> should...
>>
>>
>> I wonder how many other filesystems were trashed like this - I'm short
>> of ~10 GB on other system, many other users might be affected by that
>> (telling the Internet stories about btrfs running out of space).
> 
> Firstly, no other filesystem supports snapshot.
> So it's pretty hard to get a baseline.
> 
> But as I mentioned, XFS supports reflink, which means file extent can be
> shared between several inodes.
> 
> From the message I got from XFS guys, they free any unused space of a
> file extent, so it should handle it quite well.
> 
> But it's quite a hard work to achieve in btrfs, needs years development
> at least.
> 
>>
>> It is not a problem that I need to defrag a file, the problem is I don't know:
>> 1. whether I need to defrag,
>> 2. *what* should I defrag
>> nor have a tool that would defrag smart - only the exclusive data or, in
>> general, the block that are worth defragging if space released from
>> extents is greater than space lost on inter-snapshot duplication.
>>
>> I can't just defrag entire filesystem since it breaks links with snapshots.
>> This change was a real deal-breaker here...
> 
> IIRC it's better to add a option to make defrag snapshot-aware.
> (Don't break snapshot sharing but only to defrag exclusive data)
> 
> Thanks,
> Qu
> 
>>
>> Any way to fed the deduplication code with snapshots maybe? There are
>> directories and files in the same layout, this could be fast-tracked to
>> check and deduplicate.
>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-10 23:44                 ` Qu Wenruo
  2017-12-11  0:24                   ` Qu Wenruo
@ 2017-12-11 11:40                   ` Tomasz Pala
  2017-12-12  0:50                     ` Qu Wenruo
  1 sibling, 1 reply; 32+ messages in thread
From: Tomasz Pala @ 2017-12-11 11:40 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2219 bytes --]

On Mon, Dec 11, 2017 at 07:44:46 +0800, Qu Wenruo wrote:

>> I could debug something before I'll clean this up, is there anything you
>> want to me to check/know about the files?
> 
> fiemap result along with btrfs dump-tree -t2 result.

fiemap attached, but dump-tree requires unmounted fs, doesn't it?

>> - I've lost 3.6 GB during the night with reasonably small
>> amount of writes, I guess it might be possible to trash entire
>> filesystem within 10 minutes if doing this on purpose.
> 
> That's a little complex.
> To get into such situation, snapshot must be used and one must know
> which file extent is shared and how it's shared.

Hostile user might assume that any of his own files old enough were
being snapshotted. Unless snapshots are not used at all...

The 'obvious' solution would be for quotas to limit the data size including
extents lost due to fragmentation, but this is not the real solution as
users don't care about fragmentation. So we're back to square one.

> But as I mentioned, XFS supports reflink, which means file extent can be
> shared between several inodes.
> 
> From the message I got from XFS guys, they free any unused space of a
> file extent, so it should handle it quite well.

Forgive my ignorance, as I'm not familiar with details, but isn't the
problem 'solvable' by reusing space freed from the same extent for any
single (i.e. the same) inode? This would certainly increase
fragmentation of a file, but reduce extent usage significially.


Still, I don't comprehend the cause of my situation. If - after doing a
defrag (after snapshotting whatever there were already trashed) btrfs
decides to allocate new extents for the file, why doesn't is use them
efficiently as long as I'm not doing snapshots anymore?

I'm attaching the second fiemap, the same file from last snapshot taken.
According to this one-liner:

for i in `awk '{print $3}' fiemap`; do grep $i fiemap_old; done

current file doesn't share any physical locations with the old one.
But still grows, so what does this situation have with snapshots anyway?

Oh, and BTW - 900+ extents for ~5 GB taken means there is about 5.5 MB
occupied per extent. How is that possible?

-- 
Tomasz Pala <gotar@pld-linux.org>

[-- Attachment #2: fiemap --]
[-- Type: text/plain, Size: 56900 bytes --]

File log.0000000014 has 933 extents:
#	Logical          Physical         Length           Flags
0:	0000000000000000 000000297a001000 0000000000001000 0000
1:	0000000000001000 000000297aa01000 0000000000001000 0000
2:	0000000000002000 0000002979ffe000 0000000000001000 0000
3:	0000000000003000 000000297d1fc000 0000000000001000 0000
4:	0000000000004000 000000297e5f7000 0000000000001000 0000
5:	0000000000005000 000000297d1fe000 0000000000001000 0000
6:	0000000000006000 000000297c7f4000 0000000000001000 0000
7:	0000000000007000 000000297dbf9000 0000000000001000 0000
8:	0000000000008000 000000297eff3000 0000000000001000 0000
9:	0000000000009000 00000029821c7000 0000000000001000 0000
10:	000000000000a000 0000002982bbf000 0000000000001000 0000
11:	000000000000b000 00000029803e0000 0000000000001000 0000
12:	000000000000c000 000000297b400000 0000000000001000 0000
13:	000000000000d000 0000002979601000 0000000000001000 0000
14:	000000000000e000 0000002980dd5000 0000000000001000 0000
15:	000000000000f000 00000029821be000 0000000000001000 0000
16:	0000000000010000 000000298715f000 0000000000001000 0000
17:	0000000000011000 0000002985d71000 0000000000001000 0000
18:	0000000000012000 000000298537f000 0000000000001000 0000
19:	0000000000013000 0000002986760000 0000000000001000 0000
20:	0000000000014000 000000298498d000 0000000000001000 0000
21:	0000000000015000 00000029821b4000 0000000000001000 0000
22:	0000000000016000 00000029817c7000 0000000000001000 0000
23:	0000000000017000 000000298a2fa000 0000000000001000 0000
24:	0000000000018000 0000002988f1f000 0000000000001000 0000
25:	0000000000019000 000000298d47f000 0000000000001000 0000
26:	000000000001a000 000000298c0af000 0000000000001000 0000
27:	000000000001b000 000000298a2ee000 0000000000001000 0000
28:	000000000001c000 000000298a2eb000 0000000000001000 0000
29:	000000000001d000 00000029905f2000 0000000000001000 0000
30:	000000000001e000 000000298f22a000 0000000000001000 0000
31:	000000000001f000 000000298de66000 0000000000001000 0000
32:	0000000000020000 000000298ace3000 0000000000001000 0000
33:	0000000000021000 000000298a2e9000 0000000000001000 0000
34:	0000000000022000 000000298a2e7000 0000000000001000 0000
35:	0000000000023000 000000298b6c3000 0000000000001000 0000
36:	0000000000024000 0000002990fd5000 0000000000001000 0000
37:	0000000000025000 0000002992d6c000 0000000000001000 0000
38:	0000000000026000 00000029954db000 0000000000001000 0000
39:	0000000000027000 0000002993747000 0000000000001000 0000
40:	0000000000028000 0000002992d62000 0000000000001000 0000
41:	0000000000029000 0000002992389000 0000000000001000 0000
42:	000000000002a000 00000029919b1000 0000000000001000 0000
43:	000000000002b000 0000002998fe2000 0000000000001000 0000
44:	000000000002c000 000000299d4b5000 0000000000001000 0000
45:	000000000002d000 000000299cadb000 0000000000001000 0000
46:	000000000002e000 000000299c102000 0000000000001000 0000
47:	000000000002f000 000000299ad5b000 0000000000001000 0000
48:	0000000000030000 000000299a388000 0000000000001000 0000
49:	0000000000031000 00000029999b7000 0000000000001000 0000
50:	0000000000032000 00000029a0f97000 0000000000001000 0000
51:	0000000000033000 00000029a5439000 0000000000001000 0000
52:	0000000000034000 00000029a4098000 0000000000001000 0000
53:	0000000000035000 00000029a36c8000 0000000000001000 0000
54:	0000000000036000 00000029a2330000 0000000000001000 0000
55:	0000000000037000 00000029a7198000 0000000000001000 0000
56:	0000000000038000 00000029a67ce000 0000000000001000 0000
57:	0000000000039000 00000029a7b61000 0000000000001000 0000
58:	000000000003a000 00000029a36c0000 0000000000001000 0000
59:	000000000003b000 00000029a232d000 0000000000001000 0000
60:	000000000003c000 00000029a1966000 0000000000001000 0000
61:	000000000003d000 00000029a98b3000 0000000000001000 0000
62:	000000000003e000 00000029a8eea000 0000000000001000 0000
63:	000000000003f000 00000029ad33b000 0000000000001000 0000
64:	0000000000040000 00000029abfb6000 0000000000001000 0000
65:	0000000000041000 00000029a98ae000 0000000000001000 0000
66:	0000000000042000 00000029aac34000 0000000000001000 0000
67:	0000000000043000 00000029aa276000 0000000000001000 0000
68:	0000000000044000 00000029ab5f2000 0000000000001000 0000
69:	0000000000045000 00000029b1767000 0000000000001000 0000
70:	0000000000046000 00000029afa2b000 0000000000001000 0000
71:	0000000000047000 00000029b1757000 0000000000001000 0000
72:	0000000000048000 00000029b0d9d000 0000000000001000 0000
73:	0000000000049000 00000029b51b5000 0000000000001000 0000
74:	000000000004a000 00000029b348e000 0000000000001000 0000
75:	000000000004b000 00000029b2122000 0000000000001000 0000
76:	000000000004c000 00000029b7888000 0000000000001000 0000
77:	000000000004d000 00000027bda37000 0000000000001000 0000
78:	000000000004e000 00000027bb365000 0000000000002000 0000
79:	0000000000050000 00000027c00fa000 0000000000001000 0000
80:	0000000000051000 00000027be3e8000 0000000000001000 0000
81:	0000000000052000 00000027bda39000 0000000000001000 0000
82:	0000000000053000 00000027bb35e000 0000000000001000 0000
83:	0000000000054000 00000027bb35c000 0000000000001000 0000
84:	0000000000055000 00000027c0aaa000 0000000000001000 0000
85:	0000000000056000 00000027bed97000 0000000000001000 0000
86:	0000000000057000 00000027bbd17000 0000000000001000 0000
87:	0000000000058000 00000027ba001000 0000000000001000 0000
88:	0000000000059000 00000027c314f000 0000000000001000 0000
89:	000000000005a000 00000027c3148000 0000000000001000 0000
90:	000000000005b000 00000027c27a0000 0000000000001000 0000
91:	000000000005c000 00000027bf741000 0000000000001000 0000
92:	000000000005d000 00000027bc6c0000 0000000000001000 0000
93:	000000000005e000 00000027bd063000 0000000000001000 0000
94:	000000000005f000 00000027c1455000 0000000000001000 0000
95:	0000000000060000 00000027c279b000 0000000000001000 0000
96:	0000000000061000 00000027c616c000 0000000000001000 0000
97:	0000000000062000 00000027c57c9000 0000000000001000 0000
98:	0000000000063000 00000027c448b000 0000000000001000 0000
99:	0000000000064000 00000027c4e28000 0000000000001000 0000
100:	0000000000065000 00000027c1df7000 0000000000001000 0000
101:	0000000000066000 00000027c9b0e000 0000000000001000 0000
102:	0000000000067000 00000027c9b10000 0000000000001000 0000
103:	0000000000068000 00000027c7e3b000 0000000000001000 0000
104:	0000000000069000 00000027c6b0b000 0000000000001000 0000
105:	000000000006a000 00000027c3151000 0000000000001000 0000
106:	000000000006b000 00000027ccb02000 0000000000001000 0000
107:	000000000006c000 00000027c9afd000 0000000000001000 0000
108:	000000000006d000 00000027c9afb000 0000000000001000 0000
109:	000000000006e000 00000027cb7cd000 0000000000001000 0000
110:	000000000006f000 00000027cae3a000 0000000000001000 0000
111:	0000000000070000 00000027c9163000 0000000000001000 0000
112:	0000000000071000 00000027cc15f000 0000000000001000 0000
113:	0000000000072000 00000027c9af4000 0000000000001000 0000
114:	0000000000073000 00000027d2a9a000 0000000000001000 0000
115:	0000000000074000 00000027d0ddf000 0000000000001000 0000
116:	0000000000075000 00000027d50c8000 0000000000001000 0000
117:	0000000000076000 00000027d2a9c000 0000000000001000 0000
118:	0000000000077000 00000027d20f5000 0000000000001000 0000
119:	0000000000078000 00000027d2a87000 0000000000001000 0000
120:	0000000000079000 00000027d2a81000 0000000000001000 0000
121:	000000000007a000 00000027d63d9000 0000000000001000 0000
122:	000000000007b000 00000027d5a53000 0000000000001000 0000
123:	000000000007c000 00000027d3daa000 0000000000001000 0000
124:	000000000007d000 00000027d472e000 0000000000001000 0000
125:	000000000007e000 00000027d176b000 0000000000001000 0000
126:	000000000007f000 00000027d20ed000 0000000000001000 0000
127:	0000000000080000 00000027d6d5f000 0000000000001000 0000
128:	0000000000081000 00000027da65b000 0000000000001000 0000
129:	0000000000082000 00000027d9359000 0000000000001000 0000
130:	0000000000083000 00000027dcc54000 0000000000001000 0000
131:	0000000000084000 00000027db959000 0000000000001000 0000
132:	0000000000085000 00000027da653000 0000000000001000 0000
133:	0000000000086000 00000027dafda000 0000000000001000 0000
134:	0000000000087000 00000027e0eb6000 0000000000001000 0000
135:	0000000000088000 00000027e5107000 0000000000001000 0000
136:	0000000000089000 00000027e4787000 0000000000001000 0000
137:	000000000008a000 00000027e5a7f000 0000000000001000 0000
138:	000000000008b000 00000027e3e06000 0000000000001000 0000
139:	000000000008c000 00000027e2b18000 0000000000001000 0000
140:	000000000008d000 00000027e9341000 0000000000001000 0000
141:	000000000008e000 00000027ed566000 0000000000001000 0000
142:	000000000008f000 00000027ec27c000 0000000000001000 0000
143:	0000000000090000 00000027ea625000 0000000000001000 0000
144:	0000000000091000 00000027ec27a000 0000000000001000 0000
145:	0000000000092000 00000027ec273000 0000000000001000 0000
146:	0000000000093000 00000027eded8000 0000000000001000 0000
147:	0000000000094000 00000027ee845000 0000000000001000 0000
148:	0000000000095000 00000027ecbed000 0000000000001000 0000
149:	0000000000096000 00000027eaf95000 0000000000001000 0000
150:	0000000000097000 00000027e9cb5000 0000000000001000 0000
151:	0000000000098000 00000027f1754000 0000000000001000 0000
152:	0000000000099000 00000027f2a24000 0000000000001000 0000
153:	000000000009a000 00000027f20bc000 0000000000001000 0000
154:	000000000009b000 00000027f047c000 0000000000001000 0000
155:	000000000009c000 00000027efb15000 0000000000001000 0000
156:	000000000009d000 00000027f1756000 0000000000001000 0000
157:	000000000009e000 00000027ea61e000 0000000000001000 0000
158:	000000000009f000 00000027f0de1000 0000000000001000 0000
159:	00000000000a0000 00000027f590b000 0000000000001000 0000
160:	00000000000a1000 00000027f7529000 0000000000001000 0000
161:	00000000000a2000 00000027f6bc9000 0000000000001000 0000
162:	00000000000a3000 0000002983f12000 0000000000001000 0000
163:	00000000000a4000 0000002992387000 0000000000001000 0000
164:	00000000000a5000 0000002992d66000 0000000000001000 0000
165:	00000000000a6000 000000299e7e3000 0000000000001000 0000
166:	00000000000a7000 00000029a409a000 0000000000001000 0000
167:	00000000000a8000 00000029ac976000 0000000000001000 0000
168:	00000000000a9000 00000029b5b6c000 0000000000001000 0000
169:	00000000000aa000 00000029b175e000 0000000000001000 0000
170:	00000000000ab000 00000027c9af6000 0000000000001000 0000
171:	00000000000ac000 00000027d8033000 0000000000001000 0000
172:	00000000000ad000 00000027df1ca000 0000000000001000 0000
173:	00000000000ae000 00000027e046f000 0000000000001000 0000
174:	00000000000af000 00000027e182f000 0000000000001000 0000
175:	00000000000b0000 00000027e2ad5000 0000000000001000 0000
176:	00000000000b1000 00000027f1743000 0000000000001000 0000
177:	00000000000b2000 00000027f4627000 0000000000001000 0000
178:	00000000000b3000 00000029817c9000 0000000000001000 0000
179:	00000000000b4000 0000002989907000 0000000000001000 0000
180:	00000000000b5000 0000002994120000 0000000000001000 0000
181:	00000000000b6000 000000299c104000 0000000000001000 0000
182:	00000000000b7000 00000029a232a000 0000000000001000 0000
183:	00000000000b8000 00000029aef8c000 0000000000001000 0000
184:	00000000000b9000 00000029b64c3000 0000000000001000 0000
185:	00000000000ba000 00000029b6e0a000 0000000000001000 0000
186:	00000000000bb000 00000027ca4a9000 0000000000001000 0000
187:	00000000000bc000 00000027cdddb000 0000000000001000 0000
188:	00000000000bd000 00000027d20ef000 0000000000001000 0000
189:	00000000000be000 00000027d9cd7000 0000000000001000 0000
190:	00000000000bf000 00000027e2180000 0000000000001000 0000
191:	00000000000c0000 00000027e88f5000 0000000000001000 0000
192:	00000000000c1000 00000027e7fb2000 0000000000001000 0000
193:	00000000000c2000 00000027f626b000 0000000000001000 0000
194:	00000000000c3000 0000002982117000 0000000000001000 0000
195:	00000000000c4000 0000002992d60000 0000000000001000 0000
196:	00000000000c5000 00000029967f0000 0000000000001000 0000
197:	00000000000c6000 00000029a03b1000 0000000000001000 0000
198:	00000000000c7000 00000029a49f3000 0000000000001000 0000
199:	00000000000c8000 00000029ae634000 0000000000001000 0000
200:	00000000000c9000 00000029b175a000 0000000000001000 0000
201:	00000000000ca000 00000027c2794000 0000000000001000 0000
202:	00000000000cb000 00000027d3426000 0000000000001000 0000
203:	00000000000cc000 00000027e2b1a000 0000000000001000 0000
204:	00000000000cd000 00000027ef1b1000 0000000000001000 0000
205:	00000000000ce000 000000297f9eb000 0000000000001000 0000
206:	00000000000cf000 000000298e847000 0000000000001000 0000
207:	00000000000d0000 0000002992d69000 0000000000001000 0000
208:	00000000000d1000 00000029a5e06000 0000000000001000 0000
209:	00000000000d2000 00000029b823c000 0000000000001000 0000
210:	00000000000d3000 00000027c74a2000 0000000000001000 0000
211:	00000000000d4000 00000027d02a3000 0000000000001000 0000
212:	00000000000d5000 00000027dfb1d000 0000000000001000 0000
213:	00000000000d6000 00000027e2ac2000 0000000000001000 0000
214:	00000000000d7000 00000027f87b1000 0000000000001000 0000
215:	00000000000d8000 00000029835b5000 0000000000001000 0000
216:	00000000000d9000 000000299712b000 0000000000001000 0000
217:	00000000000da000 000000299fa63000 0000000000001000 0000
218:	00000000000db000 00000029b03e5000 0000000000001000 0000
219:	00000000000dc000 00000029b1760000 0000000000001000 0000
220:	00000000000dd000 00000027c9af9000 0000000000001000 0000
221:	00000000000de000 00000027dd5d1000 0000000000001000 0000
222:	00000000000df000 00000027e3e08000 0000000000001000 0000
223:	00000000000e0000 00000027f338b000 0000000000001000 0000
224:	00000000000e1000 000000298ca95000 0000000000001000 0000
225:	00000000000e2000 000000299ad58000 0000000000001000 0000
226:	00000000000e3000 00000029a98ac000 0000000000001000 0000
227:	00000000000e4000 00000027c87d3000 0000000000001000 0000
228:	00000000000e5000 00000027d76df000 0000000000001000 0000
229:	00000000000e6000 00000027e6d0f000 0000000000001000 0000
230:	00000000000e7000 00000029821c9000 0000000000001000 0000
231:	00000000000e8000 0000002995383000 0000000000001000 0000
232:	00000000000e9000 00000029a3611000 0000000000001000 0000
233:	00000000000ea000 00000027c3ae7000 0000000000001000 0000
234:	00000000000eb000 00000027d8987000 0000000000001000 0000
235:	00000000000ec000 00000027eb8ff000 0000000000001000 0000
236:	00000000000ed000 000000298fc0c000 0000000000001000 0000
237:	00000000000ee000 000000299b72c000 0000000000001000 0000
238:	00000000000ef000 00000029b175c000 0000000000001000 0000
239:	00000000000f0000 00000027cf02f000 0000000000001000 0000
240:	00000000000f1000 00000027e7629000 0000000000001000 0000
241:	00000000000f2000 0000002987b4f000 0000000000001000 0000
242:	00000000000f3000 000000299835f000 0000000000001000 0000
243:	00000000000f4000 00000029a36ca000 0000000000001000 0000
244:	00000000000f5000 00000027c9aff000 0000000000001000 0000
245:	00000000000f6000 00000027e344e000 0000000000001000 0000
246:	00000000000f7000 0000002995eb5000 0000000000001000 0000
247:	00000000000f8000 00000029adcfc000 0000000000001000 0000
248:	00000000000f9000 00000027ce71f000 0000000000001000 0000
249:	00000000000fa000 00000027f3cab000 0000000000001000 0000
250:	00000000000fb000 0000002995370000 0000000000001000 0000
251:	00000000000fc000 00000029b1762000 0000000000001000 0000
252:	00000000000fd000 00000027de7f6000 0000000000001000 0000
253:	00000000000fe000 0000002997a52000 0000000000001000 0000
254:	00000000000ff000 00000029b4745000 0000000000001000 0000
255:	0000000000100000 00000027e63f5000 0000000000001000 0000
256:	0000000000101000 000000299c03e000 0000000000001000 0000
257:	0000000000102000 00000027bb2a7000 0000000000001000 0000
258:	0000000000103000 000000298845d000 0000000000001000 0000
259:	0000000000104000 00000029b2ad7000 0000000000001000 0000
260:	0000000000105000 00000027ddef3000 0000000000001000 0000
261:	0000000000106000 000000299f13d000 0000000000001000 0000
262:	0000000000107000 00000027c9b01000 0000000000001000 0000
263:	0000000000108000 0000002992d64000 0000000000001000 0000
264:	0000000000109000 00000027cf93f000 0000000000001000 0000
265:	000000000010a000 00000029a2cfa000 0000000000001000 0000
266:	000000000010b000 00000027f4f75000 0000000000001000 0000
267:	000000000010c000 00000029b8b6a000 0000000000001000 0000
268:	000000000010d000 000000299de89000 0000000000001000 0000
269:	000000000010e000 0000002979ff4000 0000000000001000 0000
270:	000000000010f000 00000029b9600000 0000000000001000 0000
271:	0000000000110000 00000029bd498000 0000000000001000 0000
272:	0000000000111000 00000029bcba1000 0000000000001000 0000
273:	0000000000112000 00000029bb9be000 0000000000001000 0000
274:	0000000000113000 00000029ba7df000 0000000000001000 0000
275:	0000000000114000 00000029ba7dd000 0000000000001000 0000
276:	0000000000115000 00000029c130e000 0000000000001000 0000
277:	0000000000116000 00000029bf846000 0000000000001000 0000
278:	0000000000117000 00000029c3fa1000 0000000000001000 0000
279:	0000000000118000 00000029c2dcb000 0000000000001000 0000
280:	0000000000119000 00000029c1bf9000 0000000000001000 0000
281:	000000000011a000 00000029c3f9b000 0000000000001000 0000
282:	000000000011b000 00000029c3f98000 0000000000001000 0000
283:	000000000011c000 00000029c6336000 0000000000001000 0000
284:	000000000011d000 00000029c488a000 0000000000001000 0000
285:	000000000011e000 00000029c9898000 0000000000001000 0000
286:	000000000011f000 00000029c7ddc000 0000000000001000 0000
287:	0000000000120000 00000029c2dc5000 0000000000001000 0000
288:	0000000000121000 00000029c2dc3000 0000000000001000 0000
289:	0000000000122000 00000029c6c1a000 0000000000001000 0000
290:	0000000000123000 00000029c516d000 0000000000001000 0000
291:	0000000000124000 00000029c86bd000 0000000000001000 0000
292:	0000000000125000 00000029caa55000 0000000000001000 0000
293:	0000000000126000 00000029ca17a000 0000000000001000 0000
294:	0000000000127000 00000029c2dbc000 0000000000001000 0000
295:	0000000000128000 00000029c5a4a000 0000000000001000 0000
296:	0000000000129000 00000029c36b3000 0000000000001000 0000
297:	000000000012a000 00000029c24e1000 0000000000001000 0000
298:	000000000012b000 00000029c8f99000 0000000000001000 0000
299:	000000000012c000 00000029ce83a000 0000000000001000 0000
300:	000000000012d000 00000029c9873000 0000000000001000 0000
301:	000000000012e000 00000029cdf4a000 0000000000001000 0000
302:	000000000012f000 00000029cf9df000 0000000000001000 0000
303:	0000000000130000 00000029cd670000 0000000000001000 0000
304:	0000000000131000 00000029cb330000 0000000000001000 0000
305:	0000000000132000 00000029cbbff000 0000000000001000 0000
306:	0000000000133000 00000029d2eb6000 0000000000001000 0000
307:	0000000000134000 00000029c9875000 0000000000001000 0000
308:	0000000000135000 00000029cc4cd000 0000000000001000 0000
309:	0000000000136000 00000029ccd98000 0000000000001000 0000
310:	0000000000137000 00000029cf10e000 0000000000001000 0000
311:	0000000000138000 00000029d02b0000 0000000000001000 0000
312:	0000000000139000 00000029d1d0a000 0000000000001000 0000
313:	000000000013a000 00000029d1d07000 0000000000001000 0000
314:	000000000013b000 00000029d2eb8000 0000000000001000 0000
315:	000000000013c000 00000029d1d0c000 0000000000001000 0000
316:	000000000013d000 00000029d25d0000 0000000000001000 0000
317:	000000000013e000 00000029d143a000 0000000000001000 0000
318:	000000000013f000 00000029d1d04000 0000000000001000 0000
319:	0000000000140000 00000029d1d02000 0000000000001000 0000
320:	0000000000141000 00000029d6338000 0000000000001000 0000
321:	0000000000142000 00000029d51b7000 0000000000001000 0000
322:	0000000000143000 00000029da069000 0000000000001000 0000
323:	0000000000144000 00000029ddd95000 0000000000001000 0000
324:	0000000000145000 00000029d7d6d000 0000000000001000 0000
325:	0000000000146000 00000029d2e93000 0000000000001000 0000
326:	0000000000147000 00000029d6bf7000 0000000000001000 0000
327:	0000000000148000 00000029d74b0000 0000000000001000 0000
328:	0000000000149000 00000029d48eb000 0000000000001000 0000
329:	000000000014a000 00000029d4033000 0000000000001000 0000
330:	000000000014b000 00000029d377d000 0000000000001000 0000
331:	000000000014c000 00000029d1cfc000 0000000000001000 0000
332:	000000000014d000 00000029d5a75000 0000000000001000 0000
333:	000000000014e000 00000029dd4a3000 0000000000001000 0000
334:	000000000014f000 00000029dc339000 0000000000001000 0000
335:	0000000000150000 00000029da926000 0000000000001000 0000
336:	0000000000151000 00000029d9786000 0000000000001000 0000
337:	0000000000152000 00000029da03e000 0000000000001000 0000
338:	0000000000153000 00000029da035000 0000000000001000 0000
339:	0000000000154000 00000029de651000 0000000000001000 0000
340:	0000000000155000 00000029dcbea000 0000000000001000 0000
341:	0000000000156000 00000029dba80000 0000000000001000 0000
342:	0000000000157000 00000029db1d6000 0000000000001000 0000
343:	0000000000158000 00000029d1cfe000 0000000000001000 0000
344:	0000000000159000 00000029d8628000 0000000000001000 0000
345:	000000000015a000 00000029d8ed0000 0000000000001000 0000
346:	000000000015b000 00000029e2b88000 0000000000001000 0000
347:	000000000015c000 00000029e1a31000 0000000000001000 0000
348:	000000000015d000 00000029e1189000 0000000000001000 0000
349:	000000000015e000 00000029e3ccf000 0000000000001000 0000
350:	000000000015f000 00000029e1a2c000 0000000000001000 0000
351:	0000000000160000 00000029e81df000 0000000000001000 0000
352:	0000000000161000 00000029e67ed000 0000000000001000 0000
353:	0000000000162000 00000029e56ad000 0000000000001000 0000
354:	0000000000163000 00000029e4571000 0000000000001000 0000
355:	0000000000164000 00000029e342d000 0000000000001000 0000
356:	0000000000165000 00000029e1a2e000 0000000000001000 0000
357:	0000000000166000 00000029eacec000 0000000000001000 0000
358:	0000000000167000 00000029e9319000 0000000000001000 0000
359:	0000000000168000 00000029ed7e6000 0000000000001000 0000
360:	0000000000169000 00000029ec6b4000 0000000000001000 0000
361:	000000000016a000 00000029ea449000 0000000000001000 0000
362:	000000000016b000 00000029eacdf000 0000000000001000 0000
363:	000000000016c000 00000029eace7000 0000000000001000 0000
364:	000000000016d000 00000029efa37000 0000000000001000 0000
365:	000000000016e000 00000029ef1a2000 0000000000001000 0000
366:	000000000016f000 00000029ee07e000 0000000000001000 0000
367:	0000000000170000 00000029eb586000 0000000000001000 0000
368:	0000000000171000 00000029e9bb2000 0000000000001000 0000
369:	0000000000172000 00000029f1c75000 0000000000001000 0000
370:	0000000000173000 00000029f13e4000 0000000000001000 0000
371:	0000000000174000 00000029f1c77000 0000000000001000 0000
372:	0000000000175000 00000029f4fbb000 0000000000001000 0000
373:	0000000000176000 00000029f3617000 0000000000001000 0000
374:	0000000000177000 00000029f4fb8000 0000000000001000 0000
375:	0000000000178000 00000029f3615000 0000000000001000 0000
376:	0000000000179000 00000029f60cd000 0000000000001000 0000
377:	000000000017a000 00000029f5846000 0000000000001000 0000
378:	000000000017b000 00000029f4726000 0000000000001000 0000
379:	000000000017c000 00000029f6954000 0000000000001000 0000
380:	000000000017d000 00000027f7e88000 0000000000001000 0000
381:	000000000017e000 00000029ba775000 0000000000001000 0000
382:	000000000017f000 00000029bc2ac000 0000000000001000 0000
383:	0000000000180000 00000029d862a000 0000000000001000 0000
384:	0000000000181000 00000029e5f4b000 0000000000001000 0000
385:	0000000000182000 00000029ebe16000 0000000000001000 0000
386:	0000000000183000 00000029f2d80000 0000000000001000 0000
387:	0000000000184000 00000029f35fd000 0000000000001000 0000
388:	0000000000185000 00000029b9ef1000 0000000000001000 0000
389:	0000000000186000 00000029bdd88000 0000000000001000 0000
390:	0000000000187000 00000029df776000 0000000000001000 0000
391:	0000000000188000 00000029e7904000 0000000000001000 0000
392:	0000000000189000 00000029f2503000 0000000000001000 0000
393:	000000000018a000 00000029f2d7a000 0000000000001000 0000
394:	000000000018b000 00000029b1755000 0000000000001000 0000
395:	000000000018c000 00000029c09a4000 0000000000001000 0000
396:	000000000018d000 00000029e22d5000 0000000000001000 0000
397:	000000000018e000 00000029e9bb4000 0000000000001000 0000
398:	000000000018f000 00000029f82ba000 0000000000001000 0000
399:	0000000000190000 00000029a35f0000 0000000000001000 0000
400:	0000000000191000 00000029c986e000 0000000000001000 0000
401:	0000000000192000 00000029ecf4b000 0000000000001000 0000
402:	0000000000193000 00000029f7a45000 0000000000001000 0000
403:	0000000000194000 00000029a8528000 0000000000001000 0000
404:	0000000000195000 00000029d0b78000 0000000000001000 0000
405:	0000000000196000 00000029e2b49000 0000000000001000 0000
406:	0000000000197000 00000027bb212000 0000000000001000 0000
407:	0000000000198000 00000029bee6a000 0000000000001000 0000
408:	0000000000199000 00000029e4e0e000 0000000000001000 0000
409:	000000000019a000 00000029f02ca000 0000000000001000 0000
410:	000000000019b000 000000297bdf4000 0000000000001000 0000
411:	000000000019c000 00000029deefd000 0000000000001000 0000
412:	000000000019d000 00000029f1c71000 0000000000001000 0000
413:	000000000019e000 00000029bb0cc000 0000000000001000 0000
414:	000000000019f000 00000029e8a7f000 0000000000001000 0000
415:	00000000001a0000 00000027ba9a9000 0000000000001000 0000
416:	00000000001a1000 00000029c74f8000 0000000000001000 0000
417:	00000000001a2000 00000029f71d8000 0000000000001000 0000
418:	00000000001a3000 00000029ba76d000 0000000000001000 0000
419:	00000000001a4000 00000029f8b2b000 0000000000001000 0000
420:	00000000001a5000 00000029dffef000 0000000000001000 0000
421:	00000000001a6000 00000027dc2d5000 0000000000001000 0000
422:	00000000001a7000 00000029e708c000 0000000000001000 0000
423:	00000000001a8000 00000029be602000 0000000000001000 0000
424:	00000000001a9000 00000029f9e57000 0000000000001000 0000
425:	00000000001aa000 00000029fd8b9000 0000000000001000 0000
426:	00000000001ab000 00000029fbfb0000 0000000000001000 0000
427:	00000000001ac000 00000029fd05b000 0000000000001000 0000
428:	00000000001ad000 00000029faeff000 0000000000001000 0000
429:	00000000001ae000 00000029fa6ab000 0000000000001000 0000
430:	00000000001af000 00000029f9600000 0000000000001000 0000
431:	00000000001b0000 00000029faefd000 0000000000001000 0000
432:	00000000001b1000 00000029fe959000 0000000000001000 0000
433:	00000000001b2000 00000029fd8bb000 0000000000001000 0000
434:	00000000001b3000 00000029fb752000 0000000000001000 0000
435:	00000000001b4000 00000029f9e59000 0000000000001000 0000
436:	00000000001b5000 00000029f9602000 0000000000001000 0000
437:	00000000001b6000 0000002a0236c000 0000000000001000 0000
438:	00000000001b7000 0000002a012cc000 0000000000001000 0000
439:	00000000001b8000 0000002a04cd6000 0000000000001000 0000
440:	00000000001b9000 0000002a03c43000 0000000000001000 0000
441:	00000000001ba000 0000002a0448a000 0000000000001000 0000
442:	00000000001bb000 0000002a02bb3000 0000000000001000 0000
443:	00000000001bc000 0000002a02359000 0000000000001000 0000
444:	00000000001bd000 0000002a086bc000 0000000000001000 0000
445:	00000000001be000 0000002a065a2000 0000000000001000 0000
446:	00000000001bf000 0000002a05d5f000 0000000000001000 0000
447:	00000000001c0000 0000002a07e64000 0000000000001000 0000
448:	00000000001c1000 0000002a033f8000 0000000000001000 0000
449:	00000000001c2000 0000002a02354000 0000000000001000 0000
450:	00000000001c3000 0000002a01b15000 0000000000001000 0000
451:	00000000001c4000 0000002a0aff7000 0000000000001000 0000
452:	00000000001c5000 0000002a0973a000 0000000000001000 0000
453:	00000000001c6000 0000002a09f76000 0000000000001000 0000
454:	00000000001c7000 0000002a0761d000 0000000000001000 0000
455:	00000000001c8000 0000002a02356000 0000000000001000 0000
456:	00000000001c9000 0000002a02351000 0000000000001000 0000
457:	00000000001ca000 0000002a01b18000 0000000000001000 0000
458:	00000000001cb000 0000002a08eff000 0000000000001000 0000
459:	00000000001cc000 0000002a0d0d1000 0000000000001000 0000
460:	00000000001cd000 0000002a0e96b000 0000000000001000 0000
461:	00000000001ce000 0000002a09738000 0000000000001000 0000
462:	00000000001cf000 0000002a0afe3000 0000000000001000 0000
463:	00000000001d0000 0000002a0e135000 0000000000001000 0000
464:	00000000001d1000 0000002a0d905000 0000000000001000 0000
465:	00000000001d2000 0000002a0c88f000 0000000000001000 0000
466:	00000000001d3000 0000002a0f19e000 0000000000001000 0000
467:	00000000001d4000 0000002a0a7b0000 0000000000001000 0000
468:	00000000001d5000 0000002a122a4000 0000000000001000 0000
469:	00000000001d6000 0000002a0f9cb000 0000000000001000 0000
470:	00000000001d7000 0000002a12acf000 0000000000001000 0000
471:	00000000001d8000 0000002a11a72000 0000000000001000 0000
472:	00000000001d9000 0000002a11243000 0000000000001000 0000
473:	00000000001da000 0000002a13b1e000 0000000000001000 0000
474:	00000000001db000 0000002a11a6b000 0000000000001000 0000
475:	00000000001dc000 0000002a122a6000 0000000000001000 0000
476:	00000000001dd000 0000002a16bf3000 0000000000001000 0000
477:	00000000001de000 0000002a163cc000 0000000000001000 0000
478:	00000000001df000 0000002a17416000 0000000000001000 0000
479:	00000000001e0000 0000002a17c37000 0000000000001000 0000
480:	00000000001e1000 0000002a11a6f000 0000000000001000 0000
481:	00000000001e2000 0000002a1229a000 0000000000001000 0000
482:	00000000001e3000 0000002a14344000 0000000000001000 0000
483:	00000000001e4000 0000002a18457000 0000000000001000 0000
484:	00000000001e5000 0000002a19cad000 0000000000001000 0000
485:	00000000001e6000 0000002a1a4c8000 0000000000001000 0000
486:	00000000001e7000 0000002a1229c000 0000000000001000 0000
487:	00000000001e8000 0000002a1229e000 0000000000001000 0000
488:	00000000001e9000 0000002a14b61000 0000000000001000 0000
489:	00000000001ea000 0000002a132f8000 0000000000001000 0000
490:	00000000001eb000 0000002a15378000 0000000000001000 0000
491:	00000000001ec000 0000002a18c73000 0000000000001000 0000
492:	00000000001ed000 0000002a1d54b000 0000000000001000 0000
493:	00000000001ee000 0000002a19c9c000 0000000000001000 0000
494:	00000000001ef000 0000002a1c515000 0000000000001000 0000
495:	00000000001f0000 0000002a1ace2000 0000000000001000 0000
496:	00000000001f1000 0000002a15b8d000 0000000000001000 0000
497:	00000000001f2000 0000002a19487000 0000000000001000 0000
498:	00000000001f3000 0000002a1fd98000 0000000000001000 0000
499:	00000000001f4000 0000002a19c9a000 0000000000001000 0000
500:	00000000001f5000 0000002a1e569000 0000000000001000 0000
501:	00000000001f6000 0000002a1dd5e000 0000000000001000 0000
502:	00000000001f7000 0000002a1cd26000 0000000000001000 0000
503:	00000000001f8000 0000002a1b4f2000 0000000000001000 0000
504:	00000000001f9000 0000002a22dd1000 0000000000001000 0000
505:	00000000001fa000 0000002a21db7000 0000000000001000 0000
506:	00000000001fb000 0000002a19c99000 0000000000001000 0000
507:	00000000001fc000 0000002a20da9000 0000000000001000 0000
508:	00000000001fd000 0000002a225bc000 0000000000001000 0000
509:	00000000001fe000 0000002a205a5000 0000000000001000 0000
510:	00000000001ff000 0000002a1f575000 0000000000001000 0000
511:	0000000000200000 0000002a19c97000 0000000000001000 0000
512:	0000000000201000 0000002a19c95000 0000000000001000 0000
513:	0000000000202000 0000002a215ad000 0000000000001000 0000
514:	0000000000203000 0000002a255d0000 0000000000001000 0000
515:	0000000000204000 0000002a265c9000 0000000000001000 0000
516:	0000000000205000 0000002a26dc5000 0000000000001000 0000
517:	0000000000206000 0000002a21db0000 0000000000001000 0000
518:	0000000000207000 0000002a21dae000 0000000000001000 0000
519:	0000000000208000 0000002a245c8000 0000000000001000 0000
520:	0000000000209000 0000002a23dcf000 0000000000001000 0000
521:	000000000020a000 0000002a235d8000 0000000000001000 0000
522:	000000000020b000 0000002a29d92000 0000000000001000 0000
523:	000000000020c000 0000002a28d9c000 0000000000001000 0000
524:	000000000020d000 0000002a21db3000 0000000000001000 0000
525:	000000000020e000 0000002a21db9000 0000000000001000 0000
526:	000000000020f000 0000002a25dcd000 0000000000001000 0000
527:	0000000000210000 0000002a24dc0000 0000000000001000 0000
528:	0000000000211000 0000002a29590000 0000000000001000 0000
529:	0000000000212000 0000002a2c545000 0000000000001000 0000
530:	0000000000213000 0000002a29d86000 0000000000001000 0000
531:	0000000000214000 0000002a2dd0b000 0000000000001000 0000
532:	0000000000215000 0000002a2e4f7000 0000000000001000 0000
533:	0000000000216000 0000002a2d51d000 0000000000001000 0000
534:	0000000000217000 0000002a2cd33000 0000000000001000 0000
535:	0000000000218000 0000002a2b557000 0000000000001000 0000
536:	0000000000219000 0000002a29d7f000 0000000000001000 0000
537:	000000000021a000 0000002a29d8a000 0000000000001000 0000
538:	000000000021b000 0000002a31460000 0000000000001000 0000
539:	000000000021c000 0000002a343be000 0000000000001000 0000
540:	000000000021d000 0000002a32c0d000 0000000000001000 0000
541:	000000000021e000 0000002a31c45000 0000000000001000 0000
542:	000000000021f000 0000002a32c09000 0000000000001000 0000
543:	0000000000220000 0000002a32c07000 0000000000001000 0000
544:	0000000000221000 0000002a37ae4000 0000000000001000 0000
545:	0000000000222000 0000002a3633c000 0000000000001000 0000
546:	0000000000223000 0000002a35b5c000 0000000000001000 0000
547:	0000000000224000 0000002a3537e000 0000000000001000 0000
548:	0000000000225000 00000029ba76f000 0000000000001000 0000
549:	0000000000226000 00000029f9e4f000 0000000000001000 0000
550:	0000000000227000 0000002a0551e000 0000000000001000 0000
551:	0000000000228000 0000002a1bcfa000 0000000000001000 0000
552:	0000000000229000 0000002a2ece2000 0000000000001000 0000
553:	000000000022a000 0000002a30c3b000 0000000000001000 0000
554:	000000000022b000 0000002a372ef000 0000000000001000 0000
555:	000000000022c000 0000002a32bff000 0000000000001000 0000
556:	000000000022d000 00000029f0b30000 0000000000001000 0000
557:	000000000022e000 00000029fe109000 0000000000001000 0000
558:	000000000022f000 0000002a0c004000 0000000000001000 0000
559:	0000000000230000 0000002a2a587000 0000000000001000 0000
560:	0000000000231000 0000002a333f0000 0000000000001000 0000
561:	0000000000232000 00000027c9b03000 0000000000001000 0000
562:	0000000000233000 00000029ff1a8000 0000000000001000 0000
563:	0000000000234000 0000002a0b833000 0000000000001000 0000
564:	0000000000235000 0000002a2ad57000 0000000000001000 0000
565:	0000000000236000 0000002a34ba2000 0000000000001000 0000
566:	0000000000237000 0000002994a6b000 0000000000001000 0000
567:	0000000000238000 00000029f9e51000 0000000000001000 0000
568:	0000000000239000 0000002a0234e000 0000000000001000 0000
569:	000000000023a000 0000002a2f4b9000 0000000000001000 0000
570:	000000000023b000 0000002a33bbf000 0000000000001000 0000
571:	000000000023c000 00000029ee90f000 0000000000001000 0000
572:	000000000023d000 0000002a06de4000 0000000000001000 0000
573:	000000000023e000 0000002a21dab000 0000000000001000 0000
574:	000000000023f000 0000002a32bea000 0000000000001000 0000
575:	0000000000240000 00000029f3ea1000 0000000000001000 0000
576:	0000000000241000 0000002a109b4000 0000000000001000 0000
577:	0000000000242000 0000002a3043d000 0000000000001000 0000
578:	0000000000243000 0000002a382c3000 0000000000001000 0000
579:	0000000000244000 0000002a00131000 0000000000001000 0000
580:	0000000000245000 0000002a21db5000 0000000000001000 0000
581:	0000000000246000 00000029b3e44000 0000000000001000 0000
582:	0000000000247000 0000002a101f5000 0000000000001000 0000
583:	0000000000248000 0000002a32427000 0000000000001000 0000
584:	0000000000249000 00000029ff975000 0000000000001000 0000
585:	000000000024a000 0000002a2852c000 0000000000001000 0000
586:	000000000024b000 00000029d9776000 0000000000001000 0000
587:	000000000024c000 0000002a2fc7f000 0000000000001000 0000
588:	000000000024d000 0000002a008ed000 0000000000001000 0000
589:	000000000024e000 0000002a36b1a000 0000000000001000 0000
590:	000000000024f000 0000002a1ed74000 0000000000001000 0000
591:	0000000000250000 00000029e084a000 0000000000001000 0000
592:	0000000000251000 0000002a3cbcf000 0000000000001000 0000
593:	0000000000252000 0000002a3c419000 0000000000001000 0000
594:	0000000000253000 0000002a3b4b8000 0000000000001000 0000
595:	0000000000254000 0000002a3ad08000 0000000000001000 0000
596:	0000000000255000 0000002a3c41b000 0000000000001000 0000
597:	0000000000256000 0000002a39dab000 0000000000001000 0000
598:	0000000000257000 0000002a39da9000 0000000000001000 0000
599:	0000000000258000 0000002a3acfd000 0000000000001000 0000
600:	0000000000259000 0000002a3a555000 0000000000001000 0000
601:	000000000025a000 0000002a39600000 0000000000001000 0000
602:	000000000025b000 0000002a3f9b4000 0000000000001000 0000
603:	000000000025c000 0000002a3f205000 0000000000001000 0000
604:	000000000025d000 0000002a3acff000 0000000000001000 0000
605:	000000000025e000 0000002a3ad01000 0000000000001000 0000
606:	000000000025f000 0000002a3e2b4000 0000000000001000 0000
607:	0000000000260000 0000002a42f1c000 0000000000001000 0000
608:	0000000000261000 0000002a3cbd1000 0000000000001000 0000
609:	0000000000262000 0000002a3d370000 0000000000001000 0000
610:	0000000000263000 0000002a40159000 0000000000001000 0000
611:	0000000000264000 0000002a41fcb000 0000000000001000 0000
612:	0000000000265000 0000002a436bc000 0000000000001000 0000
613:	0000000000266000 0000002a41fcd000 0000000000001000 0000
614:	0000000000267000 0000002a41828000 0000000000001000 0000
615:	0000000000268000 0000002a4108f000 0000000000001000 0000
616:	0000000000269000 0000002a44d85000 0000000000001000 0000
617:	000000000026a000 0000002a41fc2000 0000000000001000 0000
618:	000000000026b000 0000002a43e57000 0000000000001000 0000
619:	000000000026c000 0000002a42767000 0000000000001000 0000
620:	000000000026d000 0000002a48293000 0000000000001000 0000
621:	000000000026e000 0000002a47364000 0000000000001000 0000
622:	000000000026f000 0000002a46bcf000 0000000000001000 0000
623:	0000000000270000 0000002a41fc4000 0000000000001000 0000
624:	0000000000271000 0000002a41fc6000 0000000000001000 0000
625:	0000000000272000 0000002a48a26000 0000000000001000 0000
626:	0000000000273000 0000002a4a0d2000 0000000000001000 0000
627:	0000000000274000 0000002a49942000 0000000000001000 0000
628:	0000000000275000 0000002a47af6000 0000000000001000 0000
629:	0000000000276000 0000002a49940000 0000000000001000 0000
630:	0000000000277000 0000002a42efb000 0000000000001000 0000
631:	0000000000278000 0000002a45ca4000 0000000000001000 0000
632:	0000000000279000 0000002a4642c000 0000000000001000 0000
633:	000000000027a000 0000002a4dd0b000 0000000000001000 0000
634:	000000000027b000 0000002a4beee000 0000000000001000 0000
635:	000000000027c000 0000002a4afe3000 0000000000001000 0000
636:	000000000027d000 0000002a41fc8000 0000000000001000 0000
637:	000000000027e000 0000002a4551c000 0000000000001000 0000
638:	000000000027f000 0000002a491b4000 0000000000001000 0000
639:	0000000000280000 0000002a4a85f000 0000000000001000 0000
640:	0000000000281000 0000002a4fb0e000 0000000000001000 0000
641:	0000000000282000 0000002a4ec0f000 0000000000001000 0000
642:	0000000000283000 0000002a49939000 0000000000001000 0000
643:	0000000000284000 0000002a4cdef000 0000000000001000 0000
644:	0000000000285000 0000002a5028d000 0000000000001000 0000
645:	0000000000286000 0000002a51182000 0000000000001000 0000
646:	0000000000287000 0000002a518fc000 0000000000001000 0000
647:	0000000000288000 0000002a545d2000 0000000000001000 0000
648:	0000000000289000 0000002a53e54000 0000000000001000 0000
649:	000000000028a000 0000002a52f62000 0000000000001000 0000
650:	000000000028b000 0000002a527eb000 0000000000001000 0000
651:	000000000028c000 0000002a52075000 0000000000001000 0000
652:	000000000028d000 0000002a56b1d000 0000000000001000 0000
653:	000000000028e000 0000002a554bc000 0000000000001000 0000
654:	000000000028f000 0000002a53e46000 0000000000001000 0000
655:	0000000000290000 0000002a53e43000 0000000000001000 0000
656:	0000000000291000 0000002a53e56000 0000000000001000 0000
657:	0000000000292000 0000002a52f63000 0000000000001000 0000
658:	0000000000293000 0000002a588da000 0000000000001000 0000
659:	0000000000294000 0000002a579fc000 0000000000001000 0000
660:	0000000000295000 0000002a53e3d000 0000000000001000 0000
661:	0000000000296000 0000002a53e3b000 0000000000001000 0000
662:	0000000000297000 0000002a55c2e000 0000000000001000 0000
663:	0000000000298000 0000002a57290000 0000000000001000 0000
664:	0000000000299000 0000002a54d4a000 0000000000001000 0000
665:	000000000029a000 0000002a536d1000 0000000000001000 0000
666:	000000000029b000 0000002a59047000 0000000000001000 0000
667:	000000000029c000 0000002a5a678000 0000000000001000 0000
668:	000000000029d000 0000002a5cb78000 0000000000001000 0000
669:	000000000029e000 0000002a5bca4000 0000000000001000 0000
670:	000000000029f000 0000002a5d2db000 0000000000001000 0000
671:	00000000002a0000 0000002a59f0d000 0000000000001000 0000
672:	00000000002a1000 0000002a5e8fa000 0000000000001000 0000
673:	00000000002a2000 0000002a5a670000 0000000000001000 0000
674:	00000000002a3000 0000002a5da3c000 0000000000001000 0000
675:	00000000002a4000 0000002a5b532000 0000000000001000 0000
676:	00000000002a5000 0000002a5a67a000 0000000000001000 0000
677:	00000000002a6000 0000002a5add5000 0000000000001000 0000
678:	00000000002a7000 0000002a5c406000 0000000000001000 0000
679:	00000000002a8000 0000002a59f04000 0000000000001000 0000
680:	00000000002a9000 0000002a59f06000 0000000000001000 0000
681:	00000000002aa000 0000002a5e199000 0000000000001000 0000
682:	00000000002ab000 0000002a5f059000 0000000000001000 0000
683:	00000000002ac000 0000002a623b8000 0000000000001000 0000
684:	00000000002ad000 0000002a61c53000 0000000000001000 0000
685:	00000000002ae000 0000002a59f08000 0000000000001000 0000
686:	00000000002af000 0000002a59f01000 0000000000001000 0000
687:	00000000002b0000 0000002a5fefe000 0000000000001000 0000
688:	00000000002b1000 0000002a597ad000 0000000000001000 0000
689:	00000000002b2000 0000002a64846000 0000000000001000 0000
690:	00000000002b3000 0000002a639a6000 0000000000001000 0000
691:	00000000002b4000 0000002a61c32000 0000000000001000 0000
692:	00000000002b5000 0000002a59efd000 0000000000001000 0000
693:	00000000002b6000 0000002a614e2000 0000000000001000 0000
694:	00000000002b7000 0000002a63255000 0000000000001000 0000
695:	00000000002b8000 0000002a67b47000 0000000000001000 0000
696:	00000000002b9000 0000002a66cb0000 0000000000001000 0000
697:	00000000002ba000 0000002a66566000 0000000000001000 0000
698:	00000000002bb000 0000002a61c30000 0000000000001000 0000
699:	00000000002bc000 0000002a65e1c000 0000000000001000 0000
700:	00000000002bd000 0000002a640f3000 0000000000001000 0000
701:	00000000002be000 0000002a6828f000 0000000000001000 0000
702:	00000000002bf000 0000002a69857000 0000000000001000 0000
703:	00000000002c0000 0000002a656d4000 0000000000001000 0000
704:	00000000002c1000 0000002a61c34000 0000000000001000 0000
705:	00000000002c2000 0000002a64f94000 0000000000001000 0000
706:	00000000002c3000 0000002a673f7000 0000000000001000 0000
707:	00000000002c4000 0000002a62b0c000 0000000000001000 0000
708:	00000000002c5000 0000002a6910c000 0000000000001000 0000
709:	00000000002c6000 0000002a6d232000 0000000000001000 0000
710:	00000000002c7000 0000002a6984f000 0000000000001000 0000
711:	00000000002c8000 0000002a69851000 0000000000001000 0000
712:	00000000002c9000 0000002a6b53d000 0000000000001000 0000
713:	00000000002ca000 0000002a6a6ce000 0000000000001000 0000
714:	00000000002cb000 0000002a6ef0d000 0000000000001000 0000
715:	00000000002cc000 0000002a6e0a0000 0000000000001000 0000
716:	00000000002cd000 0000002a6984c000 0000000000001000 0000
717:	00000000002ce000 0000002a69854000 0000000000001000 0000
718:	00000000002cf000 0000002a6c3a5000 0000000000001000 0000
719:	00000000002d0000 0000002a6bc74000 0000000000001000 0000
720:	00000000002d1000 0000002a712ff000 0000000000001000 0000
721:	00000000002d2000 0000002a6fd70000 0000000000001000 0000
722:	00000000002d3000 0000002a7049e000 0000000000001000 0000
723:	00000000002d4000 0000002a6a6c6000 0000000000001000 0000
724:	00000000002d5000 0000002a73e0a000 0000000000001000 0000
725:	00000000002d6000 0000002a736da000 0000000000001000 0000
726:	00000000002d7000 0000002a72fac000 0000000000001000 0000
727:	00000000002d8000 0000002a72157000 0000000000001000 0000
728:	00000000002d9000 0000002a7702e000 0000000000001000 0000
729:	00000000002da000 0000002a72fa8000 0000000000001000 0000
730:	00000000002db000 0000002a75aa4000 0000000000001000 0000
731:	00000000002dc000 0000002a7537d000 0000000000001000 0000
732:	00000000002dd000 0000002a74c58000 0000000000001000 0000
733:	00000000002de000 00000029c0130000 0000000000001000 0000
734:	00000000002df000 0000002a3bc65000 0000000000001000 0000
735:	00000000002e0000 0000002a49937000 0000000000001000 0000
736:	00000000002e1000 0000002a52f60000 0000000000001000 0000
737:	00000000002e2000 0000002a6d96c000 0000000000001000 0000
738:	00000000002e3000 0000002a71a2e000 0000000000001000 0000
739:	00000000002e4000 0000002a77e71000 0000000000001000 0000
740:	00000000002e5000 0000002a27cdb000 0000000000001000 0000
741:	00000000002e6000 0000002a49935000 0000000000001000 0000
742:	00000000002e7000 0000002a53e37000 0000000000001000 0000
743:	00000000002e8000 0000002a6e7d4000 0000000000001000 0000
744:	00000000002e9000 0000002a768e0000 0000000000001000 0000
745:	00000000002ea000 0000002a77755000 0000000000001000 0000
746:	00000000002eb000 0000002a38a80000 0000000000001000 0000
747:	00000000002ec000 0000002a4993d000 0000000000001000 0000
748:	00000000002ed000 0000002a53e39000 0000000000001000 0000
749:	00000000002ee000 0000002a6cad6000 0000000000001000 0000
750:	00000000002ef000 0000002a78c9e000 0000000000001000 0000
751:	00000000002f0000 0000002a0236e000 0000000000001000 0000
752:	00000000002f1000 0000002a4b767000 0000000000001000 0000
753:	00000000002f2000 0000002a4e491000 0000000000001000 0000
754:	00000000002f3000 0000002a59f0a000 0000000000001000 0000
755:	00000000002f4000 0000002a74535000 0000000000001000 0000
756:	00000000002f5000 0000002a225ab000 0000000000001000 0000
757:	00000000002f6000 0000002a3ea55000 0000000000001000 0000
758:	00000000002f7000 0000002a4a0cf000 0000000000001000 0000
759:	00000000002f8000 0000002a60d56000 0000000000001000 0000
760:	00000000002f9000 0000002a7287f000 0000000000001000 0000
761:	00000000002fa000 0000002a275c0000 0000000000001000 0000
762:	00000000002fb000 0000002a4c673000 0000000000001000 0000
763:	00000000002fc000 0000002a58168000 0000000000001000 0000
764:	00000000002fd000 0000002a69849000 0000000000001000 0000
765:	00000000002fe000 00000027c9b05000 0000000000001000 0000
766:	00000000002ff000 0000002a445ec000 0000000000001000 0000
767:	0000000000300000 0000002a56397000 0000000000001000 0000
768:	0000000000301000 0000002a69f98000 0000000000001000 0000
769:	0000000000302000 0000002a2bd3f000 0000000000001000 0000
770:	0000000000303000 0000002a6064e000 0000000000001000 0000
771:	0000000000304000 00000027c9b07000 0000000000001000 0000
772:	0000000000305000 0000002a59eff000 0000000000001000 0000
773:	0000000000306000 0000002a7858d000 0000000000001000 0000
774:	0000000000307000 0000002a50a08000 0000000000001000 0000
775:	0000000000308000 00000027cd497000 0000000000001000 0000
776:	0000000000309000 0000002a689d1000 0000000000001000 0000
777:	000000000030a000 0000002a39da7000 0000000000001000 0000
778:	000000000030b000 0000002a7214b000 0000000000001000 0000
779:	000000000030c000 0000002a6f642000 0000000000001000 0000
780:	000000000030d000 0000002a4f38d000 0000000000001000 0000
781:	000000000030e000 0000002a7aad9000 0000000000001000 0000
782:	000000000030f000 0000002a7a3e4000 0000000000001000 0000
783:	0000000000310000 0000002a79600000 0000000000001000 0000
784:	0000000000311000 0000002a7aad7000 0000000000001000 0000
785:	0000000000312000 0000002a7e259000 0000000000001000 0000
786:	0000000000313000 0000002a7d46c000 0000000000001000 0000
787:	0000000000314000 0000002a7bfa3000 0000000000001000 0000
788:	0000000000315000 0000002a7c68f000 0000000000001000 0000
789:	0000000000316000 0000002a7b8b5000 0000000000001000 0000
790:	0000000000317000 0000002a7a3d9000 0000000000001000 0000
791:	0000000000318000 0000002a7b1cb000 0000000000001000 0000
792:	0000000000319000 0000002a80bcd000 0000000000001000 0000
793:	000000000031a000 0000002a7fdf9000 0000000000001000 0000
794:	000000000031b000 0000002a7f711000 0000000000001000 0000
795:	000000000031c000 0000002a7e947000 0000000000001000 0000
796:	000000000031d000 0000002a7a3d6000 0000000000001000 0000
797:	000000000031e000 0000002a7a3d4000 0000000000001000 0000
798:	000000000031f000 0000002a7db59000 0000000000001000 0000
799:	0000000000320000 0000002a79cf1000 0000000000001000 0000
800:	0000000000321000 0000002a8351a000 0000000000001000 0000
801:	0000000000322000 0000002a82750000 0000000000001000 0000
802:	0000000000323000 0000002a8206e000 0000000000001000 0000
803:	0000000000324000 0000002a7a3db000 0000000000001000 0000
804:	0000000000325000 0000002a8198f000 0000000000001000 0000
805:	0000000000326000 0000002a7cd7a000 0000000000001000 0000
806:	0000000000327000 0000002a849ab000 0000000000001000 0000
807:	0000000000328000 0000002a842d1000 0000000000001000 0000
808:	0000000000329000 0000002a8198c000 0000000000001000 0000
809:	000000000032a000 0000002a8198a000 0000000000001000 0000
810:	000000000032b000 0000002a82070000 0000000000001000 0000
811:	000000000032c000 0000002a7f02b000 0000000000001000 0000
812:	000000000032d000 0000002a804df000 0000000000001000 0000
813:	000000000032e000 0000002a872a0000 0000000000001000 0000
814:	000000000032f000 0000002a864f7000 0000000000001000 0000
815:	0000000000330000 0000002a81986000 0000000000001000 0000
816:	0000000000331000 0000002a82e2e000 0000000000001000 0000
817:	0000000000332000 0000002a83bf9000 0000000000001000 0000
818:	0000000000333000 0000002a812b4000 0000000000001000 0000
819:	0000000000334000 0000002a88dd9000 0000000000001000 0000
820:	0000000000335000 0000002a8803d000 0000000000001000 0000
821:	0000000000336000 0000002a81988000 0000000000001000 0000
822:	0000000000337000 0000002a85e16000 0000000000001000 0000
823:	0000000000338000 0000002a8574c000 0000000000001000 0000
824:	0000000000339000 0000002a894a5000 0000000000001000 0000
825:	000000000033a000 0000002a8a8fa000 0000000000001000 0000
826:	000000000033b000 0000002a8a232000 0000000000001000 0000
827:	000000000033c000 0000002a8d198000 0000000000001000 0000
828:	000000000033d000 0000002a8d18f000 0000000000001000 0000
829:	000000000033e000 0000002a8c406000 0000000000001000 0000
830:	000000000033f000 0000002a8bd42000 0000000000001000 0000
831:	0000000000340000 0000002a8b680000 0000000000001000 0000
832:	0000000000341000 0000002a8afc0000 0000000000001000 0000
833:	0000000000342000 0000002a8a22a000 0000000000001000 0000
834:	0000000000343000 0000002a8a8f8000 0000000000001000 0000
835:	0000000000344000 0000002a89b6c000 0000000000001000 0000
836:	0000000000345000 0000002a8fa01000 0000000000001000 0000
837:	0000000000346000 0000002a8f33c000 0000000000001000 0000
838:	0000000000347000 0000002a8e5c5000 0000000000001000 0000
839:	0000000000348000 0000002a8d852000 0000000000001000 0000
840:	0000000000349000 0000002a8a22c000 0000000000001000 0000
841:	000000000034a000 0000002a8cac8000 0000000000001000 0000
842:	000000000034b000 0000002a91b91000 0000000000001000 0000
843:	000000000034c000 0000002a90e24000 0000000000001000 0000
844:	000000000034d000 0000002a8ec7e000 0000000000001000 0000
845:	000000000034e000 0000002a900bc000 0000000000001000 0000
846:	000000000034f000 0000002a8a22e000 0000000000001000 0000
847:	0000000000350000 0000002a914d8000 0000000000001000 0000
848:	0000000000351000 0000002a93653000 0000000000001000 0000
849:	0000000000352000 0000002a9510c000 0000000000001000 0000
850:	0000000000353000 0000002a93d02000 0000000000001000 0000
851:	0000000000354000 0000002a943af000 0000000000001000 0000
852:	0000000000355000 0000002a91b88000 0000000000001000 0000
853:	0000000000356000 0000002a93645000 0000000000001000 0000
854:	0000000000357000 0000002a9650c000 0000000000001000 0000
855:	0000000000358000 0000002a94a5b000 0000000000001000 0000
856:	0000000000359000 0000002a92f94000 0000000000001000 0000
857:	000000000035a000 0000002a957ba000 0000000000001000 0000
858:	000000000035b000 0000002a91b8a000 0000000000001000 0000
859:	000000000035c000 0000002a92f90000 0000000000001000 0000
860:	000000000035d000 0000002a97f9e000 0000000000001000 0000
861:	000000000035e000 0000002a97257000 0000000000001000 0000
862:	000000000035f000 0000002a96bb5000 0000000000001000 0000
863:	0000000000360000 0000002a928e6000 0000000000001000 0000
864:	0000000000361000 0000002a91b8e000 0000000000001000 0000
865:	0000000000362000 0000002a928e4000 0000000000001000 0000
866:	0000000000363000 0000002a95e60000 0000000000001000 0000
867:	0000000000364000 0000002a92246000 0000000000001000 0000
868:	0000000000365000 0000002a9b48a000 0000000000001000 0000
869:	0000000000366000 0000002a9a0b2000 0000000000001000 0000
870:	0000000000367000 0000002a9ade5000 0000000000001000 0000
871:	0000000000368000 0000002a928e2000 0000000000001000 0000
872:	0000000000369000 0000002a9bb25000 0000000000001000 0000
873:	000000000036a000 0000002a98cd7000 0000000000001000 0000
874:	000000000036b000 0000002a978f9000 0000000000001000 0000
875:	000000000036c000 0000002a9dc0e000 0000000000001000 0000
876:	000000000036d000 0000002a9cee2000 0000000000001000 0000
877:	000000000036e000 0000002a9a094000 0000000000001000 0000
878:	000000000036f000 0000002a9d575000 0000000000001000 0000
879:	0000000000370000 0000002a9c84c000 0000000000001000 0000
880:	0000000000371000 0000002a9e931000 0000000000001000 0000
881:	0000000000372000 0000002a9e2a2000 0000000000001000 0000
882:	0000000000373000 0000002a9c1bc000 0000000000001000 0000
883:	0000000000374000 0000002a9a087000 0000000000001000 0000
884:	0000000000375000 0000002a91b8c000 0000000000001000 0000
885:	0000000000376000 0000002a999f7000 0000000000001000 0000
886:	0000000000377000 0000002a9936d000 0000000000001000 0000
887:	0000000000378000 0000002aa09e0000 0000000000001000 0000
888:	0000000000379000 0000002aa16ef000 0000000000001000 0000
889:	000000000037a000 0000002a9a085000 0000000000001000 0000
890:	000000000037b000 0000002a9a083000 0000000000001000 0000
891:	000000000037c000 0000002a9f644000 0000000000001000 0000
892:	000000000037d000 0000002a9a74c000 0000000000001000 0000
893:	000000000037e000 0000002aa3784000 0000000000001000 0000
894:	000000000037f000 0000002aa2a78000 0000000000001000 0000
895:	0000000000380000 0000002aa30f9000 0000000000001000 0000
896:	0000000000381000 0000002a9a081000 0000000000001000 0000
897:	0000000000382000 0000002aa5180000 0000000000001000 0000
898:	0000000000383000 0000002aa3e06000 0000000000001000 0000
899:	0000000000384000 0000002aa1d76000 0000000000001000 0000
900:	0000000000385000 0000002aa0343000 0000000000001000 0000
901:	0000000000386000 0000002aa23f2000 0000000000001000 0000
902:	0000000000387000 0000002aa16e1000 0000000000001000 0000
903:	0000000000388000 0000002aa4483000 0000000000001000 0000
904:	0000000000389000 0000002a9fcc8000 0000000000001000 0000
905:	000000000038a000 0000002a9efc0000 0000000000001000 0000
906:	000000000038b000 0000002aa1069000 0000000000001000 0000
907:	000000000038c000 0000002aa4afb000 0000000000001000 0000
908:	000000000038d000 0000002aa16e8000 0000000000001000 0000
909:	000000000038e000 0000002aa16e5000 0000000000001000 0000
910:	000000000038f000 0000002aa71c2000 0000000000001000 0000
911:	0000000000390000 0000002aa64de000 0000000000001000 0000
912:	0000000000391000 0000002aa9ed2000 0000000000001000 0000
913:	0000000000392000 0000002aa8b7e000 0000000000001000 0000
914:	0000000000393000 0000002aa16e6000 0000000000001000 0000
915:	0000000000394000 0000002aa9ec4000 0000000000001000 0000
916:	0000000000395000 0000002aa7e9e000 0000000000001000 0000
917:	0000000000396000 0000002aa7833000 0000000000001000 0000
918:	0000000000397000 0000002aa6b4e000 0000000000001000 0000
919:	0000000000398000 0000002aa16de000 0000000000001000 0000
920:	0000000000399000 0000002aa9ebe000 0000000000001000 0000
921:	000000000039a000 0000002aaaba7000 0000000000001000 0000
922:	000000000039b000 0000002aaa541000 0000000000001000 0000
923:	000000000039c000 0000002aa9850000 0000000000001000 0000
924:	000000000039d000 0000002aabed3000 0000000000001000 0000
925:	000000000039e000 0000002aab20d000 0000000000001000 0000
926:	000000000039f000 0000002aa16ea000 0000000000001000 0000
927:	00000000003a0000 0000002aa8509000 0000000000001000 0000
928:	00000000003a1000 0000002aa91ed000 0000000000001000 0000
929:	00000000003a2000 0000002aac536000 0000000000001000 0000
930:	00000000003a3000 0000002aaf826000 0000000000001000 0000
931:	00000000003a4000 0000002aa9eb9000 0000000000001000 0000
932:	00000000003a5000 0000002aaf828000 000000000065b000 0001


[-- Attachment #3: Type: message/external-body, Size: 99 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-11 11:40                   ` Tomasz Pala
@ 2017-12-12  0:50                     ` Qu Wenruo
  2017-12-15  8:22                       ` Tomasz Pala
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2017-12-12  0:50 UTC (permalink / raw)
  To: Tomasz Pala, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4367 bytes --]



On 2017年12月11日 19:40, Tomasz Pala wrote:
> On Mon, Dec 11, 2017 at 07:44:46 +0800, Qu Wenruo wrote:
> 
>>> I could debug something before I'll clean this up, is there anything you
>>> want to me to check/know about the files?
>>
>> fiemap result along with btrfs dump-tree -t2 result.
> 
> fiemap attached, but dump-tree requires unmounted fs, doesn't it?

It doesn't.

You can dump your tree with fs mounted, although it may affect the accuracy.

The good news is, in your case, it doesn't really need extent tree, as
there is no shared extent here.

> 
>>> - I've lost 3.6 GB during the night with reasonably small
>>> amount of writes, I guess it might be possible to trash entire
>>> filesystem within 10 minutes if doing this on purpose.
>>
>> That's a little complex.
>> To get into such situation, snapshot must be used and one must know
>> which file extent is shared and how it's shared.
> 
> Hostile user might assume that any of his own files old enough were
> being snapshotted. Unless snapshots are not used at all...
> 
> The 'obvious' solution would be for quotas to limit the data size including
> extents lost due to fragmentation, but this is not the real solution as
> users don't care about fragmentation. So we're back to square one.
> 
>> But as I mentioned, XFS supports reflink, which means file extent can be
>> shared between several inodes.
>>
>> From the message I got from XFS guys, they free any unused space of a
>> file extent, so it should handle it quite well.
> 
> Forgive my ignorance, as I'm not familiar with details, but isn't the
> problem 'solvable' by reusing space freed from the same extent for any
> single (i.e. the same) inode?

Not that easy.

The extent tree design makes it a little tricky to do that.
So btrfs use the current extent booking, the laziest way to delete extent.

> This would certainly increase
> fragmentation of a file, but reduce extent usage significially.
> 
> 
> Still, I don't comprehend the cause of my situation. If - after doing a
> defrag (after snapshotting whatever there were already trashed) btrfs
> decides to allocate new extents for the file, why doesn't is use them
> efficiently as long as I'm not doing snapshots anymore?

Even without snapshot, things can easily go crazy.

This will write 128M file (max btrfs file extent size) and write it to disk.
# xfs_io -f -c "pwrite 0 128M" -c "sync" /mnt/btrfs/file

Then, overwrite the 1~128M range.
# xfs_io -f -c "pwrite 1M 127M" -c "sync" /mnt/btrfs/file

Guess your real disk usage, it's 127M + 128M = 255M.

The point here, if there is any reference of a file extent, the whole
extent won't be freed, even it's only 1M of a 128M extent.

While defrag will basically read out the whole 128M file, and rewrite it.
Basically the same as:

# dd if=/mnt/btrfs/file of=/mnt/btrfs/file2
# rm /mnt/btrfs/file

In this case, it will cause a new 128M file extent, while old 128M+127M
extents lost all their reference so they are freed.
As a result, it frees 127M.



> I'm attaching the second fiemap, the same file from last snapshot taken.
> According to this one-liner:
> 
> for i in `awk '{print $3}' fiemap`; do grep $i fiemap_old; done
> 
> current file doesn't share any physical locations with the old one.
> But still grows, so what does this situation have with snapshots anyway?

In your fiemap, all your file extent is exclusive, so not really related
to snapshot.

But the file is very fragmented.
Most of them is 4K sized, several 8K sized.
And the final extent is 220K sized.

Are you pre-allocating the file before write using tools like dd?
If so, just as I explained above, it will at least *DOUBLE* on-disk
space usage, and cause tons of fragment.

It's recommended to use fallocate to prealloc file instead of things
like dd.
(preallocated range acts must like nocow, although only for first write)

And if possible, use nocow for this file.


> 
> Oh, and BTW - 900+ extents for ~5 GB taken means there is about 5.5 MB
> occupied per extent. How is that possible?

Appending small write and frequently fsync or small random DIO.

Avoid such pattern or at least use nocow.
Also avoid using dd to preallocate file.

Another solution is autodefrag, but I doubt the effect.

Thanks,
Qu

> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-12  0:50                     ` Qu Wenruo
@ 2017-12-15  8:22                       ` Tomasz Pala
  2017-12-16  3:21                         ` Duncan
  0 siblings, 1 reply; 32+ messages in thread
From: Tomasz Pala @ 2017-12-15  8:22 UTC (permalink / raw)
  To: linux-btrfs

On Tue, Dec 12, 2017 at 08:50:15 +0800, Qu Wenruo wrote:

> Even without snapshot, things can easily go crazy.
> 
> This will write 128M file (max btrfs file extent size) and write it to disk.
> # xfs_io -f -c "pwrite 0 128M" -c "sync" /mnt/btrfs/file
> 
> Then, overwrite the 1~128M range.
> # xfs_io -f -c "pwrite 1M 127M" -c "sync" /mnt/btrfs/file
> 
> Guess your real disk usage, it's 127M + 128M = 255M.
> 
> The point here, if there is any reference of a file extent, the whole
> extent won't be freed, even it's only 1M of a 128M extent.

OK, /this/ is scary. I guess nocow prevents this behaviour?
I have +C chatted the file eating my space and it ceased.

> Are you pre-allocating the file before write using tools like dd?

I have no idea, this could be checked in source of http://pam-abl.sourceforge.net/
But this is plain Berkeley DB (5.3 in my case)... which scarries me even
more:

$  rpm -q --what-requires 'libdb-5.2.so()(64bit)' 'libdb-5.3.so()(64bit)' | wc -l
14

#  ipoldek desc -B db5.3
Package:        db5.3-5.3.28.0-4.x86_64
Required(by):   apache1-base, apache1-mod_ssl, apr-util-dbm-db,
bogofilter, 
    c-icap, c-icap-srv_url_check, courier-authlib, 
    courier-authlib-authuserdb, courier-imap, courier-imap-common, 
    cyrus-imapd, cyrus-imapd-libs, cyrus-sasl, cyrus-sasl-sasldb, 
    db5.3-devel, db5.3-utils, dnshistory, dsniff, evolution-data-server, 
    evolution-data-server-libs, exim, gda-db, ggz-server, 
    heimdal-libs-common, hotkeys, inn, inn-libs, isync, jabberd, jigdo, 
    jigdo-gtk, jnettop, libetpan, libgda3, libgda3-devel, libhome, libqxt, 
    libsolv, lizardfs-master, maildrop, moc, mutt, netatalk, nss_updatedb, 
    ocaml-dbm, opensips, opensmtpd, pam-pam_abl, pam-pam_ccreds, perl-BDB, 
    perl-BerkeleyDB, perl-BerkeleyDB, perl-DB_File, perl-URPM, 
    perl-cyrus-imapd, php4-dba, php52-dba, php53-dba, php54-dba, php55-dba, 
    php56-dba, php70-dba, php70-dba, php71-dba, php71-dba, php72-dba, 
    php72-dba, postfix, python-bsddb, python-modules, python3-bsddb3, 
    redland, ruby-modules, sendmail, squid-session_acl, 
    squid-time_quota_acl, squidGuard, subversion-libs, swish-e, tomoe-svn, 
    webalizer-base, wwwcount

OK, not much of user-applications here, as they mostly use sqlite.
I wonder how this one db-library behaves:

$  find . -name \*.sqlite | xargs ls -gGhS | head -n1
-rw-r--r-- 1  15M 2017-12-08 12:14 ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite

$  ~/fiemap ./.mozilla/firefox/*.default/extension-data/ublock0.sqlite | head -n1
File ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite has 128 extents:


At least every $HOME/{.{,c}cache,tmp} should be +C...

> And if possible, use nocow for this file.

Actually, this should be officially advised to use +C for entire /var tree and
every other tree that might be exposed for hostile write patterns, like /home
or /tmp (if held on btrfs).

I'd say, that from security point of view the nocow should be default,
unless specified for mount or specific file... Currently, if I mount
with nocow, there is no way to whitelist trusted users or secure
location, and until btrfs-specific options could be handled per
subvolume, there is really no alternative.


-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: exclusive subvolume space missing
  2017-12-15  8:22                       ` Tomasz Pala
@ 2017-12-16  3:21                         ` Duncan
  0 siblings, 0 replies; 32+ messages in thread
From: Duncan @ 2017-12-16  3:21 UTC (permalink / raw)
  To: linux-btrfs

Tomasz Pala posted on Fri, 15 Dec 2017 09:22:14 +0100 as excerpted:

> I wonder how this one db-library behaves:
> 
> $  find . -name \*.sqlite | xargs ls -gGhS | head -n1
> -rw-r--r-- 1  15M 2017-12-08 12:14
> ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite
> 
> $  ~/fiemap ./.mozilla/firefox/*.default/extension-data/ublock0.sqlite |
> head -n1
> File ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite
> has 128 extents:
> 
> 
> At least every $HOME/{.{,c}cache,tmp} should be +C...

Many admins will put tmp, and sometimes cache or selected parts of it, on 
tmpfs anyway... thereby both automatically clearing it on reboot, and 
allowing enforced size control as necessary.

>> And if possible, use nocow for this file.
> 
> Actually, this should be officially advised to use +C for entire /var
> tree and every other tree that might be exposed for hostile write
> patterns, like /home or /tmp (if held on btrfs).
> 
> I'd say, that from security point of view the nocow should be default,
> unless specified for mount or specific file... Currently, if I mount
> with nocow, there is no way to whitelist trusted users or secure
> location, and until btrfs-specific options could be handled per
> subvolume, there is really no alternative.

Nocow disables many reasons people run btrfs in the first place, 
including checksumming and damage-detection, with auto-repair from other 
copies where available (raid1/10 and dup modes primarily), as well as 
btrfs transparent compression, for users using that.  Additionally, 
snapshotting, another feature people use btrfs for, turns nocow into cow1 
(cow the first time a block is written after a snapshot), since 
snapshotting locks down the previous extent in ordered to maintain the 
snapshotted reference.

And given that any user can create a snapshot any time they want (even if 
you lock down the btrfs executable, if they're malevolent users and not 
locked to only running specifically whitelisted executables, they can 
always get a copy of the executable elsewhere), and /home or individual 
user subvols may well be auto-snapshotted already, setting nocow isn't 
likely to be of much security value at all.

So nocow is, as one regular wrote, most useful for "this really should go 
on something other than btrfs, but I'm too lazy to set it up that way and 
I'm already on btrfs, so the nocow band-aid is all I got.  And yes, I try 
using my screwdriver as a hammer too, because that's what I have there 
too!"

In that sort of case, just use some other filesystem more appropriate to 
the use-case, and you won't have to worry about btrfs issues, cow-
triggered or otherwise, in the first place.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-12-16  3:21 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-01 16:15 exclusive subvolume space missing Tomasz Pala
2017-12-01 21:27 ` Duncan
2017-12-01 21:36 ` Hugo Mills
2017-12-02  0:53   ` Tomasz Pala
2017-12-02  1:05     ` Qu Wenruo
2017-12-02  1:43       ` Tomasz Pala
2017-12-02  2:17         ` Qu Wenruo
2017-12-02  2:56     ` Duncan
2017-12-02 16:28     ` Tomasz Pala
2017-12-02 17:18       ` Tomasz Pala
2017-12-03  1:45         ` Duncan
2017-12-03 10:47           ` Adam Borowski
2017-12-04  5:11             ` Chris Murphy
2017-12-10 10:49           ` Tomasz Pala
2017-12-04  4:58     ` Chris Murphy
2017-12-02  0:27 ` Qu Wenruo
2017-12-02  1:23   ` Tomasz Pala
2017-12-02  1:47     ` Qu Wenruo
2017-12-02  2:21       ` Tomasz Pala
2017-12-02  2:35         ` Qu Wenruo
2017-12-02  9:33           ` Tomasz Pala
2017-12-04  0:34             ` Qu Wenruo
2017-12-10 11:27               ` Tomasz Pala
2017-12-10 15:49                 ` Tomasz Pala
2017-12-10 23:44                 ` Qu Wenruo
2017-12-11  0:24                   ` Qu Wenruo
2017-12-11 11:40                   ` Tomasz Pala
2017-12-12  0:50                     ` Qu Wenruo
2017-12-15  8:22                       ` Tomasz Pala
2017-12-16  3:21                         ` Duncan
2017-12-05 18:47   ` How exclusive in parent qgroup is computed? (was: Re: exclusive subvolume space missing) Andrei Borzenkov
2017-12-05 23:57     ` How exclusive in parent qgroup is computed? Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.