All of lore.kernel.org
 help / color / mirror / Atom feed
* BTRFS doesn't compress on the fly
@ 2023-11-30 11:21 Gerhard Wiesinger
  2023-11-30 20:53 ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Gerhard Wiesinger @ 2023-11-30 11:21 UTC (permalink / raw)
  To: linux-btrfs

Dear All,

I created a new BTRFS volume with migrating an existing PostgreSQL 
database on it. Versions are recent.

Compression is not done on the fly although everything is IMHO 
configured correctly to do so.

I need to run the following command that everything gets compressed:
btrfs filesystem defragment -r -v -czstd /var/lib/pgsql

Had also a problem that
chattr -R +c /var/lib/pgsql
didn't work for some files.

Find further details below.

Looks like a bug to me.

Any ideas?

Thanx.

Ciao,
Gerhard

uname -a
Linux myhostname 6.5.12-300.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 
20 22:44:24 UTC 2023 x86_64 GNU/Linux

btrfs --version
btrfs-progs v6.5.1

btrfs filesystem show
Label: 'database'  uuid: 6ad6ef90-30fa-4979-9509-99803f7545aa
         Total devices 1 FS bytes used 15.76GiB
         devid    1 size 129.98GiB used 21.06GiB path /dev/mapper/datab

btrfs filesystem df /var/lib/pgsql
Data, single: total=19.00GiB, used=15.61GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=1.00GiB, used=151.92MiB
GlobalReserve, single: total=85.38MiB, used=0.00B

# Mounted via force
findmnt -vno OPTIONS /var/lib/pgsql
rw,relatime,compress-force=zstd:3,space_cache=v2,subvolid=5,subvol=/

# all files even have "c" attribute, set after creation of the filesystem
lsattr /var/lib/pgsql
--------c------------- /var/lib/pgsql/16

# Should be empty and is empty, so everything has the comressed 
attribute (after creation and also all new files)
lsattr -R /var/lib/pgsql | grep -v "^/" | grep -v "^$" | grep -v 
"^........c"

# Stays here at this compression level
compsize -x /var/lib/pgsql
Processed 5332 files, 575858 regular extents (591204 refs), 40 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       63%       51G          80G          80G
none       100%       40G          40G          40G
zstd        27%       10G          40G          40G
prealloc   100%      5.0M         5.0M         5.5M

# After running: btrfs filesystem defragment -r -v -czstd /var/lib/pgsql
compsize -x /var/lib/pgsql
Processed 5563 files, 664076 regular extents (664076 refs), 40 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       19%       15G          80G          80G
none       100%      120K         120K         120K
zstd        19%       15G          80G          80G

# At the first time creating the filesystem I had also the problem that 
I couln't change all attributes, didn't find a way to get rid of this. 
Any ideas.
chattr -R +c /var/lib/pgsql
chattr: Invalid argument while setting flags on 
/var/lib/pgsql/16/data/base/1/2836
chattr: Invalid argument while setting flags on 
/var/lib/pgsql/16/data/base/1/2840
chattr: Invalid argument while setting flags on 
/var/lib/pgsql/16/data/base/1/2838
chattr: Invalid argument while setting flags on 
/var/lib/pgsql/16/data/base/4/2836
chattr: Invalid argument while setting flags on 
/var/lib/pgsql/16/data/base/4/2838
chattr: Invalid argument while setting flags on 
/var/lib/pgsql/16/data/base/5/2836
chattr: Invalid argument while setting flags on 
/var/lib/pgsql/16/data/base/5/2838


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BTRFS doesn't compress on the fly
  2023-11-30 11:21 BTRFS doesn't compress on the fly Gerhard Wiesinger
@ 2023-11-30 20:53 ` Qu Wenruo
  2023-12-02 12:02   ` Gerhard Wiesinger
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2023-11-30 20:53 UTC (permalink / raw)
  To: Gerhard Wiesinger, linux-btrfs



On 2023/11/30 21:51, Gerhard Wiesinger wrote:
> Dear All,
>
> I created a new BTRFS volume with migrating an existing PostgreSQL
> database on it. Versions are recent.

Does the data base directory has something like NODATACOW or NODATASUM set?
The other possibility is preallocation, for the first write on
preallocated range, no matter if the compression is enabled, the write
would be treated as NOCOW.

>
> Compression is not done on the fly although everything is IMHO
> configured correctly to do so.
>
> I need to run the following command that everything gets compressed:
> btrfs filesystem defragment -r -v -czstd /var/lib/pgsql
>
> Had also a problem that
> chattr -R +c /var/lib/pgsql
> didn't work for some files.
>
> Find further details below.
>
> Looks like a bug to me.
>
> Any ideas?
>
> Thanx.
>
> Ciao,
> Gerhard
>
> uname -a
> Linux myhostname 6.5.12-300.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov
> 20 22:44:24 UTC 2023 x86_64 GNU/Linux
>
> btrfs --version
> btrfs-progs v6.5.1
>
> btrfs filesystem show
> Label: 'database'  uuid: 6ad6ef90-30fa-4979-9509-99803f7545aa
>          Total devices 1 FS bytes used 15.76GiB
>          devid    1 size 129.98GiB used 21.06GiB path /dev/mapper/datab
>
> btrfs filesystem df /var/lib/pgsql
> Data, single: total=19.00GiB, used=15.61GiB
> System, DUP: total=32.00MiB, used=16.00KiB
> Metadata, DUP: total=1.00GiB, used=151.92MiB
> GlobalReserve, single: total=85.38MiB, used=0.00B
>
> # Mounted via force
> findmnt -vno OPTIONS /var/lib/pgsql
> rw,relatime,compress-force=zstd:3,space_cache=v2,subvolid=5,subvol=/'
>
> # all files even have "c" attribute, set after creation of the filesystem
> lsattr /var/lib/pgsql
> --------c------------- /var/lib/pgsql/16
>
> # Should be empty and is empty, so everything has the comressed
> attribute (after creation and also all new files)
> lsattr -R /var/lib/pgsql | grep -v "^/" | grep -v "^$" | grep -v
> "^........c"
>
> # Stays here at this compression level
> compsize -x /var/lib/pgsql
> Processed 5332 files, 575858 regular extents (591204 refs), 40 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL       63%       51G          80G          80G
> none       100%       40G          40G          40G
> zstd        27%       10G          40G          40G
> prealloc   100%      5.0M         5.0M         5.5M

Not sure if the preallocation is the cause, but maybe you can try
disabling preallocation of postgresql?

As preallocation doesn't make that much sense on btrfs, there are too
many cases that can break the preallocation.

>
> # After running: btrfs filesystem defragment -r -v -czstd /var/lib/pgsql
> compsize -x /var/lib/pgsql
> Processed 5563 files, 664076 regular extents (664076 refs), 40 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL       19%       15G          80G          80G
> none       100%      120K         120K         120K
> zstd        19%       15G          80G          80G
>
> # At the first time creating the filesystem I had also the problem that
> I couln't change all attributes, didn't find a way to get rid of this.
> Any ideas.
> chattr -R +c /var/lib/pgsql
> chattr: Invalid argument while setting flags on

A lot of flags can only be set on empty files IIRC.

Thanks,
Qu

> /var/lib/pgsql/16/data/base/1/2836
> chattr: Invalid argument while setting flags on
> /var/lib/pgsql/16/data/base/1/2840
> chattr: Invalid argument while setting flags on
> /var/lib/pgsql/16/data/base/1/2838
> chattr: Invalid argument while setting flags on
> /var/lib/pgsql/16/data/base/4/2836
> chattr: Invalid argument while setting flags on
> /var/lib/pgsql/16/data/base/4/2838
> chattr: Invalid argument while setting flags on
> /var/lib/pgsql/16/data/base/5/2836
> chattr: Invalid argument while setting flags on
> /var/lib/pgsql/16/data/base/5/2838
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BTRFS doesn't compress on the fly
  2023-11-30 20:53 ` Qu Wenruo
@ 2023-12-02 12:02   ` Gerhard Wiesinger
  2023-12-02 20:07     ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Gerhard Wiesinger @ 2023-12-02 12:02 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hello Qu,

Thank you for the answers, see inline.

Any further ideas?

Ciao,
Gerhard.

On 30.11.2023 21:53, Qu Wenruo wrote:
>
>
> On 2023/11/30 21:51, Gerhard Wiesinger wrote:
>> Dear All,
>>
>> I created a new BTRFS volume with migrating an existing PostgreSQL
>> database on it. Versions are recent.
>
> Does the data base directory has something like NODATACOW or NODATASUM 
> set?
> The other possibility is preallocation, for the first write on
> preallocated range, no matter if the compression is enabled, the write
> would be treated as NOCOW.
>
I don't think so. How to find out (googled already a lot)?

At least it is not mounted with these options (see also original post).

# Mounted via force
findmnt -vno OPTIONS /var/lib/pgsql
rw,relatime,compress-force=zstd:3,space_cache=v2,subvolid=5,subvol=/'

According to the following link it should compress anyway with the -o 
compress-force option:

https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Compression.html#What.27s_the_precedence_of_all_the_options_affecting_compression.3F
Compression to newly written data happens:
always -- if the filesystem is mounted with -o compress-force
never -- if the NOCOMPRESS flag is set per-file/-directory
if possible -- if the COMPRESS per-file flag (aka chattr +c) is set, but 
it may get converted to NOCOMPRESS eventually
if possible -- if the -o compress mount option is specified
Note, that mounting with -o compress will not set the +c file attribute.

>>
>> Compression is not done on the fly although everything is IMHO
>> configured correctly to do so.
>>
>> I need to run the following command that everything gets compressed:
>> btrfs filesystem defragment -r -v -czstd /var/lib/pgsql
>>
>> Had also a problem that
>> chattr -R +c /var/lib/pgsql
>> didn't work for some files.
>>
>> Find further details below.
>>
>> Looks like a bug to me.
>>
>> Any ideas?
>>
>> Thanx.
>>
>> Ciao,
>> Gerhard
>>
>> uname -a
>> Linux myhostname 6.5.12-300.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov
>> 20 22:44:24 UTC 2023 x86_64 GNU/Linux
>>
>> btrfs --version
>> btrfs-progs v6.5.1
>>
>> btrfs filesystem show
>> Label: 'database'  uuid: 6ad6ef90-30fa-4979-9509-99803f7545aa
>>          Total devices 1 FS bytes used 15.76GiB
>>          devid    1 size 129.98GiB used 21.06GiB path /dev/mapper/datab
>>
>> btrfs filesystem df /var/lib/pgsql
>> Data, single: total=19.00GiB, used=15.61GiB
>> System, DUP: total=32.00MiB, used=16.00KiB
>> Metadata, DUP: total=1.00GiB, used=151.92MiB
>> GlobalReserve, single: total=85.38MiB, used=0.00B
>>
>> # Mounted via force
>> findmnt -vno OPTIONS /var/lib/pgsql
>> rw,relatime,compress-force=zstd:3,space_cache=v2,subvolid=5,subvol=/'
>>
>> # all files even have "c" attribute, set after creation of the 
>> filesystem
>> lsattr /var/lib/pgsql
>> --------c------------- /var/lib/pgsql/16
>>
>> # Should be empty and is empty, so everything has the comressed
>> attribute (after creation and also all new files)
>> lsattr -R /var/lib/pgsql | grep -v "^/" | grep -v "^$" | grep -v
>> "^........c"
>>
>> # Stays here at this compression level
>> compsize -x /var/lib/pgsql
>> Processed 5332 files, 575858 regular extents (591204 refs), 40 inline.
>> Type       Perc     Disk Usage   Uncompressed Referenced
>> TOTAL       63%       51G          80G          80G
>> none       100%       40G          40G          40G
>> zstd        27%       10G          40G          40G
>> prealloc   100%      5.0M         5.0M         5.5M
>
> Not sure if the preallocation is the cause, but maybe you can try
> disabling preallocation of postgresql?
>
> As preallocation doesn't make that much sense on btrfs, there are too
> many cases that can break the preallocation.


I googled a lot and didn't find anything useful with preallocation and 
postgresql (looks like it doesn'use fallocate).

How can I find something about preallocation out?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BTRFS doesn't compress on the fly
  2023-12-02 12:02   ` Gerhard Wiesinger
@ 2023-12-02 20:07     ` Qu Wenruo
  2023-12-02 21:56       ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2023-12-02 20:07 UTC (permalink / raw)
  To: Gerhard Wiesinger, linux-btrfs



On 2023/12/2 22:32, Gerhard Wiesinger wrote:
> Hello Qu,
>
> Thank you for the answers, see inline.
>
> Any further ideas?
>
> Ciao,
> Gerhard.
>
> On 30.11.2023 21:53, Qu Wenruo wrote:
>>
>>
>> On 2023/11/30 21:51, Gerhard Wiesinger wrote:
>>> Dear All,
>>>
>>> I created a new BTRFS volume with migrating an existing PostgreSQL
>>> database on it. Versions are recent.
>>
>> Does the data base directory has something like NODATACOW or NODATASUM
>> set?
>> The other possibility is preallocation, for the first write on
>> preallocated range, no matter if the compression is enabled, the write
>> would be treated as NOCOW.
>>
> I don't think so. How to find out (googled already a lot)?

I normally go `btrfs ins dump-tree`, dump the subvolume, grep for the
inode number with `grep -A 3 "item .* key (257 INODE_ITEM 0)"`, which
would show something like this:

	item 6 key (257 INODE_ITEM 0) itemoff 15811 itemsize 160
		generation 7 transid 8 size 4194304 nbytes 4194304
		block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
		sequence 513 flags 0x10(PREALLOC)

The flags is the btrfs specific flags, which would show NODATACOW or
NODATASUM.

>
> At least it is not mounted with these options (see also original post).
>
> # Mounted via force
> findmnt -vno OPTIONS /var/lib/pgsql
> rw,relatime,compress-force=zstd:3,space_cache=v2,subvolid=5,subvol=/'
>
> According to the following link it should compress anyway with the -o
> compress-force option:
>
> https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Compression.html#What.27s_the_precedence_of_all_the_options_affecting_compression.3F
> Compression to newly written data happens:
> always -- if the filesystem is mounted with -o compress-force
> never -- if the NOCOMPRESS flag is set per-file/-directory
> if possible -- if the COMPRESS per-file flag (aka chattr +c) is set, but
> it may get converted to NOCOMPRESS eventually
> if possible -- if the -o compress mount option is specified
> Note, that mounting with -o compress will not set the +c file attribute.

Well, if you check the kernel code, inside btrfs_run_delalloc_range(),
which calls should_nocow() to check if we should fall to NOCOW path.

That should_nocow() would check if the inode has NODATACOW or PREALLOC
flags, then verify if there is any defrag request for it.
If no defrag request, then it can go NOCOW, thus break the COW requirement.

>
[...]
>>> # Stays here at this compression level
>>> compsize -x /var/lib/pgsql
>>> Processed 5332 files, 575858 regular extents (591204 refs), 40 inline.
>>> Type       Perc     Disk Usage   Uncompressed Referenced
>>> TOTAL       63%       51G          80G          80G
>>> none       100%       40G          40G          40G
>>> zstd        27%       10G          40G          40G
>>> prealloc   100%      5.0M         5.0M         5.5M
>>
>> Not sure if the preallocation is the cause, but maybe you can try
>> disabling preallocation of postgresql?
>>
>> As preallocation doesn't make that much sense on btrfs, there are too
>> many cases that can break the preallocation.
>
>
> I googled a lot and didn't find anything useful with preallocation and
> postgresql (looks like it doesn'use fallocate).

I don't think so.

>
> How can I find something about preallocation out?

Above compsize is already showing there is some preallocated space.

Thus I'm wondering if the preallocation is the cause.

As should_nocow() would also check the PREALLOC inode flag, and tries
NOCOW path first (then falls to COW if needed)

Thanks,
Qu

>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BTRFS doesn't compress on the fly
  2023-12-02 20:07     ` Qu Wenruo
@ 2023-12-02 21:56       ` Qu Wenruo
  2023-12-03  8:24         ` Gerhard Wiesinger
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2023-12-02 21:56 UTC (permalink / raw)
  To: Gerhard Wiesinger, linux-btrfs



On 2023/12/3 06:37, Qu Wenruo wrote:
>
>
> On 2023/12/2 22:32, Gerhard Wiesinger wrote:
>> Hello Qu,
>>
>> Thank you for the answers, see inline.
>>
>> Any further ideas?
>>
>> Ciao,
>> Gerhard.
>>
>> On 30.11.2023 21:53, Qu Wenruo wrote:
>>>
>>>
>>> On 2023/11/30 21:51, Gerhard Wiesinger wrote:
>>>> Dear All,
>>>>
>>>> I created a new BTRFS volume with migrating an existing PostgreSQL
>>>> database on it. Versions are recent.
>>>
>>> Does the data base directory has something like NODATACOW or NODATASUM
>>> set?
>>> The other possibility is preallocation, for the first write on
>>> preallocated range, no matter if the compression is enabled, the write
>>> would be treated as NOCOW.
>>>
>> I don't think so. How to find out (googled already a lot)?
>
> I normally go `btrfs ins dump-tree`, dump the subvolume, grep for the
> inode number with `grep -A 3 "item .* key (257 INODE_ITEM 0)"`, which
> would show something like this:
>
>      item 6 key (257 INODE_ITEM 0) itemoff 15811 itemsize 160
>          generation 7 transid 8 size 4194304 nbytes 4194304
>          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>          sequence 513 flags 0x10(PREALLOC)
>
> The flags is the btrfs specific flags, which would show NODATACOW or
> NODATASUM.
>
>>
>> At least it is not mounted with these options (see also original post).
>>
>> # Mounted via force
>> findmnt -vno OPTIONS /var/lib/pgsql
>> rw,relatime,compress-force=zstd:3,space_cache=v2,subvolid=5,subvol=/'
>>
>> According to the following link it should compress anyway with the -o
>> compress-force option:
>>
>> https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Compression.html#What.27s_the_precedence_of_all_the_options_affecting_compression.3F
>> Compression to newly written data happens:
>> always -- if the filesystem is mounted with -o compress-force
>> never -- if the NOCOMPRESS flag is set per-file/-directory
>> if possible -- if the COMPRESS per-file flag (aka chattr +c) is set, but
>> it may get converted to NOCOMPRESS eventually
>> if possible -- if the -o compress mount option is specified
>> Note, that mounting with -o compress will not set the +c file attribute.
>
> Well, if you check the kernel code, inside btrfs_run_delalloc_range(),
> which calls should_nocow() to check if we should fall to NOCOW path.
>
> That should_nocow() would check if the inode has NODATACOW or PREALLOC
> flags, then verify if there is any defrag request for it.
> If no defrag request, then it can go NOCOW, thus break the COW requirement.
>
>>
> [...]
>>>> # Stays here at this compression level
>>>> compsize -x /var/lib/pgsql
>>>> Processed 5332 files, 575858 regular extents (591204 refs), 40 inline.
>>>> Type       Perc     Disk Usage   Uncompressed Referenced
>>>> TOTAL       63%       51G          80G          80G
>>>> none       100%       40G          40G          40G
>>>> zstd        27%       10G          40G          40G
>>>> prealloc   100%      5.0M         5.0M         5.5M
>>>
>>> Not sure if the preallocation is the cause, but maybe you can try
>>> disabling preallocation of postgresql?
>>>
>>> As preallocation doesn't make that much sense on btrfs, there are too
>>> many cases that can break the preallocation.
>>
>>
>> I googled a lot and didn't find anything useful with preallocation and
>> postgresql (looks like it doesn'use fallocate).
>
> I don't think so.
>
>>
>> How can I find something about preallocation out?
>
> Above compsize is already showing there is some preallocated space.
>
> Thus I'm wondering if the preallocation is the cause.
>
> As should_nocow() would also check the PREALLOC inode flag, and tries
> NOCOW path first (then falls to COW if needed)

Yep, I just reproduced it, for any INODE with PREALLOC flag (aka, the
file has some preallocated range), even we're writing into the range
that needs COW anyway (e.g. new writes which would enlarge the file),
the compression would not work anyway.

  # mkfs.btrfs test.img
  # mount test.img -o compress-force=zstd /mnt/btrfs
  # fallocate -l 128k /mnt/btrfs/file1
  # xfs_io -f -c "pwrite 128k 128k" /mnt/btrfs/file1
  # xfs_io -f -c "pwrite 128k 128k" /mnt/btrfs/file2
  # sync

Since file1 has 128K preallocated range, thus the inode has PREALLOC
flag, and would lead to no compression:

	item 6 key (257 INODE_ITEM 0) itemoff 15811 itemsize 160
		generation 8 transid 8 size 262144 nbytes 262144
		block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
		sequence 33 flags 0x10(PREALLOC) <<<<
	item 7 key (257 INODE_REF 256) itemoff 15796 itemsize 15
		index 2 namelen 5 name: file1
	item 8 key (257 EXTENT_DATA 0) itemoff 15743 itemsize 53
		generation 8 type 2 (prealloc)
		prealloc data disk byte 13631488 nr 131072
		prealloc data offset 0 nr 131072
	item 9 key (257 EXTENT_DATA 131072) itemoff 15690 itemsize 53
		generation 8 type 1 (regular)
		extent data disk byte 13762560 nr 131072
		extent data offset 0 nr 131072 ram 131072
		extent compression 0 (none) <<<

Meanwhile for the other file, which has no prealloc, would go regular
compression path.

	item 10 key (258 INODE_ITEM 0) itemoff 15530 itemsize 160
		generation 8 transid 8 size 262144 nbytes 131072
		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
		sequence 32 flags 0x0(none)
	item 11 key (258 INODE_REF 256) itemoff 15515 itemsize 15
		index 3 namelen 5 name: file2
	item 12 key (258 EXTENT_DATA 131072) itemoff 15462 itemsize 53
		generation 8 type 1 (regular)
		extent data disk byte 13893632 nr 4096
		extent data offset 0 nr 131072 ram 131072
		extent compression 3 (zstd)

To me, this looks a bug, and the reason is exactly what I explained before.

The worst thing is, as long as the inode has PREALLOC flag, even if all
preallocated extents are used, it would prevent compression from
happening, forever for that inode.

Let me try to fix the fallback to COW path to include compression.

Thanks,
Qu
>
> Thanks,
> Qu
>
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BTRFS doesn't compress on the fly
  2023-12-02 21:56       ` Qu Wenruo
@ 2023-12-03  8:24         ` Gerhard Wiesinger
  2023-12-03  9:11           ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Gerhard Wiesinger @ 2023-12-03  8:24 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 02.12.2023 22:56, Qu Wenruo wrote:
>>
>>>
>>> How can I find something about preallocation out?
>>
>> Above compsize is already showing there is some preallocated space.
>>
>> Thus I'm wondering if the preallocation is the cause.
>>
>> As should_nocow() would also check the PREALLOC inode flag, and tries
>> NOCOW path first (then falls to COW if needed)
>
> Yep, I just reproduced it, for any INODE with PREALLOC flag (aka, the
> file has some preallocated range), even we're writing into the range
> that needs COW anyway (e.g. new writes which would enlarge the file),
> the compression would not work anyway.
>
>  # mkfs.btrfs test.img
>  # mount test.img -o compress-force=zstd /mnt/btrfs
>  # fallocate -l 128k /mnt/btrfs/file1
>  # xfs_io -f -c "pwrite 128k 128k" /mnt/btrfs/file1
>  # xfs_io -f -c "pwrite 128k 128k" /mnt/btrfs/file2
>  # sync
>
> Since file1 has 128K preallocated range, thus the inode has PREALLOC
> flag, and would lead to no compression:
>
>     item 6 key (257 INODE_ITEM 0) itemoff 15811 itemsize 160
>         generation 8 transid 8 size 262144 nbytes 262144
>         block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>         sequence 33 flags 0x10(PREALLOC) <<<<
>     item 7 key (257 INODE_REF 256) itemoff 15796 itemsize 15
>         index 2 namelen 5 name: file1
>     item 8 key (257 EXTENT_DATA 0) itemoff 15743 itemsize 53
>         generation 8 type 2 (prealloc)
>         prealloc data disk byte 13631488 nr 131072
>         prealloc data offset 0 nr 131072
>     item 9 key (257 EXTENT_DATA 131072) itemoff 15690 itemsize 53
>         generation 8 type 1 (regular)
>         extent data disk byte 13762560 nr 131072
>         extent data offset 0 nr 131072 ram 131072
>         extent compression 0 (none) <<<
>
> Meanwhile for the other file, which has no prealloc, would go regular
> compression path.
>
>     item 10 key (258 INODE_ITEM 0) itemoff 15530 itemsize 160
>         generation 8 transid 8 size 262144 nbytes 131072
>         block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>         sequence 32 flags 0x0(none)
>     item 11 key (258 INODE_REF 256) itemoff 15515 itemsize 15
>         index 3 namelen 5 name: file2
>     item 12 key (258 EXTENT_DATA 131072) itemoff 15462 itemsize 53
>         generation 8 type 1 (regular)
>         extent data disk byte 13893632 nr 4096
>         extent data offset 0 nr 131072 ram 131072
>         extent compression 3 (zstd)
>
> To me, this looks a bug, and the reason is exactly what I explained 
> before.
>
> The worst thing is, as long as the inode has PREALLOC flag, even if all
> preallocated extents are used, it would prevent compression from
> happening, forever for that inode.
>
> Let me try to fix the fallback to COW path to include compression.


Thank you for reproducting it. Think we nailed it down.

Is there a way to get the output of the file of the chunks/items?

Thnx.

Ciao,

Gerhard


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BTRFS doesn't compress on the fly
  2023-12-03  8:24         ` Gerhard Wiesinger
@ 2023-12-03  9:11           ` Qu Wenruo
  2023-12-03  9:45             ` Gerhard Wiesinger
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2023-12-03  9:11 UTC (permalink / raw)
  To: Gerhard Wiesinger, linux-btrfs



On 2023/12/3 18:54, Gerhard Wiesinger wrote:
> On 02.12.2023 22:56, Qu Wenruo wrote:
>>>
>>>>
>>>> How can I find something about preallocation out?
>>>
>>> Above compsize is already showing there is some preallocated space.
>>>
>>> Thus I'm wondering if the preallocation is the cause.
>>>
>>> As should_nocow() would also check the PREALLOC inode flag, and tries
>>> NOCOW path first (then falls to COW if needed)
>>
>> Yep, I just reproduced it, for any INODE with PREALLOC flag (aka, the
>> file has some preallocated range), even we're writing into the range
>> that needs COW anyway (e.g. new writes which would enlarge the file),
>> the compression would not work anyway.
>>
>>  # mkfs.btrfs test.img
>>  # mount test.img -o compress-force=zstd /mnt/btrfs
>>  # fallocate -l 128k /mnt/btrfs/file1
>>  # xfs_io -f -c "pwrite 128k 128k" /mnt/btrfs/file1
>>  # xfs_io -f -c "pwrite 128k 128k" /mnt/btrfs/file2
>>  # sync
>>
>> Since file1 has 128K preallocated range, thus the inode has PREALLOC
>> flag, and would lead to no compression:
>>
>>     item 6 key (257 INODE_ITEM 0) itemoff 15811 itemsize 160
>>         generation 8 transid 8 size 262144 nbytes 262144
>>         block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>         sequence 33 flags 0x10(PREALLOC) <<<<
>>     item 7 key (257 INODE_REF 256) itemoff 15796 itemsize 15
>>         index 2 namelen 5 name: file1
>>     item 8 key (257 EXTENT_DATA 0) itemoff 15743 itemsize 53
>>         generation 8 type 2 (prealloc)
>>         prealloc data disk byte 13631488 nr 131072
>>         prealloc data offset 0 nr 131072
>>     item 9 key (257 EXTENT_DATA 131072) itemoff 15690 itemsize 53
>>         generation 8 type 1 (regular)
>>         extent data disk byte 13762560 nr 131072
>>         extent data offset 0 nr 131072 ram 131072
>>         extent compression 0 (none) <<<
>>
>> Meanwhile for the other file, which has no prealloc, would go regular
>> compression path.
>>
>>     item 10 key (258 INODE_ITEM 0) itemoff 15530 itemsize 160
>>         generation 8 transid 8 size 262144 nbytes 131072
>>         block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>>         sequence 32 flags 0x0(none)
>>     item 11 key (258 INODE_REF 256) itemoff 15515 itemsize 15
>>         index 3 namelen 5 name: file2
>>     item 12 key (258 EXTENT_DATA 131072) itemoff 15462 itemsize 53
>>         generation 8 type 1 (regular)
>>         extent data disk byte 13893632 nr 4096
>>         extent data offset 0 nr 131072 ram 131072
>>         extent compression 3 (zstd)
>>
>> To me, this looks a bug, and the reason is exactly what I explained
>> before.
>>
>> The worst thing is, as long as the inode has PREALLOC flag, even if all
>> preallocated extents are used, it would prevent compression from
>> happening, forever for that inode.
>>
>> Let me try to fix the fallback to COW path to include compression.
>
>
> Thank you for reproducting it. Think we nailed it down.
>
> Is there a way to get the output of the file of the chunks/items?

You can always dump the full subvolume (`btrfs ins dump-tree -t
<subvolid> <device>`), then try to grep the inode which has PREALLOC
alloc (`| grep -C 5 "flags.*PREALLOC"), which would include the inode
number, then you can ping down the inodes which has PREALLOC flags and
not undergoing compression.

I won't be surprised most (if not all) files of postgresql would have
that flag.

Thanks,
Qu
>
> Thnx.
>
> Ciao,
>
> Gerhard
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BTRFS doesn't compress on the fly
  2023-12-03  9:11           ` Qu Wenruo
@ 2023-12-03  9:45             ` Gerhard Wiesinger
  2023-12-03 10:19               ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Gerhard Wiesinger @ 2023-12-03  9:45 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 03.12.2023 10:11, Qu Wenruo wrote:
>
>>
>> Thank you for reproducting it. Think we nailed it down.
>>
>> Is there a way to get the output of the file of the chunks/items?
>
> You can always dump the full subvolume (`btrfs ins dump-tree -t
> <subvolid> <device>`), then try to grep the inode which has PREALLOC
> alloc (`| grep -C 5 "flags.*PREALLOC"), which would include the inode
> number, then you can ping down the inodes which has PREALLOC flags and
> not undergoing compression.
>
> I won't be surprised most (if not all) files of postgresql would have
> that flag.

Looks like only a small part has PREALLOC:

find /var/lib/pgsql -type f | wc -l
5569

btrfs inspect-internal dump-tree /dev/mapper/datab | grep -i PREALLOC | 
wc -l
95

For reference:

How to find the file at a certain btrfs inode
https://serverfault.com/questions/746938/how-to-find-the-file-at-a-certain-btrfs-inode

btrfs inspect-internal inode-resolve 13269 /var/lib/pgsql
/var/lib/pgsql/16/data/base/16400/16419

find /var/lib/pgsql -xdev -inum 13269
/var/lib/pgsql/16/data/base/16400/16419

# Get  files from inodes

btrfs inspect-internal dump-tree /dev/mapper/datab | grep -C 5 
"flags.*PREALLOC" | grep -i INODE | perl -pe 's/.*?\((.*?) .*/$1/' | 
sort | uniq | while read INODE; do echo -n "$INODE: ";btrfs 
inspect-internal inode-resolve ${INODE} /var/lib/pgsql; done

# Number of inodes, count is consistent

btrfs inspect-internal dump-tree /dev/mapper/datab | grep -C 5 
"flags.*PREALLOC" | grep -i INODE | perl -pe 's/.*?\((.*?) .*/$1/' | 
sort | uniq | while read INODE; do echo -n "$INODE: ";btrfs 
inspect-internal inode-resolve ${INODE} /var/lib/pgsql; done | wc -l

95

All files are in subdirectories: /var/lib/pgsql/16/data/base/

Already an idea for the fix?

BTW:

if compression is forced, should be then just any "block" be compressed?

Or, what's the problem of the logic?

Thnx.

Ciao,

Gerhard


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BTRFS doesn't compress on the fly
  2023-12-03  9:45             ` Gerhard Wiesinger
@ 2023-12-03 10:19               ` Qu Wenruo
  2023-12-22  5:58                 ` Gerhard Wiesinger
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2023-12-03 10:19 UTC (permalink / raw)
  To: Gerhard Wiesinger, linux-btrfs



On 2023/12/3 20:15, Gerhard Wiesinger wrote:
> On 03.12.2023 10:11, Qu Wenruo wrote:
>>
>>>
>>> Thank you for reproducting it. Think we nailed it down.
>>>
>>> Is there a way to get the output of the file of the chunks/items?
>>
>> You can always dump the full subvolume (`btrfs ins dump-tree -t
>> <subvolid> <device>`), then try to grep the inode which has PREALLOC
>> alloc (`| grep -C 5 "flags.*PREALLOC"), which would include the inode
>> number, then you can ping down the inodes which has PREALLOC flags and
>> not undergoing compression.
>>
>> I won't be surprised most (if not all) files of postgresql would have
>> that flag.
>
> Looks like only a small part has PREALLOC:
>
> find /var/lib/pgsql -type f | wc -l
> 5569
>
> btrfs inspect-internal dump-tree /dev/mapper/datab | grep -i PREALLOC |
> wc -l
> 95
>
> For reference:
>
> How to find the file at a certain btrfs inode
> https://serverfault.com/questions/746938/how-to-find-the-file-at-a-certain-btrfs-inode
>
> btrfs inspect-internal inode-resolve 13269 /var/lib/pgsql
> /var/lib/pgsql/16/data/base/16400/16419
>
> find /var/lib/pgsql -xdev -inum 13269
> /var/lib/pgsql/16/data/base/16400/16419
>
> # Get  files from inodes
>
> btrfs inspect-internal dump-tree /dev/mapper/datab | grep -C 5
> "flags.*PREALLOC" | grep -i INODE | perl -pe 's/.*?\((.*?) .*/$1/' |
> sort | uniq | while read INODE; do echo -n "$INODE: ";btrfs
> inspect-internal inode-resolve ${INODE} /var/lib/pgsql; done
>
> # Number of inodes, count is consistent
>
> btrfs inspect-internal dump-tree /dev/mapper/datab | grep -C 5
> "flags.*PREALLOC" | grep -i INODE | perl -pe 's/.*?\((.*?) .*/$1/' |
> sort | uniq | while read INODE; do echo -n "$INODE: ";btrfs
> inspect-internal inode-resolve ${INODE} /var/lib/pgsql; done | wc -l
>
> 95
>
> All files are in subdirectories: /var/lib/pgsql/16/data/base/
>
> Already an idea for the fix?

We can copy the files (without using reflink) to a temporary location
(better out of btrfs), then copy the temporary copy back to overwrite
all the existing files.

The problem is still inside pgsql, as long as they do preallocation, the
same problem would still happen.

>
> BTW:
>
> if compression is forced, should be then just any "block" be compressed?

There is a long existing problem with compression with preallocation.

One easy example is, if we go compression for the preallocated range,
what we do with the gap (compressed size is always smaller than the real
size).

If we leave the gap, then the read performance can be even worse, as now
we have to read several small extents with gaps between them, vs a large
contig read.

IIRC years ago when I was a btrfs newbie, that's the direction I tried
to go, but never reached upstream.

Thus you can see some of the reason why we do not go compression for
preallocated range.

But I still don't believe we should go as the current behavior.
We should still try to go compression as long as we know the write still
needs COW, thus we should fix it.

Thanks,
Qu

>
> Or, what's the problem of the logic?
>
> Thnx.
>
> Ciao,
>
> Gerhard
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BTRFS doesn't compress on the fly
  2023-12-03 10:19               ` Qu Wenruo
@ 2023-12-22  5:58                 ` Gerhard Wiesinger
  2023-12-22  6:13                   ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Gerhard Wiesinger @ 2023-12-22  5:58 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 03.12.2023 11:19, Qu Wenruo wrote:
>
>
>> BTW:
>>
>> if compression is forced, should be then just any "block" be compressed?
>
> There is a long existing problem with compression with preallocation.
>
> One easy example is, if we go compression for the preallocated range,
> what we do with the gap (compressed size is always smaller than the real
> size).
>
> If we leave the gap, then the read performance can be even worse, as now
> we have to read several small extents with gaps between them, vs a large
> contig read.
>
> IIRC years ago when I was a btrfs newbie, that's the direction I tried
> to go, but never reached upstream.
>
> Thus you can see some of the reason why we do not go compression for
> preallocated range.
>
> But I still don't believe we should go as the current behavior.
> We should still try to go compression as long as we know the write still
> needs COW, thus we should fix it.


Any progress with the fix?

Thnx.

Ciao,

Gerhard


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BTRFS doesn't compress on the fly
  2023-12-22  5:58                 ` Gerhard Wiesinger
@ 2023-12-22  6:13                   ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2023-12-22  6:13 UTC (permalink / raw)
  To: Gerhard Wiesinger, linux-btrfs



On 2023/12/22 16:28, Gerhard Wiesinger wrote:
> On 03.12.2023 11:19, Qu Wenruo wrote:
>>
>>
>>> BTW:
>>>
>>> if compression is forced, should be then just any "block" be compressed?
>>
>> There is a long existing problem with compression with preallocation.
>>
>> One easy example is, if we go compression for the preallocated range,
>> what we do with the gap (compressed size is always smaller than the real
>> size).
>>
>> If we leave the gap, then the read performance can be even worse, as now
>> we have to read several small extents with gaps between them, vs a large
>> contig read.
>>
>> IIRC years ago when I was a btrfs newbie, that's the direction I tried
>> to go, but never reached upstream.
>>
>> Thus you can see some of the reason why we do not go compression for
>> preallocated range.
>>
>> But I still don't believe we should go as the current behavior.
>> We should still try to go compression as long as we know the write still
>> needs COW, thus we should fix it.
>
>
> Any progress with the fix?

Tried several solution, the best one would still lead to reserved space
underflow.

The proper fix would introduce some larger changes to the whole delalloc
mechanism.

Thus it's not something can be easily fixed yet.

Thanks,
Qu
>
> Thnx.
>
> Ciao,
>
> Gerhard
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-12-22  6:14 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-30 11:21 BTRFS doesn't compress on the fly Gerhard Wiesinger
2023-11-30 20:53 ` Qu Wenruo
2023-12-02 12:02   ` Gerhard Wiesinger
2023-12-02 20:07     ` Qu Wenruo
2023-12-02 21:56       ` Qu Wenruo
2023-12-03  8:24         ` Gerhard Wiesinger
2023-12-03  9:11           ` Qu Wenruo
2023-12-03  9:45             ` Gerhard Wiesinger
2023-12-03 10:19               ` Qu Wenruo
2023-12-22  5:58                 ` Gerhard Wiesinger
2023-12-22  6:13                   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.