linux-btrfs.vger.kernel.org archive mirror

Re: enospc errors during balance — how to prevent out of space

2024-04-27T12:03:22Z


My system was making backups for about one week.

It was doing automatic "btrfs balance".



Yesterday it went through:

btrfs balance -dusage=0
btrfs balance -dusage=10
btrfs balance -dusage=20
btrfs balance -dusage=30
...
btrfs balance -dusage=100
...
btrfs balance -musage=0
btrfs balance -dusage=10
btrfs balance -dusage=20
...


Something went wrong when balancing musage (m, as metadata).
System got "read only".



While this happened btrfs was in a process of deleting four snapshots 
(btrfs sub list / -d  — not empty).




It had 450 GB of free space (shown for df -h).

It had almost no Unallocated space (btrfs dev usa /).



After reboot system is mounted read-only.
Kernel shows (Ctrl+D or give root password for maintenance).


Tried to run    btrfs balance -dusage /      on read only system failed.

Tried to mount -oremount,rw    hanged.


Reboot.




Started from USB key Finnix to repair.


Started to mount system.


Dmesg shows:

              bdev /dev/sdc3 errs: wr 0, rd 0, flush 0, corrupt 35967, gen 0


It mounts for a long time now.
Nothing more in dmesg.
Mount command seems stalled, but on iotop I see "btrfs-transaction" 
running — write about 10 M/s



I will leave the system over night and check tommorow or on monday if 
mount was successful.




PS. script that was balancing:


         findmnt --types btrfs --output SOURCE --nofsroot --noheadings | 
sort | uniq |
         while read dev; do
                 mnt=$(findmnt --source "$dev" --output TARGET 
--first-only --noheadings)
                 test -d "$mnt" || continue

                 # no balance if plenty of unallocated space
                 btrfs dev usage "$mnt" -g |
                 perl -ne '/Unallocated: +([0-9]+\.[0-9]{2})GiB/ and $1 
< 21 and print $1' |
                 grep -q . || continue

                 for typ in dusage musage; do
                         for usa in $(seq 0 10 100); do
                                 # if relocated, then get out of two 
loops for next "$dev"
                                 btrfs balance start -$typ=$usa,limit=3 
"$mnt" 2>&1 |
                                 grep -Eq "Done, had to relocate 
[1-9][0-9]* out of [0-9]+ chunks" &&
                                 break 2
                         done
                 done
         done

Re: What's the difference between `btrfs sub del -c` and `btrfs fi sync`?

2024-04-27T00:04:38Z


在 2024/4/27 08:44, intelfx@intelfx.name 写道:
> On 2024-04-27 at 08:36 +0930, Qu Wenruo wrote:
>>
>> 在 2024/4/27 08:22, intelfx@intelfx.name 写道:
>>> Hi,
>>>
>>> I've been trying to read btrfs-progs code to understand btrfs ioctls
>>> and one thing evades my understanding.
>>>
>>> A `btrfs subvolume delete --commit-{after,each}` operation involves
>>> issuing two ioctls at the commit time: BTRFS_IOC_START_SYNC immediately
>>> followed by BTRFS_IOC_WAIT_SYNC. Notably, the relevant comment says
>>> "<...> issue SYNC ioctl <...>" and the function that encapsulates the
>>> two ioctls is called `wait_for_commit()`.
>>>
>>> On the other hand, a `btrfs filesystem sync` operation involves issuing
>>> just one ioctl, BTRFS_IOC_SYNC (encapsulated in a function called
>>> `btrfs_util_sync_fd()`).
>>>
>>> I tried to look at the kernel code for the three ioctls but to my
>>> untrained eye, they look like they are doing different things with
>>> different side effects.
>>>
>>> What is the difference, and why is it needed (i.e. why are there two
>>> sets of sync-related ioctls)?
>>
>> IIRC --commit-after/each only commit the current transaction, and it's
>> just doing the same `btrfs fi sync` after all/each subvolume deletion.
>>
>> The reason is to ensure the unlinking (not fully deleting) of the target
>> subvolume fully committed to disk, so a sudden powerloss after the
>> deletion won't lead to the re-appearing of the target subvolume(s)
>>
>>
>> However there is a another behavior involved, `btrfs subvolume sync`,
>> which is to wait for a deleted subvolume to be fully dropped.
>> In the case of btrfs subvolume deletion, it can be a heavy load, thus
>> btrfs only unlink the to-be-deleted subvolume, and mark it for
>> background deletion.
>> `btrfs subvolume sync` would wait for any such orphan subvolume to be
>> deleted.
>>
>> Thanks,
>> Qu
>>
>>
>>>
>>> Cheers,
> 
> Thanks for the fast reply!
> 
> Yes, I'm aware about `btrfs sub sync`. I understand that's a totally
> different operation.
> 
> What I was asking about was specifically the difference between
> `btrfs _filesystem_ sync` and the operation that happens at the end of
> a `btrfs subvolume delete --commit-after`.
> 
> Or, in kernel terms: what exactly is the difference between issuing a
> BTRFS_IOC_SYNC and issuing a BTRFS_IOC_START_SYNC immediately followed
> by a BTRFS_IOC_WAIT_SYNC?

If you go really deep, there is some small difference, but overall you 
can consider them the same, despite the START/WAIT_SYNC is an async 
operation, while IOC_SYNC would wait for it.

> 
> It is not immediately obvious that the kernel code for the three ioctls
> is equivalent (even if it is). For instance, BTRFS_IOC_SYNC begins with
> a call to btrfs_start_delalloc_roots() whereas BTRFS_IOC_START_SYNC
> begins with a call to btrfs_orphan_cleanup(), and the subsequent
> transaction handling code seems subtly different.
>
There is a small difference, but not really effect end users.

The IOC_SYNC would start and wait for the writeback of all dirty files.
(AKA, the same behavior as `sync` command).
Meanwhile IOC_START_SYNC would not trigger the writeback, just commit 
the metadata which is already dirty.

For the --commit-after/each, IOC_START_SYNC is faster, since 
IOC_SNAP_DESTORY has already dirtied the necessary metadata, we only 
need to commit the dirtied metadata in current transaction, no need to 
wait for other data writeback.

Thanks,
Qu

Re: What's the difference between `btrfs sub del -c` and `btrfs fi sync`?

2024-04-26T23:14:04Z

On 2024-04-27 at 08:36 +0930, Qu Wenruo wrote:
> 
> 在 2024/4/27 08:22, intelfx@intelfx.name 写道:
> > Hi,
> > 
> > I've been trying to read btrfs-progs code to understand btrfs ioctls
> > and one thing evades my understanding.
> > 
> > A `btrfs subvolume delete --commit-{after,each}` operation involves
> > issuing two ioctls at the commit time: BTRFS_IOC_START_SYNC immediately
> > followed by BTRFS_IOC_WAIT_SYNC. Notably, the relevant comment says
> > "<...> issue SYNC ioctl <...>" and the function that encapsulates the
> > two ioctls is called `wait_for_commit()`.
> > 
> > On the other hand, a `btrfs filesystem sync` operation involves issuing
> > just one ioctl, BTRFS_IOC_SYNC (encapsulated in a function called
> > `btrfs_util_sync_fd()`).
> > 
> > I tried to look at the kernel code for the three ioctls but to my
> > untrained eye, they look like they are doing different things with
> > different side effects.
> > 
> > What is the difference, and why is it needed (i.e. why are there two
> > sets of sync-related ioctls)?
> 
> IIRC --commit-after/each only commit the current transaction, and it's
> just doing the same `btrfs fi sync` after all/each subvolume deletion.
> 
> The reason is to ensure the unlinking (not fully deleting) of the target
> subvolume fully committed to disk, so a sudden powerloss after the
> deletion won't lead to the re-appearing of the target subvolume(s)
> 
> 
> However there is a another behavior involved, `btrfs subvolume sync`,
> which is to wait for a deleted subvolume to be fully dropped.
> In the case of btrfs subvolume deletion, it can be a heavy load, thus
> btrfs only unlink the to-be-deleted subvolume, and mark it for
> background deletion.
> `btrfs subvolume sync` would wait for any such orphan subvolume to be
> deleted.
> 
> Thanks,
> Qu
> 
> 
> > 
> > Cheers,

Thanks for the fast reply!

Yes, I'm aware about `btrfs sub sync`. I understand that's a totally
different operation.

What I was asking about was specifically the difference between
`btrfs _filesystem_ sync` and the operation that happens at the end of
a `btrfs subvolume delete --commit-after`.

Or, in kernel terms: what exactly is the difference between issuing a
BTRFS_IOC_SYNC and issuing a BTRFS_IOC_START_SYNC immediately followed
by a BTRFS_IOC_WAIT_SYNC?

It is not immediately obvious that the kernel code for the three ioctls
is equivalent (even if it is). For instance, BTRFS_IOC_SYNC begins with
a call to btrfs_start_delalloc_roots() whereas BTRFS_IOC_START_SYNC
begins with a call to btrfs_orphan_cleanup(), and the subsequent
transaction handling code seems subtly different.

-- 
Ivan Shapovalov / intelfx /

Re: What's the difference between `btrfs sub del -c` and `btrfs fi sync`?

2024-04-26T23:06:32Z


在 2024/4/27 08:22, intelfx@intelfx.name 写道:
> Hi,
>
> I've been trying to read btrfs-progs code to understand btrfs ioctls
> and one thing evades my understanding.
>
> A `btrfs subvolume delete --commit-{after,each}` operation involves
> issuing two ioctls at the commit time: BTRFS_IOC_START_SYNC immediately
> followed by BTRFS_IOC_WAIT_SYNC. Notably, the relevant comment says
> "<...> issue SYNC ioctl <...>" and the function that encapsulates the
> two ioctls is called `wait_for_commit()`.
>
> On the other hand, a `btrfs filesystem sync` operation involves issuing
> just one ioctl, BTRFS_IOC_SYNC (encapsulated in a function called
> `btrfs_util_sync_fd()`).
>
> I tried to look at the kernel code for the three ioctls but to my
> untrained eye, they look like they are doing different things with
> different side effects.
>
> What is the difference, and why is it needed (i.e. why are there two
> sets of sync-related ioctls)?

IIRC --commit-after/each only commit the current transaction, and it's
just doing the same `btrfs fi sync` after all/each subvolume deletion.

The reason is to ensure the unlinking (not fully deleting) of the target
subvolume fully committed to disk, so a sudden powerloss after the
deletion won't lead to the re-appearing of the target subvolume(s)


However there is a another behavior involved, `btrfs subvolume sync`,
which is to wait for a deleted subvolume to be fully dropped.
In the case of btrfs subvolume deletion, it can be a heavy load, thus
btrfs only unlink the to-be-deleted subvolume, and mark it for
background deletion.
`btrfs subvolume sync` would wait for any such orphan subvolume to be
deleted.

Thanks,
Qu


>
> Cheers,

What's the difference between `btrfs sub del -c` and `btrfs fi sync`?

2024-04-26T22:52:19Z

Hi,

I've been trying to read btrfs-progs code to understand btrfs ioctls
and one thing evades my understanding.

A `btrfs subvolume delete --commit-{after,each}` operation involves
issuing two ioctls at the commit time: BTRFS_IOC_START_SYNC immediately
followed by BTRFS_IOC_WAIT_SYNC. Notably, the relevant comment says
"<...> issue SYNC ioctl <...>" and the function that encapsulates the
two ioctls is called `wait_for_commit()`.

On the other hand, a `btrfs filesystem sync` operation involves issuing
just one ioctl, BTRFS_IOC_SYNC (encapsulated in a function called
`btrfs_util_sync_fd()`).

I tried to look at the kernel code for the three ioctls but to my
untrained eye, they look like they are doing different things with
different side effects.

What is the difference, and why is it needed (i.e. why are there two
sets of sync-related ioctls)?

Cheers,
-- 
Ivan Shapovalov / intelfx /

[RFC PATCH 6/6] btrfs: zlib: add support for zlib-deflate through acomp

2024-04-26T11:10:23Z

From: Weigang Li 

Add support for zlib compression and decompression through the acomp
APIs.
Input pages are added to an sg-list and sent to acomp in one request.
Since acomp is asynchronous, the thread is put to sleep and then the CPU
is freed up. Once compression is done, the acomp callback is triggered
and the thread is woke up.

This patch doesn't change the BTRFS disk format, this means that files
compressed by hardware engines can be de-compressed by the zlib software
library, and vice versa.

Limitations:
  * The implementation tries always to use an acomp even if only
    zlib-deflate-scomp is present
  * Acomp does not provide a way to support compression levels
  * Acomp is an asynchronous API but used here synchronously

Signed-off-by: Weigang Li 
---
 fs/btrfs/zlib.c | 216 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 216 insertions(+)

diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index e5b3f2003896..b5bbb8c97244 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -18,6 +18,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include "compression.h"
 
 /* workspace buffer size for s390 zlib hardware support */
@@ -33,6 +35,201 @@ struct workspace {
 
 static struct workspace_manager wsm;
 
+static int acomp_comp_pages(struct address_space *mapping, u64 start,
+			    unsigned long len, struct page **pages,
+			    unsigned long *out_pages,
+			    unsigned long *total_in,
+			    unsigned long *total_out)
+{
+	unsigned int nr_src_pages = 0, nr_dst_pages = 0, nr_pages = 0;
+	struct sg_table in_sg = { 0 }, out_sg = { 0 };
+	struct page *in_page, *out_page, **in_pages;
+	struct crypto_acomp *tfm = NULL;
+	struct acomp_req *req = NULL;
+	struct crypto_wait wait;
+	int ret, i;
+
+	nr_src_pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	in_pages = kcalloc(nr_src_pages, sizeof(struct page *), GFP_KERNEL);
+	if (!in_pages) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < nr_src_pages; i++) {
+		in_page = find_get_page(mapping, start >> PAGE_SHIFT);
+		out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (!in_page || !out_page) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		in_pages[i] = in_page;
+		pages[i] = out_page;
+		nr_dst_pages += 1;
+		start += PAGE_SIZE;
+	}
+
+	ret = sg_alloc_table_from_pages(&in_sg, in_pages, nr_src_pages, 0,
+					nr_src_pages << PAGE_SHIFT, GFP_KERNEL);
+	if (ret)
+		goto out;
+
+	ret = sg_alloc_table_from_pages(&out_sg, pages, nr_dst_pages, 0,
+					nr_dst_pages << PAGE_SHIFT, GFP_KERNEL);
+	if (ret)
+		goto out;
+
+	crypto_init_wait(&wait);
+	tfm = crypto_alloc_acomp("zlib-deflate", 0, 0);
+	if (IS_ERR(tfm)) {
+		ret = PTR_ERR(tfm);
+		goto out;
+	}
+
+	req = acomp_request_alloc(tfm);
+	if (!req) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	acomp_request_set_params(req, in_sg.sgl, out_sg.sgl, len,
+				 nr_dst_pages << PAGE_SHIFT);
+	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, &wait);
+
+	ret = crypto_wait_req(crypto_acomp_compress(req), &wait);
+	if (ret)
+		goto out;
+
+	*total_in = len;
+	*total_out = req->dlen;
+	nr_pages = (*total_out + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+out:
+	sg_free_table(&in_sg);
+	sg_free_table(&out_sg);
+
+	if (in_pages) {
+		for (i = 0; i < nr_src_pages; i++)
+			put_page(in_pages[i]);
+		kfree(in_pages);
+	}
+
+	/* free un-used out pages */
+	for (i = nr_pages; i < nr_dst_pages; i++)
+		put_page(pages[i]);
+
+	if (req)
+		acomp_request_free(req);
+
+	if (tfm)
+		crypto_free_acomp(tfm);
+
+	*out_pages = nr_pages;
+
+	return ret;
+}
+
+static int acomp_zlib_decomp_bio(struct page **in_pages,
+				 struct compressed_bio *cb, size_t srclen,
+				 unsigned long total_pages_in)
+{
+	unsigned int nr_dst_pages = BTRFS_MAX_COMPRESSED_PAGES;
+	struct sg_table in_sg = { 0 }, out_sg = { 0 };
+	struct bio *orig_bio = &cb->orig_bbio->bio;
+	char *data_out = NULL, *bv_buf = NULL;
+	int copy_len = 0, bytes_left = 0;
+	struct crypto_acomp *tfm = NULL;
+	struct page **out_pages = NULL;
+	struct acomp_req *req = NULL;
+	struct crypto_wait wait;
+	struct bio_vec bvec;
+	int ret, i = 0;
+
+	ret = sg_alloc_table_from_pages(&in_sg, in_pages, total_pages_in,
+					0, srclen, GFP_KERNEL);
+	if (ret)
+		goto out;
+
+	out_pages = kcalloc(nr_dst_pages, sizeof(struct page *), GFP_KERNEL);
+	if (!out_pages) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < nr_dst_pages; i++) {
+		out_pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (!out_pages[i]) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
+	ret = sg_alloc_table_from_pages(&out_sg, out_pages, nr_dst_pages, 0,
+					nr_dst_pages << PAGE_SHIFT, GFP_KERNEL);
+	if (ret)
+		goto out;
+
+	crypto_init_wait(&wait);
+	tfm = crypto_alloc_acomp("zlib-deflate", 0, 0);
+	if (IS_ERR(tfm)) {
+		ret = PTR_ERR(tfm);
+		goto out;
+	}
+
+	req = acomp_request_alloc(tfm);
+	if (!req) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	acomp_request_set_params(req, in_sg.sgl, out_sg.sgl, srclen,
+				 nr_dst_pages << PAGE_SHIFT);
+	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, &wait);
+
+	ret = crypto_wait_req(crypto_acomp_decompress(req), &wait);
+	if (ret)
+		goto out;
+
+	/* Copy decompressed buffer to bio pages */
+	bytes_left = req->dlen;
+	for (i = 0; i < nr_dst_pages; i++) {
+		copy_len = bytes_left > PAGE_SIZE ? PAGE_SIZE : bytes_left;
+		data_out = kmap_local_page(out_pages[i]);
+
+		bvec = bio_iter_iovec(orig_bio, orig_bio->bi_iter);
+		bv_buf = kmap_local_page(bvec.bv_page);
+		memcpy(bv_buf, data_out, copy_len);
+		kunmap_local(bv_buf);
+
+		bio_advance(orig_bio, copy_len);
+		if (!orig_bio->bi_iter.bi_size)
+			break;
+		bytes_left -= copy_len;
+		if (bytes_left <= 0)
+			break;
+	}
+out:
+	sg_free_table(&in_sg);
+	sg_free_table(&out_sg);
+
+	if (out_pages) {
+		for (i = 0; i < nr_dst_pages; i++) {
+			if (out_pages[i])
+				put_page(out_pages[i]);
+		}
+		kfree(out_pages);
+	}
+
+	if (req)
+		acomp_request_free(req);
+	if (tfm)
+		crypto_free_acomp(tfm);
+
+	return ret;
+}
+
 struct list_head *zlib_get_workspace(unsigned int level)
 {
 	struct list_head *ws = btrfs_get_workspace(BTRFS_COMPRESS_ZLIB, level);
@@ -108,6 +305,15 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping,
 	unsigned long nr_dest_pages = *out_pages;
 	const unsigned long max_out = nr_dest_pages * PAGE_SIZE;
 
+	if (crypto_has_acomp("zlib-deflate", 0, 0)) {
+		ret = acomp_comp_pages(mapping, start, len, pages, out_pages,
+				       total_in, total_out);
+		if (!ret)
+			return ret;
+
+		pr_warn("BTRFS: acomp compression failed: ret = %d\n", ret);
+		/* Fallback to SW implementation if HW compression failed */
+	}
 	*out_pages = 0;
 	*total_out = 0;
 	*total_in = 0;
@@ -281,6 +487,16 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 	unsigned long buf_start;
 	struct page **pages_in = cb->compressed_pages;
 
+	if (crypto_has_acomp("zlib-deflate", 0, 0)) {
+		ret = acomp_zlib_decomp_bio(pages_in, cb, srclen,
+					    total_pages_in);
+		if (!ret)
+			return ret;
+
+		pr_warn("BTRFS: acomp decompression failed, ret=%d\n", ret);
+		/* Fallback to SW implementation if HW decompression failed */
+	}
+
 	data_in = kmap_local_page(pages_in[page_in_index]);
 	workspace->strm.next_in = data_in;
 	workspace->strm.avail_in = min_t(size_t, srclen, PAGE_SIZE);
-- 
2.44.0

[RFC PATCH 5/6] crypto: qat - change compressor settings for QAT GEN4

2024-04-26T11:10:19Z

Enable dynamic compression by default for QAT GEN4 and change
compression level to 9.

Signed-off-by: Giovanni Cabiddu 
---
 drivers/crypto/intel/qat/qat_common/adf_gen4_dc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/intel/qat/qat_common/adf_gen4_dc.c b/drivers/crypto/intel/qat/qat_common/adf_gen4_dc.c
index 5859238e37de..34f418b88738 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_gen4_dc.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_gen4_dc.c
@@ -22,7 +22,7 @@ static void qat_comp_build_deflate(void *ctx)
 	header->hdr_flags =
 		ICP_QAT_FW_COMN_HDR_FLAGS_BUILD(ICP_QAT_FW_COMN_REQ_FLAG_SET);
 	header->service_type = ICP_QAT_FW_COMN_REQ_CPM_FW_COMP;
-	header->service_cmd_id = ICP_QAT_FW_COMP_CMD_STATIC;
+	header->service_cmd_id = ICP_QAT_FW_COMP_CMD_DYNAMIC;
 	header->comn_req_flags =
 		ICP_QAT_FW_COMN_FLAGS_BUILD(QAT_COMN_CD_FLD_TYPE_16BYTE_DATA,
 					    QAT_COMN_PTR_TYPE_SGL);
@@ -35,7 +35,7 @@ static void qat_comp_build_deflate(void *ctx)
 	hw_comp_lower_csr.skip_ctrl = ICP_QAT_HW_COMP_20_BYTE_SKIP_3BYTE_LITERAL;
 	hw_comp_lower_csr.algo = ICP_QAT_HW_COMP_20_HW_COMP_FORMAT_ILZ77;
 	hw_comp_lower_csr.lllbd = ICP_QAT_HW_COMP_20_LLLBD_CTRL_LLLBD_ENABLED;
-	hw_comp_lower_csr.sd = ICP_QAT_HW_COMP_20_SEARCH_DEPTH_LEVEL_1;
+	hw_comp_lower_csr.sd = ICP_QAT_HW_COMP_20_SEARCH_DEPTH_LEVEL_9;
 	hw_comp_lower_csr.hash_update = ICP_QAT_HW_COMP_20_SKIP_HASH_UPDATE_DONT_ALLOW;
 	hw_comp_lower_csr.edmm = ICP_QAT_HW_COMP_20_EXTENDED_DELAY_MATCH_MODE_EDMM_ENABLED;
 	hw_comp_upper_csr.nice = ICP_QAT_HW_COMP_20_CONFIG_CSR_NICE_PARAM_DEFAULT_VAL;
-- 
2.44.0

[RFC PATCH 4/6] Revert "crypto: qat - remove unused macros in qat_comp_alg.c"

2024-04-26T11:10:15Z

This reverts commit b4bf8295892924fca60d0704ac7cbc3b5897d233.
---
 drivers/crypto/intel/qat/qat_common/qat_comp_algs.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
index 79de04cfa012..b533984906ec 100644
--- a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
+++ b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
@@ -13,6 +13,15 @@
 #include "qat_compression.h"
 #include "qat_algs_send.h"
 
+#define QAT_RFC_1950_HDR_SIZE 2
+#define QAT_RFC_1950_FOOTER_SIZE 4
+#define QAT_RFC_1950_CM_DEFLATE 8
+#define QAT_RFC_1950_CM_DEFLATE_CINFO_32K 7
+#define QAT_RFC_1950_CM_MASK 0x0f
+#define QAT_RFC_1950_CM_OFFSET 4
+#define QAT_RFC_1950_DICT_MASK 0x20
+#define QAT_RFC_1950_COMP_HDR 0x785e
+
 static DEFINE_MUTEX(algs_lock);
 static unsigned int active_devs;
 
-- 
2.44.0

[RFC PATCH 3/6] Revert "crypto: qat - Remove zlib-deflate"

2024-04-26T11:10:11Z

This reverts commit e9dd20e0e5f62d01d9404db2cf9824d1faebcf71.
---
 .../intel/qat/qat_common/qat_comp_algs.c      | 129 +++++++++++++++++-
 1 file changed, 128 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
index 2ba4aa22e092..79de04cfa012 100644
--- a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
+++ b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
@@ -100,6 +100,69 @@ static void qat_comp_resubmit(struct work_struct *work)
 	acomp_request_complete(areq, ret);
 }
 
+static int parse_zlib_header(u16 zlib_h)
+{
+	int ret = -EINVAL;
+	__be16 header;
+	u8 *header_p;
+	u8 cmf, flg;
+
+	header = cpu_to_be16(zlib_h);
+	header_p = (u8 *)&header;
+
+	flg = header_p[0];
+	cmf = header_p[1];
+
+	if (cmf >> QAT_RFC_1950_CM_OFFSET > QAT_RFC_1950_CM_DEFLATE_CINFO_32K)
+		return ret;
+
+	if ((cmf & QAT_RFC_1950_CM_MASK) != QAT_RFC_1950_CM_DEFLATE)
+		return ret;
+
+	if (flg & QAT_RFC_1950_DICT_MASK)
+		return ret;
+
+	return 0;
+}
+
+static int qat_comp_rfc1950_callback(struct qat_compression_req *qat_req,
+				     void *resp)
+{
+	struct acomp_req *areq = qat_req->acompress_req;
+	enum direction dir = qat_req->dir;
+	__be32 qat_produced_adler;
+
+	qat_produced_adler = cpu_to_be32(qat_comp_get_produced_adler32(resp));
+
+	if (dir == COMPRESSION) {
+		__be16 zlib_header;
+
+		zlib_header = cpu_to_be16(QAT_RFC_1950_COMP_HDR);
+		scatterwalk_map_and_copy(&zlib_header, areq->dst, 0, QAT_RFC_1950_HDR_SIZE, 1);
+		areq->dlen += QAT_RFC_1950_HDR_SIZE;
+
+		scatterwalk_map_and_copy(&qat_produced_adler, areq->dst, areq->dlen,
+					 QAT_RFC_1950_FOOTER_SIZE, 1);
+		areq->dlen += QAT_RFC_1950_FOOTER_SIZE;
+	} else {
+		__be32 decomp_adler;
+		int footer_offset;
+		int consumed;
+
+		consumed = qat_comp_get_consumed_ctr(resp);
+		footer_offset = consumed + QAT_RFC_1950_HDR_SIZE;
+		if (footer_offset + QAT_RFC_1950_FOOTER_SIZE > areq->slen)
+			return -EBADMSG;
+
+		scatterwalk_map_and_copy(&decomp_adler, areq->src, footer_offset,
+					 QAT_RFC_1950_FOOTER_SIZE, 0);
+
+		if (qat_produced_adler != decomp_adler)
+			return -EBADMSG;
+	}
+	return 0;
+}
+
 static void qat_comp_generic_callback(struct qat_compression_req *qat_req,
 				      void *resp)
 {
@@ -221,6 +284,18 @@ static void qat_comp_alg_exit_tfm(struct crypto_acomp *acomp_tfm)
 	memset(ctx, 0, sizeof(*ctx));
 }
 
+static int qat_comp_alg_rfc1950_init_tfm(struct crypto_acomp *acomp_tfm)
+{
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
+	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	int ret;
+
+	ret = qat_comp_alg_init_tfm(acomp_tfm);
+	ctx->qat_comp_callback = &qat_comp_rfc1950_callback;
+
+	return ret;
+}
+
 static int qat_comp_alg_compress_decompress(struct acomp_req *areq, enum direction dir,
 					    unsigned int shdr, unsigned int sftr,
 					    unsigned int dhdr, unsigned int dftr)
@@ -316,6 +391,43 @@ static int qat_comp_alg_decompress(struct acomp_req *req)
 	return qat_comp_alg_compress_decompress(req, DECOMPRESSION, 0, 0, 0, 0);
 }
 
+static int qat_comp_alg_rfc1950_compress(struct acomp_req *req)
+{
+	if (!req->dst && req->dlen != 0)
+		return -EINVAL;
+
+	if (req->dst && req->dlen <= QAT_RFC_1950_HDR_SIZE + QAT_RFC_1950_FOOTER_SIZE)
+		return -EINVAL;
+
+	return qat_comp_alg_compress_decompress(req, COMPRESSION, 0, 0,
+						QAT_RFC_1950_HDR_SIZE,
+						QAT_RFC_1950_FOOTER_SIZE);
+}
+
+static int qat_comp_alg_rfc1950_decompress(struct acomp_req *req)
+{
+	struct crypto_acomp *acomp_tfm = crypto_acomp_reqtfm(req);
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
+	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	struct adf_accel_dev *accel_dev = ctx->inst->accel_dev;
+	u16 zlib_header;
+	int ret;
+
+	if (req->slen <= QAT_RFC_1950_HDR_SIZE + QAT_RFC_1950_FOOTER_SIZE)
+		return -EBADMSG;
+
+	scatterwalk_map_and_copy(&zlib_header, req->src, 0, QAT_RFC_1950_HDR_SIZE, 0);
+
+	ret = parse_zlib_header(zlib_header);
+	if (ret) {
+		dev_dbg(&GET_DEV(accel_dev), "Error parsing zlib header\n");
+		return ret;
+	}
+
+	return qat_comp_alg_compress_decompress(req, DECOMPRESSION, QAT_RFC_1950_HDR_SIZE,
+						QAT_RFC_1950_FOOTER_SIZE, 0, 0);
+}
+
 static struct acomp_alg qat_acomp[] = { {
 	.base = {
 		.cra_name = "deflate",
@@ -331,7 +443,22 @@ static struct acomp_alg qat_acomp[] = { {
 	.decompress = qat_comp_alg_decompress,
 	.dst_free = sgl_free,
 	.reqsize = sizeof(struct qat_compression_req),
-}};
+}, {
+	.base = {
+		.cra_name = "zlib-deflate",
+		.cra_driver_name = "qat_zlib_deflate",
+		.cra_priority = 4001,
+		.cra_flags = CRYPTO_ALG_ASYNC,
+		.cra_ctxsize = sizeof(struct qat_compression_ctx),
+		.cra_module = THIS_MODULE,
+	},
+	.init = qat_comp_alg_rfc1950_init_tfm,
+	.exit = qat_comp_alg_exit_tfm,
+	.compress = qat_comp_alg_rfc1950_compress,
+	.decompress = qat_comp_alg_rfc1950_decompress,
+	.dst_free = sgl_free,
+	.reqsize = sizeof(struct qat_compression_req),
+} };
 
 int qat_comp_algs_register(void)
 {
-- 
2.44.0

[RFC PATCH 2/6] Revert "crypto: deflate - Remove zlib-deflate"

2024-04-26T11:10:08Z

This reverts commit 62a465c25e99b9a98259a6b7f5bb759f5296d501.
---
 crypto/deflate.c | 61 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 44 insertions(+), 17 deletions(-)

diff --git a/crypto/deflate.c b/crypto/deflate.c
index 6e31e0db0e86..b2a46f6dc961 100644
--- a/crypto/deflate.c
+++ b/crypto/deflate.c
@@ -39,20 +39,24 @@ struct deflate_ctx {
 	struct z_stream_s decomp_stream;
 };
 
-static int deflate_comp_init(struct deflate_ctx *ctx)
+static int deflate_comp_init(struct deflate_ctx *ctx, int format)
 {
 	int ret = 0;
 	struct z_stream_s *stream = &ctx->comp_stream;
 
 	stream->workspace = vzalloc(zlib_deflate_workspacesize(
-				    -DEFLATE_DEF_WINBITS, MAX_MEM_LEVEL));
+				    MAX_WBITS, MAX_MEM_LEVEL));
 	if (!stream->workspace) {
 		ret = -ENOMEM;
 		goto out;
 	}
-	ret = zlib_deflateInit2(stream, DEFLATE_DEF_LEVEL, Z_DEFLATED,
-				-DEFLATE_DEF_WINBITS, DEFLATE_DEF_MEMLEVEL,
-				Z_DEFAULT_STRATEGY);
+	if (format)
+		ret = zlib_deflateInit(stream, 3);
+	else
+		ret = zlib_deflateInit2(stream, DEFLATE_DEF_LEVEL, Z_DEFLATED,
+					-DEFLATE_DEF_WINBITS,
+					DEFLATE_DEF_MEMLEVEL,
+					Z_DEFAULT_STRATEGY);
 	if (ret != Z_OK) {
 		ret = -EINVAL;
 		goto out_free;
@@ -64,7 +68,7 @@ static int deflate_comp_init(struct deflate_ctx *ctx)
 	goto out;
 }
 
-static int deflate_decomp_init(struct deflate_ctx *ctx)
+static int deflate_decomp_init(struct deflate_ctx *ctx, int format)
 {
 	int ret = 0;
 	struct z_stream_s *stream = &ctx->decomp_stream;
@@ -74,7 +78,10 @@ static int deflate_decomp_init(struct deflate_ctx *ctx)
 		ret = -ENOMEM;
 		goto out;
 	}
-	ret = zlib_inflateInit2(stream, -DEFLATE_DEF_WINBITS);
+	if (format)
+		ret = zlib_inflateInit(stream);
+	else
+		ret = zlib_inflateInit2(stream, -DEFLATE_DEF_WINBITS);
 	if (ret != Z_OK) {
 		ret = -EINVAL;
 		goto out_free;
@@ -98,21 +105,21 @@ static void deflate_decomp_exit(struct deflate_ctx *ctx)
 	vfree(ctx->decomp_stream.workspace);
 }
 
-static int __deflate_init(void *ctx)
+static int __deflate_init(void *ctx, int format)
 {
 	int ret;
 
-	ret = deflate_comp_init(ctx);
+	ret = deflate_comp_init(ctx, format);
 	if (ret)
 		goto out;
-	ret = deflate_decomp_init(ctx);
+	ret = deflate_decomp_init(ctx, format);
 	if (ret)
 		deflate_comp_exit(ctx);
 out:
 	return ret;
 }
 
-static void *deflate_alloc_ctx(struct crypto_scomp *tfm)
+static void *gen_deflate_alloc_ctx(struct crypto_scomp *tfm, int format)
 {
 	struct deflate_ctx *ctx;
 	int ret;
@@ -121,7 +128,7 @@ static void *deflate_alloc_ctx(struct crypto_scomp *tfm)
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
-	ret = __deflate_init(ctx);
+	ret = __deflate_init(ctx, format);
 	if (ret) {
 		kfree(ctx);
 		return ERR_PTR(ret);
@@ -130,11 +137,21 @@ static void *deflate_alloc_ctx(struct crypto_scomp *tfm)
 	return ctx;
 }
 
+static void *deflate_alloc_ctx(struct crypto_scomp *tfm)
+{
+	return gen_deflate_alloc_ctx(tfm, 0);
+}
+
+static void *zlib_deflate_alloc_ctx(struct crypto_scomp *tfm)
+{
+	return gen_deflate_alloc_ctx(tfm, 1);
+}
+
 static int deflate_init(struct crypto_tfm *tfm)
 {
 	struct deflate_ctx *ctx = crypto_tfm_ctx(tfm);
 
-	return __deflate_init(ctx);
+	return __deflate_init(ctx, 0);
 }
 
 static void __deflate_exit(void *ctx)
@@ -269,7 +286,7 @@ static struct crypto_alg alg = {
 	.coa_decompress  	= deflate_decompress } }
 };
 
-static struct scomp_alg scomp = {
+static struct scomp_alg scomp[] = { {
 	.alloc_ctx		= deflate_alloc_ctx,
 	.free_ctx		= deflate_free_ctx,
 	.compress		= deflate_scompress,
@@ -279,7 +296,17 @@ static struct scomp_alg scomp = {
 		.cra_driver_name = "deflate-scomp",
 		.cra_module	 = THIS_MODULE,
 	}
-};
+}, {
+	.alloc_ctx		= zlib_deflate_alloc_ctx,
+	.free_ctx		= deflate_free_ctx,
+	.compress		= deflate_scompress,
+	.decompress		= deflate_sdecompress,
+	.base			= {
+		.cra_name	= "zlib-deflate",
+		.cra_driver_name = "zlib-deflate-scomp",
+		.cra_module	 = THIS_MODULE,
+	}
+} };
 
 static int __init deflate_mod_init(void)
 {
@@ -289,7 +316,7 @@ static int __init deflate_mod_init(void)
 	if (ret)
 		return ret;
 
-	ret = crypto_register_scomp(&scomp);
+	ret = crypto_register_scomps(scomp, ARRAY_SIZE(scomp));
 	if (ret) {
 		crypto_unregister_alg(&alg);
 		return ret;
@@ -301,7 +328,7 @@ static int __init deflate_mod_init(void)
 static void __exit deflate_mod_fini(void)
 {
 	crypto_unregister_alg(&alg);
-	crypto_unregister_scomp(&scomp);
+	crypto_unregister_scomps(scomp, ARRAY_SIZE(scomp));
 }
 
 subsys_initcall(deflate_mod_init);
-- 
2.44.0

[RFC PATCH 1/6] Revert "crypto: testmgr - Remove zlib-deflate"

2024-04-26T11:10:04Z

This reverts commit 30febae71c6182e0762dc7744737012b4f8e6a6d.
---
 crypto/testmgr.c | 10 +++++++
 crypto/testmgr.h | 75 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 00f5a6cf341a..ab904ab74bee 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -5733,6 +5733,16 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.suite = {
 			.hash = __VECS(xxhash64_tv_template)
 		}
+	}, {
+		.alg = "zlib-deflate",
+		.test = alg_test_comp,
+		.fips_allowed = 1,
+		.suite = {
+			.comp = {
+				.comp = __VECS(zlib_deflate_comp_tv_template),
+				.decomp = __VECS(zlib_deflate_decomp_tv_template)
+			}
+		}
 	}, {
 		.alg = "zstd",
 		.test = alg_test_comp,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 5350cfd9d325..71d87a2fd842 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -34752,6 +34752,81 @@ static const struct comp_testvec deflate_decomp_tv_template[] = {
 	},
 };
 
+static const struct comp_testvec zlib_deflate_comp_tv_template[] = {
+	{
+		.inlen	= 70,
+		.outlen	= 44,
+		.input	= "Join us now and share the software "
+			"Join us now and share the software ",
+		.output	= "\x78\x5e\xf3\xca\xcf\xcc\x53\x28"
+			  "\x2d\x56\xc8\xcb\x2f\x57\x48\xcc"
+			  "\x4b\x51\x28\xce\x48\x2c\x4a\x55"
+			  "\x28\xc9\x48\x55\x28\xce\x4f\x2b"
+			  "\x29\x07\x71\xbc\x08\x2b\x01\x00"
+			  "\x7c\x65\x19\x3d",
+	}, {
+		.inlen	= 191,
+		.outlen	= 129,
+		.input	= "This document describes a compression method based on the DEFLATE"
+			"compression algorithm.  This document defines the application of "
+			"the DEFLATE algorithm to the IP Payload Compression Protocol.",
+		.output	= "\x78\x5e\x5d\xce\x41\x0a\xc3\x30"
+			  "\x0c\x04\xc0\xaf\xec\x0b\xf2\x87"
+			  "\xd2\xa6\x50\xe8\xc1\x07\x7f\x40"
+			  "\xb1\x95\x5a\x60\x5b\xc6\x56\x0f"
+			  "\xfd\x7d\x93\x1e\x42\xe8\x51\xec"
+			  "\xee\x20\x9f\x64\x20\x6a\x78\x17"
+			  "\xae\x86\xc8\x23\x74\x59\x78\x80"
+			  "\x10\xb4\xb4\xce\x63\x88\x56\x14"
+			  "\xb6\xa4\x11\x0b\x0d\x8e\xd8\x6e"
+			  "\x4b\x8c\xdb\x7c\x7f\x5e\xfc\x7c"
+			  "\xae\x51\x7e\x69\x17\x4b\x65\x02"
+			  "\xfc\x1f\xbc\x4a\xdd\xd8\x7d\x48"
+			  "\xad\x65\x09\x64\x3b\xac\xeb\xd9"
+			  "\xc2\x01\xc0\xf4\x17\x3c\x1c\x1c"
+			  "\x7d\xb2\x52\xc4\xf5\xf4\x8f\xeb"
+			  "\x6a\x1a\x34\x4f\x5f\x2e\x32\x45"
+			  "\x4e",
+	},
+};
+
+static const struct comp_testvec zlib_deflate_decomp_tv_template[] = {
+	{
+		.inlen	= 128,
+		.outlen	= 191,
+		.input	= "\x78\x9c\x5d\x8d\x31\x0e\xc2\x30"
+			  "\x10\x04\xbf\xb2\x2f\xc8\x1f\x10"
+			  "\x04\x09\x89\xc2\x85\x3f\x70\xb1"
+			  "\x2f\xf8\x24\xdb\x67\xd9\x47\xc1"
+			  "\xef\x49\x68\x12\x51\xae\x76\x67"
+			  "\xd6\x27\x19\x88\x1a\xde\x85\xab"
+			  "\x21\xf2\x08\x5d\x16\x1e\x20\x04"
+			  "\x2d\xad\xf3\x18\xa2\x15\x85\x2d"
+			  "\x69\xc4\x42\x83\x23\xb6\x6c\x89"
+			  "\x71\x9b\xef\xcf\x8b\x9f\xcf\x33"
+			  "\xca\x2f\xed\x62\xa9\x4c\x80\xff"
+			  "\x13\xaf\x52\x37\xed\x0e\x52\x6b"
+			  "\x59\x02\xd9\x4e\xe8\x7a\x76\x1d"
+			  "\x02\x98\xfe\x8a\x87\x83\xa3\x4f"
+			  "\x56\x8a\xb8\x9e\x8e\x5c\x57\xd3"
+			  "\xa0\x79\xfa\x02\x2e\x32\x45\x4e",
+		.output	= "This document describes a compression method based on the DEFLATE"
+			"compression algorithm.  This document defines the application of "
+			"the DEFLATE algorithm to the IP Payload Compression Protocol.",
+	}, {
+		.inlen	= 44,
+		.outlen	= 70,
+		.input	= "\x78\x9c\xf3\xca\xcf\xcc\x53\x28"
+			  "\x2d\x56\xc8\xcb\x2f\x57\x48\xcc"
+			  "\x4b\x51\x28\xce\x48\x2c\x4a\x55"
+			  "\x28\xc9\x48\x55\x28\xce\x4f\x2b"
+			  "\x29\x07\x71\xbc\x08\x2b\x01\x00"
+			  "\x7c\x65\x19\x3d",
+		.output	= "Join us now and share the software "
+			"Join us now and share the software ",
+	},
+};
+
 /*
  * LZO test vectors (null-terminated strings).
  */
-- 
2.44.0

[RFC PATCH 0/6] btrfs: offload zlib-deflate to accelerators

2024-04-26T11:10:00Z

Add support for zlib compression and decompression through the acomp
APIs in BTRFS. This enables [de]compression operations to be offloaded
to accelerators. This is a rework of [1].

This set also re-enables zlib-deflate in the Crypto API and in the QAT
driver as they were removed in [2] since there was no user in kernel.
The re-enablement is done by reverting the commits that removed such
feature.

The code has been benchmarked on a system with the following specs:
 * Dual socket Intel(R) Xeon(R) Platinum 8470N
 * 512GB (16x32GB DDR5 4800 MT/s [4800 MT/s])
 * 4 NVMe disks (349.3G INTEL SSDPE21K375GA)
 * 2 QAT 4xxx devices, one per socket, configured for compression only
 * Kernel 6.8.2

The test consisted of 4 processes running `dd` that wrote in parallel
50GB of data (Silesia corpus) to the 4 NVMe disks separately. We captured
disk write throughput, CPU utilization and compression ratio:

    +---------------------------+---------+---------+---------+---------+
    |                           | QAT-L9  | ZSTD-L3 | ZLIB-L3 | LZO-L1  |
    +---------------------------+---------+---------+---------+---------+
    | Disk Write TPUT (GiB/s)   | 6.5     | 5.2     | 2.2     | 6.5     |
    +---------------------------+---------+---------+---------+---------+
    | CPU utils %age @208 cores | 4.56%   | 15.67%  | 12.79%  | 19.85%  |
    +---------------------------+---------+---------+---------+---------+
    | Compression Ratio         | 34%     | 35%     | 37%     | 58%     |
    +---------------------------+---------+---------+---------+---------+

From the results we see that BTRFS with QAT configured for zlib-deflate Level 9
provides the best throughput with less CPU utilization and better compression
ratio compared with software zstd-l3, zlib-l3 and lzo. 

Limitations: 
  * The implementation is synchronous, even if acomp is an asynchronous API.
  * The implementation tries always to use an acomp tfm even if only
    zlib-deflate-scomp is present. This ignores the compression levels
    configuration for zlib.
  * There is no way to configure a compression level for acomp(zlib-deflate).
    This is hardcoded in the acomp algorithm implementation/provider.

[1] https://lore.kernel.org/all/1467083180-111750-1-git-send-email-weigang.li@intel.com/  
[2] https://lore.kernel.org/all/ZO8ULhlJSrJ0Mcsx@gondor.apana.org.au/

Giovanni Cabiddu (5):
  Revert "crypto: testmgr - Remove zlib-deflate"
  Revert "crypto: deflate - Remove zlib-deflate"
  Revert "crypto: qat - Remove zlib-deflate"
  Revert "crypto: qat - remove unused macros in qat_comp_alg.c"
  crypto: qat - change compressor settings for QAT GEN4

Weigang Li (1):
  btrfs: zlib: add support for zlib-deflate through acomp

 crypto/deflate.c                              |  61 +++--
 crypto/testmgr.c                              |  10 +
 crypto/testmgr.h                              |  75 ++++++
 .../crypto/intel/qat/qat_common/adf_gen4_dc.c |   4 +-
 .../intel/qat/qat_common/qat_comp_algs.c      | 138 ++++++++++-
 fs/btrfs/zlib.c                               | 216 ++++++++++++++++++
 6 files changed, 484 insertions(+), 20 deletions(-)

base-commit: ed265f7fd9a635d77c8022fc6d9a1b735dd4dfd7
-- 
2.44.0

[PATCH] btrfs: remove the recursive include of btrfs_inode.h from itself

2024-04-26T04:18:53Z

Inside btrfs_inode.h we include itself, although it's not causing any
problem, it's still being reported by clangd, and is really unnecessary.

Just remove the recursive include.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/btrfs_inode.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 91c994b569f3..de918d89a582 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -19,7 +19,6 @@
 #include 
 #include 
 #include "block-rsv.h"
-#include "btrfs_inode.h"
 #include "extent_map.h"
 #include "extent_io.h"
 #include "extent-io-tree.h"
-- 
2.44.0

[PATCH] btrfs-progs: fix documentation build due to phony contents.rst

2024-04-25T23:00:44Z

[BUG]
Since commit 8049446bb0ba ("btrfs-progs: docs: placeholder for
contents.rst file on older sphinx version"), on systems with much newer
sphinx-build, "make" would not work for Documentation directory:

 $ make clean-all && ./autogen.sh && ./configure --prefix=/usr/ && make -j12
 $ ls -alh Documentation/_build
 ls: cannot access 'Documentation/_build': No such file or directory

The sphinx-build has a much newer version:

 $ sphinx-build --version
 sphinx-build 7.2.6

[CAUSE]
On systems which doesn't need the workaround, the phony target of
contents.rst seems to cause a dependency loop:

 GNU Make 4.4.1
 Built for x86_64-pc-linux-gnu
 Copyright (C) 1988-2023 Free Software Foundation, Inc.
 License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
 This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law.
 Reading makefiles...
 Reading makefile 'Makefile'...
 Updating makefiles....
  Considering target file 'Makefile'.
   Looking for an implicit rule for 'Makefile'.
    Trying pattern rule '%:' with stem 'Makefile'.
   Found implicit rule '%:' for 'Makefile'.
  Finished prerequisites of target file 'Makefile'.
  No need to remake target 'Makefile'.
 Updating goal targets....
 Considering target file 'contents.rst'.
  File 'contents.rst' does not exist.
 Finished prerequisites of target file 'contents.rst'.
 Must remake target 'contents.rst'.
 Makefile:35: update target 'contents.rst' due to: target is .PHONY
 if [ "$(sphinx-build --version | cut -d' ' -f2)" \< "1.7.7" ]; then \
 	touch contents.rst; \
 fi
 Putting child 0x64ee81960130 (contents.rst) PID 66833 on the chain.
 Live child 0x64ee81960130 (contents.rst) PID 66833
 Reaping winning child 0x64ee81960130 PID 66833
 Removing child 0x64ee81960130 PID 66833 from chain.
 Successfully remade target file 'contents.rst'.

All the default make doing is just try to generate contents.rst, but
since we have much newer version, we won't generate the file at all.

[FIX]
Instead of a phony target, just move the contents.rst generation into
man page target so that we won't cause loop target on contents.rst.

Fixes: 8049446bb0ba ("btrfs-progs: docs: placeholder for contents.rst file on older sphinx version")
Signed-off-by: Qu Wenruo 
---
 Documentation/Makefile.in | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/Documentation/Makefile.in b/Documentation/Makefile.in
index b4c09dcc255a..76e0cbbc242f 100644
--- a/Documentation/Makefile.in
+++ b/Documentation/Makefile.in
@@ -28,19 +28,16 @@ man5dir = $(mandir)/man5
 man8dir = $(mandir)/man8
 
 .PHONY: all man help
-.PHONY: contents.rst
+
+# Build manual pages by default
+all: man
 
 # Workaround for old sphinx that requires the contents.rst file
-contents.rst:
+man:
 	@if [ "$$(sphinx-build --version | cut -d' ' -f2)" \< "1.7.7" ]; then \
 		touch contents.rst; \
 	fi
 
-# Build manual pages by default
-
-all: man
-
-man:
 	$(QUIET_SPHINX)$(SPHINXBUILD) -M man "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
 
 help:
-- 
2.44.0

[PATCH 2/2] btrfs: misc-tests: remove the subvol-delete-qgroup test case

2024-04-25T22:06:20Z

The test case relies on `--delete-qgroup` option, but the feature is not
properly designed from the very beginning, and would not work for most
cases.

The test case does not take the complexity of subvolume dropping into
consideration and only tested the simplest cases.

Since the `--delete-qgroup` option patch is reverted, we also need to
revert this one too.

Signed-off-by: Qu Wenruo 
---
 .../061-subvol-delete-qgroup/test.sh          | 47 -------------------
 1 file changed, 47 deletions(-)
 delete mode 100755 tests/misc-tests/061-subvol-delete-qgroup/test.sh

diff --git a/tests/misc-tests/061-subvol-delete-qgroup/test.sh b/tests/misc-tests/061-subvol-delete-qgroup/test.sh
deleted file mode 100755
index c2637ac33cdc..000000000000
--- a/tests/misc-tests/061-subvol-delete-qgroup/test.sh
+++ /dev/null
@@ -1,47 +0,0 @@
-#!/bin/bash
-# Create subvolumes with enabled qutoas and check that subvolume deleteion will
-# also delete the 0-level qgruop.
-
-source "$TEST_TOP/common" || exit
-
-setup_root_helper
-prepare_test_dev
-
-run_check_mkfs_test_dev
-run_check_mount_test_dev
-run_check $SUDO_HELPER dd if=/dev/zero of="$TEST_MNT"/file bs=1M count=1
-
-# Without quotas
-run_check $SUDO_HELPER "$TOP/btrfs" subvolume create "$TEST_MNT/subv1"
-run_check $SUDO_HELPER "$TOP/btrfs" subvolume create "$TEST_MNT/subv2"
-run_check $SUDO_HELPER "$TOP/btrfs" subvolume delete --delete-qgroup "$TEST_MNT/subv1"
-run_check $SUDO_HELPER "$TOP/btrfs" subvolume delete --no-delete-qgroup "$TEST_MNT/subv2"
-run_check $SUDO_HELPER "$TOP/btrfs" filesystem sync "$TEST_MNT"
-run_check $SUDO_HELPER "$TOP/btrfs" subvolume sync "$TEST_MNT"
-run_check $SUDO_HELPER "$TOP/btrfs" subvol list "$TEST_MNT"
-
-# With quotas enabled
-run_check $SUDO_HELPER "$TOP/btrfs" quota enable "$TEST_MNT"
-run_check $SUDO_HELPER "$TOP/btrfs" subvolume create "$TEST_MNT/subv1"
-rootid1=$(run_check_stdout "$TOP/btrfs" inspect-internal rootid "$TEST_MNT/subv1")
-run_check $SUDO_HELPER "$TOP/btrfs" subvolume create "$TEST_MNT/subv2"
-rootid2=$(run_check_stdout "$TOP/btrfs" inspect-internal rootid "$TEST_MNT/subv2")
-run_check $SUDO_HELPER "$TOP/btrfs" qgroup create 1/1 "$TEST_MNT"
-run_check $SUDO_HELPER "$TOP/btrfs" qgroup assign "0/$rootid1" 1/1 "$TEST_MNT"
-run_check $SUDO_HELPER "$TOP/btrfs" qgroup assign "0/$rootid2" 1/1 "$TEST_MNT"
-run_check $SUDO_HELPER "$TOP/btrfs" quota rescan --wait "$TEST_MNT"
-run_check $SUDO_HELPER "$TOP/btrfs" subvol list "$TEST_MNT"
-run_check $SUDO_HELPER "$TOP/btrfs" qgroup show "$TEST_MNT"
-run_check $SUDO_HELPER "$TOP/btrfs" subvolume delete --delete-qgroup "$TEST_MNT/subv1"
-run_check $SUDO_HELPER "$TOP/btrfs" subvolume delete --no-delete-qgroup "$TEST_MNT/subv2"
-run_check $SUDO_HELPER "$TOP/btrfs" filesystem sync "$TEST_MNT"
-run_check $SUDO_HELPER "$TOP/btrfs" subvolume sync "$TEST_MNT"
-run_check $SUDO_HELPER "$TOP/btrfs" subvol list "$TEST_MNT"
-run_check $SUDO_HELPER "$TOP/btrfs" qgroup show "$TEST_MNT"
-if run_check_stdout $SUDO_HELPER "$TOP/btrfs" qgroup show "$TEST_MNT" | grep -q "0/$rootid1"; then
-	_fail "qgroup 0/$rootid1 not deleted"
-fi
-if ! run_check_stdout $SUDO_HELPER "$TOP/btrfs" qgroup show "$TEST_MNT" | grep -q "0/$rootid2"; then
-	_fail "qgroup 0/$rootid2 deleted"
-fi
-run_check_umount_test_dev
-- 
2.44.0

[PATCH 1/2] Revert "btrfs-progs: subvol delete: add options to delete the qgroup"

2024-04-25T22:06:17Z

This reverts commit 9da773aa46ba33a9c3bdd83b31e15b031b3bfe4d.

There are several problems related to the --delete-qgroup option:

- Currently kernel doesn not allow to delete non-empty qgroups

- A qgroup can only be empty after fully dropped and a transaction is
  committed
  The tool doesn not take either factor into consideration

- Things like drop_subtree_threshold or other operations can mark qgroup
  inconsistent and skip accounting
  This can mean the target qgroup will never be empty until next rescan

On the other hand, even if we do the proper way, it would hugely delay
the command (wait until the subvolume to be dropped).

Furthermore, even all the wait is handled properly,
drop_subtree_threshold can still prevent us deleting the qgroup (qgroup
numbers are inconsistent, and accounting is skipped completely).

So this qgroup cleanup needs kernel work to support them anyway, and it
would be much easier to handle all the operations inside kernel.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-subvolume.rst |  7 -------
 cmds/subvolume.c                  | 26 --------------------------
 2 files changed, 33 deletions(-)

diff --git a/Documentation/btrfs-subvolume.rst b/Documentation/btrfs-subvolume.rst
index d4379a2df83d..d1e89f15e1e2 100644
--- a/Documentation/btrfs-subvolume.rst
+++ b/Documentation/btrfs-subvolume.rst
@@ -112,13 +112,6 @@ delete [options] [ [...]], delete -i|--subvolid 
         -i|--subvolid 
                 subvolume id to be removed instead of the  that should point to the
                 filesystem with the subvolume
-
-        --delete-qgroup
-                also delete the qgroup 0/subvolid if it exists
-
-        --no-delete-qgroup
-                do not delete the 0/subvolid qgroup (default)
-
         -v|--verbose
                 (deprecated) alias for global *-v* option
 
diff --git a/cmds/subvolume.c b/cmds/subvolume.c
index 319f595ca495..f77a6e091569 100644
--- a/cmds/subvolume.c
+++ b/cmds/subvolume.c
@@ -348,8 +348,6 @@ static const char * const cmd_subvolume_delete_usage[] = {
 	OPTLINE("-c|--commit-after", "wait for transaction commit at the end of the operation"),
 	OPTLINE("-C|--commit-each", "wait for transaction commit after deleting each subvolume"),
 	OPTLINE("-i|--subvolid", "subvolume id of the to be removed subvolume"),
-	OPTLINE("--delete-qgroup", "also delete the qgroup 0/subvolid if it exists"),
-	OPTLINE("--no-delete-qgroup", "do not delete the qgroup 0/subvolid if it exists (default)"),
 	OPTLINE("-v|--verbose", "deprecated, alias for global -v option"),
 	HELPINFO_INSERT_GLOBALS,
 	HELPINFO_INSERT_VERBOSE,
@@ -378,20 +376,15 @@ static int cmd_subvolume_delete(const struct cmd_struct *cmd, int argc, char **a
 	enum { COMMIT_AFTER = 1, COMMIT_EACH = 2 };
 	enum btrfs_util_error err;
 	uint64_t default_subvol_id, target_subvol_id = 0;
-	bool opt_delete_qgroup = false;
 
 	optind = 0;
 	while (1) {
 		int c;
-		enum { GETOPT_VAL_DELETE_QGROUP = GETOPT_VAL_FIRST,
-		       GETOPT_VAL_NO_DELETE_QGROUP };
 		static const struct option long_options[] = {
 			{"commit-after", no_argument, NULL, 'c'},
 			{"commit-each", no_argument, NULL, 'C'},
 			{"subvolid", required_argument, NULL, 'i'},
 			{"verbose", no_argument, NULL, 'v'},
-			{"delete-qgroup", no_argument, NULL, GETOPT_VAL_DELETE_QGROUP },
-			{"no-delete-qgroup", no_argument, NULL, GETOPT_VAL_NO_DELETE_QGROUP },
 			{NULL, 0, NULL, 0}
 		};
 
@@ -412,12 +405,6 @@ static int cmd_subvolume_delete(const struct cmd_struct *cmd, int argc, char **a
 		case 'v':
 			bconf_be_verbose();
 			break;
-		case GETOPT_VAL_DELETE_QGROUP:
-			opt_delete_qgroup = true;
-			break;
-		case GETOPT_VAL_NO_DELETE_QGROUP:
-			opt_delete_qgroup = false;
-			break;
 		default:
 			usage_unknown_option(cmd, argv);
 		}
@@ -553,19 +540,6 @@ again:
 			warning("deletion failed with EPERM, you don't have permissions or send may be in progress");
 		ret = 1;
 		goto out;
-	} else if (opt_delete_qgroup) {
-		struct btrfs_ioctl_qgroup_create_args args = { .qgroupid = target_subvol_id };
-
-		ret = ioctl(fd, BTRFS_IOC_QGROUP_CREATE, &args);
-		if (ret == 0) {
-			pr_verbose(LOG_DEFAULT, "Delete qgroup 0/%" PRIu64 "\n", target_subvol_id);
-		} else if (ret < 0 && (errno == ENOTCONN || errno == ENOENT)) {
-			/* Quotas not enabled, or there's no qgroup. */
-		} else {
-			warning("unable to delete qgroup 0/%llu: %m", subvolid);
-		}
-		/* Qgroup errors are not fatal. */
-		ret = 0;
 	}
 
 	if (commit_mode == COMMIT_EACH) {
-- 
2.44.0

[PATCH 0/2] btrfs-progs: revert `btrfs subvolume delete --delete-qgroup` option

2024-04-25T22:06:16Z

The introduction of `btrfs subvolume delete --delete-qgroup` would not
work for a lot of real world cases.

This would leads to unnecessary errors, and can be very confusing for
end users.

Furthermore the new options do not take the lifespan of a subvolume into
consideration or the possible conflicts with other qgroup features.

Although it's already too late, we should revert it to prevent further
confusion and damage.

Qu Wenruo (2):
  Revert "btrfs-progs: subvol delete: add options to delete the qgroup"
  btrfs: misc-tests: remove the subvol-delete-qgroup test case

 Documentation/btrfs-subvolume.rst             |  7 ---
 cmds/subvolume.c                              | 26 ----------
 .../061-subvol-delete-qgroup/test.sh          | 47 -------------------
 3 files changed, 80 deletions(-)
 delete mode 100755 tests/misc-tests/061-subvol-delete-qgroup/test.sh

--
2.44.0

Re: [PATCH v2 2/2] btrfs: automatically remove the subvolume qgroup

2024-04-25T21:51:19Z


在 2024/4/25 22:04, David Sterba 写道:
> On Thu, Apr 25, 2024 at 07:49:12AM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/4/24 22:11, David Sterba 写道:
>>> On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote:
>>>> Currently if we fully removed a subvolume (not only unlinked, but fully
>>>> dropped its root item), its qgroup would not be removed.
>>>>
>>>> Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups.
>>>
>>> There's also an option 'btrfs subvolume delete --delete-qgroup' that
>>> does that and is going to be default in 6.9. With this kernel change it
>>> would break the behaviour of the --no-delete-qgroup, which is there for
>>> the case something depends on that.  For now I'd rather postpone
>>> changing the kernel behaviour.
>>>
>>
>> A quick glance of the --delete-qgroup shows it won't work as expected at
>> all.
>>
>> Firstly, the qgroup delete requires the qgroup numbers to be 0.
>> Meanwhile qgroup numbers can only be 0 after 1) the full subvolume has
>> been dropped 2) a transaction is committed to reflect the qgroup numbers.
>
> The deletion option calls ioctl, so this means that 'btrfs qgroup remove'
> will not delete it either?

Nope, at least if the subvolume is not cleaned up immediately.
>
>> Both situation is only handled in my patchset, thus this means for a lot
>> of cases it won't work at all.
>>
>> Furthermore, there is the drop_subtree_threshold thing, which can mark
>> qgroup inconsistent and skip accounting, making the target subvolume's
>> qgroup numbers never fall back to 0 (until next rescan).
>>
>> So I'm afraid the --delete-qgroup won't work until the 1/2 patch get
>> merged (allowing deleting qgroups as long as the target subvolume is gone).
>
> Ok, so for emulation of the complete removal in userspace it's
>
> btrfs subvolume delete 123
> btrfs subvolume sync 123
> btrfs qgroup remove 0/123
>
> but this needs to wait until the sync is finished and that is not
> expected for the subvolume delete command.

That's the problem, and why doing it in user space has it limits.

Furthermore, with drop_subtree_threshold or other qgroup operations
marking the qgroup inconsistent, you can not delete that qgroup at all,
until the next rescan.

> It needs to be fixed but now
> I'm not sure this can be default in 6.9 as planned.

I'd say, you should not implement this feature without really
understanding the challenges in the first place.

And that's why I really prefer you send out non-trivial btrfs-progs for
review, other than pushing them directly into github repo.

Thanks,
Qu

Re: [PATCH 02/30] btrfs: Use a folio in write_dev_supers()

2024-04-25T16:38:58Z

On Thu, Apr 25, 2024 at 04:44:03PM +0200, David Sterba wrote:
> On Sat, Apr 20, 2024 at 03:49:57AM +0100, Matthew Wilcox (Oracle) wrote:
> > @@ -3812,8 +3814,7 @@ static int write_dev_supers(struct btrfs_device *device,
> >  		bio->bi_iter.bi_sector = bytenr >> SECTOR_SHIFT;
> >  		bio->bi_private = device;
> >  		bio->bi_end_io = btrfs_end_super_write;
> > -		__bio_add_page(bio, page, BTRFS_SUPER_INFO_SIZE,
> > -			       offset_in_page(bytenr));
> > +		bio_add_folio_nofail(bio, folio, BTRFS_SUPER_INFO_SIZE, offset);
> 
> Compilation fails when btrfs is built as a module, bio_add_folio_nofail()
> is not exported. I can keep __bio_add_page() and the conversion can be
> done later.

I'd rather you added the obvious patch I just sent ...

(please don't get me stuck in the infinite loop of "you can't export a
symbol without any users" "you can't add a user until this is exported")

Re: [PATCH 02/30] btrfs: Use a folio in write_dev_supers()

2024-04-25T14:51:40Z

On Sat, Apr 20, 2024 at 03:49:57AM +0100, Matthew Wilcox (Oracle) wrote:
> @@ -3812,8 +3814,7 @@ static int write_dev_supers(struct btrfs_device *device,
>  		bio->bi_iter.bi_sector = bytenr >> SECTOR_SHIFT;
>  		bio->bi_private = device;
>  		bio->bi_end_io = btrfs_end_super_write;
> -		__bio_add_page(bio, page, BTRFS_SUPER_INFO_SIZE,
> -			       offset_in_page(bytenr));
> +		bio_add_folio_nofail(bio, folio, BTRFS_SUPER_INFO_SIZE, offset);

Compilation fails when btrfs is built as a module, bio_add_folio_nofail()
is not exported. I can keep __bio_add_page() and the conversion can be
done later.

Re: [PATCH v2 2/2] btrfs: automatically remove the subvolume qgroup

2024-04-25T12:42:26Z

On Thu, Apr 25, 2024 at 07:49:12AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/4/24 22:11, David Sterba 写道:
> > On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote:
> >> Currently if we fully removed a subvolume (not only unlinked, but fully
> >> dropped its root item), its qgroup would not be removed.
> >>
> >> Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups.
> >
> > There's also an option 'btrfs subvolume delete --delete-qgroup' that
> > does that and is going to be default in 6.9. With this kernel change it
> > would break the behaviour of the --no-delete-qgroup, which is there for
> > the case something depends on that.  For now I'd rather postpone
> > changing the kernel behaviour.
> >
> 
> A quick glance of the --delete-qgroup shows it won't work as expected at
> all.
> 
> Firstly, the qgroup delete requires the qgroup numbers to be 0.
> Meanwhile qgroup numbers can only be 0 after 1) the full subvolume has
> been dropped 2) a transaction is committed to reflect the qgroup numbers.

The deletion option calls ioctl, so this means that 'btrfs qgroup remove'
will not delete it either?

> Both situation is only handled in my patchset, thus this means for a lot
> of cases it won't work at all.
> 
> Furthermore, there is the drop_subtree_threshold thing, which can mark
> qgroup inconsistent and skip accounting, making the target subvolume's
> qgroup numbers never fall back to 0 (until next rescan).
> 
> So I'm afraid the --delete-qgroup won't work until the 1/2 patch get
> merged (allowing deleting qgroups as long as the target subvolume is gone).

Ok, so for emulation of the complete removal in userspace it's

btrfs subvolume delete 123
btrfs subvolume sync 123
btrfs qgroup remove 0/123

but this needs to wait until the sync is finished and that is not
expected for the subvolume delete command. It needs to be fixed but now
I'm not sure this can be default in 6.9 as planned.

Re: mkfs.btrfs enabled RST by default casuing unable to mount on stable kernel

2024-04-25T03:07:43Z

[-- Attachment #1.1: Type: text/plain, Size: 3766 bytes --]

在 2024/4/25 3:57, Goffredo Baroncelli 写道:
> On 24/04/2024 18.15, HAN Yuwei wrote:
>>
>> 在 2024/4/24 17:48, David Sterba 写道:
>>> On Wed, Apr 24, 2024 at 01:45:24PM +0800, HAN Yuwei wrote:
>>>> I have found incompatibility issue on btrfs-prog & kernel. btrfs-progs
>>>> is v6.7.1, kernel is 6.7.5-aosc-main.
>>>>
>>>> Using this command to create btrfs volume:
>>>> # mkfs.btrfs -f -d raid10 -m raid1c4 -s 16k -L HYWDATA_ZONED_TEST
>>>> /dev/sdb /dev/sdc /dev/sdd /dev/sde
>>>>
>>>> When mounting, dmesg said:
>>>> [  329.071403] BTRFS info (device sdb): first mount of filesystem
>>>> 7b4f2ca6-efe3-48d9-81f6-ba65a00db85e
>>>> [  329.080422] BTRFS info (device sdb): using crc32c (crc32c-generic)
>>>> checksum algorithm
>>>> [  329.088222] BTRFS info (device sdb): using free space tree
>>>> [  329.093673] BTRFS error (device sdb): cannot mount because of 
>>>> unknown
>>>> incompat features (0x5b41)
>>>> [  329.102442] BTRFS error (device sdb): open_ctree failed
>>>>
>>>> dump-super said:
>>>> [...]
>>>> incompat_flags          0x5b41
>>>>                           ( MIXED_BACKREF |
>>>>                             EXTENDED_IREF |
>>>>                             SKINNY_METADATA |
>>>>                             NO_HOLES |
>>>>                             RAID1C34 |
>>>>                             ZONED |
>>>>                             RAID_STRIPE_TREE )
>>>> [...]
>>>>
>>>>
>>>> RAID_STRIPE_TREE need CONFIG_BTRFS_DEBUG to be supported and this
>>>> feature flag is disabled on most distributions. I hope RST can be
>>>> disabled by default on btrfs-progs.
>>> IMO this works as intended. Features may be enabled ahead of time in
>>> btrfs-progs due to early testing and not requiring the experimental
>>> build. The experimental status of progs features is about completeness
>>> of the implementation, so if mkfs creates a filesystem with RST then it
>>> could be enabled.
>> If due to early testing, btrfs-progs could have --experimental option 
>> to enable it instead of
>>
>> enabling it by default which would causing confusion to normal users. 
>> For experienced user who wants to test new feature, we can hint them 
>> to use this option when needed.
>>
>> e.g.
>>
>> # mkfs.btrfs -f -d raid10 -m raid1c4 -s 16k
>> can't use raid10, this is a experimental feature, use --experimental 
>> if you really want it.
>>
>> # mkfs.btrfs -f -d raid10 -m raid1c4 -s 16k --experimental
>>
>> [succeed]
>>
>>> The kernel support is still missing some features and there are some
>>> known bugs, this is conveniently hidden behind the DEBUG option so it
>>> does not affect distributions.
>>>
>>> However it seems that the documentation is not clear about that and the
>>> status page should be updated.
>>>
>
> I think that the problem is the following: if you want to use a "zoned 
> device"
> and a "raid device", you NEED a raid-stripe-tree.
> In fact by default the RST is not enabled, but it is pulled if we have
> a zoned device and a raid10.
>
> So it is not a problem of btrfs-progs itself.
>
> I think that btrfs.mkfs should be more verbose about pulling incomapt
> feature as a dependency.
>
>
I think that btrfs-progs could require user to explicitly introduce 
features which current stable kernel may not enabled by default instead 
of being more verbose in output.

Because users may 1. don't read output 2. are using tty which would 
truncate their output. Making this explicitly is better than surprising 
users afterwards.

HAN Yuwei


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

Re: [PATCH v2 2/2] btrfs: automatically remove the subvolume qgroup

2024-04-24T22:19:24Z

在 2024/4/24 22:11, David Sterba 写道:
> On Fri, Apr 19, 2024 at 07:16:53PM +0930, Qu Wenruo wrote:
>> Currently if we fully removed a subvolume (not only unlinked, but fully
>> dropped its root item), its qgroup would not be removed.
>>
>> Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups.
>
> There's also an option 'btrfs subvolume delete --delete-qgroup' that
> does that and is going to be default in 6.9. With this kernel change it
> would break the behaviour of the --no-delete-qgroup, which is there for
> the case something depends on that.  For now I'd rather postpone
> changing the kernel behaviour.
>

A quick glance of the --delete-qgroup shows it won't work as expected at
all.

Firstly, the qgroup delete requires the qgroup numbers to be 0.
Meanwhile qgroup numbers can only be 0 after 1) the full subvolume has
been dropped 2) a transaction is committed to reflect the qgroup numbers.

Both situation is only handled in my patchset, thus this means for a lot
of cases it won't work at all.

Furthermore, there is the drop_subtree_threshold thing, which can mark
qgroup inconsistent and skip accounting, making the target subvolume's
qgroup numbers never fall back to 0 (until next rescan).

So I'm afraid the --delete-qgroup won't work until the 1/2 patch get
merged (allowing deleting qgroups as long as the target subvolume is gone).

Thanks,
Qu

Re: [PATCH] btrfs-progs: fsfeatures: hide RST behind experimental builds

2024-04-24T19:58:14Z

On 24/04/2024 21.30, Goffredo Baroncelli wrote:
> On 24/04/2024 07.49, Qu Wenruo wrote:
>> [BUG]
>> There is a report that with v6.7.1 btrfs-progs, `mkfs.btrfs -O rst`, but
>> kernel 6.7/6.8 are unable to mount it at all.
>>
>> [CAUSE]
>> Although the feature string states the raid-stripe-tree feature is
>> supported since v6.7 kernel, it's not correct.
>> Only debug kernel with CONFIG_BTRFS_DEBUG would support the RST feature.
>>
>> Thus RST is still considered experimental.
>>
>> [FIX]
>> Move the RST mkfs features back to experimental.
>>
>> This patch would only hide the RST features from 'mkfs -O' options, the
>> existing supporting code for RST would still be there, thus
>> non-experimental build of `btrfs check` can still verify a btrfs with
>> RST.
>>
>> Reported-by: HAN Yuwei 
>> Fixes: b421fdff9574 ("btrfs-progs: move raid-stripe-tree and squota build out of experimental")
>> Signed-off-by: Qu Wenruo 
>> ---
>>   common/fsfeatures.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
>> index c5e81629ccea..7aaddab6a192 100644
>> --- a/common/fsfeatures.c
>> +++ b/common/fsfeatures.c
>> @@ -222,6 +222,7 @@ static const struct btrfs_feature mkfs_features[] = {
>>           VERSION_NULL(default),
>>           .desc        = "block group tree, more efficient block group tracking to reduce mount time"
>>       },
>> +#if EXPERIMENTAL
>>       {
>>           .name        = "rst",
>>           .incompat_flag    = BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE,
>> @@ -238,6 +239,7 @@ static const struct btrfs_feature mkfs_features[] = {
>>           VERSION_NULL(default),
>>           .desc        = "raid stripe tree, enhanced file extent tracking"
>>       },
>> +#endif
>>       {
>>           .name        = "squota",
>>           .incompat_flag    = BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA,
> 
> I am bit confused.

Ignore this email.

> 
> The Han report say that the problem is due to the fact that the option is enabled by *default*.
> 
> So way we should remove (from the binary) this option at all ? Shouldn't be enough to remove it from the options enabled by default  ?
> 
> Something like:
> 
> -        VERSION_TO_STRING2(compat, 6,7),
> -        VERSION_NULL(safe),
> 
> +       VERSION_NULL(compat),
> 
> +        VERSION_TO_STRING2(safe, 6,7),
> 
> 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

Re: [PATCH 16/26] cxl/extent: Realize extent devices

2024-04-24T19:58:02Z

Dave Jiang wrote:
> 
> 
> On 3/24/24 4:18 PM, ira.weiny@intel.com wrote:
> > From: Navneet Singh 
> > 

[snip]

> > diff --git a/drivers/cxl/core/extent.c b/drivers/cxl/core/extent.c
> > new file mode 100644
> > index 000000000000..487c220f1c3c
> > --- /dev/null
> > +++ b/drivers/cxl/core/extent.c
> > @@ -0,0 +1,133 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright(c) 2024 Intel Corporation. All rights reserved. */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > +static DEFINE_IDA(cxl_extent_ida);
> 
> According to Documentation/core-api/idr.rst, IDR interface is deprecated and
> xarray usage is preferred.

IDA != IDR

ida_alloc() provides a unique, unused id for the device.  I worked hard to
eliminate all extra references to the extent objects so as to ensure object
lifetimes.  So I'm keeping this for now.

> > +
> > +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> > +			 char *buf)
> 
> Parameter alignment a bit off here? and some of the other functions as well.

Thanks, fixed.

[snip]

> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 9e33a0976828..6b00e717e42b 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -1020,6 +1020,32 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> >  	return rc;
> >  }
> >  
> > +static int cxl_send_dc_cap_response(struct cxl_memdev_state *mds,
> > +				    struct range *extent, int opcode)
> > +{
> > +	struct cxl_mbox_cmd mbox_cmd;
> > +	size_t size;
> > +
> > +	struct cxl_mbox_dc_response *dc_res __free(kfree);
> > +	size = struct_size(dc_res, extent_list, 1);
> > +	dc_res = kzalloc(size, GFP_KERNEL);
> > +	if (!dc_res)
> > +		return -ENOMEM;
> > +
> > +	dc_res->extent_list[0].dpa_start = cpu_to_le64(extent->start);
> > +	memset(dc_res->extent_list[0].reserved, 0, 8);
> 
> Not needed. kzalloc already zeroed.

Thanks, Fan mentioned it too.

Ira