linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size
@ 2024-01-24  3:59 Qu Wenruo
  2024-01-24  3:59 ` [PATCH RFC 1/2] btrfs: introduce cached folio size Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Qu Wenruo @ 2024-01-24  3:59 UTC (permalink / raw)
  To: linux-btrfs

With the folio interface, it's much easier to support multi-page sector
size (aka, sector/block size > PAGE_SIZE, which is rare between major
upstream filesystems).

The basic idea is, if we firstly convert to full folio interface, and
allow an address space to only allocate folio which is exactly
block/sector size, the support for multi-page would be mostly done.

But before that support, there are still quite some conversion left for
btrfs.

Furthermore, with both subpage and multipage sector size, we need to
handle folio different:

- For subpage
  The folio would always be page sized.

- For multipage (and regular sectorsize == PAGE_SIZE)
  The folio would be sector sized.

Furthermore, the filemap interface would make various shifts more
complex.
As filemap_*() interfaces use index which is PAGE_SHIFT based,
meanwhile with potential larger folio, the folio shift can be larger
than PAGE_SHIFT.

Thus in the future, we may want to change filemap interface to accept
bytenr to avoid confusion between page and folio shifts.

The first patch would introduce a cached folio size and shift.

The second patch would make defrag to utilize the new cached folio size
and shift.
(And still use PAGE_SHIFT only for filemap*() interfaces)

Qu Wenruo (2):
  btrfs: introduce cached folio size
  btrfs: defrag: prepare defrag for larger data folio size

 fs/btrfs/defrag.c  | 69 +++++++++++++++++++++++++---------------------
 fs/btrfs/disk-io.c | 11 ++++++++
 fs/btrfs/fs.h      | 10 +++++++
 3 files changed, 58 insertions(+), 32 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH RFC 1/2] btrfs: introduce cached folio size
  2024-01-24  3:59 [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size Qu Wenruo
@ 2024-01-24  3:59 ` Qu Wenruo
  2024-01-24  3:59 ` [PATCH RFC 2/2] btrfs: defrag: prepare defrag for larger data " Qu Wenruo
  2024-01-24  4:03 ` [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size Qu Wenruo
  2 siblings, 0 replies; 12+ messages in thread
From: Qu Wenruo @ 2024-01-24  3:59 UTC (permalink / raw)
  To: linux-btrfs

For the future multipage sectorsize support (sectorsize > PAGE_SIZE), we
want to fully utilize folio interface, thus making every data sector to
be represented by a folio.

However this would lead to a small problem that multipage and subpage
support would have every different folio size expectation.

For subpage, since folio can not be smaller than a page, one folio would
always be page sized.
But for multiplage, each folio would be sector sized.

For callsites directly handling pages/folios (aka, all read/write paths)
we don't want to do such check every time we got a folio.
So instead we cache the data folio size and its shift into
btrfs_fs_info.

This would make later folio conversion easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 11 +++++++++++
 fs/btrfs/fs.h      | 10 ++++++++++
 2 files changed, 21 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 51c6127508af..429fb95320ec 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2813,6 +2813,8 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	fs_info->sectorsize = 4096;
 	fs_info->sectorsize_bits = ilog2(4096);
 	fs_info->stripesize = 4096;
+	fs_info->folio_size = PAGE_SIZE;
+	fs_info->folio_shift = PAGE_SHIFT;
 
 	/* Default compress algorithm when user does -o compress */
 	fs_info->compress_type = BTRFS_COMPRESS_ZLIB;
@@ -3306,6 +3308,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	fs_info->csums_per_leaf = BTRFS_MAX_ITEM_SIZE(fs_info) / fs_info->csum_size;
 	fs_info->stripesize = stripesize;
 
+	if (sectorsize > PAGE_SIZE) {
+		/* For future multi-page sectorsize support */
+		fs_info->folio_size = sectorsize;
+		fs_info->sectorsize_bits = fs_info->sectorsize_bits;
+	} else {
+		fs_info->folio_size = PAGE_SIZE;
+		fs_info->folio_shift = PAGE_SHIFT;
+	}
+
 	/*
 	 * Handle the space caching options appropriately now that we have the
 	 * super block loaded and validated.
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index f8bb73d6ab68..52f67bf71ba5 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -750,6 +750,16 @@ struct btrfs_fs_info {
 	u32 csums_per_leaf;
 	u32 stripesize;
 
+	/*
+	 * For future subpage and multipage sectorsize support.
+	 *
+	 * For subpage, all of our data folios would still be PAGE_SIZE.
+	 * But for multipage, those data folios would be sector sized.
+	 * This is the cached result to read/write path to utilize.
+	 */
+	u32 folio_size;
+	u32 folio_shift;
+
 	/*
 	 * Maximum size of an extent. BTRFS_MAX_EXTENT_SIZE on regular
 	 * filesystem, on zoned it depends on the device constraints.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC 2/2] btrfs: defrag: prepare defrag for larger data folio size
  2024-01-24  3:59 [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size Qu Wenruo
  2024-01-24  3:59 ` [PATCH RFC 1/2] btrfs: introduce cached folio size Qu Wenruo
@ 2024-01-24  3:59 ` Qu Wenruo
  2024-02-15 20:23   ` Matthew Wilcox
  2024-01-24  4:03 ` [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size Qu Wenruo
  2 siblings, 1 reply; 12+ messages in thread
From: Qu Wenruo @ 2024-01-24  3:59 UTC (permalink / raw)
  To: linux-btrfs

Although we have migrated defrag to use the folio interface, we can
still further enhance it for the future larger data folio size.

This patch would:

- Rename page related variables to folio's equivalent

- Change "pgoff_t index" to "u64 folio_start" for defrag_prepare_one_folio()
  For the future multi-page sectorsize support, each data folio would be
  sector sized (except for subpage cases).
  Thus we should avoid using index, as there would be two different
  shifts:
  * PAGE_SHIFT based index
    Would be utilized in filemap related interfaces

  * Folio shift based index
    Would be utilized for the remaining cases

  So here we use the "u64 folio_start" to represent one folio

- Use fs_info->folio_shift to replace PAGE_SHIFT
  Since in the future the data folios would no longer be page sized, use
  the cached fs_info->folio_shift to handle both multi-page and subpage
  cases.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/defrag.c | 69 +++++++++++++++++++++++++----------------------
 1 file changed, 37 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index dd1b5a060366..07df0081ac57 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -861,18 +861,19 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em,
  * NOTE: Caller should also wait for page writeback after the cluster is
  * prepared, here we don't do writeback wait for each page.
  */
-static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t index)
+static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode,
+					      u64 folio_start)
 {
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct address_space *mapping = inode->vfs_inode.i_mapping;
 	gfp_t mask = btrfs_alloc_write_mask(mapping);
-	u64 page_start = (u64)index << PAGE_SHIFT;
-	u64 page_end = page_start + PAGE_SIZE - 1;
+	u64 folio_end = folio_start + fs_info->folio_size - 1;
 	struct extent_state *cached_state = NULL;
 	struct folio *folio;
 	int ret;
 
 again:
-	folio = __filemap_get_folio(mapping, index,
+	folio = __filemap_get_folio(mapping, folio_start >> PAGE_SHIFT,
 				    FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mask);
 	if (IS_ERR(folio))
 		return folio;
@@ -902,9 +903,10 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
 	while (1) {
 		struct btrfs_ordered_extent *ordered;
 
-		lock_extent(&inode->io_tree, page_start, page_end, &cached_state);
-		ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
-		unlock_extent(&inode->io_tree, page_start, page_end,
+		lock_extent(&inode->io_tree, folio_start, folio_end, &cached_state);
+		ordered = btrfs_lookup_ordered_range(inode, folio_start,
+						     fs_info->folio_size);
+		unlock_extent(&inode->io_tree, folio_start, folio_end,
 			      &cached_state);
 		if (!ordered)
 			break;
@@ -1163,20 +1165,20 @@ static_assert(PAGE_ALIGNED(CLUSTER_SIZE));
  */
 static int defrag_one_locked_target(struct btrfs_inode *inode,
 				    struct defrag_target_range *target,
-				    struct folio **folios, int nr_pages,
+				    struct folio **folios, int nr_folios,
 				    struct extent_state **cached_state)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct extent_changeset *data_reserved = NULL;
 	const u64 start = target->start;
 	const u64 len = target->len;
-	unsigned long last_index = (start + len - 1) >> PAGE_SHIFT;
-	unsigned long start_index = start >> PAGE_SHIFT;
+	unsigned long last_index = (start + len - 1) >> fs_info->folio_shift;
+	unsigned long start_index = start >> fs_info->folio_shift;
 	unsigned long first_index = folios[0]->index;
 	int ret = 0;
 	int i;
 
-	ASSERT(last_index - first_index + 1 <= nr_pages);
+	ASSERT(last_index - first_index + 1 <= nr_folios);
 
 	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len);
 	if (ret < 0)
@@ -1187,7 +1189,7 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
 	set_extent_bit(&inode->io_tree, start, start + len - 1,
 		       EXTENT_DELALLOC | EXTENT_DEFRAG, cached_state);
 
-	/* Update the page status */
+	/* Update the folio status */
 	for (i = start_index - first_index; i <= last_index - first_index; i++) {
 		folio_clear_checked(folios[i]);
 		btrfs_folio_clamp_set_dirty(fs_info, folios[i], start, len);
@@ -1202,40 +1204,42 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 			    u32 extent_thresh, u64 newer_than, bool do_compress,
 			    u64 *last_scanned_ret)
 {
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct extent_state *cached_state = NULL;
 	struct defrag_target_range *entry;
 	struct defrag_target_range *tmp;
 	LIST_HEAD(target_list);
 	struct folio **folios;
-	const u32 sectorsize = inode->root->fs_info->sectorsize;
-	u64 last_index = (start + len - 1) >> PAGE_SHIFT;
-	u64 start_index = start >> PAGE_SHIFT;
-	unsigned int nr_pages = last_index - start_index + 1;
+	const u32 sectorsize = fs_info->sectorsize;
+	u64 last_index = (start + len - 1) >> fs_info->folio_shift;
+	u64 start_index = start >> fs_info->folio_shift;
+	unsigned int nr_folios = last_index - start_index + 1;
 	int ret = 0;
 	int i;
 
-	ASSERT(nr_pages <= CLUSTER_SIZE / PAGE_SIZE);
+	ASSERT(nr_folios <= (CLUSTER_SIZE >> fs_info->folio_shift));
 	ASSERT(IS_ALIGNED(start, sectorsize) && IS_ALIGNED(len, sectorsize));
 
-	folios = kcalloc(nr_pages, sizeof(struct folio *), GFP_NOFS);
+	folios = kcalloc(nr_folios, sizeof(struct folio *), GFP_NOFS);
 	if (!folios)
 		return -ENOMEM;
 
 	/* Prepare all pages */
-	for (i = 0; i < nr_pages; i++) {
-		folios[i] = defrag_prepare_one_folio(inode, start_index + i);
+	for (i = 0; i < nr_folios ; i++) {
+		folios[i] = defrag_prepare_one_folio(inode,
+				(start_index + i) << fs_info->folio_shift);
 		if (IS_ERR(folios[i])) {
 			ret = PTR_ERR(folios[i]);
-			nr_pages = i;
+			nr_folios = i;
 			goto free_folios;
 		}
 	}
-	for (i = 0; i < nr_pages; i++)
+	for (i = 0; i < nr_folios; i++)
 		folio_wait_writeback(folios[i]);
 
 	/* Lock the pages range */
-	lock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
-		    (last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
+	lock_extent(&inode->io_tree, start_index << fs_info->folio_shift,
+		    (last_index << fs_info->folio_shift) + fs_info->folio_size - 1,
 		    &cached_state);
 	/*
 	 * Now we have a consistent view about the extent map, re-check
@@ -1251,7 +1255,7 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 		goto unlock_extent;
 
 	list_for_each_entry(entry, &target_list, list) {
-		ret = defrag_one_locked_target(inode, entry, folios, nr_pages,
+		ret = defrag_one_locked_target(inode, entry, folios, nr_folios,
 					       &cached_state);
 		if (ret < 0)
 			break;
@@ -1262,11 +1266,11 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 		kfree(entry);
 	}
 unlock_extent:
-	unlock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
-		      (last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
+	unlock_extent(&inode->io_tree, start_index << fs_info->folio_shift,
+		      (last_index << fs_info->folio_shift) + fs_info->folio_size - 1,
 		      &cached_state);
 free_folios:
-	for (i = 0; i < nr_pages; i++) {
+	for (i = 0; i < nr_folios; i++) {
 		folio_unlock(folios[i]);
 		folio_put(folios[i]);
 	}
@@ -1282,7 +1286,8 @@ static int defrag_one_cluster(struct btrfs_inode *inode,
 			      unsigned long max_sectors,
 			      u64 *last_scanned_ret)
 {
-	const u32 sectorsize = inode->root->fs_info->sectorsize;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	const u32 sectorsize = fs_info->sectorsize;
 	struct defrag_target_range *entry;
 	struct defrag_target_range *tmp;
 	LIST_HEAD(target_list);
@@ -1421,7 +1426,7 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 	 * Make writeback start from the beginning of the range, so that the
 	 * defrag range can be written sequentially.
 	 */
-	start_index = cur >> PAGE_SHIFT;
+	start_index = cur >> fs_info->folio_shift;
 	if (start_index < inode->i_mapping->writeback_index)
 		inode->i_mapping->writeback_index = start_index;
 
@@ -1436,8 +1441,8 @@ int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
 		}
 
 		/* We want the cluster end at page boundary when possible */
-		cluster_end = (((cur >> PAGE_SHIFT) +
-			       (SZ_256K >> PAGE_SHIFT)) << PAGE_SHIFT) - 1;
+		cluster_end = (((cur >> fs_info->folio_shift) +
+			(SZ_256K >> fs_info->folio_shift)) << fs_info->folio_shift) - 1;
 		cluster_end = min(cluster_end, last_byte);
 
 		btrfs_inode_lock(BTRFS_I(inode), 0);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size
  2024-01-24  3:59 [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size Qu Wenruo
  2024-01-24  3:59 ` [PATCH RFC 1/2] btrfs: introduce cached folio size Qu Wenruo
  2024-01-24  3:59 ` [PATCH RFC 2/2] btrfs: defrag: prepare defrag for larger data " Qu Wenruo
@ 2024-01-24  4:03 ` Qu Wenruo
  2024-01-24  4:48   ` Matthew Wilcox
  2 siblings, 1 reply; 12+ messages in thread
From: Qu Wenruo @ 2024-01-24  4:03 UTC (permalink / raw)
  To: linux-btrfs, Linux Memory Management List


[-- Attachment #1.1.1: Type: text/plain, Size: 1403 bytes --]

Adding MM list to this cover letter.

On 2024/1/24 14:29, Qu Wenruo wrote:
> With the folio interface, it's much easier to support multi-page sector
> size (aka, sector/block size > PAGE_SIZE, which is rare between major
> upstream filesystems).
> 
> The basic idea is, if we firstly convert to full folio interface, and
> allow an address space to only allocate folio which is exactly
> block/sector size, the support for multi-page would be mostly done.
> 
> But before that support, there are still quite some conversion left for
> btrfs.
> 
> Furthermore, with both subpage and multipage sector size, we need to
> handle folio different:
> 
> - For subpage
>    The folio would always be page sized.
> 
> - For multipage (and regular sectorsize == PAGE_SIZE)
>    The folio would be sector sized.
> 
> Furthermore, the filemap interface would make various shifts more
> complex.
> As filemap_*() interfaces use index which is PAGE_SHIFT based,
> meanwhile with potential larger folio, the folio shift can be larger
> than PAGE_SHIFT.

As I really want some feedback on this part.

I'm pretty sure we would have some filesystems go utilizing larger 
folios to implement their multi-page block size support.

Thus in that case, can we have an interface change to make all folio 
versions of filemap_*() to accept a file offset instead of page index?

Thanks,
Qu

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7027 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size
  2024-01-24  4:03 ` [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size Qu Wenruo
@ 2024-01-24  4:48   ` Matthew Wilcox
  2024-01-24  5:27     ` Qu Wenruo
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2024-01-24  4:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Linux Memory Management List

On Wed, Jan 24, 2024 at 02:33:22PM +1030, Qu Wenruo wrote:
> I'm pretty sure we would have some filesystems go utilizing larger folios to
> implement their multi-page block size support.
> 
> Thus in that case, can we have an interface change to make all folio
> versions of filemap_*() to accept a file offset instead of page index?

You're confused.  There's no change needed to the filemap API to support
large folios used by large block sizes.  Quite possibly more of btrfs
is confused, but it's really very simple.  index == pos / PAGE_SIZE.
That's all.  Even if you have a 64kB block size device on a 4kB PAGE_SIZE
machine.

That implies that folios must be at least order-4, but you can still
look up a folio at index 23 and get back the folio which was stored at
index 16 (range 16-31).

hugetlbfs made the mistake of 'hstate->order' and it's still not fixed.
It's a little better than it was (thanks to Sid), but more work is needed.
Just use the same approach as THPs or you're going to end up hurt.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size
  2024-01-24  4:48   ` Matthew Wilcox
@ 2024-01-24  5:27     ` Qu Wenruo
  2024-01-24  5:43       ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Qu Wenruo @ 2024-01-24  5:27 UTC (permalink / raw)
  To: Matthew Wilcox, Qu Wenruo; +Cc: linux-btrfs, Linux Memory Management List



On 2024/1/24 15:18, Matthew Wilcox wrote:
> On Wed, Jan 24, 2024 at 02:33:22PM +1030, Qu Wenruo wrote:
>> I'm pretty sure we would have some filesystems go utilizing larger folios to
>> implement their multi-page block size support.
>>
>> Thus in that case, can we have an interface change to make all folio
>> versions of filemap_*() to accept a file offset instead of page index?
>
> You're confused.  There's no change needed to the filemap API to support
> large folios used by large block sizes.  Quite possibly more of btrfs
> is confused, but it's really very simple.  index == pos / PAGE_SIZE.
> That's all.  Even if you have a 64kB block size device on a 4kB PAGE_SIZE
> machine.

Yes, I understand that filemap API is always working on PAGE_SHIFTed index.

The concern is, (hopefully) with more fses going to utilized large
folios, there would be two shifts.

One folio shift (ilog2(blocksize)), one PAGE_SHIFT for filemap interfaces.

And I'm pretty sure it's going to cause confusion, e.g. someone doing
the conversion without much think, and all go the folio shift, even for
filemap_get_folio().

Thus I'm wondering if it's possible to get a bytenr version of
filemap_get_folio().

(Or is it better just creating an inline wrapper inside the fs to avoid
confusion?)

>
> That implies that folios must be at least order-4, but you can still
> look up a folio at index 23 and get back the folio which was stored at
> index 16 (range 16-31).

Yep, that's also what I expect, and that is very handy.

Thanks,
Qu

>
> hugetlbfs made the mistake of 'hstate->order' and it's still not fixed.
> It's a little better than it was (thanks to Sid), but more work is needed.
> Just use the same approach as THPs or you're going to end up hurt.
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size
  2024-01-24  5:27     ` Qu Wenruo
@ 2024-01-24  5:43       ` Matthew Wilcox
  2024-01-24  5:50         ` Qu Wenruo
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2024-01-24  5:43 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, Linux Memory Management List

On Wed, Jan 24, 2024 at 03:57:39PM +1030, Qu Wenruo wrote:
> 
> 
> On 2024/1/24 15:18, Matthew Wilcox wrote:
> > On Wed, Jan 24, 2024 at 02:33:22PM +1030, Qu Wenruo wrote:
> > > I'm pretty sure we would have some filesystems go utilizing larger folios to
> > > implement their multi-page block size support.
> > > 
> > > Thus in that case, can we have an interface change to make all folio
> > > versions of filemap_*() to accept a file offset instead of page index?
> > 
> > You're confused.  There's no change needed to the filemap API to support
> > large folios used by large block sizes.  Quite possibly more of btrfs
> > is confused, but it's really very simple.  index == pos / PAGE_SIZE.
> > That's all.  Even if you have a 64kB block size device on a 4kB PAGE_SIZE
> > machine.
> 
> Yes, I understand that filemap API is always working on PAGE_SHIFTed index.

OK, good.

> The concern is, (hopefully) with more fses going to utilized large
> folios, there would be two shifts.
> 
> One folio shift (ilog2(blocksize)), one PAGE_SHIFT for filemap interfaces.

Don't shift the file position by the folio_shift().  You want to support
large(r) folios _and_ large blocksizes at the same time.  ie 64kB might
be the block size, but all that would mean would be that folio_shift()
would be at least 16.  It might be 17, 18 or 21 (for a THP).

Filesystems already have to deal with different PAGE_SIZE, SECTOR_SIZE,
fsblock size and LBA size.  Folios aren't making things any worse here
(they're also not making anything better in this area, but I never
claimed they would).

btrfs is slightly unusual in that it defined PAGE_SIZE and fsblock size
to be the same (and then had to deal with the consequences of arm64/x86
interoperability later).  But most filesystems have pretty good separation
of the four concepts.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size
  2024-01-24  5:43       ` Matthew Wilcox
@ 2024-01-24  5:50         ` Qu Wenruo
  0 siblings, 0 replies; 12+ messages in thread
From: Qu Wenruo @ 2024-01-24  5:50 UTC (permalink / raw)
  To: Matthew Wilcox, Qu Wenruo; +Cc: linux-btrfs, Linux Memory Management List


[-- Attachment #1.1.1: Type: text/plain, Size: 2352 bytes --]



On 2024/1/24 16:13, Matthew Wilcox wrote:
> On Wed, Jan 24, 2024 at 03:57:39PM +1030, Qu Wenruo wrote:
>>
>>
>> On 2024/1/24 15:18, Matthew Wilcox wrote:
>>> On Wed, Jan 24, 2024 at 02:33:22PM +1030, Qu Wenruo wrote:
>>>> I'm pretty sure we would have some filesystems go utilizing larger folios to
>>>> implement their multi-page block size support.
>>>>
>>>> Thus in that case, can we have an interface change to make all folio
>>>> versions of filemap_*() to accept a file offset instead of page index?
>>>
>>> You're confused.  There's no change needed to the filemap API to support
>>> large folios used by large block sizes.  Quite possibly more of btrfs
>>> is confused, but it's really very simple.  index == pos / PAGE_SIZE.
>>> That's all.  Even if you have a 64kB block size device on a 4kB PAGE_SIZE
>>> machine.
>>
>> Yes, I understand that filemap API is always working on PAGE_SHIFTed index.
> 
> OK, good.
> 
>> The concern is, (hopefully) with more fses going to utilized large
>> folios, there would be two shifts.
>>
>> One folio shift (ilog2(blocksize)), one PAGE_SHIFT for filemap interfaces.
> 
> Don't shift the file position by the folio_shift().  You want to support
> large(r) folios _and_ large blocksizes at the same time.  ie 64kB might
> be the block size, but all that would mean would be that folio_shift()
> would be at least 16.  It might be 17, 18 or 21 (for a THP).

Indeed, I forgot we have THP.

> 
> Filesystems already have to deal with different PAGE_SIZE, SECTOR_SIZE,
> fsblock size and LBA size.  Folios aren't making things any worse here
> (they're also not making anything better in this area, but I never
> claimed they would).

OK, that makes sense.

So all the folio shifts would be an fs internal deal, and we have to 
handle it properly.

> 
> btrfs is slightly unusual in that it defined PAGE_SIZE and fsblock size
> to be the same (and then had to deal with the consequences of arm64/x86
> interoperability later).  But most filesystems have pretty good separation
> of the four concepts.

Indeed, although I also found most major fses do not support larger 
block size (> PAGE_SIZE) either.

I guess subpage is simpler in the past, and hopefully with larger folio, 
we can see more fses support multi-page blocksize soon.

Thanks,
Qu

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7027 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: defrag: prepare defrag for larger data folio size
  2024-01-24  3:59 ` [PATCH RFC 2/2] btrfs: defrag: prepare defrag for larger data " Qu Wenruo
@ 2024-02-15 20:23   ` Matthew Wilcox
  2024-02-15 23:07     ` Qu Wenruo
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2024-02-15 20:23 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, David Sterba

On Wed, Jan 24, 2024 at 02:29:08PM +1030, Qu Wenruo wrote:
> Although we have migrated defrag to use the folio interface, we can
> still further enhance it for the future larger data folio size.

This patch is wrong.  Please drop it.

>  {
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>  	struct extent_changeset *data_reserved = NULL;
>  	const u64 start = target->start;
>  	const u64 len = target->len;
> -	unsigned long last_index = (start + len - 1) >> PAGE_SHIFT;
> -	unsigned long start_index = start >> PAGE_SHIFT;
> +	unsigned long last_index = (start + len - 1) >> fs_info->folio_shift;
> +	unsigned long start_index = start >> fs_info->folio_shift;

indices are always in multiples of PAGE_SIZE.

>  	unsigned long first_index = folios[0]->index;

... so if you've shifted a file position by some "folio_shift" and then
subtracted it from folio->index, you have garbage.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: defrag: prepare defrag for larger data folio size
  2024-02-15 20:23   ` Matthew Wilcox
@ 2024-02-15 23:07     ` Qu Wenruo
  2024-02-27 21:32       ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Qu Wenruo @ 2024-02-15 23:07 UTC (permalink / raw)
  To: Matthew Wilcox, Qu Wenruo; +Cc: linux-btrfs, David Sterba



在 2024/2/16 06:53, Matthew Wilcox 写道:
> On Wed, Jan 24, 2024 at 02:29:08PM +1030, Qu Wenruo wrote:
>> Although we have migrated defrag to use the folio interface, we can
>> still further enhance it for the future larger data folio size.
>
> This patch is wrong.  Please drop it.
>
>>   {
>>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>   	struct extent_changeset *data_reserved = NULL;
>>   	const u64 start = target->start;
>>   	const u64 len = target->len;
>> -	unsigned long last_index = (start + len - 1) >> PAGE_SHIFT;
>> -	unsigned long start_index = start >> PAGE_SHIFT;
>> +	unsigned long last_index = (start + len - 1) >> fs_info->folio_shift;
>> +	unsigned long start_index = start >> fs_info->folio_shift;
>
> indices are always in multiples of PAGE_SIZE.

So is the fs_info->folio_shift. It would always be >= PAGE_SHIFT.

That folio_shift is assigned inside the first patch:
(Although in the first patch, there is a bug that it should assign
folio_shift, not sectorsize_bits again)

+	fs_info->folio_size = PAGE_SIZE;
+	fs_info->folio_shift = PAGE_SHIFT;
+	if (sectorsize > PAGE_SIZE) {
+		/* For future multi-page sectorsize support */
+		fs_info->folio_size = sectorsize;
+		fs_info->folio_shift  = fs_info->sectorsize_bits;
+	} else {
+		fs_info->folio_size = PAGE_SIZE;
+		fs_info->folio_shift = PAGE_SHIFT;
+	}

>
>>   	unsigned long first_index = folios[0]->index;
>
> ... so if you've shifted a file position by some "folio_shift" and then
> subtracted it from folio->index, you have garbage.

For the future larger folio support, all folio would be in the size of
sectorsize.
Mind to explain why the folio->index would be garbage?

Thanks,
Qu

>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: defrag: prepare defrag for larger data folio size
  2024-02-15 23:07     ` Qu Wenruo
@ 2024-02-27 21:32       ` Matthew Wilcox
  2024-02-27 21:42         ` Qu Wenruo
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2024-02-27 21:32 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, David Sterba

On Fri, Feb 16, 2024 at 09:37:01AM +1030, Qu Wenruo wrote:
> 在 2024/2/16 06:53, Matthew Wilcox 写道:
> > On Wed, Jan 24, 2024 at 02:29:08PM +1030, Qu Wenruo wrote:
> > > Although we have migrated defrag to use the folio interface, we can
> > > still further enhance it for the future larger data folio size.
> > 
> > This patch is wrong.  Please drop it.
> > 
> > >   {
> > >   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > >   	struct extent_changeset *data_reserved = NULL;
> > >   	const u64 start = target->start;
> > >   	const u64 len = target->len;
> > > -	unsigned long last_index = (start + len - 1) >> PAGE_SHIFT;
> > > -	unsigned long start_index = start >> PAGE_SHIFT;
> > > +	unsigned long last_index = (start + len - 1) >> fs_info->folio_shift;
> > > +	unsigned long start_index = start >> fs_info->folio_shift;
> > 
> > indices are always in multiples of PAGE_SIZE.
> 
> So is the fs_info->folio_shift. It would always be >= PAGE_SHIFT.

No, you don't understand.  folio->index * PAGE_SIZE == byte offset of folio
in the file.  What you've done here breaks that.

> > >   	unsigned long first_index = folios[0]->index;
> > 
> > ... so if you've shifted a file position by some "folio_shift" and then
> > subtracted it from folio->index, you have garbage.
> 
> For the future larger folio support, all folio would be in the size of
> sectorsize.

Yes, folios are always an integer multiple of sector size.  But their
_index_ is expressed as a multiple of the page size.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 2/2] btrfs: defrag: prepare defrag for larger data folio size
  2024-02-27 21:32       ` Matthew Wilcox
@ 2024-02-27 21:42         ` Qu Wenruo
  0 siblings, 0 replies; 12+ messages in thread
From: Qu Wenruo @ 2024-02-27 21:42 UTC (permalink / raw)
  To: Matthew Wilcox, Qu Wenruo; +Cc: linux-btrfs, David Sterba



在 2024/2/28 08:02, Matthew Wilcox 写道:
> On Fri, Feb 16, 2024 at 09:37:01AM +1030, Qu Wenruo wrote:
>> 在 2024/2/16 06:53, Matthew Wilcox 写道:
>>> On Wed, Jan 24, 2024 at 02:29:08PM +1030, Qu Wenruo wrote:
>>>> Although we have migrated defrag to use the folio interface, we can
>>>> still further enhance it for the future larger data folio size.
>>>
>>> This patch is wrong.  Please drop it.
>>>
>>>>    {
>>>>    	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>>>    	struct extent_changeset *data_reserved = NULL;
>>>>    	const u64 start = target->start;
>>>>    	const u64 len = target->len;
>>>> -	unsigned long last_index = (start + len - 1) >> PAGE_SHIFT;
>>>> -	unsigned long start_index = start >> PAGE_SHIFT;
>>>> +	unsigned long last_index = (start + len - 1) >> fs_info->folio_shift;
>>>> +	unsigned long start_index = start >> fs_info->folio_shift;
>>>
>>> indices are always in multiples of PAGE_SIZE.
>>
>> So is the fs_info->folio_shift. It would always be >= PAGE_SHIFT.
> 
> No, you don't understand.  folio->index * PAGE_SIZE == byte offset of folio
> in the file.  What you've done here breaks that.

Oh, I see it now.

Then it's all my bad. We should still go PAGE_SHIFT/PAGE_SIZE for filemap.

In that case, the series should be dropped.

Thanks for pointing this out,
Qu
> 
>>>>    	unsigned long first_index = folios[0]->index;
>>>
>>> ... so if you've shifted a file position by some "folio_shift" and then
>>> subtracted it from folio->index, you have garbage.
>>
>> For the future larger folio support, all folio would be in the size of
>> sectorsize.
> 
> Yes, folios are always an integer multiple of sector size.  But their
> _index_ is expressed as a multiple of the page size.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-02-27 21:42 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-24  3:59 [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size Qu Wenruo
2024-01-24  3:59 ` [PATCH RFC 1/2] btrfs: introduce cached folio size Qu Wenruo
2024-01-24  3:59 ` [PATCH RFC 2/2] btrfs: defrag: prepare defrag for larger data " Qu Wenruo
2024-02-15 20:23   ` Matthew Wilcox
2024-02-15 23:07     ` Qu Wenruo
2024-02-27 21:32       ` Matthew Wilcox
2024-02-27 21:42         ` Qu Wenruo
2024-01-24  4:03 ` [PATCH RFC 0/2] btrfs: defrag: further preparation for multi-page sector size Qu Wenruo
2024-01-24  4:48   ` Matthew Wilcox
2024-01-24  5:27     ` Qu Wenruo
2024-01-24  5:43       ` Matthew Wilcox
2024-01-24  5:50         ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).