All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] btrfs: make extent buffer memory continuous
@ 2023-07-25  2:57 Qu Wenruo
  2023-07-25  2:57 ` [PATCH RFC 1/2] btrfs: map uncontinuous extent buffer pages into virtual address space Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Qu Wenruo @ 2023-07-25  2:57 UTC (permalink / raw)
  To: linux-btrfs

[REPO]
https://github.com/adam900710/linux/tree/eb_page_cleanups

This includes the submitted extent buffer accessors cleanup as
the dependency.

[BACKGROUND]
We have a lot of extent buffer code addressing the cross-page accesses, on
the other hand, other filesystems like XFS is mapping its xfs_buf into
kernel virtual address space, so that they can access the content of
xfs_buf without bothering the page boundaries.

[OBJECTIVE]
This patchset is mostly learning from the xfs_buf, to greatly simplify
the extent buffer accessors.

Now all the extent buffer accessors are turned into wrappers of
memcpy()/memcmp()/memmove().

For now, it can pass test cases from btrfs test case without new
regressions.

[RFC]
But I still want to get more feedbacks on this topic, since it's
changing the very core of btrfs extent buffer.

Furthermore, this change may not be 32bit systems friendly, as kernel
virtual address space is only 128MiB for 32bit systems, not sure if it's
going to cause any regression on 32bit systems.

[TODO]
- Benchmarks
  I'm not 100% sure if this going to cause any performance change.
  In theory, we off-load the cross-page handling to hardware MMU, which
  should improve performance, but we spend more time initializing the
  extent buffer.

- More tests on 32bit and 64bit systems

Qu Wenruo (2):
  btrfs: map uncontinuous extent buffer pages into virtual address space
  btrfs: utilize the physically/virtually continuous extent buffer
    memory

 fs/btrfs/disk-io.c   |  18 +--
 fs/btrfs/extent_io.c | 303 ++++++++++++++-----------------------------
 fs/btrfs/extent_io.h |  17 +++
 3 files changed, 119 insertions(+), 219 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH RFC 1/2] btrfs: map uncontinuous extent buffer pages into virtual address space
  2023-07-25  2:57 [PATCH RFC 0/2] btrfs: make extent buffer memory continuous Qu Wenruo
@ 2023-07-25  2:57 ` Qu Wenruo
  2023-07-27 14:18   ` David Sterba
  2023-07-25  2:57 ` [PATCH RFC 2/2] btrfs: utilize the physically/virtually continuous extent buffer memory Qu Wenruo
  2023-07-26  9:16 ` [PATCH RFC 0/2] btrfs: make extent buffer memory continuous Qu Wenruo
  2 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2023-07-25  2:57 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs implements its extent buffer read-write using various
helpers doing cross-page handling for the pages array.

However other filesystems like XFS is mapping the pages into kernel
virtual address space, greatly simplify the access.

This patch would learn from XFS and map the pages into virtual address
space, if and only if the pages are not physically continuous.
(Note, a single page counts as physically continuous.)

For now we only do the map, but not yet really utilize the mapped
address.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_io.h |  7 +++++
 2 files changed, 77 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4144c649718e..f40d48f641c0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -14,6 +14,7 @@
 #include <linux/pagevec.h>
 #include <linux/prefetch.h>
 #include <linux/fsverity.h>
+#include <linux/vmalloc.h>
 #include "misc.h"
 #include "extent_io.h"
 #include "extent-io-tree.h"
@@ -3206,6 +3207,8 @@ static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
 	ASSERT(!extent_buffer_under_io(eb));
 
 	num_pages = num_extent_pages(eb);
+	if (eb->vaddr)
+		vm_unmap_ram(eb->vaddr, num_pages);
 	for (i = 0; i < num_pages; i++) {
 		struct page *page = eb->pages[i];
 
@@ -3255,6 +3258,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
 {
 	int i;
 	struct extent_buffer *new;
+	bool pages_contig = true;
 	int num_pages = num_extent_pages(src);
 	int ret;
 
@@ -3279,6 +3283,9 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
 		int ret;
 		struct page *p = new->pages[i];
 
+		if (i && p != new->pages[i - 1] + 1)
+			pages_contig = false;
+
 		ret = attach_extent_buffer_page(new, p, NULL);
 		if (ret < 0) {
 			btrfs_release_extent_buffer(new);
@@ -3286,6 +3293,23 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
 		}
 		WARN_ON(PageDirty(p));
 	}
+	if (!pages_contig) {
+		unsigned int nofs_flag;
+		int retried = 0;
+
+		nofs_flag = memalloc_nofs_save();
+		do {
+			new->vaddr = vm_map_ram(new->pages, num_pages, -1);
+			if (new->vaddr)
+				break;
+			vm_unmap_aliases();
+		} while ((retried++) <= 1);
+		memalloc_nofs_restore(nofs_flag);
+		if (!new->vaddr) {
+			btrfs_release_extent_buffer(new);
+			return NULL;
+		}
+	}
 	copy_extent_buffer_full(new, src);
 	set_extent_buffer_uptodate(new);
 
@@ -3296,6 +3320,7 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 						  u64 start, unsigned long len)
 {
 	struct extent_buffer *eb;
+	bool pages_contig = true;
 	int num_pages;
 	int i;
 	int ret;
@@ -3312,11 +3337,29 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = eb->pages[i];
 
+		if (i && p != eb->pages[i - 1] + 1)
+			pages_contig = false;
+
 		ret = attach_extent_buffer_page(eb, p, NULL);
 		if (ret < 0)
 			goto err;
 	}
 
+	if (!pages_contig) {
+		unsigned int nofs_flag;
+		int retried = 0;
+
+		nofs_flag = memalloc_nofs_save();
+		do {
+			eb->vaddr = vm_map_ram(eb->pages, num_pages, -1);
+			if (eb->vaddr)
+				break;
+			vm_unmap_aliases();
+		} while ((retried++) <= 1);
+		memalloc_nofs_restore(nofs_flag);
+		if (!eb->vaddr)
+			goto err;
+	}
 	set_extent_buffer_uptodate(eb);
 	btrfs_set_header_nritems(eb, 0);
 	set_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
@@ -3539,6 +3582,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	struct address_space *mapping = fs_info->btree_inode->i_mapping;
 	struct btrfs_subpage *prealloc = NULL;
 	u64 lockdep_owner = owner_root;
+	bool pages_contig = true;
 	int uptodate = 1;
 	int ret;
 
@@ -3611,6 +3655,10 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		/* Should not fail, as we have preallocated the memory */
 		ret = attach_extent_buffer_page(eb, p, prealloc);
 		ASSERT(!ret);
+
+		if (i && p != eb->pages[i - 1] + 1)
+			pages_contig = false;
+
 		/*
 		 * To inform we have extra eb under allocation, so that
 		 * detach_extent_buffer_page() won't release the page private
@@ -3636,6 +3684,28 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		 * we could crash.
 		 */
 	}
+
+	/*
+	 * If pages are not continuous, here we map it into a continuous virtual
+	 * range to make later access easier.
+	 */
+	if (!pages_contig) {
+		unsigned int nofs_flag;
+		int retried = 0;
+
+		nofs_flag = memalloc_nofs_save();
+		do {
+			eb->vaddr = vm_map_ram(eb->pages, num_pages, -1);
+			if (eb->vaddr)
+				break;
+			vm_unmap_aliases();
+		} while ((retried++) <= 1);
+		memalloc_nofs_restore(nofs_flag);
+		if (!eb->vaddr) {
+			exists = ERR_PTR(-ENOMEM);
+			goto free_eb;
+		}
+	}
 	if (uptodate)
 		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 again:
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5966d810af7b..f1505c3a05cc 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -88,6 +88,13 @@ struct extent_buffer {
 
 	struct rw_semaphore lock;
 
+	/*
+	 * For virtually mapped address.
+	 *
+	 * NULL if the pages are physically continuous.
+	 */
+	void *vaddr;
+
 	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
 #ifdef CONFIG_BTRFS_DEBUG
 	struct list_head leak_list;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC 2/2] btrfs: utilize the physically/virtually continuous extent buffer memory
  2023-07-25  2:57 [PATCH RFC 0/2] btrfs: make extent buffer memory continuous Qu Wenruo
  2023-07-25  2:57 ` [PATCH RFC 1/2] btrfs: map uncontinuous extent buffer pages into virtual address space Qu Wenruo
@ 2023-07-25  2:57 ` Qu Wenruo
  2023-07-26  9:16 ` [PATCH RFC 0/2] btrfs: make extent buffer memory continuous Qu Wenruo
  2 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2023-07-25  2:57 UTC (permalink / raw)
  To: linux-btrfs

Since the extent buffer pages are either physically or virtually
continuous, let's benefit from the new feature.

This involves the following changes:

- Extent buffer accessors
  Now read/write/memcpy/memmove_extent_buffer() functions are just
  a wrapper of memcpy()/memmove().

  The cross-page handling are handled by hardware MMU.

- csum_tree_block()
  We can directly go crypto_shash_digest(), as we don't need to handle
  page boundaries anymore.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c   |  18 +---
 fs/btrfs/extent_io.c | 233 ++++++-------------------------------------
 fs/btrfs/extent_io.h |  10 ++
 3 files changed, 42 insertions(+), 219 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b4495d4c1533..8ca12ca2dc32 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -75,24 +75,14 @@ static void btrfs_free_csum_hash(struct btrfs_fs_info *fs_info)
 static void csum_tree_block(struct extent_buffer *buf, u8 *result)
 {
 	struct btrfs_fs_info *fs_info = buf->fs_info;
-	const int num_pages = num_extent_pages(buf);
-	const int first_page_part = min_t(u32, PAGE_SIZE, fs_info->nodesize);
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
-	char *kaddr;
-	int i;
+	void *eb_addr = btrfs_get_eb_addr(buf);
 
+	memset(result, 0, BTRFS_CSUM_SIZE);
 	shash->tfm = fs_info->csum_shash;
 	crypto_shash_init(shash);
-	kaddr = page_address(buf->pages[0]) + offset_in_page(buf->start);
-	crypto_shash_update(shash, kaddr + BTRFS_CSUM_SIZE,
-			    first_page_part - BTRFS_CSUM_SIZE);
-
-	for (i = 1; i < num_pages && INLINE_EXTENT_BUFFER_PAGES > 1; i++) {
-		kaddr = page_address(buf->pages[i]);
-		crypto_shash_update(shash, kaddr, PAGE_SIZE);
-	}
-	memset(result, 0, BTRFS_CSUM_SIZE);
-	crypto_shash_final(shash, result);
+	crypto_shash_digest(shash, eb_addr + BTRFS_CSUM_SIZE,
+			    buf->len - BTRFS_CSUM_SIZE, result);
 }
 
 /*
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f40d48f641c0..98077bbefc48 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4126,100 +4126,39 @@ static inline int check_eb_range(const struct extent_buffer *eb,
 void read_extent_buffer(const struct extent_buffer *eb, void *dstv,
 			unsigned long start, unsigned long len)
 {
-	size_t cur;
-	size_t offset;
-	struct page *page;
-	char *kaddr;
-	char *dst = (char *)dstv;
-	unsigned long i = get_eb_page_index(start);
+	void *eb_addr = btrfs_get_eb_addr(eb);
 
 	if (check_eb_range(eb, start, len))
 		return;
 
-	offset = get_eb_offset_in_page(eb, start);
-
-	while (len > 0) {
-		page = eb->pages[i];
-
-		cur = min(len, (PAGE_SIZE - offset));
-		kaddr = page_address(page);
-		memcpy(dst, kaddr + offset, cur);
-
-		dst += cur;
-		len -= cur;
-		offset = 0;
-		i++;
-	}
+	memcpy(dstv, eb_addr + start, len);
 }
 
 int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
 				       void __user *dstv,
 				       unsigned long start, unsigned long len)
 {
-	size_t cur;
-	size_t offset;
-	struct page *page;
-	char *kaddr;
-	char __user *dst = (char __user *)dstv;
-	unsigned long i = get_eb_page_index(start);
-	int ret = 0;
+	void *eb_addr = btrfs_get_eb_addr(eb);
+	int ret;
 
 	WARN_ON(start > eb->len);
 	WARN_ON(start + len > eb->start + eb->len);
 
-	offset = get_eb_offset_in_page(eb, start);
-
-	while (len > 0) {
-		page = eb->pages[i];
-
-		cur = min(len, (PAGE_SIZE - offset));
-		kaddr = page_address(page);
-		if (copy_to_user_nofault(dst, kaddr + offset, cur)) {
-			ret = -EFAULT;
-			break;
-		}
-
-		dst += cur;
-		len -= cur;
-		offset = 0;
-		i++;
-	}
-
-	return ret;
+	ret = copy_to_user_nofault(dstv, eb_addr + start, len);
+	if (ret)
+		return -EFAULT;
+	return 0;
 }
 
 int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv,
 			 unsigned long start, unsigned long len)
 {
-	size_t cur;
-	size_t offset;
-	struct page *page;
-	char *kaddr;
-	char *ptr = (char *)ptrv;
-	unsigned long i = get_eb_page_index(start);
-	int ret = 0;
+	void *eb_addr = btrfs_get_eb_addr(eb);
 
 	if (check_eb_range(eb, start, len))
 		return -EINVAL;
 
-	offset = get_eb_offset_in_page(eb, start);
-
-	while (len > 0) {
-		page = eb->pages[i];
-
-		cur = min(len, (PAGE_SIZE - offset));
-
-		kaddr = page_address(page);
-		ret = memcmp(ptr, kaddr + offset, cur);
-		if (ret)
-			break;
-
-		ptr += cur;
-		len -= cur;
-		offset = 0;
-		i++;
-	}
-	return ret;
+	return memcmp(ptrv, eb_addr + start, len);
 }
 
 /*
@@ -4253,67 +4192,20 @@ static void assert_eb_page_uptodate(const struct extent_buffer *eb,
 	}
 }
 
-static void __write_extent_buffer(const struct extent_buffer *eb,
-				  const void *srcv, unsigned long start,
-				  unsigned long len, bool use_memmove)
-{
-	size_t cur;
-	size_t offset;
-	struct page *page;
-	char *kaddr;
-	char *src = (char *)srcv;
-	unsigned long i = get_eb_page_index(start);
-	/* For unmapped (dummy) ebs, no need to check their uptodate status. */
-	const bool check_uptodate = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
-
-	WARN_ON(test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags));
-
-	if (check_eb_range(eb, start, len))
-		return;
-
-	offset = get_eb_offset_in_page(eb, start);
-
-	while (len > 0) {
-		page = eb->pages[i];
-		if (check_uptodate)
-			assert_eb_page_uptodate(eb, page);
-
-		cur = min(len, PAGE_SIZE - offset);
-		kaddr = page_address(page);
-		if (use_memmove)
-			memmove(kaddr + offset, src, cur);
-		else
-			memcpy(kaddr + offset, src, cur);
-
-		src += cur;
-		len -= cur;
-		offset = 0;
-		i++;
-	}
-}
-
 void write_extent_buffer(const struct extent_buffer *eb, const void *srcv,
 			 unsigned long start, unsigned long len)
 {
-	return __write_extent_buffer(eb, srcv, start, len, false);
+	void *eb_addr = btrfs_get_eb_addr(eb);
+
+	memcpy(eb_addr + start, srcv, len);
 }
 
 static void memset_extent_buffer(const struct extent_buffer *eb, int c,
 				 unsigned long start, unsigned long len)
 {
-	unsigned long cur = start;
+	void *eb_addr = btrfs_get_eb_addr(eb);
 
-	while (cur < start + len) {
-		unsigned long index = get_eb_page_index(cur);
-		unsigned int offset = get_eb_offset_in_page(eb, cur);
-		unsigned int cur_len = min(start + len - cur, PAGE_SIZE - offset);
-		struct page *page = eb->pages[index];
-
-		assert_eb_page_uptodate(eb, page);
-		memset(page_address(page) + offset, c, cur_len);
-
-		cur += cur_len;
-	}
+	memset(eb_addr + start, c, len);
 }
 
 void memzero_extent_buffer(const struct extent_buffer *eb, unsigned long start,
@@ -4327,20 +4219,12 @@ void memzero_extent_buffer(const struct extent_buffer *eb, unsigned long start,
 void copy_extent_buffer_full(const struct extent_buffer *dst,
 			     const struct extent_buffer *src)
 {
-	unsigned long cur = 0;
+	void *dst_addr = btrfs_get_eb_addr(dst);
+	void *src_addr = btrfs_get_eb_addr(src);
 
 	ASSERT(dst->len == src->len);
 
-	while (cur < src->len) {
-		unsigned long index = get_eb_page_index(cur);
-		unsigned long offset = get_eb_offset_in_page(src, cur);
-		unsigned long cur_len = min(src->len, PAGE_SIZE - offset);
-		void *addr = page_address(src->pages[index]) + offset;
-
-		write_extent_buffer(dst, addr, cur, cur_len);
-
-		cur += cur_len;
-	}
+	memcpy(dst_addr, src_addr, dst->len);
 }
 
 void copy_extent_buffer(const struct extent_buffer *dst,
@@ -4349,11 +4233,8 @@ void copy_extent_buffer(const struct extent_buffer *dst,
 			unsigned long len)
 {
 	u64 dst_len = dst->len;
-	size_t cur;
-	size_t offset;
-	struct page *page;
-	char *kaddr;
-	unsigned long i = get_eb_page_index(dst_offset);
+	void *dst_addr = btrfs_get_eb_addr(dst);
+	void *src_addr = btrfs_get_eb_addr(src);
 
 	if (check_eb_range(dst, dst_offset, len) ||
 	    check_eb_range(src, src_offset, len))
@@ -4361,22 +4242,7 @@ void copy_extent_buffer(const struct extent_buffer *dst,
 
 	WARN_ON(src->len != dst_len);
 
-	offset = get_eb_offset_in_page(dst, dst_offset);
-
-	while (len > 0) {
-		page = dst->pages[i];
-		assert_eb_page_uptodate(dst, page);
-
-		cur = min(len, (unsigned long)(PAGE_SIZE - offset));
-
-		kaddr = page_address(page);
-		read_extent_buffer(src, kaddr + offset, src_offset, cur);
-
-		src_offset += cur;
-		len -= cur;
-		offset = 0;
-		i++;
-	}
+	memcpy(dst_addr + dst_offset, src_addr + src_offset, len);
 }
 
 /*
@@ -4524,72 +4390,29 @@ void memcpy_extent_buffer(const struct extent_buffer *dst,
 			  unsigned long dst_offset, unsigned long src_offset,
 			  unsigned long len)
 {
-	unsigned long cur_off = 0;
+	void *eb_addr = btrfs_get_eb_addr(dst);
 
 	if (check_eb_range(dst, dst_offset, len) ||
 	    check_eb_range(dst, src_offset, len))
 		return;
 
-	while (cur_off < len) {
-		unsigned long cur_src = cur_off + src_offset;
-		unsigned long pg_index = get_eb_page_index(cur_src);
-		unsigned long pg_off = get_eb_offset_in_page(dst, cur_src);
-		unsigned long cur_len = min(src_offset + len - cur_src,
-					    PAGE_SIZE - pg_off);
-		void *src_addr = page_address(dst->pages[pg_index]) + pg_off;
-		const bool use_memmove = areas_overlap(src_offset + cur_off,
-						       dst_offset + cur_off, cur_len);
-
-		__write_extent_buffer(dst, src_addr, dst_offset + cur_off, cur_len,
-				      use_memmove);
-		cur_off += cur_len;
-	}
+	if (areas_overlap(dst_offset, src_offset, len))
+		memmove(eb_addr + dst_offset, eb_addr + src_offset, len);
+	else
+		memcpy(eb_addr + dst_offset, eb_addr + src_offset, len);
 }
 
 void memmove_extent_buffer(const struct extent_buffer *dst,
 			   unsigned long dst_offset, unsigned long src_offset,
 			   unsigned long len)
 {
-	unsigned long dst_end = dst_offset + len - 1;
-	unsigned long src_end = src_offset + len - 1;
+	void *eb_addr = btrfs_get_eb_addr(dst);
 
 	if (check_eb_range(dst, dst_offset, len) ||
 	    check_eb_range(dst, src_offset, len))
 		return;
 
-	if (dst_offset < src_offset) {
-		memcpy_extent_buffer(dst, dst_offset, src_offset, len);
-		return;
-	}
-
-	while (len > 0) {
-		unsigned long src_i;
-		size_t cur;
-		size_t dst_off_in_page;
-		size_t src_off_in_page;
-		void *src_addr;
-		bool use_memmove;
-
-		src_i = get_eb_page_index(src_end);
-
-		dst_off_in_page = get_eb_offset_in_page(dst, dst_end);
-		src_off_in_page = get_eb_offset_in_page(dst, src_end);
-
-		cur = min_t(unsigned long, len, src_off_in_page + 1);
-		cur = min(cur, dst_off_in_page + 1);
-
-		src_addr = page_address(dst->pages[src_i]) + src_off_in_page -
-			   cur + 1;
-		use_memmove = areas_overlap(src_end - cur + 1, dst_end - cur + 1,
-					    cur);
-
-		__write_extent_buffer(dst, src_addr, dst_end - cur + 1, cur,
-				      use_memmove);
-
-		dst_end -= cur;
-		src_end -= cur;
-		len -= cur;
-	}
+	memmove(eb_addr + dst_offset, eb_addr + src_offset, len);
 }
 
 #define GANG_LOOKUP_SIZE	16
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index f1505c3a05cc..f97707829ee5 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -134,6 +134,16 @@ static inline unsigned long get_eb_page_index(unsigned long offset)
 	return offset >> PAGE_SHIFT;
 }
 
+static inline void *btrfs_get_eb_addr(const struct extent_buffer *eb)
+{
+	/* For fallback vmapped extent buffer. */
+	if (eb->vaddr)
+		return eb->vaddr;
+
+	/* For physically continuous pages and subpage cases. */
+	return page_address(eb->pages[0]) + offset_in_page(eb->start);
+}
+
 /*
  * Structure to record how many bytes and which ranges are set/cleared
  */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC 0/2] btrfs: make extent buffer memory continuous
  2023-07-25  2:57 [PATCH RFC 0/2] btrfs: make extent buffer memory continuous Qu Wenruo
  2023-07-25  2:57 ` [PATCH RFC 1/2] btrfs: map uncontinuous extent buffer pages into virtual address space Qu Wenruo
  2023-07-25  2:57 ` [PATCH RFC 2/2] btrfs: utilize the physically/virtually continuous extent buffer memory Qu Wenruo
@ 2023-07-26  9:16 ` Qu Wenruo
  2 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2023-07-26  9:16 UTC (permalink / raw)
  To: linux-btrfs



On 2023/7/25 10:57, Qu Wenruo wrote:
> [REPO]
> https://github.com/adam900710/linux/tree/eb_page_cleanups
>
> This includes the submitted extent buffer accessors cleanup as
> the dependency.
>
> [BACKGROUND]
> We have a lot of extent buffer code addressing the cross-page accesses, on
> the other hand, other filesystems like XFS is mapping its xfs_buf into
> kernel virtual address space, so that they can access the content of
> xfs_buf without bothering the page boundaries.
>
> [OBJECTIVE]
> This patchset is mostly learning from the xfs_buf, to greatly simplify
> the extent buffer accessors.
>
> Now all the extent buffer accessors are turned into wrappers of
> memcpy()/memcmp()/memmove().
>
> For now, it can pass test cases from btrfs test case without new
> regressions.
>
> [RFC]
> But I still want to get more feedbacks on this topic, since it's
> changing the very core of btrfs extent buffer.
>
> Furthermore, this change may not be 32bit systems friendly, as kernel
> virtual address space is only 128MiB for 32bit systems, not sure if it's
> going to cause any regression on 32bit systems.
>
> [TODO]
> - Benchmarks
>    I'm not 100% sure if this going to cause any performance change.
>    In theory, we off-load the cross-page handling to hardware MMU, which
>    should improve performance, but we spend more time initializing the
>    extent buffer.

I tried an fio run with the following parameters on a PCIE3 NVME device:

fio -rw=randrw --size=8g \
	--bsrange=512b-64k --bs_unaligned \
	--ioengine=libaio --fsync=1024 \
	--filename=$mnt/job --name=job1 --name=job2 --runtime=300s

This would result heavy enough metadata workload for btrfs, and the new
patchset did get a small improvement on both throughput and latency:

Baseline:
    READ: bw=33.0MiB/s (34.6MB/s), 16.5MiB/s-16.6MiB/s
(17.3MB/s-17.4MB/s), io=8136MiB (8531MB), run=245999-246658msec
   WRITE: bw=33.0MiB/s (34.6MB/s), 16.5MiB/s-16.5MiB/s
(17.3MB/s-17.3MB/s), io=8144MiB (8539MB), run=245999-246658msec

Patched:
    READ: bw=33.0MiB/s (34.6MB/s), 16.5MiB/s-16.6MiB/s
(17.3MB/s-17.4MB/s), io=8136MiB (8531MB), run=245999-246658msec
   WRITE: bw=33.0MiB/s (34.6MB/s), 16.5MiB/s-16.5MiB/s
(17.3MB/s-17.3MB/s), io=8144MiB (8539MB), run=245999-246658msec

The throughput and latency both got around 2.6%.

Thanks,
Qu
>
> - More tests on 32bit and 64bit systems
>
> Qu Wenruo (2):
>    btrfs: map uncontinuous extent buffer pages into virtual address space
>    btrfs: utilize the physically/virtually continuous extent buffer
>      memory
>
>   fs/btrfs/disk-io.c   |  18 +--
>   fs/btrfs/extent_io.c | 303 ++++++++++++++-----------------------------
>   fs/btrfs/extent_io.h |  17 +++
>   3 files changed, 119 insertions(+), 219 deletions(-)
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC 1/2] btrfs: map uncontinuous extent buffer pages into virtual address space
  2023-07-25  2:57 ` [PATCH RFC 1/2] btrfs: map uncontinuous extent buffer pages into virtual address space Qu Wenruo
@ 2023-07-27 14:18   ` David Sterba
  2023-07-27 22:24     ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: David Sterba @ 2023-07-27 14:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Jul 25, 2023 at 10:57:21AM +0800, Qu Wenruo wrote:
> Currently btrfs implements its extent buffer read-write using various
> helpers doing cross-page handling for the pages array.
> 
> However other filesystems like XFS is mapping the pages into kernel
> virtual address space, greatly simplify the access.
> 
> This patch would learn from XFS and map the pages into virtual address
> space, if and only if the pages are not physically continuous.
> (Note, a single page counts as physically continuous.)
> 
> For now we only do the map, but not yet really utilize the mapped
> address.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/extent_io.h |  7 +++++
>  2 files changed, 77 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 4144c649718e..f40d48f641c0 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -14,6 +14,7 @@
>  #include <linux/pagevec.h>
>  #include <linux/prefetch.h>
>  #include <linux/fsverity.h>
> +#include <linux/vmalloc.h>
>  #include "misc.h"
>  #include "extent_io.h"
>  #include "extent-io-tree.h"
> @@ -3206,6 +3207,8 @@ static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
>  	ASSERT(!extent_buffer_under_io(eb));
>  
>  	num_pages = num_extent_pages(eb);
> +	if (eb->vaddr)
> +		vm_unmap_ram(eb->vaddr, num_pages);
>  	for (i = 0; i < num_pages; i++) {
>  		struct page *page = eb->pages[i];
>  
> @@ -3255,6 +3258,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>  {
>  	int i;
>  	struct extent_buffer *new;
> +	bool pages_contig = true;
>  	int num_pages = num_extent_pages(src);
>  	int ret;
>  
> @@ -3279,6 +3283,9 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>  		int ret;
>  		struct page *p = new->pages[i];
>  
> +		if (i && p != new->pages[i - 1] + 1)
> +			pages_contig = false;
> +
>  		ret = attach_extent_buffer_page(new, p, NULL);
>  		if (ret < 0) {
>  			btrfs_release_extent_buffer(new);
> @@ -3286,6 +3293,23 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>  		}
>  		WARN_ON(PageDirty(p));
>  	}
> +	if (!pages_contig) {
> +		unsigned int nofs_flag;
> +		int retried = 0;
> +
> +		nofs_flag = memalloc_nofs_save();
> +		do {
> +			new->vaddr = vm_map_ram(new->pages, num_pages, -1);
> +			if (new->vaddr)
> +				break;
> +			vm_unmap_aliases();
> +		} while ((retried++) <= 1);
> +		memalloc_nofs_restore(nofs_flag);
> +		if (!new->vaddr) {
> +			btrfs_release_extent_buffer(new);
> +			return NULL;
> +		}
> +	}
>  	copy_extent_buffer_full(new, src);
>  	set_extent_buffer_uptodate(new);
>  
> @@ -3296,6 +3320,7 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>  						  u64 start, unsigned long len)
>  {
>  	struct extent_buffer *eb;
> +	bool pages_contig = true;
>  	int num_pages;
>  	int i;
>  	int ret;
> @@ -3312,11 +3337,29 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>  	for (i = 0; i < num_pages; i++) {
>  		struct page *p = eb->pages[i];
>  
> +		if (i && p != eb->pages[i - 1] + 1)
> +			pages_contig = false;

Chances that allocated pages in eb->pages will be contiguous decrease
over time basically to zero, because even one page out of order will
ruin it. This means we can assume that virtual mapping will have to be
used almost every time.

The virtual mapping can also fail and we have no fallback and there are
two more places when allocating extent buffer can fail.

There's alloc_pages(gfp, order) that can try to allocate contiguous
pages of a given order, and we have nodesize always matching power of
two so we could use it. Although this also forces alignment to the same
order, which we don't need, and adds to the failure modes.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC 1/2] btrfs: map uncontinuous extent buffer pages into virtual address space
  2023-07-27 14:18   ` David Sterba
@ 2023-07-27 22:24     ` Qu Wenruo
  2023-08-17 11:32       ` Qu Wenruo
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2023-07-27 22:24 UTC (permalink / raw)
  To: dsterba, Qu Wenruo; +Cc: linux-btrfs



On 2023/7/27 22:18, David Sterba wrote:
> On Tue, Jul 25, 2023 at 10:57:21AM +0800, Qu Wenruo wrote:
>> Currently btrfs implements its extent buffer read-write using various
>> helpers doing cross-page handling for the pages array.
>>
>> However other filesystems like XFS is mapping the pages into kernel
>> virtual address space, greatly simplify the access.
>>
>> This patch would learn from XFS and map the pages into virtual address
>> space, if and only if the pages are not physically continuous.
>> (Note, a single page counts as physically continuous.)
>>
>> For now we only do the map, but not yet really utilize the mapped
>> address.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/extent_io.h |  7 +++++
>>   2 files changed, 77 insertions(+)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 4144c649718e..f40d48f641c0 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -14,6 +14,7 @@
>>   #include <linux/pagevec.h>
>>   #include <linux/prefetch.h>
>>   #include <linux/fsverity.h>
>> +#include <linux/vmalloc.h>
>>   #include "misc.h"
>>   #include "extent_io.h"
>>   #include "extent-io-tree.h"
>> @@ -3206,6 +3207,8 @@ static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
>>   	ASSERT(!extent_buffer_under_io(eb));
>>
>>   	num_pages = num_extent_pages(eb);
>> +	if (eb->vaddr)
>> +		vm_unmap_ram(eb->vaddr, num_pages);
>>   	for (i = 0; i < num_pages; i++) {
>>   		struct page *page = eb->pages[i];
>>
>> @@ -3255,6 +3258,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>>   {
>>   	int i;
>>   	struct extent_buffer *new;
>> +	bool pages_contig = true;
>>   	int num_pages = num_extent_pages(src);
>>   	int ret;
>>
>> @@ -3279,6 +3283,9 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>>   		int ret;
>>   		struct page *p = new->pages[i];
>>
>> +		if (i && p != new->pages[i - 1] + 1)
>> +			pages_contig = false;
>> +
>>   		ret = attach_extent_buffer_page(new, p, NULL);
>>   		if (ret < 0) {
>>   			btrfs_release_extent_buffer(new);
>> @@ -3286,6 +3293,23 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>>   		}
>>   		WARN_ON(PageDirty(p));
>>   	}
>> +	if (!pages_contig) {
>> +		unsigned int nofs_flag;
>> +		int retried = 0;
>> +
>> +		nofs_flag = memalloc_nofs_save();
>> +		do {
>> +			new->vaddr = vm_map_ram(new->pages, num_pages, -1);
>> +			if (new->vaddr)
>> +				break;
>> +			vm_unmap_aliases();
>> +		} while ((retried++) <= 1);
>> +		memalloc_nofs_restore(nofs_flag);
>> +		if (!new->vaddr) {
>> +			btrfs_release_extent_buffer(new);
>> +			return NULL;
>> +		}
>> +	}
>>   	copy_extent_buffer_full(new, src);
>>   	set_extent_buffer_uptodate(new);
>>
>> @@ -3296,6 +3320,7 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>>   						  u64 start, unsigned long len)
>>   {
>>   	struct extent_buffer *eb;
>> +	bool pages_contig = true;
>>   	int num_pages;
>>   	int i;
>>   	int ret;
>> @@ -3312,11 +3337,29 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>>   	for (i = 0; i < num_pages; i++) {
>>   		struct page *p = eb->pages[i];
>>
>> +		if (i && p != eb->pages[i - 1] + 1)
>> +			pages_contig = false;
>
> Chances that allocated pages in eb->pages will be contiguous decrease
> over time basically to zero, because even one page out of order will
> ruin it.

I doubt this assumption.

Shouldn't things like THP requires physically contiguous pages?
Thus if your assumption is correct, then THP would not work for long
running servers at all, which doesn't look correct to me.

> This means we can assume that virtual mapping will have to be
> used almost every time.
>
> The virtual mapping can also fail and we have no fallback and there are
> two more places when allocating extent buffer can fail.

Yes, it can indeed fail, but it's only when the virtual address space is
full. This is more of a problem for 32bit systems.

Although my counter argument is, XFS is doing this for a long time, and
even XFS has much smaller metadata compared to btrfs, it doesn't has a
known problem of hitting such failure.

>
> There's alloc_pages(gfp, order) that can try to allocate contiguous
> pages of a given order, and we have nodesize always matching power of
> two so we could use it.

Should this lead to problems of exhausted contiguous pages (if that's
really true)?

> Although this also forces alignment to the same
> order, which we don't need, and adds to the failure modes.

We're migrating to reject non-nodesize aligned tree block in the long run.

I have submitted a patch to warn about such tree blocks already:
https://patchwork.kernel.org/project/linux-btrfs/patch/fee2c7df3d0a2e91e9b5e07a04242fcf28ade6bf.1690178924.git.wqu@suse.com/

Since btrfs has a similar (but less strict, just checks cross-stripe
boundary) checks a long time ago, and we haven't received such warning
for a while, I believe we can gradually move to reject such tree blocks.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC 1/2] btrfs: map uncontinuous extent buffer pages into virtual address space
  2023-07-27 22:24     ` Qu Wenruo
@ 2023-08-17 11:32       ` Qu Wenruo
  0 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2023-08-17 11:32 UTC (permalink / raw)
  To: dsterba, Qu Wenruo; +Cc: linux-btrfs



On 2023/7/28 06:24, Qu Wenruo wrote:
>
>
> On 2023/7/27 22:18, David Sterba wrote:
>> On Tue, Jul 25, 2023 at 10:57:21AM +0800, Qu Wenruo wrote:
>>> Currently btrfs implements its extent buffer read-write using various
>>> helpers doing cross-page handling for the pages array.
>>>
>>> However other filesystems like XFS is mapping the pages into kernel
>>> virtual address space, greatly simplify the access.
>>>
>>> This patch would learn from XFS and map the pages into virtual address
>>> space, if and only if the pages are not physically continuous.
>>> (Note, a single page counts as physically continuous.)
>>>
>>> For now we only do the map, but not yet really utilize the mapped
>>> address.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>   fs/btrfs/extent_io.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
>>>   fs/btrfs/extent_io.h |  7 +++++
>>>   2 files changed, 77 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 4144c649718e..f40d48f641c0 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -14,6 +14,7 @@
>>>   #include <linux/pagevec.h>
>>>   #include <linux/prefetch.h>
>>>   #include <linux/fsverity.h>
>>> +#include <linux/vmalloc.h>
>>>   #include "misc.h"
>>>   #include "extent_io.h"
>>>   #include "extent-io-tree.h"
>>> @@ -3206,6 +3207,8 @@ static void
>>> btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
>>>       ASSERT(!extent_buffer_under_io(eb));
>>>
>>>       num_pages = num_extent_pages(eb);
>>> +    if (eb->vaddr)
>>> +        vm_unmap_ram(eb->vaddr, num_pages);
>>>       for (i = 0; i < num_pages; i++) {
>>>           struct page *page = eb->pages[i];
>>>
>>> @@ -3255,6 +3258,7 @@ struct extent_buffer
>>> *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>>>   {
>>>       int i;
>>>       struct extent_buffer *new;
>>> +    bool pages_contig = true;
>>>       int num_pages = num_extent_pages(src);
>>>       int ret;
>>>
>>> @@ -3279,6 +3283,9 @@ struct extent_buffer
>>> *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>>>           int ret;
>>>           struct page *p = new->pages[i];
>>>
>>> +        if (i && p != new->pages[i - 1] + 1)
>>> +            pages_contig = false;
>>> +
>>>           ret = attach_extent_buffer_page(new, p, NULL);
>>>           if (ret < 0) {
>>>               btrfs_release_extent_buffer(new);
>>> @@ -3286,6 +3293,23 @@ struct extent_buffer
>>> *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>>>           }
>>>           WARN_ON(PageDirty(p));
>>>       }
>>> +    if (!pages_contig) {
>>> +        unsigned int nofs_flag;
>>> +        int retried = 0;
>>> +
>>> +        nofs_flag = memalloc_nofs_save();
>>> +        do {
>>> +            new->vaddr = vm_map_ram(new->pages, num_pages, -1);
>>> +            if (new->vaddr)
>>> +                break;
>>> +            vm_unmap_aliases();
>>> +        } while ((retried++) <= 1);
>>> +        memalloc_nofs_restore(nofs_flag);
>>> +        if (!new->vaddr) {
>>> +            btrfs_release_extent_buffer(new);
>>> +            return NULL;
>>> +        }
>>> +    }
>>>       copy_extent_buffer_full(new, src);
>>>       set_extent_buffer_uptodate(new);
>>>
>>> @@ -3296,6 +3320,7 @@ struct extent_buffer
>>> *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>>>                             u64 start, unsigned long len)
>>>   {
>>>       struct extent_buffer *eb;
>>> +    bool pages_contig = true;
>>>       int num_pages;
>>>       int i;
>>>       int ret;
>>> @@ -3312,11 +3337,29 @@ struct extent_buffer
>>> *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>>>       for (i = 0; i < num_pages; i++) {
>>>           struct page *p = eb->pages[i];
>>>
>>> +        if (i && p != eb->pages[i - 1] + 1)
>>> +            pages_contig = false;
>>
>> Chances that allocated pages in eb->pages will be contiguous decrease
>> over time basically to zero, because even one page out of order will
>> ruin it.
>
> I doubt this assumption.
>
> Shouldn't things like THP requires physically contiguous pages?
> Thus if your assumption is correct, then THP would not work for long
> running servers at all, which doesn't look correct to me.
>
>> This means we can assume that virtual mapping will have to be
>> used almost every time.
>>
>> The virtual mapping can also fail and we have no fallback and there are
>> two more places when allocating extent buffer can fail.
>
> Yes, it can indeed fail, but it's only when the virtual address space is
> full. This is more of a problem for 32bit systems.
>
> Although my counter argument is, XFS is doing this for a long time, and
> even XFS has much smaller metadata compared to btrfs, it doesn't has a
> known problem of hitting such failure.
>
>>
>> There's alloc_pages(gfp, order) that can try to allocate contiguous
>> pages of a given order, and we have nodesize always matching power of
>> two so we could use it.
>
> Should this lead to problems of exhausted contiguous pages (if that's
> really true)?
>
>> Although this also forces alignment to the same
>> order, which we don't need, and adds to the failure modes.
>
> We're migrating to reject non-nodesize aligned tree block in the long run.
>
> I have submitted a patch to warn about such tree blocks already:
> https://patchwork.kernel.org/project/linux-btrfs/patch/fee2c7df3d0a2e91e9b5e07a04242fcf28ade6bf.1690178924.git.wqu@suse.com/
>
> Since btrfs has a similar (but less strict, just checks cross-stripe
> boundary) checks a long time ago, and we haven't received such warning
> for a while, I believe we can gradually move to reject such tree blocks.

Any extra feedback? Without this series, it's much harder to go folio
for eb pages.

Thanks,
Qu
>
> Thanks,
> Qu

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-08-17 11:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-25  2:57 [PATCH RFC 0/2] btrfs: make extent buffer memory continuous Qu Wenruo
2023-07-25  2:57 ` [PATCH RFC 1/2] btrfs: map uncontinuous extent buffer pages into virtual address space Qu Wenruo
2023-07-27 14:18   ` David Sterba
2023-07-27 22:24     ` Qu Wenruo
2023-08-17 11:32       ` Qu Wenruo
2023-07-25  2:57 ` [PATCH RFC 2/2] btrfs: utilize the physically/virtually continuous extent buffer memory Qu Wenruo
2023-07-26  9:16 ` [PATCH RFC 0/2] btrfs: make extent buffer memory continuous Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.