linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace
@ 2020-08-19  5:53 Qu Wenruo
  2020-08-19  8:50 ` kernel test robot
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Qu Wenruo @ 2020-08-19  5:53 UTC (permalink / raw)
  To: linux-btrfs

Before this patch, btrfs_buffered_write() do page copy in a 8 pages
batch.

While for EXT4, it uses generic_perform_write() which does page by page
copy.

This 8 pages batch behavior makes a lot of things more complex:
- More complex error handling
  Now we need to handle all errors for half written case.

- More complex advance check
  Since for 8 pages, we need to consider cases like 4 pages copied.
  This makes we need to release reserved space for the untouched 4
  pages.

- More wrappers for multi-pages operations
  The most obvious one is btrfs_copy_from_user(), which introduces way
  more complexity than we need.

This patch will change the behavior by going to the page-by-page pace,
each time we only reserve space for one page, do one page copy.

There are still a lot of complexity remained, mostly for short copy,
non-uptodate page and extent locking.
But that's more or less the same as the generic_perform_write().

The performance is the same for 4K block size buffered write, but has an
obvious impact when using multiple pages siuzed block size:

The test involves writing a 128MiB file, which is smaller than 1/8th of
the system memory.
		Speed (MiB/sec)		Ops (ops/sec)
Unpatched:	931.498			14903.9756
Patched:	447.606			7161.6806

In fact, if we account the execution time of btrfs_buffered_write(),
meta/data rsv and later page dirty takes way more time than memory copy:

Patched:
 nr_runs          = 32768
 total_prepare_ns = 66908022
 total_copy_ns    = 75532103
 total_cleanup_ns = 135749090

Unpatched:
 nr_runs          = 2176
 total_prepare_ns = 7425773
 total_copy_ns    = 87780898
 total_cleanup_ns = 37704811

The patched behavior is now similar to EXT4, the buffered write remain
mostly unchanged for from 4K blocksize and larger.

On the other hand, XFS uses iomap, which supports multi-page reserve and
copy, leading to similar performance of unpatched btrfs.

It looks like that we'd better go iomap routine other than the
generic_perform_write().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Reason for RFC:

The performance drop is enough for this patch to be discarded.
---
 fs/btrfs/file.c | 293 ++++++++++++------------------------------------
 1 file changed, 72 insertions(+), 221 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index bbfc8819cf28..be595da9bc05 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -379,60 +379,6 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
 	return 0;
 }
 
-/* simple helper to fault in pages and copy.  This should go away
- * and be replaced with calls into generic code.
- */
-static noinline int btrfs_copy_from_user(loff_t pos, size_t write_bytes,
-					 struct page **prepared_pages,
-					 struct iov_iter *i)
-{
-	size_t copied = 0;
-	size_t total_copied = 0;
-	int pg = 0;
-	int offset = offset_in_page(pos);
-
-	while (write_bytes > 0) {
-		size_t count = min_t(size_t,
-				     PAGE_SIZE - offset, write_bytes);
-		struct page *page = prepared_pages[pg];
-		/*
-		 * Copy data from userspace to the current page
-		 */
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, count);
-
-		/* Flush processor's dcache for this page */
-		flush_dcache_page(page);
-
-		/*
-		 * if we get a partial write, we can end up with
-		 * partially up to date pages.  These add
-		 * a lot of complexity, so make sure they don't
-		 * happen by forcing this copy to be retried.
-		 *
-		 * The rest of the btrfs_file_write code will fall
-		 * back to page at a time copies after we return 0.
-		 */
-		if (!PageUptodate(page) && copied < count)
-			copied = 0;
-
-		iov_iter_advance(i, copied);
-		write_bytes -= copied;
-		total_copied += copied;
-
-		/* Return to btrfs_file_write_iter to fault page */
-		if (unlikely(copied == 0))
-			break;
-
-		if (copied < PAGE_SIZE - offset) {
-			offset += copied;
-		} else {
-			pg++;
-			offset = 0;
-		}
-	}
-	return total_copied;
-}
-
 /*
  * unlocks pages after btrfs_file_write is done with them
  */
@@ -443,8 +389,8 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
 		/* page checked is some magic around finding pages that
 		 * have been modified without going through btrfs_set_page_dirty
 		 * clear it here. There should be no need to mark the pages
-		 * accessed as prepare_pages should have marked them accessed
-		 * in prepare_pages via find_or_create_page()
+		 * accessed as prepare_pages() should have marked them accessed
+		 * in prepare_pages() via find_or_create_page()
 		 */
 		ClearPageChecked(pages[i]);
 		unlock_page(pages[i]);
@@ -1400,58 +1346,6 @@ static int prepare_uptodate_page(struct inode *inode,
 	return 0;
 }
 
-/*
- * this just gets pages into the page cache and locks them down.
- */
-static noinline int prepare_pages(struct inode *inode, struct page **pages,
-				  size_t num_pages, loff_t pos,
-				  size_t write_bytes, bool force_uptodate)
-{
-	int i;
-	unsigned long index = pos >> PAGE_SHIFT;
-	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
-	int err = 0;
-	int faili;
-
-	for (i = 0; i < num_pages; i++) {
-again:
-		pages[i] = find_or_create_page(inode->i_mapping, index + i,
-					       mask | __GFP_WRITE);
-		if (!pages[i]) {
-			faili = i - 1;
-			err = -ENOMEM;
-			goto fail;
-		}
-
-		if (i == 0)
-			err = prepare_uptodate_page(inode, pages[i], pos,
-						    force_uptodate);
-		if (!err && i == num_pages - 1)
-			err = prepare_uptodate_page(inode, pages[i],
-						    pos + write_bytes, false);
-		if (err) {
-			put_page(pages[i]);
-			if (err == -EAGAIN) {
-				err = 0;
-				goto again;
-			}
-			faili = i - 1;
-			goto fail;
-		}
-		wait_on_page_writeback(pages[i]);
-	}
-
-	return 0;
-fail:
-	while (faili >= 0) {
-		unlock_page(pages[faili]);
-		put_page(pages[faili]);
-		faili--;
-	}
-	return err;
-
-}
-
 /*
  * This function locks the extent and properly waits for data=ordered extents
  * to finish before allowing the pages to be modified if need.
@@ -1619,6 +1513,38 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
 	btrfs_drew_write_unlock(&inode->root->snapshot_lock);
 }
 
+static int prepare_one_page(struct inode *inode, struct page **page_ret,
+			    loff_t pos, size_t write_bytes, bool force_uptodate)
+{
+	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping) | __GFP_WRITE;
+	struct page *page;
+	int ret;
+
+again:
+	page = find_or_create_page(inode->i_mapping, pos >> PAGE_SHIFT, mask);
+	if (!page)
+		return -ENOMEM;
+
+	/*
+	 * We need the page uptodate for the following cases:
+	 * - Write range only covers part of the page
+	 * - We got a short copy on non-uptodate page in previous run
+	 */
+	if ((!(offset_in_page(pos) == 0 && write_bytes == PAGE_SIZE) ||
+	     force_uptodate) && !PageUptodate(page)) {
+		ret = prepare_uptodate_page(inode, page, pos, true);
+		if (ret) {
+			put_page(page);
+			if (ret == -EAGAIN)
+				goto again;
+			return ret;
+		}
+		wait_on_page_writeback(page);
+	}
+	*page_ret = page;
+	return 0;
+}
+
 static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 					       struct iov_iter *i)
 {
@@ -1626,45 +1552,26 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 	loff_t pos = iocb->ki_pos;
 	struct inode *inode = file_inode(file);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct page **pages = NULL;
+	struct page *page = NULL;
 	struct extent_changeset *data_reserved = NULL;
-	u64 release_bytes = 0;
 	u64 lockstart;
 	u64 lockend;
 	size_t num_written = 0;
-	int nrptrs;
 	int ret = 0;
 	bool only_release_metadata = false;
 	bool force_page_uptodate = false;
 
-	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
-			PAGE_SIZE / (sizeof(struct page *)));
-	nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
-	nrptrs = max(nrptrs, 8);
-	pages = kmalloc_array(nrptrs, sizeof(struct page *), GFP_KERNEL);
-	if (!pages)
-		return -ENOMEM;
-
 	while (iov_iter_count(i) > 0) {
 		struct extent_state *cached_state = NULL;
 		size_t offset = offset_in_page(pos);
-		size_t sector_offset;
 		size_t write_bytes = min(iov_iter_count(i),
-					 nrptrs * (size_t)PAGE_SIZE -
-					 offset);
-		size_t num_pages = DIV_ROUND_UP(write_bytes + offset,
-						PAGE_SIZE);
-		size_t reserve_bytes;
-		size_t dirty_pages;
+					 PAGE_SIZE - offset);
+		size_t reserve_bytes = PAGE_SIZE;
 		size_t copied;
-		size_t dirty_sectors;
-		size_t num_sectors;
 		int extents_locked;
 
-		WARN_ON(num_pages > nrptrs);
-
 		/*
-		 * Fault pages before locking them in prepare_pages
+		 * Fault pages before locking them in prepare_page()
 		 * to avoid recursive lock
 		 */
 		if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
@@ -1673,37 +1580,27 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 		}
 
 		only_release_metadata = false;
-		sector_offset = pos & (fs_info->sectorsize - 1);
-		reserve_bytes = round_up(write_bytes + sector_offset,
-				fs_info->sectorsize);
 
 		extent_changeset_release(data_reserved);
 		ret = btrfs_check_data_free_space(BTRFS_I(inode),
 						  &data_reserved, pos,
 						  write_bytes);
 		if (ret < 0) {
+			size_t tmp = write_bytes;
 			if (btrfs_check_nocow_lock(BTRFS_I(inode), pos,
-						   &write_bytes) > 0) {
+						   &tmp) > 0) {
+				ASSERT(tmp == write_bytes);
 				/*
 				 * For nodata cow case, no need to reserve
 				 * data space.
 				 */
 				only_release_metadata = true;
-				/*
-				 * our prealloc extent may be smaller than
-				 * write_bytes, so scale down.
-				 */
-				num_pages = DIV_ROUND_UP(write_bytes + offset,
-							 PAGE_SIZE);
-				reserve_bytes = round_up(write_bytes +
-							 sector_offset,
-							 fs_info->sectorsize);
+				reserve_bytes = 0;
 			} else {
 				break;
 			}
 		}
 
-		WARN_ON(reserve_bytes == 0);
 		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
 				reserve_bytes);
 		if (ret) {
@@ -1716,16 +1613,9 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 			break;
 		}
 
-		release_bytes = reserve_bytes;
 again:
-		/*
-		 * This is going to setup the pages array with the number of
-		 * pages we want, so we don't really need to worry about the
-		 * contents of pages from loop to loop
-		 */
-		ret = prepare_pages(inode, pages, num_pages,
-				    pos, write_bytes,
-				    force_page_uptodate);
+		ret = prepare_one_page(inode, &page, pos, write_bytes,
+				       force_page_uptodate);
 		if (ret) {
 			btrfs_delalloc_release_extents(BTRFS_I(inode),
 						       reserve_bytes);
@@ -1733,9 +1623,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 		}
 
 		extents_locked = lock_and_cleanup_extent_if_need(
-				BTRFS_I(inode), pages,
-				num_pages, pos, write_bytes, &lockstart,
-				&lockend, &cached_state);
+				BTRFS_I(inode), &page, 1, pos, write_bytes,
+				&lockstart, &lockend, &cached_state);
 		if (extents_locked < 0) {
 			if (extents_locked == -EAGAIN)
 				goto again;
@@ -1745,57 +1634,38 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 			break;
 		}
 
-		copied = btrfs_copy_from_user(pos, write_bytes, pages, i);
-
-		num_sectors = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
-		dirty_sectors = round_up(copied + sector_offset,
-					fs_info->sectorsize);
-		dirty_sectors = BTRFS_BYTES_TO_BLKS(fs_info, dirty_sectors);
-
-		/*
-		 * if we have trouble faulting in the pages, fall
-		 * back to one page at a time
-		 */
-		if (copied < write_bytes)
-			nrptrs = 1;
+		copied = iov_iter_copy_from_user_atomic(page, i, offset,
+							write_bytes);
+		flush_dcache_page(page);
 
-		if (copied == 0) {
+		if (!PageUptodate(page) && copied < write_bytes) {
+			/*
+			 * Short write on non-uptodate page, we must retry and
+			 * force the page uptodate in next run.
+			 */
+			copied = 0;
 			force_page_uptodate = true;
-			dirty_sectors = 0;
-			dirty_pages = 0;
 		} else {
+			/* Next run doesn't need forced uptodate */
 			force_page_uptodate = false;
-			dirty_pages = DIV_ROUND_UP(copied + offset,
-						   PAGE_SIZE);
 		}
 
-		if (num_sectors > dirty_sectors) {
-			/* release everything except the sectors we dirtied */
-			release_bytes -= dirty_sectors <<
-						fs_info->sb->s_blocksize_bits;
-			if (only_release_metadata) {
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							release_bytes, true);
-			} else {
-				u64 __pos;
+		iov_iter_advance(i, copied);
 
-				__pos = round_down(pos,
-						   fs_info->sectorsize) +
-					(dirty_pages << PAGE_SHIFT);
+		if (copied > 0) {
+			ret = btrfs_dirty_pages(BTRFS_I(inode), &page, 1, pos,
+						copied, &cached_state);
+		} else {
+			/* No bytes copied, need to free reserved space */
+			if (only_release_metadata)
+				btrfs_delalloc_release_metadata(BTRFS_I(inode),
+						reserve_bytes, true);
+			else
 				btrfs_delalloc_release_space(BTRFS_I(inode),
-						data_reserved, __pos,
-						release_bytes, true);
-			}
+						data_reserved, pos, write_bytes,
+						true);
 		}
 
-		release_bytes = round_up(copied + sector_offset,
-					fs_info->sectorsize);
-
-		if (copied > 0)
-			ret = btrfs_dirty_pages(BTRFS_I(inode), pages,
-						dirty_pages, pos, copied,
-						&cached_state);
-
 		/*
 		 * If we have not locked the extent range, because the range's
 		 * start offset is >= i_size, we might still have a non-NULL
@@ -1811,26 +1681,22 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 
 		btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes);
 		if (ret) {
-			btrfs_drop_pages(pages, num_pages);
+			btrfs_drop_pages(&page, 1);
 			break;
 		}
 
-		release_bytes = 0;
-		if (only_release_metadata)
+		if (only_release_metadata) {
 			btrfs_check_nocow_unlock(BTRFS_I(inode));
-
-		if (only_release_metadata && copied > 0) {
 			lockstart = round_down(pos,
 					       fs_info->sectorsize);
-			lockend = round_up(pos + copied,
-					   fs_info->sectorsize) - 1;
+			lockend = lockstart + PAGE_SIZE - 1;
 
 			set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
 				       lockend, EXTENT_NORESERVE, NULL,
 				       NULL, GFP_NOFS);
 		}
 
-		btrfs_drop_pages(pages, num_pages);
+		btrfs_drop_pages(&page, 1);
 
 		cond_resched();
 
@@ -1840,21 +1706,6 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 		num_written += copied;
 	}
 
-	kfree(pages);
-
-	if (release_bytes) {
-		if (only_release_metadata) {
-			btrfs_check_nocow_unlock(BTRFS_I(inode));
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-					release_bytes, true);
-		} else {
-			btrfs_delalloc_release_space(BTRFS_I(inode),
-					data_reserved,
-					round_down(pos, fs_info->sectorsize),
-					release_bytes, true);
-		}
-	}
-
 	extent_changeset_free(data_reserved);
 	return num_written ? num_written : ret;
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace
  2020-08-19  5:53 [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace Qu Wenruo
@ 2020-08-19  8:50 ` kernel test robot
  2020-08-19  9:09 ` Filipe Manana
  2020-08-19 20:11 ` Josef Bacik
  2 siblings, 0 replies; 5+ messages in thread
From: kernel test robot @ 2020-08-19  8:50 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: kbuild-all, clang-built-linux

[-- Attachment #1: Type: text/plain, Size: 15979 bytes --]

Hi Qu,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on v5.9-rc1 next-20200819]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-switch-btrfs_buffered_write-to-page-by-page-pace/20200819-135541
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: arm-randconfig-r022-20200818 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project b34b1e38381fa4d1b1d9751a6b5233b68e734cfe)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from fs/btrfs/file.c:19:
   fs/btrfs/ctree.h:2265:8: warning: 'const' type qualifier on return type has no effect [-Wignored-qualifiers]
   size_t __const btrfs_get_num_csums(void);
          ^~~~~~~~
>> fs/btrfs/file.c:1570:24: warning: comparison of distinct pointer types ('typeof (iov_iter_count(i)) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12) - offset) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
                   size_t write_bytes = min(iov_iter_count(i),
                                        ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/kernel.h:884:19: note: expanded from macro 'min'
   #define min(x, y)       __careful_cmp(x, y, <)
                           ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/kernel.h:875:24: note: expanded from macro '__careful_cmp'
           __builtin_choose_expr(__safe_cmp(x, y), \
                                 ^~~~~~~~~~~~~~~~
   include/linux/kernel.h:865:4: note: expanded from macro '__safe_cmp'
                   (__typecheck(x, y) && __no_side_effects(x, y))
                    ^~~~~~~~~~~~~~~~~
   include/linux/kernel.h:851:29: note: expanded from macro '__typecheck'
                   (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                              ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   2 warnings generated.

# https://github.com/0day-ci/linux/commit/89f414f96d87eab3a1c3a3afc5283f80154e7a8b
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Qu-Wenruo/btrfs-switch-btrfs_buffered_write-to-page-by-page-pace/20200819-135541
git checkout 89f414f96d87eab3a1c3a3afc5283f80154e7a8b
vim +1570 fs/btrfs/file.c

89f414f96d87ea Qu Wenruo          2020-08-19  1550  
e4af400a9c5081 Goldwyn Rodrigues  2018-06-17  1551  static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
e4af400a9c5081 Goldwyn Rodrigues  2018-06-17  1552  					       struct iov_iter *i)
39279cc3d2704c Chris Mason        2007-06-12  1553  {
e4af400a9c5081 Goldwyn Rodrigues  2018-06-17  1554  	struct file *file = iocb->ki_filp;
e4af400a9c5081 Goldwyn Rodrigues  2018-06-17  1555  	loff_t pos = iocb->ki_pos;
496ad9aa8ef448 Al Viro            2013-01-23  1556  	struct inode *inode = file_inode(file);
0b246afa62b0cf Jeff Mahoney       2016-06-22  1557  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
89f414f96d87ea Qu Wenruo          2020-08-19  1558  	struct page *page = NULL;
364ecf3651e086 Qu Wenruo          2017-02-27  1559  	struct extent_changeset *data_reserved = NULL;
376cc685cb3b43 Miao Xie           2013-12-10  1560  	u64 lockstart;
376cc685cb3b43 Miao Xie           2013-12-10  1561  	u64 lockend;
d0215f3e5ebb58 Josef Bacik        2011-01-25  1562  	size_t num_written = 0;
c9149235a42ab9 Tsutomu Itoh       2011-03-30  1563  	int ret = 0;
7ee9e4405f264e Josef Bacik        2013-06-21  1564  	bool only_release_metadata = false;
b6316429af7f36 Josef Bacik        2011-09-30  1565  	bool force_page_uptodate = false;
4b46fce23349bf Josef Bacik        2010-05-23  1566  
d0215f3e5ebb58 Josef Bacik        2011-01-25  1567  	while (iov_iter_count(i) > 0) {
c67d970f0ea8dc Filipe Manana      2019-09-30  1568  		struct extent_state *cached_state = NULL;
7073017aeb98db Johannes Thumshirn 2018-12-05  1569  		size_t offset = offset_in_page(pos);
d0215f3e5ebb58 Josef Bacik        2011-01-25 @1570  		size_t write_bytes = min(iov_iter_count(i),
89f414f96d87ea Qu Wenruo          2020-08-19  1571  					 PAGE_SIZE - offset);
89f414f96d87ea Qu Wenruo          2020-08-19  1572  		size_t reserve_bytes = PAGE_SIZE;
d0215f3e5ebb58 Josef Bacik        2011-01-25  1573  		size_t copied;
79f015f216539d Goldwyn Rodrigues  2017-10-16  1574  		int extents_locked;
39279cc3d2704c Chris Mason        2007-06-12  1575  
914ee295af418e Xin Zhong          2010-12-09  1576  		/*
89f414f96d87ea Qu Wenruo          2020-08-19  1577  		 * Fault pages before locking them in prepare_page()
914ee295af418e Xin Zhong          2010-12-09  1578  		 * to avoid recursive lock
914ee295af418e Xin Zhong          2010-12-09  1579  		 */
d0215f3e5ebb58 Josef Bacik        2011-01-25  1580  		if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
914ee295af418e Xin Zhong          2010-12-09  1581  			ret = -EFAULT;
d0215f3e5ebb58 Josef Bacik        2011-01-25  1582  			break;
914ee295af418e Xin Zhong          2010-12-09  1583  		}
914ee295af418e Xin Zhong          2010-12-09  1584  
a0e248bb502d51 Filipe Manana      2019-10-11  1585  		only_release_metadata = false;
d9d8b2a51a404c Qu Wenruo          2015-09-08  1586  
364ecf3651e086 Qu Wenruo          2017-02-27  1587  		extent_changeset_release(data_reserved);
36ea6f3e931391 Nikolay Borisov    2020-06-03  1588  		ret = btrfs_check_data_free_space(BTRFS_I(inode),
36ea6f3e931391 Nikolay Borisov    2020-06-03  1589  						  &data_reserved, pos,
364ecf3651e086 Qu Wenruo          2017-02-27  1590  						  write_bytes);
c6887cd11149d7 Josef Bacik        2016-03-25  1591  		if (ret < 0) {
89f414f96d87ea Qu Wenruo          2020-08-19  1592  			size_t tmp = write_bytes;
38d37aa9c32938 Qu Wenruo          2020-06-24  1593  			if (btrfs_check_nocow_lock(BTRFS_I(inode), pos,
89f414f96d87ea Qu Wenruo          2020-08-19  1594  						   &tmp) > 0) {
89f414f96d87ea Qu Wenruo          2020-08-19  1595  				ASSERT(tmp == write_bytes);
d9d8b2a51a404c Qu Wenruo          2015-09-08  1596  				/*
d9d8b2a51a404c Qu Wenruo          2015-09-08  1597  				 * For nodata cow case, no need to reserve
d9d8b2a51a404c Qu Wenruo          2015-09-08  1598  				 * data space.
d9d8b2a51a404c Qu Wenruo          2015-09-08  1599  				 */
7ee9e4405f264e Josef Bacik        2013-06-21  1600  				only_release_metadata = true;
89f414f96d87ea Qu Wenruo          2020-08-19  1601  				reserve_bytes = 0;
c6887cd11149d7 Josef Bacik        2016-03-25  1602  			} else {
d0215f3e5ebb58 Josef Bacik        2011-01-25  1603  				break;
c6887cd11149d7 Josef Bacik        2016-03-25  1604  			}
c6887cd11149d7 Josef Bacik        2016-03-25  1605  		}
1832a6d5ee3b1a Chris Mason        2007-12-21  1606  
9f3db423f98c5c Nikolay Borisov    2017-02-20  1607  		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
9f3db423f98c5c Nikolay Borisov    2017-02-20  1608  				reserve_bytes);
7ee9e4405f264e Josef Bacik        2013-06-21  1609  		if (ret) {
7ee9e4405f264e Josef Bacik        2013-06-21  1610  			if (!only_release_metadata)
25ce28caaa1ddc Nikolay Borisov    2020-06-03  1611  				btrfs_free_reserved_data_space(BTRFS_I(inode),
bc42bda22345ef Qu Wenruo          2017-02-27  1612  						data_reserved, pos,
d9d8b2a51a404c Qu Wenruo          2015-09-08  1613  						write_bytes);
8257b2dc3c1a10 Miao Xie           2014-03-06  1614  			else
38d37aa9c32938 Qu Wenruo          2020-06-24  1615  				btrfs_check_nocow_unlock(BTRFS_I(inode));
7ee9e4405f264e Josef Bacik        2013-06-21  1616  			break;
7ee9e4405f264e Josef Bacik        2013-06-21  1617  		}
7ee9e4405f264e Josef Bacik        2013-06-21  1618  
376cc685cb3b43 Miao Xie           2013-12-10  1619  again:
89f414f96d87ea Qu Wenruo          2020-08-19  1620  		ret = prepare_one_page(inode, &page, pos, write_bytes,
b6316429af7f36 Josef Bacik        2011-09-30  1621  				       force_page_uptodate);
8b62f87bad9cf0 Josef Bacik        2017-10-19  1622  		if (ret) {
8b62f87bad9cf0 Josef Bacik        2017-10-19  1623  			btrfs_delalloc_release_extents(BTRFS_I(inode),
8702ba9396bf7b Qu Wenruo          2019-10-14  1624  						       reserve_bytes);
d0215f3e5ebb58 Josef Bacik        2011-01-25  1625  			break;
8b62f87bad9cf0 Josef Bacik        2017-10-19  1626  		}
39279cc3d2704c Chris Mason        2007-06-12  1627  
79f015f216539d Goldwyn Rodrigues  2017-10-16  1628  		extents_locked = lock_and_cleanup_extent_if_need(
89f414f96d87ea Qu Wenruo          2020-08-19  1629  				BTRFS_I(inode), &page, 1, pos, write_bytes,
89f414f96d87ea Qu Wenruo          2020-08-19  1630  				&lockstart, &lockend, &cached_state);
79f015f216539d Goldwyn Rodrigues  2017-10-16  1631  		if (extents_locked < 0) {
79f015f216539d Goldwyn Rodrigues  2017-10-16  1632  			if (extents_locked == -EAGAIN)
376cc685cb3b43 Miao Xie           2013-12-10  1633  				goto again;
8b62f87bad9cf0 Josef Bacik        2017-10-19  1634  			btrfs_delalloc_release_extents(BTRFS_I(inode),
8702ba9396bf7b Qu Wenruo          2019-10-14  1635  						       reserve_bytes);
79f015f216539d Goldwyn Rodrigues  2017-10-16  1636  			ret = extents_locked;
376cc685cb3b43 Miao Xie           2013-12-10  1637  			break;
376cc685cb3b43 Miao Xie           2013-12-10  1638  		}
376cc685cb3b43 Miao Xie           2013-12-10  1639  
89f414f96d87ea Qu Wenruo          2020-08-19  1640  		copied = iov_iter_copy_from_user_atomic(page, i, offset,
89f414f96d87ea Qu Wenruo          2020-08-19  1641  							write_bytes);
89f414f96d87ea Qu Wenruo          2020-08-19  1642  		flush_dcache_page(page);
56244ef151c3cd Chris Mason        2016-05-16  1643  
89f414f96d87ea Qu Wenruo          2020-08-19  1644  		if (!PageUptodate(page) && copied < write_bytes) {
b1bf862e9dad43 Chris Mason        2011-02-28  1645  			/*
89f414f96d87ea Qu Wenruo          2020-08-19  1646  			 * Short write on non-uptodate page, we must retry and
89f414f96d87ea Qu Wenruo          2020-08-19  1647  			 * force the page uptodate in next run.
b1bf862e9dad43 Chris Mason        2011-02-28  1648  			 */
89f414f96d87ea Qu Wenruo          2020-08-19  1649  			copied = 0;
b6316429af7f36 Josef Bacik        2011-09-30  1650  			force_page_uptodate = true;
b6316429af7f36 Josef Bacik        2011-09-30  1651  		} else {
89f414f96d87ea Qu Wenruo          2020-08-19  1652  			/* Next run doesn't need forced uptodate */
b6316429af7f36 Josef Bacik        2011-09-30  1653  			force_page_uptodate = false;
b6316429af7f36 Josef Bacik        2011-09-30  1654  		}
914ee295af418e Xin Zhong          2010-12-09  1655  
89f414f96d87ea Qu Wenruo          2020-08-19  1656  		iov_iter_advance(i, copied);
485290a734f142 Qu Wenruo          2015-10-29  1657  
89f414f96d87ea Qu Wenruo          2020-08-19  1658  		if (copied > 0) {
89f414f96d87ea Qu Wenruo          2020-08-19  1659  			ret = btrfs_dirty_pages(BTRFS_I(inode), &page, 1, pos,
89f414f96d87ea Qu Wenruo          2020-08-19  1660  						copied, &cached_state);
89f414f96d87ea Qu Wenruo          2020-08-19  1661  		} else {
89f414f96d87ea Qu Wenruo          2020-08-19  1662  			/* No bytes copied, need to free reserved space */
89f414f96d87ea Qu Wenruo          2020-08-19  1663  			if (only_release_metadata)
89f414f96d87ea Qu Wenruo          2020-08-19  1664  				btrfs_delalloc_release_metadata(BTRFS_I(inode),
89f414f96d87ea Qu Wenruo          2020-08-19  1665  						reserve_bytes, true);
89f414f96d87ea Qu Wenruo          2020-08-19  1666  			else
86d52921a2ba51 Nikolay Borisov    2020-06-03  1667  				btrfs_delalloc_release_space(BTRFS_I(inode),
89f414f96d87ea Qu Wenruo          2020-08-19  1668  						data_reserved, pos, write_bytes,
89f414f96d87ea Qu Wenruo          2020-08-19  1669  						true);
485290a734f142 Qu Wenruo          2015-10-29  1670  		}
914ee295af418e Xin Zhong          2010-12-09  1671  
c67d970f0ea8dc Filipe Manana      2019-09-30  1672  		/*
c67d970f0ea8dc Filipe Manana      2019-09-30  1673  		 * If we have not locked the extent range, because the range's
c67d970f0ea8dc Filipe Manana      2019-09-30  1674  		 * start offset is >= i_size, we might still have a non-NULL
c67d970f0ea8dc Filipe Manana      2019-09-30  1675  		 * cached extent state, acquired while marking the extent range
c67d970f0ea8dc Filipe Manana      2019-09-30  1676  		 * as delalloc through btrfs_dirty_pages(). Therefore free any
c67d970f0ea8dc Filipe Manana      2019-09-30  1677  		 * possible cached extent state to avoid a memory leak.
c67d970f0ea8dc Filipe Manana      2019-09-30  1678  		 */
79f015f216539d Goldwyn Rodrigues  2017-10-16  1679  		if (extents_locked)
376cc685cb3b43 Miao Xie           2013-12-10  1680  			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
e43bbe5e16d87b David Sterba       2017-12-12  1681  					     lockstart, lockend, &cached_state);
c67d970f0ea8dc Filipe Manana      2019-09-30  1682  		else
c67d970f0ea8dc Filipe Manana      2019-09-30  1683  			free_extent_state(cached_state);
c67d970f0ea8dc Filipe Manana      2019-09-30  1684  
8702ba9396bf7b Qu Wenruo          2019-10-14  1685  		btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes);
f1de968376340c Miao Xie           2014-01-09  1686  		if (ret) {
89f414f96d87ea Qu Wenruo          2020-08-19  1687  			btrfs_drop_pages(&page, 1);
d0215f3e5ebb58 Josef Bacik        2011-01-25  1688  			break;
f1de968376340c Miao Xie           2014-01-09  1689  		}
39279cc3d2704c Chris Mason        2007-06-12  1690  
89f414f96d87ea Qu Wenruo          2020-08-19  1691  		if (only_release_metadata) {
38d37aa9c32938 Qu Wenruo          2020-06-24  1692  			btrfs_check_nocow_unlock(BTRFS_I(inode));
da17066c40472c Jeff Mahoney       2016-06-15  1693  			lockstart = round_down(pos,
0b246afa62b0cf Jeff Mahoney       2016-06-22  1694  					       fs_info->sectorsize);
89f414f96d87ea Qu Wenruo          2020-08-19  1695  			lockend = lockstart + PAGE_SIZE - 1;
7ee9e4405f264e Josef Bacik        2013-06-21  1696  
7ee9e4405f264e Josef Bacik        2013-06-21  1697  			set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
7ee9e4405f264e Josef Bacik        2013-06-21  1698  				       lockend, EXTENT_NORESERVE, NULL,
7ee9e4405f264e Josef Bacik        2013-06-21  1699  				       NULL, GFP_NOFS);
7ee9e4405f264e Josef Bacik        2013-06-21  1700  		}
7ee9e4405f264e Josef Bacik        2013-06-21  1701  
89f414f96d87ea Qu Wenruo          2020-08-19  1702  		btrfs_drop_pages(&page, 1);
f1de968376340c Miao Xie           2014-01-09  1703  
d0215f3e5ebb58 Josef Bacik        2011-01-25  1704  		cond_resched();
d0215f3e5ebb58 Josef Bacik        2011-01-25  1705  
d0e1d66b5aa1ec Namjae Jeon        2012-12-11  1706  		balance_dirty_pages_ratelimited(inode->i_mapping);
cb843a6f513a1a Chris Mason        2008-10-03  1707  
914ee295af418e Xin Zhong          2010-12-09  1708  		pos += copied;
914ee295af418e Xin Zhong          2010-12-09  1709  		num_written += copied;
d0215f3e5ebb58 Josef Bacik        2011-01-25  1710  	}
39279cc3d2704c Chris Mason        2007-06-12  1711  
364ecf3651e086 Qu Wenruo          2017-02-27  1712  	extent_changeset_free(data_reserved);
d0215f3e5ebb58 Josef Bacik        2011-01-25  1713  	return num_written ? num_written : ret;
39279cc3d2704c Chris Mason        2007-06-12  1714  }
d0215f3e5ebb58 Josef Bacik        2011-01-25  1715  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30266 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace
  2020-08-19  5:53 [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace Qu Wenruo
  2020-08-19  8:50 ` kernel test robot
@ 2020-08-19  9:09 ` Filipe Manana
  2020-08-19  9:16   ` Qu Wenruo
  2020-08-19 20:11 ` Josef Bacik
  2 siblings, 1 reply; 5+ messages in thread
From: Filipe Manana @ 2020-08-19  9:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Aug 19, 2020 at 6:55 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Before this patch, btrfs_buffered_write() do page copy in a 8 pages
> batch.
>
> While for EXT4, it uses generic_perform_write() which does page by page
> copy.
>
> This 8 pages batch behavior makes a lot of things more complex:
> - More complex error handling
>   Now we need to handle all errors for half written case.
>
> - More complex advance check
>   Since for 8 pages, we need to consider cases like 4 pages copied.
>   This makes we need to release reserved space for the untouched 4
>   pages.
>
> - More wrappers for multi-pages operations
>   The most obvious one is btrfs_copy_from_user(), which introduces way
>   more complexity than we need.
>
> This patch will change the behavior by going to the page-by-page pace,
> each time we only reserve space for one page, do one page copy.
>
> There are still a lot of complexity remained, mostly for short copy,
> non-uptodate page and extent locking.
> But that's more or less the same as the generic_perform_write().
>
> The performance is the same for 4K block size buffered write, but has an
> obvious impact when using multiple pages siuzed block size:
>
> The test involves writing a 128MiB file, which is smaller than 1/8th of
> the system memory.
>                 Speed (MiB/sec)         Ops (ops/sec)
> Unpatched:      931.498                 14903.9756
> Patched:        447.606                 7161.6806
>
> In fact, if we account the execution time of btrfs_buffered_write(),
> meta/data rsv and later page dirty takes way more time than memory copy:
>
> Patched:
>  nr_runs          = 32768
>  total_prepare_ns = 66908022
>  total_copy_ns    = 75532103
>  total_cleanup_ns = 135749090
>
> Unpatched:
>  nr_runs          = 2176
>  total_prepare_ns = 7425773
>  total_copy_ns    = 87780898
>  total_cleanup_ns = 37704811
>
> The patched behavior is now similar to EXT4, the buffered write remain
> mostly unchanged for from 4K blocksize and larger.
>
> On the other hand, XFS uses iomap, which supports multi-page reserve and
> copy, leading to similar performance of unpatched btrfs.
>
> It looks like that we'd better go iomap routine other than the
> generic_perform_write().
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Reason for RFC:
>
> The performance drop is enough for this patch to be discarded.

More than enough, a nearly 50% drop of performance for a very common operation.
I can't even understand why you bothered proposing it.

Thanks.

> ---
>  fs/btrfs/file.c | 293 ++++++++++++------------------------------------
>  1 file changed, 72 insertions(+), 221 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index bbfc8819cf28..be595da9bc05 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -379,60 +379,6 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
>         return 0;
>  }
>
> -/* simple helper to fault in pages and copy.  This should go away
> - * and be replaced with calls into generic code.
> - */
> -static noinline int btrfs_copy_from_user(loff_t pos, size_t write_bytes,
> -                                        struct page **prepared_pages,
> -                                        struct iov_iter *i)
> -{
> -       size_t copied = 0;
> -       size_t total_copied = 0;
> -       int pg = 0;
> -       int offset = offset_in_page(pos);
> -
> -       while (write_bytes > 0) {
> -               size_t count = min_t(size_t,
> -                                    PAGE_SIZE - offset, write_bytes);
> -               struct page *page = prepared_pages[pg];
> -               /*
> -                * Copy data from userspace to the current page
> -                */
> -               copied = iov_iter_copy_from_user_atomic(page, i, offset, count);
> -
> -               /* Flush processor's dcache for this page */
> -               flush_dcache_page(page);
> -
> -               /*
> -                * if we get a partial write, we can end up with
> -                * partially up to date pages.  These add
> -                * a lot of complexity, so make sure they don't
> -                * happen by forcing this copy to be retried.
> -                *
> -                * The rest of the btrfs_file_write code will fall
> -                * back to page at a time copies after we return 0.
> -                */
> -               if (!PageUptodate(page) && copied < count)
> -                       copied = 0;
> -
> -               iov_iter_advance(i, copied);
> -               write_bytes -= copied;
> -               total_copied += copied;
> -
> -               /* Return to btrfs_file_write_iter to fault page */
> -               if (unlikely(copied == 0))
> -                       break;
> -
> -               if (copied < PAGE_SIZE - offset) {
> -                       offset += copied;
> -               } else {
> -                       pg++;
> -                       offset = 0;
> -               }
> -       }
> -       return total_copied;
> -}
> -
>  /*
>   * unlocks pages after btrfs_file_write is done with them
>   */
> @@ -443,8 +389,8 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
>                 /* page checked is some magic around finding pages that
>                  * have been modified without going through btrfs_set_page_dirty
>                  * clear it here. There should be no need to mark the pages
> -                * accessed as prepare_pages should have marked them accessed
> -                * in prepare_pages via find_or_create_page()
> +                * accessed as prepare_pages() should have marked them accessed
> +                * in prepare_pages() via find_or_create_page()
>                  */
>                 ClearPageChecked(pages[i]);
>                 unlock_page(pages[i]);
> @@ -1400,58 +1346,6 @@ static int prepare_uptodate_page(struct inode *inode,
>         return 0;
>  }
>
> -/*
> - * this just gets pages into the page cache and locks them down.
> - */
> -static noinline int prepare_pages(struct inode *inode, struct page **pages,
> -                                 size_t num_pages, loff_t pos,
> -                                 size_t write_bytes, bool force_uptodate)
> -{
> -       int i;
> -       unsigned long index = pos >> PAGE_SHIFT;
> -       gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
> -       int err = 0;
> -       int faili;
> -
> -       for (i = 0; i < num_pages; i++) {
> -again:
> -               pages[i] = find_or_create_page(inode->i_mapping, index + i,
> -                                              mask | __GFP_WRITE);
> -               if (!pages[i]) {
> -                       faili = i - 1;
> -                       err = -ENOMEM;
> -                       goto fail;
> -               }
> -
> -               if (i == 0)
> -                       err = prepare_uptodate_page(inode, pages[i], pos,
> -                                                   force_uptodate);
> -               if (!err && i == num_pages - 1)
> -                       err = prepare_uptodate_page(inode, pages[i],
> -                                                   pos + write_bytes, false);
> -               if (err) {
> -                       put_page(pages[i]);
> -                       if (err == -EAGAIN) {
> -                               err = 0;
> -                               goto again;
> -                       }
> -                       faili = i - 1;
> -                       goto fail;
> -               }
> -               wait_on_page_writeback(pages[i]);
> -       }
> -
> -       return 0;
> -fail:
> -       while (faili >= 0) {
> -               unlock_page(pages[faili]);
> -               put_page(pages[faili]);
> -               faili--;
> -       }
> -       return err;
> -
> -}
> -
>  /*
>   * This function locks the extent and properly waits for data=ordered extents
>   * to finish before allowing the pages to be modified if need.
> @@ -1619,6 +1513,38 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
>         btrfs_drew_write_unlock(&inode->root->snapshot_lock);
>  }
>
> +static int prepare_one_page(struct inode *inode, struct page **page_ret,
> +                           loff_t pos, size_t write_bytes, bool force_uptodate)
> +{
> +       gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping) | __GFP_WRITE;
> +       struct page *page;
> +       int ret;
> +
> +again:
> +       page = find_or_create_page(inode->i_mapping, pos >> PAGE_SHIFT, mask);
> +       if (!page)
> +               return -ENOMEM;
> +
> +       /*
> +        * We need the page uptodate for the following cases:
> +        * - Write range only covers part of the page
> +        * - We got a short copy on non-uptodate page in previous run
> +        */
> +       if ((!(offset_in_page(pos) == 0 && write_bytes == PAGE_SIZE) ||
> +            force_uptodate) && !PageUptodate(page)) {
> +               ret = prepare_uptodate_page(inode, page, pos, true);
> +               if (ret) {
> +                       put_page(page);
> +                       if (ret == -EAGAIN)
> +                               goto again;
> +                       return ret;
> +               }
> +               wait_on_page_writeback(page);
> +       }
> +       *page_ret = page;
> +       return 0;
> +}
> +
>  static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>                                                struct iov_iter *i)
>  {
> @@ -1626,45 +1552,26 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>         loff_t pos = iocb->ki_pos;
>         struct inode *inode = file_inode(file);
>         struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> -       struct page **pages = NULL;
> +       struct page *page = NULL;
>         struct extent_changeset *data_reserved = NULL;
> -       u64 release_bytes = 0;
>         u64 lockstart;
>         u64 lockend;
>         size_t num_written = 0;
> -       int nrptrs;
>         int ret = 0;
>         bool only_release_metadata = false;
>         bool force_page_uptodate = false;
>
> -       nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
> -                       PAGE_SIZE / (sizeof(struct page *)));
> -       nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
> -       nrptrs = max(nrptrs, 8);
> -       pages = kmalloc_array(nrptrs, sizeof(struct page *), GFP_KERNEL);
> -       if (!pages)
> -               return -ENOMEM;
> -
>         while (iov_iter_count(i) > 0) {
>                 struct extent_state *cached_state = NULL;
>                 size_t offset = offset_in_page(pos);
> -               size_t sector_offset;
>                 size_t write_bytes = min(iov_iter_count(i),
> -                                        nrptrs * (size_t)PAGE_SIZE -
> -                                        offset);
> -               size_t num_pages = DIV_ROUND_UP(write_bytes + offset,
> -                                               PAGE_SIZE);
> -               size_t reserve_bytes;
> -               size_t dirty_pages;
> +                                        PAGE_SIZE - offset);
> +               size_t reserve_bytes = PAGE_SIZE;
>                 size_t copied;
> -               size_t dirty_sectors;
> -               size_t num_sectors;
>                 int extents_locked;
>
> -               WARN_ON(num_pages > nrptrs);
> -
>                 /*
> -                * Fault pages before locking them in prepare_pages
> +                * Fault pages before locking them in prepare_page()
>                  * to avoid recursive lock
>                  */
>                 if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
> @@ -1673,37 +1580,27 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>                 }
>
>                 only_release_metadata = false;
> -               sector_offset = pos & (fs_info->sectorsize - 1);
> -               reserve_bytes = round_up(write_bytes + sector_offset,
> -                               fs_info->sectorsize);
>
>                 extent_changeset_release(data_reserved);
>                 ret = btrfs_check_data_free_space(BTRFS_I(inode),
>                                                   &data_reserved, pos,
>                                                   write_bytes);
>                 if (ret < 0) {
> +                       size_t tmp = write_bytes;
>                         if (btrfs_check_nocow_lock(BTRFS_I(inode), pos,
> -                                                  &write_bytes) > 0) {
> +                                                  &tmp) > 0) {
> +                               ASSERT(tmp == write_bytes);
>                                 /*
>                                  * For nodata cow case, no need to reserve
>                                  * data space.
>                                  */
>                                 only_release_metadata = true;
> -                               /*
> -                                * our prealloc extent may be smaller than
> -                                * write_bytes, so scale down.
> -                                */
> -                               num_pages = DIV_ROUND_UP(write_bytes + offset,
> -                                                        PAGE_SIZE);
> -                               reserve_bytes = round_up(write_bytes +
> -                                                        sector_offset,
> -                                                        fs_info->sectorsize);
> +                               reserve_bytes = 0;
>                         } else {
>                                 break;
>                         }
>                 }
>
> -               WARN_ON(reserve_bytes == 0);
>                 ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
>                                 reserve_bytes);
>                 if (ret) {
> @@ -1716,16 +1613,9 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>                         break;
>                 }
>
> -               release_bytes = reserve_bytes;
>  again:
> -               /*
> -                * This is going to setup the pages array with the number of
> -                * pages we want, so we don't really need to worry about the
> -                * contents of pages from loop to loop
> -                */
> -               ret = prepare_pages(inode, pages, num_pages,
> -                                   pos, write_bytes,
> -                                   force_page_uptodate);
> +               ret = prepare_one_page(inode, &page, pos, write_bytes,
> +                                      force_page_uptodate);
>                 if (ret) {
>                         btrfs_delalloc_release_extents(BTRFS_I(inode),
>                                                        reserve_bytes);
> @@ -1733,9 +1623,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>                 }
>
>                 extents_locked = lock_and_cleanup_extent_if_need(
> -                               BTRFS_I(inode), pages,
> -                               num_pages, pos, write_bytes, &lockstart,
> -                               &lockend, &cached_state);
> +                               BTRFS_I(inode), &page, 1, pos, write_bytes,
> +                               &lockstart, &lockend, &cached_state);
>                 if (extents_locked < 0) {
>                         if (extents_locked == -EAGAIN)
>                                 goto again;
> @@ -1745,57 +1634,38 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>                         break;
>                 }
>
> -               copied = btrfs_copy_from_user(pos, write_bytes, pages, i);
> -
> -               num_sectors = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
> -               dirty_sectors = round_up(copied + sector_offset,
> -                                       fs_info->sectorsize);
> -               dirty_sectors = BTRFS_BYTES_TO_BLKS(fs_info, dirty_sectors);
> -
> -               /*
> -                * if we have trouble faulting in the pages, fall
> -                * back to one page at a time
> -                */
> -               if (copied < write_bytes)
> -                       nrptrs = 1;
> +               copied = iov_iter_copy_from_user_atomic(page, i, offset,
> +                                                       write_bytes);
> +               flush_dcache_page(page);
>
> -               if (copied == 0) {
> +               if (!PageUptodate(page) && copied < write_bytes) {
> +                       /*
> +                        * Short write on non-uptodate page, we must retry and
> +                        * force the page uptodate in next run.
> +                        */
> +                       copied = 0;
>                         force_page_uptodate = true;
> -                       dirty_sectors = 0;
> -                       dirty_pages = 0;
>                 } else {
> +                       /* Next run doesn't need forced uptodate */
>                         force_page_uptodate = false;
> -                       dirty_pages = DIV_ROUND_UP(copied + offset,
> -                                                  PAGE_SIZE);
>                 }
>
> -               if (num_sectors > dirty_sectors) {
> -                       /* release everything except the sectors we dirtied */
> -                       release_bytes -= dirty_sectors <<
> -                                               fs_info->sb->s_blocksize_bits;
> -                       if (only_release_metadata) {
> -                               btrfs_delalloc_release_metadata(BTRFS_I(inode),
> -                                                       release_bytes, true);
> -                       } else {
> -                               u64 __pos;
> +               iov_iter_advance(i, copied);
>
> -                               __pos = round_down(pos,
> -                                                  fs_info->sectorsize) +
> -                                       (dirty_pages << PAGE_SHIFT);
> +               if (copied > 0) {
> +                       ret = btrfs_dirty_pages(BTRFS_I(inode), &page, 1, pos,
> +                                               copied, &cached_state);
> +               } else {
> +                       /* No bytes copied, need to free reserved space */
> +                       if (only_release_metadata)
> +                               btrfs_delalloc_release_metadata(BTRFS_I(inode),
> +                                               reserve_bytes, true);
> +                       else
>                                 btrfs_delalloc_release_space(BTRFS_I(inode),
> -                                               data_reserved, __pos,
> -                                               release_bytes, true);
> -                       }
> +                                               data_reserved, pos, write_bytes,
> +                                               true);
>                 }
>
> -               release_bytes = round_up(copied + sector_offset,
> -                                       fs_info->sectorsize);
> -
> -               if (copied > 0)
> -                       ret = btrfs_dirty_pages(BTRFS_I(inode), pages,
> -                                               dirty_pages, pos, copied,
> -                                               &cached_state);
> -
>                 /*
>                  * If we have not locked the extent range, because the range's
>                  * start offset is >= i_size, we might still have a non-NULL
> @@ -1811,26 +1681,22 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>
>                 btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes);
>                 if (ret) {
> -                       btrfs_drop_pages(pages, num_pages);
> +                       btrfs_drop_pages(&page, 1);
>                         break;
>                 }
>
> -               release_bytes = 0;
> -               if (only_release_metadata)
> +               if (only_release_metadata) {
>                         btrfs_check_nocow_unlock(BTRFS_I(inode));
> -
> -               if (only_release_metadata && copied > 0) {
>                         lockstart = round_down(pos,
>                                                fs_info->sectorsize);
> -                       lockend = round_up(pos + copied,
> -                                          fs_info->sectorsize) - 1;
> +                       lockend = lockstart + PAGE_SIZE - 1;
>
>                         set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
>                                        lockend, EXTENT_NORESERVE, NULL,
>                                        NULL, GFP_NOFS);
>                 }
>
> -               btrfs_drop_pages(pages, num_pages);
> +               btrfs_drop_pages(&page, 1);
>
>                 cond_resched();
>
> @@ -1840,21 +1706,6 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>                 num_written += copied;
>         }
>
> -       kfree(pages);
> -
> -       if (release_bytes) {
> -               if (only_release_metadata) {
> -                       btrfs_check_nocow_unlock(BTRFS_I(inode));
> -                       btrfs_delalloc_release_metadata(BTRFS_I(inode),
> -                                       release_bytes, true);
> -               } else {
> -                       btrfs_delalloc_release_space(BTRFS_I(inode),
> -                                       data_reserved,
> -                                       round_down(pos, fs_info->sectorsize),
> -                                       release_bytes, true);
> -               }
> -       }
> -
>         extent_changeset_free(data_reserved);
>         return num_written ? num_written : ret;
>  }
> --
> 2.28.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace
  2020-08-19  9:09 ` Filipe Manana
@ 2020-08-19  9:16   ` Qu Wenruo
  0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2020-08-19  9:16 UTC (permalink / raw)
  To: fdmanana, Qu Wenruo; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 23579 bytes --]



On 2020/8/19 下午5:09, Filipe Manana wrote:
> On Wed, Aug 19, 2020 at 6:55 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> Before this patch, btrfs_buffered_write() do page copy in a 8 pages
>> batch.
>>
>> While for EXT4, it uses generic_perform_write() which does page by page
>> copy.
>>
>> This 8 pages batch behavior makes a lot of things more complex:
>> - More complex error handling
>>   Now we need to handle all errors for half written case.
>>
>> - More complex advance check
>>   Since for 8 pages, we need to consider cases like 4 pages copied.
>>   This makes we need to release reserved space for the untouched 4
>>   pages.
>>
>> - More wrappers for multi-pages operations
>>   The most obvious one is btrfs_copy_from_user(), which introduces way
>>   more complexity than we need.
>>
>> This patch will change the behavior by going to the page-by-page pace,
>> each time we only reserve space for one page, do one page copy.
>>
>> There are still a lot of complexity remained, mostly for short copy,
>> non-uptodate page and extent locking.
>> But that's more or less the same as the generic_perform_write().
>>
>> The performance is the same for 4K block size buffered write, but has an
>> obvious impact when using multiple pages siuzed block size:
>>
>> The test involves writing a 128MiB file, which is smaller than 1/8th of
>> the system memory.
>>                 Speed (MiB/sec)         Ops (ops/sec)
>> Unpatched:      931.498                 14903.9756
>> Patched:        447.606                 7161.6806
>>
>> In fact, if we account the execution time of btrfs_buffered_write(),
>> meta/data rsv and later page dirty takes way more time than memory copy:
>>
>> Patched:
>>  nr_runs          = 32768
>>  total_prepare_ns = 66908022
>>  total_copy_ns    = 75532103
>>  total_cleanup_ns = 135749090
>>
>> Unpatched:
>>  nr_runs          = 2176
>>  total_prepare_ns = 7425773
>>  total_copy_ns    = 87780898
>>  total_cleanup_ns = 37704811
>>
>> The patched behavior is now similar to EXT4, the buffered write remain
>> mostly unchanged for from 4K blocksize and larger.
>>
>> On the other hand, XFS uses iomap, which supports multi-page reserve and
>> copy, leading to similar performance of unpatched btrfs.
>>
>> It looks like that we'd better go iomap routine other than the
>> generic_perform_write().
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Reason for RFC:
>>
>> The performance drop is enough for this patch to be discarded.
> 
> More than enough, a nearly 50% drop of performance for a very common operation.
> I can't even understand why you bothered proposing it.

Well, at least for blocksize <= 4K, it's still the same performance, but
with a much cleaner code base.

Anyway, will go the iomap direction, as iomap seems to have extra method
to ensure:
- The iov_iter is always readable
  So that we won't need to bother shorter in-page copy, thus no need to
  bother the copied == 0 case.

- The needed operation done before real page copy
  I'm wondering what is the proper way to interact with page and extent
  lock.
  If we can first lock the extent io tree, then try to get the pages
  locked, we can then remove one "goto again" call, and make it easier
  to integrate with iomap.

Thanks,
Qu

> 
> Thanks.
> 
>> ---
>>  fs/btrfs/file.c | 293 ++++++++++++------------------------------------
>>  1 file changed, 72 insertions(+), 221 deletions(-)
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index bbfc8819cf28..be595da9bc05 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -379,60 +379,6 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
>>         return 0;
>>  }
>>
>> -/* simple helper to fault in pages and copy.  This should go away
>> - * and be replaced with calls into generic code.
>> - */
>> -static noinline int btrfs_copy_from_user(loff_t pos, size_t write_bytes,
>> -                                        struct page **prepared_pages,
>> -                                        struct iov_iter *i)
>> -{
>> -       size_t copied = 0;
>> -       size_t total_copied = 0;
>> -       int pg = 0;
>> -       int offset = offset_in_page(pos);
>> -
>> -       while (write_bytes > 0) {
>> -               size_t count = min_t(size_t,
>> -                                    PAGE_SIZE - offset, write_bytes);
>> -               struct page *page = prepared_pages[pg];
>> -               /*
>> -                * Copy data from userspace to the current page
>> -                */
>> -               copied = iov_iter_copy_from_user_atomic(page, i, offset, count);
>> -
>> -               /* Flush processor's dcache for this page */
>> -               flush_dcache_page(page);
>> -
>> -               /*
>> -                * if we get a partial write, we can end up with
>> -                * partially up to date pages.  These add
>> -                * a lot of complexity, so make sure they don't
>> -                * happen by forcing this copy to be retried.
>> -                *
>> -                * The rest of the btrfs_file_write code will fall
>> -                * back to page at a time copies after we return 0.
>> -                */
>> -               if (!PageUptodate(page) && copied < count)
>> -                       copied = 0;
>> -
>> -               iov_iter_advance(i, copied);
>> -               write_bytes -= copied;
>> -               total_copied += copied;
>> -
>> -               /* Return to btrfs_file_write_iter to fault page */
>> -               if (unlikely(copied == 0))
>> -                       break;
>> -
>> -               if (copied < PAGE_SIZE - offset) {
>> -                       offset += copied;
>> -               } else {
>> -                       pg++;
>> -                       offset = 0;
>> -               }
>> -       }
>> -       return total_copied;
>> -}
>> -
>>  /*
>>   * unlocks pages after btrfs_file_write is done with them
>>   */
>> @@ -443,8 +389,8 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
>>                 /* page checked is some magic around finding pages that
>>                  * have been modified without going through btrfs_set_page_dirty
>>                  * clear it here. There should be no need to mark the pages
>> -                * accessed as prepare_pages should have marked them accessed
>> -                * in prepare_pages via find_or_create_page()
>> +                * accessed as prepare_pages() should have marked them accessed
>> +                * in prepare_pages() via find_or_create_page()
>>                  */
>>                 ClearPageChecked(pages[i]);
>>                 unlock_page(pages[i]);
>> @@ -1400,58 +1346,6 @@ static int prepare_uptodate_page(struct inode *inode,
>>         return 0;
>>  }
>>
>> -/*
>> - * this just gets pages into the page cache and locks them down.
>> - */
>> -static noinline int prepare_pages(struct inode *inode, struct page **pages,
>> -                                 size_t num_pages, loff_t pos,
>> -                                 size_t write_bytes, bool force_uptodate)
>> -{
>> -       int i;
>> -       unsigned long index = pos >> PAGE_SHIFT;
>> -       gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
>> -       int err = 0;
>> -       int faili;
>> -
>> -       for (i = 0; i < num_pages; i++) {
>> -again:
>> -               pages[i] = find_or_create_page(inode->i_mapping, index + i,
>> -                                              mask | __GFP_WRITE);
>> -               if (!pages[i]) {
>> -                       faili = i - 1;
>> -                       err = -ENOMEM;
>> -                       goto fail;
>> -               }
>> -
>> -               if (i == 0)
>> -                       err = prepare_uptodate_page(inode, pages[i], pos,
>> -                                                   force_uptodate);
>> -               if (!err && i == num_pages - 1)
>> -                       err = prepare_uptodate_page(inode, pages[i],
>> -                                                   pos + write_bytes, false);
>> -               if (err) {
>> -                       put_page(pages[i]);
>> -                       if (err == -EAGAIN) {
>> -                               err = 0;
>> -                               goto again;
>> -                       }
>> -                       faili = i - 1;
>> -                       goto fail;
>> -               }
>> -               wait_on_page_writeback(pages[i]);
>> -       }
>> -
>> -       return 0;
>> -fail:
>> -       while (faili >= 0) {
>> -               unlock_page(pages[faili]);
>> -               put_page(pages[faili]);
>> -               faili--;
>> -       }
>> -       return err;
>> -
>> -}
>> -
>>  /*
>>   * This function locks the extent and properly waits for data=ordered extents
>>   * to finish before allowing the pages to be modified if need.
>> @@ -1619,6 +1513,38 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
>>         btrfs_drew_write_unlock(&inode->root->snapshot_lock);
>>  }
>>
>> +static int prepare_one_page(struct inode *inode, struct page **page_ret,
>> +                           loff_t pos, size_t write_bytes, bool force_uptodate)
>> +{
>> +       gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping) | __GFP_WRITE;
>> +       struct page *page;
>> +       int ret;
>> +
>> +again:
>> +       page = find_or_create_page(inode->i_mapping, pos >> PAGE_SHIFT, mask);
>> +       if (!page)
>> +               return -ENOMEM;
>> +
>> +       /*
>> +        * We need the page uptodate for the following cases:
>> +        * - Write range only covers part of the page
>> +        * - We got a short copy on non-uptodate page in previous run
>> +        */
>> +       if ((!(offset_in_page(pos) == 0 && write_bytes == PAGE_SIZE) ||
>> +            force_uptodate) && !PageUptodate(page)) {
>> +               ret = prepare_uptodate_page(inode, page, pos, true);
>> +               if (ret) {
>> +                       put_page(page);
>> +                       if (ret == -EAGAIN)
>> +                               goto again;
>> +                       return ret;
>> +               }
>> +               wait_on_page_writeback(page);
>> +       }
>> +       *page_ret = page;
>> +       return 0;
>> +}
>> +
>>  static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>>                                                struct iov_iter *i)
>>  {
>> @@ -1626,45 +1552,26 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>>         loff_t pos = iocb->ki_pos;
>>         struct inode *inode = file_inode(file);
>>         struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>> -       struct page **pages = NULL;
>> +       struct page *page = NULL;
>>         struct extent_changeset *data_reserved = NULL;
>> -       u64 release_bytes = 0;
>>         u64 lockstart;
>>         u64 lockend;
>>         size_t num_written = 0;
>> -       int nrptrs;
>>         int ret = 0;
>>         bool only_release_metadata = false;
>>         bool force_page_uptodate = false;
>>
>> -       nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
>> -                       PAGE_SIZE / (sizeof(struct page *)));
>> -       nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
>> -       nrptrs = max(nrptrs, 8);
>> -       pages = kmalloc_array(nrptrs, sizeof(struct page *), GFP_KERNEL);
>> -       if (!pages)
>> -               return -ENOMEM;
>> -
>>         while (iov_iter_count(i) > 0) {
>>                 struct extent_state *cached_state = NULL;
>>                 size_t offset = offset_in_page(pos);
>> -               size_t sector_offset;
>>                 size_t write_bytes = min(iov_iter_count(i),
>> -                                        nrptrs * (size_t)PAGE_SIZE -
>> -                                        offset);
>> -               size_t num_pages = DIV_ROUND_UP(write_bytes + offset,
>> -                                               PAGE_SIZE);
>> -               size_t reserve_bytes;
>> -               size_t dirty_pages;
>> +                                        PAGE_SIZE - offset);
>> +               size_t reserve_bytes = PAGE_SIZE;
>>                 size_t copied;
>> -               size_t dirty_sectors;
>> -               size_t num_sectors;
>>                 int extents_locked;
>>
>> -               WARN_ON(num_pages > nrptrs);
>> -
>>                 /*
>> -                * Fault pages before locking them in prepare_pages
>> +                * Fault pages before locking them in prepare_page()
>>                  * to avoid recursive lock
>>                  */
>>                 if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
>> @@ -1673,37 +1580,27 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>>                 }
>>
>>                 only_release_metadata = false;
>> -               sector_offset = pos & (fs_info->sectorsize - 1);
>> -               reserve_bytes = round_up(write_bytes + sector_offset,
>> -                               fs_info->sectorsize);
>>
>>                 extent_changeset_release(data_reserved);
>>                 ret = btrfs_check_data_free_space(BTRFS_I(inode),
>>                                                   &data_reserved, pos,
>>                                                   write_bytes);
>>                 if (ret < 0) {
>> +                       size_t tmp = write_bytes;
>>                         if (btrfs_check_nocow_lock(BTRFS_I(inode), pos,
>> -                                                  &write_bytes) > 0) {
>> +                                                  &tmp) > 0) {
>> +                               ASSERT(tmp == write_bytes);
>>                                 /*
>>                                  * For nodata cow case, no need to reserve
>>                                  * data space.
>>                                  */
>>                                 only_release_metadata = true;
>> -                               /*
>> -                                * our prealloc extent may be smaller than
>> -                                * write_bytes, so scale down.
>> -                                */
>> -                               num_pages = DIV_ROUND_UP(write_bytes + offset,
>> -                                                        PAGE_SIZE);
>> -                               reserve_bytes = round_up(write_bytes +
>> -                                                        sector_offset,
>> -                                                        fs_info->sectorsize);
>> +                               reserve_bytes = 0;
>>                         } else {
>>                                 break;
>>                         }
>>                 }
>>
>> -               WARN_ON(reserve_bytes == 0);
>>                 ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
>>                                 reserve_bytes);
>>                 if (ret) {
>> @@ -1716,16 +1613,9 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>>                         break;
>>                 }
>>
>> -               release_bytes = reserve_bytes;
>>  again:
>> -               /*
>> -                * This is going to setup the pages array with the number of
>> -                * pages we want, so we don't really need to worry about the
>> -                * contents of pages from loop to loop
>> -                */
>> -               ret = prepare_pages(inode, pages, num_pages,
>> -                                   pos, write_bytes,
>> -                                   force_page_uptodate);
>> +               ret = prepare_one_page(inode, &page, pos, write_bytes,
>> +                                      force_page_uptodate);
>>                 if (ret) {
>>                         btrfs_delalloc_release_extents(BTRFS_I(inode),
>>                                                        reserve_bytes);
>> @@ -1733,9 +1623,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>>                 }
>>
>>                 extents_locked = lock_and_cleanup_extent_if_need(
>> -                               BTRFS_I(inode), pages,
>> -                               num_pages, pos, write_bytes, &lockstart,
>> -                               &lockend, &cached_state);
>> +                               BTRFS_I(inode), &page, 1, pos, write_bytes,
>> +                               &lockstart, &lockend, &cached_state);
>>                 if (extents_locked < 0) {
>>                         if (extents_locked == -EAGAIN)
>>                                 goto again;
>> @@ -1745,57 +1634,38 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>>                         break;
>>                 }
>>
>> -               copied = btrfs_copy_from_user(pos, write_bytes, pages, i);
>> -
>> -               num_sectors = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
>> -               dirty_sectors = round_up(copied + sector_offset,
>> -                                       fs_info->sectorsize);
>> -               dirty_sectors = BTRFS_BYTES_TO_BLKS(fs_info, dirty_sectors);
>> -
>> -               /*
>> -                * if we have trouble faulting in the pages, fall
>> -                * back to one page at a time
>> -                */
>> -               if (copied < write_bytes)
>> -                       nrptrs = 1;
>> +               copied = iov_iter_copy_from_user_atomic(page, i, offset,
>> +                                                       write_bytes);
>> +               flush_dcache_page(page);
>>
>> -               if (copied == 0) {
>> +               if (!PageUptodate(page) && copied < write_bytes) {
>> +                       /*
>> +                        * Short write on non-uptodate page, we must retry and
>> +                        * force the page uptodate in next run.
>> +                        */
>> +                       copied = 0;
>>                         force_page_uptodate = true;
>> -                       dirty_sectors = 0;
>> -                       dirty_pages = 0;
>>                 } else {
>> +                       /* Next run doesn't need forced uptodate */
>>                         force_page_uptodate = false;
>> -                       dirty_pages = DIV_ROUND_UP(copied + offset,
>> -                                                  PAGE_SIZE);
>>                 }
>>
>> -               if (num_sectors > dirty_sectors) {
>> -                       /* release everything except the sectors we dirtied */
>> -                       release_bytes -= dirty_sectors <<
>> -                                               fs_info->sb->s_blocksize_bits;
>> -                       if (only_release_metadata) {
>> -                               btrfs_delalloc_release_metadata(BTRFS_I(inode),
>> -                                                       release_bytes, true);
>> -                       } else {
>> -                               u64 __pos;
>> +               iov_iter_advance(i, copied);
>>
>> -                               __pos = round_down(pos,
>> -                                                  fs_info->sectorsize) +
>> -                                       (dirty_pages << PAGE_SHIFT);
>> +               if (copied > 0) {
>> +                       ret = btrfs_dirty_pages(BTRFS_I(inode), &page, 1, pos,
>> +                                               copied, &cached_state);
>> +               } else {
>> +                       /* No bytes copied, need to free reserved space */
>> +                       if (only_release_metadata)
>> +                               btrfs_delalloc_release_metadata(BTRFS_I(inode),
>> +                                               reserve_bytes, true);
>> +                       else
>>                                 btrfs_delalloc_release_space(BTRFS_I(inode),
>> -                                               data_reserved, __pos,
>> -                                               release_bytes, true);
>> -                       }
>> +                                               data_reserved, pos, write_bytes,
>> +                                               true);
>>                 }
>>
>> -               release_bytes = round_up(copied + sector_offset,
>> -                                       fs_info->sectorsize);
>> -
>> -               if (copied > 0)
>> -                       ret = btrfs_dirty_pages(BTRFS_I(inode), pages,
>> -                                               dirty_pages, pos, copied,
>> -                                               &cached_state);
>> -
>>                 /*
>>                  * If we have not locked the extent range, because the range's
>>                  * start offset is >= i_size, we might still have a non-NULL
>> @@ -1811,26 +1681,22 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>>
>>                 btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes);
>>                 if (ret) {
>> -                       btrfs_drop_pages(pages, num_pages);
>> +                       btrfs_drop_pages(&page, 1);
>>                         break;
>>                 }
>>
>> -               release_bytes = 0;
>> -               if (only_release_metadata)
>> +               if (only_release_metadata) {
>>                         btrfs_check_nocow_unlock(BTRFS_I(inode));
>> -
>> -               if (only_release_metadata && copied > 0) {
>>                         lockstart = round_down(pos,
>>                                                fs_info->sectorsize);
>> -                       lockend = round_up(pos + copied,
>> -                                          fs_info->sectorsize) - 1;
>> +                       lockend = lockstart + PAGE_SIZE - 1;
>>
>>                         set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
>>                                        lockend, EXTENT_NORESERVE, NULL,
>>                                        NULL, GFP_NOFS);
>>                 }
>>
>> -               btrfs_drop_pages(pages, num_pages);
>> +               btrfs_drop_pages(&page, 1);
>>
>>                 cond_resched();
>>
>> @@ -1840,21 +1706,6 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>>                 num_written += copied;
>>         }
>>
>> -       kfree(pages);
>> -
>> -       if (release_bytes) {
>> -               if (only_release_metadata) {
>> -                       btrfs_check_nocow_unlock(BTRFS_I(inode));
>> -                       btrfs_delalloc_release_metadata(BTRFS_I(inode),
>> -                                       release_bytes, true);
>> -               } else {
>> -                       btrfs_delalloc_release_space(BTRFS_I(inode),
>> -                                       data_reserved,
>> -                                       round_down(pos, fs_info->sectorsize),
>> -                                       release_bytes, true);
>> -               }
>> -       }
>> -
>>         extent_changeset_free(data_reserved);
>>         return num_written ? num_written : ret;
>>  }
>> --
>> 2.28.0
>>
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace
  2020-08-19  5:53 [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace Qu Wenruo
  2020-08-19  8:50 ` kernel test robot
  2020-08-19  9:09 ` Filipe Manana
@ 2020-08-19 20:11 ` Josef Bacik
  2 siblings, 0 replies; 5+ messages in thread
From: Josef Bacik @ 2020-08-19 20:11 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 8/19/20 1:53 AM, Qu Wenruo wrote:
> Before this patch, btrfs_buffered_write() do page copy in a 8 pages
> batch.
> 
> While for EXT4, it uses generic_perform_write() which does page by page
> copy.
> 
> This 8 pages batch behavior makes a lot of things more complex:
> - More complex error handling
>    Now we need to handle all errors for half written case.
> 
> - More complex advance check
>    Since for 8 pages, we need to consider cases like 4 pages copied.
>    This makes we need to release reserved space for the untouched 4
>    pages.
> 
> - More wrappers for multi-pages operations
>    The most obvious one is btrfs_copy_from_user(), which introduces way
>    more complexity than we need.
> 
> This patch will change the behavior by going to the page-by-page pace,
> each time we only reserve space for one page, do one page copy.
> 
> There are still a lot of complexity remained, mostly for short copy,
> non-uptodate page and extent locking.
> But that's more or less the same as the generic_perform_write().
> 
> The performance is the same for 4K block size buffered write, but has an
> obvious impact when using multiple pages siuzed block size:
> 
> The test involves writing a 128MiB file, which is smaller than 1/8th of
> the system memory.
> 		Speed (MiB/sec)		Ops (ops/sec)
> Unpatched:	931.498			14903.9756
> Patched:	447.606			7161.6806
> 
> In fact, if we account the execution time of btrfs_buffered_write(),
> meta/data rsv and later page dirty takes way more time than memory copy:
> 
> Patched:
>   nr_runs          = 32768
>   total_prepare_ns = 66908022
>   total_copy_ns    = 75532103
>   total_cleanup_ns = 135749090
> 
> Unpatched:
>   nr_runs          = 2176
>   total_prepare_ns = 7425773
>   total_copy_ns    = 87780898
>   total_cleanup_ns = 37704811
> 
> The patched behavior is now similar to EXT4, the buffered write remain
> mostly unchanged for from 4K blocksize and larger.
> 
> On the other hand, XFS uses iomap, which supports multi-page reserve and
> copy, leading to similar performance of unpatched btrfs.
> 
> It looks like that we'd better go iomap routine other than the
> generic_perform_write().

I'm fine with tying it into iomap if we get to keep our performance.  The 
reality is we do so much more in the write path per-write that we really do need 
to batch as much as possible.  If you can figure out a way to modify the generic 
code to allow us to use that and still get the batching/performance then go for 
it.  Otherwise the complexity is worth the price.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-08-19 20:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-19  5:53 [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace Qu Wenruo
2020-08-19  8:50 ` kernel test robot
2020-08-19  9:09 ` Filipe Manana
2020-08-19  9:16   ` Qu Wenruo
2020-08-19 20:11 ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).