* [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace
@ 2020-08-19 5:53 Qu Wenruo
2020-08-19 8:50 ` kernel test robot
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Qu Wenruo @ 2020-08-19 5:53 UTC (permalink / raw)
To: linux-btrfs
Before this patch, btrfs_buffered_write() do page copy in a 8 pages
batch.
While for EXT4, it uses generic_perform_write() which does page by page
copy.
This 8 pages batch behavior makes a lot of things more complex:
- More complex error handling
Now we need to handle all errors for half written case.
- More complex advance check
Since for 8 pages, we need to consider cases like 4 pages copied.
This makes we need to release reserved space for the untouched 4
pages.
- More wrappers for multi-pages operations
The most obvious one is btrfs_copy_from_user(), which introduces way
more complexity than we need.
This patch will change the behavior by going to the page-by-page pace,
each time we only reserve space for one page, do one page copy.
There are still a lot of complexity remained, mostly for short copy,
non-uptodate page and extent locking.
But that's more or less the same as the generic_perform_write().
The performance is the same for 4K block size buffered write, but has an
obvious impact when using multiple pages siuzed block size:
The test involves writing a 128MiB file, which is smaller than 1/8th of
the system memory.
Speed (MiB/sec) Ops (ops/sec)
Unpatched: 931.498 14903.9756
Patched: 447.606 7161.6806
In fact, if we account the execution time of btrfs_buffered_write(),
meta/data rsv and later page dirty takes way more time than memory copy:
Patched:
nr_runs = 32768
total_prepare_ns = 66908022
total_copy_ns = 75532103
total_cleanup_ns = 135749090
Unpatched:
nr_runs = 2176
total_prepare_ns = 7425773
total_copy_ns = 87780898
total_cleanup_ns = 37704811
The patched behavior is now similar to EXT4, the buffered write remain
mostly unchanged for from 4K blocksize and larger.
On the other hand, XFS uses iomap, which supports multi-page reserve and
copy, leading to similar performance of unpatched btrfs.
It looks like that we'd better go iomap routine other than the
generic_perform_write().
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Reason for RFC:
The performance drop is enough for this patch to be discarded.
---
fs/btrfs/file.c | 293 ++++++++++++------------------------------------
1 file changed, 72 insertions(+), 221 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index bbfc8819cf28..be595da9bc05 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -379,60 +379,6 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
return 0;
}
-/* simple helper to fault in pages and copy. This should go away
- * and be replaced with calls into generic code.
- */
-static noinline int btrfs_copy_from_user(loff_t pos, size_t write_bytes,
- struct page **prepared_pages,
- struct iov_iter *i)
-{
- size_t copied = 0;
- size_t total_copied = 0;
- int pg = 0;
- int offset = offset_in_page(pos);
-
- while (write_bytes > 0) {
- size_t count = min_t(size_t,
- PAGE_SIZE - offset, write_bytes);
- struct page *page = prepared_pages[pg];
- /*
- * Copy data from userspace to the current page
- */
- copied = iov_iter_copy_from_user_atomic(page, i, offset, count);
-
- /* Flush processor's dcache for this page */
- flush_dcache_page(page);
-
- /*
- * if we get a partial write, we can end up with
- * partially up to date pages. These add
- * a lot of complexity, so make sure they don't
- * happen by forcing this copy to be retried.
- *
- * The rest of the btrfs_file_write code will fall
- * back to page at a time copies after we return 0.
- */
- if (!PageUptodate(page) && copied < count)
- copied = 0;
-
- iov_iter_advance(i, copied);
- write_bytes -= copied;
- total_copied += copied;
-
- /* Return to btrfs_file_write_iter to fault page */
- if (unlikely(copied == 0))
- break;
-
- if (copied < PAGE_SIZE - offset) {
- offset += copied;
- } else {
- pg++;
- offset = 0;
- }
- }
- return total_copied;
-}
-
/*
* unlocks pages after btrfs_file_write is done with them
*/
@@ -443,8 +389,8 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
/* page checked is some magic around finding pages that
* have been modified without going through btrfs_set_page_dirty
* clear it here. There should be no need to mark the pages
- * accessed as prepare_pages should have marked them accessed
- * in prepare_pages via find_or_create_page()
+ * accessed as prepare_pages() should have marked them accessed
+ * in prepare_pages() via find_or_create_page()
*/
ClearPageChecked(pages[i]);
unlock_page(pages[i]);
@@ -1400,58 +1346,6 @@ static int prepare_uptodate_page(struct inode *inode,
return 0;
}
-/*
- * this just gets pages into the page cache and locks them down.
- */
-static noinline int prepare_pages(struct inode *inode, struct page **pages,
- size_t num_pages, loff_t pos,
- size_t write_bytes, bool force_uptodate)
-{
- int i;
- unsigned long index = pos >> PAGE_SHIFT;
- gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
- int err = 0;
- int faili;
-
- for (i = 0; i < num_pages; i++) {
-again:
- pages[i] = find_or_create_page(inode->i_mapping, index + i,
- mask | __GFP_WRITE);
- if (!pages[i]) {
- faili = i - 1;
- err = -ENOMEM;
- goto fail;
- }
-
- if (i == 0)
- err = prepare_uptodate_page(inode, pages[i], pos,
- force_uptodate);
- if (!err && i == num_pages - 1)
- err = prepare_uptodate_page(inode, pages[i],
- pos + write_bytes, false);
- if (err) {
- put_page(pages[i]);
- if (err == -EAGAIN) {
- err = 0;
- goto again;
- }
- faili = i - 1;
- goto fail;
- }
- wait_on_page_writeback(pages[i]);
- }
-
- return 0;
-fail:
- while (faili >= 0) {
- unlock_page(pages[faili]);
- put_page(pages[faili]);
- faili--;
- }
- return err;
-
-}
-
/*
* This function locks the extent and properly waits for data=ordered extents
* to finish before allowing the pages to be modified if need.
@@ -1619,6 +1513,38 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
btrfs_drew_write_unlock(&inode->root->snapshot_lock);
}
+static int prepare_one_page(struct inode *inode, struct page **page_ret,
+ loff_t pos, size_t write_bytes, bool force_uptodate)
+{
+ gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping) | __GFP_WRITE;
+ struct page *page;
+ int ret;
+
+again:
+ page = find_or_create_page(inode->i_mapping, pos >> PAGE_SHIFT, mask);
+ if (!page)
+ return -ENOMEM;
+
+ /*
+ * We need the page uptodate for the following cases:
+ * - Write range only covers part of the page
+ * - We got a short copy on non-uptodate page in previous run
+ */
+ if ((!(offset_in_page(pos) == 0 && write_bytes == PAGE_SIZE) ||
+ force_uptodate) && !PageUptodate(page)) {
+ ret = prepare_uptodate_page(inode, page, pos, true);
+ if (ret) {
+ put_page(page);
+ if (ret == -EAGAIN)
+ goto again;
+ return ret;
+ }
+ wait_on_page_writeback(page);
+ }
+ *page_ret = page;
+ return 0;
+}
+
static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
struct iov_iter *i)
{
@@ -1626,45 +1552,26 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
loff_t pos = iocb->ki_pos;
struct inode *inode = file_inode(file);
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
- struct page **pages = NULL;
+ struct page *page = NULL;
struct extent_changeset *data_reserved = NULL;
- u64 release_bytes = 0;
u64 lockstart;
u64 lockend;
size_t num_written = 0;
- int nrptrs;
int ret = 0;
bool only_release_metadata = false;
bool force_page_uptodate = false;
- nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
- PAGE_SIZE / (sizeof(struct page *)));
- nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
- nrptrs = max(nrptrs, 8);
- pages = kmalloc_array(nrptrs, sizeof(struct page *), GFP_KERNEL);
- if (!pages)
- return -ENOMEM;
-
while (iov_iter_count(i) > 0) {
struct extent_state *cached_state = NULL;
size_t offset = offset_in_page(pos);
- size_t sector_offset;
size_t write_bytes = min(iov_iter_count(i),
- nrptrs * (size_t)PAGE_SIZE -
- offset);
- size_t num_pages = DIV_ROUND_UP(write_bytes + offset,
- PAGE_SIZE);
- size_t reserve_bytes;
- size_t dirty_pages;
+ PAGE_SIZE - offset);
+ size_t reserve_bytes = PAGE_SIZE;
size_t copied;
- size_t dirty_sectors;
- size_t num_sectors;
int extents_locked;
- WARN_ON(num_pages > nrptrs);
-
/*
- * Fault pages before locking them in prepare_pages
+ * Fault pages before locking them in prepare_page()
* to avoid recursive lock
*/
if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
@@ -1673,37 +1580,27 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
}
only_release_metadata = false;
- sector_offset = pos & (fs_info->sectorsize - 1);
- reserve_bytes = round_up(write_bytes + sector_offset,
- fs_info->sectorsize);
extent_changeset_release(data_reserved);
ret = btrfs_check_data_free_space(BTRFS_I(inode),
&data_reserved, pos,
write_bytes);
if (ret < 0) {
+ size_t tmp = write_bytes;
if (btrfs_check_nocow_lock(BTRFS_I(inode), pos,
- &write_bytes) > 0) {
+ &tmp) > 0) {
+ ASSERT(tmp == write_bytes);
/*
* For nodata cow case, no need to reserve
* data space.
*/
only_release_metadata = true;
- /*
- * our prealloc extent may be smaller than
- * write_bytes, so scale down.
- */
- num_pages = DIV_ROUND_UP(write_bytes + offset,
- PAGE_SIZE);
- reserve_bytes = round_up(write_bytes +
- sector_offset,
- fs_info->sectorsize);
+ reserve_bytes = 0;
} else {
break;
}
}
- WARN_ON(reserve_bytes == 0);
ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
reserve_bytes);
if (ret) {
@@ -1716,16 +1613,9 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
break;
}
- release_bytes = reserve_bytes;
again:
- /*
- * This is going to setup the pages array with the number of
- * pages we want, so we don't really need to worry about the
- * contents of pages from loop to loop
- */
- ret = prepare_pages(inode, pages, num_pages,
- pos, write_bytes,
- force_page_uptodate);
+ ret = prepare_one_page(inode, &page, pos, write_bytes,
+ force_page_uptodate);
if (ret) {
btrfs_delalloc_release_extents(BTRFS_I(inode),
reserve_bytes);
@@ -1733,9 +1623,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
}
extents_locked = lock_and_cleanup_extent_if_need(
- BTRFS_I(inode), pages,
- num_pages, pos, write_bytes, &lockstart,
- &lockend, &cached_state);
+ BTRFS_I(inode), &page, 1, pos, write_bytes,
+ &lockstart, &lockend, &cached_state);
if (extents_locked < 0) {
if (extents_locked == -EAGAIN)
goto again;
@@ -1745,57 +1634,38 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
break;
}
- copied = btrfs_copy_from_user(pos, write_bytes, pages, i);
-
- num_sectors = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
- dirty_sectors = round_up(copied + sector_offset,
- fs_info->sectorsize);
- dirty_sectors = BTRFS_BYTES_TO_BLKS(fs_info, dirty_sectors);
-
- /*
- * if we have trouble faulting in the pages, fall
- * back to one page at a time
- */
- if (copied < write_bytes)
- nrptrs = 1;
+ copied = iov_iter_copy_from_user_atomic(page, i, offset,
+ write_bytes);
+ flush_dcache_page(page);
- if (copied == 0) {
+ if (!PageUptodate(page) && copied < write_bytes) {
+ /*
+ * Short write on non-uptodate page, we must retry and
+ * force the page uptodate in next run.
+ */
+ copied = 0;
force_page_uptodate = true;
- dirty_sectors = 0;
- dirty_pages = 0;
} else {
+ /* Next run doesn't need forced uptodate */
force_page_uptodate = false;
- dirty_pages = DIV_ROUND_UP(copied + offset,
- PAGE_SIZE);
}
- if (num_sectors > dirty_sectors) {
- /* release everything except the sectors we dirtied */
- release_bytes -= dirty_sectors <<
- fs_info->sb->s_blocksize_bits;
- if (only_release_metadata) {
- btrfs_delalloc_release_metadata(BTRFS_I(inode),
- release_bytes, true);
- } else {
- u64 __pos;
+ iov_iter_advance(i, copied);
- __pos = round_down(pos,
- fs_info->sectorsize) +
- (dirty_pages << PAGE_SHIFT);
+ if (copied > 0) {
+ ret = btrfs_dirty_pages(BTRFS_I(inode), &page, 1, pos,
+ copied, &cached_state);
+ } else {
+ /* No bytes copied, need to free reserved space */
+ if (only_release_metadata)
+ btrfs_delalloc_release_metadata(BTRFS_I(inode),
+ reserve_bytes, true);
+ else
btrfs_delalloc_release_space(BTRFS_I(inode),
- data_reserved, __pos,
- release_bytes, true);
- }
+ data_reserved, pos, write_bytes,
+ true);
}
- release_bytes = round_up(copied + sector_offset,
- fs_info->sectorsize);
-
- if (copied > 0)
- ret = btrfs_dirty_pages(BTRFS_I(inode), pages,
- dirty_pages, pos, copied,
- &cached_state);
-
/*
* If we have not locked the extent range, because the range's
* start offset is >= i_size, we might still have a non-NULL
@@ -1811,26 +1681,22 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes);
if (ret) {
- btrfs_drop_pages(pages, num_pages);
+ btrfs_drop_pages(&page, 1);
break;
}
- release_bytes = 0;
- if (only_release_metadata)
+ if (only_release_metadata) {
btrfs_check_nocow_unlock(BTRFS_I(inode));
-
- if (only_release_metadata && copied > 0) {
lockstart = round_down(pos,
fs_info->sectorsize);
- lockend = round_up(pos + copied,
- fs_info->sectorsize) - 1;
+ lockend = lockstart + PAGE_SIZE - 1;
set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
lockend, EXTENT_NORESERVE, NULL,
NULL, GFP_NOFS);
}
- btrfs_drop_pages(pages, num_pages);
+ btrfs_drop_pages(&page, 1);
cond_resched();
@@ -1840,21 +1706,6 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
num_written += copied;
}
- kfree(pages);
-
- if (release_bytes) {
- if (only_release_metadata) {
- btrfs_check_nocow_unlock(BTRFS_I(inode));
- btrfs_delalloc_release_metadata(BTRFS_I(inode),
- release_bytes, true);
- } else {
- btrfs_delalloc_release_space(BTRFS_I(inode),
- data_reserved,
- round_down(pos, fs_info->sectorsize),
- release_bytes, true);
- }
- }
-
extent_changeset_free(data_reserved);
return num_written ? num_written : ret;
}
--
2.28.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace
2020-08-19 5:53 [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace Qu Wenruo
@ 2020-08-19 8:50 ` kernel test robot
2020-08-19 9:09 ` Filipe Manana
2020-08-19 20:11 ` Josef Bacik
2 siblings, 0 replies; 5+ messages in thread
From: kernel test robot @ 2020-08-19 8:50 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs; +Cc: kbuild-all, clang-built-linux
[-- Attachment #1: Type: text/plain, Size: 15979 bytes --]
Hi Qu,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on kdave/for-next]
[also build test WARNING on v5.9-rc1 next-20200819]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-switch-btrfs_buffered_write-to-page-by-page-pace/20200819-135541
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: arm-randconfig-r022-20200818 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project b34b1e38381fa4d1b1d9751a6b5233b68e734cfe)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm cross compiling tool for clang build
# apt-get install binutils-arm-linux-gnueabi
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All warnings (new ones prefixed by >>):
In file included from fs/btrfs/file.c:19:
fs/btrfs/ctree.h:2265:8: warning: 'const' type qualifier on return type has no effect [-Wignored-qualifiers]
size_t __const btrfs_get_num_csums(void);
^~~~~~~~
>> fs/btrfs/file.c:1570:24: warning: comparison of distinct pointer types ('typeof (iov_iter_count(i)) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12) - offset) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
size_t write_bytes = min(iov_iter_count(i),
^~~~~~~~~~~~~~~~~~~~~~
include/linux/kernel.h:884:19: note: expanded from macro 'min'
#define min(x, y) __careful_cmp(x, y, <)
^~~~~~~~~~~~~~~~~~~~~~
include/linux/kernel.h:875:24: note: expanded from macro '__careful_cmp'
__builtin_choose_expr(__safe_cmp(x, y), \
^~~~~~~~~~~~~~~~
include/linux/kernel.h:865:4: note: expanded from macro '__safe_cmp'
(__typecheck(x, y) && __no_side_effects(x, y))
^~~~~~~~~~~~~~~~~
include/linux/kernel.h:851:29: note: expanded from macro '__typecheck'
(!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
2 warnings generated.
# https://github.com/0day-ci/linux/commit/89f414f96d87eab3a1c3a3afc5283f80154e7a8b
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Qu-Wenruo/btrfs-switch-btrfs_buffered_write-to-page-by-page-pace/20200819-135541
git checkout 89f414f96d87eab3a1c3a3afc5283f80154e7a8b
vim +1570 fs/btrfs/file.c
89f414f96d87ea Qu Wenruo 2020-08-19 1550
e4af400a9c5081 Goldwyn Rodrigues 2018-06-17 1551 static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
e4af400a9c5081 Goldwyn Rodrigues 2018-06-17 1552 struct iov_iter *i)
39279cc3d2704c Chris Mason 2007-06-12 1553 {
e4af400a9c5081 Goldwyn Rodrigues 2018-06-17 1554 struct file *file = iocb->ki_filp;
e4af400a9c5081 Goldwyn Rodrigues 2018-06-17 1555 loff_t pos = iocb->ki_pos;
496ad9aa8ef448 Al Viro 2013-01-23 1556 struct inode *inode = file_inode(file);
0b246afa62b0cf Jeff Mahoney 2016-06-22 1557 struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
89f414f96d87ea Qu Wenruo 2020-08-19 1558 struct page *page = NULL;
364ecf3651e086 Qu Wenruo 2017-02-27 1559 struct extent_changeset *data_reserved = NULL;
376cc685cb3b43 Miao Xie 2013-12-10 1560 u64 lockstart;
376cc685cb3b43 Miao Xie 2013-12-10 1561 u64 lockend;
d0215f3e5ebb58 Josef Bacik 2011-01-25 1562 size_t num_written = 0;
c9149235a42ab9 Tsutomu Itoh 2011-03-30 1563 int ret = 0;
7ee9e4405f264e Josef Bacik 2013-06-21 1564 bool only_release_metadata = false;
b6316429af7f36 Josef Bacik 2011-09-30 1565 bool force_page_uptodate = false;
4b46fce23349bf Josef Bacik 2010-05-23 1566
d0215f3e5ebb58 Josef Bacik 2011-01-25 1567 while (iov_iter_count(i) > 0) {
c67d970f0ea8dc Filipe Manana 2019-09-30 1568 struct extent_state *cached_state = NULL;
7073017aeb98db Johannes Thumshirn 2018-12-05 1569 size_t offset = offset_in_page(pos);
d0215f3e5ebb58 Josef Bacik 2011-01-25 @1570 size_t write_bytes = min(iov_iter_count(i),
89f414f96d87ea Qu Wenruo 2020-08-19 1571 PAGE_SIZE - offset);
89f414f96d87ea Qu Wenruo 2020-08-19 1572 size_t reserve_bytes = PAGE_SIZE;
d0215f3e5ebb58 Josef Bacik 2011-01-25 1573 size_t copied;
79f015f216539d Goldwyn Rodrigues 2017-10-16 1574 int extents_locked;
39279cc3d2704c Chris Mason 2007-06-12 1575
914ee295af418e Xin Zhong 2010-12-09 1576 /*
89f414f96d87ea Qu Wenruo 2020-08-19 1577 * Fault pages before locking them in prepare_page()
914ee295af418e Xin Zhong 2010-12-09 1578 * to avoid recursive lock
914ee295af418e Xin Zhong 2010-12-09 1579 */
d0215f3e5ebb58 Josef Bacik 2011-01-25 1580 if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
914ee295af418e Xin Zhong 2010-12-09 1581 ret = -EFAULT;
d0215f3e5ebb58 Josef Bacik 2011-01-25 1582 break;
914ee295af418e Xin Zhong 2010-12-09 1583 }
914ee295af418e Xin Zhong 2010-12-09 1584
a0e248bb502d51 Filipe Manana 2019-10-11 1585 only_release_metadata = false;
d9d8b2a51a404c Qu Wenruo 2015-09-08 1586
364ecf3651e086 Qu Wenruo 2017-02-27 1587 extent_changeset_release(data_reserved);
36ea6f3e931391 Nikolay Borisov 2020-06-03 1588 ret = btrfs_check_data_free_space(BTRFS_I(inode),
36ea6f3e931391 Nikolay Borisov 2020-06-03 1589 &data_reserved, pos,
364ecf3651e086 Qu Wenruo 2017-02-27 1590 write_bytes);
c6887cd11149d7 Josef Bacik 2016-03-25 1591 if (ret < 0) {
89f414f96d87ea Qu Wenruo 2020-08-19 1592 size_t tmp = write_bytes;
38d37aa9c32938 Qu Wenruo 2020-06-24 1593 if (btrfs_check_nocow_lock(BTRFS_I(inode), pos,
89f414f96d87ea Qu Wenruo 2020-08-19 1594 &tmp) > 0) {
89f414f96d87ea Qu Wenruo 2020-08-19 1595 ASSERT(tmp == write_bytes);
d9d8b2a51a404c Qu Wenruo 2015-09-08 1596 /*
d9d8b2a51a404c Qu Wenruo 2015-09-08 1597 * For nodata cow case, no need to reserve
d9d8b2a51a404c Qu Wenruo 2015-09-08 1598 * data space.
d9d8b2a51a404c Qu Wenruo 2015-09-08 1599 */
7ee9e4405f264e Josef Bacik 2013-06-21 1600 only_release_metadata = true;
89f414f96d87ea Qu Wenruo 2020-08-19 1601 reserve_bytes = 0;
c6887cd11149d7 Josef Bacik 2016-03-25 1602 } else {
d0215f3e5ebb58 Josef Bacik 2011-01-25 1603 break;
c6887cd11149d7 Josef Bacik 2016-03-25 1604 }
c6887cd11149d7 Josef Bacik 2016-03-25 1605 }
1832a6d5ee3b1a Chris Mason 2007-12-21 1606
9f3db423f98c5c Nikolay Borisov 2017-02-20 1607 ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
9f3db423f98c5c Nikolay Borisov 2017-02-20 1608 reserve_bytes);
7ee9e4405f264e Josef Bacik 2013-06-21 1609 if (ret) {
7ee9e4405f264e Josef Bacik 2013-06-21 1610 if (!only_release_metadata)
25ce28caaa1ddc Nikolay Borisov 2020-06-03 1611 btrfs_free_reserved_data_space(BTRFS_I(inode),
bc42bda22345ef Qu Wenruo 2017-02-27 1612 data_reserved, pos,
d9d8b2a51a404c Qu Wenruo 2015-09-08 1613 write_bytes);
8257b2dc3c1a10 Miao Xie 2014-03-06 1614 else
38d37aa9c32938 Qu Wenruo 2020-06-24 1615 btrfs_check_nocow_unlock(BTRFS_I(inode));
7ee9e4405f264e Josef Bacik 2013-06-21 1616 break;
7ee9e4405f264e Josef Bacik 2013-06-21 1617 }
7ee9e4405f264e Josef Bacik 2013-06-21 1618
376cc685cb3b43 Miao Xie 2013-12-10 1619 again:
89f414f96d87ea Qu Wenruo 2020-08-19 1620 ret = prepare_one_page(inode, &page, pos, write_bytes,
b6316429af7f36 Josef Bacik 2011-09-30 1621 force_page_uptodate);
8b62f87bad9cf0 Josef Bacik 2017-10-19 1622 if (ret) {
8b62f87bad9cf0 Josef Bacik 2017-10-19 1623 btrfs_delalloc_release_extents(BTRFS_I(inode),
8702ba9396bf7b Qu Wenruo 2019-10-14 1624 reserve_bytes);
d0215f3e5ebb58 Josef Bacik 2011-01-25 1625 break;
8b62f87bad9cf0 Josef Bacik 2017-10-19 1626 }
39279cc3d2704c Chris Mason 2007-06-12 1627
79f015f216539d Goldwyn Rodrigues 2017-10-16 1628 extents_locked = lock_and_cleanup_extent_if_need(
89f414f96d87ea Qu Wenruo 2020-08-19 1629 BTRFS_I(inode), &page, 1, pos, write_bytes,
89f414f96d87ea Qu Wenruo 2020-08-19 1630 &lockstart, &lockend, &cached_state);
79f015f216539d Goldwyn Rodrigues 2017-10-16 1631 if (extents_locked < 0) {
79f015f216539d Goldwyn Rodrigues 2017-10-16 1632 if (extents_locked == -EAGAIN)
376cc685cb3b43 Miao Xie 2013-12-10 1633 goto again;
8b62f87bad9cf0 Josef Bacik 2017-10-19 1634 btrfs_delalloc_release_extents(BTRFS_I(inode),
8702ba9396bf7b Qu Wenruo 2019-10-14 1635 reserve_bytes);
79f015f216539d Goldwyn Rodrigues 2017-10-16 1636 ret = extents_locked;
376cc685cb3b43 Miao Xie 2013-12-10 1637 break;
376cc685cb3b43 Miao Xie 2013-12-10 1638 }
376cc685cb3b43 Miao Xie 2013-12-10 1639
89f414f96d87ea Qu Wenruo 2020-08-19 1640 copied = iov_iter_copy_from_user_atomic(page, i, offset,
89f414f96d87ea Qu Wenruo 2020-08-19 1641 write_bytes);
89f414f96d87ea Qu Wenruo 2020-08-19 1642 flush_dcache_page(page);
56244ef151c3cd Chris Mason 2016-05-16 1643
89f414f96d87ea Qu Wenruo 2020-08-19 1644 if (!PageUptodate(page) && copied < write_bytes) {
b1bf862e9dad43 Chris Mason 2011-02-28 1645 /*
89f414f96d87ea Qu Wenruo 2020-08-19 1646 * Short write on non-uptodate page, we must retry and
89f414f96d87ea Qu Wenruo 2020-08-19 1647 * force the page uptodate in next run.
b1bf862e9dad43 Chris Mason 2011-02-28 1648 */
89f414f96d87ea Qu Wenruo 2020-08-19 1649 copied = 0;
b6316429af7f36 Josef Bacik 2011-09-30 1650 force_page_uptodate = true;
b6316429af7f36 Josef Bacik 2011-09-30 1651 } else {
89f414f96d87ea Qu Wenruo 2020-08-19 1652 /* Next run doesn't need forced uptodate */
b6316429af7f36 Josef Bacik 2011-09-30 1653 force_page_uptodate = false;
b6316429af7f36 Josef Bacik 2011-09-30 1654 }
914ee295af418e Xin Zhong 2010-12-09 1655
89f414f96d87ea Qu Wenruo 2020-08-19 1656 iov_iter_advance(i, copied);
485290a734f142 Qu Wenruo 2015-10-29 1657
89f414f96d87ea Qu Wenruo 2020-08-19 1658 if (copied > 0) {
89f414f96d87ea Qu Wenruo 2020-08-19 1659 ret = btrfs_dirty_pages(BTRFS_I(inode), &page, 1, pos,
89f414f96d87ea Qu Wenruo 2020-08-19 1660 copied, &cached_state);
89f414f96d87ea Qu Wenruo 2020-08-19 1661 } else {
89f414f96d87ea Qu Wenruo 2020-08-19 1662 /* No bytes copied, need to free reserved space */
89f414f96d87ea Qu Wenruo 2020-08-19 1663 if (only_release_metadata)
89f414f96d87ea Qu Wenruo 2020-08-19 1664 btrfs_delalloc_release_metadata(BTRFS_I(inode),
89f414f96d87ea Qu Wenruo 2020-08-19 1665 reserve_bytes, true);
89f414f96d87ea Qu Wenruo 2020-08-19 1666 else
86d52921a2ba51 Nikolay Borisov 2020-06-03 1667 btrfs_delalloc_release_space(BTRFS_I(inode),
89f414f96d87ea Qu Wenruo 2020-08-19 1668 data_reserved, pos, write_bytes,
89f414f96d87ea Qu Wenruo 2020-08-19 1669 true);
485290a734f142 Qu Wenruo 2015-10-29 1670 }
914ee295af418e Xin Zhong 2010-12-09 1671
c67d970f0ea8dc Filipe Manana 2019-09-30 1672 /*
c67d970f0ea8dc Filipe Manana 2019-09-30 1673 * If we have not locked the extent range, because the range's
c67d970f0ea8dc Filipe Manana 2019-09-30 1674 * start offset is >= i_size, we might still have a non-NULL
c67d970f0ea8dc Filipe Manana 2019-09-30 1675 * cached extent state, acquired while marking the extent range
c67d970f0ea8dc Filipe Manana 2019-09-30 1676 * as delalloc through btrfs_dirty_pages(). Therefore free any
c67d970f0ea8dc Filipe Manana 2019-09-30 1677 * possible cached extent state to avoid a memory leak.
c67d970f0ea8dc Filipe Manana 2019-09-30 1678 */
79f015f216539d Goldwyn Rodrigues 2017-10-16 1679 if (extents_locked)
376cc685cb3b43 Miao Xie 2013-12-10 1680 unlock_extent_cached(&BTRFS_I(inode)->io_tree,
e43bbe5e16d87b David Sterba 2017-12-12 1681 lockstart, lockend, &cached_state);
c67d970f0ea8dc Filipe Manana 2019-09-30 1682 else
c67d970f0ea8dc Filipe Manana 2019-09-30 1683 free_extent_state(cached_state);
c67d970f0ea8dc Filipe Manana 2019-09-30 1684
8702ba9396bf7b Qu Wenruo 2019-10-14 1685 btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes);
f1de968376340c Miao Xie 2014-01-09 1686 if (ret) {
89f414f96d87ea Qu Wenruo 2020-08-19 1687 btrfs_drop_pages(&page, 1);
d0215f3e5ebb58 Josef Bacik 2011-01-25 1688 break;
f1de968376340c Miao Xie 2014-01-09 1689 }
39279cc3d2704c Chris Mason 2007-06-12 1690
89f414f96d87ea Qu Wenruo 2020-08-19 1691 if (only_release_metadata) {
38d37aa9c32938 Qu Wenruo 2020-06-24 1692 btrfs_check_nocow_unlock(BTRFS_I(inode));
da17066c40472c Jeff Mahoney 2016-06-15 1693 lockstart = round_down(pos,
0b246afa62b0cf Jeff Mahoney 2016-06-22 1694 fs_info->sectorsize);
89f414f96d87ea Qu Wenruo 2020-08-19 1695 lockend = lockstart + PAGE_SIZE - 1;
7ee9e4405f264e Josef Bacik 2013-06-21 1696
7ee9e4405f264e Josef Bacik 2013-06-21 1697 set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
7ee9e4405f264e Josef Bacik 2013-06-21 1698 lockend, EXTENT_NORESERVE, NULL,
7ee9e4405f264e Josef Bacik 2013-06-21 1699 NULL, GFP_NOFS);
7ee9e4405f264e Josef Bacik 2013-06-21 1700 }
7ee9e4405f264e Josef Bacik 2013-06-21 1701
89f414f96d87ea Qu Wenruo 2020-08-19 1702 btrfs_drop_pages(&page, 1);
f1de968376340c Miao Xie 2014-01-09 1703
d0215f3e5ebb58 Josef Bacik 2011-01-25 1704 cond_resched();
d0215f3e5ebb58 Josef Bacik 2011-01-25 1705
d0e1d66b5aa1ec Namjae Jeon 2012-12-11 1706 balance_dirty_pages_ratelimited(inode->i_mapping);
cb843a6f513a1a Chris Mason 2008-10-03 1707
914ee295af418e Xin Zhong 2010-12-09 1708 pos += copied;
914ee295af418e Xin Zhong 2010-12-09 1709 num_written += copied;
d0215f3e5ebb58 Josef Bacik 2011-01-25 1710 }
39279cc3d2704c Chris Mason 2007-06-12 1711
364ecf3651e086 Qu Wenruo 2017-02-27 1712 extent_changeset_free(data_reserved);
d0215f3e5ebb58 Josef Bacik 2011-01-25 1713 return num_written ? num_written : ret;
39279cc3d2704c Chris Mason 2007-06-12 1714 }
d0215f3e5ebb58 Josef Bacik 2011-01-25 1715
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30266 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace
2020-08-19 5:53 [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace Qu Wenruo
2020-08-19 8:50 ` kernel test robot
@ 2020-08-19 9:09 ` Filipe Manana
2020-08-19 9:16 ` Qu Wenruo
2020-08-19 20:11 ` Josef Bacik
2 siblings, 1 reply; 5+ messages in thread
From: Filipe Manana @ 2020-08-19 9:09 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Wed, Aug 19, 2020 at 6:55 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Before this patch, btrfs_buffered_write() do page copy in a 8 pages
> batch.
>
> While for EXT4, it uses generic_perform_write() which does page by page
> copy.
>
> This 8 pages batch behavior makes a lot of things more complex:
> - More complex error handling
> Now we need to handle all errors for half written case.
>
> - More complex advance check
> Since for 8 pages, we need to consider cases like 4 pages copied.
> This makes we need to release reserved space for the untouched 4
> pages.
>
> - More wrappers for multi-pages operations
> The most obvious one is btrfs_copy_from_user(), which introduces way
> more complexity than we need.
>
> This patch will change the behavior by going to the page-by-page pace,
> each time we only reserve space for one page, do one page copy.
>
> There are still a lot of complexity remained, mostly for short copy,
> non-uptodate page and extent locking.
> But that's more or less the same as the generic_perform_write().
>
> The performance is the same for 4K block size buffered write, but has an
> obvious impact when using multiple pages siuzed block size:
>
> The test involves writing a 128MiB file, which is smaller than 1/8th of
> the system memory.
> Speed (MiB/sec) Ops (ops/sec)
> Unpatched: 931.498 14903.9756
> Patched: 447.606 7161.6806
>
> In fact, if we account the execution time of btrfs_buffered_write(),
> meta/data rsv and later page dirty takes way more time than memory copy:
>
> Patched:
> nr_runs = 32768
> total_prepare_ns = 66908022
> total_copy_ns = 75532103
> total_cleanup_ns = 135749090
>
> Unpatched:
> nr_runs = 2176
> total_prepare_ns = 7425773
> total_copy_ns = 87780898
> total_cleanup_ns = 37704811
>
> The patched behavior is now similar to EXT4, the buffered write remain
> mostly unchanged for from 4K blocksize and larger.
>
> On the other hand, XFS uses iomap, which supports multi-page reserve and
> copy, leading to similar performance of unpatched btrfs.
>
> It looks like that we'd better go iomap routine other than the
> generic_perform_write().
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Reason for RFC:
>
> The performance drop is enough for this patch to be discarded.
More than enough, a nearly 50% drop of performance for a very common operation.
I can't even understand why you bothered proposing it.
Thanks.
> ---
> fs/btrfs/file.c | 293 ++++++++++++------------------------------------
> 1 file changed, 72 insertions(+), 221 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index bbfc8819cf28..be595da9bc05 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -379,60 +379,6 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
> return 0;
> }
>
> -/* simple helper to fault in pages and copy. This should go away
> - * and be replaced with calls into generic code.
> - */
> -static noinline int btrfs_copy_from_user(loff_t pos, size_t write_bytes,
> - struct page **prepared_pages,
> - struct iov_iter *i)
> -{
> - size_t copied = 0;
> - size_t total_copied = 0;
> - int pg = 0;
> - int offset = offset_in_page(pos);
> -
> - while (write_bytes > 0) {
> - size_t count = min_t(size_t,
> - PAGE_SIZE - offset, write_bytes);
> - struct page *page = prepared_pages[pg];
> - /*
> - * Copy data from userspace to the current page
> - */
> - copied = iov_iter_copy_from_user_atomic(page, i, offset, count);
> -
> - /* Flush processor's dcache for this page */
> - flush_dcache_page(page);
> -
> - /*
> - * if we get a partial write, we can end up with
> - * partially up to date pages. These add
> - * a lot of complexity, so make sure they don't
> - * happen by forcing this copy to be retried.
> - *
> - * The rest of the btrfs_file_write code will fall
> - * back to page at a time copies after we return 0.
> - */
> - if (!PageUptodate(page) && copied < count)
> - copied = 0;
> -
> - iov_iter_advance(i, copied);
> - write_bytes -= copied;
> - total_copied += copied;
> -
> - /* Return to btrfs_file_write_iter to fault page */
> - if (unlikely(copied == 0))
> - break;
> -
> - if (copied < PAGE_SIZE - offset) {
> - offset += copied;
> - } else {
> - pg++;
> - offset = 0;
> - }
> - }
> - return total_copied;
> -}
> -
> /*
> * unlocks pages after btrfs_file_write is done with them
> */
> @@ -443,8 +389,8 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
> /* page checked is some magic around finding pages that
> * have been modified without going through btrfs_set_page_dirty
> * clear it here. There should be no need to mark the pages
> - * accessed as prepare_pages should have marked them accessed
> - * in prepare_pages via find_or_create_page()
> + * accessed as prepare_pages() should have marked them accessed
> + * in prepare_pages() via find_or_create_page()
> */
> ClearPageChecked(pages[i]);
> unlock_page(pages[i]);
> @@ -1400,58 +1346,6 @@ static int prepare_uptodate_page(struct inode *inode,
> return 0;
> }
>
> -/*
> - * this just gets pages into the page cache and locks them down.
> - */
> -static noinline int prepare_pages(struct inode *inode, struct page **pages,
> - size_t num_pages, loff_t pos,
> - size_t write_bytes, bool force_uptodate)
> -{
> - int i;
> - unsigned long index = pos >> PAGE_SHIFT;
> - gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
> - int err = 0;
> - int faili;
> -
> - for (i = 0; i < num_pages; i++) {
> -again:
> - pages[i] = find_or_create_page(inode->i_mapping, index + i,
> - mask | __GFP_WRITE);
> - if (!pages[i]) {
> - faili = i - 1;
> - err = -ENOMEM;
> - goto fail;
> - }
> -
> - if (i == 0)
> - err = prepare_uptodate_page(inode, pages[i], pos,
> - force_uptodate);
> - if (!err && i == num_pages - 1)
> - err = prepare_uptodate_page(inode, pages[i],
> - pos + write_bytes, false);
> - if (err) {
> - put_page(pages[i]);
> - if (err == -EAGAIN) {
> - err = 0;
> - goto again;
> - }
> - faili = i - 1;
> - goto fail;
> - }
> - wait_on_page_writeback(pages[i]);
> - }
> -
> - return 0;
> -fail:
> - while (faili >= 0) {
> - unlock_page(pages[faili]);
> - put_page(pages[faili]);
> - faili--;
> - }
> - return err;
> -
> -}
> -
> /*
> * This function locks the extent and properly waits for data=ordered extents
> * to finish before allowing the pages to be modified if need.
> @@ -1619,6 +1513,38 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
> btrfs_drew_write_unlock(&inode->root->snapshot_lock);
> }
>
> +static int prepare_one_page(struct inode *inode, struct page **page_ret,
> + loff_t pos, size_t write_bytes, bool force_uptodate)
> +{
> + gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping) | __GFP_WRITE;
> + struct page *page;
> + int ret;
> +
> +again:
> + page = find_or_create_page(inode->i_mapping, pos >> PAGE_SHIFT, mask);
> + if (!page)
> + return -ENOMEM;
> +
> + /*
> + * We need the page uptodate for the following cases:
> + * - Write range only covers part of the page
> + * - We got a short copy on non-uptodate page in previous run
> + */
> + if ((!(offset_in_page(pos) == 0 && write_bytes == PAGE_SIZE) ||
> + force_uptodate) && !PageUptodate(page)) {
> + ret = prepare_uptodate_page(inode, page, pos, true);
> + if (ret) {
> + put_page(page);
> + if (ret == -EAGAIN)
> + goto again;
> + return ret;
> + }
> + wait_on_page_writeback(page);
> + }
> + *page_ret = page;
> + return 0;
> +}
> +
> static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
> struct iov_iter *i)
> {
> @@ -1626,45 +1552,26 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
> loff_t pos = iocb->ki_pos;
> struct inode *inode = file_inode(file);
> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> - struct page **pages = NULL;
> + struct page *page = NULL;
> struct extent_changeset *data_reserved = NULL;
> - u64 release_bytes = 0;
> u64 lockstart;
> u64 lockend;
> size_t num_written = 0;
> - int nrptrs;
> int ret = 0;
> bool only_release_metadata = false;
> bool force_page_uptodate = false;
>
> - nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
> - PAGE_SIZE / (sizeof(struct page *)));
> - nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
> - nrptrs = max(nrptrs, 8);
> - pages = kmalloc_array(nrptrs, sizeof(struct page *), GFP_KERNEL);
> - if (!pages)
> - return -ENOMEM;
> -
> while (iov_iter_count(i) > 0) {
> struct extent_state *cached_state = NULL;
> size_t offset = offset_in_page(pos);
> - size_t sector_offset;
> size_t write_bytes = min(iov_iter_count(i),
> - nrptrs * (size_t)PAGE_SIZE -
> - offset);
> - size_t num_pages = DIV_ROUND_UP(write_bytes + offset,
> - PAGE_SIZE);
> - size_t reserve_bytes;
> - size_t dirty_pages;
> + PAGE_SIZE - offset);
> + size_t reserve_bytes = PAGE_SIZE;
> size_t copied;
> - size_t dirty_sectors;
> - size_t num_sectors;
> int extents_locked;
>
> - WARN_ON(num_pages > nrptrs);
> -
> /*
> - * Fault pages before locking them in prepare_pages
> + * Fault pages before locking them in prepare_page()
> * to avoid recursive lock
> */
> if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
> @@ -1673,37 +1580,27 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
> }
>
> only_release_metadata = false;
> - sector_offset = pos & (fs_info->sectorsize - 1);
> - reserve_bytes = round_up(write_bytes + sector_offset,
> - fs_info->sectorsize);
>
> extent_changeset_release(data_reserved);
> ret = btrfs_check_data_free_space(BTRFS_I(inode),
> &data_reserved, pos,
> write_bytes);
> if (ret < 0) {
> + size_t tmp = write_bytes;
> if (btrfs_check_nocow_lock(BTRFS_I(inode), pos,
> - &write_bytes) > 0) {
> + &tmp) > 0) {
> + ASSERT(tmp == write_bytes);
> /*
> * For nodata cow case, no need to reserve
> * data space.
> */
> only_release_metadata = true;
> - /*
> - * our prealloc extent may be smaller than
> - * write_bytes, so scale down.
> - */
> - num_pages = DIV_ROUND_UP(write_bytes + offset,
> - PAGE_SIZE);
> - reserve_bytes = round_up(write_bytes +
> - sector_offset,
> - fs_info->sectorsize);
> + reserve_bytes = 0;
> } else {
> break;
> }
> }
>
> - WARN_ON(reserve_bytes == 0);
> ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
> reserve_bytes);
> if (ret) {
> @@ -1716,16 +1613,9 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
> break;
> }
>
> - release_bytes = reserve_bytes;
> again:
> - /*
> - * This is going to setup the pages array with the number of
> - * pages we want, so we don't really need to worry about the
> - * contents of pages from loop to loop
> - */
> - ret = prepare_pages(inode, pages, num_pages,
> - pos, write_bytes,
> - force_page_uptodate);
> + ret = prepare_one_page(inode, &page, pos, write_bytes,
> + force_page_uptodate);
> if (ret) {
> btrfs_delalloc_release_extents(BTRFS_I(inode),
> reserve_bytes);
> @@ -1733,9 +1623,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
> }
>
> extents_locked = lock_and_cleanup_extent_if_need(
> - BTRFS_I(inode), pages,
> - num_pages, pos, write_bytes, &lockstart,
> - &lockend, &cached_state);
> + BTRFS_I(inode), &page, 1, pos, write_bytes,
> + &lockstart, &lockend, &cached_state);
> if (extents_locked < 0) {
> if (extents_locked == -EAGAIN)
> goto again;
> @@ -1745,57 +1634,38 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
> break;
> }
>
> - copied = btrfs_copy_from_user(pos, write_bytes, pages, i);
> -
> - num_sectors = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
> - dirty_sectors = round_up(copied + sector_offset,
> - fs_info->sectorsize);
> - dirty_sectors = BTRFS_BYTES_TO_BLKS(fs_info, dirty_sectors);
> -
> - /*
> - * if we have trouble faulting in the pages, fall
> - * back to one page at a time
> - */
> - if (copied < write_bytes)
> - nrptrs = 1;
> + copied = iov_iter_copy_from_user_atomic(page, i, offset,
> + write_bytes);
> + flush_dcache_page(page);
>
> - if (copied == 0) {
> + if (!PageUptodate(page) && copied < write_bytes) {
> + /*
> + * Short write on non-uptodate page, we must retry and
> + * force the page uptodate in next run.
> + */
> + copied = 0;
> force_page_uptodate = true;
> - dirty_sectors = 0;
> - dirty_pages = 0;
> } else {
> + /* Next run doesn't need forced uptodate */
> force_page_uptodate = false;
> - dirty_pages = DIV_ROUND_UP(copied + offset,
> - PAGE_SIZE);
> }
>
> - if (num_sectors > dirty_sectors) {
> - /* release everything except the sectors we dirtied */
> - release_bytes -= dirty_sectors <<
> - fs_info->sb->s_blocksize_bits;
> - if (only_release_metadata) {
> - btrfs_delalloc_release_metadata(BTRFS_I(inode),
> - release_bytes, true);
> - } else {
> - u64 __pos;
> + iov_iter_advance(i, copied);
>
> - __pos = round_down(pos,
> - fs_info->sectorsize) +
> - (dirty_pages << PAGE_SHIFT);
> + if (copied > 0) {
> + ret = btrfs_dirty_pages(BTRFS_I(inode), &page, 1, pos,
> + copied, &cached_state);
> + } else {
> + /* No bytes copied, need to free reserved space */
> + if (only_release_metadata)
> + btrfs_delalloc_release_metadata(BTRFS_I(inode),
> + reserve_bytes, true);
> + else
> btrfs_delalloc_release_space(BTRFS_I(inode),
> - data_reserved, __pos,
> - release_bytes, true);
> - }
> + data_reserved, pos, write_bytes,
> + true);
> }
>
> - release_bytes = round_up(copied + sector_offset,
> - fs_info->sectorsize);
> -
> - if (copied > 0)
> - ret = btrfs_dirty_pages(BTRFS_I(inode), pages,
> - dirty_pages, pos, copied,
> - &cached_state);
> -
> /*
> * If we have not locked the extent range, because the range's
> * start offset is >= i_size, we might still have a non-NULL
> @@ -1811,26 +1681,22 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>
> btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes);
> if (ret) {
> - btrfs_drop_pages(pages, num_pages);
> + btrfs_drop_pages(&page, 1);
> break;
> }
>
> - release_bytes = 0;
> - if (only_release_metadata)
> + if (only_release_metadata) {
> btrfs_check_nocow_unlock(BTRFS_I(inode));
> -
> - if (only_release_metadata && copied > 0) {
> lockstart = round_down(pos,
> fs_info->sectorsize);
> - lockend = round_up(pos + copied,
> - fs_info->sectorsize) - 1;
> + lockend = lockstart + PAGE_SIZE - 1;
>
> set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
> lockend, EXTENT_NORESERVE, NULL,
> NULL, GFP_NOFS);
> }
>
> - btrfs_drop_pages(pages, num_pages);
> + btrfs_drop_pages(&page, 1);
>
> cond_resched();
>
> @@ -1840,21 +1706,6 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
> num_written += copied;
> }
>
> - kfree(pages);
> -
> - if (release_bytes) {
> - if (only_release_metadata) {
> - btrfs_check_nocow_unlock(BTRFS_I(inode));
> - btrfs_delalloc_release_metadata(BTRFS_I(inode),
> - release_bytes, true);
> - } else {
> - btrfs_delalloc_release_space(BTRFS_I(inode),
> - data_reserved,
> - round_down(pos, fs_info->sectorsize),
> - release_bytes, true);
> - }
> - }
> -
> extent_changeset_free(data_reserved);
> return num_written ? num_written : ret;
> }
> --
> 2.28.0
>
--
Filipe David Manana,
“Whether you think you can, or you think you can't — you're right.”
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace
2020-08-19 9:09 ` Filipe Manana
@ 2020-08-19 9:16 ` Qu Wenruo
0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2020-08-19 9:16 UTC (permalink / raw)
To: fdmanana, Qu Wenruo; +Cc: linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 23579 bytes --]
On 2020/8/19 下午5:09, Filipe Manana wrote:
> On Wed, Aug 19, 2020 at 6:55 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> Before this patch, btrfs_buffered_write() do page copy in a 8 pages
>> batch.
>>
>> While for EXT4, it uses generic_perform_write() which does page by page
>> copy.
>>
>> This 8 pages batch behavior makes a lot of things more complex:
>> - More complex error handling
>> Now we need to handle all errors for half written case.
>>
>> - More complex advance check
>> Since for 8 pages, we need to consider cases like 4 pages copied.
>> This makes we need to release reserved space for the untouched 4
>> pages.
>>
>> - More wrappers for multi-pages operations
>> The most obvious one is btrfs_copy_from_user(), which introduces way
>> more complexity than we need.
>>
>> This patch will change the behavior by going to the page-by-page pace,
>> each time we only reserve space for one page, do one page copy.
>>
>> There are still a lot of complexity remained, mostly for short copy,
>> non-uptodate page and extent locking.
>> But that's more or less the same as the generic_perform_write().
>>
>> The performance is the same for 4K block size buffered write, but has an
>> obvious impact when using multiple pages siuzed block size:
>>
>> The test involves writing a 128MiB file, which is smaller than 1/8th of
>> the system memory.
>> Speed (MiB/sec) Ops (ops/sec)
>> Unpatched: 931.498 14903.9756
>> Patched: 447.606 7161.6806
>>
>> In fact, if we account the execution time of btrfs_buffered_write(),
>> meta/data rsv and later page dirty takes way more time than memory copy:
>>
>> Patched:
>> nr_runs = 32768
>> total_prepare_ns = 66908022
>> total_copy_ns = 75532103
>> total_cleanup_ns = 135749090
>>
>> Unpatched:
>> nr_runs = 2176
>> total_prepare_ns = 7425773
>> total_copy_ns = 87780898
>> total_cleanup_ns = 37704811
>>
>> The patched behavior is now similar to EXT4, the buffered write remain
>> mostly unchanged for from 4K blocksize and larger.
>>
>> On the other hand, XFS uses iomap, which supports multi-page reserve and
>> copy, leading to similar performance of unpatched btrfs.
>>
>> It looks like that we'd better go iomap routine other than the
>> generic_perform_write().
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Reason for RFC:
>>
>> The performance drop is enough for this patch to be discarded.
>
> More than enough, a nearly 50% drop of performance for a very common operation.
> I can't even understand why you bothered proposing it.
Well, at least for blocksize <= 4K, it's still the same performance, but
with a much cleaner code base.
Anyway, will go the iomap direction, as iomap seems to have extra method
to ensure:
- The iov_iter is always readable
So that we won't need to bother shorter in-page copy, thus no need to
bother the copied == 0 case.
- The needed operation done before real page copy
I'm wondering what is the proper way to interact with page and extent
lock.
If we can first lock the extent io tree, then try to get the pages
locked, we can then remove one "goto again" call, and make it easier
to integrate with iomap.
Thanks,
Qu
>
> Thanks.
>
>> ---
>> fs/btrfs/file.c | 293 ++++++++++++------------------------------------
>> 1 file changed, 72 insertions(+), 221 deletions(-)
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index bbfc8819cf28..be595da9bc05 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -379,60 +379,6 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
>> return 0;
>> }
>>
>> -/* simple helper to fault in pages and copy. This should go away
>> - * and be replaced with calls into generic code.
>> - */
>> -static noinline int btrfs_copy_from_user(loff_t pos, size_t write_bytes,
>> - struct page **prepared_pages,
>> - struct iov_iter *i)
>> -{
>> - size_t copied = 0;
>> - size_t total_copied = 0;
>> - int pg = 0;
>> - int offset = offset_in_page(pos);
>> -
>> - while (write_bytes > 0) {
>> - size_t count = min_t(size_t,
>> - PAGE_SIZE - offset, write_bytes);
>> - struct page *page = prepared_pages[pg];
>> - /*
>> - * Copy data from userspace to the current page
>> - */
>> - copied = iov_iter_copy_from_user_atomic(page, i, offset, count);
>> -
>> - /* Flush processor's dcache for this page */
>> - flush_dcache_page(page);
>> -
>> - /*
>> - * if we get a partial write, we can end up with
>> - * partially up to date pages. These add
>> - * a lot of complexity, so make sure they don't
>> - * happen by forcing this copy to be retried.
>> - *
>> - * The rest of the btrfs_file_write code will fall
>> - * back to page at a time copies after we return 0.
>> - */
>> - if (!PageUptodate(page) && copied < count)
>> - copied = 0;
>> -
>> - iov_iter_advance(i, copied);
>> - write_bytes -= copied;
>> - total_copied += copied;
>> -
>> - /* Return to btrfs_file_write_iter to fault page */
>> - if (unlikely(copied == 0))
>> - break;
>> -
>> - if (copied < PAGE_SIZE - offset) {
>> - offset += copied;
>> - } else {
>> - pg++;
>> - offset = 0;
>> - }
>> - }
>> - return total_copied;
>> -}
>> -
>> /*
>> * unlocks pages after btrfs_file_write is done with them
>> */
>> @@ -443,8 +389,8 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
>> /* page checked is some magic around finding pages that
>> * have been modified without going through btrfs_set_page_dirty
>> * clear it here. There should be no need to mark the pages
>> - * accessed as prepare_pages should have marked them accessed
>> - * in prepare_pages via find_or_create_page()
>> + * accessed as prepare_pages() should have marked them accessed
>> + * in prepare_pages() via find_or_create_page()
>> */
>> ClearPageChecked(pages[i]);
>> unlock_page(pages[i]);
>> @@ -1400,58 +1346,6 @@ static int prepare_uptodate_page(struct inode *inode,
>> return 0;
>> }
>>
>> -/*
>> - * this just gets pages into the page cache and locks them down.
>> - */
>> -static noinline int prepare_pages(struct inode *inode, struct page **pages,
>> - size_t num_pages, loff_t pos,
>> - size_t write_bytes, bool force_uptodate)
>> -{
>> - int i;
>> - unsigned long index = pos >> PAGE_SHIFT;
>> - gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
>> - int err = 0;
>> - int faili;
>> -
>> - for (i = 0; i < num_pages; i++) {
>> -again:
>> - pages[i] = find_or_create_page(inode->i_mapping, index + i,
>> - mask | __GFP_WRITE);
>> - if (!pages[i]) {
>> - faili = i - 1;
>> - err = -ENOMEM;
>> - goto fail;
>> - }
>> -
>> - if (i == 0)
>> - err = prepare_uptodate_page(inode, pages[i], pos,
>> - force_uptodate);
>> - if (!err && i == num_pages - 1)
>> - err = prepare_uptodate_page(inode, pages[i],
>> - pos + write_bytes, false);
>> - if (err) {
>> - put_page(pages[i]);
>> - if (err == -EAGAIN) {
>> - err = 0;
>> - goto again;
>> - }
>> - faili = i - 1;
>> - goto fail;
>> - }
>> - wait_on_page_writeback(pages[i]);
>> - }
>> -
>> - return 0;
>> -fail:
>> - while (faili >= 0) {
>> - unlock_page(pages[faili]);
>> - put_page(pages[faili]);
>> - faili--;
>> - }
>> - return err;
>> -
>> -}
>> -
>> /*
>> * This function locks the extent and properly waits for data=ordered extents
>> * to finish before allowing the pages to be modified if need.
>> @@ -1619,6 +1513,38 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
>> btrfs_drew_write_unlock(&inode->root->snapshot_lock);
>> }
>>
>> +static int prepare_one_page(struct inode *inode, struct page **page_ret,
>> + loff_t pos, size_t write_bytes, bool force_uptodate)
>> +{
>> + gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping) | __GFP_WRITE;
>> + struct page *page;
>> + int ret;
>> +
>> +again:
>> + page = find_or_create_page(inode->i_mapping, pos >> PAGE_SHIFT, mask);
>> + if (!page)
>> + return -ENOMEM;
>> +
>> + /*
>> + * We need the page uptodate for the following cases:
>> + * - Write range only covers part of the page
>> + * - We got a short copy on non-uptodate page in previous run
>> + */
>> + if ((!(offset_in_page(pos) == 0 && write_bytes == PAGE_SIZE) ||
>> + force_uptodate) && !PageUptodate(page)) {
>> + ret = prepare_uptodate_page(inode, page, pos, true);
>> + if (ret) {
>> + put_page(page);
>> + if (ret == -EAGAIN)
>> + goto again;
>> + return ret;
>> + }
>> + wait_on_page_writeback(page);
>> + }
>> + *page_ret = page;
>> + return 0;
>> +}
>> +
>> static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>> struct iov_iter *i)
>> {
>> @@ -1626,45 +1552,26 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>> loff_t pos = iocb->ki_pos;
>> struct inode *inode = file_inode(file);
>> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>> - struct page **pages = NULL;
>> + struct page *page = NULL;
>> struct extent_changeset *data_reserved = NULL;
>> - u64 release_bytes = 0;
>> u64 lockstart;
>> u64 lockend;
>> size_t num_written = 0;
>> - int nrptrs;
>> int ret = 0;
>> bool only_release_metadata = false;
>> bool force_page_uptodate = false;
>>
>> - nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
>> - PAGE_SIZE / (sizeof(struct page *)));
>> - nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
>> - nrptrs = max(nrptrs, 8);
>> - pages = kmalloc_array(nrptrs, sizeof(struct page *), GFP_KERNEL);
>> - if (!pages)
>> - return -ENOMEM;
>> -
>> while (iov_iter_count(i) > 0) {
>> struct extent_state *cached_state = NULL;
>> size_t offset = offset_in_page(pos);
>> - size_t sector_offset;
>> size_t write_bytes = min(iov_iter_count(i),
>> - nrptrs * (size_t)PAGE_SIZE -
>> - offset);
>> - size_t num_pages = DIV_ROUND_UP(write_bytes + offset,
>> - PAGE_SIZE);
>> - size_t reserve_bytes;
>> - size_t dirty_pages;
>> + PAGE_SIZE - offset);
>> + size_t reserve_bytes = PAGE_SIZE;
>> size_t copied;
>> - size_t dirty_sectors;
>> - size_t num_sectors;
>> int extents_locked;
>>
>> - WARN_ON(num_pages > nrptrs);
>> -
>> /*
>> - * Fault pages before locking them in prepare_pages
>> + * Fault pages before locking them in prepare_page()
>> * to avoid recursive lock
>> */
>> if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
>> @@ -1673,37 +1580,27 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>> }
>>
>> only_release_metadata = false;
>> - sector_offset = pos & (fs_info->sectorsize - 1);
>> - reserve_bytes = round_up(write_bytes + sector_offset,
>> - fs_info->sectorsize);
>>
>> extent_changeset_release(data_reserved);
>> ret = btrfs_check_data_free_space(BTRFS_I(inode),
>> &data_reserved, pos,
>> write_bytes);
>> if (ret < 0) {
>> + size_t tmp = write_bytes;
>> if (btrfs_check_nocow_lock(BTRFS_I(inode), pos,
>> - &write_bytes) > 0) {
>> + &tmp) > 0) {
>> + ASSERT(tmp == write_bytes);
>> /*
>> * For nodata cow case, no need to reserve
>> * data space.
>> */
>> only_release_metadata = true;
>> - /*
>> - * our prealloc extent may be smaller than
>> - * write_bytes, so scale down.
>> - */
>> - num_pages = DIV_ROUND_UP(write_bytes + offset,
>> - PAGE_SIZE);
>> - reserve_bytes = round_up(write_bytes +
>> - sector_offset,
>> - fs_info->sectorsize);
>> + reserve_bytes = 0;
>> } else {
>> break;
>> }
>> }
>>
>> - WARN_ON(reserve_bytes == 0);
>> ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
>> reserve_bytes);
>> if (ret) {
>> @@ -1716,16 +1613,9 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>> break;
>> }
>>
>> - release_bytes = reserve_bytes;
>> again:
>> - /*
>> - * This is going to setup the pages array with the number of
>> - * pages we want, so we don't really need to worry about the
>> - * contents of pages from loop to loop
>> - */
>> - ret = prepare_pages(inode, pages, num_pages,
>> - pos, write_bytes,
>> - force_page_uptodate);
>> + ret = prepare_one_page(inode, &page, pos, write_bytes,
>> + force_page_uptodate);
>> if (ret) {
>> btrfs_delalloc_release_extents(BTRFS_I(inode),
>> reserve_bytes);
>> @@ -1733,9 +1623,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>> }
>>
>> extents_locked = lock_and_cleanup_extent_if_need(
>> - BTRFS_I(inode), pages,
>> - num_pages, pos, write_bytes, &lockstart,
>> - &lockend, &cached_state);
>> + BTRFS_I(inode), &page, 1, pos, write_bytes,
>> + &lockstart, &lockend, &cached_state);
>> if (extents_locked < 0) {
>> if (extents_locked == -EAGAIN)
>> goto again;
>> @@ -1745,57 +1634,38 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>> break;
>> }
>>
>> - copied = btrfs_copy_from_user(pos, write_bytes, pages, i);
>> -
>> - num_sectors = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
>> - dirty_sectors = round_up(copied + sector_offset,
>> - fs_info->sectorsize);
>> - dirty_sectors = BTRFS_BYTES_TO_BLKS(fs_info, dirty_sectors);
>> -
>> - /*
>> - * if we have trouble faulting in the pages, fall
>> - * back to one page at a time
>> - */
>> - if (copied < write_bytes)
>> - nrptrs = 1;
>> + copied = iov_iter_copy_from_user_atomic(page, i, offset,
>> + write_bytes);
>> + flush_dcache_page(page);
>>
>> - if (copied == 0) {
>> + if (!PageUptodate(page) && copied < write_bytes) {
>> + /*
>> + * Short write on non-uptodate page, we must retry and
>> + * force the page uptodate in next run.
>> + */
>> + copied = 0;
>> force_page_uptodate = true;
>> - dirty_sectors = 0;
>> - dirty_pages = 0;
>> } else {
>> + /* Next run doesn't need forced uptodate */
>> force_page_uptodate = false;
>> - dirty_pages = DIV_ROUND_UP(copied + offset,
>> - PAGE_SIZE);
>> }
>>
>> - if (num_sectors > dirty_sectors) {
>> - /* release everything except the sectors we dirtied */
>> - release_bytes -= dirty_sectors <<
>> - fs_info->sb->s_blocksize_bits;
>> - if (only_release_metadata) {
>> - btrfs_delalloc_release_metadata(BTRFS_I(inode),
>> - release_bytes, true);
>> - } else {
>> - u64 __pos;
>> + iov_iter_advance(i, copied);
>>
>> - __pos = round_down(pos,
>> - fs_info->sectorsize) +
>> - (dirty_pages << PAGE_SHIFT);
>> + if (copied > 0) {
>> + ret = btrfs_dirty_pages(BTRFS_I(inode), &page, 1, pos,
>> + copied, &cached_state);
>> + } else {
>> + /* No bytes copied, need to free reserved space */
>> + if (only_release_metadata)
>> + btrfs_delalloc_release_metadata(BTRFS_I(inode),
>> + reserve_bytes, true);
>> + else
>> btrfs_delalloc_release_space(BTRFS_I(inode),
>> - data_reserved, __pos,
>> - release_bytes, true);
>> - }
>> + data_reserved, pos, write_bytes,
>> + true);
>> }
>>
>> - release_bytes = round_up(copied + sector_offset,
>> - fs_info->sectorsize);
>> -
>> - if (copied > 0)
>> - ret = btrfs_dirty_pages(BTRFS_I(inode), pages,
>> - dirty_pages, pos, copied,
>> - &cached_state);
>> -
>> /*
>> * If we have not locked the extent range, because the range's
>> * start offset is >= i_size, we might still have a non-NULL
>> @@ -1811,26 +1681,22 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>>
>> btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes);
>> if (ret) {
>> - btrfs_drop_pages(pages, num_pages);
>> + btrfs_drop_pages(&page, 1);
>> break;
>> }
>>
>> - release_bytes = 0;
>> - if (only_release_metadata)
>> + if (only_release_metadata) {
>> btrfs_check_nocow_unlock(BTRFS_I(inode));
>> -
>> - if (only_release_metadata && copied > 0) {
>> lockstart = round_down(pos,
>> fs_info->sectorsize);
>> - lockend = round_up(pos + copied,
>> - fs_info->sectorsize) - 1;
>> + lockend = lockstart + PAGE_SIZE - 1;
>>
>> set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
>> lockend, EXTENT_NORESERVE, NULL,
>> NULL, GFP_NOFS);
>> }
>>
>> - btrfs_drop_pages(pages, num_pages);
>> + btrfs_drop_pages(&page, 1);
>>
>> cond_resched();
>>
>> @@ -1840,21 +1706,6 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
>> num_written += copied;
>> }
>>
>> - kfree(pages);
>> -
>> - if (release_bytes) {
>> - if (only_release_metadata) {
>> - btrfs_check_nocow_unlock(BTRFS_I(inode));
>> - btrfs_delalloc_release_metadata(BTRFS_I(inode),
>> - release_bytes, true);
>> - } else {
>> - btrfs_delalloc_release_space(BTRFS_I(inode),
>> - data_reserved,
>> - round_down(pos, fs_info->sectorsize),
>> - release_bytes, true);
>> - }
>> - }
>> -
>> extent_changeset_free(data_reserved);
>> return num_written ? num_written : ret;
>> }
>> --
>> 2.28.0
>>
>
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace
2020-08-19 5:53 [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace Qu Wenruo
2020-08-19 8:50 ` kernel test robot
2020-08-19 9:09 ` Filipe Manana
@ 2020-08-19 20:11 ` Josef Bacik
2 siblings, 0 replies; 5+ messages in thread
From: Josef Bacik @ 2020-08-19 20:11 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 8/19/20 1:53 AM, Qu Wenruo wrote:
> Before this patch, btrfs_buffered_write() do page copy in a 8 pages
> batch.
>
> While for EXT4, it uses generic_perform_write() which does page by page
> copy.
>
> This 8 pages batch behavior makes a lot of things more complex:
> - More complex error handling
> Now we need to handle all errors for half written case.
>
> - More complex advance check
> Since for 8 pages, we need to consider cases like 4 pages copied.
> This makes we need to release reserved space for the untouched 4
> pages.
>
> - More wrappers for multi-pages operations
> The most obvious one is btrfs_copy_from_user(), which introduces way
> more complexity than we need.
>
> This patch will change the behavior by going to the page-by-page pace,
> each time we only reserve space for one page, do one page copy.
>
> There are still a lot of complexity remained, mostly for short copy,
> non-uptodate page and extent locking.
> But that's more or less the same as the generic_perform_write().
>
> The performance is the same for 4K block size buffered write, but has an
> obvious impact when using multiple pages siuzed block size:
>
> The test involves writing a 128MiB file, which is smaller than 1/8th of
> the system memory.
> Speed (MiB/sec) Ops (ops/sec)
> Unpatched: 931.498 14903.9756
> Patched: 447.606 7161.6806
>
> In fact, if we account the execution time of btrfs_buffered_write(),
> meta/data rsv and later page dirty takes way more time than memory copy:
>
> Patched:
> nr_runs = 32768
> total_prepare_ns = 66908022
> total_copy_ns = 75532103
> total_cleanup_ns = 135749090
>
> Unpatched:
> nr_runs = 2176
> total_prepare_ns = 7425773
> total_copy_ns = 87780898
> total_cleanup_ns = 37704811
>
> The patched behavior is now similar to EXT4, the buffered write remain
> mostly unchanged for from 4K blocksize and larger.
>
> On the other hand, XFS uses iomap, which supports multi-page reserve and
> copy, leading to similar performance of unpatched btrfs.
>
> It looks like that we'd better go iomap routine other than the
> generic_perform_write().
I'm fine with tying it into iomap if we get to keep our performance. The
reality is we do so much more in the write path per-write that we really do need
to batch as much as possible. If you can figure out a way to modify the generic
code to allow us to use that and still get the batching/performance then go for
it. Otherwise the complexity is worth the price. Thanks,
Josef
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-08-19 20:11 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-19 5:53 [PATCH] btrfs: switch btrfs_buffered_write() to page-by-page pace Qu Wenruo
2020-08-19 8:50 ` kernel test robot
2020-08-19 9:09 ` Filipe Manana
2020-08-19 9:16 ` Qu Wenruo
2020-08-19 20:11 ` Josef Bacik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).