* [PATCH v2 00/18] btrfs: add read-only support for subpage sector size
@ 2020-12-10 6:38 Qu Wenruo
2020-12-10 6:38 ` [PATCH v2 01/18] btrfs: extent_io: rename @offset parameter to @disk_bytenr for submit_extent_page() Qu Wenruo
` (17 more replies)
0 siblings, 18 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs
Patches can be fetched from github:
https://github.com/adam900710/linux/tree/subpage
Currently the branch also contains partial RW data support (still some
out-of-sync subpage data page status).
Great thanks to David for his effort reviewing and merging the
preparation patches into misc-next.
Now all previously submitted preparation patches are already in
misc-next.
=== What works ===
Just from the patchset:
- Data read
Both regular and compressed data, with csum check.
- Metadata read
This means, with these patchset, 64K page systems can at least mount
btrfs with 4K sector size.
In the subpage branch
- Metadata read write
Not yet full tested due to data write still has bugs need to be
solved.
But considering that metadata operations from previous iteration
is mostly untouched, metadata read write should be pretty stable.
- Data read write
WIP. There are fsstress runs which leads to subpage dirty
status out-of-sync and cause some ordered extent never finish.
Still fixing it.
=== Needs feedback ===
The following design needs extra comments:
- u16 bitmap
As David mentioned, using u16 as bit map is not the fastest way.
That's also why current bitmap code requires unsigned long (u32) as
minimal unit.
But using bitmap directly would double the memory usage.
Thus the best way is to pack two u16 bitmap into one u32 bitmap, but
that still needs extra investigation to find better practice.
Anyway the skeleton should be pretty simple to expand.
- Separate handling for subpage metadata
Currently the metadata read and (later write path) handles subpage
metadata differently. Mostly due to the page locking must be skipped
for subpage metadata.
I tried several times to use as many common code as possible, but
every time I ended up reverting back to current code.
Thankfully, for data handling we will use the same common code.
=== Patchset structure ===
Patch 01~03: New preparation patches.
Mostly readability related patches found during RW
development
Patch 04~08: Subpage handling for extent buffer allocation and
freeing
Patch 09~18: Subpage handling for extent buffer read path
=== Changelog ===
v1:
- Separate the main implementation from previous huge patchset
Huge patchset doesn't make much sense.
- Use bitmap implementation
Now page::private will be a pointer to btrfs_subpage structure, which
contains bitmaps for various page status.
v2:
- Use page::private as btrfs_subpage for extra info
This replace old extent io tree based solution, which reduces latency
and don't require memory allocation for its operations.
- Cherry-pick new preparation patches from RW development
Those new preparation patches improves the readability by their own.
Qu Wenruo (18):
btrfs: extent_io: rename @offset parameter to @disk_bytenr for
submit_extent_page()
btrfs: extent_io: refactor __extent_writepage_io() to improve
readability
btrfs: file: update comment for btrfs_dirty_pages()
btrfs: extent_io: introduce a helper to grab an existing extent buffer
from a page
btrfs: extent_io: introduce the skeleton of btrfs_subpage structure
btrfs: extent_io: make attach_extent_buffer_page() to handle subpage
case
btrfs: extent_io: make grab_extent_buffer_from_page() to handle
subpage case
btrfs: extent_io: support subpage for extent buffer page release
btrfs: subpage: introduce helper for subpage uptodate status
btrfs: subpage: introduce helper for subpage error status
btrfs: extent_io: make set/clear_extent_buffer_uptodate() to support
subpage size
btrfs: extent_io: implement try_release_extent_buffer() for subpage
metadata support
btrfs: extent_io: introduce read_extent_buffer_subpage()
btrfs: extent_io: make endio_readpage_update_page_status() to handle
subpage case
btrfs: disk-io: introduce subpage metadata validation check
btrfs: introduce btrfs_subpage for data inodes
btrfs: integrate page status update for read path into
begin/end_page_read()
btrfs: allow RO mount of 4K sector size fs on 64K page system
fs/btrfs/Makefile | 3 +-
fs/btrfs/compression.c | 10 +-
fs/btrfs/disk-io.c | 107 +++++++-
fs/btrfs/extent_io.c | 507 ++++++++++++++++++++++++++++--------
fs/btrfs/extent_io.h | 3 +-
fs/btrfs/file.c | 25 +-
fs/btrfs/free-space-cache.c | 15 +-
fs/btrfs/inode.c | 12 +-
fs/btrfs/ioctl.c | 5 +-
fs/btrfs/reflink.c | 5 +-
fs/btrfs/relocation.c | 12 +-
fs/btrfs/subpage.c | 34 +++
fs/btrfs/subpage.h | 264 +++++++++++++++++++
fs/btrfs/super.c | 7 +
14 files changed, 876 insertions(+), 133 deletions(-)
create mode 100644 fs/btrfs/subpage.c
create mode 100644 fs/btrfs/subpage.h
--
2.29.2
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH v2 01/18] btrfs: extent_io: rename @offset parameter to @disk_bytenr for submit_extent_page()
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
@ 2020-12-10 6:38 ` Qu Wenruo
2020-12-17 15:44 ` Josef Bacik
2020-12-10 6:38 ` [PATCH v2 02/18] btrfs: extent_io: refactor __extent_writepage_io() to improve readability Qu Wenruo
` (16 subsequent siblings)
17 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs
The parameter @offset can't be more confusing.
In fact that parameter is the disk bytenr for metadata/data.
Rename it to @disk_bytenr and update the comment to reduce confusion.
Since we're here, also rename all @offset passed into
submit_extent_page() to @disk_bytenr.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6e3b72e63e42..2650e8720394 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3064,10 +3064,10 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
* @opf: bio REQ_OP_* and REQ_* flags as one value
* @wbc: optional writeback control for io accounting
* @page: page to add to the bio
+ * @disk_bytenr:the logical bytenr where the write will be
+ * @size: portion of page that we want to write
* @pg_offset: offset of the new bio or to check whether we are adding
* a contiguous page to the previous one
- * @size: portion of page that we want to write
- * @offset: starting offset in the page
* @bio_ret: must be valid pointer, newly allocated bio will be stored there
* @end_io_func: end_io callback for new bio
* @mirror_num: desired mirror to read/write
@@ -3076,7 +3076,7 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
*/
static int submit_extent_page(unsigned int opf,
struct writeback_control *wbc,
- struct page *page, u64 offset,
+ struct page *page, u64 disk_bytenr,
size_t size, unsigned long pg_offset,
struct bio **bio_ret,
bio_end_io_t end_io_func,
@@ -3088,7 +3088,7 @@ static int submit_extent_page(unsigned int opf,
int ret = 0;
struct bio *bio;
size_t io_size = min_t(size_t, size, PAGE_SIZE);
- sector_t sector = offset >> 9;
+ sector_t sector = disk_bytenr >> 9;
struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree;
ASSERT(bio_ret);
@@ -3122,7 +3122,7 @@ static int submit_extent_page(unsigned int opf,
}
}
- bio = btrfs_bio_alloc(offset);
+ bio = btrfs_bio_alloc(disk_bytenr);
bio_add_page(bio, page, io_size, pg_offset);
bio->bi_end_io = end_io_func;
bio->bi_private = tree;
@@ -3244,7 +3244,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
}
while (cur <= end) {
bool force_bio_submit = false;
- u64 offset;
+ u64 disk_bytenr;
if (cur >= last_byte) {
char *userpage;
@@ -3282,9 +3282,9 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
cur_end = min(extent_map_end(em) - 1, end);
iosize = ALIGN(iosize, blocksize);
if (this_bio_flag & EXTENT_BIO_COMPRESSED)
- offset = em->block_start;
+ disk_bytenr = em->block_start;
else
- offset = em->block_start + extent_offset;
+ disk_bytenr = em->block_start + extent_offset;
block_start = em->block_start;
if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
block_start = EXTENT_MAP_HOLE;
@@ -3373,7 +3373,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
}
ret = submit_extent_page(REQ_OP_READ | read_flags, NULL,
- page, offset, iosize,
+ page, disk_bytenr, iosize,
pg_offset, bio,
end_bio_extent_readpage, 0,
*bio_flags,
@@ -3550,8 +3550,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
blocksize = inode->vfs_inode.i_sb->s_blocksize;
while (cur <= end) {
+ u64 disk_bytenr;
u64 em_end;
- u64 offset;
if (cur >= i_size) {
btrfs_writepage_endio_finish_ordered(page, cur,
@@ -3571,7 +3571,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
BUG_ON(end < cur);
iosize = min(em_end - cur, end - cur + 1);
iosize = ALIGN(iosize, blocksize);
- offset = em->block_start + extent_offset;
+ disk_bytenr = em->block_start + extent_offset;
block_start = em->block_start;
compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
free_extent_map(em);
@@ -3601,7 +3601,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
}
ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
- page, offset, iosize, pg_offset,
+ page, disk_bytenr, iosize, pg_offset,
&epd->bio,
end_bio_extent_writepage,
0, 0, 0, false);
@@ -3925,7 +3925,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
struct writeback_control *wbc,
struct extent_page_data *epd)
{
- u64 offset = eb->start;
+ u64 disk_bytenr = eb->start;
u32 nritems;
int i, num_pages;
unsigned long start, end;
@@ -3958,7 +3958,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
clear_page_dirty_for_io(p);
set_page_writeback(p);
ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
- p, offset, PAGE_SIZE, 0,
+ p, disk_bytenr, PAGE_SIZE, 0,
&epd->bio,
end_bio_extent_buffer_writepage,
0, 0, 0, false);
@@ -3971,7 +3971,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
ret = -EIO;
break;
}
- offset += PAGE_SIZE;
+ disk_bytenr += PAGE_SIZE;
update_nr_written(wbc, 1);
unlock_page(p);
}
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 02/18] btrfs: extent_io: refactor __extent_writepage_io() to improve readability
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
2020-12-10 6:38 ` [PATCH v2 01/18] btrfs: extent_io: rename @offset parameter to @disk_bytenr for submit_extent_page() Qu Wenruo
@ 2020-12-10 6:38 ` Qu Wenruo
2020-12-10 12:12 ` Nikolay Borisov
2020-12-17 15:43 ` Josef Bacik
2020-12-10 6:38 ` [PATCH v2 03/18] btrfs: file: update comment for btrfs_dirty_pages() Qu Wenruo
` (15 subsequent siblings)
17 siblings, 2 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs
The refactor involves the following modifications:
- iosize alignment
In fact we don't really need to manually do alignment at all.
All extent maps should already be aligned, thus basic ASSERT() check
would be enough.
- redundant variables
We have extra variable like blocksize/pg_offset/end.
They are all unnecessary.
@blocksize can be replaced by sectorsize size directly, and it's only
used to verify the em start/size is aligned.
@pg_offset can be easily calculated using @cur and page_offset(page).
@end is just assigned to @page_end and never modified, use @page_end
to replace it.
- remove some BUG_ON()s
The BUG_ON()s are for extent map, which we have tree-checker to check
on-disk extent data item and runtime check.
ASSERT() should be enough.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 37 +++++++++++++++++--------------------
1 file changed, 17 insertions(+), 20 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2650e8720394..612fe60b367e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3515,17 +3515,14 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
unsigned long nr_written,
int *nr_ret)
{
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct extent_io_tree *tree = &inode->io_tree;
u64 start = page_offset(page);
u64 page_end = start + PAGE_SIZE - 1;
- u64 end;
u64 cur = start;
u64 extent_offset;
u64 block_start;
- u64 iosize;
struct extent_map *em;
- size_t pg_offset = 0;
- size_t blocksize;
int ret = 0;
int nr = 0;
const unsigned int write_flags = wbc_to_write_flags(wbc);
@@ -3546,19 +3543,17 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
*/
update_nr_written(wbc, nr_written + 1);
- end = page_end;
- blocksize = inode->vfs_inode.i_sb->s_blocksize;
-
- while (cur <= end) {
+ while (cur <= page_end) {
u64 disk_bytenr;
u64 em_end;
+ u32 iosize;
if (cur >= i_size) {
btrfs_writepage_endio_finish_ordered(page, cur,
page_end, 1);
break;
}
- em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
+ em = btrfs_get_extent(inode, NULL, 0, cur, page_end - cur + 1);
if (IS_ERR_OR_NULL(em)) {
SetPageError(page);
ret = PTR_ERR_OR_ZERO(em);
@@ -3567,16 +3562,20 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
extent_offset = cur - em->start;
em_end = extent_map_end(em);
- BUG_ON(em_end <= cur);
- BUG_ON(end < cur);
- iosize = min(em_end - cur, end - cur + 1);
- iosize = ALIGN(iosize, blocksize);
- disk_bytenr = em->block_start + extent_offset;
+ ASSERT(cur <= em_end);
+ ASSERT(cur < page_end);
+ ASSERT(IS_ALIGNED(em->start, fs_info->sectorsize));
+ ASSERT(IS_ALIGNED(em->len, fs_info->sectorsize));
block_start = em->block_start;
compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
+ disk_bytenr = em->block_start + extent_offset;
+
+ /* Note that em_end from extent_map_end() is exclusive */
+ iosize = min(em_end, page_end + 1) - cur;
free_extent_map(em);
em = NULL;
+
/*
* compressed and inline extents are written through other
* paths in the FS
@@ -3589,7 +3588,6 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
btrfs_writepage_endio_finish_ordered(page, cur,
cur + iosize - 1, 1);
cur += iosize;
- pg_offset += iosize;
continue;
}
@@ -3597,12 +3595,12 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
if (!PageWriteback(page)) {
btrfs_err(inode->root->fs_info,
"page %lu not writeback, cur %llu end %llu",
- page->index, cur, end);
+ page->index, cur, page_end);
}
ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
- page, disk_bytenr, iosize, pg_offset,
- &epd->bio,
+ page, disk_bytenr, iosize,
+ cur - page_offset(page), &epd->bio,
end_bio_extent_writepage,
0, 0, 0, false);
if (ret) {
@@ -3611,8 +3609,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
end_page_writeback(page);
}
- cur = cur + iosize;
- pg_offset += iosize;
+ cur += iosize;
nr++;
}
*nr_ret = nr;
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 03/18] btrfs: file: update comment for btrfs_dirty_pages()
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
2020-12-10 6:38 ` [PATCH v2 01/18] btrfs: extent_io: rename @offset parameter to @disk_bytenr for submit_extent_page() Qu Wenruo
2020-12-10 6:38 ` [PATCH v2 02/18] btrfs: extent_io: refactor __extent_writepage_io() to improve readability Qu Wenruo
@ 2020-12-10 6:38 ` Qu Wenruo
2020-12-10 12:16 ` Nikolay Borisov
2020-12-10 6:38 ` [PATCH v2 04/18] btrfs: extent_io: introduce a helper to grab an existing extent buffer from a page Qu Wenruo
` (14 subsequent siblings)
17 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs
The original comment is from the initial merge, which has several
problems:
- No holes check any more
- No inline decision is made
Update the out-of-date comment with more correct one.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/file.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0e41459b8de6..a29b50208eee 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -453,12 +453,15 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
}
/*
- * after copy_from_user, pages need to be dirtied and we need to make
- * sure holes are created between the current EOF and the start of
- * any next extents (if required).
- *
- * this also makes the decision about creating an inline extent vs
- * doing real data extents, marking pages dirty and delalloc as required.
+ * After btrfs_copy_from_user(), update the following things for delalloc:
+ * - DELALLOC extent io tree bits
+ * Later btrfs_run_delalloc_range() relies on this bit to determine the
+ * writeback range.
+ * - Page status
+ * Including basic status like Dirty and Uptodate, and btrfs specific bit
+ * like Checked (for cow fixup)
+ * - Inode size update
+ * If needed
*/
int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
size_t num_pages, loff_t pos, size_t write_bytes,
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 04/18] btrfs: extent_io: introduce a helper to grab an existing extent buffer from a page
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (2 preceding siblings ...)
2020-12-10 6:38 ` [PATCH v2 03/18] btrfs: file: update comment for btrfs_dirty_pages() Qu Wenruo
@ 2020-12-10 6:38 ` Qu Wenruo
2020-12-10 13:51 ` Nikolay Borisov
2020-12-17 15:50 ` Josef Bacik
2020-12-10 6:38 ` [PATCH v2 05/18] btrfs: extent_io: introduce the skeleton of btrfs_subpage structure Qu Wenruo
` (13 subsequent siblings)
17 siblings, 2 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs; +Cc: Johannes Thumshirn
This patch will extract the code to grab an extent buffer from a page
into a helper, grab_extent_buffer_from_page().
This reduces one indent level, and provides the work place for later
expansion for subapge support.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 52 +++++++++++++++++++++++++++-----------------
1 file changed, 32 insertions(+), 20 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 612fe60b367e..6350c2687c7e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5251,6 +5251,32 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
}
#endif
+static struct extent_buffer *grab_extent_buffer_from_page(struct page *page)
+{
+ struct extent_buffer *exists;
+
+ /* Page not yet attached to an extent buffer */
+ if (!PagePrivate(page))
+ return NULL;
+
+ /*
+ * We could have already allocated an eb for this page
+ * and attached one so lets see if we can get a ref on
+ * the existing eb, and if we can we know it's good and
+ * we can just return that one, else we know we can just
+ * overwrite page->private.
+ */
+ exists = (struct extent_buffer *)page->private;
+ if (atomic_inc_not_zero(&exists->refs)) {
+ mark_extent_buffer_accessed(exists, page);
+ return exists;
+ }
+
+ WARN_ON(PageDirty(page));
+ detach_page_private(page);
+ return NULL;
+}
+
struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
u64 start, u64 owner_root, int level)
{
@@ -5296,26 +5322,12 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
}
spin_lock(&mapping->private_lock);
- if (PagePrivate(p)) {
- /*
- * We could have already allocated an eb for this page
- * and attached one so lets see if we can get a ref on
- * the existing eb, and if we can we know it's good and
- * we can just return that one, else we know we can just
- * overwrite page->private.
- */
- exists = (struct extent_buffer *)p->private;
- if (atomic_inc_not_zero(&exists->refs)) {
- spin_unlock(&mapping->private_lock);
- unlock_page(p);
- put_page(p);
- mark_extent_buffer_accessed(exists, p);
- goto free_eb;
- }
- exists = NULL;
-
- WARN_ON(PageDirty(p));
- detach_page_private(p);
+ exists = grab_extent_buffer_from_page(p);
+ if (exists) {
+ spin_unlock(&mapping->private_lock);
+ unlock_page(p);
+ put_page(p);
+ goto free_eb;
}
attach_extent_buffer_page(eb, p);
spin_unlock(&mapping->private_lock);
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 05/18] btrfs: extent_io: introduce the skeleton of btrfs_subpage structure
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (3 preceding siblings ...)
2020-12-10 6:38 ` [PATCH v2 04/18] btrfs: extent_io: introduce a helper to grab an existing extent buffer from a page Qu Wenruo
@ 2020-12-10 6:38 ` Qu Wenruo
2020-12-17 15:52 ` Josef Bacik
2020-12-10 6:38 ` [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case Qu Wenruo
` (12 subsequent siblings)
17 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs
For btrfs subpage support, we need a structure to record extra info for
the status of each sectors of a page.
This patch will introduce the skeleton structure for future btrfs
subpage support.
All subpage related code would go to subpage.[ch] to avoid populating
the existing code base.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/Makefile | 3 ++-
fs/btrfs/subpage.c | 34 ++++++++++++++++++++++++++++++++++
fs/btrfs/subpage.h | 31 +++++++++++++++++++++++++++++++
3 files changed, 67 insertions(+), 1 deletion(-)
create mode 100644 fs/btrfs/subpage.c
create mode 100644 fs/btrfs/subpage.h
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 9f1b1a88e317..942562e11456 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -11,7 +11,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
- block-rsv.o delalloc-space.o block-group.o discard.o reflink.o
+ block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
+ subpage.o
btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
new file mode 100644
index 000000000000..9ca9f9ca61a9
--- /dev/null
+++ b/fs/btrfs/subpage.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "subpage.h"
+
+int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page)
+{
+ struct btrfs_subpage *subpage;
+
+ ASSERT(PageLocked(page));
+ /* Either not subpage, or the page already has private attached */
+ if (fs_info->sectorsize == PAGE_SIZE || PagePrivate(page))
+ return 0;
+
+ subpage = kzalloc(sizeof(*subpage), GFP_NOFS);
+ if (!subpage)
+ return -ENOMEM;
+
+ spin_lock_init(&subpage->lock);
+ attach_page_private(page, subpage);
+ return 0;
+}
+
+void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page)
+{
+ struct btrfs_subpage *subpage;
+
+ /* Either not subpage, or already detached */
+ if (fs_info->sectorsize == PAGE_SIZE || !PagePrivate(page))
+ return;
+
+ subpage = (struct btrfs_subpage *)detach_page_private(page);
+ ASSERT(subpage);
+ kfree(subpage);
+}
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
new file mode 100644
index 000000000000..96f3b226913e
--- /dev/null
+++ b/fs/btrfs/subpage.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef BTRFS_SUBPAGE_H
+#define BTRFS_SUBPAGE_H
+
+#include <linux/spinlock.h>
+#include "ctree.h"
+
+/*
+ * Since the maximum page size btrfs is going to support is 64K while the
+ * minimum sectorsize is 4K, this means a u16 bitmap is enough.
+ *
+ * The regular bitmap requires 32 bits as minimal bitmap size, so we can't use
+ * existing bitmap_* helpers here.
+ */
+#define BTRFS_SUBPAGE_BITMAP_SIZE 16
+
+/*
+ * Structure to trace status of each sector inside a page.
+ *
+ * Will be attached to page::private for both data and metadata inodes.
+ */
+struct btrfs_subpage {
+ /* Common members for both data and metadata pages */
+ spinlock_t lock;
+};
+
+int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
+void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
+
+#endif /* BTRFS_SUBPAGE_H */
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (4 preceding siblings ...)
2020-12-10 6:38 ` [PATCH v2 05/18] btrfs: extent_io: introduce the skeleton of btrfs_subpage structure Qu Wenruo
@ 2020-12-10 6:38 ` Qu Wenruo
2020-12-10 15:30 ` Nikolay Borisov
` (2 more replies)
2020-12-10 6:38 ` [PATCH v2 07/18] btrfs: extent_io: make grab_extent_buffer_from_page() " Qu Wenruo
` (11 subsequent siblings)
17 siblings, 3 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs
For subpage case, we need to allocate new memory for each metadata page.
So we need to:
- Allow attach_extent_buffer_page() to return int
To indicate allocation failure
- Prealloc page->private for alloc_extent_buffer()
We don't want to call memory allocation with spinlock hold, so
do preallocation before we acquire the spin lock.
- Handle subpage and regular case differently in
attach_extent_buffer_page()
For regular case, just do the usual thing.
For subpage case, allocate new memory and update the tree_block
bitmap.
The bitmap update will be handled by new subpage specific helper,
btrfs_subpage_set_tree_block().
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 69 +++++++++++++++++++++++++++++++++++---------
fs/btrfs/subpage.h | 44 ++++++++++++++++++++++++++++
2 files changed, 99 insertions(+), 14 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6350c2687c7e..51dd7ec3c2b3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -24,6 +24,7 @@
#include "rcu-string.h"
#include "backref.h"
#include "disk-io.h"
+#include "subpage.h"
static struct kmem_cache *extent_state_cache;
static struct kmem_cache *extent_buffer_cache;
@@ -3142,22 +3143,41 @@ static int submit_extent_page(unsigned int opf,
return ret;
}
-static void attach_extent_buffer_page(struct extent_buffer *eb,
+static int attach_extent_buffer_page(struct extent_buffer *eb,
struct page *page)
{
- /*
- * If the page is mapped to btree inode, we should hold the private
- * lock to prevent race.
- * For cloned or dummy extent buffers, their pages are not mapped and
- * will not race with any other ebs.
- */
- if (page->mapping)
- lockdep_assert_held(&page->mapping->private_lock);
+ struct btrfs_fs_info *fs_info = eb->fs_info;
+ int ret;
- if (!PagePrivate(page))
- attach_page_private(page, eb);
- else
- WARN_ON(page->private != (unsigned long)eb);
+ if (fs_info->sectorsize == PAGE_SIZE) {
+ /*
+ * If the page is mapped to btree inode, we should hold the
+ * private lock to prevent race.
+ * For cloned or dummy extent buffers, their pages are not
+ * mapped and will not race with any other ebs.
+ */
+ if (page->mapping)
+ lockdep_assert_held(&page->mapping->private_lock);
+
+ if (!PagePrivate(page))
+ attach_page_private(page, eb);
+ else
+ WARN_ON(page->private != (unsigned long)eb);
+ return 0;
+ }
+
+ /* Already mapped, just update the existing range */
+ if (PagePrivate(page))
+ goto update_bitmap;
+
+ /* Do new allocation to attach subpage */
+ ret = btrfs_attach_subpage(fs_info, page);
+ if (ret < 0)
+ return ret;
+
+update_bitmap:
+ btrfs_subpage_set_tree_block(fs_info, page, eb->start, eb->len);
+ return 0;
}
void set_page_extent_mapped(struct page *page)
@@ -5067,12 +5087,19 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
return NULL;
for (i = 0; i < num_pages; i++) {
+ int ret;
+
p = alloc_page(GFP_NOFS);
if (!p) {
btrfs_release_extent_buffer(new);
return NULL;
}
- attach_extent_buffer_page(new, p);
+ ret = attach_extent_buffer_page(new, p);
+ if (ret < 0) {
+ put_page(p);
+ btrfs_release_extent_buffer(new);
+ return NULL;
+ }
WARN_ON(PageDirty(p));
SetPageUptodate(p);
new->pages[i] = p;
@@ -5321,6 +5348,18 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
goto free_eb;
}
+ /*
+ * Preallocate page->private for subpage case, so that
+ * we won't allocate memory with private_lock hold.
+ */
+ ret = btrfs_attach_subpage(fs_info, p);
+ if (ret < 0) {
+ unlock_page(p);
+ put_page(p);
+ exists = ERR_PTR(-ENOMEM);
+ goto free_eb;
+ }
+
spin_lock(&mapping->private_lock);
exists = grab_extent_buffer_from_page(p);
if (exists) {
@@ -5329,8 +5368,10 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
put_page(p);
goto free_eb;
}
+ /* Should not fail, as we have attached the subpage already */
attach_extent_buffer_page(eb, p);
spin_unlock(&mapping->private_lock);
+
WARN_ON(PageDirty(p));
eb->pages[i] = p;
if (!PageUptodate(p))
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 96f3b226913e..c2ce603e7848 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -23,9 +23,53 @@
struct btrfs_subpage {
/* Common members for both data and metadata pages */
spinlock_t lock;
+ union {
+ /* Structures only used by metadata */
+ struct {
+ u16 tree_block_bitmap;
+ };
+ /* structures only used by data */
+ };
};
int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
+/*
+ * Convert the [start, start + len) range into a u16 bitmap
+ *
+ * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
+ */
+static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
+ struct page *page, u64 start, u32 len)
+{
+ int bit_start = (start - page_offset(page)) >> fs_info->sectorsize_bits;
+ int nbits = len >> fs_info->sectorsize_bits;
+
+ /* Basic checks */
+ ASSERT(PagePrivate(page) && page->private);
+ ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
+ IS_ALIGNED(len, fs_info->sectorsize));
+ ASSERT(page_offset(page) <= start &&
+ start + len <= page_offset(page) + PAGE_SIZE);
+ /*
+ * Here nbits can be 16, thus can go beyond u16 range. Here we make the
+ * first left shift to be calculated in unsigned long (u32), then
+ * truncate the result to u16.
+ */
+ return (u16)(((1UL << nbits) - 1) << bit_start);
+}
+
+static inline void btrfs_subpage_set_tree_block(struct btrfs_fs_info *fs_info,
+ struct page *page, u64 start, u32 len)
+{
+ struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+ unsigned long flags;
+ u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+
+ spin_lock_irqsave(&subpage->lock, flags);
+ subpage->tree_block_bitmap |= tmp;
+ spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
#endif /* BTRFS_SUBPAGE_H */
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 07/18] btrfs: extent_io: make grab_extent_buffer_from_page() to handle subpage case
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (5 preceding siblings ...)
2020-12-10 6:38 ` [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case Qu Wenruo
@ 2020-12-10 6:38 ` Qu Wenruo
2020-12-10 15:39 ` Nikolay Borisov
2020-12-17 16:02 ` Josef Bacik
2020-12-10 6:38 ` [PATCH v2 08/18] btrfs: extent_io: support subpage for extent buffer page release Qu Wenruo
` (10 subsequent siblings)
17 siblings, 2 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs
For subpage case, grab_extent_buffer_from_page() can't really get an
extent buffer just from btrfs_subpage.
Although we have btrfs_subpage::tree_block_bitmap, which can be used to
grab the bytenr of an existing extent buffer, and can then go radix tree
search to grab that existing eb.
However we are still doing radix tree insert check in
alloc_extent_buffer(), thus we don't really need to do the extra hassle,
just let alloc_extent_buffer() to handle existing eb in radix tree.
So for grab_extent_buffer_from_page(), just always return NULL for
subpage case.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 51dd7ec3c2b3..b99bd0402130 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5278,10 +5278,19 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
}
#endif
-static struct extent_buffer *grab_extent_buffer_from_page(struct page *page)
+static struct extent_buffer *grab_extent_buffer_from_page(
+ struct btrfs_fs_info *fs_info, struct page *page)
{
struct extent_buffer *exists;
+ /*
+ * For subpage case, we completely rely on radix tree to ensure we
+ * don't try to insert two eb for the same bytenr.
+ * So here we alwasy return NULL and just continue.
+ */
+ if (fs_info->sectorsize < PAGE_SIZE)
+ return NULL;
+
/* Page not yet attached to an extent buffer */
if (!PagePrivate(page))
return NULL;
@@ -5361,7 +5370,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
}
spin_lock(&mapping->private_lock);
- exists = grab_extent_buffer_from_page(p);
+ exists = grab_extent_buffer_from_page(fs_info, p);
if (exists) {
spin_unlock(&mapping->private_lock);
unlock_page(p);
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 08/18] btrfs: extent_io: support subpage for extent buffer page release
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (6 preceding siblings ...)
2020-12-10 6:38 ` [PATCH v2 07/18] btrfs: extent_io: make grab_extent_buffer_from_page() " Qu Wenruo
@ 2020-12-10 6:38 ` Qu Wenruo
2020-12-10 16:13 ` Nikolay Borisov
2020-12-10 6:38 ` [PATCH v2 09/18] btrfs: subpage: introduce helper for subpage uptodate status Qu Wenruo
` (9 subsequent siblings)
17 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
To do so, introduce a new helper, detach_extent_buffer_page(), to do
different handling for regular and subpage cases.
For subpage case, the new trick is to clear the range of current extent
buffer, and detach page private if and only if we're the last tree block
of the page.
This part is handled by the subpage helper,
btrfs_subpage_clear_and_test_tree_block().
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 59 +++++++++++++++++++++++++++++++-------------
fs/btrfs/subpage.h | 24 ++++++++++++++++++
2 files changed, 66 insertions(+), 17 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b99bd0402130..ee81a2a1baa2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4994,25 +4994,12 @@ int extent_buffer_under_io(const struct extent_buffer *eb)
test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
}
-/*
- * Release all pages attached to the extent buffer.
- */
-static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
+static void detach_extent_buffer_page(struct extent_buffer *eb,
+ struct page *page)
{
- int i;
- int num_pages;
- int mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
-
- BUG_ON(extent_buffer_under_io(eb));
-
- num_pages = num_extent_pages(eb);
- for (i = 0; i < num_pages; i++) {
- struct page *page = eb->pages[i];
+ struct btrfs_fs_info *fs_info = eb->fs_info;
- if (!page)
- continue;
- if (mapped)
- spin_lock(&page->mapping->private_lock);
+ if (fs_info->sectorsize == PAGE_SIZE) {
/*
* We do this since we'll remove the pages after we've
* removed the eb from the radix tree, so we could race
@@ -5031,6 +5018,44 @@ static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
*/
detach_page_private(page);
}
+ return;
+ }
+
+ /*
+ * For subpage case, clear the range in tree_block_bitmap,
+ * and if we're the last one, detach private completely.
+ */
+ if (PagePrivate(page)) {
+ bool last = false;
+
+ last = btrfs_subpage_clear_and_test_tree_block(fs_info, page,
+ eb->start, eb->len);
+ if (last)
+ btrfs_detach_subpage(fs_info, page);
+ }
+}
+
+/*
+ * Release all pages attached to the extent buffer.
+ */
+static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
+{
+ int i;
+ int num_pages;
+ int mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
+
+ ASSERT(!extent_buffer_under_io(eb));
+
+ num_pages = num_extent_pages(eb);
+ for (i = 0; i < num_pages; i++) {
+ struct page *page = eb->pages[i];
+
+ if (!page)
+ continue;
+ if (mapped)
+ spin_lock(&page->mapping->private_lock);
+
+ detach_extent_buffer_page(eb, page);
if (mapped)
spin_unlock(&page->mapping->private_lock);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index c2ce603e7848..87b4e028ae18 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -72,4 +72,28 @@ static inline void btrfs_subpage_set_tree_block(struct btrfs_fs_info *fs_info,
spin_unlock_irqrestore(&subpage->lock, flags);
}
+/*
+ * Clear the bits in tree_block_bitmap and return if we're the last bit set
+ * int tree_block_bitmap.
+ *
+ * Return true if we're the last bits in the tree_block_bitmap.
+ * Return false otherwise.
+ */
+static inline bool btrfs_subpage_clear_and_test_tree_block(
+ struct btrfs_fs_info *fs_info, struct page *page,
+ u64 start, u32 len)
+{
+ struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+ u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+ unsigned long flags;
+ bool last = false;
+
+ spin_lock_irqsave(&subpage->lock, flags);
+ subpage->tree_block_bitmap &= ~tmp;
+ if (subpage->tree_block_bitmap == 0)
+ last = true;
+ spin_unlock_irqrestore(&subpage->lock, flags);
+ return last;
+}
+
#endif /* BTRFS_SUBPAGE_H */
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 09/18] btrfs: subpage: introduce helper for subpage uptodate status
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (7 preceding siblings ...)
2020-12-10 6:38 ` [PATCH v2 08/18] btrfs: extent_io: support subpage for extent buffer page release Qu Wenruo
@ 2020-12-10 6:38 ` Qu Wenruo
2020-12-11 10:10 ` Nikolay Borisov
2020-12-10 6:38 ` [PATCH v2 10/18] btrfs: subpage: introduce helper for subpage error status Qu Wenruo
` (8 subsequent siblings)
17 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs
This patch introduce the following functions to handle btrfs subpage
uptodate status:
- btrfs_subpage_set_uptodate()
- btrfs_subpage_clear_uptodate()
- btrfs_subpage_test_uptodate()
Those helpers can only be called when the range is ensured to be
inside the page.
- btrfs_page_set_uptodate()
- btrfs_page_clear_uptodate()
- btrfs_page_test_uptodate()
Those helpers can handle both regular sector size and subpage without
problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/subpage.h | 98 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 98 insertions(+)
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 87b4e028ae18..b3cf9171ec98 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -23,6 +23,7 @@
struct btrfs_subpage {
/* Common members for both data and metadata pages */
spinlock_t lock;
+ u16 uptodate_bitmap;
union {
/* Structures only used by metadata */
struct {
@@ -35,6 +36,17 @@ struct btrfs_subpage {
int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
+static inline void btrfs_subpage_clamp_range(struct page *page,
+ u64 *start, u32 *len)
+{
+ u64 orig_start = *start;
+ u32 orig_len = *len;
+
+ *start = max_t(u64, page_offset(page), orig_start);
+ *len = min_t(u64, page_offset(page) + PAGE_SIZE,
+ orig_start + orig_len) - *start;
+}
+
/*
* Convert the [start, start + len) range into a u16 bitmap
*
@@ -96,4 +108,90 @@ static inline bool btrfs_subpage_clear_and_test_tree_block(
return last;
}
+static inline void btrfs_subpage_set_uptodate(struct btrfs_fs_info *fs_info,
+ struct page *page, u64 start, u32 len)
+{
+ struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+ u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+ unsigned long flags;
+
+ spin_lock_irqsave(&subpage->lock, flags);
+ subpage->uptodate_bitmap |= tmp;
+ if (subpage->uptodate_bitmap == (u16)-1)
+ SetPageUptodate(page);
+ spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+static inline void btrfs_subpage_clear_uptodate(struct btrfs_fs_info *fs_info,
+ struct page *page, u64 start, u32 len)
+{
+ struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+ u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+ unsigned long flags;
+
+ spin_lock_irqsave(&subpage->lock, flags);
+ subpage->tree_block_bitmap &= ~tmp;
+ ClearPageUptodate(page);
+ spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+/*
+ * Unlike set/clear which is dependent on each page status, for test all bits
+ * are tested in the same way.
+ */
+#define DECLARE_BTRFS_SUBPAGE_TEST_OP(name) \
+static inline bool btrfs_subpage_test_##name(struct btrfs_fs_info *fs_info, \
+ struct page *page, u64 start, u32 len) \
+{ \
+ struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private; \
+ u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len); \
+ unsigned long flags; \
+ bool ret; \
+ \
+ spin_lock_irqsave(&subpage->lock, flags); \
+ ret = ((subpage->name##_bitmap & tmp) == tmp); \
+ spin_unlock_irqrestore(&subpage->lock, flags); \
+ return ret; \
+}
+DECLARE_BTRFS_SUBPAGE_TEST_OP(uptodate);
+
+/*
+ * Note that, in selftest, especially extent-io-tests, we can have empty
+ * fs_info passed in.
+ * Thanfully in selftest, we only test sectorsize == PAGE_SIZE cases so far
+ * thus we can fall back to regular sectorsize branch.
+ */
+#define DECLARE_BTRFS_PAGE_OPS(name, set_page_func, clear_page_func, \
+ test_page_func) \
+static inline void btrfs_page_set_##name(struct btrfs_fs_info *fs_info, \
+ struct page *page, u64 start, u32 len) \
+{ \
+ if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) { \
+ set_page_func(page); \
+ return; \
+ } \
+ btrfs_subpage_clamp_range(page, &start, &len); \
+ btrfs_subpage_set_##name(fs_info, page, start, len); \
+} \
+static inline void btrfs_page_clear_##name(struct btrfs_fs_info *fs_info, \
+ struct page *page, u64 start, u32 len) \
+{ \
+ if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) { \
+ clear_page_func(page); \
+ return; \
+ } \
+ btrfs_subpage_clamp_range(page, &start, &len); \
+ btrfs_subpage_clear_##name(fs_info, page, start, len); \
+} \
+static inline bool btrfs_page_test_##name(struct btrfs_fs_info *fs_info, \
+ struct page *page, u64 start, u32 len) \
+{ \
+ if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) \
+ return test_page_func(page); \
+ btrfs_subpage_clamp_range(page, &start, &len); \
+ return btrfs_subpage_test_##name(fs_info, page, start, len); \
+}
+DECLARE_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
+ PageUptodate);
+
#endif /* BTRFS_SUBPAGE_H */
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 10/18] btrfs: subpage: introduce helper for subpage error status
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (8 preceding siblings ...)
2020-12-10 6:38 ` [PATCH v2 09/18] btrfs: subpage: introduce helper for subpage uptodate status Qu Wenruo
@ 2020-12-10 6:38 ` Qu Wenruo
2020-12-10 6:38 ` [PATCH v2 11/18] btrfs: extent_io: make set/clear_extent_buffer_uptodate() to support subpage size Qu Wenruo
` (7 subsequent siblings)
17 siblings, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs
This patch introduce the following functions to handle btrfs subpage
error status:
- btrfs_subpage_set_error()
- btrfs_subpage_clear_error()
- btrfs_subpage_test_error()
Those helpers can only be called when the range is ensured to be
inside the page.
- btrfs_page_set_error()
- btrfs_page_clear_error()
- btrfs_page_test_error()
Those helpers can handle both regular sector size and subpage without
problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/subpage.h | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index b3cf9171ec98..8592234d773e 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -24,6 +24,7 @@ struct btrfs_subpage {
/* Common members for both data and metadata pages */
spinlock_t lock;
u16 uptodate_bitmap;
+ u16 error_bitmap;
union {
/* Structures only used by metadata */
struct {
@@ -135,6 +136,35 @@ static inline void btrfs_subpage_clear_uptodate(struct btrfs_fs_info *fs_info,
spin_unlock_irqrestore(&subpage->lock, flags);
}
+static inline void btrfs_subpage_set_error(struct btrfs_fs_info *fs_info,
+ struct page *page, u64 start,
+ u32 len)
+{
+ struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+ u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+ unsigned long flags;
+
+ spin_lock_irqsave(&subpage->lock, flags);
+ subpage->error_bitmap |= tmp;
+ SetPageError(page);
+ spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+static inline void btrfs_subpage_clear_error(struct btrfs_fs_info *fs_info,
+ struct page *page, u64 start,
+ u32 len)
+{
+ struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+ u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+ unsigned long flags;
+
+ spin_lock_irqsave(&subpage->lock, flags);
+ subpage->error_bitmap &= ~tmp;
+ if (subpage->error_bitmap == 0)
+ ClearPageError(page);
+ spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
/*
* Unlike set/clear which is dependent on each page status, for test all bits
* are tested in the same way.
@@ -154,6 +184,7 @@ static inline bool btrfs_subpage_test_##name(struct btrfs_fs_info *fs_info, \
return ret; \
}
DECLARE_BTRFS_SUBPAGE_TEST_OP(uptodate);
+DECLARE_BTRFS_SUBPAGE_TEST_OP(error);
/*
* Note that, in selftest, especially extent-io-tests, we can have empty
@@ -193,5 +224,6 @@ static inline bool btrfs_page_test_##name(struct btrfs_fs_info *fs_info, \
}
DECLARE_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
PageUptodate);
+DECLARE_BTRFS_PAGE_OPS(error, SetPageError, ClearPageError, PageError);
#endif /* BTRFS_SUBPAGE_H */
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 11/18] btrfs: extent_io: make set/clear_extent_buffer_uptodate() to support subpage size
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (9 preceding siblings ...)
2020-12-10 6:38 ` [PATCH v2 10/18] btrfs: subpage: introduce helper for subpage error status Qu Wenruo
@ 2020-12-10 6:38 ` Qu Wenruo
2020-12-10 6:38 ` [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support Qu Wenruo
` (6 subsequent siblings)
17 siblings, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs
For those functions, to support subpage size they just need to call
btrfs_page_set/clear_uptodate() wrappers.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ee81a2a1baa2..141e414b1ab9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5611,30 +5611,33 @@ bool set_extent_buffer_dirty(struct extent_buffer *eb)
void clear_extent_buffer_uptodate(struct extent_buffer *eb)
{
- int i;
+ struct btrfs_fs_info *fs_info = eb->fs_info;
struct page *page;
int num_pages;
+ int i;
clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
num_pages = num_extent_pages(eb);
for (i = 0; i < num_pages; i++) {
page = eb->pages[i];
if (page)
- ClearPageUptodate(page);
+ btrfs_page_clear_uptodate(fs_info, page,
+ eb->start, eb->len);
}
}
void set_extent_buffer_uptodate(struct extent_buffer *eb)
{
- int i;
+ struct btrfs_fs_info *fs_info = eb->fs_info;
struct page *page;
int num_pages;
+ int i;
set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
num_pages = num_extent_pages(eb);
for (i = 0; i < num_pages; i++) {
page = eb->pages[i];
- SetPageUptodate(page);
+ btrfs_page_set_uptodate(fs_info, page, eb->start, eb->len);
}
}
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (10 preceding siblings ...)
2020-12-10 6:38 ` [PATCH v2 11/18] btrfs: extent_io: make set/clear_extent_buffer_uptodate() to support subpage size Qu Wenruo
@ 2020-12-10 6:38 ` Qu Wenruo
2020-12-11 12:00 ` Nikolay Borisov
2020-12-10 6:39 ` [PATCH v2 13/18] btrfs: extent_io: introduce read_extent_buffer_subpage() Qu Wenruo
` (5 subsequent siblings)
17 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:38 UTC (permalink / raw)
To: linux-btrfs
Unlike the original try_release_extent_buffer,
try_release_subpage_extent_buffer() will iterate through
btrfs_subpage::tree_block_bitmap, and try to release each extent buffer.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 73 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 141e414b1ab9..4d55803302e9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -6258,10 +6258,83 @@ void memmove_extent_buffer(const struct extent_buffer *dst,
}
}
+static int try_release_subpage_extent_buffer(struct page *page)
+{
+ struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+ u64 page_start = page_offset(page);
+ int bitmap_size = BTRFS_SUBPAGE_BITMAP_SIZE;
+ int bit_start = 0;
+ int ret;
+
+ while (bit_start < bitmap_size) {
+ struct btrfs_subpage *subpage;
+ struct extent_buffer *eb;
+ unsigned long flags;
+ u16 tmp = 1 << bit_start;
+ u64 start;
+
+ /*
+ * Make sure the page still has private, as previous run can
+ * detach the private
+ */
+ spin_lock(&page->mapping->private_lock);
+ if (!PagePrivate(page)) {
+ spin_unlock(&page->mapping->private_lock);
+ break;
+ }
+ subpage = (struct btrfs_subpage *)page->private;
+ spin_unlock(&page->mapping->private_lock);
+
+ spin_lock_irqsave(&subpage->lock, flags);
+ if (!(tmp & subpage->tree_block_bitmap)) {
+ spin_unlock_irqrestore(&subpage->lock, flags);
+ bit_start++;
+ continue;
+ }
+ spin_unlock_irqrestore(&subpage->lock, flags);
+
+ start = bit_start * fs_info->sectorsize + page_start;
+ bit_start += fs_info->nodesize >> fs_info->sectorsize_bits;
+ /*
+ * Here we can't call find_extent_buffer() which will increase
+ * eb->refs.
+ */
+ rcu_read_lock();
+ eb = radix_tree_lookup(&fs_info->buffer_radix,
+ start >> fs_info->sectorsize_bits);
+ rcu_read_unlock();
+ ASSERT(eb);
+ spin_lock(&eb->refs_lock);
+ if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb) ||
+ !test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
+ spin_unlock(&eb->refs_lock);
+ continue;
+ }
+ /*
+ * Here we don't care the return value, we will always check
+ * the page private at the end.
+ * And release_extent_buffer() will release the refs_lock.
+ */
+ release_extent_buffer(eb);
+ }
+ /* Finally to check if we have cleared page private */
+ spin_lock(&page->mapping->private_lock);
+ if (!PagePrivate(page))
+ ret = 1;
+ else
+ ret = 0;
+ spin_unlock(&page->mapping->private_lock);
+ return ret;
+
+}
+
int try_release_extent_buffer(struct page *page)
{
struct extent_buffer *eb;
+ if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
+ return try_release_subpage_extent_buffer(page);
+
/*
* We need to make sure nobody is attaching this page to an eb right
* now.
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 13/18] btrfs: extent_io: introduce read_extent_buffer_subpage()
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (11 preceding siblings ...)
2020-12-10 6:38 ` [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support Qu Wenruo
@ 2020-12-10 6:39 ` Qu Wenruo
2020-12-10 6:39 ` [PATCH v2 14/18] btrfs: extent_io: make endio_readpage_update_page_status() to handle subpage case Qu Wenruo
` (4 subsequent siblings)
17 siblings, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:39 UTC (permalink / raw)
To: linux-btrfs
Introduce a new helper, read_extent_buffer_subpage(), to do the subpage
extent buffer read.
The difference between regular and subpage routines are:
- No page locking
Here we completely rely on extent locking.
Page locking can reduce the concurrency greatly, as if we lock one
page to read one extent buffer, all the other extent buffers in the
same page will have to wait.
- Extent uptodate condition
Despite the existing PageUptodate() and EXTENT_BUFFER_UPTODATE check,
We also need to check btrfs_subpage::uptodate_bitmap.
- No page loop
Just one page, no need to loop, this greately simplified the subpage
routine.
This patch only implemented the bio submit part, no endio support yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/disk-io.c | 1 +
fs/btrfs/extent_io.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 71 insertions(+)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 765deefda92b..b6c03a8b0c72 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -602,6 +602,7 @@ int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio,
ASSERT(page->private);
eb = (struct extent_buffer *)page->private;
+
/*
* The pending IO might have been the only thing that kept this buffer
* in memory. Make sure we have a ref for all this other checks
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4d55803302e9..1ec9de2aa910 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5641,6 +5641,73 @@ void set_extent_buffer_uptodate(struct extent_buffer *eb)
}
}
+static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
+ int mirror_num)
+{
+ struct btrfs_fs_info *fs_info = eb->fs_info;
+ struct extent_io_tree *io_tree;
+ struct page *page = eb->pages[0];
+ struct bio *bio = NULL;
+ int ret = 0;
+
+ ASSERT(!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags));
+ ASSERT(PagePrivate(page));
+ io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
+
+ if (wait == WAIT_NONE) {
+ ret = try_lock_extent(io_tree, eb->start,
+ eb->start + eb->len - 1);
+ if (ret <= 0)
+ return ret;
+ } else {
+ ret = lock_extent(io_tree, eb->start, eb->start + eb->len - 1);
+ if (ret < 0)
+ return ret;
+ }
+
+ ret = 0;
+ if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags) ||
+ PageUptodate(page) ||
+ btrfs_subpage_test_uptodate(fs_info, page, eb->start, eb->len)) {
+ set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+ unlock_extent(io_tree, eb->start, eb->start + eb->len - 1);
+ return ret;
+ }
+
+ clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
+ eb->read_mirror = 0;
+ atomic_set(&eb->io_pages, 1);
+ check_buffer_tree_ref(eb);
+
+ ret = submit_extent_page(REQ_OP_READ | REQ_META, NULL, page, eb->start,
+ eb->len, eb->start - page_offset(page), &bio,
+ end_bio_extent_readpage, mirror_num, 0, 0,
+ true);
+ if (ret) {
+ /*
+ * In the endio function, if we hit something wrong we will
+ * increase the io_pages, so here we need to decrease it for error
+ * path.
+ */
+ atomic_dec(&eb->io_pages);
+ }
+ if (bio) {
+ int tmp;
+
+ tmp = submit_one_bio(bio, mirror_num, 0);
+ if (tmp < 0)
+ return tmp;
+ }
+ if (ret || wait != WAIT_COMPLETE)
+ return ret;
+
+ wait_extent_bit(io_tree, eb->start, eb->start + eb->len - 1,
+ EXTENT_LOCKED);
+ if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
+ ret = -EIO;
+ return ret;
+}
+
int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
{
int i;
@@ -5657,6 +5724,9 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
return 0;
+ if (eb->fs_info->sectorsize < PAGE_SIZE)
+ return read_extent_buffer_subpage(eb, wait, mirror_num);
+
num_pages = num_extent_pages(eb);
for (i = 0; i < num_pages; i++) {
page = eb->pages[i];
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 14/18] btrfs: extent_io: make endio_readpage_update_page_status() to handle subpage case
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (12 preceding siblings ...)
2020-12-10 6:39 ` [PATCH v2 13/18] btrfs: extent_io: introduce read_extent_buffer_subpage() Qu Wenruo
@ 2020-12-10 6:39 ` Qu Wenruo
2020-12-14 9:57 ` Nikolay Borisov
2020-12-10 6:39 ` [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check Qu Wenruo
` (3 subsequent siblings)
17 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:39 UTC (permalink / raw)
To: linux-btrfs
To handle subpage status update, add the following new tricks:
- Use btrfs_page_*() helpers to update page status
Now we can handle both cases well.
- No page unlock for subpage metadata
Since subpage metadata doesn't utilize page locking at all, skip it.
For subpage data locking, it's handled in later commits.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1ec9de2aa910..64a19c1884fc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2841,15 +2841,26 @@ static void endio_readpage_release_extent(struct processed_extent *processed,
processed->uptodate = uptodate;
}
-static void endio_readpage_update_page_status(struct page *page, bool uptodate)
+static void endio_readpage_update_page_status(struct page *page, bool uptodate,
+ u64 start, u64 end)
{
+ struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+ u32 len;
+
+ ASSERT(page_offset(page) <= start &&
+ end <= page_offset(page) + PAGE_SIZE - 1);
+ len = end + 1 - start;
+
if (uptodate) {
- SetPageUptodate(page);
+ btrfs_page_set_uptodate(fs_info, page, start, len);
} else {
- ClearPageUptodate(page);
- SetPageError(page);
+ btrfs_page_clear_uptodate(fs_info, page, start, len);
+ btrfs_page_set_error(fs_info, page, start, len);
}
- unlock_page(page);
+
+ if (fs_info->sectorsize == PAGE_SIZE)
+ unlock_page(page);
+ /* Subpage locking will be handled in later patches */
}
/*
@@ -2986,7 +2997,7 @@ static void end_bio_extent_readpage(struct bio *bio)
bio_offset += len;
/* Update page status and unlock */
- endio_readpage_update_page_status(page, uptodate);
+ endio_readpage_update_page_status(page, uptodate, start, end);
endio_readpage_release_extent(&processed, BTRFS_I(inode),
start, end, uptodate);
}
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (13 preceding siblings ...)
2020-12-10 6:39 ` [PATCH v2 14/18] btrfs: extent_io: make endio_readpage_update_page_status() to handle subpage case Qu Wenruo
@ 2020-12-10 6:39 ` Qu Wenruo
2020-12-10 13:24 ` kernel test robot
` (2 more replies)
2020-12-10 6:39 ` [PATCH v2 16/18] btrfs: introduce btrfs_subpage for data inodes Qu Wenruo
` (2 subsequent siblings)
17 siblings, 3 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:39 UTC (permalink / raw)
To: linux-btrfs
For subpage metadata validation check, there are some difference:
- Read must finish in one bvec
Since we're just reading one subpage range in one page, it should
never be split into two bios nor two bvecs.
- How to grab the existing eb
Instead of grabbing eb using page->private, we have to go search radix
tree as we don't have any direct pointer at hand.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/disk-io.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 82 insertions(+)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b6c03a8b0c72..adda76895058 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -591,6 +591,84 @@ static int validate_extent_buffer(struct extent_buffer *eb)
return ret;
}
+static int validate_subpage_buffer(struct page *page, u64 start, u64 end,
+ int mirror)
+{
+ struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+ struct extent_buffer *eb;
+ int reads_done;
+ int ret = 0;
+
+ if (!IS_ALIGNED(start, fs_info->sectorsize) ||
+ !IS_ALIGNED(end - start + 1, fs_info->sectorsize) ||
+ !IS_ALIGNED(end - start + 1, fs_info->nodesize)) {
+ WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
+ btrfs_err(fs_info, "invalid tree read bytenr");
+ return -EUCLEAN;
+ }
+
+ /*
+ * We don't allow bio merge for subpage metadata read, so we should
+ * only get one eb for each endio hook.
+ */
+ ASSERT(end == start + fs_info->nodesize - 1);
+ ASSERT(PagePrivate(page));
+
+ rcu_read_lock();
+ eb = radix_tree_lookup(&fs_info->buffer_radix,
+ start / fs_info->sectorsize);
+ rcu_read_unlock();
+
+ /*
+ * When we are reading one tree block, eb must have been
+ * inserted into the radix tree. If not something is wrong.
+ */
+ if (!eb) {
+ WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
+ btrfs_err(fs_info,
+ "can't find extent buffer for bytenr %llu",
+ start);
+ return -EUCLEAN;
+ }
+ /*
+ * The pending IO might have been the only thing that kept
+ * this buffer in memory. Make sure we have a ref for all
+ * this other checks
+ */
+ atomic_inc(&eb->refs);
+
+ reads_done = atomic_dec_and_test(&eb->io_pages);
+ /* Subpage read must finish in page read */
+ ASSERT(reads_done);
+
+ eb->read_mirror = mirror;
+ if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
+ ret = -EIO;
+ goto err;
+ }
+ ret = validate_extent_buffer(eb);
+ if (ret < 0)
+ goto err;
+
+ if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
+ btree_readahead_hook(eb, ret);
+
+ set_extent_buffer_uptodate(eb);
+
+ free_extent_buffer(eb);
+ return ret;
+err:
+ /*
+ * our io error hook is going to dec the io pages
+ * again, we have to make sure it has something to
+ * decrement
+ */
+ atomic_inc(&eb->io_pages);
+ clear_extent_buffer_uptodate(eb);
+ free_extent_buffer(eb);
+ return ret;
+}
+
int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio,
struct page *page, u64 start, u64 end,
int mirror)
@@ -600,6 +678,10 @@ int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio,
int reads_done;
ASSERT(page->private);
+
+ if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
+ return validate_subpage_buffer(page, start, end, mirror);
+
eb = (struct extent_buffer *)page->private;
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 16/18] btrfs: introduce btrfs_subpage for data inodes
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (14 preceding siblings ...)
2020-12-10 6:39 ` [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check Qu Wenruo
@ 2020-12-10 6:39 ` Qu Wenruo
2020-12-10 9:44 ` kernel test robot
` (2 more replies)
2020-12-10 6:39 ` [PATCH v2 17/18] btrfs: integrate page status update for read path into begin/end_page_read() Qu Wenruo
2020-12-10 6:39 ` [PATCH v2 18/18] btrfs: allow RO mount of 4K sector size fs on 64K page system Qu Wenruo
17 siblings, 3 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:39 UTC (permalink / raw)
To: linux-btrfs
To support subpage sector size, data also need extra info to make sure
which sectors in a page are uptodate/dirty/...
This patch will make pages for data inodes to get btrfs_subpage
structure attached, and detached when the page is freed.
This patch also slightly changes the timing when
set_page_extent_mapped() to make sure:
- We have page->mapping set
page->mapping->host is used to grab btrfs_fs_info, thus we can only
call this function after page is mapped to an inode.
One call site attaches pages to inode manually, thus we have to modify
the timing of set_page_extent_mapped() a little.
- As soon as possible, before other operations
Since memory allocation can fail, we have to do extra error handling.
Calling set_page_extent_mapped() as soon as possible can simply the
error handling for several call sites.
The idea is pretty much the same as iomap_page, but with more bitmaps
for btrfs specific cases.
Currently the plan is to switch iomap if iomap can provide sector
aligned write back (only write back dirty sectors, but not the full
page, data balance require this feature).
So we will stick to btrfs specific bitmap for now.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/compression.c | 10 ++++++--
fs/btrfs/extent_io.c | 47 +++++++++++++++++++++++++++++++++----
fs/btrfs/extent_io.h | 3 ++-
fs/btrfs/file.c | 10 +++++---
fs/btrfs/free-space-cache.c | 15 +++++++++---
fs/btrfs/inode.c | 12 ++++++----
fs/btrfs/ioctl.c | 5 +++-
fs/btrfs/reflink.c | 5 +++-
fs/btrfs/relocation.c | 12 ++++++++--
9 files changed, 98 insertions(+), 21 deletions(-)
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 5ae3fa0386b7..6d203acfdeb3 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -542,13 +542,19 @@ static noinline int add_ra_bio_pages(struct inode *inode,
goto next;
}
- end = last_offset + PAGE_SIZE - 1;
/*
* at this point, we have a locked page in the page cache
* for these bytes in the file. But, we have to make
* sure they map to this compressed extent on disk.
*/
- set_page_extent_mapped(page);
+ ret = set_page_extent_mapped(page);
+ if (ret < 0) {
+ unlock_page(page);
+ put_page(page);
+ break;
+ }
+
+ end = last_offset + PAGE_SIZE - 1;
lock_extent(tree, last_offset, end);
read_lock(&em_tree->lock);
em = lookup_extent_mapping(em_tree, last_offset,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 64a19c1884fc..4e4ed9c453ae 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3191,10 +3191,40 @@ static int attach_extent_buffer_page(struct extent_buffer *eb,
return 0;
}
-void set_page_extent_mapped(struct page *page)
+int __must_check set_page_extent_mapped(struct page *page)
{
- if (!PagePrivate(page))
+ struct btrfs_fs_info *fs_info;
+
+ ASSERT(page->mapping);
+
+ if (PagePrivate(page))
+ return 0;
+
+ fs_info = btrfs_sb(page->mapping->host->i_sb);
+ if (fs_info->sectorsize == PAGE_SIZE) {
attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE);
+ return 0;
+ }
+
+ return btrfs_attach_subpage(fs_info, page);
+}
+
+void clear_page_extent_mapped(struct page *page)
+{
+ struct btrfs_fs_info *fs_info;
+
+ ASSERT(page->mapping);
+
+ if (!PagePrivate(page))
+ return;
+
+ fs_info = btrfs_sb(page->mapping->host->i_sb);
+ if (fs_info->sectorsize == PAGE_SIZE) {
+ detach_page_private(page);
+ return;
+ }
+
+ btrfs_detach_subpage(fs_info, page);
}
static struct extent_map *
@@ -3251,7 +3281,12 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
unsigned long this_bio_flag = 0;
struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
- set_page_extent_mapped(page);
+ ret = set_page_extent_mapped(page);
+ if (ret < 0) {
+ unlock_extent(tree, start, end);
+ SetPageError(page);
+ goto out;
+ }
if (!PageUptodate(page)) {
if (cleancache_get_page(page) == 0) {
@@ -3693,7 +3728,11 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
flush_dcache_page(page);
}
- set_page_extent_mapped(page);
+ ret = set_page_extent_mapped(page);
+ if (ret < 0) {
+ SetPageError(page);
+ goto done;
+ }
if (!epd->extent_locked) {
ret = writepage_delalloc(BTRFS_I(inode), page, wbc, start,
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 19221095c635..349d044c1254 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -178,7 +178,8 @@ int btree_write_cache_pages(struct address_space *mapping,
void extent_readahead(struct readahead_control *rac);
int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len);
-void set_page_extent_mapped(struct page *page);
+int __must_check set_page_extent_mapped(struct page *page);
+void clear_page_extent_mapped(struct page *page);
struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
u64 start, u64 owner_root, int level);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a29b50208eee..9b878616b489 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1373,6 +1373,12 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
goto fail;
}
+ err = set_page_extent_mapped(pages[i]);
+ if (err < 0) {
+ faili = i;
+ goto fail;
+ }
+
if (i == 0)
err = prepare_uptodate_page(inode, pages[i], pos,
force_uptodate);
@@ -1470,10 +1476,8 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages,
* We'll call btrfs_dirty_pages() later on, and that will flip around
* delalloc bits and dirty the pages as required.
*/
- for (i = 0; i < num_pages; i++) {
- set_page_extent_mapped(pages[i]);
+ for (i = 0; i < num_pages; i++)
WARN_ON(!PageLocked(pages[i]));
- }
return ret;
}
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 71d0d14bc18b..c347b415060a 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -431,11 +431,22 @@ static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, bool uptodate)
int i;
for (i = 0; i < io_ctl->num_pages; i++) {
+ int ret;
+
page = find_or_create_page(inode->i_mapping, i, mask);
if (!page) {
io_ctl_drop_pages(io_ctl);
return -ENOMEM;
}
+
+ ret = set_page_extent_mapped(page);
+ if (ret < 0) {
+ unlock_page(page);
+ put_page(page);
+ io_ctl_drop_pages(io_ctl);
+ return -ENOMEM;
+ }
+
io_ctl->pages[i] = page;
if (uptodate && !PageUptodate(page)) {
btrfs_readpage(NULL, page);
@@ -455,10 +466,8 @@ static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, bool uptodate)
}
}
- for (i = 0; i < io_ctl->num_pages; i++) {
+ for (i = 0; i < io_ctl->num_pages; i++)
clear_page_dirty_for_io(io_ctl->pages[i]);
- set_page_extent_mapped(io_ctl->pages[i]);
- }
return 0;
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 070716650df8..5b64715df92e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4720,6 +4720,9 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
ret = -ENOMEM;
goto out;
}
+ ret = set_page_extent_mapped(page);
+ if (ret < 0)
+ goto out_unlock;
if (!PageUptodate(page)) {
ret = btrfs_readpage(NULL, page);
@@ -4737,7 +4740,6 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
wait_on_page_writeback(page);
lock_extent_bits(io_tree, block_start, block_end, &cached_state);
- set_page_extent_mapped(page);
ordered = btrfs_lookup_ordered_extent(inode, block_start);
if (ordered) {
@@ -8117,7 +8119,7 @@ static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
{
int ret = try_release_extent_mapping(page, gfp_flags);
if (ret == 1)
- detach_page_private(page);
+ clear_page_extent_mapped(page);
return ret;
}
@@ -8276,7 +8278,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
}
ClearPageChecked(page);
- detach_page_private(page);
+ clear_page_extent_mapped(page);
}
/*
@@ -8355,7 +8357,9 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
wait_on_page_writeback(page);
lock_extent_bits(io_tree, page_start, page_end, &cached_state);
- set_page_extent_mapped(page);
+ ret = set_page_extent_mapped(page);
+ if (ret < 0)
+ goto out_unlock;
/*
* we can't set the delalloc bits if there are pending ordered
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index dde49a791f3e..1d58ffb9212f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1319,6 +1319,10 @@ static int cluster_pages_for_defrag(struct inode *inode,
if (!page)
break;
+ ret = set_page_extent_mapped(page);
+ if (ret < 0)
+ break;
+
page_start = page_offset(page);
page_end = page_start + PAGE_SIZE - 1;
while (1) {
@@ -1440,7 +1444,6 @@ static int cluster_pages_for_defrag(struct inode *inode,
for (i = 0; i < i_done; i++) {
clear_page_dirty_for_io(pages[i]);
ClearPageChecked(pages[i]);
- set_page_extent_mapped(pages[i]);
set_page_dirty(pages[i]);
unlock_page(pages[i]);
put_page(pages[i]);
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index b03e7891394e..b24396cf2f99 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -81,7 +81,10 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
goto out_unlock;
}
- set_page_extent_mapped(page);
+ ret = set_page_extent_mapped(page);
+ if (ret < 0)
+ goto out_unlock;
+
clear_extent_bit(&inode->io_tree, file_offset, range_end,
EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
0, 0, NULL);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 19b7db8b2117..41ee0f376af3 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2679,6 +2679,16 @@ static int relocate_file_extent_cluster(struct inode *inode,
goto out;
}
}
+ ret = set_page_extent_mapped(page);
+ if (ret < 0) {
+ btrfs_delalloc_release_metadata(BTRFS_I(inode),
+ PAGE_SIZE, true);
+ btrfs_delalloc_release_extents(BTRFS_I(inode),
+ PAGE_SIZE);
+ unlock_page(page);
+ put_page(page);
+ goto out;
+ }
if (PageReadahead(page)) {
page_cache_async_readahead(inode->i_mapping,
@@ -2706,8 +2716,6 @@ static int relocate_file_extent_cluster(struct inode *inode,
lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
- set_page_extent_mapped(page);
-
if (nr < cluster->nr &&
page_start + offset == cluster->boundary[nr]) {
set_extent_bits(&BTRFS_I(inode)->io_tree,
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 17/18] btrfs: integrate page status update for read path into begin/end_page_read()
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (15 preceding siblings ...)
2020-12-10 6:39 ` [PATCH v2 16/18] btrfs: introduce btrfs_subpage for data inodes Qu Wenruo
@ 2020-12-10 6:39 ` Qu Wenruo
2020-12-14 13:59 ` Nikolay Borisov
2020-12-10 6:39 ` [PATCH v2 18/18] btrfs: allow RO mount of 4K sector size fs on 64K page system Qu Wenruo
17 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:39 UTC (permalink / raw)
To: linux-btrfs
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there are more works to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- being_page_read()
For regular case, it does nothing.
For subpage case, it update the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The only new trick added is the condition for page unlock.
Now for subage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/extent_io.c | 39 +++++++++++++++++++++++++-----------
fs/btrfs/subpage.h | 47 ++++++++++++++++++++++++++++++++++++++------
2 files changed, 68 insertions(+), 18 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4e4ed9c453ae..56174e7f0ae8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2841,8 +2841,18 @@ static void endio_readpage_release_extent(struct processed_extent *processed,
processed->uptodate = uptodate;
}
-static void endio_readpage_update_page_status(struct page *page, bool uptodate,
- u64 start, u64 end)
+static void begin_data_page_read(struct btrfs_fs_info *fs_info, struct page *page)
+{
+ ASSERT(PageLocked(page));
+ if (fs_info->sectorsize == PAGE_SIZE)
+ return;
+
+ ASSERT(PagePrivate(page) && page->private);
+ ASSERT(page->mapping->host != fs_info->btree_inode);
+ btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
+}
+
+static void end_page_read(struct page *page, bool uptodate, u64 start, u64 end)
{
struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
u32 len;
@@ -2860,7 +2870,12 @@ static void endio_readpage_update_page_status(struct page *page, bool uptodate,
if (fs_info->sectorsize == PAGE_SIZE)
unlock_page(page);
- /* Subpage locking will be handled in later patches */
+ else if (page->mapping->host != fs_info->btree_inode)
+ /*
+ * For subpage data, unlock the page if we're the last reader.
+ * For subpage metadata, page lock is not utilized for read.
+ */
+ btrfs_subpage_end_reader(fs_info, page, start, len);
}
/*
@@ -2997,7 +3012,7 @@ static void end_bio_extent_readpage(struct bio *bio)
bio_offset += len;
/* Update page status and unlock */
- endio_readpage_update_page_status(page, uptodate, start, end);
+ end_page_read(page, uptodate, start, end);
endio_readpage_release_extent(&processed, BTRFS_I(inode),
start, end, uptodate);
}
@@ -3265,6 +3280,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
unsigned int read_flags, u64 *prev_em_start)
{
struct inode *inode = page->mapping->host;
+ struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
u64 start = page_offset(page);
const u64 end = start + PAGE_SIZE - 1;
u64 cur = start;
@@ -3308,6 +3324,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
kunmap_atomic(userpage);
}
}
+ begin_data_page_read(fs_info, page);
while (cur <= end) {
bool force_bio_submit = false;
u64 disk_bytenr;
@@ -3325,13 +3342,14 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
&cached, GFP_NOFS);
unlock_extent_cached(tree, cur,
cur + iosize - 1, &cached);
+ end_page_read(page, true, cur, cur + iosize - 1);
break;
}
em = __get_extent_map(inode, page, pg_offset, cur,
end - cur + 1, em_cached);
if (IS_ERR_OR_NULL(em)) {
- SetPageError(page);
unlock_extent(tree, cur, end);
+ end_page_read(page, false, cur, end);
break;
}
extent_offset = cur - em->start;
@@ -3414,6 +3432,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
&cached, GFP_NOFS);
unlock_extent_cached(tree, cur,
cur + iosize - 1, &cached);
+ end_page_read(page, true, cur, cur + iosize - 1);
cur = cur + iosize;
pg_offset += iosize;
continue;
@@ -3423,6 +3442,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
EXTENT_UPTODATE, 1, NULL)) {
check_page_uptodate(tree, page);
unlock_extent(tree, cur, cur + iosize - 1);
+ end_page_read(page, true, cur, cur + iosize - 1);
cur = cur + iosize;
pg_offset += iosize;
continue;
@@ -3431,8 +3451,8 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
* to date. Error out
*/
if (block_start == EXTENT_MAP_INLINE) {
- SetPageError(page);
unlock_extent(tree, cur, cur + iosize - 1);
+ end_page_read(page, false, cur, cur + iosize - 1);
cur = cur + iosize;
pg_offset += iosize;
continue;
@@ -3449,19 +3469,14 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
nr++;
*bio_flags = this_bio_flag;
} else {
- SetPageError(page);
unlock_extent(tree, cur, cur + iosize - 1);
+ end_page_read(page, false, cur, cur + iosize - 1);
goto out;
}
cur = cur + iosize;
pg_offset += iosize;
}
out:
- if (!nr) {
- if (!PageError(page))
- SetPageUptodate(page);
- unlock_page(page);
- }
return ret;
}
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 8592234d773e..6c801ef00d2d 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -31,6 +31,9 @@ struct btrfs_subpage {
u16 tree_block_bitmap;
};
/* structures only used by data */
+ struct {
+ atomic_t readers;
+ };
};
};
@@ -48,6 +51,17 @@ static inline void btrfs_subpage_clamp_range(struct page *page,
orig_start + orig_len) - *start;
}
+static inline void btrfs_subpage_assert(struct btrfs_fs_info *fs_info,
+ struct page *page, u64 start, u32 len)
+{
+ /* Basic checks */
+ ASSERT(PagePrivate(page) && page->private);
+ ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
+ IS_ALIGNED(len, fs_info->sectorsize));
+ ASSERT(page_offset(page) <= start &&
+ start + len <= page_offset(page) + PAGE_SIZE);
+}
+
/*
* Convert the [start, start + len) range into a u16 bitmap
*
@@ -59,12 +73,8 @@ static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
int bit_start = (start - page_offset(page)) >> fs_info->sectorsize_bits;
int nbits = len >> fs_info->sectorsize_bits;
- /* Basic checks */
- ASSERT(PagePrivate(page) && page->private);
- ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
- IS_ALIGNED(len, fs_info->sectorsize));
- ASSERT(page_offset(page) <= start &&
- start + len <= page_offset(page) + PAGE_SIZE);
+ btrfs_subpage_assert(fs_info, page, start, len);
+
/*
* Here nbits can be 16, thus can go beyond u16 range. Here we make the
* first left shift to be calculated in unsigned long (u32), then
@@ -73,6 +83,31 @@ static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
return (u16)(((1UL << nbits) - 1) << bit_start);
}
+static inline void btrfs_subpage_start_reader(struct btrfs_fs_info *fs_info,
+ struct page *page, u64 start,
+ u32 len)
+{
+ struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+ int nbits = len >> fs_info->sectorsize_bits;
+
+ btrfs_subpage_assert(fs_info, page, start, len);
+
+ ASSERT(atomic_read(&subpage->readers) == 0);
+ atomic_set(&subpage->readers, nbits);
+}
+
+static inline void btrfs_subpage_end_reader(struct btrfs_fs_info *fs_info,
+ struct page *page, u64 start, u32 len)
+{
+ struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+ int nbits = len >> fs_info->sectorsize_bits;
+
+ btrfs_subpage_assert(fs_info, page, start, len);
+ ASSERT(atomic_read(&subpage->readers) >= nbits);
+ if (atomic_sub_and_test(nbits, &subpage->readers))
+ unlock_page(page);
+}
+
static inline void btrfs_subpage_set_tree_block(struct btrfs_fs_info *fs_info,
struct page *page, u64 start, u32 len)
{
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH v2 18/18] btrfs: allow RO mount of 4K sector size fs on 64K page system
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
` (16 preceding siblings ...)
2020-12-10 6:39 ` [PATCH v2 17/18] btrfs: integrate page status update for read path into begin/end_page_read() Qu Wenruo
@ 2020-12-10 6:39 ` Qu Wenruo
17 siblings, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 6:39 UTC (permalink / raw)
To: linux-btrfs
This adds the basic RO mount ability for 4K sector size on 64K page
system.
Currently we only plan to support 4K and 64K page system.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/disk-io.c | 24 +++++++++++++++++++++---
fs/btrfs/super.c | 7 +++++++
2 files changed, 28 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index adda76895058..8ab6308ff852 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2510,13 +2510,21 @@ static int validate_super(struct btrfs_fs_info *fs_info,
btrfs_err(fs_info, "invalid sectorsize %llu", sectorsize);
ret = -EINVAL;
}
- /* Only PAGE SIZE is supported yet */
- if (sectorsize != PAGE_SIZE) {
+
+ /*
+ * For 4K page size, we only support 4K sector size.
+ * For 64K page size, we support RW for 64K sector size, and RO for
+ * 4K sector size.
+ */
+ if ((SZ_4K == PAGE_SIZE && sectorsize != PAGE_SIZE) ||
+ (SZ_64K == PAGE_SIZE && (sectorsize != SZ_4K &&
+ sectorsize != SZ_64K))) {
btrfs_err(fs_info,
- "sectorsize %llu not supported yet, only support %lu",
+ "sectorsize %llu not supported yet for page size %lu",
sectorsize, PAGE_SIZE);
ret = -EINVAL;
}
+
if (!is_power_of_2(nodesize) || nodesize < sectorsize ||
nodesize > BTRFS_MAX_METADATA_BLOCKSIZE) {
btrfs_err(fs_info, "invalid nodesize %llu", nodesize);
@@ -3272,6 +3280,16 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
goto fail_alloc;
}
+ /* For 4K sector size support, it's only read-only yet */
+ if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
+ if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
+ btrfs_err(fs_info,
+ "subpage sector size only support RO yet");
+ err = -EINVAL;
+ goto fail_alloc;
+ }
+ }
+
ret = btrfs_init_workqueues(fs_info, fs_devices);
if (ret) {
err = ret;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 022f20810089..a8068c389d60 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1996,6 +1996,13 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
ret = -EINVAL;
goto restore;
}
+ if (fs_info->sectorsize < PAGE_SIZE) {
+ btrfs_warn(fs_info,
+ "read-write mount is not yet allowed for sector size %u page size %lu",
+ fs_info->sectorsize, PAGE_SIZE);
+ ret = -EINVAL;
+ goto restore;
+ }
/*
* NOTE: when remounting with a change that does writes, don't
--
2.29.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: [PATCH v2 16/18] btrfs: introduce btrfs_subpage for data inodes
2020-12-10 6:39 ` [PATCH v2 16/18] btrfs: introduce btrfs_subpage for data inodes Qu Wenruo
@ 2020-12-10 9:44 ` kernel test robot
2020-12-11 0:43 ` kernel test robot
2020-12-14 12:46 ` Nikolay Borisov
2 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2020-12-10 9:44 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs; +Cc: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 8390 bytes --]
Hi Qu,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on kdave/for-next]
[also build test WARNING on next-20201209]
[cannot apply to v5.10-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-randconfig-s021-20201210 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.3-179-ga00755aa-dirty
# https://github.com/0day-ci/linux/commit/3852ff477c118432fb205a3422aa538dc8ac3a5f
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
git checkout 3852ff477c118432fb205a3422aa538dc8ac3a5f
# save the attached .config to linux build tree
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
"sparse warnings: (new ones prefixed by >>)"
>> fs/btrfs/inode.c:8360:13: sparse: sparse: incorrect type in assignment (different base types) @@ expected restricted vm_fault_t [assigned] [usertype] ret @@ got int @@
fs/btrfs/inode.c:8360:13: sparse: expected restricted vm_fault_t [assigned] [usertype] ret
fs/btrfs/inode.c:8360:13: sparse: got int
>> fs/btrfs/inode.c:8361:13: sparse: sparse: restricted vm_fault_t degrades to integer
vim +8360 fs/btrfs/inode.c
8283
8284 /*
8285 * btrfs_page_mkwrite() is not allowed to change the file size as it gets
8286 * called from a page fault handler when a page is first dirtied. Hence we must
8287 * be careful to check for EOF conditions here. We set the page up correctly
8288 * for a written page which means we get ENOSPC checking when writing into
8289 * holes and correct delalloc and unwritten extent mapping on filesystems that
8290 * support these features.
8291 *
8292 * We are not allowed to take the i_mutex here so we have to play games to
8293 * protect against truncate races as the page could now be beyond EOF. Because
8294 * truncate_setsize() writes the inode size before removing pages, once we have
8295 * the page lock we can determine safely if the page is beyond EOF. If it is not
8296 * beyond EOF, then the page is guaranteed safe against truncation until we
8297 * unlock the page.
8298 */
8299 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
8300 {
8301 struct page *page = vmf->page;
8302 struct inode *inode = file_inode(vmf->vma->vm_file);
8303 struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
8304 struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
8305 struct btrfs_ordered_extent *ordered;
8306 struct extent_state *cached_state = NULL;
8307 struct extent_changeset *data_reserved = NULL;
8308 char *kaddr;
8309 unsigned long zero_start;
8310 loff_t size;
8311 vm_fault_t ret;
8312 int ret2;
8313 int reserved = 0;
8314 u64 reserved_space;
8315 u64 page_start;
8316 u64 page_end;
8317 u64 end;
8318
8319 reserved_space = PAGE_SIZE;
8320
8321 sb_start_pagefault(inode->i_sb);
8322 page_start = page_offset(page);
8323 page_end = page_start + PAGE_SIZE - 1;
8324 end = page_end;
8325
8326 /*
8327 * Reserving delalloc space after obtaining the page lock can lead to
8328 * deadlock. For example, if a dirty page is locked by this function
8329 * and the call to btrfs_delalloc_reserve_space() ends up triggering
8330 * dirty page write out, then the btrfs_writepage() function could
8331 * end up waiting indefinitely to get a lock on the page currently
8332 * being processed by btrfs_page_mkwrite() function.
8333 */
8334 ret2 = btrfs_delalloc_reserve_space(BTRFS_I(inode), &data_reserved,
8335 page_start, reserved_space);
8336 if (!ret2) {
8337 ret2 = file_update_time(vmf->vma->vm_file);
8338 reserved = 1;
8339 }
8340 if (ret2) {
8341 ret = vmf_error(ret2);
8342 if (reserved)
8343 goto out;
8344 goto out_noreserve;
8345 }
8346
8347 ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
8348 again:
8349 lock_page(page);
8350 size = i_size_read(inode);
8351
8352 if ((page->mapping != inode->i_mapping) ||
8353 (page_start >= size)) {
8354 /* page got truncated out from underneath us */
8355 goto out_unlock;
8356 }
8357 wait_on_page_writeback(page);
8358
8359 lock_extent_bits(io_tree, page_start, page_end, &cached_state);
> 8360 ret = set_page_extent_mapped(page);
> 8361 if (ret < 0)
8362 goto out_unlock;
8363
8364 /*
8365 * we can't set the delalloc bits if there are pending ordered
8366 * extents. Drop our locks and wait for them to finish
8367 */
8368 ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), page_start,
8369 PAGE_SIZE);
8370 if (ordered) {
8371 unlock_extent_cached(io_tree, page_start, page_end,
8372 &cached_state);
8373 unlock_page(page);
8374 btrfs_start_ordered_extent(ordered, 1);
8375 btrfs_put_ordered_extent(ordered);
8376 goto again;
8377 }
8378
8379 if (page->index == ((size - 1) >> PAGE_SHIFT)) {
8380 reserved_space = round_up(size - page_start,
8381 fs_info->sectorsize);
8382 if (reserved_space < PAGE_SIZE) {
8383 end = page_start + reserved_space - 1;
8384 btrfs_delalloc_release_space(BTRFS_I(inode),
8385 data_reserved, page_start,
8386 PAGE_SIZE - reserved_space, true);
8387 }
8388 }
8389
8390 /*
8391 * page_mkwrite gets called when the page is firstly dirtied after it's
8392 * faulted in, but write(2) could also dirty a page and set delalloc
8393 * bits, thus in this case for space account reason, we still need to
8394 * clear any delalloc bits within this page range since we have to
8395 * reserve data&meta space before lock_page() (see above comments).
8396 */
8397 clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, end,
8398 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
8399 EXTENT_DEFRAG, 0, 0, &cached_state);
8400
8401 ret2 = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, end, 0,
8402 &cached_state);
8403 if (ret2) {
8404 unlock_extent_cached(io_tree, page_start, page_end,
8405 &cached_state);
8406 ret = VM_FAULT_SIGBUS;
8407 goto out_unlock;
8408 }
8409
8410 /* page is wholly or partially inside EOF */
8411 if (page_start + PAGE_SIZE > size)
8412 zero_start = offset_in_page(size);
8413 else
8414 zero_start = PAGE_SIZE;
8415
8416 if (zero_start != PAGE_SIZE) {
8417 kaddr = kmap(page);
8418 memset(kaddr + zero_start, 0, PAGE_SIZE - zero_start);
8419 flush_dcache_page(page);
8420 kunmap(page);
8421 }
8422 ClearPageChecked(page);
8423 set_page_dirty(page);
8424 SetPageUptodate(page);
8425
8426 BTRFS_I(inode)->last_trans = fs_info->generation;
8427 BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
8428 BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
8429
8430 unlock_extent_cached(io_tree, page_start, page_end, &cached_state);
8431
8432 btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
8433 sb_end_pagefault(inode->i_sb);
8434 extent_changeset_free(data_reserved);
8435 return VM_FAULT_LOCKED;
8436
8437 out_unlock:
8438 unlock_page(page);
8439 out:
8440 btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
8441 btrfs_delalloc_release_space(BTRFS_I(inode), data_reserved, page_start,
8442 reserved_space, (ret != 0));
8443 out_noreserve:
8444 sb_end_pagefault(inode->i_sb);
8445 extent_changeset_free(data_reserved);
8446 return ret;
8447 }
8448
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37952 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 16/18] btrfs: introduce btrfs_subpage for data inodes
@ 2020-12-10 9:44 ` kernel test robot
0 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2020-12-10 9:44 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 8599 bytes --]
Hi Qu,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on kdave/for-next]
[also build test WARNING on next-20201209]
[cannot apply to v5.10-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-randconfig-s021-20201210 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.3-179-ga00755aa-dirty
# https://github.com/0day-ci/linux/commit/3852ff477c118432fb205a3422aa538dc8ac3a5f
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
git checkout 3852ff477c118432fb205a3422aa538dc8ac3a5f
# save the attached .config to linux build tree
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
"sparse warnings: (new ones prefixed by >>)"
>> fs/btrfs/inode.c:8360:13: sparse: sparse: incorrect type in assignment (different base types) @@ expected restricted vm_fault_t [assigned] [usertype] ret @@ got int @@
fs/btrfs/inode.c:8360:13: sparse: expected restricted vm_fault_t [assigned] [usertype] ret
fs/btrfs/inode.c:8360:13: sparse: got int
>> fs/btrfs/inode.c:8361:13: sparse: sparse: restricted vm_fault_t degrades to integer
vim +8360 fs/btrfs/inode.c
8283
8284 /*
8285 * btrfs_page_mkwrite() is not allowed to change the file size as it gets
8286 * called from a page fault handler when a page is first dirtied. Hence we must
8287 * be careful to check for EOF conditions here. We set the page up correctly
8288 * for a written page which means we get ENOSPC checking when writing into
8289 * holes and correct delalloc and unwritten extent mapping on filesystems that
8290 * support these features.
8291 *
8292 * We are not allowed to take the i_mutex here so we have to play games to
8293 * protect against truncate races as the page could now be beyond EOF. Because
8294 * truncate_setsize() writes the inode size before removing pages, once we have
8295 * the page lock we can determine safely if the page is beyond EOF. If it is not
8296 * beyond EOF, then the page is guaranteed safe against truncation until we
8297 * unlock the page.
8298 */
8299 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
8300 {
8301 struct page *page = vmf->page;
8302 struct inode *inode = file_inode(vmf->vma->vm_file);
8303 struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
8304 struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
8305 struct btrfs_ordered_extent *ordered;
8306 struct extent_state *cached_state = NULL;
8307 struct extent_changeset *data_reserved = NULL;
8308 char *kaddr;
8309 unsigned long zero_start;
8310 loff_t size;
8311 vm_fault_t ret;
8312 int ret2;
8313 int reserved = 0;
8314 u64 reserved_space;
8315 u64 page_start;
8316 u64 page_end;
8317 u64 end;
8318
8319 reserved_space = PAGE_SIZE;
8320
8321 sb_start_pagefault(inode->i_sb);
8322 page_start = page_offset(page);
8323 page_end = page_start + PAGE_SIZE - 1;
8324 end = page_end;
8325
8326 /*
8327 * Reserving delalloc space after obtaining the page lock can lead to
8328 * deadlock. For example, if a dirty page is locked by this function
8329 * and the call to btrfs_delalloc_reserve_space() ends up triggering
8330 * dirty page write out, then the btrfs_writepage() function could
8331 * end up waiting indefinitely to get a lock on the page currently
8332 * being processed by btrfs_page_mkwrite() function.
8333 */
8334 ret2 = btrfs_delalloc_reserve_space(BTRFS_I(inode), &data_reserved,
8335 page_start, reserved_space);
8336 if (!ret2) {
8337 ret2 = file_update_time(vmf->vma->vm_file);
8338 reserved = 1;
8339 }
8340 if (ret2) {
8341 ret = vmf_error(ret2);
8342 if (reserved)
8343 goto out;
8344 goto out_noreserve;
8345 }
8346
8347 ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
8348 again:
8349 lock_page(page);
8350 size = i_size_read(inode);
8351
8352 if ((page->mapping != inode->i_mapping) ||
8353 (page_start >= size)) {
8354 /* page got truncated out from underneath us */
8355 goto out_unlock;
8356 }
8357 wait_on_page_writeback(page);
8358
8359 lock_extent_bits(io_tree, page_start, page_end, &cached_state);
> 8360 ret = set_page_extent_mapped(page);
> 8361 if (ret < 0)
8362 goto out_unlock;
8363
8364 /*
8365 * we can't set the delalloc bits if there are pending ordered
8366 * extents. Drop our locks and wait for them to finish
8367 */
8368 ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), page_start,
8369 PAGE_SIZE);
8370 if (ordered) {
8371 unlock_extent_cached(io_tree, page_start, page_end,
8372 &cached_state);
8373 unlock_page(page);
8374 btrfs_start_ordered_extent(ordered, 1);
8375 btrfs_put_ordered_extent(ordered);
8376 goto again;
8377 }
8378
8379 if (page->index == ((size - 1) >> PAGE_SHIFT)) {
8380 reserved_space = round_up(size - page_start,
8381 fs_info->sectorsize);
8382 if (reserved_space < PAGE_SIZE) {
8383 end = page_start + reserved_space - 1;
8384 btrfs_delalloc_release_space(BTRFS_I(inode),
8385 data_reserved, page_start,
8386 PAGE_SIZE - reserved_space, true);
8387 }
8388 }
8389
8390 /*
8391 * page_mkwrite gets called when the page is firstly dirtied after it's
8392 * faulted in, but write(2) could also dirty a page and set delalloc
8393 * bits, thus in this case for space account reason, we still need to
8394 * clear any delalloc bits within this page range since we have to
8395 * reserve data&meta space before lock_page() (see above comments).
8396 */
8397 clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, end,
8398 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
8399 EXTENT_DEFRAG, 0, 0, &cached_state);
8400
8401 ret2 = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, end, 0,
8402 &cached_state);
8403 if (ret2) {
8404 unlock_extent_cached(io_tree, page_start, page_end,
8405 &cached_state);
8406 ret = VM_FAULT_SIGBUS;
8407 goto out_unlock;
8408 }
8409
8410 /* page is wholly or partially inside EOF */
8411 if (page_start + PAGE_SIZE > size)
8412 zero_start = offset_in_page(size);
8413 else
8414 zero_start = PAGE_SIZE;
8415
8416 if (zero_start != PAGE_SIZE) {
8417 kaddr = kmap(page);
8418 memset(kaddr + zero_start, 0, PAGE_SIZE - zero_start);
8419 flush_dcache_page(page);
8420 kunmap(page);
8421 }
8422 ClearPageChecked(page);
8423 set_page_dirty(page);
8424 SetPageUptodate(page);
8425
8426 BTRFS_I(inode)->last_trans = fs_info->generation;
8427 BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
8428 BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
8429
8430 unlock_extent_cached(io_tree, page_start, page_end, &cached_state);
8431
8432 btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
8433 sb_end_pagefault(inode->i_sb);
8434 extent_changeset_free(data_reserved);
8435 return VM_FAULT_LOCKED;
8436
8437 out_unlock:
8438 unlock_page(page);
8439 out:
8440 btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
8441 btrfs_delalloc_release_space(BTRFS_I(inode), data_reserved, page_start,
8442 reserved_space, (ret != 0));
8443 out_noreserve:
8444 sb_end_pagefault(inode->i_sb);
8445 extent_changeset_free(data_reserved);
8446 return ret;
8447 }
8448
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 37952 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 02/18] btrfs: extent_io: refactor __extent_writepage_io() to improve readability
2020-12-10 6:38 ` [PATCH v2 02/18] btrfs: extent_io: refactor __extent_writepage_io() to improve readability Qu Wenruo
@ 2020-12-10 12:12 ` Nikolay Borisov
2020-12-10 12:53 ` Qu Wenruo
2020-12-17 15:43 ` Josef Bacik
1 sibling, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-10 12:12 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
> The refactor involves the following modifications:
> - iosize alignment
> In fact we don't really need to manually do alignment at all.
> All extent maps should already be aligned, thus basic ASSERT() check
> would be enough.
>
> - redundant variables
> We have extra variable like blocksize/pg_offset/end.
> They are all unnecessary.
>
> @blocksize can be replaced by sectorsize size directly, and it's only
> used to verify the em start/size is aligned.
>
> @pg_offset can be easily calculated using @cur and page_offset(page).
>
> @end is just assigned to @page_end and never modified, use @page_end
> to replace it.
>
> - remove some BUG_ON()s
> The BUG_ON()s are for extent map, which we have tree-checker to check
> on-disk extent data item and runtime check.
> ASSERT() should be enough.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 37 +++++++++++++++++--------------------
> 1 file changed, 17 insertions(+), 20 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 2650e8720394..612fe60b367e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3515,17 +3515,14 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
> unsigned long nr_written,
> int *nr_ret)
> {
> + struct btrfs_fs_info *fs_info = inode->root->fs_info;
> struct extent_io_tree *tree = &inode->io_tree;
> u64 start = page_offset(page);
> u64 page_end = start + PAGE_SIZE - 1;
nit: page_end should be renamed to end because start now points to the
logical logical byte offset, i.e having "page" in the name is misleading.
<snip>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 03/18] btrfs: file: update comment for btrfs_dirty_pages()
2020-12-10 6:38 ` [PATCH v2 03/18] btrfs: file: update comment for btrfs_dirty_pages() Qu Wenruo
@ 2020-12-10 12:16 ` Nikolay Borisov
0 siblings, 0 replies; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-10 12:16 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
> The original comment is from the initial merge, which has several
> problems:
> - No holes check any more
> - No inline decision is made
>
> Update the out-of-date comment with more correct one.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/file.c | 15 +++++++++------
> 1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0e41459b8de6..a29b50208eee 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -453,12 +453,15 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
> }
>
> /*
> - * after copy_from_user, pages need to be dirtied and we need to make
> - * sure holes are created between the current EOF and the start of
> - * any next extents (if required).
> - *
> - * this also makes the decision about creating an inline extent vs
> - * doing real data extents, marking pages dirty and delalloc as required.
> + * After btrfs_copy_from_user(), update the following things for delalloc:
> + * - DELALLOC extent io tree bits
> + * Later btrfs_run_delalloc_range() relies on this bit to determine the
> + * writeback range.
IMO the following seems more coherent and concise:
- Mark newly dirtied pages as DELALLOC in the io tree. Used to advise
which range is to be written back.
> + * - Page status
> + * Including basic status like Dirty and Uptodate, and btrfs specific bit
> + * like Checked (for cow fixup)
- Marks modified pages as Uptodate/Dirty and not needing cowfixup
> + * - Inode size update
> + * If needed
- Update inode size for past EOF write.
> */
> int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
> size_t num_pages, loff_t pos, size_t write_bytes,
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 02/18] btrfs: extent_io: refactor __extent_writepage_io() to improve readability
2020-12-10 12:12 ` Nikolay Borisov
@ 2020-12-10 12:53 ` Qu Wenruo
2020-12-10 12:58 ` Nikolay Borisov
0 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-10 12:53 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs
On 2020/12/10 下午8:12, Nikolay Borisov wrote:
>
>
> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>> The refactor involves the following modifications:
>> - iosize alignment
>> In fact we don't really need to manually do alignment at all.
>> All extent maps should already be aligned, thus basic ASSERT() check
>> would be enough.
>>
>> - redundant variables
>> We have extra variable like blocksize/pg_offset/end.
>> They are all unnecessary.
>>
>> @blocksize can be replaced by sectorsize size directly, and it's only
>> used to verify the em start/size is aligned.
>>
>> @pg_offset can be easily calculated using @cur and page_offset(page).
>>
>> @end is just assigned to @page_end and never modified, use @page_end
>> to replace it.
>>
>> - remove some BUG_ON()s
>> The BUG_ON()s are for extent map, which we have tree-checker to check
>> on-disk extent data item and runtime check.
>> ASSERT() should be enough.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/extent_io.c | 37 +++++++++++++++++--------------------
>> 1 file changed, 17 insertions(+), 20 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 2650e8720394..612fe60b367e 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3515,17 +3515,14 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>> unsigned long nr_written,
>> int *nr_ret)
>> {
>> + struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> struct extent_io_tree *tree = &inode->io_tree;
>> u64 start = page_offset(page);
>> u64 page_end = start + PAGE_SIZE - 1;
>
> nit: page_end should be renamed to end because start now points to the
> logical logical byte offset, i.e having "page" in the name is misleading.
But page_offset() along page_end is still logical bytenr, thus I didn't
see much confusion here...
Thanks,
Qu
>
> <snip>
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 02/18] btrfs: extent_io: refactor __extent_writepage_io() to improve readability
2020-12-10 12:53 ` Qu Wenruo
@ 2020-12-10 12:58 ` Nikolay Borisov
0 siblings, 0 replies; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-10 12:58 UTC (permalink / raw)
To: Qu Wenruo, Qu Wenruo, linux-btrfs
On 10.12.20 г. 14:53 ч., Qu Wenruo wrote:
>
>
> On 2020/12/10 下午8:12, Nikolay Borisov wrote:
>>
>>
>> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>>> The refactor involves the following modifications:
>>> - iosize alignment
>>> In fact we don't really need to manually do alignment at all.
>>> All extent maps should already be aligned, thus basic ASSERT() check
>>> would be enough.
>>>
>>> - redundant variables
>>> We have extra variable like blocksize/pg_offset/end.
>>> They are all unnecessary.
>>>
>>> @blocksize can be replaced by sectorsize size directly, and it's only
>>> used to verify the em start/size is aligned.
>>>
>>> @pg_offset can be easily calculated using @cur and page_offset(page).
>>>
>>> @end is just assigned to @page_end and never modified, use @page_end
>>> to replace it.
>>>
>>> - remove some BUG_ON()s
>>> The BUG_ON()s are for extent map, which we have tree-checker to check
>>> on-disk extent data item and runtime check.
>>> ASSERT() should be enough.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>> fs/btrfs/extent_io.c | 37 +++++++++++++++++--------------------
>>> 1 file changed, 17 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 2650e8720394..612fe60b367e 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -3515,17 +3515,14 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>>> unsigned long nr_written,
>>> int *nr_ret)
>>> {
>>> + struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>> struct extent_io_tree *tree = &inode->io_tree;
>>> u64 start = page_offset(page);
>>> u64 page_end = start + PAGE_SIZE - 1;
>>
>> nit: page_end should be renamed to end because start now points to the
>> logical logical byte offset, i.e having "page" in the name is misleading.
>
> But page_offset() along page_end is still logical bytenr, thus I didn't
> see much confusion here...
Exactly page_offset converts the page index to a logical bytenr and that
point we no longer care about the physical page but the logical range
which is PAGE_SIZE. 'page_end' is really some logical affset which spans
a PAGE_SIZE region
>
> Thanks,
> Qu
>>
>> <snip>
>>
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check
2020-12-10 6:39 ` [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check Qu Wenruo
@ 2020-12-10 13:24 ` kernel test robot
2020-12-10 13:39 ` kernel test robot
2020-12-14 10:21 ` Nikolay Borisov
2 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2020-12-10 13:24 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs; +Cc: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 1828 bytes --]
Hi Qu,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on kdave/for-next]
[also build test ERROR on next-20201209]
[cannot apply to btrfs/next v5.10-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: nds32-randconfig-r004-20201209 (attached as .config)
compiler: nds32le-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/e01cdf51d0d32647697616c0dd08f2cc3220bde4
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
git checkout e01cdf51d0d32647697616c0dd08f2cc3220bde4
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=nds32
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
nds32le-linux-ld: fs/btrfs/disk-io.o: in function `btrfs_validate_metadata_buffer':
disk-io.c:(.text+0x4200): undefined reference to `__udivdi3'
>> nds32le-linux-ld: disk-io.c:(.text+0x4204): undefined reference to `__udivdi3'
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 22008 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check
@ 2020-12-10 13:24 ` kernel test robot
0 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2020-12-10 13:24 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 1867 bytes --]
Hi Qu,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on kdave/for-next]
[also build test ERROR on next-20201209]
[cannot apply to btrfs/next v5.10-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: nds32-randconfig-r004-20201209 (attached as .config)
compiler: nds32le-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/e01cdf51d0d32647697616c0dd08f2cc3220bde4
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
git checkout e01cdf51d0d32647697616c0dd08f2cc3220bde4
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=nds32
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
nds32le-linux-ld: fs/btrfs/disk-io.o: in function `btrfs_validate_metadata_buffer':
disk-io.c:(.text+0x4200): undefined reference to `__udivdi3'
>> nds32le-linux-ld: disk-io.c:(.text+0x4204): undefined reference to `__udivdi3'
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 22008 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check
2020-12-10 6:39 ` [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check Qu Wenruo
@ 2020-12-10 13:39 ` kernel test robot
2020-12-10 13:39 ` kernel test robot
2020-12-14 10:21 ` Nikolay Borisov
2 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2020-12-10 13:39 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs; +Cc: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 4077 bytes --]
Hi Qu,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on kdave/for-next]
[also build test ERROR on next-20201210]
[cannot apply to v5.10-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: i386-randconfig-a013-20201209 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/e01cdf51d0d32647697616c0dd08f2cc3220bde4
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
git checkout e01cdf51d0d32647697616c0dd08f2cc3220bde4
# save the attached .config to linux build tree
make W=1 ARCH=i386
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
ld: fs/btrfs/disk-io.o: in function `validate_subpage_buffer':
>> fs/btrfs/disk-io.c:619: undefined reference to `__udivdi3'
vim +619 fs/btrfs/disk-io.c
593
594 static int validate_subpage_buffer(struct page *page, u64 start, u64 end,
595 int mirror)
596 {
597 struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
598 struct extent_buffer *eb;
599 int reads_done;
600 int ret = 0;
601
602 if (!IS_ALIGNED(start, fs_info->sectorsize) ||
603 !IS_ALIGNED(end - start + 1, fs_info->sectorsize) ||
604 !IS_ALIGNED(end - start + 1, fs_info->nodesize)) {
605 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
606 btrfs_err(fs_info, "invalid tree read bytenr");
607 return -EUCLEAN;
608 }
609
610 /*
611 * We don't allow bio merge for subpage metadata read, so we should
612 * only get one eb for each endio hook.
613 */
614 ASSERT(end == start + fs_info->nodesize - 1);
615 ASSERT(PagePrivate(page));
616
617 rcu_read_lock();
618 eb = radix_tree_lookup(&fs_info->buffer_radix,
> 619 start / fs_info->sectorsize);
620 rcu_read_unlock();
621
622 /*
623 * When we are reading one tree block, eb must have been
624 * inserted into the radix tree. If not something is wrong.
625 */
626 if (!eb) {
627 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
628 btrfs_err(fs_info,
629 "can't find extent buffer for bytenr %llu",
630 start);
631 return -EUCLEAN;
632 }
633 /*
634 * The pending IO might have been the only thing that kept
635 * this buffer in memory. Make sure we have a ref for all
636 * this other checks
637 */
638 atomic_inc(&eb->refs);
639
640 reads_done = atomic_dec_and_test(&eb->io_pages);
641 /* Subpage read must finish in page read */
642 ASSERT(reads_done);
643
644 eb->read_mirror = mirror;
645 if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
646 ret = -EIO;
647 goto err;
648 }
649 ret = validate_extent_buffer(eb);
650 if (ret < 0)
651 goto err;
652
653 if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
654 btree_readahead_hook(eb, ret);
655
656 set_extent_buffer_uptodate(eb);
657
658 free_extent_buffer(eb);
659 return ret;
660 err:
661 /*
662 * our io error hook is going to dec the io pages
663 * again, we have to make sure it has something to
664 * decrement
665 */
666 atomic_inc(&eb->io_pages);
667 clear_extent_buffer_uptodate(eb);
668 free_extent_buffer(eb);
669 return ret;
670 }
671
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 36747 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check
@ 2020-12-10 13:39 ` kernel test robot
0 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2020-12-10 13:39 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 4195 bytes --]
Hi Qu,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on kdave/for-next]
[also build test ERROR on next-20201210]
[cannot apply to v5.10-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: i386-randconfig-a013-20201209 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/e01cdf51d0d32647697616c0dd08f2cc3220bde4
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
git checkout e01cdf51d0d32647697616c0dd08f2cc3220bde4
# save the attached .config to linux build tree
make W=1 ARCH=i386
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
ld: fs/btrfs/disk-io.o: in function `validate_subpage_buffer':
>> fs/btrfs/disk-io.c:619: undefined reference to `__udivdi3'
vim +619 fs/btrfs/disk-io.c
593
594 static int validate_subpage_buffer(struct page *page, u64 start, u64 end,
595 int mirror)
596 {
597 struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
598 struct extent_buffer *eb;
599 int reads_done;
600 int ret = 0;
601
602 if (!IS_ALIGNED(start, fs_info->sectorsize) ||
603 !IS_ALIGNED(end - start + 1, fs_info->sectorsize) ||
604 !IS_ALIGNED(end - start + 1, fs_info->nodesize)) {
605 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
606 btrfs_err(fs_info, "invalid tree read bytenr");
607 return -EUCLEAN;
608 }
609
610 /*
611 * We don't allow bio merge for subpage metadata read, so we should
612 * only get one eb for each endio hook.
613 */
614 ASSERT(end == start + fs_info->nodesize - 1);
615 ASSERT(PagePrivate(page));
616
617 rcu_read_lock();
618 eb = radix_tree_lookup(&fs_info->buffer_radix,
> 619 start / fs_info->sectorsize);
620 rcu_read_unlock();
621
622 /*
623 * When we are reading one tree block, eb must have been
624 * inserted into the radix tree. If not something is wrong.
625 */
626 if (!eb) {
627 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
628 btrfs_err(fs_info,
629 "can't find extent buffer for bytenr %llu",
630 start);
631 return -EUCLEAN;
632 }
633 /*
634 * The pending IO might have been the only thing that kept
635 * this buffer in memory. Make sure we have a ref for all
636 * this other checks
637 */
638 atomic_inc(&eb->refs);
639
640 reads_done = atomic_dec_and_test(&eb->io_pages);
641 /* Subpage read must finish in page read */
642 ASSERT(reads_done);
643
644 eb->read_mirror = mirror;
645 if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
646 ret = -EIO;
647 goto err;
648 }
649 ret = validate_extent_buffer(eb);
650 if (ret < 0)
651 goto err;
652
653 if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
654 btree_readahead_hook(eb, ret);
655
656 set_extent_buffer_uptodate(eb);
657
658 free_extent_buffer(eb);
659 return ret;
660 err:
661 /*
662 * our io error hook is going to dec the io pages
663 * again, we have to make sure it has something to
664 * decrement
665 */
666 atomic_inc(&eb->io_pages);
667 clear_extent_buffer_uptodate(eb);
668 free_extent_buffer(eb);
669 return ret;
670 }
671
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 36747 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 04/18] btrfs: extent_io: introduce a helper to grab an existing extent buffer from a page
2020-12-10 6:38 ` [PATCH v2 04/18] btrfs: extent_io: introduce a helper to grab an existing extent buffer from a page Qu Wenruo
@ 2020-12-10 13:51 ` Nikolay Borisov
2020-12-17 15:50 ` Josef Bacik
1 sibling, 0 replies; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-10 13:51 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs; +Cc: Johannes Thumshirn
On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
> This patch will extract the code to grab an extent buffer from a page
> into a helper, grab_extent_buffer_from_page().
>
> This reduces one indent level, and provides the work place for later
> expansion for subapge support.
>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 52 +++++++++++++++++++++++++++-----------------
> 1 file changed, 32 insertions(+), 20 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 612fe60b367e..6350c2687c7e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5251,6 +5251,32 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
> }
> #endif
>
> +static struct extent_buffer *grab_extent_buffer_from_page(struct page *page)
nit: Make the name just grab_extent_buffer/get_extent_buffer, given you
pass in a page as an input parameter "from_page" is obvious.
> +{
> + struct extent_buffer *exists;
> +
> + /* Page not yet attached to an extent buffer */
> + if (!PagePrivate(page))
> + return NULL;
> +
> + /*
> + * We could have already allocated an eb for this page
> + * and attached one so lets see if we can get a ref on
> + * the existing eb, and if we can we know it's good and
> + * we can just return that one, else we know we can just
> + * overwrite page->private.
> + */
> + exists = (struct extent_buffer *)page->private;
> + if (atomic_inc_not_zero(&exists->refs)) {
> + mark_extent_buffer_accessed(exists, page);
> + return exists;
> + }
nit: This patch slightly changes the timing of
mark_extent_buffer_accessed, as it's now called under
mapping->private_lock and respective page locked. Looking at
map_extent_buffer_accessed it does iterate pages and call
mark_page_accessed on them as well as calling check_buffer_tre_ref which
does some atomic ops. While it might not be a big hit I'd expect there
will be some minimal performance regression.
> +
> + WARN_ON(PageDirty(page));
> + detach_page_private(page);
> + return NULL;
> +}
> +
> struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> u64 start, u64 owner_root, int level)
> {
> @@ -5296,26 +5322,12 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> }
>
> spin_lock(&mapping->private_lock);
> - if (PagePrivate(p)) {
> - /*
> - * We could have already allocated an eb for this page
> - * and attached one so lets see if we can get a ref on
> - * the existing eb, and if we can we know it's good and
> - * we can just return that one, else we know we can just
> - * overwrite page->private.
> - */
> - exists = (struct extent_buffer *)p->private;
> - if (atomic_inc_not_zero(&exists->refs)) {
> - spin_unlock(&mapping->private_lock);
> - unlock_page(p);
> - put_page(p);
> - mark_extent_buffer_accessed(exists, p);
> - goto free_eb;
> - }
> - exists = NULL;
> -
> - WARN_ON(PageDirty(p));
> - detach_page_private(p);
> + exists = grab_extent_buffer_from_page(p);
> + if (exists) {
> + spin_unlock(&mapping->private_lock);
> + unlock_page(p);
> + put_page(p);
> + goto free_eb;
> }
> attach_extent_buffer_page(eb, p);
> spin_unlock(&mapping->private_lock);
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case
2020-12-10 6:38 ` [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case Qu Wenruo
@ 2020-12-10 15:30 ` Nikolay Borisov
2020-12-17 6:48 ` Qu Wenruo
2020-12-10 16:09 ` Nikolay Borisov
2020-12-17 16:00 ` Josef Bacik
2 siblings, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-10 15:30 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
> For subpage case, we need to allocate new memory for each metadata page.
>
> So we need to:
> - Allow attach_extent_buffer_page() to return int
> To indicate allocation failure
>
> - Prealloc page->private for alloc_extent_buffer()
> We don't want to call memory allocation with spinlock hold, so
> do preallocation before we acquire the spin lock.
>
> - Handle subpage and regular case differently in
> attach_extent_buffer_page()
> For regular case, just do the usual thing.
> For subpage case, allocate new memory and update the tree_block
> bitmap.
>
> The bitmap update will be handled by new subpage specific helper,
> btrfs_subpage_set_tree_block().
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 69 +++++++++++++++++++++++++++++++++++---------
> fs/btrfs/subpage.h | 44 ++++++++++++++++++++++++++++
> 2 files changed, 99 insertions(+), 14 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 6350c2687c7e..51dd7ec3c2b3 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -24,6 +24,7 @@
> #include "rcu-string.h"
> #include "backref.h"
> #include "disk-io.h"
> +#include "subpage.h"
>
> static struct kmem_cache *extent_state_cache;
> static struct kmem_cache *extent_buffer_cache;
> @@ -3142,22 +3143,41 @@ static int submit_extent_page(unsigned int opf,
> return ret;
> }
>
> -static void attach_extent_buffer_page(struct extent_buffer *eb,
> +static int attach_extent_buffer_page(struct extent_buffer *eb,
> struct page *page)
> {
> - /*
> - * If the page is mapped to btree inode, we should hold the private
> - * lock to prevent race.
> - * For cloned or dummy extent buffers, their pages are not mapped and
> - * will not race with any other ebs.
> - */
> - if (page->mapping)
> - lockdep_assert_held(&page->mapping->private_lock);
> + struct btrfs_fs_info *fs_info = eb->fs_info;
> + int ret;
>
> - if (!PagePrivate(page))
> - attach_page_private(page, eb);
> - else
> - WARN_ON(page->private != (unsigned long)eb);
> + if (fs_info->sectorsize == PAGE_SIZE) {
> + /*
> + * If the page is mapped to btree inode, we should hold the
> + * private lock to prevent race.
> + * For cloned or dummy extent buffers, their pages are not
> + * mapped and will not race with any other ebs.
> + */
> + if (page->mapping)
> + lockdep_assert_held(&page->mapping->private_lock);
> +
> + if (!PagePrivate(page))
> + attach_page_private(page, eb);
> + else
> + WARN_ON(page->private != (unsigned long)eb);
> + return 0;
> + }
> +
> + /* Already mapped, just update the existing range */
> + if (PagePrivate(page))
> + goto update_bitmap;
How can this check ever be false, given btrfs_attach_subpage is called
unconditionally in alloc_extent_buffer so that you can avoid allocating
memory with private lock held, yet in this function you check if memory
hasn't been allocated and you proceed to do it? Also that memory
allocation is done with GFP_NOFS under a spinlock, that's not atomic i.e
IO can still be kicked which means you can go to sleep while holding a
spinlock, not cool.
> +
> + /* Do new allocation to attach subpage */
> + ret = btrfs_attach_subpage(fs_info, page);
> + if (ret < 0)
> + return ret;
> +
> +update_bitmap:
> + btrfs_subpage_set_tree_block(fs_info, page, eb->start, eb->len);
> + return 0;
Those are really 2 functions, demarcated by the if. Given that
attach_extent_buffer is called in only 2 places, can't you opencode the
if (fs_info->sectorize) check in the callers and define 2 functions:
1 for subpage blocksize and the other one for the old code?
> }
>
<snip>
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index 96f3b226913e..c2ce603e7848 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -23,9 +23,53 @@
> struct btrfs_subpage {
> /* Common members for both data and metadata pages */
> spinlock_t lock;
> + union {
> + /* Structures only used by metadata */
> + struct {
> + u16 tree_block_bitmap;
> + };
> + /* structures only used by data */
> + };
> };
>
> int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
> void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>
> +/*
> + * Convert the [start, start + len) range into a u16 bitmap
> + *
> + * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
> + */
> +static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
> + struct page *page, u64 start, u32 len)
> +{
> + int bit_start = (start - page_offset(page)) >> fs_info->sectorsize_bits;
> + int nbits = len >> fs_info->sectorsize_bits;
> +
> + /* Basic checks */
> + ASSERT(PagePrivate(page) && page->private);
> + ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
> + IS_ALIGNED(len, fs_info->sectorsize));
Separate aligns so if they feel it's evident which one failed.
> + ASSERT(page_offset(page) <= start &&
> + start + len <= page_offset(page) + PAGE_SIZE);
ditto. Also instead of checking 'page_offset(page) <= start' you can
simply check 'bit_start is >= 0' as that's what you ultimately care about.
> + /*
> + * Here nbits can be 16, thus can go beyond u16 range. Here we make the
> + * first left shift to be calculated in unsigned long (u32), then
> + * truncate the result to u16.
> + */
> + return (u16)(((1UL << nbits) - 1) << bit_start);
> +}
> +
> +static inline void btrfs_subpage_set_tree_block(struct btrfs_fs_info *fs_info,
> + struct page *page, u64 start, u32 len)
> +{
> + struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> + unsigned long flags;
> + u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> +
> + spin_lock_irqsave(&subpage->lock, flags);
> + subpage->tree_block_bitmap |= tmp;
> + spin_unlock_irqrestore(&subpage->lock, flags);
> +}
> +
> #endif /* BTRFS_SUBPAGE_H */
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 07/18] btrfs: extent_io: make grab_extent_buffer_from_page() to handle subpage case
2020-12-10 6:38 ` [PATCH v2 07/18] btrfs: extent_io: make grab_extent_buffer_from_page() " Qu Wenruo
@ 2020-12-10 15:39 ` Nikolay Borisov
2020-12-17 6:55 ` Qu Wenruo
2020-12-17 16:02 ` Josef Bacik
1 sibling, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-10 15:39 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
> For subpage case, grab_extent_buffer_from_page() can't really get an
> extent buffer just from btrfs_subpage.
>
> Although we have btrfs_subpage::tree_block_bitmap, which can be used to
> grab the bytenr of an existing extent buffer, and can then go radix tree
> search to grab that existing eb.
>
> However we are still doing radix tree insert check in
> alloc_extent_buffer(), thus we don't really need to do the extra hassle,
> just let alloc_extent_buffer() to handle existing eb in radix tree.
>
> So for grab_extent_buffer_from_page(), just always return NULL for
> subpage case.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 51dd7ec3c2b3..b99bd0402130 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5278,10 +5278,19 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
> }
> #endif
>
> -static struct extent_buffer *grab_extent_buffer_from_page(struct page *page)
> +static struct extent_buffer *grab_extent_buffer_from_page(
> + struct btrfs_fs_info *fs_info, struct page *page)
> {
> struct extent_buffer *exists;
>
> + /*
> + * For subpage case, we completely rely on radix tree to ensure we
> + * don't try to insert two eb for the same bytenr.
> + * So here we alwasy return NULL and just continue.
> + */
> + if (fs_info->sectorsize < PAGE_SIZE)
> + return NULL;
> +
Instead of hiding this in the function, just open-code it in the only caller. It would look like:
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b99bd0402130..440dab207944 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5370,8 +5370,9 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
}
spin_lock(&mapping->private_lock);
- exists = grab_extent_buffer_from_page(fs_info, p);
- if (exists) {
+ if (fs_info->sectorsize == PAGE_SIZE &&
+ (exists = grab_extent_buffer_from_page(fs_info, p)));
+ {
spin_unlock(&mapping->private_lock);
unlock_page(p);
put_page(p);
Admittedly that exist = ... in the if condition is a bit of an anti-pattern but given it's used in only 1 place
and makes the flow of code more linear I'd say it's a win. But would like to hear David's opinion.
> /* Page not yet attached to an extent buffer */
> if (!PagePrivate(page))
> return NULL;
> @@ -5361,7 +5370,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> }
>
> spin_lock(&mapping->private_lock);
> - exists = grab_extent_buffer_from_page(p);
> + exists = grab_extent_buffer_from_page(fs_info, p);
> if (exists) {
> spin_unlock(&mapping->private_lock);
> unlock_page(p);
>
^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case
2020-12-10 6:38 ` [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case Qu Wenruo
2020-12-10 15:30 ` Nikolay Borisov
@ 2020-12-10 16:09 ` Nikolay Borisov
2020-12-17 16:00 ` Josef Bacik
2 siblings, 0 replies; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-10 16:09 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
> For subpage case, we need to allocate new memory for each metadata page.
>
> So we need to:
> - Allow attach_extent_buffer_page() to return int
> To indicate allocation failure
>
> - Prealloc page->private for alloc_extent_buffer()
> We don't want to call memory allocation with spinlock hold, so
> do preallocation before we acquire the spin lock.
>
> - Handle subpage and regular case differently in
> attach_extent_buffer_page()
> For regular case, just do the usual thing.
> For subpage case, allocate new memory and update the tree_block
> bitmap.
>
> The bitmap update will be handled by new subpage specific helper,
> btrfs_subpage_set_tree_block().
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 69 +++++++++++++++++++++++++++++++++++---------
> fs/btrfs/subpage.h | 44 ++++++++++++++++++++++++++++
> 2 files changed, 99 insertions(+), 14 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 6350c2687c7e..51dd7ec3c2b3 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -24,6 +24,7 @@
> #include "rcu-string.h"
> #include "backref.h"
> #include "disk-io.h"
> +#include "subpage.h"
>
<snip>
> void set_page_extent_mapped(struct page *page)
> @@ -5067,12 +5087,19 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
> return NULL;
>
> for (i = 0; i < num_pages; i++) {
> + int ret;
> +
> p = alloc_page(GFP_NOFS);
> if (!p) {
> btrfs_release_extent_buffer(new);
> return NULL;
> }
> - attach_extent_buffer_page(new, p);
> + ret = attach_extent_buffer_page(new, p);
> + if (ret < 0) {
> + put_page(p);
> + btrfs_release_extent_buffer(new);
> + return NULL;
> + }
In this function you need to move
'set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);' line before entering
the loop otherwise when btrfs_release_extent_buffer is called it will
try to erroneously acquire the mapping lock since BUFFER_UNMAPPED
wouldn't have been set.
<snip>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 08/18] btrfs: extent_io: support subpage for extent buffer page release
2020-12-10 6:38 ` [PATCH v2 08/18] btrfs: extent_io: support subpage for extent buffer page release Qu Wenruo
@ 2020-12-10 16:13 ` Nikolay Borisov
0 siblings, 0 replies; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-10 16:13 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
> In btrfs_release_extent_buffer_pages(), we need to add extra handling
> for subpage.
>
> To do so, introduce a new helper, detach_extent_buffer_page(), to do
> different handling for regular and subpage cases.
>
> For subpage case, the new trick is to clear the range of current extent
> buffer, and detach page private if and only if we're the last tree block
> of the page.
> This part is handled by the subpage helper,
> btrfs_subpage_clear_and_test_tree_block().
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 59 +++++++++++++++++++++++++++++++-------------
> fs/btrfs/subpage.h | 24 ++++++++++++++++++
> 2 files changed, 66 insertions(+), 17 deletions(-)
<snip>
> @@ -5031,6 +5018,44 @@ static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
> */
> detach_page_private(page);
> }
> + return;
> + }
> +
> + /*
> + * For subpage case, clear the range in tree_block_bitmap,
> + * and if we're the last one, detach private completely.
> + */
> + if (PagePrivate(page)) {
Under what condition can you have subpage fs and call
detach_extent_buffer_page on a page that doesn't have PagePrivate flag
set ? I think that's impossible i.e that check should be really an assert?
> + bool last = false;
> +
> + last = btrfs_subpage_clear_and_test_tree_block(fs_info, page,
> + eb->start, eb->len);
> + if (last)
> + btrfs_detach_subpage(fs_info, page);
> + }
> +}
> +
<snip>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 16/18] btrfs: introduce btrfs_subpage for data inodes
2020-12-10 6:39 ` [PATCH v2 16/18] btrfs: introduce btrfs_subpage for data inodes Qu Wenruo
@ 2020-12-11 0:43 ` kernel test robot
2020-12-11 0:43 ` kernel test robot
2020-12-14 12:46 ` Nikolay Borisov
2 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2020-12-11 0:43 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs; +Cc: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 7566 bytes --]
Hi Qu,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on kdave/for-next]
[also build test WARNING on next-20201210]
[cannot apply to v5.10-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: i386-randconfig-m021-20201209 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
New smatch warnings:
fs/btrfs/inode.c:8361 btrfs_page_mkwrite() warn: unsigned 'ret' is never less than zero.
Old smatch warnings:
include/linux/fs.h:862 i_size_write() warn: statement has no effect 31
vim +/ret +8361 fs/btrfs/inode.c
8283
8284 /*
8285 * btrfs_page_mkwrite() is not allowed to change the file size as it gets
8286 * called from a page fault handler when a page is first dirtied. Hence we must
8287 * be careful to check for EOF conditions here. We set the page up correctly
8288 * for a written page which means we get ENOSPC checking when writing into
8289 * holes and correct delalloc and unwritten extent mapping on filesystems that
8290 * support these features.
8291 *
8292 * We are not allowed to take the i_mutex here so we have to play games to
8293 * protect against truncate races as the page could now be beyond EOF. Because
8294 * truncate_setsize() writes the inode size before removing pages, once we have
8295 * the page lock we can determine safely if the page is beyond EOF. If it is not
8296 * beyond EOF, then the page is guaranteed safe against truncation until we
8297 * unlock the page.
8298 */
8299 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
8300 {
8301 struct page *page = vmf->page;
8302 struct inode *inode = file_inode(vmf->vma->vm_file);
8303 struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
8304 struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
8305 struct btrfs_ordered_extent *ordered;
8306 struct extent_state *cached_state = NULL;
8307 struct extent_changeset *data_reserved = NULL;
8308 char *kaddr;
8309 unsigned long zero_start;
8310 loff_t size;
8311 vm_fault_t ret;
8312 int ret2;
8313 int reserved = 0;
8314 u64 reserved_space;
8315 u64 page_start;
8316 u64 page_end;
8317 u64 end;
8318
8319 reserved_space = PAGE_SIZE;
8320
8321 sb_start_pagefault(inode->i_sb);
8322 page_start = page_offset(page);
8323 page_end = page_start + PAGE_SIZE - 1;
8324 end = page_end;
8325
8326 /*
8327 * Reserving delalloc space after obtaining the page lock can lead to
8328 * deadlock. For example, if a dirty page is locked by this function
8329 * and the call to btrfs_delalloc_reserve_space() ends up triggering
8330 * dirty page write out, then the btrfs_writepage() function could
8331 * end up waiting indefinitely to get a lock on the page currently
8332 * being processed by btrfs_page_mkwrite() function.
8333 */
8334 ret2 = btrfs_delalloc_reserve_space(BTRFS_I(inode), &data_reserved,
8335 page_start, reserved_space);
8336 if (!ret2) {
8337 ret2 = file_update_time(vmf->vma->vm_file);
8338 reserved = 1;
8339 }
8340 if (ret2) {
8341 ret = vmf_error(ret2);
8342 if (reserved)
8343 goto out;
8344 goto out_noreserve;
8345 }
8346
8347 ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
8348 again:
8349 lock_page(page);
8350 size = i_size_read(inode);
8351
8352 if ((page->mapping != inode->i_mapping) ||
8353 (page_start >= size)) {
8354 /* page got truncated out from underneath us */
8355 goto out_unlock;
8356 }
8357 wait_on_page_writeback(page);
8358
8359 lock_extent_bits(io_tree, page_start, page_end, &cached_state);
8360 ret = set_page_extent_mapped(page);
> 8361 if (ret < 0)
8362 goto out_unlock;
8363
8364 /*
8365 * we can't set the delalloc bits if there are pending ordered
8366 * extents. Drop our locks and wait for them to finish
8367 */
8368 ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), page_start,
8369 PAGE_SIZE);
8370 if (ordered) {
8371 unlock_extent_cached(io_tree, page_start, page_end,
8372 &cached_state);
8373 unlock_page(page);
8374 btrfs_start_ordered_extent(ordered, 1);
8375 btrfs_put_ordered_extent(ordered);
8376 goto again;
8377 }
8378
8379 if (page->index == ((size - 1) >> PAGE_SHIFT)) {
8380 reserved_space = round_up(size - page_start,
8381 fs_info->sectorsize);
8382 if (reserved_space < PAGE_SIZE) {
8383 end = page_start + reserved_space - 1;
8384 btrfs_delalloc_release_space(BTRFS_I(inode),
8385 data_reserved, page_start,
8386 PAGE_SIZE - reserved_space, true);
8387 }
8388 }
8389
8390 /*
8391 * page_mkwrite gets called when the page is firstly dirtied after it's
8392 * faulted in, but write(2) could also dirty a page and set delalloc
8393 * bits, thus in this case for space account reason, we still need to
8394 * clear any delalloc bits within this page range since we have to
8395 * reserve data&meta space before lock_page() (see above comments).
8396 */
8397 clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, end,
8398 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
8399 EXTENT_DEFRAG, 0, 0, &cached_state);
8400
8401 ret2 = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, end, 0,
8402 &cached_state);
8403 if (ret2) {
8404 unlock_extent_cached(io_tree, page_start, page_end,
8405 &cached_state);
8406 ret = VM_FAULT_SIGBUS;
8407 goto out_unlock;
8408 }
8409
8410 /* page is wholly or partially inside EOF */
8411 if (page_start + PAGE_SIZE > size)
8412 zero_start = offset_in_page(size);
8413 else
8414 zero_start = PAGE_SIZE;
8415
8416 if (zero_start != PAGE_SIZE) {
8417 kaddr = kmap(page);
8418 memset(kaddr + zero_start, 0, PAGE_SIZE - zero_start);
8419 flush_dcache_page(page);
8420 kunmap(page);
8421 }
8422 ClearPageChecked(page);
8423 set_page_dirty(page);
8424 SetPageUptodate(page);
8425
8426 BTRFS_I(inode)->last_trans = fs_info->generation;
8427 BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
8428 BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
8429
8430 unlock_extent_cached(io_tree, page_start, page_end, &cached_state);
8431
8432 btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
8433 sb_end_pagefault(inode->i_sb);
8434 extent_changeset_free(data_reserved);
8435 return VM_FAULT_LOCKED;
8436
8437 out_unlock:
8438 unlock_page(page);
8439 out:
8440 btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
8441 btrfs_delalloc_release_space(BTRFS_I(inode), data_reserved, page_start,
8442 reserved_space, (ret != 0));
8443 out_noreserve:
8444 sb_end_pagefault(inode->i_sb);
8445 extent_changeset_free(data_reserved);
8446 return ret;
8447 }
8448
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 36210 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 16/18] btrfs: introduce btrfs_subpage for data inodes
@ 2020-12-11 0:43 ` kernel test robot
0 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2020-12-11 0:43 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 7765 bytes --]
Hi Qu,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on kdave/for-next]
[also build test WARNING on next-20201210]
[cannot apply to v5.10-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-add-read-only-support-for-subpage-sector-size/20201210-144442
base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: i386-randconfig-m021-20201209 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
New smatch warnings:
fs/btrfs/inode.c:8361 btrfs_page_mkwrite() warn: unsigned 'ret' is never less than zero.
Old smatch warnings:
include/linux/fs.h:862 i_size_write() warn: statement has no effect 31
vim +/ret +8361 fs/btrfs/inode.c
8283
8284 /*
8285 * btrfs_page_mkwrite() is not allowed to change the file size as it gets
8286 * called from a page fault handler when a page is first dirtied. Hence we must
8287 * be careful to check for EOF conditions here. We set the page up correctly
8288 * for a written page which means we get ENOSPC checking when writing into
8289 * holes and correct delalloc and unwritten extent mapping on filesystems that
8290 * support these features.
8291 *
8292 * We are not allowed to take the i_mutex here so we have to play games to
8293 * protect against truncate races as the page could now be beyond EOF. Because
8294 * truncate_setsize() writes the inode size before removing pages, once we have
8295 * the page lock we can determine safely if the page is beyond EOF. If it is not
8296 * beyond EOF, then the page is guaranteed safe against truncation until we
8297 * unlock the page.
8298 */
8299 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
8300 {
8301 struct page *page = vmf->page;
8302 struct inode *inode = file_inode(vmf->vma->vm_file);
8303 struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
8304 struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
8305 struct btrfs_ordered_extent *ordered;
8306 struct extent_state *cached_state = NULL;
8307 struct extent_changeset *data_reserved = NULL;
8308 char *kaddr;
8309 unsigned long zero_start;
8310 loff_t size;
8311 vm_fault_t ret;
8312 int ret2;
8313 int reserved = 0;
8314 u64 reserved_space;
8315 u64 page_start;
8316 u64 page_end;
8317 u64 end;
8318
8319 reserved_space = PAGE_SIZE;
8320
8321 sb_start_pagefault(inode->i_sb);
8322 page_start = page_offset(page);
8323 page_end = page_start + PAGE_SIZE - 1;
8324 end = page_end;
8325
8326 /*
8327 * Reserving delalloc space after obtaining the page lock can lead to
8328 * deadlock. For example, if a dirty page is locked by this function
8329 * and the call to btrfs_delalloc_reserve_space() ends up triggering
8330 * dirty page write out, then the btrfs_writepage() function could
8331 * end up waiting indefinitely to get a lock on the page currently
8332 * being processed by btrfs_page_mkwrite() function.
8333 */
8334 ret2 = btrfs_delalloc_reserve_space(BTRFS_I(inode), &data_reserved,
8335 page_start, reserved_space);
8336 if (!ret2) {
8337 ret2 = file_update_time(vmf->vma->vm_file);
8338 reserved = 1;
8339 }
8340 if (ret2) {
8341 ret = vmf_error(ret2);
8342 if (reserved)
8343 goto out;
8344 goto out_noreserve;
8345 }
8346
8347 ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
8348 again:
8349 lock_page(page);
8350 size = i_size_read(inode);
8351
8352 if ((page->mapping != inode->i_mapping) ||
8353 (page_start >= size)) {
8354 /* page got truncated out from underneath us */
8355 goto out_unlock;
8356 }
8357 wait_on_page_writeback(page);
8358
8359 lock_extent_bits(io_tree, page_start, page_end, &cached_state);
8360 ret = set_page_extent_mapped(page);
> 8361 if (ret < 0)
8362 goto out_unlock;
8363
8364 /*
8365 * we can't set the delalloc bits if there are pending ordered
8366 * extents. Drop our locks and wait for them to finish
8367 */
8368 ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), page_start,
8369 PAGE_SIZE);
8370 if (ordered) {
8371 unlock_extent_cached(io_tree, page_start, page_end,
8372 &cached_state);
8373 unlock_page(page);
8374 btrfs_start_ordered_extent(ordered, 1);
8375 btrfs_put_ordered_extent(ordered);
8376 goto again;
8377 }
8378
8379 if (page->index == ((size - 1) >> PAGE_SHIFT)) {
8380 reserved_space = round_up(size - page_start,
8381 fs_info->sectorsize);
8382 if (reserved_space < PAGE_SIZE) {
8383 end = page_start + reserved_space - 1;
8384 btrfs_delalloc_release_space(BTRFS_I(inode),
8385 data_reserved, page_start,
8386 PAGE_SIZE - reserved_space, true);
8387 }
8388 }
8389
8390 /*
8391 * page_mkwrite gets called when the page is firstly dirtied after it's
8392 * faulted in, but write(2) could also dirty a page and set delalloc
8393 * bits, thus in this case for space account reason, we still need to
8394 * clear any delalloc bits within this page range since we have to
8395 * reserve data&meta space before lock_page() (see above comments).
8396 */
8397 clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, end,
8398 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
8399 EXTENT_DEFRAG, 0, 0, &cached_state);
8400
8401 ret2 = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, end, 0,
8402 &cached_state);
8403 if (ret2) {
8404 unlock_extent_cached(io_tree, page_start, page_end,
8405 &cached_state);
8406 ret = VM_FAULT_SIGBUS;
8407 goto out_unlock;
8408 }
8409
8410 /* page is wholly or partially inside EOF */
8411 if (page_start + PAGE_SIZE > size)
8412 zero_start = offset_in_page(size);
8413 else
8414 zero_start = PAGE_SIZE;
8415
8416 if (zero_start != PAGE_SIZE) {
8417 kaddr = kmap(page);
8418 memset(kaddr + zero_start, 0, PAGE_SIZE - zero_start);
8419 flush_dcache_page(page);
8420 kunmap(page);
8421 }
8422 ClearPageChecked(page);
8423 set_page_dirty(page);
8424 SetPageUptodate(page);
8425
8426 BTRFS_I(inode)->last_trans = fs_info->generation;
8427 BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
8428 BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
8429
8430 unlock_extent_cached(io_tree, page_start, page_end, &cached_state);
8431
8432 btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
8433 sb_end_pagefault(inode->i_sb);
8434 extent_changeset_free(data_reserved);
8435 return VM_FAULT_LOCKED;
8436
8437 out_unlock:
8438 unlock_page(page);
8439 out:
8440 btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
8441 btrfs_delalloc_release_space(BTRFS_I(inode), data_reserved, page_start,
8442 reserved_space, (ret != 0));
8443 out_noreserve:
8444 sb_end_pagefault(inode->i_sb);
8445 extent_changeset_free(data_reserved);
8446 return ret;
8447 }
8448
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 36210 bytes --]
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 09/18] btrfs: subpage: introduce helper for subpage uptodate status
2020-12-10 6:38 ` [PATCH v2 09/18] btrfs: subpage: introduce helper for subpage uptodate status Qu Wenruo
@ 2020-12-11 10:10 ` Nikolay Borisov
2020-12-11 10:48 ` Qu Wenruo
0 siblings, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-11 10:10 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
> This patch introduce the following functions to handle btrfs subpage
> uptodate status:
> - btrfs_subpage_set_uptodate()
> - btrfs_subpage_clear_uptodate()
> - btrfs_subpage_test_uptodate()
> Those helpers can only be called when the range is ensured to be
> inside the page.
>
> - btrfs_page_set_uptodate()
> - btrfs_page_clear_uptodate()
> - btrfs_page_test_uptodate()
> Those helpers can handle both regular sector size and subpage without
> problem.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/subpage.h | 98 ++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 98 insertions(+)
>
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index 87b4e028ae18..b3cf9171ec98 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -23,6 +23,7 @@
> struct btrfs_subpage {
> /* Common members for both data and metadata pages */
> spinlock_t lock;
> + u16 uptodate_bitmap;
> union {
> /* Structures only used by metadata */
> struct {
> @@ -35,6 +36,17 @@ struct btrfs_subpage {
> int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
> void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>
> +static inline void btrfs_subpage_clamp_range(struct page *page,
> + u64 *start, u32 *len)
> +{
> + u64 orig_start = *start;
> + u32 orig_len = *len;
> +
> + *start = max_t(u64, page_offset(page), orig_start);
> + *len = min_t(u64, page_offset(page) + PAGE_SIZE,
> + orig_start + orig_len) - *start;
> +}
This handles EB's which span pages, right? If so - a comment is in order
since there is no design document specifying whether eb can or cannot
span multiple pages.
> +
> /*
> * Convert the [start, start + len) range into a u16 bitmap
> *
> @@ -96,4 +108,90 @@ static inline bool btrfs_subpage_clear_and_test_tree_block(
> return last;
> }
>
> +static inline void btrfs_subpage_set_uptodate(struct btrfs_fs_info *fs_info,
> + struct page *page, u64 start, u32 len)
> +{
> + struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> + u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&subpage->lock, flags);
> + subpage->uptodate_bitmap |= tmp;
> + if (subpage->uptodate_bitmap == (u16)-1)
just use U16_MAX instead of (u16)-1.
> + SetPageUptodate(page);
> + spin_unlock_irqrestore(&subpage->lock, flags);
> +}
> +
> +static inline void btrfs_subpage_clear_uptodate(struct btrfs_fs_info *fs_info,
> + struct page *page, u64 start, u32 len)
> +{
> + struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> + u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&subpage->lock, flags);
> + subpage->tree_block_bitmap &= ~tmp;
I guess you meant to clear uptodate_bitmap and not tree_block_bitmap ?
> + ClearPageUptodate(page);
> + spin_unlock_irqrestore(&subpage->lock, flags);
> +}
> +
<snip>
> +DECLARE_BTRFS_SUBPAGE_TEST_OP(uptodate);
> +
> +/*
> + * Note that, in selftest, especially extent-io-tests, we can have empty
> + * fs_info passed in.
> + * Thanfully in selftest, we only test sectorsize == PAGE_SIZE cases so far
nit:s/Thankfully/Thankfully
> + * thus we can fall back to regular sectorsize branch.
> + */
<snip>
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 09/18] btrfs: subpage: introduce helper for subpage uptodate status
2020-12-11 10:10 ` Nikolay Borisov
@ 2020-12-11 10:48 ` Qu Wenruo
2020-12-11 11:41 ` Nikolay Borisov
0 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-11 10:48 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs
On 2020/12/11 下午6:10, Nikolay Borisov wrote:
>
>
> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>> This patch introduce the following functions to handle btrfs subpage
>> uptodate status:
>> - btrfs_subpage_set_uptodate()
>> - btrfs_subpage_clear_uptodate()
>> - btrfs_subpage_test_uptodate()
>> Those helpers can only be called when the range is ensured to be
>> inside the page.
>>
>> - btrfs_page_set_uptodate()
>> - btrfs_page_clear_uptodate()
>> - btrfs_page_test_uptodate()
>> Those helpers can handle both regular sector size and subpage without
>> problem.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/subpage.h | 98 ++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 98 insertions(+)
>>
>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>> index 87b4e028ae18..b3cf9171ec98 100644
>> --- a/fs/btrfs/subpage.h
>> +++ b/fs/btrfs/subpage.h
>> @@ -23,6 +23,7 @@
>> struct btrfs_subpage {
>> /* Common members for both data and metadata pages */
>> spinlock_t lock;
>> + u16 uptodate_bitmap;
>> union {
>> /* Structures only used by metadata */
>> struct {
>> @@ -35,6 +36,17 @@ struct btrfs_subpage {
>> int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>> void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>>
>> +static inline void btrfs_subpage_clamp_range(struct page *page,
>> + u64 *start, u32 *len)
>> +{
>> + u64 orig_start = *start;
>> + u32 orig_len = *len;
>> +
>> + *start = max_t(u64, page_offset(page), orig_start);
>> + *len = min_t(u64, page_offset(page) + PAGE_SIZE,
>> + orig_start + orig_len) - *start;
>> +}
>
> This handles EB's which span pages, right? If so - a comment is in order
> since there is no design document specifying whether eb can or cannot
> span multiple pages.
Didn't I have already stated that in the subpage eb accessors patch?
No subpage eb can across page bounday.
>
>> +
>> /*
>> * Convert the [start, start + len) range into a u16 bitmap
>> *
>> @@ -96,4 +108,90 @@ static inline bool btrfs_subpage_clear_and_test_tree_block(
>> return last;
>> }
>>
>> +static inline void btrfs_subpage_set_uptodate(struct btrfs_fs_info *fs_info,
>> + struct page *page, u64 start, u32 len)
>> +{
>> + struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
>> + u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&subpage->lock, flags);
>> + subpage->uptodate_bitmap |= tmp;
>> + if (subpage->uptodate_bitmap == (u16)-1)
>
> just use U16_MAX instead of (u16)-1.
>
>> + SetPageUptodate(page);
>> + spin_unlock_irqrestore(&subpage->lock, flags);
>> +}
>> +
>> +static inline void btrfs_subpage_clear_uptodate(struct btrfs_fs_info *fs_info,
>> + struct page *page, u64 start, u32 len)
>> +{
>> + struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
>> + u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&subpage->lock, flags);
>> + subpage->tree_block_bitmap &= ~tmp;
>
> I guess you meant to clear uptodate_bitmap and not tree_block_bitmap ?'
Oh my...
Thanks for catching this,
Qu
>
>> + ClearPageUptodate(page);
>> + spin_unlock_irqrestore(&subpage->lock, flags);
>> +}
>> +
>
> <snip>
>
>> +DECLARE_BTRFS_SUBPAGE_TEST_OP(uptodate);
>> +
>> +/*
>> + * Note that, in selftest, especially extent-io-tests, we can have empty
>> + * fs_info passed in.
>> + * Thanfully in selftest, we only test sectorsize == PAGE_SIZE cases so far
>
> nit:s/Thankfully/Thankfully
>
>> + * thus we can fall back to regular sectorsize branch.
>> + */
>
> <snip>
>
>>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 09/18] btrfs: subpage: introduce helper for subpage uptodate status
2020-12-11 10:48 ` Qu Wenruo
@ 2020-12-11 11:41 ` Nikolay Borisov
2020-12-11 11:56 ` Qu Wenruo
0 siblings, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-11 11:41 UTC (permalink / raw)
To: Qu Wenruo, Qu Wenruo, linux-btrfs
On 11.12.20 г. 12:48 ч., Qu Wenruo wrote:
>
>
> On 2020/12/11 下午6:10, Nikolay Borisov wrote:
>>
>>
>> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>>> This patch introduce the following functions to handle btrfs subpage
>>> uptodate status:
>>> - btrfs_subpage_set_uptodate()
>>> - btrfs_subpage_clear_uptodate()
>>> - btrfs_subpage_test_uptodate()
>>> Those helpers can only be called when the range is ensured to be
>>> inside the page.
>>>
>>> - btrfs_page_set_uptodate()
>>> - btrfs_page_clear_uptodate()
>>> - btrfs_page_test_uptodate()
>>> Those helpers can handle both regular sector size and subpage without
>>> problem.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>> fs/btrfs/subpage.h | 98 ++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 98 insertions(+)
>>>
>>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>>> index 87b4e028ae18..b3cf9171ec98 100644
>>> --- a/fs/btrfs/subpage.h
>>> +++ b/fs/btrfs/subpage.h
>>> @@ -23,6 +23,7 @@
>>> struct btrfs_subpage {
>>> /* Common members for both data and metadata pages */
>>> spinlock_t lock;
>>> + u16 uptodate_bitmap;
>>> union {
>>> /* Structures only used by metadata */
>>> struct {
>>> @@ -35,6 +36,17 @@ struct btrfs_subpage {
>>> int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>>> void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>>>
>>> +static inline void btrfs_subpage_clamp_range(struct page *page,
>>> + u64 *start, u32 *len)
>>> +{
>>> + u64 orig_start = *start;
>>> + u32 orig_len = *len;
>>> +
>>> + *start = max_t(u64, page_offset(page), orig_start);
>>> + *len = min_t(u64, page_offset(page) + PAGE_SIZE,
>>> + orig_start + orig_len) - *start;
>>> +}
>>
>> This handles EB's which span pages, right? If so - a comment is in order
>> since there is no design document specifying whether eb can or cannot
>> span multiple pages.
>
> Didn't I have already stated that in the subpage eb accessors patch?
>
> No subpage eb can across page bounday.
>
As just discussed during the whiteboard session this function is really
dead code for eb's because they are guaranteed to not span pages. Even
for RW support it seems there is only btrfs_dirty_pages which changes
page flags without having clamped the data i.e. there's only 1
exception. In light of this I think it would be better to replace this
function with ASSERTS and handle the only exception at the call site.
<snip>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 09/18] btrfs: subpage: introduce helper for subpage uptodate status
2020-12-11 11:41 ` Nikolay Borisov
@ 2020-12-11 11:56 ` Qu Wenruo
0 siblings, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-11 11:56 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs
On 2020/12/11 下午7:41, Nikolay Borisov wrote:
>
>
> On 11.12.20 г. 12:48 ч., Qu Wenruo wrote:
>>
>>
>> On 2020/12/11 下午6:10, Nikolay Borisov wrote:
>>>
>>>
>>> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>>>> This patch introduce the following functions to handle btrfs subpage
>>>> uptodate status:
>>>> - btrfs_subpage_set_uptodate()
>>>> - btrfs_subpage_clear_uptodate()
>>>> - btrfs_subpage_test_uptodate()
>>>> Those helpers can only be called when the range is ensured to be
>>>> inside the page.
>>>>
>>>> - btrfs_page_set_uptodate()
>>>> - btrfs_page_clear_uptodate()
>>>> - btrfs_page_test_uptodate()
>>>> Those helpers can handle both regular sector size and subpage without
>>>> problem.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>> fs/btrfs/subpage.h | 98 ++++++++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 98 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>>>> index 87b4e028ae18..b3cf9171ec98 100644
>>>> --- a/fs/btrfs/subpage.h
>>>> +++ b/fs/btrfs/subpage.h
>>>> @@ -23,6 +23,7 @@
>>>> struct btrfs_subpage {
>>>> /* Common members for both data and metadata pages */
>>>> spinlock_t lock;
>>>> + u16 uptodate_bitmap;
>>>> union {
>>>> /* Structures only used by metadata */
>>>> struct {
>>>> @@ -35,6 +36,17 @@ struct btrfs_subpage {
>>>> int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>>>> void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>>>>
>>>> +static inline void btrfs_subpage_clamp_range(struct page *page,
>>>> + u64 *start, u32 *len)
>>>> +{
>>>> + u64 orig_start = *start;
>>>> + u32 orig_len = *len;
>>>> +
>>>> + *start = max_t(u64, page_offset(page), orig_start);
>>>> + *len = min_t(u64, page_offset(page) + PAGE_SIZE,
>>>> + orig_start + orig_len) - *start;
>>>> +}
>>>
>>> This handles EB's which span pages, right? If so - a comment is in order
>>> since there is no design document specifying whether eb can or cannot
>>> span multiple pages.
>>
>> Didn't I have already stated that in the subpage eb accessors patch?
>>
>> No subpage eb can across page bounday.
>>
>
> As just discussed during the whiteboard session this function is really
> dead code for eb's because they are guaranteed to not span pages. Even
> for RW support it seems there is only btrfs_dirty_pages which changes
> page flags without having clamped the data i.e. there's only 1
> exception. In light of this I think it would be better to replace this
> function with ASSERTS and handle the only exception at the call site.
You're completely right.
I'll definite change these in next update.
Thanks,
Qu
>
>
> <snip>
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support
2020-12-10 6:38 ` [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support Qu Wenruo
@ 2020-12-11 12:00 ` Nikolay Borisov
2020-12-11 12:11 ` Qu Wenruo
0 siblings, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-11 12:00 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
> Unlike the original try_release_extent_buffer,
> try_release_subpage_extent_buffer() will iterate through
> btrfs_subpage::tree_block_bitmap, and try to release each extent buffer.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 73 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 73 insertions(+)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 141e414b1ab9..4d55803302e9 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -6258,10 +6258,83 @@ void memmove_extent_buffer(const struct extent_buffer *dst,
> }
> }
>
> +static int try_release_subpage_extent_buffer(struct page *page)
> +{
> + struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> + u64 page_start = page_offset(page);
> + int bitmap_size = BTRFS_SUBPAGE_BITMAP_SIZE;
Remove this variable and directly use BTRFS_SUBPAGE_BITMAP_SIZE as a
terminating condition
> + int bit_start = 0;
> + int ret;
> +
> + while (bit_start < bitmap_size) {
You really want to iterate for a fixed number of items so switch that to
a for loop.
> + struct btrfs_subpage *subpage;
> + struct extent_buffer *eb;
> + unsigned long flags;
> + u16 tmp = 1 << bit_start;
> + u64 start;
> +
> + /*
> + * Make sure the page still has private, as previous run can
> + * detach the private
> + */
But if previous run has run it would have disposed of this eb and you
won't find this page at all, no ?
> + spin_lock(&page->mapping->private_lock);
> + if (!PagePrivate(page)) {
> + spin_unlock(&page->mapping->private_lock);
> + break;
> + }
> + subpage = (struct btrfs_subpage *)page->private;
> + spin_unlock(&page->mapping->private_lock);
> +
> + spin_lock_irqsave(&subpage->lock, flags);
> + if (!(tmp & subpage->tree_block_bitmap)) {
> + spin_unlock_irqrestore(&subpage->lock, flags);
> + bit_start++;
> + continue;
> + }
> + spin_unlock_irqrestore(&subpage->lock, flags);
> +
> + start = bit_start * fs_info->sectorsize + page_start;
> + bit_start += fs_info->nodesize >> fs_info->sectorsize_bits;
By doing this you are really saying "skip all blocks pertaining to this
eb". In order for this to be correct it would imply that bit_start
should _always_ be 0,4,8,12 - am I correct? But what happens if
if (!(tmp & subpage->tree_block_bitmap)) has executed and bit_start is
now 1, then you'd make start = page_start + 4k , skip next 4(16k) blocks
but that would be wrong, no ?
Essentially the page would look like:
|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
So you want to release the EB's that spawn 0-3, 4-7, 8-11, 12-15, but
what if bit_start becomes 1 and you add 4 to that, this offsets all
further calculation by 1 i.e you are going into the next eb.
> + /*
> + * Here we can't call find_extent_buffer() which will increase
> + * eb->refs.
> + */
> + rcu_read_lock();
> + eb = radix_tree_lookup(&fs_info->buffer_radix,
> + start >> fs_info->sectorsize_bits);
> + rcu_read_unlock();
> + ASSERT(eb);
> + spin_lock(&eb->refs_lock);
> + if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb) ||
> + !test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
> + spin_unlock(&eb->refs_lock);
> + continue;
> + }
> + /*
> + * Here we don't care the return value, we will always check
> + * the page private at the end.
> + * And release_extent_buffer() will release the refs_lock.
> + */
> + release_extent_buffer(eb);
> + }
> + /* Finally to check if we have cleared page private */
> + spin_lock(&page->mapping->private_lock);
> + if (!PagePrivate(page))
> + ret = 1;
> + else
> + ret = 0;
> + spin_unlock(&page->mapping->private_lock);
> + return ret;
> +
> +}
> +
> int try_release_extent_buffer(struct page *page)
> {
> struct extent_buffer *eb;
>
> + if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
> + return try_release_subpage_extent_buffer(page);
> +
> /*
> * We need to make sure nobody is attaching this page to an eb right
> * now.
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support
2020-12-11 12:00 ` Nikolay Borisov
@ 2020-12-11 12:11 ` Qu Wenruo
2020-12-11 16:57 ` Nikolay Borisov
0 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-11 12:11 UTC (permalink / raw)
To: Nikolay Borisov, linux-btrfs
On 2020/12/11 下午8:00, Nikolay Borisov wrote:
>
>
> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>> Unlike the original try_release_extent_buffer,
>> try_release_subpage_extent_buffer() will iterate through
>> btrfs_subpage::tree_block_bitmap, and try to release each extent buffer.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/extent_io.c | 73 ++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 73 insertions(+)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 141e414b1ab9..4d55803302e9 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -6258,10 +6258,83 @@ void memmove_extent_buffer(const struct extent_buffer *dst,
>> }
>> }
>>
>> +static int try_release_subpage_extent_buffer(struct page *page)
>> +{
>> + struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
>> + u64 page_start = page_offset(page);
>> + int bitmap_size = BTRFS_SUBPAGE_BITMAP_SIZE;
>
> Remove this variable and directly use BTRFS_SUBPAGE_BITMAP_SIZE as a
> terminating condition
>
>> + int bit_start = 0;
>> + int ret;
>> +
>> + while (bit_start < bitmap_size) {
>
> You really want to iterate for a fixed number of items so switch that to
> a for loop.
The problem here is, it's not always fixed.
If it finds one bit set, it will skip (nodesize >> sectorsize_bits) bits.
But if not found, it will skip to just next bit.
Thus I'm not sure if for loop is really a good choice here for
differential step.
>
>> + struct btrfs_subpage *subpage;
>> + struct extent_buffer *eb;
>> + unsigned long flags;
>> + u16 tmp = 1 << bit_start;
>> + u64 start;
>> +
>> + /*
>> + * Make sure the page still has private, as previous run can
>> + * detach the private
>> + */
>
> But if previous run has run it would have disposed of this eb and you
> won't find this page at all, no ?
For the "previous run" I mean, previous iteration in the same loop.
E.g. the page has 4 bits set, just one eb (16K nodesize).
For the first run, it release the only eb of the page, and cleared page
private.
For the second run, since private is cleared, we need to break out.
>
>> + spin_lock(&page->mapping->private_lock);
>> + if (!PagePrivate(page)) {
>> + spin_unlock(&page->mapping->private_lock);
>> + break;
>> + }
>> + subpage = (struct btrfs_subpage *)page->private;
>> + spin_unlock(&page->mapping->private_lock);
>> +
>> + spin_lock_irqsave(&subpage->lock, flags);
>> + if (!(tmp & subpage->tree_block_bitmap)) {
>> + spin_unlock_irqrestore(&subpage->lock, flags);
>> + bit_start++;
>> + continue;
>> + }
>> + spin_unlock_irqrestore(&subpage->lock, flags);
>> +
>> + start = bit_start * fs_info->sectorsize + page_start;
>> + bit_start += fs_info->nodesize >> fs_info->sectorsize_bits;
>
> By doing this you are really saying "skip all blocks pertaining to this
> eb". In order for this to be correct it would imply that bit_start
> should _always_ be 0,4,8,12 - am I correct?
Nope. As long as no eb crosses page boundary, it won't cause problem.
So in theory we support case like eb spans sector 1~5.
> But what happens if
> if (!(tmp & subpage->tree_block_bitmap)) has executed and bit_start is
> now 1, then you'd make start = page_start + 4k , skip next 4(16k) blocks
> but that would be wrong, no ?
For (!(tmp & subpage->tree_block_bitmap)) branch, isn't bit_start just
increased by one?
Exactly like what I said, we will check next sector, until we hit the
first bit set.
And only when we hit a bit, we increase the bit_start by nodesize /
sectorsize.
>
> Essentially the page would look like:
>
> |0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
>
> So you want to release the EB's that spawn 0-3, 4-7, 8-11, 12-15, but
> what if bit_start becomes 1 and you add 4 to that, this offsets all
> further calculation by 1 i.e you are going into the next eb.
Nope, 4 is only added we we hit a bit set.
If we hit a bit zero, we jump to next bit, not following nodesize >>
sectorsize.
That's exactly the reason I'm not using for() loop here, due to the
difference in step size.
Thanks,
Qu
>
>
>> + /*
>> + * Here we can't call find_extent_buffer() which will increase
>> + * eb->refs.
>> + */
>> + rcu_read_lock();
>> + eb = radix_tree_lookup(&fs_info->buffer_radix,
>> + start >> fs_info->sectorsize_bits);
>> + rcu_read_unlock();
>> + ASSERT(eb);
>> + spin_lock(&eb->refs_lock);
>> + if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb) ||
>> + !test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
>> + spin_unlock(&eb->refs_lock);
>> + continue;
>> + }
>> + /*
>> + * Here we don't care the return value, we will always check
>> + * the page private at the end.
>> + * And release_extent_buffer() will release the refs_lock.
>> + */
>> + release_extent_buffer(eb);
>> + }
>> + /* Finally to check if we have cleared page private */
>> + spin_lock(&page->mapping->private_lock);
>> + if (!PagePrivate(page))
>> + ret = 1;
>> + else
>> + ret = 0;
>> + spin_unlock(&page->mapping->private_lock);
>> + return ret;
>> +
>> +}
>> +
>> int try_release_extent_buffer(struct page *page)
>> {
>> struct extent_buffer *eb;
>>
>> + if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
>> + return try_release_subpage_extent_buffer(page);
>> +
>> /*
>> * We need to make sure nobody is attaching this page to an eb right
>> * now.
>>
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support
2020-12-11 12:11 ` Qu Wenruo
@ 2020-12-11 16:57 ` Nikolay Borisov
2020-12-12 1:28 ` Qu Wenruo
2020-12-12 5:44 ` Qu Wenruo
0 siblings, 2 replies; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-11 16:57 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 11.12.20 г. 14:11 ч., Qu Wenruo wrote:
>
>
> On 2020/12/11 下午8:00, Nikolay Borisov wrote:
>>
>>
>> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>>> Unlike the original try_release_extent_buffer,
>>> try_release_subpage_extent_buffer() will iterate through
>>> btrfs_subpage::tree_block_bitmap, and try to release each extent buffer.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>> fs/btrfs/extent_io.c | 73 ++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 73 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 141e414b1ab9..4d55803302e9 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -6258,10 +6258,83 @@ void memmove_extent_buffer(const struct extent_buffer *dst,
>>> }
>>> }
>>>
>>> +static int try_release_subpage_extent_buffer(struct page *page)
>>> +{
>>> + struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
>>> + u64 page_start = page_offset(page);
>>> + int bitmap_size = BTRFS_SUBPAGE_BITMAP_SIZE;
>>
>> Remove this variable and directly use BTRFS_SUBPAGE_BITMAP_SIZE as a
>> terminating condition
>>
>>> + int bit_start = 0;
>>> + int ret;
>>> +
>>> + while (bit_start < bitmap_size) {
>>
>> You really want to iterate for a fixed number of items so switch that to
>> a for loop.
>
> The problem here is, it's not always fixed.
>
> If it finds one bit set, it will skip (nodesize >> sectorsize_bits) bits.
>
> But if not found, it will skip to just next bit.
>
> Thus I'm not sure if for loop is really a good choice here for
> differential step.
>
>>
>>> + struct btrfs_subpage *subpage;
>>> + struct extent_buffer *eb;
>>> + unsigned long flags;
>>> + u16 tmp = 1 << bit_start;
>>> + u64 start;
>>> +
>>> + /*
>>> + * Make sure the page still has private, as previous run can
>>> + * detach the private
>>> + */
>>
>> But if previous run has run it would have disposed of this eb and you
>> won't find this page at all, no ?
>
> For the "previous run" I mean, previous iteration in the same loop.
>
> E.g. the page has 4 bits set, just one eb (16K nodesize).
Isn't it guaranteed that if you iterate the eb's in a page if you meet
an empty block then the whole extent buffer is gone, hence instead of
doing bit_start++ you ought to also increment by the size of nodesize.
For example, assume a page contains 4 EBs:
0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
x|x|x|x|0|0|0|0|x|x| x|x |0 | 0|0 |0
So first bit is set, so you proceed to call release_extent_buffer on it,
which clears the first 4 bits in tree_block_bitmap, in this case you've
incremented by nodesize so next iteration begins at index 4. You detect
it's unset (0) hence you increment it byte 1 and you repeat this for the
next 3 bits, then you free the whole of the next eb. I argue that you
also need to increment by nodesize in the case of a bit which is not
set, because you cannot really see partially freed eb i.e you cannot see
the following state:
0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
x|x|x|x|x|0|0|0|x|x| x|x |0 | 0|0 |0
Am I missing something?
>
> For the first run, it release the only eb of the page, and cleared page
> private.
> For the second run, since private is cleared, we need to break out.
>
>>
>>> + spin_lock(&page->mapping->private_lock);
>>> + if (!PagePrivate(page)) {
>>> + spin_unlock(&page->mapping->private_lock);
>>> + break;
>>> + }
Aren't we guaranteed that a page has private if this function is called ?
>>> + subpage = (struct btrfs_subpage *)page->private;
>>> + spin_unlock(&page->mapping->private_lock);
>>> +
>>> + spin_lock_irqsave(&subpage->lock, flags);
>>> + if (!(tmp & subpage->tree_block_bitmap)) {
>>> + spin_unlock_irqrestore(&subpage->lock, flags);
>>> + bit_start++;
>>> + continue;
>>> + }
>>> + spin_unlock_irqrestore(&subpage->lock, flags);
>>> +
>>> + start = bit_start * fs_info->sectorsize + page_start;
>>> + bit_start += fs_info->nodesize >> fs_info->sectorsize_bits;
<snip>
> Thanks,
> Qu
>>
>>
>>> + /*
>>> + * Here we can't call find_extent_buffer() which will increase
>>> + * eb->refs.
>>> + */
>>> + rcu_read_lock();
>>> + eb = radix_tree_lookup(&fs_info->buffer_radix,
>>> + start >> fs_info->sectorsize_bits);
>>> + rcu_read_unlock();
Your usage of radix_tree_lookup + rcu lock is wrong. rcu guarantees that
an EB you get won't be freed while the rcu section is active, however
you get a reference to the EB and you do not increment the ref count
WHILE holding the RCU critical section, consult find_extent_buffer
what's the correct usage pattern.
Frankly the locking in this function is insane, first mapping->private
lock is acquired to check if Page_private is set and then page->private
is referenced but that is not signalled at all. Then subpage->lock is
taken to check the tree_block_bitmap, then the lock is dropped. At that
point no locks are held so this page could possibly be referenced by
someone else? Then the buggy locking is used to get the eb, then you
lock refs_lock and call release_extent_buffer...
>>> + ASSERT(eb);
Doing this outside of the rcu read side critical section _without_
incrementing the ref count is buggy!
>>> + spin_lock(&eb->refs_lock);
>>> + if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb) ||
>>> + !test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
>>> + spin_unlock(&eb->refs_lock);
>>> + continue;
>>> + }
>>> + /*
<snip>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support
2020-12-11 16:57 ` Nikolay Borisov
@ 2020-12-12 1:28 ` Qu Wenruo
2020-12-12 9:26 ` Nikolay Borisov
2020-12-12 5:44 ` Qu Wenruo
1 sibling, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-12 1:28 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs
On 2020/12/12 上午12:57, Nikolay Borisov wrote:
>
>
> On 11.12.20 г. 14:11 ч., Qu Wenruo wrote:
>>
>>
>> On 2020/12/11 下午8:00, Nikolay Borisov wrote:
>>>
>>>
>>> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>>>> Unlike the original try_release_extent_buffer,
>>>> try_release_subpage_extent_buffer() will iterate through
>>>> btrfs_subpage::tree_block_bitmap, and try to release each extent buffer.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>> fs/btrfs/extent_io.c | 73 ++++++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 73 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>>> index 141e414b1ab9..4d55803302e9 100644
>>>> --- a/fs/btrfs/extent_io.c
>>>> +++ b/fs/btrfs/extent_io.c
>>>> @@ -6258,10 +6258,83 @@ void memmove_extent_buffer(const struct extent_buffer *dst,
>>>> }
>>>> }
>>>>
>>>> +static int try_release_subpage_extent_buffer(struct page *page)
>>>> +{
>>>> + struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
>>>> + u64 page_start = page_offset(page);
>>>> + int bitmap_size = BTRFS_SUBPAGE_BITMAP_SIZE;
>>>
>>> Remove this variable and directly use BTRFS_SUBPAGE_BITMAP_SIZE as a
>>> terminating condition
>>>
>>>> + int bit_start = 0;
>>>> + int ret;
>>>> +
>>>> + while (bit_start < bitmap_size) {
>>>
>>> You really want to iterate for a fixed number of items so switch that to
>>> a for loop.
>>
>> The problem here is, it's not always fixed.
>>
>> If it finds one bit set, it will skip (nodesize >> sectorsize_bits) bits.
>>
>> But if not found, it will skip to just next bit.
>>
>> Thus I'm not sure if for loop is really a good choice here for
>> differential step.
>>
>>>
>>>> + struct btrfs_subpage *subpage;
>>>> + struct extent_buffer *eb;
>>>> + unsigned long flags;
>>>> + u16 tmp = 1 << bit_start;
>>>> + u64 start;
>>>> +
>>>> + /*
>>>> + * Make sure the page still has private, as previous run can
>>>> + * detach the private
>>>> + */
>>>
>>> But if previous run has run it would have disposed of this eb and you
>>> won't find this page at all, no ?
>>
>> For the "previous run" I mean, previous iteration in the same loop.
>>
>> E.g. the page has 4 bits set, just one eb (16K nodesize).
>
> Isn't it guaranteed that if you iterate the eb's in a page if you meet
> an empty block then the whole extent buffer is gone, hence instead of
> doing bit_start++ you ought to also increment by the size of nodesize.
>
> For example, assume a page contains 4 EBs:
>
> 0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
> x|x|x|x|0|0|0|0|x|x| x|x |0 | 0|0 |0
>
> So first bit is set, so you proceed to call release_extent_buffer on it,
> which clears the first 4 bits in tree_block_bitmap, in this case you've
> incremented by nodesize so next iteration begins at index 4. You detect
> it's unset (0) hence you increment it byte 1 and you repeat this for the
> next 3 bits, then you free the whole of the next eb. I argue that you
> also need to increment by nodesize in the case of a bit which is not
> set, because you cannot really see partially freed eb i.e you cannot see
> the following state:
>
> 0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
> x|x|x|x|x|0|0|0|x|x| x|x |0 | 0|0 |0
>
> Am I missing something?
It's not for partly freed eb, but nodesize unaligned eb.
E.g. if we have a eb starts at sector 1 of a page, your nodesize based
iteration would go crazy.
Although we have ensured no subpage eb can cross page boundary, but it's
not the same requirement for nodesize alignment.
Thus I uses the extra safe way for the empty bit.
Thanks,
Qu
>
>
>
>
>>
>> For the first run, it release the only eb of the page, and cleared page
>> private.
>> For the second run, since private is cleared, we need to break out.
>>
>>>
>>>> + spin_lock(&page->mapping->private_lock);
>>>> + if (!PagePrivate(page)) {
>>>> + spin_unlock(&page->mapping->private_lock);
>>>> + break;
>>>> + }
>
> Aren't we guaranteed that a page has private if this function is called ?
>
>>>> + subpage = (struct btrfs_subpage *)page->private;
>>>> + spin_unlock(&page->mapping->private_lock);
>>>> +
>>>> + spin_lock_irqsave(&subpage->lock, flags);
>>>> + if (!(tmp & subpage->tree_block_bitmap)) {
>>>> + spin_unlock_irqrestore(&subpage->lock, flags);
>>>> + bit_start++;
>>>> + continue;
>>>> + }
>>>> + spin_unlock_irqrestore(&subpage->lock, flags);
>>>> +
>>>> + start = bit_start * fs_info->sectorsize + page_start;
>>>> + bit_start += fs_info->nodesize >> fs_info->sectorsize_bits;
>
> <snip>
>
>> Thanks,
>> Qu
>>>
>>>
>>>> + /*
>>>> + * Here we can't call find_extent_buffer() which will increase
>>>> + * eb->refs.
>>>> + */
>>>> + rcu_read_lock();
>>>> + eb = radix_tree_lookup(&fs_info->buffer_radix,
>>>> + start >> fs_info->sectorsize_bits);
>>>> + rcu_read_unlock();
>
> Your usage of radix_tree_lookup + rcu lock is wrong. rcu guarantees that
> an EB you get won't be freed while the rcu section is active, however
> you get a reference to the EB and you do not increment the ref count
> WHILE holding the RCU critical section, consult find_extent_buffer
> what's the correct usage pattern.
>
> Frankly the locking in this function is insane, first mapping->private
> lock is acquired to check if Page_private is set and then page->private
> is referenced but that is not signalled at all. Then subpage->lock is
> taken to check the tree_block_bitmap, then the lock is dropped. At that
> point no locks are held so this page could possibly be referenced by
> someone else? Then the buggy locking is used to get the eb, then you
> lock refs_lock and call release_extent_buffer...
>
>>>> + ASSERT(eb);
>
> Doing this outside of the rcu read side critical section _without_
> incrementing the ref count is buggy!
>
>>>> + spin_lock(&eb->refs_lock);
>>>> + if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb) ||
>>>> + !test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
>>>> + spin_unlock(&eb->refs_lock);
>>>> + continue;
>>>> + }
>>>> + /*
>
>
> <snip>
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support
2020-12-11 16:57 ` Nikolay Borisov
2020-12-12 1:28 ` Qu Wenruo
@ 2020-12-12 5:44 ` Qu Wenruo
2020-12-12 10:30 ` Nikolay Borisov
1 sibling, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-12 5:44 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs
On 2020/12/12 上午12:57, Nikolay Borisov wrote:
>
>
> On 11.12.20 г. 14:11 ч., Qu Wenruo wrote:
>>
>>
>> On 2020/12/11 下午8:00, Nikolay Borisov wrote:
>>>
>>>
>>> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>>>> Unlike the original try_release_extent_buffer,
>>>> try_release_subpage_extent_buffer() will iterate through
>>>> btrfs_subpage::tree_block_bitmap, and try to release each extent buffer.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>> fs/btrfs/extent_io.c | 73 ++++++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 73 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>>> index 141e414b1ab9..4d55803302e9 100644
>>>> --- a/fs/btrfs/extent_io.c
>>>> +++ b/fs/btrfs/extent_io.c
>>>> @@ -6258,10 +6258,83 @@ void memmove_extent_buffer(const struct extent_buffer *dst,
>>>> }
>>>> }
>>>>
>>>> +static int try_release_subpage_extent_buffer(struct page *page)
>>>> +{
>>>> + struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
>>>> + u64 page_start = page_offset(page);
>>>> + int bitmap_size = BTRFS_SUBPAGE_BITMAP_SIZE;
>>>
>>> Remove this variable and directly use BTRFS_SUBPAGE_BITMAP_SIZE as a
>>> terminating condition
>>>
>>>> + int bit_start = 0;
>>>> + int ret;
>>>> +
>>>> + while (bit_start < bitmap_size) {
>>>
>>> You really want to iterate for a fixed number of items so switch that to
>>> a for loop.
>>
>> The problem here is, it's not always fixed.
>>
>> If it finds one bit set, it will skip (nodesize >> sectorsize_bits) bits.
>>
>> But if not found, it will skip to just next bit.
>>
>> Thus I'm not sure if for loop is really a good choice here for
>> differential step.
>>
>>>
>>>> + struct btrfs_subpage *subpage;
>>>> + struct extent_buffer *eb;
>>>> + unsigned long flags;
>>>> + u16 tmp = 1 << bit_start;
>>>> + u64 start;
>>>> +
>>>> + /*
>>>> + * Make sure the page still has private, as previous run can
>>>> + * detach the private
>>>> + */
>>>
>>> But if previous run has run it would have disposed of this eb and you
>>> won't find this page at all, no ?
>>
>> For the "previous run" I mean, previous iteration in the same loop.
>>
>> E.g. the page has 4 bits set, just one eb (16K nodesize).
>
> Isn't it guaranteed that if you iterate the eb's in a page if you meet
> an empty block then the whole extent buffer is gone, hence instead of
> doing bit_start++ you ought to also increment by the size of nodesize.
>
> For example, assume a page contains 4 EBs:
>
> 0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
> x|x|x|x|0|0|0|0|x|x| x|x |0 | 0|0 |0
>
> So first bit is set, so you proceed to call release_extent_buffer on it,
> which clears the first 4 bits in tree_block_bitmap, in this case you've
> incremented by nodesize so next iteration begins at index 4. You detect
> it's unset (0) hence you increment it byte 1 and you repeat this for the
> next 3 bits, then you free the whole of the next eb. I argue that you
> also need to increment by nodesize in the case of a bit which is not
> set, because you cannot really see partially freed eb i.e you cannot see
> the following state:
>
> 0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
> x|x|x|x|x|0|0|0|x|x| x|x |0 | 0|0 |0
>
> Am I missing something?
>
>
>
>
>>
>> For the first run, it release the only eb of the page, and cleared page
>> private.
>> For the second run, since private is cleared, we need to break out.
>>
>>>
>>>> + spin_lock(&page->mapping->private_lock);
>>>> + if (!PagePrivate(page)) {
>>>> + spin_unlock(&page->mapping->private_lock);
>>>> + break;
>>>> + }
>
> Aren't we guaranteed that a page has private if this function is called ?
>
>>>> + subpage = (struct btrfs_subpage *)page->private;
>>>> + spin_unlock(&page->mapping->private_lock);
>>>> +
>>>> + spin_lock_irqsave(&subpage->lock, flags);
>>>> + if (!(tmp & subpage->tree_block_bitmap)) {
>>>> + spin_unlock_irqrestore(&subpage->lock, flags);
>>>> + bit_start++;
>>>> + continue;
>>>> + }
>>>> + spin_unlock_irqrestore(&subpage->lock, flags);
>>>> +
>>>> + start = bit_start * fs_info->sectorsize + page_start;
>>>> + bit_start += fs_info->nodesize >> fs_info->sectorsize_bits;
>
> <snip>
>
>> Thanks,
>> Qu
>>>
>>>
>>>> + /*
>>>> + * Here we can't call find_extent_buffer() which will increase
>>>> + * eb->refs.
>>>> + */
>>>> + rcu_read_lock();
>>>> + eb = radix_tree_lookup(&fs_info->buffer_radix,
>>>> + start >> fs_info->sectorsize_bits);
>>>> + rcu_read_unlock();
>
> Your usage of radix_tree_lookup + rcu lock is wrong. rcu guarantees that
> an EB you get won't be freed while the rcu section is active, however
> you get a reference to the EB and you do not increment the ref count
> WHILE holding the RCU critical section, consult find_extent_buffer
> what's the correct usage pattern.
Nope, you just fall into the trap what I fell before.
Here if the eb has no other referencer, its refs is just 1 (because it's
still in the tree).
If you go increase the refs, the eb becomes referenced again and
release_extent_buffer() won't free it at all.
Causing no eb to be freed whatever.
>
> Frankly the locking in this function is insane, first mapping->private
> lock is acquired to check if Page_private is set and then page->private
> is referenced but that is not signalled at all.
Because we just want the page::private pointer.
We won't touch page::private until we're really going to detach/attach.
But detach/attach will also modify subpage::tree_block_bitmap which is
protected by subpage::lock.
So here just to grab subpage pointer is completely fine.
> Then subpage->lock is
> taken to check the tree_block_bitmap, then the lock is dropped. At that
> point no locks are held so this page could possibly be referenced by
> someone else?
Does it matter? We have the info we need (the eb bytenr) that's all.
Other metadata operation may touch the page, but that won't cause
anything wrong.
> Then the buggy locking is used to get the eb, then you
> lock refs_lock and call release_extent_buffer...
Nope, eb access is not buggy.
If you increase the refs, that would be buggy.
>
>>>> + ASSERT(eb);
>
> Doing this outside of the rcu read side critical section _without_
> incrementing the ref count is buggy!
Try increasing refs when we're going to cleanup one eb, that's really buggy.
Thanks,
Qu
>
>>>> + spin_lock(&eb->refs_lock);
>>>> + if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb) ||
>>>> + !test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
>>>> + spin_unlock(&eb->refs_lock);
>>>> + continue;
>>>> + }
>>>> + /*
>
>
> <snip>
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support
2020-12-12 1:28 ` Qu Wenruo
@ 2020-12-12 9:26 ` Nikolay Borisov
2020-12-12 10:26 ` Qu Wenruo
0 siblings, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-12 9:26 UTC (permalink / raw)
To: Qu Wenruo, Qu Wenruo, linux-btrfs
On 12.12.20 г. 3:28 ч., Qu Wenruo wrote:
>
>
> On 2020/12/12 上午12:57, Nikolay Borisov wrote:
>>
>>
>> On 11.12.20 г. 14:11 ч., Qu Wenruo wrote:
>>>
>>>
>>> On 2020/12/11 下午8:00, Nikolay Borisov wrote:
>>>>
>>>>
>>>> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>>>>> Unlike the original try_release_extent_buffer,
>>>>> try_release_subpage_extent_buffer() will iterate through
>>>>> btrfs_subpage::tree_block_bitmap, and try to release each extent
>>>>> buffer.
>>>>>
>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>> ---
>>>>> fs/btrfs/extent_io.c | 73
>>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>> 1 file changed, 73 insertions(+)
>>>>>
>>>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>>>> index 141e414b1ab9..4d55803302e9 100644
>>>>> --- a/fs/btrfs/extent_io.c
>>>>> +++ b/fs/btrfs/extent_io.c
>>>>> @@ -6258,10 +6258,83 @@ void memmove_extent_buffer(const struct
>>>>> extent_buffer *dst,
>>>>> }
>>>>> }
>>>>>
>>>>> +static int try_release_subpage_extent_buffer(struct page *page)
>>>>> +{
>>>>> + struct btrfs_fs_info *fs_info =
>>>>> btrfs_sb(page->mapping->host->i_sb);
>>>>> + u64 page_start = page_offset(page);
>>>>> + int bitmap_size = BTRFS_SUBPAGE_BITMAP_SIZE;
>>>>
>>>> Remove this variable and directly use BTRFS_SUBPAGE_BITMAP_SIZE as a
>>>> terminating condition
>>>>
>>>>> + int bit_start = 0;
>>>>> + int ret;
>>>>> +
>>>>> + while (bit_start < bitmap_size) {
>>>>
>>>> You really want to iterate for a fixed number of items so switch
>>>> that to
>>>> a for loop.
>>>
>>> The problem here is, it's not always fixed.
>>>
>>> If it finds one bit set, it will skip (nodesize >> sectorsize_bits)
>>> bits.
>>>
>>> But if not found, it will skip to just next bit.
>>>
>>> Thus I'm not sure if for loop is really a good choice here for
>>> differential step.
>>>
>>>>
>>>>> + struct btrfs_subpage *subpage;
>>>>> + struct extent_buffer *eb;
>>>>> + unsigned long flags;
>>>>> + u16 tmp = 1 << bit_start;
>>>>> + u64 start;
>>>>> +
>>>>> + /*
>>>>> + * Make sure the page still has private, as previous run can
>>>>> + * detach the private
>>>>> + */
>>>>
>>>> But if previous run has run it would have disposed of this eb and you
>>>> won't find this page at all, no ?
>>>
>>> For the "previous run" I mean, previous iteration in the same loop.
>>>
>>> E.g. the page has 4 bits set, just one eb (16K nodesize).
>>
>> Isn't it guaranteed that if you iterate the eb's in a page if you meet
>> an empty block then the whole extent buffer is gone, hence instead of
>> doing bit_start++ you ought to also increment by the size of nodesize.
>>
>> For example, assume a page contains 4 EBs:
>>
>> 0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
>> x|x|x|x|0|0|0|0|x|x| x|x |0 | 0|0 |0
>>
>> So first bit is set, so you proceed to call release_extent_buffer on it,
>> which clears the first 4 bits in tree_block_bitmap, in this case you've
>> incremented by nodesize so next iteration begins at index 4. You detect
>> it's unset (0) hence you increment it byte 1 and you repeat this for the
>> next 3 bits, then you free the whole of the next eb. I argue that you
>> also need to increment by nodesize in the case of a bit which is not
>> set, because you cannot really see partially freed eb i.e you cannot see
>> the following state:
>>
>> 0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
>> x|x|x|x|x|0|0|0|x|x| x|x |0 | 0|0 |0
>>
>> Am I missing something?
>
> It's not for partly freed eb, but nodesize unaligned eb.
>
> E.g. if we have a eb starts at sector 1 of a page, your nodesize based
> iteration would go crazy.
> Although we have ensured no subpage eb can cross page boundary, but it's
> not the same requirement for nodesize alignment.
>
> Thus I uses the extra safe way for the empty bit.
Which of course cannot happen because the allocator ensures that
returned addresses are always aligned to fs_info::stripeize which in
turn is always equal to sectorsize... So you add extra complexity for no
apparent reason making code which is already subtle to be even more subtle.
<snip>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support
2020-12-12 9:26 ` Nikolay Borisov
@ 2020-12-12 10:26 ` Qu Wenruo
0 siblings, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-12 10:26 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs
On 2020/12/12 下午5:26, Nikolay Borisov wrote:
>
>
> On 12.12.20 г. 3:28 ч., Qu Wenruo wrote:
>>
>>
>> On 2020/12/12 上午12:57, Nikolay Borisov wrote:
>>>
>>>
>>> On 11.12.20 г. 14:11 ч., Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2020/12/11 下午8:00, Nikolay Borisov wrote:
>>>>>
>>>>>
>>>>> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>>>>>> Unlike the original try_release_extent_buffer,
>>>>>> try_release_subpage_extent_buffer() will iterate through
>>>>>> btrfs_subpage::tree_block_bitmap, and try to release each extent
>>>>>> buffer.
>>>>>>
>>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>>> ---
>>>>>> fs/btrfs/extent_io.c | 73
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 1 file changed, 73 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>>>>> index 141e414b1ab9..4d55803302e9 100644
>>>>>> --- a/fs/btrfs/extent_io.c
>>>>>> +++ b/fs/btrfs/extent_io.c
>>>>>> @@ -6258,10 +6258,83 @@ void memmove_extent_buffer(const struct
>>>>>> extent_buffer *dst,
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> +static int try_release_subpage_extent_buffer(struct page *page)
>>>>>> +{
>>>>>> + struct btrfs_fs_info *fs_info =
>>>>>> btrfs_sb(page->mapping->host->i_sb);
>>>>>> + u64 page_start = page_offset(page);
>>>>>> + int bitmap_size = BTRFS_SUBPAGE_BITMAP_SIZE;
>>>>>
>>>>> Remove this variable and directly use BTRFS_SUBPAGE_BITMAP_SIZE as a
>>>>> terminating condition
>>>>>
>>>>>> + int bit_start = 0;
>>>>>> + int ret;
>>>>>> +
>>>>>> + while (bit_start < bitmap_size) {
>>>>>
>>>>> You really want to iterate for a fixed number of items so switch
>>>>> that to
>>>>> a for loop.
>>>>
>>>> The problem here is, it's not always fixed.
>>>>
>>>> If it finds one bit set, it will skip (nodesize >> sectorsize_bits)
>>>> bits.
>>>>
>>>> But if not found, it will skip to just next bit.
>>>>
>>>> Thus I'm not sure if for loop is really a good choice here for
>>>> differential step.
>>>>
>>>>>
>>>>>> + struct btrfs_subpage *subpage;
>>>>>> + struct extent_buffer *eb;
>>>>>> + unsigned long flags;
>>>>>> + u16 tmp = 1 << bit_start;
>>>>>> + u64 start;
>>>>>> +
>>>>>> + /*
>>>>>> + * Make sure the page still has private, as previous run can
>>>>>> + * detach the private
>>>>>> + */
>>>>>
>>>>> But if previous run has run it would have disposed of this eb and you
>>>>> won't find this page at all, no ?
>>>>
>>>> For the "previous run" I mean, previous iteration in the same loop.
>>>>
>>>> E.g. the page has 4 bits set, just one eb (16K nodesize).
>>>
>>> Isn't it guaranteed that if you iterate the eb's in a page if you meet
>>> an empty block then the whole extent buffer is gone, hence instead of
>>> doing bit_start++ you ought to also increment by the size of nodesize.
>>>
>>> For example, assume a page contains 4 EBs:
>>>
>>> 0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
>>> x|x|x|x|0|0|0|0|x|x| x|x |0 | 0|0 |0
>>>
>>> So first bit is set, so you proceed to call release_extent_buffer on it,
>>> which clears the first 4 bits in tree_block_bitmap, in this case you've
>>> incremented by nodesize so next iteration begins at index 4. You detect
>>> it's unset (0) hence you increment it byte 1 and you repeat this for the
>>> next 3 bits, then you free the whole of the next eb. I argue that you
>>> also need to increment by nodesize in the case of a bit which is not
>>> set, because you cannot really see partially freed eb i.e you cannot see
>>> the following state:
>>>
>>> 0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
>>> x|x|x|x|x|0|0|0|x|x| x|x |0 | 0|0 |0
>>>
>>> Am I missing something?
>>
>> It's not for partly freed eb, but nodesize unaligned eb.
>>
>> E.g. if we have a eb starts at sector 1 of a page, your nodesize based
>> iteration would go crazy.
>> Although we have ensured no subpage eb can cross page boundary, but it's
>> not the same requirement for nodesize alignment.
>>
>> Thus I uses the extra safe way for the empty bit.
>
> Which of course cannot happen because the allocator ensures that
> returned addresses are always aligned to fs_info::stripeize which in
> turn is always equal to sectorsize...
Nope again.
Think again, sectorsize is only 4K, while nodesize is 16K.
So it's valid (not really good though) to have eb bytenr which is only
aligned to 4K but not aligned to 16K.
> So you add extra complexity for no
> apparent reason making code which is already subtle to be even more subtle.
>
> <snip>
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support
2020-12-12 5:44 ` Qu Wenruo
@ 2020-12-12 10:30 ` Nikolay Borisov
2020-12-12 10:31 ` Qu Wenruo
0 siblings, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-12 10:30 UTC (permalink / raw)
To: Qu Wenruo, Qu Wenruo, linux-btrfs
On 12.12.20 г. 7:44 ч., Qu Wenruo wrote:
>
>
> On 2020/12/12 上午12:57, Nikolay Borisov wrote:
>>
>>
>> On 11.12.20 г. 14:11 ч., Qu Wenruo wrote:
>>>
>>>
>>> On 2020/12/11 下午8:00, Nikolay Borisov wrote:
>>>>
>>>>
<snip>
>>>>
>>>>> + /*
>>>>> + * Here we can't call find_extent_buffer() which will
>>>>> increase
>>>>> + * eb->refs.
>>>>> + */
>>>>> + rcu_read_lock();
>>>>> + eb = radix_tree_lookup(&fs_info->buffer_radix,
>>>>> + start >> fs_info->sectorsize_bits);
>>>>> + rcu_read_unlock();
>>
>> Your usage of radix_tree_lookup + rcu lock is wrong. rcu guarantees that
>> an EB you get won't be freed while the rcu section is active, however
>> you get a reference to the EB and you do not increment the ref count
>> WHILE holding the RCU critical section, consult find_extent_buffer
>> what's the correct usage pattern.
>
> Nope, you just fall into the trap what I fell before.
>
> Here if the eb has no other referencer, its refs is just 1 (because it's
> still in the tree).
>
> If you go increase the refs, the eb becomes referenced again and
> release_extent_buffer() won't free it at all.
>
> Causing no eb to be freed whatever.
After the rcu_read_unlock you hold a reference to eb, without having
incremented the eb's refs, without having locked eb's refs_lock. At this
point nothing prevents the eb from disappearing from underneath you. The
correct way would be to increment the eb's ref and check if ref is > 2
(1 for the buffer radix tree, 1 for you), then you acquire the refs_lock
and drop your current ref leaving it to 1 and call release_extent_buffer.
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support
2020-12-12 10:30 ` Nikolay Borisov
@ 2020-12-12 10:31 ` Qu Wenruo
0 siblings, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-12 10:31 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs
On 2020/12/12 下午6:30, Nikolay Borisov wrote:
>
>
> On 12.12.20 г. 7:44 ч., Qu Wenruo wrote:
>>
>>
>> On 2020/12/12 上午12:57, Nikolay Borisov wrote:
>>>
>>>
>>> On 11.12.20 г. 14:11 ч., Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2020/12/11 下午8:00, Nikolay Borisov wrote:
>>>>>
>>>>>
>
> <snip>
>
>>>>>
>>>>>> + /*
>>>>>> + * Here we can't call find_extent_buffer() which will
>>>>>> increase
>>>>>> + * eb->refs.
>>>>>> + */
>>>>>> + rcu_read_lock();
>>>>>> + eb = radix_tree_lookup(&fs_info->buffer_radix,
>>>>>> + start >> fs_info->sectorsize_bits);
>>>>>> + rcu_read_unlock();
>>>
>>> Your usage of radix_tree_lookup + rcu lock is wrong. rcu guarantees that
>>> an EB you get won't be freed while the rcu section is active, however
>>> you get a reference to the EB and you do not increment the ref count
>>> WHILE holding the RCU critical section, consult find_extent_buffer
>>> what's the correct usage pattern.
>>
>> Nope, you just fall into the trap what I fell before.
>>
>> Here if the eb has no other referencer, its refs is just 1 (because it's
>> still in the tree).
>>
>> If you go increase the refs, the eb becomes referenced again and
>> release_extent_buffer() won't free it at all.
>>
>> Causing no eb to be freed whatever.
>
> After the rcu_read_unlock you hold a reference to eb, without having
> incremented the eb's refs, without having locked eb's refs_lock.
Haven't you checked the original try_release_extent_buffer()?
At this
> point nothing prevents the eb from disappearing from underneath you. The
> correct way would be to increment the eb's ref and check if ref is > 2
> (1 for the buffer radix tree, 1 for you), then you acquire the refs_lock
> and drop your current ref leaving it to 1 and call release_extent_buffer.
>
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 14/18] btrfs: extent_io: make endio_readpage_update_page_status() to handle subpage case
2020-12-10 6:39 ` [PATCH v2 14/18] btrfs: extent_io: make endio_readpage_update_page_status() to handle subpage case Qu Wenruo
@ 2020-12-14 9:57 ` Nikolay Borisov
2020-12-14 10:46 ` Qu Wenruo
0 siblings, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-14 9:57 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 10.12.20 г. 8:39 ч., Qu Wenruo wrote:
> To handle subpage status update, add the following new tricks:
> - Use btrfs_page_*() helpers to update page status
> Now we can handle both cases well.
>
> - No page unlock for subpage metadata
> Since subpage metadata doesn't utilize page locking at all, skip it.
> For subpage data locking, it's handled in later commits.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 23 +++++++++++++++++------
> 1 file changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 1ec9de2aa910..64a19c1884fc 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2841,15 +2841,26 @@ static void endio_readpage_release_extent(struct processed_extent *processed,
> processed->uptodate = uptodate;
> }
>
> -static void endio_readpage_update_page_status(struct page *page, bool uptodate)
> +static void endio_readpage_update_page_status(struct page *page, bool uptodate,
> + u64 start, u64 end)
> {
> + struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> + u32 len;
> +
> + ASSERT(page_offset(page) <= start &&
> + end <= page_offset(page) + PAGE_SIZE - 1);
'start' in this case is
'start = page_offset(page) + bvec->bv_offset;' from
end_bio_extent_readpage, so it can't possibly be less than page_offset,
instead it will at least be equal to page_offset if bvec->bv_offset is 0
. However, can we really guarantee this ?
You are using the end only for the assert, and given you already have
the 'len' parameter calculated in the caller I'd rather have this
function take start/len, that would save you from recalculating the len
and also for someone looking at the code it would be apparent it's the
length of the currently processed bvec. I looked through the end of the
series and you never use 'end' just 'len'
<snip>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check
2020-12-10 6:39 ` [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check Qu Wenruo
2020-12-10 13:24 ` kernel test robot
2020-12-10 13:39 ` kernel test robot
@ 2020-12-14 10:21 ` Nikolay Borisov
2020-12-14 10:50 ` Qu Wenruo
2 siblings, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-14 10:21 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 10.12.20 г. 8:39 ч., Qu Wenruo wrote:
> For subpage metadata validation check, there are some difference:
> - Read must finish in one bvec
> Since we're just reading one subpage range in one page, it should
> never be split into two bios nor two bvecs.
>
> - How to grab the existing eb
> Instead of grabbing eb using page->private, we have to go search radix
> tree as we don't have any direct pointer at hand.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/disk-io.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 82 insertions(+)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b6c03a8b0c72..adda76895058 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -591,6 +591,84 @@ static int validate_extent_buffer(struct extent_buffer *eb)
> return ret;
> }
>
> +static int validate_subpage_buffer(struct page *page, u64 start, u64 end,
> + int mirror)
> +{
> + struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> + struct extent_buffer *eb;
> + int reads_done;
> + int ret = 0;
> +
> + if (!IS_ALIGNED(start, fs_info->sectorsize) ||
That's guaranteed by the allocator.
> + !IS_ALIGNED(end - start + 1, fs_info->sectorsize) ||
That's guaranteed by the fact that nodesize is a multiple of sectorsize.
> + !IS_ALIGNED(end - start + 1, fs_info->nodesize)) {
And that's also guaranteed that the size of an eb is always a nodesize.
Also aren't those checks already performed by the tree-checker during
write? Just remove this as it adds noise.
> + WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));> + btrfs_err(fs_info, "invalid tree read bytenr");
> + return -EUCLEAN;
> + }
> +
> + /*
> + * We don't allow bio merge for subpage metadata read, so we should
> + * only get one eb for each endio hook.
> + */
> + ASSERT(end == start + fs_info->nodesize - 1);
> + ASSERT(PagePrivate(page));
> +
> + rcu_read_lock();
> + eb = radix_tree_lookup(&fs_info->buffer_radix,
> + start / fs_info->sectorsize);
This division op likely produces the kernel robot's warning. It could be
written to use >> fs_info->sectorsize_bits. Furthermore this usage of
radix tree + rcu without acquiring the refs is unsafe as per my
explanation of, essentially, identical issue in patch 12 and our offline
chat about it.
> + rcu_read_unlock();
> +
> + /*
> + * When we are reading one tree block, eb must have been
> + * inserted into the radix tree. If not something is wrong.
> + */
> + if (!eb) {
> + WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
> + btrfs_err(fs_info,
> + "can't find extent buffer for bytenr %llu",
> + start);
> + return -EUCLEAN;
> + }
That's impossible to execute and such a failure will result in a crash
so just remove this code.
> + /*
> + * The pending IO might have been the only thing that kept
> + * this buffer in memory. Make sure we have a ref for all
> + * this other checks
> + */
> + atomic_inc(&eb->refs);
> +
> + reads_done = atomic_dec_and_test(&eb->io_pages);
> + /* Subpage read must finish in page read */
> + ASSERT(reads_done);
Just ASSERT(atomic_dec_and_test(&eb->io_pages)). Again, for subpage I
think that's a bit much since it only has 1 page so it's guaranteed that
it will always be true.
> +
> + eb->read_mirror = mirror;
> + if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
> + ret = -EIO;
> + goto err;
> + }
> + ret = validate_extent_buffer(eb);
> + if (ret < 0)
> + goto err;
> +
> + if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
> + btree_readahead_hook(eb, ret);
> +
> + set_extent_buffer_uptodate(eb);
> +
> + free_extent_buffer(eb);
> + return ret;
> +err:
> + /*
> + * our io error hook is going to dec the io pages
> + * again, we have to make sure it has something to
> + * decrement
> + */
That comment is slightly ambiguous - it's not the io error hook that
does the decrement but end_bio_extent_readpage. Just rewrite the comment
to :
"end_bio_extent_readpage decrements io_pages in case of error, make sure
it has ...."
> + atomic_inc(&eb->io_pages);
> + clear_extent_buffer_uptodate(eb);
> + free_extent_buffer(eb);
> + return ret;
> +}
> +
> int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio,
> struct page *page, u64 start, u64 end,
> int mirror)
> @@ -600,6 +678,10 @@ int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio,
> int reads_done;
>
> ASSERT(page->private);
> +
> + if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
> + return validate_subpage_buffer(page, start, end, mirror);
nit: validate_metadata_buffer is called in only once place so I'm
wondering won't it make it more readable if this check is lifted to its
sole caller so that when reading end_bio_extent_readpage it's apparent
what's going on. Though it's apparent that the nesting in the caller
will get somewhat unwieldy so won't be pressing hard for this.
> +
> eb = (struct extent_buffer *)page->private;
>
>
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 14/18] btrfs: extent_io: make endio_readpage_update_page_status() to handle subpage case
2020-12-14 9:57 ` Nikolay Borisov
@ 2020-12-14 10:46 ` Qu Wenruo
0 siblings, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-14 10:46 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs
On 2020/12/14 下午5:57, Nikolay Borisov wrote:
>
>
> On 10.12.20 г. 8:39 ч., Qu Wenruo wrote:
>> To handle subpage status update, add the following new tricks:
>> - Use btrfs_page_*() helpers to update page status
>> Now we can handle both cases well.
>>
>> - No page unlock for subpage metadata
>> Since subpage metadata doesn't utilize page locking at all, skip it.
>> For subpage data locking, it's handled in later commits.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/extent_io.c | 23 +++++++++++++++++------
>> 1 file changed, 17 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 1ec9de2aa910..64a19c1884fc 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2841,15 +2841,26 @@ static void endio_readpage_release_extent(struct processed_extent *processed,
>> processed->uptodate = uptodate;
>> }
>>
>> -static void endio_readpage_update_page_status(struct page *page, bool uptodate)
>> +static void endio_readpage_update_page_status(struct page *page, bool uptodate,
>> + u64 start, u64 end)
>> {
>> + struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
>> + u32 len;
>> +
>> + ASSERT(page_offset(page) <= start &&
>> + end <= page_offset(page) + PAGE_SIZE - 1);
>
> 'start' in this case is
> 'start = page_offset(page) + bvec->bv_offset;' from
> end_bio_extent_readpage, so it can't possibly be less than page_offset,
> instead it will at least be equal to page_offset if bvec->bv_offset is 0
> . However, can we really guarantee this ?
I believe we can.
But as you may have already found, I'm sometimes over-cautious, thus I
tend to use ASSERT() as a way to indicate the prerequisites.
>
>
> You are using the end only for the assert, and given you already have
> the 'len' parameter calculated in the caller I'd rather have this
> function take start/len, that would save you from recalculating the len
> and also for someone looking at the code it would be apparent it's the
> length of the currently processed bvec. I looked through the end of the
> series and you never use 'end' just 'len'
Right, sticking len would be better, I'll stick to start/len schema for
new functions in the series.
Thanks,
Qu
>
> <snip>
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check
2020-12-14 10:21 ` Nikolay Borisov
@ 2020-12-14 10:50 ` Qu Wenruo
2020-12-14 11:17 ` Nikolay Borisov
0 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-14 10:50 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs
On 2020/12/14 下午6:21, Nikolay Borisov wrote:
>
>
> On 10.12.20 г. 8:39 ч., Qu Wenruo wrote:
>> For subpage metadata validation check, there are some difference:
>> - Read must finish in one bvec
>> Since we're just reading one subpage range in one page, it should
>> never be split into two bios nor two bvecs.
>>
>> - How to grab the existing eb
>> Instead of grabbing eb using page->private, we have to go search radix
>> tree as we don't have any direct pointer at hand.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/disk-io.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 82 insertions(+)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index b6c03a8b0c72..adda76895058 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -591,6 +591,84 @@ static int validate_extent_buffer(struct extent_buffer *eb)
>> return ret;
>> }
>>
>> +static int validate_subpage_buffer(struct page *page, u64 start, u64 end,
>> + int mirror)
>> +{
>> + struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
>> + struct extent_buffer *eb;
>> + int reads_done;
>> + int ret = 0;
>> +
>> + if (!IS_ALIGNED(start, fs_info->sectorsize) ||
>
> That's guaranteed by the allocator.
>
>> + !IS_ALIGNED(end - start + 1, fs_info->sectorsize) ||
> That's guaranteed by the fact that nodesize is a multiple of sectorsize.
>
>> + !IS_ALIGNED(end - start + 1, fs_info->nodesize)) {
>
> And that's also guaranteed that the size of an eb is always a nodesize.
> Also aren't those checks already performed by the tree-checker during
> write? Just remove this as it adds noise.
>
>> + WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));> + btrfs_err(fs_info, "invalid tree read bytenr");
>> + return -EUCLEAN;
>> + }
>> +
>> + /*
>> + * We don't allow bio merge for subpage metadata read, so we should
>> + * only get one eb for each endio hook.
>> + */
>> + ASSERT(end == start + fs_info->nodesize - 1);
>> + ASSERT(PagePrivate(page));
>> +
>> + rcu_read_lock();
>> + eb = radix_tree_lookup(&fs_info->buffer_radix,
>> + start / fs_info->sectorsize);
>
> This division op likely produces the kernel robot's warning. It could be
> written to use >> fs_info->sectorsize_bits. Furthermore this usage of
> radix tree + rcu without acquiring the refs is unsafe as per my
> explanation of, essentially, identical issue in patch 12 and our offline
> chat about it.
Another relic I forgot in the long update history, nice find.
>
>> + rcu_read_unlock();
>> +
>> + /*
>> + * When we are reading one tree block, eb must have been
>> + * inserted into the radix tree. If not something is wrong.
>> + */
>> + if (!eb) {
>> + WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
>> + btrfs_err(fs_info,
>> + "can't find extent buffer for bytenr %llu",
>> + start);
>> + return -EUCLEAN;
>> + }
>
> That's impossible to execute and such a failure will result in a crash
> so just remove this code.
>
>> + /*
>> + * The pending IO might have been the only thing that kept
>> + * this buffer in memory. Make sure we have a ref for all
>> + * this other checks
>> + */
>> + atomic_inc(&eb->refs);
>> +
>> + reads_done = atomic_dec_and_test(&eb->io_pages);
>> + /* Subpage read must finish in page read */
>> + ASSERT(reads_done);
>
> Just ASSERT(atomic_dec_and_test(&eb->io_pages)). Again, for subpage I
> think that's a bit much since it only has 1 page so it's guaranteed that
> it will always be true.
IIRC ASSERT() won't execute whatever in it for non debug build.
Thus ASSERT(atomic_*) would cause non-debug kernel not to decrease the
io_pages and hangs the system.
Exactly the pitfall I'm thinking of.
Thanks,
Qu
>> +
>> + eb->read_mirror = mirror;
>> + if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
>> + ret = -EIO;
>> + goto err;
>> + }
>> + ret = validate_extent_buffer(eb);
>> + if (ret < 0)
>> + goto err;
>> +
>> + if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
>> + btree_readahead_hook(eb, ret);
>> +
>> + set_extent_buffer_uptodate(eb);
>> +
>> + free_extent_buffer(eb);
>> + return ret;
>> +err:
>> + /*
>> + * our io error hook is going to dec the io pages
>> + * again, we have to make sure it has something to
>> + * decrement
>> + */
>
> That comment is slightly ambiguous - it's not the io error hook that
> does the decrement but end_bio_extent_readpage. Just rewrite the comment
> to :
>
> "end_bio_extent_readpage decrements io_pages in case of error, make sure
> it has ...."
>
>> + atomic_inc(&eb->io_pages);
>> + clear_extent_buffer_uptodate(eb);
>> + free_extent_buffer(eb);
>> + return ret;
>> +}
>> +
>> int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio,
>> struct page *page, u64 start, u64 end,
>> int mirror)
>> @@ -600,6 +678,10 @@ int btrfs_validate_metadata_buffer(struct btrfs_io_bio *io_bio,
>> int reads_done;
>>
>> ASSERT(page->private);
>> +
>> + if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
>> + return validate_subpage_buffer(page, start, end, mirror);
>
> nit: validate_metadata_buffer is called in only once place so I'm
> wondering won't it make it more readable if this check is lifted to its
> sole caller so that when reading end_bio_extent_readpage it's apparent
> what's going on. Though it's apparent that the nesting in the caller
> will get somewhat unwieldy so won't be pressing hard for this.
>> +
>> eb = (struct extent_buffer *)page->private;
>>
>>
>>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check
2020-12-14 10:50 ` Qu Wenruo
@ 2020-12-14 11:17 ` Nikolay Borisov
2020-12-14 11:32 ` Qu Wenruo
0 siblings, 1 reply; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-14 11:17 UTC (permalink / raw)
To: Qu Wenruo, Qu Wenruo, linux-btrfs
On 14.12.20 г. 12:50 ч., Qu Wenruo wrote:
>
> IIRC ASSERT() won't execute whatever in it for non debug build.
> Thus ASSERT(atomic_*) would cause non-debug kernel not to decrease the
> io_pages and hangs the system.
Nope:
3362 #ifdef CONFIG_BTRFS_ASSERT
1 __cold __noreturn
2 static inline void assertfail(const char *expr, const char *file, int line)
3 {
4 pr_err("assertion failed: %s, in %s:%d\n", expr, file, line);
5 BUG();
6 }
7
8 #define ASSERT(expr) \
9 (likely(expr) ? (void)0 : assertfail(#expr, __FILE__, __LINE__))
10
11 #else
12 static inline void assertfail(const char *expr, const char* file, int line) { }
13 #define ASSERT(expr) (void)(expr) <-- expression is evaluated.
14 #endif
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check
2020-12-14 11:17 ` Nikolay Borisov
@ 2020-12-14 11:32 ` Qu Wenruo
2020-12-14 12:40 ` Nikolay Borisov
0 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-14 11:32 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs
On 2020/12/14 下午7:17, Nikolay Borisov wrote:
>
>
> On 14.12.20 г. 12:50 ч., Qu Wenruo wrote:
>>
>> IIRC ASSERT() won't execute whatever in it for non debug build.
>> Thus ASSERT(atomic_*) would cause non-debug kernel not to decrease the
>> io_pages and hangs the system.
>
> Nope:
>
> 3362 #ifdef CONFIG_BTRFS_ASSERT
> 1 __cold __noreturn
> 2 static inline void assertfail(const char *expr, const char *file, int line)
> 3 {
> 4 pr_err("assertion failed: %s, in %s:%d\n", expr, file, line);
> 5 BUG();
> 6 }
> 7
> 8 #define ASSERT(expr) \
> 9 (likely(expr) ? (void)0 : assertfail(#expr, __FILE__, __LINE__))
> 10
> 11 #else
> 12 static inline void assertfail(const char *expr, const char* file, int line) { }
> 13 #define ASSERT(expr) (void)(expr) <-- expression is evaluated.
> 14 #endif
>
Wow, that's too tricky and maybe that's the reason why Josef is
complaining about the ASSERT()s slows down the system.
In fact, from the assert(3) man page, we're doing things differently
than user space at least:
If the macro NDEBUG is defined at the moment <assert.h> was last
included, the macro assert() generates no code, and hence does nothing
at all.
So I'm confused, what's the proper way to do ASSERT()?
Thanks,
Qu
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check
2020-12-14 11:32 ` Qu Wenruo
@ 2020-12-14 12:40 ` Nikolay Borisov
0 siblings, 0 replies; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-14 12:40 UTC (permalink / raw)
To: Qu Wenruo, Qu Wenruo, linux-btrfs; +Cc: David Sterba
On 14.12.20 г. 13:32 ч., Qu Wenruo wrote:
>
>
> On 2020/12/14 下午7:17, Nikolay Borisov wrote:
>>
>>
>> On 14.12.20 г. 12:50 ч., Qu Wenruo wrote:
>>>
>>> IIRC ASSERT() won't execute whatever in it for non debug build.
>>> Thus ASSERT(atomic_*) would cause non-debug kernel not to decrease the
>>> io_pages and hangs the system.
>>
>> Nope:
>>
>> 3362 #ifdef CONFIG_BTRFS_ASSERT
>> 1 __cold __noreturn
>> 2 static inline void assertfail(const char *expr, const char
>> *file, int line)
>> 3 {
>> 4 pr_err("assertion failed: %s, in %s:%d\n", expr, file,
>> line);
>> 5 BUG();
>> 6 }
>> 7
>> 8 #define ASSERT(expr) \
>> 9 (likely(expr) ? (void)0 : assertfail(#expr, __FILE__,
>> __LINE__))
>> 10
>> 11 #else
>> 12 static inline void assertfail(const char *expr, const char*
>> file, int line) { }
>> 13 #define ASSERT(expr) (void)(expr) <--
>> expression is evaluated.
>> 14 #endif
>>
> Wow, that's too tricky and maybe that's the reason why Josef is
> complaining about the ASSERT()s slows down the system.
>
> In fact, from the assert(3) man page, we're doing things differently
> than user space at least:
>
> If the macro NDEBUG is defined at the moment <assert.h> was last
> included, the macro assert() generates no code, and hence does nothing
> at all.
>
> So I'm confused, what's the proper way to do ASSERT()?
Well as it stands now, what I suggested would work. OTOH this really
puts forward the question why do we leave code around (well, the
compiler should really eliminate those redundant checks). Hm, David?
>
> Thanks,
> Qu
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 16/18] btrfs: introduce btrfs_subpage for data inodes
2020-12-10 6:39 ` [PATCH v2 16/18] btrfs: introduce btrfs_subpage for data inodes Qu Wenruo
2020-12-10 9:44 ` kernel test robot
2020-12-11 0:43 ` kernel test robot
@ 2020-12-14 12:46 ` Nikolay Borisov
2 siblings, 0 replies; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-14 12:46 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 10.12.20 г. 8:39 ч., Qu Wenruo wrote:
> To support subpage sector size, data also need extra info to make sure
> which sectors in a page are uptodate/dirty/...
>
> This patch will make pages for data inodes to get btrfs_subpage
> structure attached, and detached when the page is freed.
>
> This patch also slightly changes the timing when
> set_page_extent_mapped() to make sure:
> - We have page->mapping set
> page->mapping->host is used to grab btrfs_fs_info, thus we can only
> call this function after page is mapped to an inode.
>
> One call site attaches pages to inode manually, thus we have to modify
> the timing of set_page_extent_mapped() a little.
>
> - As soon as possible, before other operations
> Since memory allocation can fail, we have to do extra error handling.
> Calling set_page_extent_mapped() as soon as possible can simply the
> error handling for several call sites.
>
> The idea is pretty much the same as iomap_page, but with more bitmaps
> for btrfs specific cases.
>
> Currently the plan is to switch iomap if iomap can provide sector
> aligned write back (only write back dirty sectors, but not the full
> page, data balance require this feature).
>
> So we will stick to btrfs specific bitmap for now.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/compression.c | 10 ++++++--
> fs/btrfs/extent_io.c | 47 +++++++++++++++++++++++++++++++++----
> fs/btrfs/extent_io.h | 3 ++-
> fs/btrfs/file.c | 10 +++++---
> fs/btrfs/free-space-cache.c | 15 +++++++++---
> fs/btrfs/inode.c | 12 ++++++----
> fs/btrfs/ioctl.c | 5 +++-
> fs/btrfs/reflink.c | 5 +++-
> fs/btrfs/relocation.c | 12 ++++++++--
> 9 files changed, 98 insertions(+), 21 deletions(-)
>
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 5ae3fa0386b7..6d203acfdeb3 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -542,13 +542,19 @@ static noinline int add_ra_bio_pages(struct inode *inode,
> goto next;
> }
>
> - end = last_offset + PAGE_SIZE - 1;
> /*
> * at this point, we have a locked page in the page cache
> * for these bytes in the file. But, we have to make
> * sure they map to this compressed extent on disk.
> */
> - set_page_extent_mapped(page);
> + ret = set_page_extent_mapped(page);
> + if (ret < 0) {
> + unlock_page(page);
> + put_page(page);
> + break;
> + }
> +
> + end = last_offset + PAGE_SIZE - 1;
> lock_extent(tree, last_offset, end);
> read_lock(&em_tree->lock);
> em = lookup_extent_mapping(em_tree, last_offset,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 64a19c1884fc..4e4ed9c453ae 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3191,10 +3191,40 @@ static int attach_extent_buffer_page(struct extent_buffer *eb,
> return 0;
> }
>
> -void set_page_extent_mapped(struct page *page)
> +int __must_check set_page_extent_mapped(struct page *page)
> {
> - if (!PagePrivate(page))
> + struct btrfs_fs_info *fs_info;
> +
> + ASSERT(page->mapping);
> +
> + if (PagePrivate(page))
> + return 0;
> +
> + fs_info = btrfs_sb(page->mapping->host->i_sb);
> + if (fs_info->sectorsize == PAGE_SIZE) {
> attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE);
> + return 0;
> + }
> +
> + return btrfs_attach_subpage(fs_info, page);
In all previous patches < PAGE_SIZE is the special case, in this
function it's reversed. For the sake of consistency change that so
btrfs_attch_subpage is executed inside the conditional.
> +}
> +
> +void clear_page_extent_mapped(struct page *page)
> +{
> + struct btrfs_fs_info *fs_info;
> +
> + ASSERT(page->mapping);
> +
> + if (!PagePrivate(page))
> + return;
> +
> + fs_info = btrfs_sb(page->mapping->host->i_sb);
> + if (fs_info->sectorsize == PAGE_SIZE) {
> + detach_page_private(page);
> + return;
> + }
> +
> + btrfs_detach_subpage(fs_info, page);
DITTO
> }
>
> static struct extent_map *
<snip>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index a29b50208eee..9b878616b489 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1373,6 +1373,12 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
> goto fail;
> }
>
> + err = set_page_extent_mapped(pages[i]);
> + if (err < 0) {
> + faili = i;
> + goto fail;
> + }
> +
> if (i == 0)
> err = prepare_uptodate_page(inode, pages[i], pos,
> force_uptodate);
> @@ -1470,10 +1476,8 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages,
> * We'll call btrfs_dirty_pages() later on, and that will flip around
> * delalloc bits and dirty the pages as required.
> */
> - for (i = 0; i < num_pages; i++) {
> - set_page_extent_mapped(pages[i]);
> + for (i = 0; i < num_pages; i++)
> WARN_ON(!PageLocked(pages[i]));
> - }
The comment above this needs to be removed/rewritten I guess?
Essentially that set_page_extent_mapped is moved to prepare_pages.
>
> return ret;
> }
<snip>
> @@ -8355,7 +8357,9 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
> wait_on_page_writeback(page);
>
> lock_extent_bits(io_tree, page_start, page_end, &cached_state);
> - set_page_extent_mapped(page);
> + ret = set_page_extent_mapped(page);
> + if (ret < 0)
> + goto out_unlock;
You should use ret2, ret in this function is used for the retval of
vmf_error().
>
> /*
> * we can't set the delalloc bits if there are pending ordered
<snip>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 17/18] btrfs: integrate page status update for read path into begin/end_page_read()
2020-12-10 6:39 ` [PATCH v2 17/18] btrfs: integrate page status update for read path into begin/end_page_read() Qu Wenruo
@ 2020-12-14 13:59 ` Nikolay Borisov
0 siblings, 0 replies; 71+ messages in thread
From: Nikolay Borisov @ 2020-12-14 13:59 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 10.12.20 г. 8:39 ч., Qu Wenruo wrote:
> In btrfs data page read path, the page status update are handled in two
> different locations:
>
> btrfs_do_read_page()
> {
> while (cur <= end) {
> /* No need to read from disk */
> if (HOLE/PREALLOC/INLINE){
> memset();
> set_extent_uptodate();
> continue;
> }
> /* Read from disk */
> ret = submit_extent_page(end_bio_extent_readpage);
> }
>
> end_bio_extent_readpage()
> {
> endio_readpage_uptodate_page_status();
> }
>
> This is fine for sectorsize == PAGE_SIZE case, as for above loop we
> should only hit one branch and then exit.
>
> But for subpage, there are more works to be done in page status update:
> - Page Unlock condition
> Unlike regular page size == sectorsize case, we can no longer just
> unlock a page.
> Only the last reader of the page can unlock the page.
> This means, we can unlock the page either in the while() loop, or in
> the endio function.
>
> - Page uptodate condition
> Since we have multiple sectors to read for a page, we can only mark
> the full page uptodate if all sectors are uptodate.
>
> To handle both subpage and regular cases, introduce a pair of functions
> to help handling page status update:
>
> - being_page_read()
> For regular case, it does nothing.
> For subpage case, it update the reader counters so that later
> end_page_read() can know who is the last one to unlock the page.
>
> - end_page_read()
> This is just endio_readpage_uptodate_page_status() renamed.
> The original name is a little too long and too specific for endio.
>
> The only new trick added is the condition for page unlock.
> Now for subage data, we unlock the page if we're the last reader.
>
> This does not only provide the basis for subpage data read, but also
> hide the special handling of page read from the main read loop.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 39 +++++++++++++++++++++++++-----------
> fs/btrfs/subpage.h | 47 ++++++++++++++++++++++++++++++++++++++------
> 2 files changed, 68 insertions(+), 18 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 4e4ed9c453ae..56174e7f0ae8 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2841,8 +2841,18 @@ static void endio_readpage_release_extent(struct processed_extent *processed,
> processed->uptodate = uptodate;
> }
>
> -static void endio_readpage_update_page_status(struct page *page, bool uptodate,
> - u64 start, u64 end)
> +static void begin_data_page_read(struct btrfs_fs_info *fs_info, struct page *page)
> +{
> + ASSERT(PageLocked(page));
> + if (fs_info->sectorsize == PAGE_SIZE)
> + return;
> +
> + ASSERT(PagePrivate(page) && page->private);
2nd part of the assert condition is redundant, page->private should only
be set via the respective generic helper which is never called with NULL
as the 2nd argument.
> + ASSERT(page->mapping->host != fs_info->btree_inode);
That function is only called by btrfs_do_readpage which is used only for
data read out, so do we really need this? I understand you want to be
extra careful but I think this is going over the top.
> + btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
> +}
> +
> +static void end_page_read(struct page *page, bool uptodate, u64 start, u64 end)
> {
> struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
> u32 len;
> @@ -2860,7 +2870,12 @@ static void endio_readpage_update_page_status(struct page *page, bool uptodate,
>
> if (fs_info->sectorsize == PAGE_SIZE)
> unlock_page(page);
> - /* Subpage locking will be handled in later patches */
> + else if (page->mapping->host != fs_info->btree_inode)
Use is_data_inode() helper
> + /*
> + * For subpage data, unlock the page if we're the last reader.
> + * For subpage metadata, page lock is not utilized for read.
> + */
> + btrfs_subpage_end_reader(fs_info, page, start, len);
> }
>
> /*
<snip>
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index 8592234d773e..6c801ef00d2d 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -31,6 +31,9 @@ struct btrfs_subpage {
> u16 tree_block_bitmap;
> };
> /* structures only used by data */
> + struct {
> + atomic_t readers;
> + };
> };
> };
>
> @@ -48,6 +51,17 @@ static inline void btrfs_subpage_clamp_range(struct page *page,
> orig_start + orig_len) - *start;
> }
>
> +static inline void btrfs_subpage_assert(struct btrfs_fs_info *fs_info,
> + struct page *page, u64 start, u32 len)
> +{
> + /* Basic checks */
> + ASSERT(PagePrivate(page) && page->private);
> + ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
> + IS_ALIGNED(len, fs_info->sectorsize));
> + ASSERT(page_offset(page) <= start &&
> + start + len <= page_offset(page) + PAGE_SIZE);
> +}
> +
> /*
> * Convert the [start, start + len) range into a u16 bitmap
> *
> @@ -59,12 +73,8 @@ static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
> int bit_start = (start - page_offset(page)) >> fs_info->sectorsize_bits;
> int nbits = len >> fs_info->sectorsize_bits;
>
> - /* Basic checks */
> - ASSERT(PagePrivate(page) && page->private);
> - ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
> - IS_ALIGNED(len, fs_info->sectorsize));
> - ASSERT(page_offset(page) <= start &&
> - start + len <= page_offset(page) + PAGE_SIZE);
> + btrfs_subpage_assert(fs_info, page, start, len);
> +
> /*
> * Here nbits can be 16, thus can go beyond u16 range. Here we make the
> * first left shift to be calculated in unsigned long (u32), then
> @@ -73,6 +83,31 @@ static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
> return (u16)(((1UL << nbits) - 1) << bit_start);
> }
>
> +static inline void btrfs_subpage_start_reader(struct btrfs_fs_info *fs_info,
> + struct page *page, u64 start,
> + u32 len)
> +{
> + struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> + int nbits = len >> fs_info->sectorsize_bits;
> +
> + btrfs_subpage_assert(fs_info, page, start, len);
> +
> + ASSERT(atomic_read(&subpage->readers) == 0);
> + atomic_set(&subpage->readers, nbits);
To make this more explicit implement it via atomic_add_unless and assert
on the return value.
> +}
> +
> +static inline void btrfs_subpage_end_reader(struct btrfs_fs_info *fs_info,
> + struct page *page, u64 start, u32 len)
> +{
> + struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
> + int nbits = len >> fs_info->sectorsize_bits;
> +
> + btrfs_subpage_assert(fs_info, page, start, len);
> + ASSERT(atomic_read(&subpage->readers) >= nbits);
> + if (atomic_sub_and_test(nbits, &subpage->readers))
> + unlock_page(page);
> +}
> +
> static inline void btrfs_subpage_set_tree_block(struct btrfs_fs_info *fs_info,
> struct page *page, u64 start, u32 len)
> {
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case
2020-12-10 15:30 ` Nikolay Borisov
@ 2020-12-17 6:48 ` Qu Wenruo
0 siblings, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-17 6:48 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs, David Sterba
On 2020/12/10 下午11:30, Nikolay Borisov wrote:
>
>
> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>> For subpage case, we need to allocate new memory for each metadata page.
>>
>> So we need to:
>> - Allow attach_extent_buffer_page() to return int
>> To indicate allocation failure
>>
>> - Prealloc page->private for alloc_extent_buffer()
>> We don't want to call memory allocation with spinlock hold, so
>> do preallocation before we acquire the spin lock.
>>
>> - Handle subpage and regular case differently in
>> attach_extent_buffer_page()
>> For regular case, just do the usual thing.
>> For subpage case, allocate new memory and update the tree_block
>> bitmap.
>>
>> The bitmap update will be handled by new subpage specific helper,
>> btrfs_subpage_set_tree_block().
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/extent_io.c | 69 +++++++++++++++++++++++++++++++++++---------
>> fs/btrfs/subpage.h | 44 ++++++++++++++++++++++++++++
>> 2 files changed, 99 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 6350c2687c7e..51dd7ec3c2b3 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -24,6 +24,7 @@
>> #include "rcu-string.h"
>> #include "backref.h"
>> #include "disk-io.h"
>> +#include "subpage.h"
>>
>> static struct kmem_cache *extent_state_cache;
>> static struct kmem_cache *extent_buffer_cache;
>> @@ -3142,22 +3143,41 @@ static int submit_extent_page(unsigned int opf,
>> return ret;
>> }
>>
>> -static void attach_extent_buffer_page(struct extent_buffer *eb,
>> +static int attach_extent_buffer_page(struct extent_buffer *eb,
>> struct page *page)
>> {
>> - /*
>> - * If the page is mapped to btree inode, we should hold the private
>> - * lock to prevent race.
>> - * For cloned or dummy extent buffers, their pages are not mapped and
>> - * will not race with any other ebs.
>> - */
>> - if (page->mapping)
>> - lockdep_assert_held(&page->mapping->private_lock);
>> + struct btrfs_fs_info *fs_info = eb->fs_info;
>> + int ret;
>>
>> - if (!PagePrivate(page))
>> - attach_page_private(page, eb);
>> - else
>> - WARN_ON(page->private != (unsigned long)eb);
>> + if (fs_info->sectorsize == PAGE_SIZE) {
>> + /*
>> + * If the page is mapped to btree inode, we should hold the
>> + * private lock to prevent race.
>> + * For cloned or dummy extent buffers, their pages are not
>> + * mapped and will not race with any other ebs.
>> + */
>> + if (page->mapping)
>> + lockdep_assert_held(&page->mapping->private_lock);
>> +
>> + if (!PagePrivate(page))
>> + attach_page_private(page, eb);
>> + else
>> + WARN_ON(page->private != (unsigned long)eb);
>> + return 0;
>> + }
>> +
>> + /* Already mapped, just update the existing range */
>> + if (PagePrivate(page))
>> + goto update_bitmap;
>
> How can this check ever be false, given btrfs_attach_subpage is called
> unconditionally in alloc_extent_buffer so that you can avoid allocating
> memory with private lock held, yet in this function you check if memory
> hasn't been allocated and you proceed to do it? Also that memory
> allocation is done with GFP_NOFS under a spinlock, that's not atomic i.e
> IO can still be kicked which means you can go to sleep while holding a
> spinlock, not cool.
There are two callers of attach_extent_buffer_page(), one in
alloc_extent_buffer(), which we pre-allocate page::private before
calling attach_extent_buffer_page().
And the pre-allocation happens out of the spinlock.
Thus there is no memory allocation at all for that call site.
The other caller is in btrfs_clone_extent_buffer(), which needs proper
memory allocation.
>
>> +
>> + /* Do new allocation to attach subpage */
>> + ret = btrfs_attach_subpage(fs_info, page);
>> + if (ret < 0)
>> + return ret;
>> +
>> +update_bitmap:
>> + btrfs_subpage_set_tree_block(fs_info, page, eb->start, eb->len);
>> + return 0;
>
> Those are really 2 functions, demarcated by the if. Given that
> attach_extent_buffer is called in only 2 places, can't you opencode the
> if (fs_info->sectorize) check in the callers and define 2 functions:
>
> 1 for subpage blocksize and the other one for the old code?
Tried, looks much worse than current code, especially we need to add one
indent in btrfs_clone_extent_buffer().
>
>> }
>>
>
> <snip>
>
>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>> index 96f3b226913e..c2ce603e7848 100644
>> --- a/fs/btrfs/subpage.h
>> +++ b/fs/btrfs/subpage.h
>> @@ -23,9 +23,53 @@
>> struct btrfs_subpage {
>> /* Common members for both data and metadata pages */
>> spinlock_t lock;
>> + union {
>> + /* Structures only used by metadata */
>> + struct {
>> + u16 tree_block_bitmap;
>> + };
>> + /* structures only used by data */
>> + };
>> };
>>
>> int btrfs_attach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>> void btrfs_detach_subpage(struct btrfs_fs_info *fs_info, struct page *page);
>>
>> +/*
>> + * Convert the [start, start + len) range into a u16 bitmap
>> + *
>> + * E.g. if start == page_offset() + 16K, len = 16K, we get 0x00f0.
>> + */
>> +static inline u16 btrfs_subpage_calc_bitmap(struct btrfs_fs_info *fs_info,
>> + struct page *page, u64 start, u32 len)
>> +{
>> + int bit_start = (start - page_offset(page)) >> fs_info->sectorsize_bits;
>> + int nbits = len >> fs_info->sectorsize_bits;
>> +
>> + /* Basic checks */
>> + ASSERT(PagePrivate(page) && page->private);
>> + ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
>> + IS_ALIGNED(len, fs_info->sectorsize));
>
> Separate aligns so if they feel it's evident which one failed.
I guess we are going to forget when ASSERT() should be used.
It's for something which shouldn't fail.
It's not used as a less-terrible BUG_ON(), but really to indicate what's
expected, thus I don't really expect it to be triggered, nor would it
matter if it's two lines or one line.
what's your idea on this David?
>
>> + ASSERT(page_offset(page) <= start &&
>> + start + len <= page_offset(page) + PAGE_SIZE);
>
> ditto. Also instead of checking 'page_offset(page) <= start' you can
> simply check 'bit_start is >= 0' as that's what you ultimately care about.
Despite the ASSERT() usage, the start + len and page_offset() is much
easier to grasp without the need to refer to bit_start.
Thanks,
Qu
>
>> + /*
>> + * Here nbits can be 16, thus can go beyond u16 range. Here we make the
>> + * first left shift to be calculated in unsigned long (u32), then
>> + * truncate the result to u16.
>> + */
>> + return (u16)(((1UL << nbits) - 1) << bit_start);
>> +}
>> +
>> +static inline void btrfs_subpage_set_tree_block(struct btrfs_fs_info *fs_info,
>> + struct page *page, u64 start, u32 len)
>> +{
>> + struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
>> + unsigned long flags;
>> + u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
>> +
>> + spin_lock_irqsave(&subpage->lock, flags);
>> + subpage->tree_block_bitmap |= tmp;
>> + spin_unlock_irqrestore(&subpage->lock, flags);
>> +}
>> +
>> #endif /* BTRFS_SUBPAGE_H */
>>
>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 07/18] btrfs: extent_io: make grab_extent_buffer_from_page() to handle subpage case
2020-12-10 15:39 ` Nikolay Borisov
@ 2020-12-17 6:55 ` Qu Wenruo
0 siblings, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-17 6:55 UTC (permalink / raw)
To: Nikolay Borisov, Qu Wenruo, linux-btrfs
On 2020/12/10 下午11:39, Nikolay Borisov wrote:
>
>
> On 10.12.20 г. 8:38 ч., Qu Wenruo wrote:
>> For subpage case, grab_extent_buffer_from_page() can't really get an
>> extent buffer just from btrfs_subpage.
>>
>> Although we have btrfs_subpage::tree_block_bitmap, which can be used to
>> grab the bytenr of an existing extent buffer, and can then go radix tree
>> search to grab that existing eb.
>>
>> However we are still doing radix tree insert check in
>> alloc_extent_buffer(), thus we don't really need to do the extra hassle,
>> just let alloc_extent_buffer() to handle existing eb in radix tree.
>>
>> So for grab_extent_buffer_from_page(), just always return NULL for
>> subpage case.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/extent_io.c | 13 +++++++++++--
>> 1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 51dd7ec3c2b3..b99bd0402130 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -5278,10 +5278,19 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
>> }
>> #endif
>>
>> -static struct extent_buffer *grab_extent_buffer_from_page(struct page *page)
>> +static struct extent_buffer *grab_extent_buffer_from_page(
>> + struct btrfs_fs_info *fs_info, struct page *page)
>> {
>> struct extent_buffer *exists;
>>
>> + /*
>> + * For subpage case, we completely rely on radix tree to ensure we
>> + * don't try to insert two eb for the same bytenr.
>> + * So here we alwasy return NULL and just continue.
>> + */
>> + if (fs_info->sectorsize < PAGE_SIZE)
>> + return NULL;
>> +
>
> Instead of hiding this in the function, just open-code it in the only caller. It would look like:
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index b99bd0402130..440dab207944 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5370,8 +5370,9 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> }
>
> spin_lock(&mapping->private_lock);
> - exists = grab_extent_buffer_from_page(fs_info, p);
> - if (exists) {
> + if (fs_info->sectorsize == PAGE_SIZE &&
> + (exists = grab_extent_buffer_from_page(fs_info, p)));
> + {
> spin_unlock(&mapping->private_lock);
> unlock_page(p);
> put_page(p);
>
>
> Admittedly that exist = ... in the if condition is a bit of an anti-pattern but given it's used in only 1 place
> and makes the flow of code more linear I'd say it's a win. But would like to hear David's opinion.
Personally speaking, the (exists == *) inside the condition really looks
ugly and hard to grasp.
And since grab_extent_buffer_from_page() is only called once, the
generated code shouldn't be that much different anyway as the compiler
would mostly just inline it.
So I still prefer the current code, not to mention it also provides
extra space for the comment.
Thanks,
Qu
>
>> /* Page not yet attached to an extent buffer */
>> if (!PagePrivate(page))
>> return NULL;
>> @@ -5361,7 +5370,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>> }
>>
>> spin_lock(&mapping->private_lock);
>> - exists = grab_extent_buffer_from_page(p);
>> + exists = grab_extent_buffer_from_page(fs_info, p);
>> if (exists) {
>> spin_unlock(&mapping->private_lock);
>> unlock_page(p);
>>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 02/18] btrfs: extent_io: refactor __extent_writepage_io() to improve readability
2020-12-10 6:38 ` [PATCH v2 02/18] btrfs: extent_io: refactor __extent_writepage_io() to improve readability Qu Wenruo
2020-12-10 12:12 ` Nikolay Borisov
@ 2020-12-17 15:43 ` Josef Bacik
1 sibling, 0 replies; 71+ messages in thread
From: Josef Bacik @ 2020-12-17 15:43 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 12/10/20 1:38 AM, Qu Wenruo wrote:
> The refactor involves the following modifications:
> - iosize alignment
> In fact we don't really need to manually do alignment at all.
> All extent maps should already be aligned, thus basic ASSERT() check
> would be enough.
>
> - redundant variables
> We have extra variable like blocksize/pg_offset/end.
> They are all unnecessary.
>
> @blocksize can be replaced by sectorsize size directly, and it's only
> used to verify the em start/size is aligned.
>
> @pg_offset can be easily calculated using @cur and page_offset(page).
>
> @end is just assigned to @page_end and never modified, use @page_end
> to replace it.
>
> - remove some BUG_ON()s
> The BUG_ON()s are for extent map, which we have tree-checker to check
> on-disk extent data item and runtime check.
> ASSERT() should be enough.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 37 +++++++++++++++++--------------------
> 1 file changed, 17 insertions(+), 20 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 2650e8720394..612fe60b367e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3515,17 +3515,14 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
> unsigned long nr_written,
> int *nr_ret)
> {
> + struct btrfs_fs_info *fs_info = inode->root->fs_info;
> struct extent_io_tree *tree = &inode->io_tree;
> u64 start = page_offset(page);
> u64 page_end = start + PAGE_SIZE - 1;
> - u64 end;
> u64 cur = start;
> u64 extent_offset;
> u64 block_start;
> - u64 iosize;
> struct extent_map *em;
> - size_t pg_offset = 0;
> - size_t blocksize;
> int ret = 0;
> int nr = 0;
> const unsigned int write_flags = wbc_to_write_flags(wbc);
> @@ -3546,19 +3543,17 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
> */
> update_nr_written(wbc, nr_written + 1);
>
> - end = page_end;
> - blocksize = inode->vfs_inode.i_sb->s_blocksize;
> -
> - while (cur <= end) {
> + while (cur <= page_end) {
> u64 disk_bytenr;
> u64 em_end;
> + u32 iosize;
>
> if (cur >= i_size) {
> btrfs_writepage_endio_finish_ordered(page, cur,
> page_end, 1);
> break;
> }
> - em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
> + em = btrfs_get_extent(inode, NULL, 0, cur, page_end - cur + 1);
> if (IS_ERR_OR_NULL(em)) {
> SetPageError(page);
> ret = PTR_ERR_OR_ZERO(em);
> @@ -3567,16 +3562,20 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>
> extent_offset = cur - em->start;
> em_end = extent_map_end(em);
> - BUG_ON(em_end <= cur);
> - BUG_ON(end < cur);
> - iosize = min(em_end - cur, end - cur + 1);
> - iosize = ALIGN(iosize, blocksize);
> - disk_bytenr = em->block_start + extent_offset;
> + ASSERT(cur <= em_end);
> + ASSERT(cur < page_end);
> + ASSERT(IS_ALIGNED(em->start, fs_info->sectorsize));
> + ASSERT(IS_ALIGNED(em->len, fs_info->sectorsize));
> block_start = em->block_start;
> compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
> + disk_bytenr = em->block_start + extent_offset;
> +
> + /* Note that em_end from extent_map_end() is exclusive */
> + iosize = min(em_end, page_end + 1) - cur;
> free_extent_map(em);
> em = NULL;
>
> +
Random extra whitespace. Once you fix you can add
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 01/18] btrfs: extent_io: rename @offset parameter to @disk_bytenr for submit_extent_page()
2020-12-10 6:38 ` [PATCH v2 01/18] btrfs: extent_io: rename @offset parameter to @disk_bytenr for submit_extent_page() Qu Wenruo
@ 2020-12-17 15:44 ` Josef Bacik
0 siblings, 0 replies; 71+ messages in thread
From: Josef Bacik @ 2020-12-17 15:44 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 12/10/20 1:38 AM, Qu Wenruo wrote:
> The parameter @offset can't be more confusing.
> In fact that parameter is the disk bytenr for metadata/data.
>
> Rename it to @disk_bytenr and update the comment to reduce confusion.
>
> Since we're here, also rename all @offset passed into
> submit_extent_page() to @disk_bytenr.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 04/18] btrfs: extent_io: introduce a helper to grab an existing extent buffer from a page
2020-12-10 6:38 ` [PATCH v2 04/18] btrfs: extent_io: introduce a helper to grab an existing extent buffer from a page Qu Wenruo
2020-12-10 13:51 ` Nikolay Borisov
@ 2020-12-17 15:50 ` Josef Bacik
1 sibling, 0 replies; 71+ messages in thread
From: Josef Bacik @ 2020-12-17 15:50 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs; +Cc: Johannes Thumshirn
On 12/10/20 1:38 AM, Qu Wenruo wrote:
> This patch will extract the code to grab an extent buffer from a page
> into a helper, grab_extent_buffer_from_page().
>
> This reduces one indent level, and provides the work place for later
> expansion for subapge support.
>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 52 +++++++++++++++++++++++++++-----------------
> 1 file changed, 32 insertions(+), 20 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 612fe60b367e..6350c2687c7e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5251,6 +5251,32 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
> }
> #endif
>
> +static struct extent_buffer *grab_extent_buffer_from_page(struct page *page)
> +{
> + struct extent_buffer *exists;
> +
> + /* Page not yet attached to an extent buffer */
> + if (!PagePrivate(page))
> + return NULL;
> +
> + /*
> + * We could have already allocated an eb for this page
> + * and attached one so lets see if we can get a ref on
> + * the existing eb, and if we can we know it's good and
> + * we can just return that one, else we know we can just
> + * overwrite page->private.
> + */
> + exists = (struct extent_buffer *)page->private;
> + if (atomic_inc_not_zero(&exists->refs)) {
> + mark_extent_buffer_accessed(exists, page);
> + return exists;
> + }
> +
> + WARN_ON(PageDirty(page));
> + detach_page_private(page);
> + return NULL;
> +}
> +
> struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> u64 start, u64 owner_root, int level)
> {
> @@ -5296,26 +5322,12 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> }
>
> spin_lock(&mapping->private_lock);
> - if (PagePrivate(p)) {
> - /*
> - * We could have already allocated an eb for this page
> - * and attached one so lets see if we can get a ref on
> - * the existing eb, and if we can we know it's good and
> - * we can just return that one, else we know we can just
> - * overwrite page->private.
> - */
> - exists = (struct extent_buffer *)p->private;
> - if (atomic_inc_not_zero(&exists->refs)) {
> - spin_unlock(&mapping->private_lock);
> - unlock_page(p);
> - put_page(p);
> - mark_extent_buffer_accessed(exists, p);
> - goto free_eb;
> - }
> - exists = NULL;
> -
> - WARN_ON(PageDirty(p));
> - detach_page_private(p);
> + exists = grab_extent_buffer_from_page(p);
> + if (exists) {
> + spin_unlock(&mapping->private_lock);
> + unlock_page(p);
> + put_page(p);
Put the mark_extent_buffer_accessed() here. Thanks,
Josef
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 05/18] btrfs: extent_io: introduce the skeleton of btrfs_subpage structure
2020-12-10 6:38 ` [PATCH v2 05/18] btrfs: extent_io: introduce the skeleton of btrfs_subpage structure Qu Wenruo
@ 2020-12-17 15:52 ` Josef Bacik
0 siblings, 0 replies; 71+ messages in thread
From: Josef Bacik @ 2020-12-17 15:52 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 12/10/20 1:38 AM, Qu Wenruo wrote:
> For btrfs subpage support, we need a structure to record extra info for
> the status of each sectors of a page.
>
> This patch will introduce the skeleton structure for future btrfs
> subpage support.
> All subpage related code would go to subpage.[ch] to avoid populating
> the existing code base.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case
2020-12-10 6:38 ` [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case Qu Wenruo
2020-12-10 15:30 ` Nikolay Borisov
2020-12-10 16:09 ` Nikolay Borisov
@ 2020-12-17 16:00 ` Josef Bacik
2020-12-18 0:44 ` Qu Wenruo
2 siblings, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2020-12-17 16:00 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 12/10/20 1:38 AM, Qu Wenruo wrote:
> For subpage case, we need to allocate new memory for each metadata page.
>
> So we need to:
> - Allow attach_extent_buffer_page() to return int
> To indicate allocation failure
>
> - Prealloc page->private for alloc_extent_buffer()
> We don't want to call memory allocation with spinlock hold, so
> do preallocation before we acquire the spin lock.
>
> - Handle subpage and regular case differently in
> attach_extent_buffer_page()
> For regular case, just do the usual thing.
> For subpage case, allocate new memory and update the tree_block
> bitmap.
>
> The bitmap update will be handled by new subpage specific helper,
> btrfs_subpage_set_tree_block().
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/extent_io.c | 69 +++++++++++++++++++++++++++++++++++---------
> fs/btrfs/subpage.h | 44 ++++++++++++++++++++++++++++
> 2 files changed, 99 insertions(+), 14 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 6350c2687c7e..51dd7ec3c2b3 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -24,6 +24,7 @@
> #include "rcu-string.h"
> #include "backref.h"
> #include "disk-io.h"
> +#include "subpage.h"
>
> static struct kmem_cache *extent_state_cache;
> static struct kmem_cache *extent_buffer_cache;
> @@ -3142,22 +3143,41 @@ static int submit_extent_page(unsigned int opf,
> return ret;
> }
>
> -static void attach_extent_buffer_page(struct extent_buffer *eb,
> +static int attach_extent_buffer_page(struct extent_buffer *eb,
> struct page *page)
> {
> - /*
> - * If the page is mapped to btree inode, we should hold the private
> - * lock to prevent race.
> - * For cloned or dummy extent buffers, their pages are not mapped and
> - * will not race with any other ebs.
> - */
> - if (page->mapping)
> - lockdep_assert_held(&page->mapping->private_lock);
> + struct btrfs_fs_info *fs_info = eb->fs_info;
> + int ret;
>
> - if (!PagePrivate(page))
> - attach_page_private(page, eb);
> - else
> - WARN_ON(page->private != (unsigned long)eb);
> + if (fs_info->sectorsize == PAGE_SIZE) {
> + /*
> + * If the page is mapped to btree inode, we should hold the
> + * private lock to prevent race.
> + * For cloned or dummy extent buffers, their pages are not
> + * mapped and will not race with any other ebs.
> + */
> + if (page->mapping)
> + lockdep_assert_held(&page->mapping->private_lock);
> +
> + if (!PagePrivate(page))
> + attach_page_private(page, eb);
> + else
> + WARN_ON(page->private != (unsigned long)eb);
> + return 0;
> + }
> +
> + /* Already mapped, just update the existing range */
> + if (PagePrivate(page))
> + goto update_bitmap;
> +
> + /* Do new allocation to attach subpage */
> + ret = btrfs_attach_subpage(fs_info, page);
> + if (ret < 0)
> + return ret;
> +
> +update_bitmap:
> + btrfs_subpage_set_tree_block(fs_info, page, eb->start, eb->len);
> + return 0;
> }
>
> void set_page_extent_mapped(struct page *page)
> @@ -5067,12 +5087,19 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
> return NULL;
>
> for (i = 0; i < num_pages; i++) {
> + int ret;
> +
> p = alloc_page(GFP_NOFS);
> if (!p) {
> btrfs_release_extent_buffer(new);
> return NULL;
> }
> - attach_extent_buffer_page(new, p);
> + ret = attach_extent_buffer_page(new, p);
> + if (ret < 0) {
> + put_page(p);
> + btrfs_release_extent_buffer(new);
> + return NULL;
> + }
> WARN_ON(PageDirty(p));
> SetPageUptodate(p);
> new->pages[i] = p;
> @@ -5321,6 +5348,18 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> goto free_eb;
> }
>
> + /*
> + * Preallocate page->private for subpage case, so that
> + * we won't allocate memory with private_lock hold.
> + */
> + ret = btrfs_attach_subpage(fs_info, p);
> + if (ret < 0) {
> + unlock_page(p);
> + put_page(p);
> + exists = ERR_PTR(-ENOMEM);
> + goto free_eb;
> + }
> +
This is broken, if we race with another thread adding an extent buffer for this
same range we'll overwrite the page private with the new thing, losing any of
the work that was done previously. Thanks,
Josef
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 07/18] btrfs: extent_io: make grab_extent_buffer_from_page() to handle subpage case
2020-12-10 6:38 ` [PATCH v2 07/18] btrfs: extent_io: make grab_extent_buffer_from_page() " Qu Wenruo
2020-12-10 15:39 ` Nikolay Borisov
@ 2020-12-17 16:02 ` Josef Bacik
2020-12-18 0:49 ` Qu Wenruo
1 sibling, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2020-12-17 16:02 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 12/10/20 1:38 AM, Qu Wenruo wrote:
> For subpage case, grab_extent_buffer_from_page() can't really get an
> extent buffer just from btrfs_subpage.
>
> Although we have btrfs_subpage::tree_block_bitmap, which can be used to
> grab the bytenr of an existing extent buffer, and can then go radix tree
> search to grab that existing eb.
>
> However we are still doing radix tree insert check in
> alloc_extent_buffer(), thus we don't really need to do the extra hassle,
> just let alloc_extent_buffer() to handle existing eb in radix tree.
>
> So for grab_extent_buffer_from_page(), just always return NULL for
> subpage case.
This is fundamentally flawed. The extent buffer radix tree look up is done
_after_ the pages are init'ed. This is why there's that complicated dance of
checking for existing extent buffers attached to to a page, because we can race
at the initialization stage and attach an EB to a page before it's in the radix
tree. What you'll end up doing here is overwriting your existing subpage stuff
anytime there's a race, and it'll end very badly. Thanks,
Josef
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case
2020-12-17 16:00 ` Josef Bacik
@ 2020-12-18 0:44 ` Qu Wenruo
2020-12-18 15:41 ` Josef Bacik
0 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-18 0:44 UTC (permalink / raw)
To: Josef Bacik, Qu Wenruo, linux-btrfs
On 2020/12/18 上午12:00, Josef Bacik wrote:
> On 12/10/20 1:38 AM, Qu Wenruo wrote:
>> For subpage case, we need to allocate new memory for each metadata page.
>>
>> So we need to:
>> - Allow attach_extent_buffer_page() to return int
>> To indicate allocation failure
>>
>> - Prealloc page->private for alloc_extent_buffer()
>> We don't want to call memory allocation with spinlock hold, so
>> do preallocation before we acquire the spin lock.
>>
>> - Handle subpage and regular case differently in
>> attach_extent_buffer_page()
>> For regular case, just do the usual thing.
>> For subpage case, allocate new memory and update the tree_block
>> bitmap.
>>
>> The bitmap update will be handled by new subpage specific helper,
>> btrfs_subpage_set_tree_block().
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/extent_io.c | 69 +++++++++++++++++++++++++++++++++++---------
>> fs/btrfs/subpage.h | 44 ++++++++++++++++++++++++++++
>> 2 files changed, 99 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 6350c2687c7e..51dd7ec3c2b3 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -24,6 +24,7 @@
>> #include "rcu-string.h"
>> #include "backref.h"
>> #include "disk-io.h"
>> +#include "subpage.h"
>> static struct kmem_cache *extent_state_cache;
>> static struct kmem_cache *extent_buffer_cache;
>> @@ -3142,22 +3143,41 @@ static int submit_extent_page(unsigned int opf,
>> return ret;
>> }
>> -static void attach_extent_buffer_page(struct extent_buffer *eb,
>> +static int attach_extent_buffer_page(struct extent_buffer *eb,
>> struct page *page)
>> {
>> - /*
>> - * If the page is mapped to btree inode, we should hold the private
>> - * lock to prevent race.
>> - * For cloned or dummy extent buffers, their pages are not mapped
>> and
>> - * will not race with any other ebs.
>> - */
>> - if (page->mapping)
>> - lockdep_assert_held(&page->mapping->private_lock);
>> + struct btrfs_fs_info *fs_info = eb->fs_info;
>> + int ret;
>> - if (!PagePrivate(page))
>> - attach_page_private(page, eb);
>> - else
>> - WARN_ON(page->private != (unsigned long)eb);
>> + if (fs_info->sectorsize == PAGE_SIZE) {
>> + /*
>> + * If the page is mapped to btree inode, we should hold the
>> + * private lock to prevent race.
>> + * For cloned or dummy extent buffers, their pages are not
>> + * mapped and will not race with any other ebs.
>> + */
>> + if (page->mapping)
>> + lockdep_assert_held(&page->mapping->private_lock);
>> +
>> + if (!PagePrivate(page))
>> + attach_page_private(page, eb);
>> + else
>> + WARN_ON(page->private != (unsigned long)eb);
>> + return 0;
>> + }
>> +
>> + /* Already mapped, just update the existing range */
>> + if (PagePrivate(page))
>> + goto update_bitmap;
>> +
>> + /* Do new allocation to attach subpage */
>> + ret = btrfs_attach_subpage(fs_info, page);
>> + if (ret < 0)
>> + return ret;
>> +
>> +update_bitmap:
>> + btrfs_subpage_set_tree_block(fs_info, page, eb->start, eb->len);
>> + return 0;
>> }
>> void set_page_extent_mapped(struct page *page)
>> @@ -5067,12 +5087,19 @@ struct extent_buffer
>> *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>> return NULL;
>> for (i = 0; i < num_pages; i++) {
>> + int ret;
>> +
>> p = alloc_page(GFP_NOFS);
>> if (!p) {
>> btrfs_release_extent_buffer(new);
>> return NULL;
>> }
>> - attach_extent_buffer_page(new, p);
>> + ret = attach_extent_buffer_page(new, p);
>> + if (ret < 0) {
>> + put_page(p);
>> + btrfs_release_extent_buffer(new);
>> + return NULL;
>> + }
>> WARN_ON(PageDirty(p));
>> SetPageUptodate(p);
>> new->pages[i] = p;
>> @@ -5321,6 +5348,18 @@ struct extent_buffer
>> *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>> goto free_eb;
>> }
>> + /*
>> + * Preallocate page->private for subpage case, so that
>> + * we won't allocate memory with private_lock hold.
>> + */
>> + ret = btrfs_attach_subpage(fs_info, p);
>> + if (ret < 0) {
>> + unlock_page(p);
>> + put_page(p);
>> + exists = ERR_PTR(-ENOMEM);
>> + goto free_eb;
>> + }
>> +
>
> This is broken, if we race with another thread adding an extent buffer
> for this same range we'll overwrite the page private with the new thing,
> losing any of the work that was done previously. Thanks,
Firstly the page is locked, so there should be only one to grab the page.
Secondly, btrfs_attach_subpage() would just exit if it detects the page
is already private.
So there shouldn't be a race.
Thanks,
Qu
>
> Josef
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 07/18] btrfs: extent_io: make grab_extent_buffer_from_page() to handle subpage case
2020-12-17 16:02 ` Josef Bacik
@ 2020-12-18 0:49 ` Qu Wenruo
0 siblings, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-18 0:49 UTC (permalink / raw)
To: Josef Bacik, Qu Wenruo, linux-btrfs
On 2020/12/18 上午12:02, Josef Bacik wrote:
> On 12/10/20 1:38 AM, Qu Wenruo wrote:
>> For subpage case, grab_extent_buffer_from_page() can't really get an
>> extent buffer just from btrfs_subpage.
>>
>> Although we have btrfs_subpage::tree_block_bitmap, which can be used to
>> grab the bytenr of an existing extent buffer, and can then go radix tree
>> search to grab that existing eb.
>>
>> However we are still doing radix tree insert check in
>> alloc_extent_buffer(), thus we don't really need to do the extra hassle,
>> just let alloc_extent_buffer() to handle existing eb in radix tree.
>>
>> So for grab_extent_buffer_from_page(), just always return NULL for
>> subpage case.
>
> This is fundamentally flawed. The extent buffer radix tree look up is
> done _after_ the pages are init'ed. This is why there's that
> complicated dance of checking for existing extent buffers attached to to
> a page, because we can race at the initialization stage and attach an EB
> to a page before it's in the radix tree. What you'll end up doing here
> is overwriting your existing subpage stuff anytime there's a race, and
> it'll end very badly. Thanks,
We have page lock preventing two eb getting the same page.
And btrfs_attach_subpage() won't overwrite the existing page::private,
thus it's safe.
Thanks,
Qu
>
> Josef
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case
2020-12-18 0:44 ` Qu Wenruo
@ 2020-12-18 15:41 ` Josef Bacik
2020-12-19 0:24 ` Qu Wenruo
0 siblings, 1 reply; 71+ messages in thread
From: Josef Bacik @ 2020-12-18 15:41 UTC (permalink / raw)
To: Qu Wenruo, Qu Wenruo, linux-btrfs
On 12/17/20 7:44 PM, Qu Wenruo wrote:
>
>
> On 2020/12/18 上午12:00, Josef Bacik wrote:
>> On 12/10/20 1:38 AM, Qu Wenruo wrote:
>>> For subpage case, we need to allocate new memory for each metadata page.
>>>
>>> So we need to:
>>> - Allow attach_extent_buffer_page() to return int
>>> To indicate allocation failure
>>>
>>> - Prealloc page->private for alloc_extent_buffer()
>>> We don't want to call memory allocation with spinlock hold, so
>>> do preallocation before we acquire the spin lock.
>>>
>>> - Handle subpage and regular case differently in
>>> attach_extent_buffer_page()
>>> For regular case, just do the usual thing.
>>> For subpage case, allocate new memory and update the tree_block
>>> bitmap.
>>>
>>> The bitmap update will be handled by new subpage specific helper,
>>> btrfs_subpage_set_tree_block().
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>> fs/btrfs/extent_io.c | 69 +++++++++++++++++++++++++++++++++++---------
>>> fs/btrfs/subpage.h | 44 ++++++++++++++++++++++++++++
>>> 2 files changed, 99 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 6350c2687c7e..51dd7ec3c2b3 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -24,6 +24,7 @@
>>> #include "rcu-string.h"
>>> #include "backref.h"
>>> #include "disk-io.h"
>>> +#include "subpage.h"
>>> static struct kmem_cache *extent_state_cache;
>>> static struct kmem_cache *extent_buffer_cache;
>>> @@ -3142,22 +3143,41 @@ static int submit_extent_page(unsigned int opf,
>>> return ret;
>>> }
>>> -static void attach_extent_buffer_page(struct extent_buffer *eb,
>>> +static int attach_extent_buffer_page(struct extent_buffer *eb,
>>> struct page *page)
>>> {
>>> - /*
>>> - * If the page is mapped to btree inode, we should hold the private
>>> - * lock to prevent race.
>>> - * For cloned or dummy extent buffers, their pages are not mapped and
>>> - * will not race with any other ebs.
>>> - */
>>> - if (page->mapping)
>>> - lockdep_assert_held(&page->mapping->private_lock);
>>> + struct btrfs_fs_info *fs_info = eb->fs_info;
>>> + int ret;
>>> - if (!PagePrivate(page))
>>> - attach_page_private(page, eb);
>>> - else
>>> - WARN_ON(page->private != (unsigned long)eb);
>>> + if (fs_info->sectorsize == PAGE_SIZE) {
>>> + /*
>>> + * If the page is mapped to btree inode, we should hold the
>>> + * private lock to prevent race.
>>> + * For cloned or dummy extent buffers, their pages are not
>>> + * mapped and will not race with any other ebs.
>>> + */
>>> + if (page->mapping)
>>> + lockdep_assert_held(&page->mapping->private_lock);
>>> +
>>> + if (!PagePrivate(page))
>>> + attach_page_private(page, eb);
>>> + else
>>> + WARN_ON(page->private != (unsigned long)eb);
>>> + return 0;
>>> + }
>>> +
>>> + /* Already mapped, just update the existing range */
>>> + if (PagePrivate(page))
>>> + goto update_bitmap;
>>> +
>>> + /* Do new allocation to attach subpage */
>>> + ret = btrfs_attach_subpage(fs_info, page);
>>> + if (ret < 0)
>>> + return ret;
>>> +
>>> +update_bitmap:
>>> + btrfs_subpage_set_tree_block(fs_info, page, eb->start, eb->len);
>>> + return 0;
>>> }
>>> void set_page_extent_mapped(struct page *page)
>>> @@ -5067,12 +5087,19 @@ struct extent_buffer *btrfs_clone_extent_buffer(const
>>> struct extent_buffer *src)
>>> return NULL;
>>> for (i = 0; i < num_pages; i++) {
>>> + int ret;
>>> +
>>> p = alloc_page(GFP_NOFS);
>>> if (!p) {
>>> btrfs_release_extent_buffer(new);
>>> return NULL;
>>> }
>>> - attach_extent_buffer_page(new, p);
>>> + ret = attach_extent_buffer_page(new, p);
>>> + if (ret < 0) {
>>> + put_page(p);
>>> + btrfs_release_extent_buffer(new);
>>> + return NULL;
>>> + }
>>> WARN_ON(PageDirty(p));
>>> SetPageUptodate(p);
>>> new->pages[i] = p;
>>> @@ -5321,6 +5348,18 @@ struct extent_buffer *alloc_extent_buffer(struct
>>> btrfs_fs_info *fs_info,
>>> goto free_eb;
>>> }
>>> + /*
>>> + * Preallocate page->private for subpage case, so that
>>> + * we won't allocate memory with private_lock hold.
>>> + */
>>> + ret = btrfs_attach_subpage(fs_info, p);
>>> + if (ret < 0) {
>>> + unlock_page(p);
>>> + put_page(p);
>>> + exists = ERR_PTR(-ENOMEM);
>>> + goto free_eb;
>>> + }
>>> +
>>
>> This is broken, if we race with another thread adding an extent buffer for
>> this same range we'll overwrite the page private with the new thing, losing
>> any of the work that was done previously. Thanks,
>
> Firstly the page is locked, so there should be only one to grab the page.
>
> Secondly, btrfs_attach_subpage() would just exit if it detects the page is
> already private.
>
> So there shouldn't be a race.
>
Task1 Task2
alloc_extent_buffer(4096) alloc_extent_buffer(4096)
find_extent_buffer, nothing find_extent_buffer, nothing
find_or_create_page(1)
find_or_create_page(1)
waits on page lock
btrfs_attach_subpage()
radix_tree_insert()
unlock pages
exit find_or_create_page()
btrfs_attach_subpage(), BAD
there's definitely a race, again this is why the code does the check to see if
there's a private attached to the EB already. Thanks,
Josef
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case
2020-12-18 15:41 ` Josef Bacik
@ 2020-12-19 0:24 ` Qu Wenruo
2020-12-21 10:15 ` Qu Wenruo
0 siblings, 1 reply; 71+ messages in thread
From: Qu Wenruo @ 2020-12-19 0:24 UTC (permalink / raw)
To: Josef Bacik, Qu Wenruo, linux-btrfs
On 2020/12/18 下午11:41, Josef Bacik wrote:
> On 12/17/20 7:44 PM, Qu Wenruo wrote:
>>
>>
>> On 2020/12/18 上午12:00, Josef Bacik wrote:
>>> On 12/10/20 1:38 AM, Qu Wenruo wrote:
>>>> For subpage case, we need to allocate new memory for each metadata
>>>> page.
>>>>
>>>> So we need to:
>>>> - Allow attach_extent_buffer_page() to return int
>>>> To indicate allocation failure
>>>>
>>>> - Prealloc page->private for alloc_extent_buffer()
>>>> We don't want to call memory allocation with spinlock hold, so
>>>> do preallocation before we acquire the spin lock.
>>>>
>>>> - Handle subpage and regular case differently in
>>>> attach_extent_buffer_page()
>>>> For regular case, just do the usual thing.
>>>> For subpage case, allocate new memory and update the tree_block
>>>> bitmap.
>>>>
>>>> The bitmap update will be handled by new subpage specific helper,
>>>> btrfs_subpage_set_tree_block().
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>> fs/btrfs/extent_io.c | 69
>>>> +++++++++++++++++++++++++++++++++++---------
>>>> fs/btrfs/subpage.h | 44 ++++++++++++++++++++++++++++
>>>> 2 files changed, 99 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>>> index 6350c2687c7e..51dd7ec3c2b3 100644
>>>> --- a/fs/btrfs/extent_io.c
>>>> +++ b/fs/btrfs/extent_io.c
>>>> @@ -24,6 +24,7 @@
>>>> #include "rcu-string.h"
>>>> #include "backref.h"
>>>> #include "disk-io.h"
>>>> +#include "subpage.h"
>>>> static struct kmem_cache *extent_state_cache;
>>>> static struct kmem_cache *extent_buffer_cache;
>>>> @@ -3142,22 +3143,41 @@ static int submit_extent_page(unsigned int opf,
>>>> return ret;
>>>> }
>>>> -static void attach_extent_buffer_page(struct extent_buffer *eb,
>>>> +static int attach_extent_buffer_page(struct extent_buffer *eb,
>>>> struct page *page)
>>>> {
>>>> - /*
>>>> - * If the page is mapped to btree inode, we should hold the
>>>> private
>>>> - * lock to prevent race.
>>>> - * For cloned or dummy extent buffers, their pages are not
>>>> mapped and
>>>> - * will not race with any other ebs.
>>>> - */
>>>> - if (page->mapping)
>>>> - lockdep_assert_held(&page->mapping->private_lock);
>>>> + struct btrfs_fs_info *fs_info = eb->fs_info;
>>>> + int ret;
>>>> - if (!PagePrivate(page))
>>>> - attach_page_private(page, eb);
>>>> - else
>>>> - WARN_ON(page->private != (unsigned long)eb);
>>>> + if (fs_info->sectorsize == PAGE_SIZE) {
>>>> + /*
>>>> + * If the page is mapped to btree inode, we should hold the
>>>> + * private lock to prevent race.
>>>> + * For cloned or dummy extent buffers, their pages are not
>>>> + * mapped and will not race with any other ebs.
>>>> + */
>>>> + if (page->mapping)
>>>> + lockdep_assert_held(&page->mapping->private_lock);
>>>> +
>>>> + if (!PagePrivate(page))
>>>> + attach_page_private(page, eb);
>>>> + else
>>>> + WARN_ON(page->private != (unsigned long)eb);
>>>> + return 0;
>>>> + }
>>>> +
>>>> + /* Already mapped, just update the existing range */
>>>> + if (PagePrivate(page))
>>>> + goto update_bitmap;
>>>> +
>>>> + /* Do new allocation to attach subpage */
>>>> + ret = btrfs_attach_subpage(fs_info, page);
>>>> + if (ret < 0)
>>>> + return ret;
>>>> +
>>>> +update_bitmap:
>>>> + btrfs_subpage_set_tree_block(fs_info, page, eb->start, eb->len);
>>>> + return 0;
>>>> }
>>>> void set_page_extent_mapped(struct page *page)
>>>> @@ -5067,12 +5087,19 @@ struct extent_buffer
>>>> *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>>>> return NULL;
>>>> for (i = 0; i < num_pages; i++) {
>>>> + int ret;
>>>> +
>>>> p = alloc_page(GFP_NOFS);
>>>> if (!p) {
>>>> btrfs_release_extent_buffer(new);
>>>> return NULL;
>>>> }
>>>> - attach_extent_buffer_page(new, p);
>>>> + ret = attach_extent_buffer_page(new, p);
>>>> + if (ret < 0) {
>>>> + put_page(p);
>>>> + btrfs_release_extent_buffer(new);
>>>> + return NULL;
>>>> + }
>>>> WARN_ON(PageDirty(p));
>>>> SetPageUptodate(p);
>>>> new->pages[i] = p;
>>>> @@ -5321,6 +5348,18 @@ struct extent_buffer
>>>> *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>>>> goto free_eb;
>>>> }
>>>> + /*
>>>> + * Preallocate page->private for subpage case, so that
>>>> + * we won't allocate memory with private_lock hold.
>>>> + */
>>>> + ret = btrfs_attach_subpage(fs_info, p);
>>>> + if (ret < 0) {
>>>> + unlock_page(p);
>>>> + put_page(p);
>>>> + exists = ERR_PTR(-ENOMEM);
>>>> + goto free_eb;
>>>> + }
>>>> +
>>>
>>> This is broken, if we race with another thread adding an extent
>>> buffer for this same range we'll overwrite the page private with the
>>> new thing, losing any of the work that was done previously. Thanks,
>>
>> Firstly the page is locked, so there should be only one to grab the page.
>>
>> Secondly, btrfs_attach_subpage() would just exit if it detects the
>> page is already private.
>>
>> So there shouldn't be a race.
>>
> Task1 Task2
> alloc_extent_buffer(4096) alloc_extent_buffer(4096)
> find_extent_buffer, nothing find_extent_buffer, nothing
> find_or_create_page(1)
> find_or_create_page(1)
> waits on page lock
> btrfs_attach_subpage()
> radix_tree_insert()
> unlock pages
> exit find_or_create_page()
> btrfs_attach_subpage(), BAD
>
> there's definitely a race, again this is why the code does the check to
> see if there's a private attached to the EB already. Thanks,
btrfs_attach_subpage() is already doing the private check.
Thanks,
Qu
>
> Josef
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case
2020-12-19 0:24 ` Qu Wenruo
@ 2020-12-21 10:15 ` Qu Wenruo
0 siblings, 0 replies; 71+ messages in thread
From: Qu Wenruo @ 2020-12-21 10:15 UTC (permalink / raw)
To: Josef Bacik, Qu Wenruo, linux-btrfs
On 2020/12/19 上午8:24, Qu Wenruo wrote:
>
>
> On 2020/12/18 下午11:41, Josef Bacik wrote:
>> On 12/17/20 7:44 PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/12/18 上午12:00, Josef Bacik wrote:
>>>> On 12/10/20 1:38 AM, Qu Wenruo wrote:
>>>>> For subpage case, we need to allocate new memory for each metadata
>>>>> page.
>>>>>
>>>>> So we need to:
>>>>> - Allow attach_extent_buffer_page() to return int
>>>>> To indicate allocation failure
>>>>>
>>>>> - Prealloc page->private for alloc_extent_buffer()
>>>>> We don't want to call memory allocation with spinlock hold, so
>>>>> do preallocation before we acquire the spin lock.
>>>>>
>>>>> - Handle subpage and regular case differently in
>>>>> attach_extent_buffer_page()
>>>>> For regular case, just do the usual thing.
>>>>> For subpage case, allocate new memory and update the tree_block
>>>>> bitmap.
>>>>>
>>>>> The bitmap update will be handled by new subpage specific helper,
>>>>> btrfs_subpage_set_tree_block().
>>>>>
>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>> ---
>>>>> fs/btrfs/extent_io.c | 69
>>>>> +++++++++++++++++++++++++++++++++++---------
>>>>> fs/btrfs/subpage.h | 44 ++++++++++++++++++++++++++++
>>>>> 2 files changed, 99 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>>>> index 6350c2687c7e..51dd7ec3c2b3 100644
>>>>> --- a/fs/btrfs/extent_io.c
>>>>> +++ b/fs/btrfs/extent_io.c
>>>>> @@ -24,6 +24,7 @@
>>>>> #include "rcu-string.h"
>>>>> #include "backref.h"
>>>>> #include "disk-io.h"
>>>>> +#include "subpage.h"
>>>>> static struct kmem_cache *extent_state_cache;
>>>>> static struct kmem_cache *extent_buffer_cache;
>>>>> @@ -3142,22 +3143,41 @@ static int submit_extent_page(unsigned int
>>>>> opf,
>>>>> return ret;
>>>>> }
>>>>> -static void attach_extent_buffer_page(struct extent_buffer *eb,
>>>>> +static int attach_extent_buffer_page(struct extent_buffer *eb,
>>>>> struct page *page)
>>>>> {
>>>>> - /*
>>>>> - * If the page is mapped to btree inode, we should hold the
>>>>> private
>>>>> - * lock to prevent race.
>>>>> - * For cloned or dummy extent buffers, their pages are not
>>>>> mapped and
>>>>> - * will not race with any other ebs.
>>>>> - */
>>>>> - if (page->mapping)
>>>>> - lockdep_assert_held(&page->mapping->private_lock);
>>>>> + struct btrfs_fs_info *fs_info = eb->fs_info;
>>>>> + int ret;
>>>>> - if (!PagePrivate(page))
>>>>> - attach_page_private(page, eb);
>>>>> - else
>>>>> - WARN_ON(page->private != (unsigned long)eb);
>>>>> + if (fs_info->sectorsize == PAGE_SIZE) {
>>>>> + /*
>>>>> + * If the page is mapped to btree inode, we should hold the
>>>>> + * private lock to prevent race.
>>>>> + * For cloned or dummy extent buffers, their pages are not
>>>>> + * mapped and will not race with any other ebs.
>>>>> + */
>>>>> + if (page->mapping)
>>>>> + lockdep_assert_held(&page->mapping->private_lock);
>>>>> +
>>>>> + if (!PagePrivate(page))
>>>>> + attach_page_private(page, eb);
>>>>> + else
>>>>> + WARN_ON(page->private != (unsigned long)eb);
>>>>> + return 0;
>>>>> + }
>>>>> +
>>>>> + /* Already mapped, just update the existing range */
>>>>> + if (PagePrivate(page))
>>>>> + goto update_bitmap;
>>>>> +
>>>>> + /* Do new allocation to attach subpage */
>>>>> + ret = btrfs_attach_subpage(fs_info, page);
>>>>> + if (ret < 0)
>>>>> + return ret;
>>>>> +
>>>>> +update_bitmap:
>>>>> + btrfs_subpage_set_tree_block(fs_info, page, eb->start, eb->len);
>>>>> + return 0;
>>>>> }
>>>>> void set_page_extent_mapped(struct page *page)
>>>>> @@ -5067,12 +5087,19 @@ struct extent_buffer
>>>>> *btrfs_clone_extent_buffer(const struct extent_buffer *src)
>>>>> return NULL;
>>>>> for (i = 0; i < num_pages; i++) {
>>>>> + int ret;
>>>>> +
>>>>> p = alloc_page(GFP_NOFS);
>>>>> if (!p) {
>>>>> btrfs_release_extent_buffer(new);
>>>>> return NULL;
>>>>> }
>>>>> - attach_extent_buffer_page(new, p);
>>>>> + ret = attach_extent_buffer_page(new, p);
>>>>> + if (ret < 0) {
>>>>> + put_page(p);
>>>>> + btrfs_release_extent_buffer(new);
>>>>> + return NULL;
>>>>> + }
>>>>> WARN_ON(PageDirty(p));
>>>>> SetPageUptodate(p);
>>>>> new->pages[i] = p;
>>>>> @@ -5321,6 +5348,18 @@ struct extent_buffer
>>>>> *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>>>>> goto free_eb;
>>>>> }
>>>>> + /*
>>>>> + * Preallocate page->private for subpage case, so that
>>>>> + * we won't allocate memory with private_lock hold.
>>>>> + */
>>>>> + ret = btrfs_attach_subpage(fs_info, p);
>>>>> + if (ret < 0) {
>>>>> + unlock_page(p);
>>>>> + put_page(p);
>>>>> + exists = ERR_PTR(-ENOMEM);
>>>>> + goto free_eb;
>>>>> + }
>>>>> +
>>>>
>>>> This is broken, if we race with another thread adding an extent
>>>> buffer for this same range we'll overwrite the page private with the
>>>> new thing, losing any of the work that was done previously. Thanks,
>>>
>>> Firstly the page is locked, so there should be only one to grab the
>>> page.
>>>
>>> Secondly, btrfs_attach_subpage() would just exit if it detects the
>>> page is already private.
>>>
>>> So there shouldn't be a race.
>>>
>> Task1 Task2
>> alloc_extent_buffer(4096) alloc_extent_buffer(4096)
>> find_extent_buffer, nothing find_extent_buffer, nothing
>> find_or_create_page(1)
>> find_or_create_page(1)
>> waits on page lock
>> btrfs_attach_subpage()
>> radix_tree_insert()
>> unlock pages
>> exit find_or_create_page()
>> btrfs_attach_subpage(), BAD
To be more clear, in above case, btrfs_attach_subpage() would find page
is already private, thus exit without doing anything (no extra attaching
nor bitmap update).
Thus no btrfs_subpage info is overwritten.
>>
>> there's definitely a race, again this is why the code does the check
>> to see if there's a private attached to the EB already. Thanks,
That's exactly btrfs_attach_subpage() is doing.
Anyway, all the hassle is needed just to avoid memory allocation inside
the spinlock.
Personally speaking I don't see any better solution than pre-allocating
right now.
Thanks,
Qu
>
> btrfs_attach_subpage() is already doing the private check.
>
> Thanks,
> Qu
>
>>
>> Josef
^ permalink raw reply [flat|nested] 71+ messages in thread
end of thread, other threads:[~2020-12-21 10:17 UTC | newest]
Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-10 6:38 [PATCH v2 00/18] btrfs: add read-only support for subpage sector size Qu Wenruo
2020-12-10 6:38 ` [PATCH v2 01/18] btrfs: extent_io: rename @offset parameter to @disk_bytenr for submit_extent_page() Qu Wenruo
2020-12-17 15:44 ` Josef Bacik
2020-12-10 6:38 ` [PATCH v2 02/18] btrfs: extent_io: refactor __extent_writepage_io() to improve readability Qu Wenruo
2020-12-10 12:12 ` Nikolay Borisov
2020-12-10 12:53 ` Qu Wenruo
2020-12-10 12:58 ` Nikolay Borisov
2020-12-17 15:43 ` Josef Bacik
2020-12-10 6:38 ` [PATCH v2 03/18] btrfs: file: update comment for btrfs_dirty_pages() Qu Wenruo
2020-12-10 12:16 ` Nikolay Borisov
2020-12-10 6:38 ` [PATCH v2 04/18] btrfs: extent_io: introduce a helper to grab an existing extent buffer from a page Qu Wenruo
2020-12-10 13:51 ` Nikolay Borisov
2020-12-17 15:50 ` Josef Bacik
2020-12-10 6:38 ` [PATCH v2 05/18] btrfs: extent_io: introduce the skeleton of btrfs_subpage structure Qu Wenruo
2020-12-17 15:52 ` Josef Bacik
2020-12-10 6:38 ` [PATCH v2 06/18] btrfs: extent_io: make attach_extent_buffer_page() to handle subpage case Qu Wenruo
2020-12-10 15:30 ` Nikolay Borisov
2020-12-17 6:48 ` Qu Wenruo
2020-12-10 16:09 ` Nikolay Borisov
2020-12-17 16:00 ` Josef Bacik
2020-12-18 0:44 ` Qu Wenruo
2020-12-18 15:41 ` Josef Bacik
2020-12-19 0:24 ` Qu Wenruo
2020-12-21 10:15 ` Qu Wenruo
2020-12-10 6:38 ` [PATCH v2 07/18] btrfs: extent_io: make grab_extent_buffer_from_page() " Qu Wenruo
2020-12-10 15:39 ` Nikolay Borisov
2020-12-17 6:55 ` Qu Wenruo
2020-12-17 16:02 ` Josef Bacik
2020-12-18 0:49 ` Qu Wenruo
2020-12-10 6:38 ` [PATCH v2 08/18] btrfs: extent_io: support subpage for extent buffer page release Qu Wenruo
2020-12-10 16:13 ` Nikolay Borisov
2020-12-10 6:38 ` [PATCH v2 09/18] btrfs: subpage: introduce helper for subpage uptodate status Qu Wenruo
2020-12-11 10:10 ` Nikolay Borisov
2020-12-11 10:48 ` Qu Wenruo
2020-12-11 11:41 ` Nikolay Borisov
2020-12-11 11:56 ` Qu Wenruo
2020-12-10 6:38 ` [PATCH v2 10/18] btrfs: subpage: introduce helper for subpage error status Qu Wenruo
2020-12-10 6:38 ` [PATCH v2 11/18] btrfs: extent_io: make set/clear_extent_buffer_uptodate() to support subpage size Qu Wenruo
2020-12-10 6:38 ` [PATCH v2 12/18] btrfs: extent_io: implement try_release_extent_buffer() for subpage metadata support Qu Wenruo
2020-12-11 12:00 ` Nikolay Borisov
2020-12-11 12:11 ` Qu Wenruo
2020-12-11 16:57 ` Nikolay Borisov
2020-12-12 1:28 ` Qu Wenruo
2020-12-12 9:26 ` Nikolay Borisov
2020-12-12 10:26 ` Qu Wenruo
2020-12-12 5:44 ` Qu Wenruo
2020-12-12 10:30 ` Nikolay Borisov
2020-12-12 10:31 ` Qu Wenruo
2020-12-10 6:39 ` [PATCH v2 13/18] btrfs: extent_io: introduce read_extent_buffer_subpage() Qu Wenruo
2020-12-10 6:39 ` [PATCH v2 14/18] btrfs: extent_io: make endio_readpage_update_page_status() to handle subpage case Qu Wenruo
2020-12-14 9:57 ` Nikolay Borisov
2020-12-14 10:46 ` Qu Wenruo
2020-12-10 6:39 ` [PATCH v2 15/18] btrfs: disk-io: introduce subpage metadata validation check Qu Wenruo
2020-12-10 13:24 ` kernel test robot
2020-12-10 13:24 ` kernel test robot
2020-12-10 13:39 ` kernel test robot
2020-12-10 13:39 ` kernel test robot
2020-12-14 10:21 ` Nikolay Borisov
2020-12-14 10:50 ` Qu Wenruo
2020-12-14 11:17 ` Nikolay Borisov
2020-12-14 11:32 ` Qu Wenruo
2020-12-14 12:40 ` Nikolay Borisov
2020-12-10 6:39 ` [PATCH v2 16/18] btrfs: introduce btrfs_subpage for data inodes Qu Wenruo
2020-12-10 9:44 ` kernel test robot
2020-12-10 9:44 ` kernel test robot
2020-12-11 0:43 ` kernel test robot
2020-12-11 0:43 ` kernel test robot
2020-12-14 12:46 ` Nikolay Borisov
2020-12-10 6:39 ` [PATCH v2 17/18] btrfs: integrate page status update for read path into begin/end_page_read() Qu Wenruo
2020-12-14 13:59 ` Nikolay Borisov
2020-12-10 6:39 ` [PATCH v2 18/18] btrfs: allow RO mount of 4K sector size fs on 64K page system Qu Wenruo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.