All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size
@ 2016-10-02 13:24 Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 01/19] Btrfs: subpage-blocksize: extent_clear_unlock_delalloc: Prevent page from being unlocked more than once Chandan Rajendra
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

Btrfs assumes block size to be the same as the machine's page
size. This would mean that a Btrfs instance created on a 4k page size
machine (e.g. x86) will not be mountable on machines with larger page
sizes (e.g. PPC64/AARCH64). This patchset aims to resolve this
incompatibility.

This patchset continues with the work posted previously at
http://marc.info/?l=linux-btrfs&m=146760691422240&w=2

This patchset is based on top of Josef's
1. Metadata throttling in writeback patches
2. Kill the btree inode patches

The major change in this version is the usage of kmalloc()-ed memory for
holding metadata blocks whose size is less than the machine's page size. This
vastly reduces the complexity of extent buffer mangement (Thanks to Josef's
"Kill the btree inode patches").

When writing back dirty extent buffers, we currently track the corresponding
extent buffers using the pointer at page->private. With kmalloc-ed() memory
this isn't possible and hence we track the first extent buffer under writeback
using bio->bi_private. Also, For kmalloc-ed() extent buffers this patchset
currently limits the number of dirty extent buffers in a "write" bio to
1. This limit will be removed in a future patchset.

The commits for the Btrfs kernel module can be found at
https://github.com/chandanr/linux/tree/btrfs/subpagesize-blocksize.

To create a filesystem with block size < page size, a patched version
of the Btrfs-progs package is required. The corresponding fixes for
Btrfs-progs can be found at
https://github.com/chandanr/btrfs-progs/tree/btrfs/subpagesize-blocksize.

Fstests run status:
1. x86_64
   - With 4k sectorsize, all the tests that succeed with the for-linus-4.8
     branch at
     git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
     branch also do so with the patches applied.
2. ppc64
   - With 4k sectorsize, 16k nodesize and with "nospace_cache" mount
     option, except for scrub and compression tests, all the tests
     that succeed with the for-next branch at
     git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
     branch also do so with the patches applied.
   - With 64k sectorsize & nodesize, all the tests that succeed with
     the for-linus-4.8 branch at
     git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git
     branch also do so with the patches applied.

TODO:
1. On ppc64, btrfsck segfaults when checking a filesystem instance
   having 2k sectorsize.
2. I am planning to fix scrub & compression via a separate patchset.

Changes from V20:
1. Applied all the review comments suggested by Josef for version V20.
   However, There are still some instances of
   	    if (compare_sectorsize_with_page_size)
	       /* do something */
   One such instance is in check_page_uptodate() where we would need to check
   for BLK_STATE_UPTODATE only if page_size < sectorsize. For page_size ==
   sectorsize case, we unconditionally set PG_uptodate flag.

Changes from V19:
1. The patchset has been rebased on top of kdave/for-next branch.
2. The patch "Btrfs: subpage-blocksize: extent_clear_unlock_delalloc:
   Prevent page from being unlocked more than once" changes the
   signatures of the functions "cow_file_range" &
   "extent_clear_unlock_delalloc". This patch has now been moved to be
   the first patch in the patchset.
3. A new patch "Btrfs: subpage-blocksize: Rate limit scrub error
   message" has been added. btrfs/073 invokes the scrub ioctl in a
   tight loop. In subpage-blocksize scenario this results in a lot of
   "scrub: size assumption sectorsize != PAGE_SIZE" messages being
   printed on the console. Hence this patch rate limits such error
   messages.

Changes from V18:
1. The per-page bitmap used to track the block status is now allocated
   from a slab cache.
2. The per-page bitmap is allocated and used only in cases where
   sectorsize < PAGE_SIZE.
3. The new patch "Btrfs: subpage-blocksize: Disable compression"
   disables compression in subpage-blocksize scenario.

Changes from V17:
1. Due to mistakes made during git rebase operations, fixes ended up
   in incorrect patches. This patchset gets the fixes in the right
   patches.

Changes from V16:
1. The V15 patchset consisted of patches obtained from an incorrect
   git branch. Apologies for the mistake. All the entries listed under
   "Changes from V15" hold good for V16.

Changes from V15:
1. The invocation of cleancache_get_page() in __do_readpage() assumed
   blocksize to be same as PAGE_SIZE. We now invoke cleancache_get_page()
   only if blocksize is same as PAGE_SIZE. Thanks to David Sterba for
   pointing this out.
2. In __extent_writepage_io() we used to accumulate all the contiguous
   dirty blocks within the page before submitting the file offset range
   for I/O. In some cases this caused the bio to span across more than
   a stripe. For example, With 4k block size, 64K stripe size
   and 64K page size, assume
   - All the blocks mapped by the page are contiguous on the logical
     address space.
   - The first block of the page is mapped to the second block of the
     stripe.
   In such a scenario, we would add all the blocks of the page to
   bio. This would mean that we would overflow the stripe by one 4K
   block. Hence this patchset removes the optimization and invokes
   submit_extent_page() for every dirty 4K block.
3. The following patches are newly added:
   - Btrfs: subpage-blocksize: __btrfs_lookup_bio_sums: Set offset
     when moving to a new bio_vec 
   - Btrfs: subpage-blocksize: Make file extent relocate code subpage
     blocksize aware 
   - Btrfs: btrfs_clone: Flush dirty blocks of a page that do not map
     the clone range

Changes from V14:
1. Fix usage of cleancache_get_page() in __do_readpage().
   In filesystems which support subpage-blocksize scenario, a page can
   map one or more blocks. Hence cleancache_get_page() should be
   invoked only when the page maps a non-hole extent and block size
   being used is equal to the page size. Thanks to David Sterba for
   pointing this out.
2. Replace page_read_complete() and page_write_complete() functions
   with page_io_complete().
3. Provide more documentation (as part of both commit message and code
   comments) about the usage of the per-page
   btrfs_page_private->io_lock.

Changes from V13:
1. Enable dedup ioctl to work in subpagesize-blocksize scenario.

Changes from V12:
1. The logic in the function btrfs_punch_hole() has been fixed to
   check for the presence of BLK_STATE_UPTODATE flags for blocks in
   pages which partially map the file range being punched.
   
Changes from V11:
1. Addressed the review comments provided by Liu Bo for version V11.
2. Fixed file defragmentation code to work in subpagesize-blocksize
   scenario.
3. Many "hard to reproduce" bugs were fixed.

Chandan Rajendra (19):
  Btrfs: subpage-blocksize: extent_clear_unlock_delalloc: Prevent page
    from being unlocked more than once
  Btrfs: subpage-blocksize: Make sure delalloc range intersects with the
    locked page's range
  Btrfs: subpage-blocksize: Use PG_Uptodate flag to track block uptodate
    status
  Btrfs: Remove extent_io_tree's track_uptodate member
  Btrfs: subpage-blocksize: Fix whole page read.
  Btrfs: subpage-blocksize: Fix whole page write
  Btrfs: subpage-blocksize: Use kmalloc()-ed memory to hold metadata
    blocks
  Btrfs: subpage-blocksize: Execute sanity tests on all possible block
    sizes
  Btrfs: subpage-blocksize: Compute free space tree BITMAP_RANGE based
    on sectorsize
  Btrfs: subpage-blocksize: Allow mounting filesystems where sectorsize
    < PAGE_SIZE
  Btrfs: subpage-blocksize: Deal with partial ordered extent
    allocations.
  Btrfs: subpage-blocksize: Explicitly track I/O status of blocks of an
    ordered extent.
  Btrfs: subpage-blocksize: btrfs_punch_hole: Fix uptodate blocks check
  Btrfs: subpage-blocksize: Fix file defragmentation code
  Btrfs: subpage-blocksize: Enable dedupe ioctl
  Btrfs: subpage-blocksize: btrfs_clone: Flush dirty blocks of a page
    that do not map the clone range
  Btrfs: subpage-blocksize: Make file extent relocate code subpage
    blocksize aware
  Btrfs: subpage-blocksize: __btrfs_lookup_bio_sums: Set offset when
    moving to a new bio_vec
  Btrfs: subpage-blocksize: Disable compression

 fs/btrfs/ctree.h                       |   6 +-
 fs/btrfs/disk-io.c                     |  49 +--
 fs/btrfs/disk-io.h                     |   2 +-
 fs/btrfs/extent-tree.c                 |   4 +-
 fs/btrfs/extent_io.c                   | 739 +++++++++++++++++++++------------
 fs/btrfs/extent_io.h                   |  99 ++++-
 fs/btrfs/file-item.c                   |   7 +-
 fs/btrfs/file.c                        | 105 ++++-
 fs/btrfs/inode.c                       | 472 +++++++++++++++------
 fs/btrfs/ioctl.c                       | 232 +++++++----
 fs/btrfs/ordered-data.c                |  19 +
 fs/btrfs/ordered-data.h                |   4 +
 fs/btrfs/relocation.c                  |  87 +++-
 fs/btrfs/super.c                       |  19 +
 fs/btrfs/tests/btrfs-tests.c           |   8 +-
 fs/btrfs/tests/extent-io-tests.c       |   4 +-
 fs/btrfs/tests/free-space-tree-tests.c |  79 ++--
 fs/btrfs/tree-log.c                    |   2 +-
 fs/btrfs/volumes.c                     |  10 +-
 19 files changed, 1373 insertions(+), 574 deletions(-)

-- 
2.5.5


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH V21 01/19] Btrfs: subpage-blocksize: extent_clear_unlock_delalloc: Prevent page from being unlocked more than once
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 02/19] Btrfs: subpage-blocksize: Make sure delalloc range intersects with the locked page's range Chandan Rajendra
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

extent_clear_unlock_delalloc() can unlock a page more than once as shown
below (assume 4k as the block size and 64k as the page size).

cow_file_range
  create 4k ordered extent corresponding to page offsets 0 - 4095
  extent_clear_unlock_delalloc corresponding to page offsets 0 - 4095
    unlock page
  create 4k ordered extent corresponding to page offsets 4096 - 8191
  extent_clear_unlock_delalloc corresponding to page offsets 4096 - 8191
    unlock page

To prevent such a scenario this commit passes "delalloc end" to
extent_clear_unlock_delalloc() to help decide whether the page can be unlocked
or not.

NOTE: Since extent_clear_unlock_delalloc() is used by compression code
as well, the commit passes ordered extent "end" as the value for the
argument corresponding to "delalloc end" for invocations made from
compression code path. This will be fixed by a future commit that gets
compression to work in subpage-blocksize scenario.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c | 16 +++++++----
 fs/btrfs/extent_io.h |  5 ++--
 fs/btrfs/inode.c     | 78 +++++++++++++++++++++++++++++-----------------------
 3 files changed, 57 insertions(+), 42 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f669240..dc60c604 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1708,9 +1708,8 @@ out_failed:
 }
 
 void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
-				 struct page *locked_page,
-				 unsigned clear_bits,
-				 unsigned long page_ops)
+				u64 delalloc_end, struct page *locked_page,
+				unsigned clear_bits, unsigned long page_ops)
 {
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	int ret;
@@ -1718,6 +1717,7 @@ void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
 	unsigned long index = start >> PAGE_SHIFT;
 	unsigned long end_index = end >> PAGE_SHIFT;
 	unsigned long nr_pages = end_index - index + 1;
+	u64 page_end;
 	int i;
 
 	clear_extent_bit(tree, start, end, clear_bits, 1, 0, NULL, GFP_NOFS);
@@ -1748,8 +1748,14 @@ void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
 				SetPageError(pages[i]);
 			if (page_ops & PAGE_END_WRITEBACK)
 				end_page_writeback(pages[i]);
-			if (page_ops & PAGE_UNLOCK)
-				unlock_page(pages[i]);
+
+			if (page_ops & PAGE_UNLOCK) {
+				page_end = page_offset(pages[i]) +
+					PAGE_SIZE - 1;
+				if ((page_end <= end)
+					|| (end == delalloc_end))
+					unlock_page(pages[i]);
+			}
 			put_page(pages[i]);
 		}
 		nr_pages -= ret;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 06b6f14..0948bca 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -430,9 +430,8 @@ int map_private_extent_buffer(struct extent_buffer *eb, unsigned long offset,
 void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
-				 struct page *locked_page,
-				 unsigned bits_to_clear,
-				 unsigned long page_ops);
+				u64 delalloc_end, struct page *locked_page,
+				unsigned bits_to_clear, unsigned long page_ops);
 struct bio *
 btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
 		gfp_t gfp_flags);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3440b52..3e4feac 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -560,12 +560,13 @@ cont:
 			 * we don't need to create any more async work items.
 			 * Unlock and free up our temp pages.
 			 */
-			extent_clear_unlock_delalloc(inode, start, end, NULL,
-						     clear_flags, PAGE_UNLOCK |
-						     PAGE_CLEAR_DIRTY |
-						     PAGE_SET_WRITEBACK |
-						     page_error_op |
-						     PAGE_END_WRITEBACK);
+			extent_clear_unlock_delalloc(inode, start, end, end,
+						NULL, clear_flags,
+						PAGE_UNLOCK
+						| PAGE_CLEAR_DIRTY
+						| PAGE_SET_WRITEBACK
+						| page_error_op
+						| PAGE_END_WRITEBACK);
 			goto free_pages_out;
 		}
 	}
@@ -835,6 +836,8 @@ retry:
 		extent_clear_unlock_delalloc(inode, async_extent->start,
 				async_extent->start +
 				async_extent->ram_size - 1,
+				async_extent->start +
+				async_extent->ram_size - 1,
 				NULL, EXTENT_LOCKED | EXTENT_DELALLOC,
 				PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
 				PAGE_SET_WRITEBACK);
@@ -854,9 +857,10 @@ retry:
 			tree->ops->writepage_end_io_hook(p, start, end,
 							 NULL, 0);
 			p->mapping = NULL;
-			extent_clear_unlock_delalloc(inode, start, end, NULL, 0,
-						     PAGE_END_WRITEBACK |
-						     PAGE_SET_ERROR);
+			extent_clear_unlock_delalloc(inode, start, end, end,
+						NULL, 0,
+						PAGE_END_WRITEBACK |
+						PAGE_SET_ERROR);
 			free_async_extent_pages(async_extent);
 		}
 		alloc_hint = ins.objectid + ins.offset;
@@ -871,6 +875,8 @@ out_free:
 	extent_clear_unlock_delalloc(inode, async_extent->start,
 				     async_extent->start +
 				     async_extent->ram_size - 1,
+				     async_extent->start +
+				     async_extent->ram_size - 1,
 				     NULL, EXTENT_LOCKED | EXTENT_DELALLOC |
 				     EXTENT_DEFRAG | EXTENT_DO_ACCOUNTING,
 				     PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
@@ -942,6 +948,7 @@ static noinline int cow_file_range(struct inode *inode,
 	struct btrfs_key ins;
 	struct extent_map *em;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+	unsigned long page_ops, extent_ops;
 	int ret = 0;
 
 	if (btrfs_is_free_space_inode(inode)) {
@@ -964,7 +971,8 @@ static noinline int cow_file_range(struct inode *inode,
 		ret = cow_file_range_inline(root, inode, start, end, 0, 0,
 					    NULL);
 		if (ret == 0) {
-			extent_clear_unlock_delalloc(inode, start, end, NULL,
+			extent_clear_unlock_delalloc(inode, start, end,
+				     delalloc_end, NULL,
 				     EXTENT_LOCKED | EXTENT_DELALLOC |
 				     EXTENT_DEFRAG, PAGE_UNLOCK |
 				     PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK |
@@ -986,8 +994,6 @@ static noinline int cow_file_range(struct inode *inode,
 	btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0);
 
 	while (disk_num_bytes > 0) {
-		unsigned long op;
-
 		cur_alloc_size = disk_num_bytes;
 		ret = btrfs_reserve_extent(root, cur_alloc_size,
 					   root->sectorsize, 0, alloc_hint,
@@ -1055,13 +1061,12 @@ static noinline int cow_file_range(struct inode *inode,
 		 * Do set the Private2 bit so we know this page was properly
 		 * setup for writepage
 		 */
-		op = unlock ? PAGE_UNLOCK : 0;
-		op |= PAGE_SET_PRIVATE2;
-
-		extent_clear_unlock_delalloc(inode, start,
-					     start + ram_size - 1, locked_page,
-					     EXTENT_LOCKED | EXTENT_DELALLOC,
-					     op);
+		page_ops = unlock ? PAGE_UNLOCK : 0;
+		page_ops |= PAGE_SET_PRIVATE2;
+		extent_ops = EXTENT_LOCKED | EXTENT_DELALLOC;
+		extent_clear_unlock_delalloc(inode, start, start + ram_size - 1,
+					delalloc_end, locked_page, extent_ops,
+					page_ops);
 		disk_num_bytes -= cur_alloc_size;
 		num_bytes -= cur_alloc_size;
 		alloc_hint = ins.objectid + ins.offset;
@@ -1076,11 +1081,14 @@ out_reserve:
 	btrfs_dec_block_group_reservations(root->fs_info, ins.objectid);
 	btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
 out_unlock:
-	extent_clear_unlock_delalloc(inode, start, end, locked_page,
-				     EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
-				     EXTENT_DELALLOC | EXTENT_DEFRAG,
-				     PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
-				     PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK);
+	page_ops = unlock ? PAGE_UNLOCK : 0;
+	page_ops |= PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK
+		| PAGE_SET_ERROR;
+	extent_ops = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING
+		| EXTENT_DEFRAG;
+
+	extent_clear_unlock_delalloc(inode, start, end, delalloc_end,
+				locked_page, extent_ops, page_ops);
 	goto out;
 }
 
@@ -1227,9 +1235,9 @@ static noinline int csum_exist_in_range(struct btrfs_root *root,
  * blocks on disk
  */
 static noinline int run_delalloc_nocow(struct inode *inode,
-				       struct page *locked_page,
-			      u64 start, u64 end, int *page_started, int force,
-			      unsigned long *nr_written)
+				struct page *locked_page,
+				u64 start, u64 end, int *page_started,
+				int force, unsigned long *nr_written)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_trans_handle *trans;
@@ -1255,7 +1263,8 @@ static noinline int run_delalloc_nocow(struct inode *inode,
 
 	path = btrfs_alloc_path();
 	if (!path) {
-		extent_clear_unlock_delalloc(inode, start, end, locked_page,
+		extent_clear_unlock_delalloc(inode, start, end, end,
+					     locked_page,
 					     EXTENT_LOCKED | EXTENT_DELALLOC |
 					     EXTENT_DO_ACCOUNTING |
 					     EXTENT_DEFRAG, PAGE_UNLOCK |
@@ -1273,7 +1282,8 @@ static noinline int run_delalloc_nocow(struct inode *inode,
 		trans = btrfs_join_transaction(root);
 
 	if (IS_ERR(trans)) {
-		extent_clear_unlock_delalloc(inode, start, end, locked_page,
+		extent_clear_unlock_delalloc(inode, start, end, end,
+					     locked_page,
 					     EXTENT_LOCKED | EXTENT_DELALLOC |
 					     EXTENT_DO_ACCOUNTING |
 					     EXTENT_DEFRAG, PAGE_UNLOCK |
@@ -1487,10 +1497,10 @@ out_check:
 		}
 
 		extent_clear_unlock_delalloc(inode, cur_offset,
-					     cur_offset + num_bytes - 1,
-					     locked_page, EXTENT_LOCKED |
-					     EXTENT_DELALLOC, PAGE_UNLOCK |
-					     PAGE_SET_PRIVATE2);
+					cur_offset + num_bytes - 1, end,
+					locked_page, EXTENT_LOCKED |
+					EXTENT_DELALLOC, PAGE_UNLOCK |
+					PAGE_SET_PRIVATE2);
 		if (!nolock && nocow)
 			btrfs_end_write_no_snapshoting(root);
 		cur_offset = extent_end;
@@ -1517,7 +1527,7 @@ error:
 		ret = err;
 
 	if (ret && cur_offset < end)
-		extent_clear_unlock_delalloc(inode, cur_offset, end,
+		extent_clear_unlock_delalloc(inode, cur_offset, end, end,
 					     locked_page, EXTENT_LOCKED |
 					     EXTENT_DELALLOC | EXTENT_DEFRAG |
 					     EXTENT_DO_ACCOUNTING, PAGE_UNLOCK |
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 02/19] Btrfs: subpage-blocksize: Make sure delalloc range intersects with the locked page's range
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 01/19] Btrfs: subpage-blocksize: extent_clear_unlock_delalloc: Prevent page from being unlocked more than once Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 03/19] Btrfs: subpage-blocksize: Use PG_Uptodate flag to track block uptodate status Chandan Rajendra
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

find_delalloc_range indirectly depends on EXTENT_UPTODDATE to make sure that
the delalloc range returned intersects with the file range mapped by the
page. Since we now track "uptodate" state in a per-page
bitmap (i.e. in btrfs_page_private->bstate), this commit makes an explicit
check to make sure that the delalloc range starts from within the file range
mapped by the page.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dc60c604..dd7faa1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1477,6 +1477,7 @@ out:
  * 1 is returned if we find something, 0 if nothing was in the tree
  */
 static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
+					struct page *locked_page,
 					u64 *start, u64 *end, u64 max_bytes,
 					struct extent_state **cached_state)
 {
@@ -1485,6 +1486,9 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 	u64 cur_start = *start;
 	u64 found = 0;
 	u64 total_bytes = 0;
+	u64 page_end;
+
+	page_end = page_offset(locked_page) + PAGE_SIZE - 1;
 
 	spin_lock(&tree->lock);
 
@@ -1505,7 +1509,8 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 			      (state->state & EXTENT_BOUNDARY))) {
 			goto out;
 		}
-		if (!(state->state & EXTENT_DELALLOC)) {
+		if (!(state->state & EXTENT_DELALLOC)
+			|| (page_end < state->start)) {
 			if (!found)
 				*end = state->end;
 			goto out;
@@ -1643,8 +1648,9 @@ again:
 	/* step one, find a bunch of delalloc bytes starting at start */
 	delalloc_start = *start;
 	delalloc_end = 0;
-	found = find_delalloc_range(tree, &delalloc_start, &delalloc_end,
-				    max_bytes, &cached_state);
+	found = find_delalloc_range(tree, locked_page,
+				&delalloc_start, &delalloc_end,
+				max_bytes, &cached_state);
 	if (!found || delalloc_end <= *start) {
 		*start = delalloc_start;
 		*end = delalloc_end;
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 03/19] Btrfs: subpage-blocksize: Use PG_Uptodate flag to track block uptodate status
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 01/19] Btrfs: subpage-blocksize: extent_clear_unlock_delalloc: Prevent page from being unlocked more than once Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 02/19] Btrfs: subpage-blocksize: Make sure delalloc range intersects with the locked page's range Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 04/19] Btrfs: Remove extent_io_tree's track_uptodate member Chandan Rajendra
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

This commit causes a block's uptodate status to be tracked using
struct page's PG_Uptodate flag instead of extent_io_tree's
EXTENT_UPTODATE flag.

This is in preparation for subpage-blocksize patchset which will use a
per-page bitmap for tracking individual block's uptodate status in the
case of blocksize < PAGE_SIZE. We will continue to use PG_Uptodate flag
to track uptodate status for blocksize == PAGE_SIZE scenario.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c | 61 +++++++---------------------------------------------
 fs/btrfs/extent_io.h |  2 +-
 fs/btrfs/inode.c     |  6 ++----
 3 files changed, 11 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dd7faa1..522c943 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1950,12 +1950,9 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
  * helper function to set a given page up to date if all the
  * extents in the tree for that page are up to date
  */
-static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
+static void check_page_uptodate(struct page *page)
 {
-	u64 start = page_offset(page);
-	u64 end = start + PAGE_SIZE - 1;
-	if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
-		SetPageUptodate(page);
+	SetPageUptodate(page);
 }
 
 int free_io_failure(struct extent_io_tree *failure_tree,
@@ -2492,18 +2489,6 @@ static void end_bio_extent_writepage(struct bio *bio)
 	bio_put(bio);
 }
 
-static void
-endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len,
-			      int uptodate)
-{
-	struct extent_state *cached = NULL;
-	u64 end = start + len - 1;
-
-	if (uptodate && tree->track_uptodate)
-		set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
-	unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC);
-}
-
 /*
  * after a readpage IO is done, we need to:
  * clear the uptodate bits on error
@@ -2525,8 +2510,6 @@ static void end_bio_extent_readpage(struct bio *bio)
 	u64 start;
 	u64 end;
 	u64 len;
-	u64 extent_start = 0;
-	u64 extent_len = 0;
 	int mirror;
 	int ret;
 	int i;
@@ -2612,7 +2595,7 @@ readpage_ok:
 			off = i_size & (PAGE_SIZE-1);
 			if (page->index == end_index && off)
 				zero_user_segment(page, off, PAGE_SIZE);
-			SetPageUptodate(page);
+			check_page_uptodate(page);
 		} else {
 			ClearPageUptodate(page);
 			SetPageError(page);
@@ -2620,32 +2603,10 @@ readpage_ok:
 		unlock_page(page);
 		offset += len;
 
-		if (unlikely(!uptodate)) {
-			if (extent_len) {
-				endio_readpage_release_extent(tree,
-							      extent_start,
-							      extent_len, 1);
-				extent_start = 0;
-				extent_len = 0;
-			}
-			endio_readpage_release_extent(tree, start,
-						      end - start + 1, 0);
-		} else if (!extent_len) {
-			extent_start = start;
-			extent_len = end + 1 - start;
-		} else if (extent_start + extent_len == start) {
-			extent_len += end + 1 - start;
-		} else {
-			endio_readpage_release_extent(tree, extent_start,
-						      extent_len, uptodate);
-			extent_start = start;
-			extent_len = end + 1 - start;
-		}
+		unlock_extent_cached(tree, start, end, NULL, GFP_ATOMIC);
+
 	}
 
-	if (extent_len)
-		endio_readpage_release_extent(tree, extent_start, extent_len,
-					      uptodate);
 	if (io_bio->end_io)
 		io_bio->end_io(io_bio, bio->bi_error);
 	bio_put(bio);
@@ -2933,18 +2894,15 @@ static int __do_readpage(struct extent_io_tree *tree,
 
 		if (cur >= last_byte) {
 			char *userpage;
-			struct extent_state *cached = NULL;
 
 			iosize = PAGE_SIZE - pg_offset;
 			userpage = kmap_atomic(page);
 			memset(userpage + pg_offset, 0, iosize);
 			flush_dcache_page(page);
 			kunmap_atomic(userpage);
-			set_extent_uptodate(tree, cur, cur + iosize - 1,
-					    &cached, GFP_NOFS);
 			unlock_extent_cached(tree, cur,
 					     cur + iosize - 1,
-					     &cached, GFP_NOFS);
+					     NULL, GFP_NOFS);
 			break;
 		}
 		em = __get_extent_map(inode, page, pg_offset, cur,
@@ -3034,8 +2992,6 @@ static int __do_readpage(struct extent_io_tree *tree,
 			flush_dcache_page(page);
 			kunmap_atomic(userpage);
 
-			set_extent_uptodate(tree, cur, cur + iosize - 1,
-					    &cached, GFP_NOFS);
 			unlock_extent_cached(tree, cur,
 					     cur + iosize - 1,
 					     &cached, GFP_NOFS);
@@ -3044,9 +3000,8 @@ static int __do_readpage(struct extent_io_tree *tree,
 			continue;
 		}
 		/* the get_extent function already copied into the page */
-		if (test_range_bit(tree, cur, cur_end,
-				   EXTENT_UPTODATE, 1, NULL)) {
-			check_page_uptodate(tree, page);
+		if (PageUptodate(page)) {
+			check_page_uptodate(page);
 			unlock_extent(tree, cur, cur + iosize - 1);
 			cur = cur + iosize;
 			pg_offset += iosize;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 0948bca..922f4c1 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -310,7 +310,7 @@ static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 start,
 		u64 end, struct extent_state **cached_state)
 {
 	return set_extent_bit(tree, start, end,
-			      EXTENT_DELALLOC | EXTENT_UPTODATE,
+			      EXTENT_DELALLOC,
 			      NULL, cached_state, GFP_NOFS);
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3e4feac..652d01d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3009,7 +3009,6 @@ out:
 		else
 			start = ordered_extent->file_offset;
 		end = ordered_extent->file_offset + ordered_extent->len - 1;
-		clear_extent_uptodate(io_tree, start, end, NULL, GFP_NOFS);
 
 		/* Drop the cache for the part of the extent we didn't write. */
 		btrfs_drop_extent_cache(inode, start, end, 0);
@@ -6807,7 +6806,6 @@ struct extent_map *btrfs_get_extent(struct inode *inode, struct page *page,
 	struct btrfs_key found_key;
 	struct extent_map *em = NULL;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
-	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_trans_handle *trans = NULL;
 	const bool new_inline = !page || create;
 
@@ -6984,8 +6982,8 @@ next:
 			kunmap(page);
 			btrfs_mark_buffer_dirty(leaf);
 		}
-		set_extent_uptodate(io_tree, em->start,
-				    extent_map_end(em) - 1, NULL, GFP_NOFS);
+
+		SetPageUptodate(page);
 		goto insert;
 	}
 not_found:
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 04/19] Btrfs: Remove extent_io_tree's track_uptodate member
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (2 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 03/19] Btrfs: subpage-blocksize: Use PG_Uptodate flag to track block uptodate status Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 05/19] Btrfs: subpage-blocksize: Fix whole page read Chandan Rajendra
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

We now track block uptodate status using a page's PG_Uptodate
flag. Hence this commit removes the now unused
extent_io_tree->track_uptodate member.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/disk-io.c   | 1 -
 fs/btrfs/extent_io.h | 1 -
 fs/btrfs/inode.c     | 2 --
 3 files changed, 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 03ac601..9ff48a7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2085,7 +2085,6 @@ int btrfs_init_eb_info(struct btrfs_fs_info *fs_info)
 
 	eb_info->fs_info = fs_info;
 	extent_io_tree_init(&eb_info->io_tree, eb_info);
-	eb_info->io_tree.track_uptodate = 0;
 	eb_info->io_tree.ops = &btree_extent_io_ops;
 	extent_io_tree_init(&eb_info->io_failure_tree, eb_info);
 	INIT_RADIX_TREE(&eb_info->buffer_radix, GFP_ATOMIC);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 922f4c1..9aa22f9 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -102,7 +102,6 @@ struct extent_io_tree {
 	struct rb_root state;
 	void *private_data;
 	u64 dirty_bytes;
-	int track_uptodate;
 	spinlock_t lock;
 	const struct extent_io_ops *ops;
 };
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 652d01d..ac4a7c0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9306,8 +9306,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	extent_map_tree_init(&ei->extent_tree);
 	extent_io_tree_init(&ei->io_tree, inode);
 	extent_io_tree_init(&ei->io_failure_tree, inode);
-	ei->io_tree.track_uptodate = 1;
-	ei->io_failure_tree.track_uptodate = 1;
 	atomic_set(&ei->sync_writers, 0);
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 05/19] Btrfs: subpage-blocksize: Fix whole page read.
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (3 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 04/19] Btrfs: Remove extent_io_tree's track_uptodate member Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 06/19] Btrfs: subpage-blocksize: Fix whole page write Chandan Rajendra
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

For the subpage-blocksize scenario, a page can contain multiple
blocks. In such cases, this patch handles reading data from files.

To track the status of individual blocks of a page, this patch makes use
of a bitmap pointed to by the newly introduced per-page 'struct
btrfs_page_private'.

The per-page btrfs_page_private->io_lock plays the same role as
BH_Uptodate_Lock (see end_buffer_async_read()) i.e. without the io_lock
we may end up in the following situation,

NOTE: Assume 64k page size and 4k block size. Also assume that the first
12 blocks of the page are contiguous while the next 4 blocks are
contiguous. When reading the page we end up submitting two "logical
address space" bios. So end_bio_extent_readpage function is invoked
twice, once for each bio.

|-------------------------+-------------------------+-------------|
| Task A                  | Task B                  | Task C      |
|-------------------------+-------------------------+-------------|
| end_bio_extent_readpage |                         |             |
| process block 0         |                         |             |
| - clear BLK_STATE_IO    |                         |             |
| - page_read_complete    |                         |             |
| process block 1         |                         |             |
|                         |                         |             |
|                         |                         |             |
|                         | end_bio_extent_readpage |             |
|                         | process block 0         |             |
|                         | - clear BLK_STATE_IO    |             |
|                         | - page_read_complete    |             |
|                         | process block 1         |             |
|                         |                         |             |
| process block 11        | process block 3         |             |
| - clear BLK_STATE_IO    | - clear BLK_STATE_IO    |             |
| - page_read_complete    | - page_read_complete    |             |
|   - returns true        |   - returns true        |             |
|   - unlock_page()       |                         |             |
|                         |                         | lock_page() |
|                         |   - unlock_page()       |             |
|-------------------------+-------------------------+-------------|

We end up incorrectly unlocking the page twice and "Task C" ends up
working on an unlocked page. So private->io_lock makes sure that only
one of the tasks gets "true" as the return value when page_io_complete()
is invoked. As an optimization the patch gets the io_lock only when the
last block of the bio_vec is being processed.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c | 299 +++++++++++++++++++++++++++++++++++++++++----------
 fs/btrfs/extent_io.h |  76 ++++++++++++-
 fs/btrfs/inode.c     |  13 +--
 3 files changed, 320 insertions(+), 68 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 522c943..b3885cc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -23,6 +23,7 @@
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
+static struct kmem_cache *page_private_cache;
 static struct bio_set *btrfs_bioset;
 
 static inline bool extent_state_in_tree(const struct extent_state *state)
@@ -163,10 +164,16 @@ int __init extent_io_init(void)
 	if (!extent_buffer_cache)
 		goto free_state_cache;
 
+	page_private_cache = kmem_cache_create("btrfs_page_private",
+			sizeof(struct btrfs_page_private), 0,
+			SLAB_MEM_SPREAD, NULL);
+	if (!page_private_cache)
+		goto free_buffer_cache;
+
 	btrfs_bioset = bioset_create(BIO_POOL_SIZE,
 				     offsetof(struct btrfs_io_bio, bio));
 	if (!btrfs_bioset)
-		goto free_buffer_cache;
+		goto free_page_private_cache;
 
 	if (bioset_integrity_create(btrfs_bioset, BIO_POOL_SIZE))
 		goto free_bioset;
@@ -177,6 +184,10 @@ free_bioset:
 	bioset_free(btrfs_bioset);
 	btrfs_bioset = NULL;
 
+free_page_private_cache:
+	kmem_cache_destroy(page_private_cache);
+	page_private_cache = NULL;
+
 free_buffer_cache:
 	kmem_cache_destroy(extent_buffer_cache);
 	extent_buffer_cache = NULL;
@@ -1311,6 +1322,96 @@ int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 				  changeset);
 }
 
+static int modify_page_blks_state(struct page *page,
+				unsigned long blk_states,
+				u64 start, u64 end, int set)
+{
+	struct inode *inode = page->mapping->host;
+	unsigned long *bitmap;
+	unsigned long first_state;
+	unsigned long state;
+	u64 nr_blks;
+	u64 blk;
+
+	if (BTRFS_I(inode)->root->sectorsize == PAGE_SIZE)
+		return 0;
+
+	bitmap = ((struct btrfs_page_private *)page->private)->bstate;
+
+	blk = BTRFS_BYTES_TO_BLKS(BTRFS_I(inode)->root->fs_info,
+				start & (PAGE_SIZE - 1));
+	nr_blks = BTRFS_BYTES_TO_BLKS(BTRFS_I(inode)->root->fs_info,
+				(end - start + 1));
+
+	first_state = find_next_bit(&blk_states, BLK_NR_STATE, 0);
+
+	while (nr_blks--) {
+		state = first_state;
+
+		while (state < BLK_NR_STATE) {
+			if (set)
+				set_bit((blk * BLK_NR_STATE) + state, bitmap);
+			else
+				clear_bit((blk * BLK_NR_STATE) + state, bitmap);
+
+			state = find_next_bit(&blk_states, BLK_NR_STATE,
+					state + 1);
+		}
+
+		++blk;
+	}
+
+	return 0;
+}
+
+int set_page_blks_state(struct page *page, unsigned long blk_states,
+			u64 start, u64 end)
+{
+	return modify_page_blks_state(page, blk_states, start, end, 1);
+}
+
+int clear_page_blks_state(struct page *page, unsigned long blk_states,
+			u64 start, u64 end)
+{
+	return modify_page_blks_state(page, blk_states, start, end, 0);
+}
+
+int test_page_blks_state(struct page *page, enum blk_state blk_state,
+			u64 start, u64 end, int check_all)
+{
+	struct inode *inode = page->mapping->host;
+	unsigned long *bitmap;
+	unsigned long blk;
+	u64 nr_blks;
+	int found = 0;
+
+	ASSERT(BTRFS_I(inode)->root->sectorsize < PAGE_SIZE);
+
+	bitmap = ((struct btrfs_page_private *)page->private)->bstate;
+
+	blk = BTRFS_BYTES_TO_BLKS(BTRFS_I(inode)->root->fs_info,
+				start & (PAGE_SIZE - 1));
+	nr_blks = BTRFS_BYTES_TO_BLKS(BTRFS_I(inode)->root->fs_info,
+				(end - start + 1));
+
+	while (nr_blks--) {
+		if (test_bit((blk * BLK_NR_STATE) + blk_state, bitmap)) {
+			if (!check_all)
+				return 1;
+			found = 1;
+		} else if (check_all) {
+			return 0;
+		}
+
+		++blk;
+	}
+
+	if (!check_all && !found)
+		return 0;
+
+	return 1;
+}
+
 /*
  * either insert or lock state struct between start and end use mask to tell
  * us if waiting is desired.
@@ -1950,9 +2051,25 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
  * helper function to set a given page up to date if all the
  * extents in the tree for that page are up to date
  */
-static void check_page_uptodate(struct page *page)
+void check_page_uptodate(struct page *page)
 {
-	SetPageUptodate(page);
+	struct inode *inode = page->mapping->host;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	u64 start = page_offset(page);
+	u64 end = start + PAGE_SIZE - 1;
+
+	if (root->sectorsize == PAGE_SIZE
+		|| test_page_blks_state(page, BLK_STATE_UPTODATE, start,
+					end, 1))
+		SetPageUptodate(page);
+}
+
+static int page_io_complete(struct page *page)
+{
+	u64 start = page_offset(page);
+	u64 end = start + PAGE_SIZE - 1;
+
+	return !test_page_blks_state(page, BLK_STATE_IO, start, end, 0);
 }
 
 int free_io_failure(struct extent_io_tree *failure_tree,
@@ -2282,7 +2399,9 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
 	 *	a) deliver good data to the caller
 	 *	b) correct the bad sectors on disk
 	 */
-	if (failed_bio->bi_vcnt > 1) {
+	if ((failed_bio->bi_vcnt > 1)
+		|| (failed_bio->bi_io_vec->bv_len
+			> BTRFS_I(inode)->root->sectorsize)) {
 		/*
 		 * to fulfill b), we need to know the exact failing sectors, as
 		 * we don't want to rewrite any more than the failed ones. thus,
@@ -2506,17 +2625,21 @@ static void end_bio_extent_readpage(struct bio *bio)
 	int uptodate = !bio->bi_error;
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
 	struct extent_io_tree *tree, *failure_tree;
+	struct btrfs_page_private *pg_private;
+	unsigned long flags;
 	u64 offset = 0;
 	u64 start;
 	u64 end;
-	u64 len;
+	int nr_sectors;
 	int mirror;
+	int unlock;
 	int ret;
 	int i;
 
 	bio_for_each_segment_all(bvec, bio, i) {
 		struct page *page = bvec->bv_page;
 		struct inode *inode = page->mapping->host;
+		struct btrfs_root *root = BTRFS_I(inode)->root;
 
 		pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
 			 "mirror=%u\n", (u64)bio->bi_iter.bi_sector,
@@ -2524,28 +2647,14 @@ static void end_bio_extent_readpage(struct bio *bio)
 		tree = &BTRFS_I(inode)->io_tree;
 		failure_tree = &BTRFS_I(inode)->io_failure_tree;
 
-		/* We always issue full-page reads, but if some block
-		 * in a page fails to read, blk_update_request() will
-		 * advance bv_offset and adjust bv_len to compensate.
-		 * Print a warning for nonzero offsets, and an error
-		 * if they don't add up to a full page.  */
-		if (bvec->bv_offset || bvec->bv_len != PAGE_SIZE) {
-			if (bvec->bv_offset + bvec->bv_len != PAGE_SIZE)
-				btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
-				   "partial page read in btrfs with offset %u and length %u",
-					bvec->bv_offset, bvec->bv_len);
-			else
-				btrfs_info(BTRFS_I(page->mapping->host)->root->fs_info,
-				   "incomplete page read in btrfs with offset %u and "
-				   "length %u",
-					bvec->bv_offset, bvec->bv_len);
-		}
+		start = page_offset(page) + bvec->bv_offset;
+		nr_sectors = BTRFS_BYTES_TO_BLKS(root->fs_info,
+						bvec->bv_len);
+		mirror = io_bio->mirror_num;
 
-		start = page_offset(page);
-		end = start + bvec->bv_offset + bvec->bv_len - 1;
-		len = bvec->bv_len;
+next_block:
+		end = start + root->sectorsize - 1;
 
-		mirror = io_bio->mirror_num;
 		if (likely(uptodate && tree->ops &&
 			   tree->ops->readpage_end_io_hook)) {
 			ret = tree->ops->readpage_end_io_hook(io_bio, offset,
@@ -2556,17 +2665,11 @@ static void end_bio_extent_readpage(struct bio *bio)
 			else
 				clean_io_failure(BTRFS_I(inode)->root->fs_info,
 						 failure_tree, tree, start,
-						 page, btrfs_ino(inode), 0);
+						 page, btrfs_ino(inode),
+						 start - page_offset(page));
 		}
 
-		if (likely(uptodate))
-			goto readpage_ok;
-
-		if (tree->ops && tree->ops->readpage_io_failed_hook) {
-			ret = tree->ops->readpage_io_failed_hook(page, mirror);
-			if (!ret && !bio->bi_error)
-				uptodate = 1;
-		} else {
+		if (!uptodate) {
 			/*
 			 * The generic bio_readpage_error handles errors the
 			 * following way: If possible, new read requests are
@@ -2581,30 +2684,58 @@ static void end_bio_extent_readpage(struct bio *bio)
 						 mirror);
 			if (ret == 0) {
 				uptodate = !bio->bi_error;
-				offset += len;
-				continue;
+				offset += root->sectorsize;
+				if (--nr_sectors) {
+					start = end + 1;
+					goto next_block;
+				} else {
+					continue;
+				}
 			}
 		}
-readpage_ok:
-		if (likely(uptodate)) {
-			loff_t i_size = i_size_read(inode);
-			pgoff_t end_index = i_size >> PAGE_SHIFT;
-			unsigned off;
-
-			/* Zero out the end if this page straddles i_size */
-			off = i_size & (PAGE_SIZE-1);
-			if (page->index == end_index && off)
-				zero_user_segment(page, off, PAGE_SIZE);
+
+		if (uptodate) {
+			set_page_blks_state(page, 1 << BLK_STATE_UPTODATE,
+					start, end);
 			check_page_uptodate(page);
 		} else {
 			ClearPageUptodate(page);
 			SetPageError(page);
 		}
-		unlock_page(page);
-		offset += len;
 
-		unlock_extent_cached(tree, start, end, NULL, GFP_ATOMIC);
+		offset += root->sectorsize;
+
+		if (--nr_sectors) {
+			clear_page_blks_state(page, 1 << BLK_STATE_IO,
+					start, end);
+			clear_extent_bit(tree, start, end,
+					EXTENT_LOCKED, 1, 0, NULL, GFP_ATOMIC);
+			start = end + 1;
+			goto next_block;
+		}
+
+		WARN_ON(!PagePrivate(page));
+
+		unlock = 1;
+
+		if (root->sectorsize < PAGE_SIZE) {
+			pg_private = (struct btrfs_page_private *)page->private;
+
+			spin_lock_irqsave(&pg_private->io_lock, flags);
+
+			clear_page_blks_state(page, 1 << BLK_STATE_IO,
+					start, end);
+
+			unlock = page_io_complete(page);
+
+			spin_unlock_irqrestore(&pg_private->io_lock, flags);
+		}
+
+		clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, NULL,
+				GFP_ATOMIC);
 
+		if (unlock)
+			unlock_page(page);
 	}
 
 	if (io_bio->end_io)
@@ -2794,13 +2925,51 @@ static void attach_extent_buffer_page(struct extent_buffer *eb,
 	}
 }
 
-void set_page_extent_mapped(struct page *page)
+int set_page_extent_mapped(struct page *page)
 {
+	struct inode *inode = page->mapping->host;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_page_private *pg_private;
+	unsigned long private = EXTENT_PAGE_PRIVATE;
+
 	if (!PagePrivate(page)) {
+		if (root->sectorsize < PAGE_SIZE) {
+			pg_private = kmem_cache_zalloc(page_private_cache,
+						GFP_NOFS);
+			if (!pg_private)
+				return -ENOMEM;
+
+			spin_lock_init(&pg_private->io_lock);
+
+			private = (unsigned long)pg_private;
+		}
+
 		SetPagePrivate(page);
 		get_page(page);
-		set_page_private(page, EXTENT_PAGE_PRIVATE);
+		set_page_private(page, private);
 	}
+
+	return 0;
+}
+
+int clear_page_extent_mapped(struct page *page)
+{
+	struct inode *inode = page->mapping->host;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_page_private *pg_private;
+
+	if (PagePrivate(page)) {
+		if (root->sectorsize < PAGE_SIZE) {
+			pg_private = (struct btrfs_page_private *)(page->private);
+			kmem_cache_free(page_private_cache, pg_private);
+		}
+
+		ClearPagePrivate(page);
+		set_page_private(page, 0);
+		put_page(page);
+	}
+
+	return 0;
 }
 
 static struct extent_map *
@@ -2846,6 +3015,7 @@ static int __do_readpage(struct extent_io_tree *tree,
 			 u64 *prev_em_start)
 {
 	struct inode *inode = page->mapping->host;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 start = page_offset(page);
 	u64 page_end = start + PAGE_SIZE - 1;
 	u64 end;
@@ -2868,12 +3038,10 @@ static int __do_readpage(struct extent_io_tree *tree,
 	set_page_extent_mapped(page);
 
 	end = page_end;
-	if (!PageUptodate(page)) {
-		if (cleancache_get_page(page) == 0) {
-			BUG_ON(blocksize != PAGE_SIZE);
-			unlock_extent(tree, start, end);
-			goto out;
-		}
+	if ((blocksize == PAGE_SIZE) && !PageUptodate(page) &&
+		(cleancache_get_page(page) == 0)) {
+		unlock_extent(tree, start, end);
+		goto out;
 	}
 
 	if (page->index == last_byte >> PAGE_SHIFT) {
@@ -2900,6 +3068,8 @@ static int __do_readpage(struct extent_io_tree *tree,
 			memset(userpage + pg_offset, 0, iosize);
 			flush_dcache_page(page);
 			kunmap_atomic(userpage);
+			set_page_blks_state(page, 1 << BLK_STATE_UPTODATE, cur,
+						cur + iosize - 1);
 			unlock_extent_cached(tree, cur,
 					     cur + iosize - 1,
 					     NULL, GFP_NOFS);
@@ -2992,6 +3162,9 @@ static int __do_readpage(struct extent_io_tree *tree,
 			flush_dcache_page(page);
 			kunmap_atomic(userpage);
 
+			set_page_blks_state(page, 1 << BLK_STATE_UPTODATE, cur,
+					cur + iosize - 1);
+
 			unlock_extent_cached(tree, cur,
 					     cur + iosize - 1,
 					     &cached, GFP_NOFS);
@@ -3000,7 +3173,12 @@ static int __do_readpage(struct extent_io_tree *tree,
 			continue;
 		}
 		/* the get_extent function already copied into the page */
-		if (PageUptodate(page)) {
+		if ((root->sectorsize == PAGE_SIZE
+				&& PageUptodate(page))
+			|| (root->sectorsize < PAGE_SIZE
+				&& test_page_blks_state(page,
+							BLK_STATE_UPTODATE, cur,
+							cur_end, 1))) {
 			check_page_uptodate(page);
 			unlock_extent(tree, cur, cur + iosize - 1);
 			cur = cur + iosize;
@@ -3019,6 +3197,9 @@ static int __do_readpage(struct extent_io_tree *tree,
 		}
 
 		pnr -= page->index;
+
+		set_page_blks_state(page, 1 << BLK_STATE_IO, cur,
+				cur + iosize - 1);
 		ret = submit_extent_page(REQ_OP_READ, read_flags, tree, NULL,
 					 page, sector, disk_io_size, pg_offset,
 					 bdev, bio, pnr,
@@ -3031,6 +3212,8 @@ static int __do_readpage(struct extent_io_tree *tree,
 			*bio_flags = this_bio_flag;
 		} else {
 			SetPageError(page);
+			clear_page_blks_state(page, 1 << BLK_STATE_IO,
+					cur, cur + iosize - 1);
 			unlock_extent(tree, cur, cur + iosize - 1);
 			goto out;
 		}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 9aa22f9..e7a0462 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -55,11 +55,72 @@
 #define PAGE_SET_ERROR		(1 << 5)
 
 /*
- * page->private values.  Every page that is controlled by the extent
- * map has page->private set to one.
+ * page->private values for "sector size" == "page size" case.  Every
+ * page that is controlled by the extent map has page->private set to
+ * one.
  */
 #define EXTENT_PAGE_PRIVATE 1
 
+enum blk_state {
+	BLK_STATE_UPTODATE,
+	BLK_STATE_DIRTY,
+	BLK_STATE_IO,
+	BLK_NR_STATE,
+};
+
+/*
+ * The maximum number of blocks per page (i.e. 32) occurs when using 2k
+ * as the block size and having 64k as the page size.
+ */
+#define BLK_STATE_NR_LONGS DIV_ROUND_UP(BLK_NR_STATE * 32, BITS_PER_LONG)
+
+
+/*
+ * btrfs_page_private->io_lock plays the same role as BH_Uptodate_Lock
+ * (see end_buffer_async_read()) i.e. without the io_lock we may end up
+ * in the following situation,
+ *
+ * NOTE: Assume 64k page size and 4k block size. Also assume that the first 12
+ * blocks of the page are contiguous while the next 4 blocks are contiguous. When
+ * reading the page we end up submitting two "logical address space" bios. So
+ * end_bio_extent_readpage function is invoked twice, once for each bio.
+ *
+ * |-------------------------+-------------------------+-------------|
+ * | Task A                  | Task B                  | Task C      |
+ * |-------------------------+-------------------------+-------------|
+ * | end_bio_extent_readpage |                         |             |
+ * | process block 0         |                         |             |
+ * | - clear BLK_STATE_IO    |                         |             |
+ * | - page_read_complete    |                         |             |
+ * | process block 1         |                         |             |
+ * |                         |                         |             |
+ * |                         |                         |             |
+ * |                         | end_bio_extent_readpage |             |
+ * |                         | process block 0         |             |
+ * |                         | - clear BLK_STATE_IO    |             |
+ * |                         | - page_read_complete    |             |
+ * |                         | process block 1         |             |
+ * |                         |                         |             |
+ * | process block 11        | process block 3         |             |
+ * | - clear BLK_STATE_IO    | - clear BLK_STATE_IO    |             |
+ * | - page_read_complete    | - page_read_complete    |             |
+ * |   - returns true        |   - returns true        |             |
+ * |   - unlock_page()       |                         |             |
+ * |                         |                         | lock_page() |
+ * |                         |   - unlock_page()       |             |
+ * |-------------------------+-------------------------+-------------|
+ *
+ * We end up incorrectly unlocking the page twice and "Task C" ends up
+ * working on an unlocked page. So private->io_lock makes sure that
+ * only one of the tasks gets "true" as the return value when
+ * page_io_complete() is invoked. As an optimization the patch gets the
+ * io_lock only when the last block of the bio_vec is being processed.
+ */
+struct btrfs_page_private {
+	spinlock_t io_lock;
+	unsigned long bstate[BLK_STATE_NR_LONGS];
+};
+
 struct extent_state;
 struct btrfs_root;
 struct btrfs_io_bio;
@@ -356,8 +417,14 @@ int extent_readpages(struct extent_io_tree *tree,
 		     get_extent_t get_extent);
 int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		__u64 start, __u64 len, get_extent_t *get_extent);
-void set_page_extent_mapped(struct page *page);
-
+int set_page_extent_mapped(struct page *page);
+int clear_page_extent_mapped(struct page *page);
+int set_page_blks_state(struct page *page, unsigned long blk_states,
+			u64 start, u64 end);
+int clear_page_blks_state(struct page *page, unsigned long blk_states,
+			u64 start, u64 end);
+int test_page_blks_state(struct page *page, enum blk_state blk_state,
+			u64 start, u64 end, int check_all);
 struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 					  u64 start);
 struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_eb_info *eb_info,
@@ -477,6 +544,7 @@ struct io_failure_record {
 	int in_validation;
 };
 
+void check_page_uptodate(struct page *page);
 void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end);
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
 				struct io_failure_record **failrec_ret);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ac4a7c0..10dcb44 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6983,7 +6983,10 @@ next:
 			btrfs_mark_buffer_dirty(leaf);
 		}
 
-		SetPageUptodate(page);
+		set_page_blks_state(page, 1 << BLK_STATE_UPTODATE,
+				em->start, extent_map_end(em) - 1);
+		check_page_uptodate(page);
+
 		goto insert;
 	}
 not_found:
@@ -8802,11 +8805,9 @@ static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
 	map = &BTRFS_I(page->mapping->host)->extent_tree;
 	ret = try_release_extent_mapping(map, tree, page, gfp_flags);
-	if (ret == 1) {
-		ClearPagePrivate(page);
-		set_page_private(page, 0);
-		put_page(page);
-	}
+	if (ret == 1)
+		clear_page_extent_mapped(page);
+
 	return ret;
 }
 
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 06/19] Btrfs: subpage-blocksize: Fix whole page write
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (4 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 05/19] Btrfs: subpage-blocksize: Fix whole page read Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 07/19] Btrfs: subpage-blocksize: Use kmalloc()-ed memory to hold metadata blocks Chandan Rajendra
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

For the subpage-blocksize scenario, a page can contain multiple
blocks. In such cases, this patch handles writing data to files.

Also, When setting EXTENT_DELALLOC, we no longer set EXTENT_UPTODATE bit on
the extent_io_tree since uptodate status is being tracked either by the
bitmap pointed to by page->private or by the PG_uptodate flag.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c  | 114 +++++++++++++++++++++++++++-----------------------
 fs/btrfs/file.c       |  16 +++++++
 fs/btrfs/inode.c      |  69 ++++++++++++++++++++++++------
 fs/btrfs/relocation.c |   3 ++
 4 files changed, 137 insertions(+), 65 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b3885cc..6cac61f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2573,36 +2573,41 @@ void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
  */
 static void end_bio_extent_writepage(struct bio *bio)
 {
+	struct btrfs_page_private *pg_private;
 	struct bio_vec *bvec;
+	unsigned long flags;
 	u64 start;
 	u64 end;
+	int clear_writeback;
 	int i;
 
 	bio_for_each_segment_all(bvec, bio, i) {
 		struct page *page = bvec->bv_page;
+		struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
-		/* We always issue full-page reads, but if some block
-		 * in a page fails to read, blk_update_request() will
-		 * advance bv_offset and adjust bv_len to compensate.
-		 * Print a warning for nonzero offsets, and an error
-		 * if they don't add up to a full page.  */
-		if (bvec->bv_offset || bvec->bv_len != PAGE_SIZE) {
-			if (bvec->bv_offset + bvec->bv_len != PAGE_SIZE)
-				btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
-				   "partial page write in btrfs with offset %u and length %u",
-					bvec->bv_offset, bvec->bv_len);
-			else
-				btrfs_info(BTRFS_I(page->mapping->host)->root->fs_info,
-				   "incomplete page write in btrfs with offset %u and "
-				   "length %u",
-					bvec->bv_offset, bvec->bv_len);
-		}
+		pg_private = NULL;
+		flags = 0;
+		clear_writeback = 1;
+
+		start = page_offset(page) + bvec->bv_offset;
+		end = start + bvec->bv_len - 1;
 
-		start = page_offset(page);
-		end = start + bvec->bv_offset + bvec->bv_len - 1;
+		if (root->sectorsize < PAGE_SIZE) {
+			pg_private = (struct btrfs_page_private *)page->private;
+			spin_lock_irqsave(&pg_private->io_lock, flags);
+		}
 
 		end_extent_writepage(page, bio->bi_error, start, end);
-		end_page_writeback(page);
+
+		if (root->sectorsize < PAGE_SIZE) {
+			clear_page_blks_state(page, 1 << BLK_STATE_IO, start,
+					end);
+			clear_writeback = page_io_complete(page);
+			spin_unlock_irqrestore(&pg_private->io_lock, flags);
+		}
+
+		if (clear_writeback)
+			end_page_writeback(page);
 	}
 
 	bio_put(bio);
@@ -3465,7 +3470,6 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 	u64 block_start;
 	u64 iosize;
 	sector_t sector;
-	struct extent_state *cached_state = NULL;
 	struct extent_map *em;
 	struct block_device *bdev;
 	size_t pg_offset = 0;
@@ -3517,20 +3521,29 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 							 page_end, NULL, 1);
 			break;
 		}
-		em = epd->get_extent(inode, page, pg_offset, cur,
-				     end - cur + 1, 1);
+
+		if (blocksize < PAGE_SIZE
+			&& !test_page_blks_state(page, BLK_STATE_DIRTY, cur,
+						cur + blocksize - 1, 1)) {
+			cur += blocksize;
+			continue;
+		}
+
+		pg_offset = cur & (PAGE_SIZE - 1);
+
+		em = epd->get_extent(inode, page, pg_offset, cur, blocksize, 1);
 		if (IS_ERR_OR_NULL(em)) {
 			SetPageError(page);
 			ret = PTR_ERR_OR_ZERO(em);
 			break;
 		}
 
-		extent_offset = cur - em->start;
 		em_end = extent_map_end(em);
 		BUG_ON(em_end <= cur);
 		BUG_ON(end < cur);
-		iosize = min(em_end - cur, end - cur + 1);
-		iosize = ALIGN(iosize, blocksize);
+
+		iosize = blocksize;
+		extent_offset = cur - em->start;
 		sector = (em->block_start + extent_offset) >> 9;
 		bdev = em->bdev;
 		block_start = em->block_start;
@@ -3538,36 +3551,32 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 		free_extent_map(em);
 		em = NULL;
 
-		/*
-		 * compressed and inline extents are written through other
-		 * paths in the FS
-		 */
-		if (compressed || block_start == EXTENT_MAP_HOLE ||
-		    block_start == EXTENT_MAP_INLINE) {
-			/*
-			 * end_io notification does not happen here for
-			 * compressed extents
-			 */
-			if (!compressed && tree->ops &&
-			    tree->ops->writepage_end_io_hook)
-				tree->ops->writepage_end_io_hook(page, cur,
-							 cur + iosize - 1,
-							 NULL, 1);
-			else if (compressed) {
-				/* we don't want to end_page_writeback on
-				 * a compressed extent.  this happens
-				 * elsewhere
-				 */
-				nr++;
-			}
+		ASSERT(!compressed);
+		ASSERT(block_start != EXTENT_MAP_INLINE);
 
+		if (block_start == EXTENT_MAP_HOLE) {
+			if (blocksize < PAGE_SIZE) {
+				if (test_page_blks_state(page, BLK_STATE_UPTODATE,
+								cur, cur + iosize - 1,
+								1)) {
+					clear_page_blks_state(page,
+							1 << BLK_STATE_DIRTY, cur,
+							cur + iosize - 1);
+				} else {
+					ASSERT(0);
+				}
+			} else if (!PageUptodate(page)) {
+				ASSERT(0);
+			}
 			cur += iosize;
-			pg_offset += iosize;
 			continue;
 		}
 
 		max_nr = (i_size >> PAGE_SHIFT) + 1;
 
+		clear_page_blks_state(page,
+				1 << BLK_STATE_DIRTY, cur, cur + iosize - 1);
+
 		set_range_writeback(tree, cur, cur + iosize - 1);
 		if (!PageWriteback(page)) {
 			btrfs_err(BTRFS_I(inode)->root->fs_info,
@@ -3575,6 +3584,9 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 			       page->index, cur, end);
 		}
 
+		set_page_blks_state(page, 1 << BLK_STATE_IO, cur,
+				cur + iosize - 1);
+
 		ret = submit_extent_page(REQ_OP_WRITE, write_flags, tree, wbc,
 					 page, sector, iosize, pg_offset,
 					 bdev, &epd->bio, max_nr,
@@ -3583,17 +3595,13 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 		if (ret)
 			SetPageError(page);
 
-		cur = cur + iosize;
-		pg_offset += iosize;
+		cur += iosize;
 		nr++;
 	}
 done:
 	*nr_ret = nr;
 
 done_unlocked:
-
-	/* drop our reference on any cached states */
-	free_extent_state(cached_state);
 	return ret;
 }
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 85bf035..54602e6 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -495,6 +495,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 	u64 num_bytes;
 	u64 start_pos;
 	u64 end_of_last_block;
+	u64 start;
+	u64 end;
+	u64 page_end;
 	u64 end_pos = pos + write_bytes;
 	loff_t isize = i_size_read(inode);
 
@@ -507,11 +510,24 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 	if (err)
 		return err;
 
+	start = start_pos;
+
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = pages[i];
 		SetPageUptodate(p);
 		ClearPageChecked(p);
+
+		end = page_end = page_offset(p) + PAGE_SIZE - 1;
+
+		if (i == num_pages - 1)
+			end = min_t(u64, page_end, end_of_last_block);
+
+		set_page_blks_state(p,
+				1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
+				start, end);
 		set_page_dirty(p);
+
+		start = page_end + 1;
 	}
 
 	/*
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 10dcb44..42f844b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -212,6 +212,9 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 		page = find_get_page(inode->i_mapping,
 				     start >> PAGE_SHIFT);
 		btrfs_set_file_extent_compression(leaf, ei, 0);
+		clear_page_blks_state(page, 1 << BLK_STATE_DIRTY, start,
+				round_up(start + size - 1, root->sectorsize)
+				- 1);
 		kaddr = kmap_atomic(page);
 		offset = start & (PAGE_SIZE - 1);
 		write_extent_buffer(leaf, kaddr + offset, ptr, size);
@@ -2018,6 +2021,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 	struct btrfs_writepage_fixup *fixup;
 	struct btrfs_ordered_extent *ordered;
 	struct extent_state *cached_state = NULL;
+	struct btrfs_root *root;
 	struct page *page;
 	struct inode *inode;
 	u64 page_start;
@@ -2034,6 +2038,7 @@ again:
 	}
 
 	inode = page->mapping->host;
+	root = BTRFS_I(inode)->root;
 	page_start = page_offset(page);
 	page_end = page_offset(page) + PAGE_SIZE - 1;
 
@@ -2065,6 +2070,11 @@ again:
 	 }
 
 	btrfs_set_extent_delalloc(inode, page_start, page_end, &cached_state);
+
+	set_page_blks_state(page,
+			1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
+			page_start, page_end);
+
 	ClearPageChecked(page);
 	set_page_dirty(page);
 out:
@@ -3066,26 +3076,48 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
 	struct btrfs_ordered_extent *ordered_extent = NULL;
 	struct btrfs_workqueue *wq;
 	btrfs_work_func_t func;
+	u64 ordered_start, ordered_end;
+	int done;
 
 	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
 
 	ClearPagePrivate2(page);
-	if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
-					    end - start + 1, uptodate))
-		return 0;
+loop:
+	ordered_extent = btrfs_lookup_ordered_range(inode, start,
+						end - start + 1);
+	if (!ordered_extent)
+		goto out;
 
-	if (btrfs_is_free_space_inode(inode)) {
-		wq = root->fs_info->endio_freespace_worker;
-		func = btrfs_freespace_write_helper;
-	} else {
-		wq = root->fs_info->endio_write_workers;
-		func = btrfs_endio_write_helper;
+	ordered_start = max_t(u64, start, ordered_extent->file_offset);
+	ordered_end = min_t(u64, end,
+			ordered_extent->file_offset + ordered_extent->len - 1);
+
+	done = btrfs_dec_test_ordered_pending(inode, &ordered_extent,
+					ordered_start,
+					ordered_end - ordered_start + 1,
+					uptodate);
+	if (done) {
+		if (btrfs_is_free_space_inode(inode)) {
+			wq = root->fs_info->endio_freespace_worker;
+			func = btrfs_freespace_write_helper;
+		} else {
+			wq = root->fs_info->endio_write_workers;
+			func = btrfs_endio_write_helper;
+		}
+
+		btrfs_init_work(&ordered_extent->work, func,
+				finish_ordered_fn, NULL, NULL);
+		btrfs_queue_work(wq, &ordered_extent->work);
 	}
 
-	btrfs_init_work(&ordered_extent->work, func, finish_ordered_fn, NULL,
-			NULL);
-	btrfs_queue_work(wq, &ordered_extent->work);
+	btrfs_put_ordered_extent(ordered_extent);
+
+	start = ordered_end + 1;
 
+	if (start < end)
+		goto loop;
+
+out:
 	return 0;
 }
 
@@ -4752,6 +4784,10 @@ again:
 		goto out_unlock;
 	}
 
+	set_page_blks_state(page,
+			1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
+			block_start, block_end);
+
 	if (offset != blocksize) {
 		if (!len)
 			len = blocksize - offset;
@@ -8910,6 +8946,10 @@ again:
 	 *    This means the reserved space should be freed here.
 	 */
 	btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE);
+
+	clear_page_blks_state(page, 1 << BLK_STATE_DIRTY, page_start,
+			page_end);
+
 	if (!inode_evicting) {
 		clear_extent_bit(tree, page_start, page_end,
 				 EXTENT_LOCKED | EXTENT_DIRTY |
@@ -9053,6 +9093,11 @@ again:
 		ret = VM_FAULT_SIGBUS;
 		goto out_unlock;
 	}
+
+	set_page_blks_state(page,
+			1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
+			page_start, end);
+
 	ret = 0;
 
 	/* page is wholly or partially inside EOF */
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 62dfc2c..f724fb5 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3198,6 +3198,9 @@ static int relocate_file_extent_cluster(struct inode *inode,
 		}
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
+		set_page_blks_state(page,
+				1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
+				page_start, page_end);
 		set_page_dirty(page);
 
 		unlock_extent(&BTRFS_I(inode)->io_tree,
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 07/19] Btrfs: subpage-blocksize: Use kmalloc()-ed memory to hold metadata blocks
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (5 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 06/19] Btrfs: subpage-blocksize: Fix whole page write Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 08/19] Btrfs: subpage-blocksize: Execute sanity tests on all possible block sizes Chandan Rajendra
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

For subpage-blocksizes this commit uses kmalloc()-ed memory to buffer
metadata blocks in memory.

When reading/writing metadata blocks, We now track the first extent
buffer using bio->bi_private. With kmalloc()-ed memory we cannot use
page->private. Hence when writing dirty extent buffers in
subpage-blocksize scenario, this commit forces each bio to contain a
single extent buffer. For the non subpage-blocksize scenario we continue
to track the corresponding extent buffer using page->private and hence a
single write bio will continue to have more than one dirty extent
buffer.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/ctree.h                 |   6 +-
 fs/btrfs/disk-io.c               |  27 +++---
 fs/btrfs/extent_io.c             | 204 +++++++++++++++++++++++++--------------
 fs/btrfs/extent_io.h             |   8 +-
 fs/btrfs/tests/extent-io-tests.c |   4 +-
 5 files changed, 158 insertions(+), 91 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b9ee7cf..745284c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1491,14 +1491,16 @@ static inline void btrfs_set_token_##name(struct extent_buffer *eb,	\
 #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)		\
 static inline u##bits btrfs_##name(struct extent_buffer *eb)		\
 {									\
-	type *p = page_address(eb->pages[0]);				\
+	type *p = (type *)((u8 *)page_address(eb->pages[0])		\
+			+ eb->pg_offset);				\
 	u##bits res = le##bits##_to_cpu(p->member);			\
 	return res;							\
 }									\
 static inline void btrfs_set_##name(struct extent_buffer *eb,		\
 				    u##bits val)			\
 {									\
-	type *p = page_address(eb->pages[0]);				\
+	type *p = (type *)((u8 *)page_address(eb->pages[0])		\
+			+ eb->pg_offset);				\
 	p->member = cpu_to_le##bits(val);				\
 }
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9ff48a7..5663481 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -448,13 +448,10 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
  * we only fill in the checksum field in the first page of a multi-page block
  */
 
-static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
+static int csum_dirty_buffer(struct btrfs_fs_info *fs_info,
+			struct extent_buffer *eb)
 {
-	struct extent_buffer *eb;
 
-	eb = (struct extent_buffer *)page->private;
-	if (page != eb->pages[0])
-		return 0;
 	ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
 			btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
 
@@ -557,11 +554,10 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	int ret = 0;
 	int reads_done;
 
-	if (!page->private)
+	eb = (io_bio->bio).bi_private;
+	if (!eb)
 		goto out;
 
-	eb = (struct extent_buffer *)page->private;
-
 	/* the pending IO might have been the only thing that kept this buffer
 	 * in memory.  Make sure we have a ref for all this other checks
 	 */
@@ -646,11 +642,11 @@ out:
 	return ret;
 }
 
-static int btree_io_failed_hook(struct page *page, int failed_mirror)
+static int btree_io_failed_hook(struct page *page, void *private,
+				int failed_mirror)
 {
-	struct extent_buffer *eb;
+	struct extent_buffer *eb = private;
 
-	eb = (struct extent_buffer *)page->private;
 	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
@@ -829,11 +825,18 @@ int btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 
 static int btree_csum_one_bio(struct btrfs_fs_info *fs_info, struct bio *bio)
 {
+	struct extent_buffer *eb = bio->bi_private;
 	struct bio_vec *bvec;
 	int i, ret = 0;
 
 	bio_for_each_segment_all(bvec, bio, i) {
-		ret = csum_dirty_buffer(fs_info, bvec->bv_page);
+		if (eb->len >= PAGE_SIZE)
+			eb = (struct extent_buffer *)(bvec->bv_page->private);
+
+		if (bvec->bv_page != eb->pages[0])
+			continue;
+
+		ret = csum_dirty_buffer(fs_info, eb);
 		if (ret)
 			break;
 	}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6cac61f..8ace367 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2817,18 +2817,17 @@ struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 }
 
 
-static int __must_check submit_one_bio(struct bio *bio, int mirror_num,
-				       unsigned long bio_flags)
+static int __must_check submit_one_bio(struct bio *bio,
+				struct extent_io_tree *tree, int mirror_num,
+				unsigned long bio_flags)
 {
 	int ret = 0;
 	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
 	struct page *page = bvec->bv_page;
-	struct extent_io_tree *tree = bio->bi_private;
 	u64 start;
 
 	start = page_offset(page) + bvec->bv_offset;
 
-	bio->bi_private = NULL;
 	bio_get(bio);
 
 	if (tree->ops && tree->ops->submit_bio_hook)
@@ -2864,7 +2863,8 @@ static int submit_extent_page(int op, int op_flags, struct extent_io_tree *tree,
 			      int mirror_num,
 			      unsigned long prev_bio_flags,
 			      unsigned long bio_flags,
-			      bool force_bio_submit)
+			      bool force_bio_submit,
+			      void *private)
 {
 	int ret = 0;
 	struct bio *bio;
@@ -2883,7 +2883,8 @@ static int submit_extent_page(int op, int op_flags, struct extent_io_tree *tree,
 		    force_bio_submit ||
 		    merge_bio(tree, page, offset, page_size, bio, bio_flags) ||
 		    bio_add_page(bio, page, page_size, offset) < page_size) {
-			ret = submit_one_bio(bio, mirror_num, prev_bio_flags);
+			ret = submit_one_bio(bio, tree, mirror_num,
+					prev_bio_flags);
 			if (ret < 0) {
 				*bio_ret = NULL;
 				return ret;
@@ -2903,7 +2904,7 @@ static int submit_extent_page(int op, int op_flags, struct extent_io_tree *tree,
 
 	bio_add_page(bio, page, page_size, offset);
 	bio->bi_end_io = end_io_func;
-	bio->bi_private = tree;
+	bio->bi_private = private;
 	bio_set_op_attrs(bio, op, op_flags);
 	if (wbc) {
 		wbc_init_bio(wbc, bio);
@@ -2913,7 +2914,7 @@ static int submit_extent_page(int op, int op_flags, struct extent_io_tree *tree,
 	if (bio_ret)
 		*bio_ret = bio;
 	else
-		ret = submit_one_bio(bio, mirror_num, bio_flags);
+		ret = submit_one_bio(bio, tree, mirror_num, bio_flags);
 
 	return ret;
 }
@@ -3211,7 +3212,7 @@ static int __do_readpage(struct extent_io_tree *tree,
 					 end_bio_extent_readpage, mirror_num,
 					 *bio_flags,
 					 this_bio_flag,
-					 force_bio_submit);
+					 force_bio_submit, NULL);
 		if (!ret) {
 			nr++;
 			*bio_flags = this_bio_flag;
@@ -3346,7 +3347,7 @@ int extent_read_full_page(struct extent_io_tree *tree, struct page *page,
 	ret = __extent_read_full_page(tree, page, get_extent, &bio, mirror_num,
 				      &bio_flags, 0);
 	if (bio)
-		ret = submit_one_bio(bio, mirror_num, bio_flags);
+		ret = submit_one_bio(bio, tree, mirror_num, bio_flags);
 	return ret;
 }
 
@@ -3591,7 +3592,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 					 page, sector, iosize, pg_offset,
 					 bdev, &epd->bio, max_nr,
 					 end_bio_extent_writepage,
-					 0, 0, 0, false);
+					 0, 0, 0, false, NULL);
 		if (ret)
 			SetPageError(page);
 
@@ -3774,9 +3775,8 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	}
 }
 
-static void set_btree_ioerr(struct page *page)
+static void set_btree_ioerr(struct extent_buffer *eb)
 {
-	struct extent_buffer *eb = (struct extent_buffer *)page->private;
 	struct btrfs_fs_info *fs_info = eb->eb_info->fs_info;
 
 	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
@@ -3837,19 +3837,23 @@ static void set_btree_ioerr(struct page *page)
 static void end_bio_extent_buffer_writepage(struct bio *bio)
 {
 	struct bio_vec *bvec;
-	struct extent_buffer *eb;
+	struct extent_buffer *eb = bio->bi_private;
+	u32 nodesize = eb->len;
 	int i, done;
 
 	bio_for_each_segment_all(bvec, bio, i) {
 		struct page *page = bvec->bv_page;
 
-		eb = (struct extent_buffer *)page->private;
-		BUG_ON(!eb);
+		if (nodesize >= PAGE_SIZE) {
+			eb = (struct extent_buffer *)page->private;
+			BUG_ON(!eb);
+		}
+
 		done = atomic_dec_and_test(&eb->io_pages);
 
 		if (bio->bi_error ||
 		    test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
-			set_btree_ioerr(page);
+			set_btree_ioerr(eb);
 
 		account_metadata_end_writeback(page,
 					       &eb->eb_info->fs_info->bdi);
@@ -3871,6 +3875,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	u64 offset = eb->start;
 	unsigned long i, num_pages;
 	unsigned long bio_flags = 0;
+	size_t len;
 	int write_flags = (epd->sync_io ? WRITE_SYNC : 0) | REQ_META;
 	int ret = 0;
 
@@ -3880,27 +3885,33 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
 		bio_flags = EXTENT_BIO_TREE_LOG;
 
+	len = min_t(size_t, eb->len, PAGE_SIZE);
+
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = eb->pages[i];
 
 		ret = submit_extent_page(REQ_OP_WRITE, write_flags, tree, wbc,
-					 p, offset >> 9, PAGE_SIZE, 0, bdev,
-					 &epd->bio, -1,
+					 p, offset >> 9, len, eb->pg_offset,
+					 bdev, &epd->bio, -1,
 					 end_bio_extent_buffer_writepage,
-					 0, epd->bio_flags, bio_flags, false);
+					 0, epd->bio_flags, bio_flags, false,
+					 eb);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_btree_ioerr(p);
+			set_btree_ioerr(eb);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
 			ret = -EIO;
 			break;
 		}
 		account_metadata_writeback(p, &fs_info->bdi);
-		offset += PAGE_SIZE;
+		offset += len;
 		update_nr_written(p, wbc, 1);
 	}
 
+	if (!ret && len < PAGE_SIZE)
+		flush_write_bio(epd);
+
 	return ret;
 }
 
@@ -3964,7 +3975,7 @@ repeat:
 	}
 	rcu_read_unlock();
 	if (ret)
-		*index = (ebs[ret - 1]->start >> PAGE_SHIFT) + 1;
+		*index = ebs[ret - 1]->start + 1;
 	return ret;
 }
 
@@ -3997,8 +4008,8 @@ static int btree_write_cache_pages(struct btrfs_fs_info *fs_info,
 		index = eb_info->writeback_index; /* Start from prev offset */
 		end = -1;
 	} else {
-		index = wbc->range_start >> PAGE_SHIFT;
-		end = wbc->range_end >> PAGE_SHIFT;
+		index = wbc->range_start;
+		end = wbc->range_end;
 		scanned = 1;
 	}
 	if (wbc->sync_mode == WB_SYNC_ALL)
@@ -4097,19 +4108,18 @@ int btree_write_range(struct btrfs_fs_info *fs_info, u64 start, u64 end)
 int btree_wait_range(struct btrfs_fs_info *fs_info, u64 start, u64 end)
 {
 	struct extent_buffer *ebs[EBVEC_SIZE];
-	pgoff_t index = start >> PAGE_SHIFT;
-	pgoff_t end_index = end >> PAGE_SHIFT;
 	unsigned nr_ebs;
 	int ret = 0;
 
 	if (end < start)
 		return ret;
 
-	while ((index <= end) &&
-	       (nr_ebs = eb_lookup_tag(fs_info->eb_info, ebs, &index,
+	while ((start <= end) &&
+		(nr_ebs = eb_lookup_tag(fs_info->eb_info, ebs,
+				       (pgoff_t *)&start,
 				       PAGECACHE_TAG_WRITEBACK,
-				       min(end_index - index,
-					   (pgoff_t)EBVEC_SIZE-1) + 1)) != 0) {
+				       min_t(u64, end - start,
+					     EBVEC_SIZE-1) + 1)) != 0) {
 		unsigned i;
 
 		for (i = 0; i < nr_ebs; i++) {
@@ -4296,7 +4306,7 @@ static void flush_epd_write_bio(struct extent_page_data *epd)
 		bio_set_op_attrs(epd->bio, REQ_OP_WRITE,
 				 epd->sync_io ? WRITE_SYNC : 0);
 
-		ret = submit_one_bio(epd->bio, 0, epd->bio_flags);
+		ret = submit_one_bio(epd->bio, epd->tree, 0, epd->bio_flags);
 		BUG_ON(ret < 0); /* -ENOMEM */
 		epd->bio = NULL;
 	}
@@ -4436,7 +4446,7 @@ int extent_readpages(struct extent_io_tree *tree,
 
 	BUG_ON(!list_empty(pages));
 	if (bio)
-		return submit_one_bio(bio, 0, bio_flags);
+		return submit_one_bio(bio, tree, 0, bio_flags);
 	return 0;
 }
 
@@ -4818,6 +4828,12 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
 		return;
 
 	ASSERT(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
+
+	if (test_bit(EXTENT_BUFFER_MEM, &eb->bflags)) {
+		kfree(eb->addr);
+		return;
+	}
+
 	do {
 		index--;
 		page = eb->pages[index];
@@ -4925,12 +4941,35 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_eb_info *eb_info,
 	if (!eb)
 		return NULL;
 
+	if (len < PAGE_SIZE) {
+		eb->addr = kmalloc(len, GFP_NOFS);
+		if (!eb->addr)
+			goto err;
+
+		if (((unsigned long)(eb->addr + len - 1) & PAGE_MASK) !=
+		    ((unsigned long)eb->addr & PAGE_MASK)) {
+			/* eb->addr spans two pages - use alloc_page instead */
+			kfree(eb->addr);
+			eb->addr = NULL;
+			goto use_alloc_page;
+		}
+
+		set_bit(EXTENT_BUFFER_MEM, &eb->bflags);
+		eb->pg_offset = offset_in_page(eb->addr);
+		eb->pages[0] = virt_to_page(eb->addr);
+		goto init_eb;
+	}
+
+use_alloc_page:
+
 	for (i = 0; i < num_pages; i++) {
 		eb->pages[i] = alloc_page(GFP_NOFS);
 		if (!eb->pages[i])
 			goto err;
 		attach_extent_buffer_page(eb, eb->pages[i]);
 	}
+
+init_eb:
 	set_extent_buffer_uptodate(eb);
 	btrfs_set_header_nritems(eb, 0);
 	set_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
@@ -4996,8 +5035,7 @@ struct extent_buffer *find_extent_buffer(struct btrfs_eb_info *eb_info,
 	struct extent_buffer *eb;
 
 	rcu_read_lock();
-	eb = radix_tree_lookup(&eb_info->buffer_radix,
-			       start >> PAGE_SHIFT);
+	eb = radix_tree_lookup(&eb_info->buffer_radix, start);
 	if (eb && atomic_inc_not_zero(&eb->refs)) {
 		rcu_read_unlock();
 		/*
@@ -5046,8 +5084,7 @@ again:
 	if (ret)
 		goto free_eb;
 	spin_lock_irq(&eb_info->buffer_lock);
-	ret = radix_tree_insert(&eb_info->buffer_radix,
-				start >> PAGE_SHIFT, eb);
+	ret = radix_tree_insert(&eb_info->buffer_radix, start, eb);
 	spin_unlock_irq(&eb_info->buffer_lock);
 	radix_tree_preload_end();
 	if (ret == -EEXIST) {
@@ -5102,6 +5139,29 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	if (!eb)
 		return ERR_PTR(-ENOMEM);
 
+	if (len < PAGE_SIZE) {
+		eb->addr = kmalloc(len, GFP_NOFS);
+		if (!eb->addr) {
+			exists = ERR_PTR(-ENOMEM);
+			goto free_eb;
+		}
+
+		if (((unsigned long)(eb->addr + len - 1) & PAGE_MASK) !=
+		    ((unsigned long)eb->addr & PAGE_MASK)) {
+			/* eb->addr spans two pages - use alloc_page instead */
+			kfree(eb->addr);
+			eb->addr = NULL;
+			goto use_alloc_page;
+		}
+
+		set_bit(EXTENT_BUFFER_MEM, &eb->bflags);
+		eb->pg_offset = offset_in_page(eb->addr);
+		eb->pages[0] = virt_to_page(eb->addr);
+		goto insert_into_tree;
+	}
+
+use_alloc_page:
+
 	for (i = 0; i < num_pages; i++) {
 		p = alloc_page(GFP_NOFS|__GFP_NOFAIL);
 		if (!p) {
@@ -5124,7 +5184,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		attach_extent_buffer_page(eb, p);
 		eb->pages[i] = p;
 	}
-again:
+insert_into_tree:
 	ret = radix_tree_preload(GFP_NOFS);
 	if (ret) {
 		exists = ERR_PTR(ret);
@@ -5132,8 +5192,7 @@ again:
 	}
 
 	spin_lock_irq(&eb_info->buffer_lock);
-	ret = radix_tree_insert(&eb_info->buffer_radix,
-				start >> PAGE_SHIFT, eb);
+	ret = radix_tree_insert(&eb_info->buffer_radix, start, eb);
 	spin_unlock_irq(&eb_info->buffer_lock);
 	radix_tree_preload_end();
 	if (ret == -EEXIST) {
@@ -5141,7 +5200,7 @@ again:
 		if (exists)
 			goto free_eb;
 		else
-			goto again;
+			goto insert_into_tree;
 	}
 	/* add one reference for the tree */
 	check_buffer_tree_ref(eb);
@@ -5412,7 +5471,9 @@ int extent_buffer_uptodate(struct extent_buffer *eb)
 static void end_bio_extent_buffer_readpage(struct bio *bio)
 {
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
-	struct extent_io_tree *tree = NULL;
+	struct extent_buffer *eb = bio->bi_private;
+	struct btrfs_eb_info *eb_info = eb->eb_info;
+	struct extent_io_tree *tree = &eb_info->io_tree;
 	struct bio_vec *bvec;
 	u64 unlock_start = 0, unlock_len = 0;
 	int mirror_num = io_bio->mirror_num;
@@ -5421,16 +5482,7 @@ static void end_bio_extent_buffer_readpage(struct bio *bio)
 
 	bio_for_each_segment_all(bvec, bio, i) {
 		struct page *page = bvec->bv_page;
-		struct btrfs_eb_info *eb_info;
-		struct extent_buffer *eb;
-
-		eb = (struct extent_buffer *)page->private;
-		if (WARN_ON(!eb))
-			continue;
 
-		eb_info = eb->eb_info;
-		if (!tree)
-			tree = &eb_info->io_tree;
 		if (uptodate) {
 			/*
 			 * btree_readpage_end_io_hook doesn't care about
@@ -5454,7 +5506,8 @@ static void end_bio_extent_buffer_readpage(struct bio *bio)
 				}
 				clean_io_failure(eb_info->fs_info,
 						 &eb_info->io_failure_tree,
-						 tree, start, page, 0, 0);
+						 tree, start, page, 0,
+						 eb->pg_offset);
 			}
 		}
 		/*
@@ -5464,11 +5517,12 @@ static void end_bio_extent_buffer_readpage(struct bio *bio)
 		 * anything.
 		 */
 		if (!uptodate)
-			tree->ops->readpage_io_failed_hook(page, mirror_num);
+			tree->ops->readpage_io_failed_hook(page, eb,
+							mirror_num);
 
 		if (unlock_start == 0) {
 			unlock_start = eb->start;
-			unlock_len = PAGE_SIZE;
+			unlock_len = min(eb->len, PAGE_SIZE);
 		} else {
 			unlock_len += PAGE_SIZE;
 		}
@@ -5493,6 +5547,7 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait,
 	u64 unlock_start = 0, unlock_len = 0;
 	unsigned long i;
 	struct page *page;
+	size_t len;
 	int err;
 	int ret = 0;
 	unsigned long num_pages;
@@ -5515,10 +5570,13 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait,
 	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_pages);
+
+	len = min_t(size_t, eb->len, PAGE_SIZE);
+
 	for (i = 0; i < num_pages; i++) {
 		page = eb->pages[i];
 		if (ret) {
-			unlock_len += PAGE_SIZE;
+			unlock_len += len;
 			if (atomic_dec_and_test(&eb->io_pages)) {
 				clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
 				smp_mb__after_atomic();
@@ -5528,10 +5586,10 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait,
 		}
 
 		err = submit_extent_page(REQ_OP_READ, REQ_META, io_tree, NULL,
-					 page, offset >> 9, PAGE_SIZE, 0, bdev,
-					 &bio, -1,
+					 page, offset >> 9, len, eb->pg_offset,
+					 bdev, &bio, -1,
 					 end_bio_extent_buffer_readpage,
-					 mirror_num, 0, 0, false);
+					 mirror_num, 0, 0, false, eb);
 		if (err) {
 			ret = err;
 			/*
@@ -5548,13 +5606,13 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait,
 				wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
 			}
 			unlock_start = offset;
-			unlock_len = PAGE_SIZE;
+			unlock_len = len;
 		}
-		offset += PAGE_SIZE;
+		offset += len;
 	}
 
 	if (bio) {
-		err = submit_one_bio(bio, mirror_num, 0);
+		err = submit_one_bio(bio, io_tree, mirror_num, 0);
 		if (err)
 			return err;
 	}
@@ -5581,7 +5639,7 @@ void read_extent_buffer(struct extent_buffer *eb, void *dstv,
 	struct page *page;
 	char *kaddr;
 	char *dst = (char *)dstv;
-	size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
+	size_t start_offset = eb->pg_offset;
 	unsigned long i = (start_offset + start) >> PAGE_SHIFT;
 
 	WARN_ON(start > eb->len);
@@ -5612,7 +5670,7 @@ int read_extent_buffer_to_user(struct extent_buffer *eb, void __user *dstv,
 	struct page *page;
 	char *kaddr;
 	char __user *dst = (char __user *)dstv;
-	size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
+	size_t start_offset = eb->pg_offset;
 	unsigned long i = (start_offset + start) >> PAGE_SHIFT;
 	int ret = 0;
 
@@ -5650,10 +5708,10 @@ int map_private_extent_buffer(struct extent_buffer *eb, unsigned long start,
 			       unsigned long *map_start,
 			       unsigned long *map_len)
 {
-	size_t offset = start & (PAGE_SIZE - 1);
+	size_t offset;
 	char *kaddr;
 	struct page *p;
-	size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
+	size_t start_offset = eb->pg_offset;
 	unsigned long i = (start_offset + start) >> PAGE_SHIFT;
 	unsigned long end_i = (start_offset + start + min_len - 1) >>
 		PAGE_SHIFT;
@@ -5679,7 +5737,7 @@ int map_private_extent_buffer(struct extent_buffer *eb, unsigned long start,
 	p = eb->pages[i];
 	kaddr = page_address(p);
 	*map = kaddr + offset;
-	*map_len = PAGE_SIZE - offset;
+	*map_len = (eb->len >= PAGE_SIZE) ? PAGE_SIZE - offset : eb->len;
 	return 0;
 }
 
@@ -5692,7 +5750,7 @@ int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv,
 	struct page *page;
 	char *kaddr;
 	char *ptr = (char *)ptrv;
-	size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
+	size_t start_offset = eb->pg_offset;
 	unsigned long i = (start_offset + start) >> PAGE_SHIFT;
 	int ret = 0;
 
@@ -5727,7 +5785,7 @@ void write_extent_buffer(struct extent_buffer *eb, const void *srcv,
 	struct page *page;
 	char *kaddr;
 	char *src = (char *)srcv;
-	size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
+	size_t start_offset = eb->pg_offset;
 	unsigned long i = (start_offset + start) >> PAGE_SHIFT;
 
 	WARN_ON(start > eb->len);
@@ -5756,7 +5814,7 @@ void memset_extent_buffer(struct extent_buffer *eb, char c,
 	size_t offset;
 	struct page *page;
 	char *kaddr;
-	size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
+	size_t start_offset = eb->pg_offset;
 	unsigned long i = (start_offset + start) >> PAGE_SHIFT;
 
 	WARN_ON(start > eb->len);
@@ -5786,7 +5844,7 @@ void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src,
 	size_t offset;
 	struct page *page;
 	char *kaddr;
-	size_t start_offset = dst->start & ((u64)PAGE_SIZE - 1);
+	size_t start_offset = dst->pg_offset;
 	unsigned long i = (start_offset + dst_offset) >> PAGE_SHIFT;
 
 	WARN_ON(src->len != dst_len);
@@ -5839,7 +5897,7 @@ static inline void eb_bitmap_offset(struct extent_buffer *eb,
 				    unsigned long *page_index,
 				    size_t *page_offset)
 {
-	size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
+	size_t start_offset = eb->pg_offset;
 	size_t byte_offset = BIT_BYTE(nr);
 	size_t offset;
 
@@ -5987,7 +6045,7 @@ void memcpy_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 	size_t cur;
 	size_t dst_off_in_page;
 	size_t src_off_in_page;
-	size_t start_offset = dst->start & ((u64)PAGE_SIZE - 1);
+	size_t start_offset = dst->pg_offset;
 	unsigned long dst_i;
 	unsigned long src_i;
 
@@ -6035,7 +6093,7 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 	size_t src_off_in_page;
 	unsigned long dst_end = dst_offset + len - 1;
 	unsigned long src_end = src_offset + len - 1;
-	size_t start_offset = dst->start & ((u64)PAGE_SIZE - 1);
+	size_t start_offset = dst->pg_offset;
 	unsigned long dst_i;
 	unsigned long src_i;
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index e7a0462..6a02343 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -45,6 +45,7 @@
 #define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
 #define EXTENT_BUFFER_MIXED_PAGES 12	/* the pages span multiple zones or numa nodes. */
 #define EXTENT_BUFFER_READING 13 /* currently reading this eb. */
+#define EXTENT_BUFFER_MEM 14
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
@@ -138,7 +139,8 @@ struct extent_io_ops {
 	int (*merge_bio_hook)(struct page *page, unsigned long offset,
 			      size_t size, struct bio *bio,
 			      unsigned long bio_flags);
-	int (*readpage_io_failed_hook)(struct page *page, int failed_mirror);
+	int (*readpage_io_failed_hook)(struct page *page, void *private,
+				int failed_mirror);
 	int (*readpage_end_io_hook)(struct btrfs_io_bio *io_bio, u64 phy_offset,
 				    struct page *page, u64 start, u64 end,
 				    int mirror);
@@ -234,6 +236,8 @@ struct extent_buffer {
 	 */
 	wait_queue_head_t read_lock_wq;
 	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
+	void *addr;
+	unsigned int pg_offset;
 #ifdef CONFIG_BTRFS_DEBUG
 	struct list_head leak_list;
 #endif
@@ -454,7 +458,7 @@ static inline void extent_buffer_get(struct extent_buffer *eb)
 
 static inline unsigned long eb_index(struct extent_buffer *eb)
 {
-	return eb->start >> PAGE_SHIFT;
+	return eb->start;
 }
 
 int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv,
diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
index 45524f1..b85a57e 100644
--- a/fs/btrfs/tests/extent-io-tests.c
+++ b/fs/btrfs/tests/extent-io-tests.c
@@ -379,7 +379,7 @@ static int test_eb_bitmaps(u32 sectorsize, u32 nodesize)
 	 * In ppc64, sectorsize can be 64K, thus 4 * 64K will be larger than
 	 * BTRFS_MAX_METADATA_BLOCKSIZE.
 	 */
-	len = (sectorsize < BTRFS_MAX_METADATA_BLOCKSIZE)
+	len = ((sectorsize * 4) <= BTRFS_MAX_METADATA_BLOCKSIZE)
 		? sectorsize * 4 : sectorsize;
 
 	bitmap = kmalloc(len, GFP_KERNEL);
@@ -401,7 +401,7 @@ static int test_eb_bitmaps(u32 sectorsize, u32 nodesize)
 
 	/* Do it over again with an extent buffer which isn't page-aligned. */
 	free_extent_buffer(eb);
-	eb = alloc_dummy_extent_buffer(NULL, nodesize / 2, len);
+	eb = alloc_dummy_extent_buffer(NULL, PAGE_SIZE / 2, len);
 	if (!eb) {
 		test_msg("Couldn't allocate test extent buffer\n");
 		kfree(bitmap);
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 08/19] Btrfs: subpage-blocksize: Execute sanity tests on all possible block sizes
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (6 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 07/19] Btrfs: subpage-blocksize: Use kmalloc()-ed memory to hold metadata blocks Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 09/19] Btrfs: subpage-blocksize: Compute free space tree BITMAP_RANGE based on sectorsize Chandan Rajendra
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

This commit executes sanity tests for all valid sectorsize/nodesize
combinations.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/tests/btrfs-tests.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index dca90d6..0f2afb6 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -261,13 +261,19 @@ int btrfs_run_sanity_tests(void)
 	int ret, i;
 	u32 sectorsize, nodesize;
 	u32 test_sectorsize[] = {
-		PAGE_SIZE,
+		4096,
+		8192,
+		16384,
+		32768,
+		65536,
 	};
 	ret = btrfs_init_test_fs();
 	if (ret)
 		return ret;
 	for (i = 0; i < ARRAY_SIZE(test_sectorsize); i++) {
 		sectorsize = test_sectorsize[i];
+		if (sectorsize > PAGE_SIZE)
+			break;
 		for (nodesize = sectorsize;
 		     nodesize <= BTRFS_MAX_METADATA_BLOCKSIZE;
 		     nodesize <<= 1) {
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 09/19] Btrfs: subpage-blocksize: Compute free space tree BITMAP_RANGE based on sectorsize
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (7 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 08/19] Btrfs: subpage-blocksize: Execute sanity tests on all possible block sizes Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 10/19] Btrfs: subpage-blocksize: Allow mounting filesystems where sectorsize < PAGE_SIZE Chandan Rajendra
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

The default bitmap length computation in free space tree sanity tests
assumes PAGE_SIZE as the sectorsize. This commit fixes this by using a
variable sectorsize to calculate BITMAP_RANGE.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/tests/free-space-tree-tests.c | 79 ++++++++++++++++++----------------
 1 file changed, 43 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/tests/free-space-tree-tests.c b/fs/btrfs/tests/free-space-tree-tests.c
index 3bf5df1..11d9fb0 100644
--- a/fs/btrfs/tests/free-space-tree-tests.c
+++ b/fs/btrfs/tests/free-space-tree-tests.c
@@ -31,7 +31,7 @@ struct free_space_extent {
  * The test cases align their operations to this in order to hit some of the
  * edge cases in the bitmap code.
  */
-#define BITMAP_RANGE (BTRFS_FREE_SPACE_BITMAP_BITS * PAGE_SIZE)
+#define BITMAP_RANGE(sectorsize) (BTRFS_FREE_SPACE_BITMAP_BITS * (sectorsize))
 
 static int __check_free_space_extents(struct btrfs_trans_handle *trans,
 				      struct btrfs_fs_info *fs_info,
@@ -203,14 +203,15 @@ static int test_remove_beginning(struct btrfs_trans_handle *trans,
 				 struct btrfs_block_group_cache *cache,
 				 struct btrfs_path *path)
 {
+	u64 bitmap_range = BITMAP_RANGE(fs_info->tree_root->sectorsize);
 	struct free_space_extent extents[] = {
-		{cache->key.objectid + BITMAP_RANGE,
-			cache->key.offset - BITMAP_RANGE},
+		{cache->key.objectid + bitmap_range,
+			cache->key.offset - bitmap_range},
 	};
 	int ret;
 
 	ret = __remove_from_free_space_tree(trans, fs_info, cache, path,
-					    cache->key.objectid, BITMAP_RANGE);
+					    cache->key.objectid, bitmap_range);
 	if (ret) {
 		test_msg("Could not remove free space\n");
 		return ret;
@@ -226,15 +227,16 @@ static int test_remove_end(struct btrfs_trans_handle *trans,
 			   struct btrfs_block_group_cache *cache,
 			   struct btrfs_path *path)
 {
+	u64 bitmap_range = BITMAP_RANGE(fs_info->tree_root->sectorsize);
 	struct free_space_extent extents[] = {
-		{cache->key.objectid, cache->key.offset - BITMAP_RANGE},
+		{cache->key.objectid, cache->key.offset - bitmap_range},
 	};
 	int ret;
 
 	ret = __remove_from_free_space_tree(trans, fs_info, cache, path,
 					    cache->key.objectid +
-					    cache->key.offset - BITMAP_RANGE,
-					    BITMAP_RANGE);
+					    cache->key.offset - bitmap_range,
+					    bitmap_range);
 	if (ret) {
 		test_msg("Could not remove free space\n");
 		return ret;
@@ -249,16 +251,17 @@ static int test_remove_middle(struct btrfs_trans_handle *trans,
 			      struct btrfs_block_group_cache *cache,
 			      struct btrfs_path *path)
 {
+	u64 bitmap_range = BITMAP_RANGE(fs_info->tree_root->sectorsize);
 	struct free_space_extent extents[] = {
-		{cache->key.objectid, BITMAP_RANGE},
-		{cache->key.objectid + 2 * BITMAP_RANGE,
-			cache->key.offset - 2 * BITMAP_RANGE},
+		{cache->key.objectid, bitmap_range},
+		{cache->key.objectid + 2 * bitmap_range,
+			cache->key.offset - 2 * bitmap_range},
 	};
 	int ret;
 
 	ret = __remove_from_free_space_tree(trans, fs_info, cache, path,
-					    cache->key.objectid + BITMAP_RANGE,
-					    BITMAP_RANGE);
+					    cache->key.objectid + bitmap_range,
+					    bitmap_range);
 	if (ret) {
 		test_msg("Could not remove free space\n");
 		return ret;
@@ -273,8 +276,9 @@ static int test_merge_left(struct btrfs_trans_handle *trans,
 			   struct btrfs_block_group_cache *cache,
 			   struct btrfs_path *path)
 {
+	u64 bitmap_range = BITMAP_RANGE(fs_info->tree_root->sectorsize);
 	struct free_space_extent extents[] = {
-		{cache->key.objectid, 2 * BITMAP_RANGE},
+		{cache->key.objectid, 2 * bitmap_range},
 	};
 	int ret;
 
@@ -287,15 +291,15 @@ static int test_merge_left(struct btrfs_trans_handle *trans,
 	}
 
 	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
-				       cache->key.objectid, BITMAP_RANGE);
+				       cache->key.objectid, bitmap_range);
 	if (ret) {
 		test_msg("Could not add free space\n");
 		return ret;
 	}
 
 	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
-				       cache->key.objectid + BITMAP_RANGE,
-				       BITMAP_RANGE);
+				       cache->key.objectid + bitmap_range,
+				       bitmap_range);
 	if (ret) {
 		test_msg("Could not add free space\n");
 		return ret;
@@ -310,8 +314,9 @@ static int test_merge_right(struct btrfs_trans_handle *trans,
 			   struct btrfs_block_group_cache *cache,
 			   struct btrfs_path *path)
 {
+	u64 bitmap_range = BITMAP_RANGE(fs_info->tree_root->sectorsize);
 	struct free_space_extent extents[] = {
-		{cache->key.objectid + BITMAP_RANGE, 2 * BITMAP_RANGE},
+		{cache->key.objectid + bitmap_range, 2 * bitmap_range},
 	};
 	int ret;
 
@@ -324,16 +329,16 @@ static int test_merge_right(struct btrfs_trans_handle *trans,
 	}
 
 	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
-				       cache->key.objectid + 2 * BITMAP_RANGE,
-				       BITMAP_RANGE);
+				       cache->key.objectid + 2 * bitmap_range,
+				       bitmap_range);
 	if (ret) {
 		test_msg("Could not add free space\n");
 		return ret;
 	}
 
 	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
-				       cache->key.objectid + BITMAP_RANGE,
-				       BITMAP_RANGE);
+				       cache->key.objectid + bitmap_range,
+				       bitmap_range);
 	if (ret) {
 		test_msg("Could not add free space\n");
 		return ret;
@@ -348,8 +353,9 @@ static int test_merge_both(struct btrfs_trans_handle *trans,
 			   struct btrfs_block_group_cache *cache,
 			   struct btrfs_path *path)
 {
+	u64 bitmap_range = BITMAP_RANGE(fs_info->tree_root->sectorsize);
 	struct free_space_extent extents[] = {
-		{cache->key.objectid, 3 * BITMAP_RANGE},
+		{cache->key.objectid, 3 * bitmap_range},
 	};
 	int ret;
 
@@ -362,23 +368,23 @@ static int test_merge_both(struct btrfs_trans_handle *trans,
 	}
 
 	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
-				       cache->key.objectid, BITMAP_RANGE);
+				       cache->key.objectid, bitmap_range);
 	if (ret) {
 		test_msg("Could not add free space\n");
 		return ret;
 	}
 
 	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
-				       cache->key.objectid + 2 * BITMAP_RANGE,
-				       BITMAP_RANGE);
+				       cache->key.objectid + 2 * bitmap_range,
+				       bitmap_range);
 	if (ret) {
 		test_msg("Could not add free space\n");
 		return ret;
 	}
 
 	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
-				       cache->key.objectid + BITMAP_RANGE,
-				       BITMAP_RANGE);
+				       cache->key.objectid + bitmap_range,
+				       bitmap_range);
 	if (ret) {
 		test_msg("Could not add free space\n");
 		return ret;
@@ -393,10 +399,11 @@ static int test_merge_none(struct btrfs_trans_handle *trans,
 			   struct btrfs_block_group_cache *cache,
 			   struct btrfs_path *path)
 {
+	u64 bitmap_range = BITMAP_RANGE(fs_info->tree_root->sectorsize);
 	struct free_space_extent extents[] = {
-		{cache->key.objectid, BITMAP_RANGE},
-		{cache->key.objectid + 2 * BITMAP_RANGE, BITMAP_RANGE},
-		{cache->key.objectid + 4 * BITMAP_RANGE, BITMAP_RANGE},
+		{cache->key.objectid, bitmap_range},
+		{cache->key.objectid + 2 * bitmap_range, bitmap_range},
+		{cache->key.objectid + 4 * bitmap_range, bitmap_range},
 	};
 	int ret;
 
@@ -409,23 +416,23 @@ static int test_merge_none(struct btrfs_trans_handle *trans,
 	}
 
 	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
-				       cache->key.objectid, BITMAP_RANGE);
+				       cache->key.objectid, bitmap_range);
 	if (ret) {
 		test_msg("Could not add free space\n");
 		return ret;
 	}
 
 	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
-				       cache->key.objectid + 4 * BITMAP_RANGE,
-				       BITMAP_RANGE);
+				       cache->key.objectid + 4 * bitmap_range,
+				       bitmap_range);
 	if (ret) {
 		test_msg("Could not add free space\n");
 		return ret;
 	}
 
 	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
-				       cache->key.objectid + 2 * BITMAP_RANGE,
-				       BITMAP_RANGE);
+				       cache->key.objectid + 2 * bitmap_range,
+				       bitmap_range);
 	if (ret) {
 		test_msg("Could not add free space\n");
 		return ret;
@@ -480,7 +487,7 @@ static int run_test(test_func_t test_func, int bitmaps,
 	btrfs_set_header_nritems(root->node, 0);
 	root->alloc_bytenr += 2 * nodesize;
 
-	cache = btrfs_alloc_dummy_block_group(8 * BITMAP_RANGE, sectorsize);
+	cache = btrfs_alloc_dummy_block_group(8 * BITMAP_RANGE(sectorsize), sectorsize);
 	if (!cache) {
 		test_msg("Couldn't allocate dummy block group cache\n");
 		ret = -ENOMEM;
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 10/19] Btrfs: subpage-blocksize: Allow mounting filesystems where sectorsize < PAGE_SIZE
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (8 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 09/19] Btrfs: subpage-blocksize: Compute free space tree BITMAP_RANGE based on sectorsize Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 11/19] Btrfs: subpage-blocksize: Deal with partial ordered extent allocations Chandan Rajendra
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

This commit allows mounting filesystem instances with sectorsize smaller
than the PAGE_SIZE.

Since the code assumes that the super block is either equal to or larger
than sectorsize, this commit brings back the nodesize argument for
btrfs_find_create_tree_block() function. This change allows us to be
able to mount and use filesystems with 2048 bytes as the sectorsize.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/disk-io.c     | 21 ++++++++-------------
 fs/btrfs/disk-io.h     |  2 +-
 fs/btrfs/extent-tree.c |  4 ++--
 fs/btrfs/extent_io.c   |  3 +--
 fs/btrfs/extent_io.h   |  2 +-
 fs/btrfs/tree-log.c    |  2 +-
 fs/btrfs/volumes.c     | 10 +++-------
 7 files changed, 17 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5663481..2684438 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -936,7 +936,7 @@ void readahead_tree_block(struct btrfs_root *root, u64 bytenr)
 {
 	struct extent_buffer *buf = NULL;
 
-	buf = btrfs_find_create_tree_block(root, bytenr);
+	buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
 	if (IS_ERR(buf))
 		return;
 	read_extent_buffer_pages(buf, WAIT_NONE, 0);
@@ -949,7 +949,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr,
 	struct extent_buffer *buf = NULL;
 	int ret;
 
-	buf = btrfs_find_create_tree_block(root, bytenr);
+	buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
 	if (IS_ERR(buf))
 		return 0;
 
@@ -979,12 +979,12 @@ struct extent_buffer *btrfs_find_tree_block(struct btrfs_fs_info *fs_info,
 }
 
 struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
-						 u64 bytenr)
+						 u64 bytenr, u32 blocksize)
 {
 	if (btrfs_is_testing(root->fs_info))
 		return alloc_test_extent_buffer(root->fs_info->eb_info, bytenr,
-						root->nodesize);
-	return alloc_extent_buffer(root->fs_info, bytenr);
+						blocksize);
+	return alloc_extent_buffer(root->fs_info, bytenr, blocksize);
 }
 
 
@@ -1006,7 +1006,7 @@ struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
 	struct extent_buffer *buf = NULL;
 	int ret;
 
-	buf = btrfs_find_create_tree_block(root, bytenr);
+	buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
 	if (IS_ERR(buf))
 		return buf;
 
@@ -3891,17 +3891,12 @@ static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
 	 * Check sectorsize and nodesize first, other check will need it.
 	 * Check all possible sectorsize(4K, 8K, 16K, 32K, 64K) here.
 	 */
-	if (!is_power_of_2(sectorsize) || sectorsize < 4096 ||
+	if (!is_power_of_2(sectorsize) || sectorsize < 2048 ||
 	    sectorsize > BTRFS_MAX_METADATA_BLOCKSIZE) {
 		printk(KERN_ERR "BTRFS: invalid sectorsize %llu\n", sectorsize);
 		ret = -EINVAL;
 	}
-	/* Only PAGE SIZE is supported yet */
-	if (sectorsize != PAGE_SIZE) {
-		printk(KERN_ERR "BTRFS: sectorsize %llu not supported yet, only support %lu\n",
-				sectorsize, PAGE_SIZE);
-		ret = -EINVAL;
-	}
+
 	if (!is_power_of_2(nodesize) || nodesize < sectorsize ||
 	    nodesize > BTRFS_MAX_METADATA_BLOCKSIZE) {
 		printk(KERN_ERR "BTRFS: invalid nodesize %llu\n", nodesize);
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 591f078..5f6263e 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -50,7 +50,7 @@ void readahead_tree_block(struct btrfs_root *root, u64 bytenr);
 int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr,
 			 int mirror_num, struct extent_buffer **eb);
 struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
-						   u64 bytenr);
+						   u64 bytenr, u32 blocksize);
 void clean_tree_block(struct btrfs_trans_handle *trans,
 		      struct btrfs_fs_info *fs_info, struct extent_buffer *buf);
 int open_ctree(struct super_block *sb,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 03da2f6..25fbfa2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8238,7 +8238,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 {
 	struct extent_buffer *buf;
 
-	buf = btrfs_find_create_tree_block(root, bytenr);
+	buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
 	if (IS_ERR(buf))
 		return buf;
 
@@ -8885,7 +8885,7 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 
 	next = btrfs_find_tree_block(root->fs_info, bytenr);
 	if (!next) {
-		next = btrfs_find_create_tree_block(root, bytenr);
+		next = btrfs_find_create_tree_block(root, bytenr, blocksize);
 		if (IS_ERR(next))
 			return PTR_ERR(next);
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8ace367..9af8237 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5113,9 +5113,8 @@ free_eb:
 #endif
 
 struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
-					  u64 start)
+					  u64 start, unsigned long len)
 {
-	unsigned long len = fs_info->tree_root->nodesize;
 	unsigned long num_pages = num_extent_pages(start, len);
 	unsigned long i;
 	struct extent_buffer *eb;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 6a02343..ad5b000 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -430,7 +430,7 @@ int clear_page_blks_state(struct page *page, unsigned long blk_states,
 int test_page_blks_state(struct page *page, enum blk_state blk_state,
 			u64 start, u64 end, int check_all);
 struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
-					  u64 start);
+					  u64 start, unsigned long len);
 struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_eb_info *eb_info,
 						u64 start, unsigned long len);
 struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index a25be18b..bc5e0c1 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2421,7 +2421,7 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans,
 		parent = path->nodes[*level];
 		root_owner = btrfs_header_owner(parent);
 
-		next = btrfs_find_create_tree_block(root, bytenr);
+		next = btrfs_find_create_tree_block(root, bytenr, blocksize);
 		if (IS_ERR(next))
 			return PTR_ERR(next);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 89dc9c7..3994ab9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6664,15 +6664,11 @@ int btrfs_read_sys_array(struct btrfs_root *root)
 	u64 type;
 	struct btrfs_key key;
 
-	ASSERT(BTRFS_SUPER_INFO_SIZE <= root->nodesize);
-	/*
-	 * This will create extent buffer of nodesize, superblock size is
-	 * fixed to BTRFS_SUPER_INFO_SIZE. If nodesize > sb size, this will
-	 * overallocate but we can keep it as-is, only the first page is used.
-	 */
-	sb = btrfs_find_create_tree_block(root, BTRFS_SUPER_INFO_OFFSET);
+	sb = btrfs_find_create_tree_block(root, BTRFS_SUPER_INFO_OFFSET,
+					  BTRFS_SUPER_INFO_SIZE);
 	if (IS_ERR(sb))
 		return PTR_ERR(sb);
+
 	set_extent_buffer_uptodate(sb);
 	btrfs_set_buffer_lockdep_class(root->root_key.objectid, sb, 0);
 	/*
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 11/19] Btrfs: subpage-blocksize: Deal with partial ordered extent allocations.
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (9 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 10/19] Btrfs: subpage-blocksize: Allow mounting filesystems where sectorsize < PAGE_SIZE Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 12/19] Btrfs: subpage-blocksize: Explicitly track I/O status of blocks of an ordered extent Chandan Rajendra
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

In subpage-blocksize scenario, extent allocations for only some of the
dirty blocks of a page can succeed, while allocation for rest of the
blocks can fail. This patch allows I/O against such pages to be
submitted.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c | 27 ++++++++++++++-------------
 fs/btrfs/inode.c     | 11 ++++++++++-
 2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9af8237..0832797 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1843,17 +1843,23 @@ void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
 			if (page_ops & PAGE_SET_PRIVATE2)
 				SetPagePrivate2(pages[i]);
 
+			if (page_ops & PAGE_SET_ERROR)
+				SetPageError(pages[i]);
+
 			if (pages[i] == locked_page) {
 				put_page(pages[i]);
 				continue;
 			}
-			if (page_ops & PAGE_CLEAR_DIRTY)
+
+			if ((page_ops & PAGE_CLEAR_DIRTY)
+				&& !PagePrivate2(pages[i]))
 				clear_page_dirty_for_io(pages[i]);
-			if (page_ops & PAGE_SET_WRITEBACK)
+			if ((page_ops & PAGE_SET_WRITEBACK)
+				&& !PagePrivate2(pages[i]))
 				set_page_writeback(pages[i]);
-			if (page_ops & PAGE_SET_ERROR)
-				SetPageError(pages[i]);
-			if (page_ops & PAGE_END_WRITEBACK)
+
+			if ((page_ops & PAGE_END_WRITEBACK)
+				&& !PagePrivate2(pages[i]))
 				end_page_writeback(pages[i]);
 
 			if (page_ops & PAGE_UNLOCK) {
@@ -2554,7 +2560,7 @@ void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
 			uptodate = 0;
 	}
 
-	if (!uptodate) {
+	if (!uptodate || PageError(page)) {
 		ClearPageUptodate(page);
 		SetPageError(page);
 		ret = ret < 0 ? ret : -EIO;
@@ -3401,7 +3407,6 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode,
 					       nr_written);
 		/* File system has been set read-only */
 		if (ret) {
-			SetPageError(page);
 			/* fill_delalloc should be return < 0 for error
 			 * but just in case, we use > 0 here meaning the
 			 * IO is started, so we don't want to return > 0
@@ -3618,7 +3623,6 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
 	struct inode *inode = page->mapping->host;
 	struct extent_page_data *epd = data;
 	u64 start = page_offset(page);
-	u64 page_end = start + PAGE_SIZE - 1;
 	int ret;
 	int nr = 0;
 	size_t pg_offset = 0;
@@ -3661,7 +3665,7 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
 	ret = writepage_delalloc(inode, page, wbc, epd, start, &nr_written);
 	if (ret == 1)
 		goto done_unlocked;
-	if (ret)
+	if (ret && !PagePrivate2(page))
 		goto done;
 
 	ret = __extent_writepage_io(inode, page, wbc, epd,
@@ -3675,10 +3679,7 @@ done:
 		set_page_writeback(page);
 		end_page_writeback(page);
 	}
-	if (PageError(page)) {
-		ret = ret < 0 ? ret : -EIO;
-		end_extent_writepage(page, ret, start, page_end);
-	}
+
 	unlock_page(page);
 	return ret;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 42f844b..cf55622 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -951,6 +951,7 @@ static noinline int cow_file_range(struct inode *inode,
 	struct btrfs_key ins;
 	struct extent_map *em;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+	struct btrfs_ordered_extent *ordered;
 	unsigned long page_ops, extent_ops;
 	int ret = 0;
 
@@ -1049,7 +1050,7 @@ static noinline int cow_file_range(struct inode *inode,
 			ret = btrfs_reloc_clone_csums(inode, start,
 						      cur_alloc_size);
 			if (ret)
-				goto out_drop_extent_cache;
+				goto out_remove_ordered_extent;
 		}
 
 		btrfs_dec_block_group_reservations(root->fs_info, ins.objectid);
@@ -1078,6 +1079,14 @@ static noinline int cow_file_range(struct inode *inode,
 out:
 	return ret;
 
+out_remove_ordered_extent:
+	ordered = btrfs_lookup_ordered_extent(inode, start);
+	ASSERT(ordered);
+	btrfs_remove_ordered_extent(inode, ordered);
+	/* once for us */
+	btrfs_put_ordered_extent(ordered);
+	/* once for the tree */
+	btrfs_put_ordered_extent(ordered);
 out_drop_extent_cache:
 	btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
 out_reserve:
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 12/19] Btrfs: subpage-blocksize: Explicitly track I/O status of blocks of an ordered extent.
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (10 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 11/19] Btrfs: subpage-blocksize: Deal with partial ordered extent allocations Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 13/19] Btrfs: subpage-blocksize: btrfs_punch_hole: Fix uptodate blocks check Chandan Rajendra
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

In subpage-blocksize scenario a page can have more than one block. So in
addition to PagePrivate2 flag, we would have to track the I/O status of
each block of a page to reliably mark the ordered extent as complete.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c    |  19 +--
 fs/btrfs/extent_io.h    |   5 +-
 fs/btrfs/inode.c        | 363 ++++++++++++++++++++++++++++++++++--------------
 fs/btrfs/ordered-data.c |  19 +++
 fs/btrfs/ordered-data.h |   4 +
 5 files changed, 294 insertions(+), 116 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0832797..df6172c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4483,11 +4483,10 @@ int extent_invalidatepage(struct extent_io_tree *tree,
  * to drop the page.
  */
 static int try_release_extent_state(struct extent_map_tree *map,
-				    struct extent_io_tree *tree,
-				    struct page *page, gfp_t mask)
+				struct extent_io_tree *tree,
+				struct page *page, u64 start, u64 end,
+				gfp_t mask)
 {
-	u64 start = page_offset(page);
-	u64 end = start + PAGE_SIZE - 1;
 	int ret = 1;
 
 	if (test_range_bit(tree, start, end,
@@ -4521,12 +4520,12 @@ static int try_release_extent_state(struct extent_map_tree *map,
  * map records are removed
  */
 int try_release_extent_mapping(struct extent_map_tree *map,
-			       struct extent_io_tree *tree, struct page *page,
-			       gfp_t mask)
+			struct extent_io_tree *tree, struct page *page,
+			u64 start, u64 end, gfp_t mask)
 {
 	struct extent_map *em;
-	u64 start = page_offset(page);
-	u64 end = start + PAGE_SIZE - 1;
+	u64 orig_start = start;
+	u64 orig_end = end;
 
 	if (gfpflags_allow_blocking(mask) &&
 	    page->mapping->host->i_size > SZ_16M) {
@@ -4560,7 +4559,9 @@ int try_release_extent_mapping(struct extent_map_tree *map,
 			free_extent_map(em);
 		}
 	}
-	return try_release_extent_state(map, tree, page, mask);
+	return try_release_extent_state(map, tree, page,
+					orig_start, orig_end,
+					mask);
 }
 
 /*
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index ad5b000..491f9b4 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -276,8 +276,9 @@ typedef struct extent_map *(get_extent_t)(struct inode *inode,
 
 void extent_io_tree_init(struct extent_io_tree *tree, void *private_data);
 int try_release_extent_mapping(struct extent_map_tree *map,
-			       struct extent_io_tree *tree, struct page *page,
-			       gfp_t mask);
+			struct extent_io_tree *tree, struct page *page,
+			u64 start, u64 end,
+			gfp_t mask);
 int try_release_extent_buffer(struct page *page);
 int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 		     struct extent_state **cached);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cf55622..03b9425 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3077,56 +3077,119 @@ static void finish_ordered_fn(struct btrfs_work *work)
 	btrfs_finish_ordered_io(ordered_extent);
 }
 
-static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
-				struct extent_state *state, int uptodate)
+static void mark_blks_io_complete(struct btrfs_ordered_extent *ordered,
+				u64 blk, u64 nr_blks, int uptodate)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = ordered->inode;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct btrfs_ordered_extent *ordered_extent = NULL;
 	struct btrfs_workqueue *wq;
 	btrfs_work_func_t func;
-	u64 ordered_start, ordered_end;
 	int done;
 
-	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
+	while (nr_blks--) {
+		if (test_and_set_bit(blk, ordered->blocks_done)) {
+			blk++;
+			continue;
+		}
 
-	ClearPagePrivate2(page);
-loop:
-	ordered_extent = btrfs_lookup_ordered_range(inode, start,
-						end - start + 1);
-	if (!ordered_extent)
-		goto out;
+		done = btrfs_dec_test_ordered_pending(inode, &ordered,
+						ordered->file_offset
+						+ (blk << inode->i_blkbits),
+						root->sectorsize,
+						uptodate);
+		if (done) {
+			if (btrfs_is_free_space_inode(inode)) {
+				wq = root->fs_info->endio_freespace_worker;
+				func = btrfs_freespace_write_helper;
+			} else {
+				wq = root->fs_info->endio_write_workers;
+				func = btrfs_endio_write_helper;
+			}
 
-	ordered_start = max_t(u64, start, ordered_extent->file_offset);
-	ordered_end = min_t(u64, end,
-			ordered_extent->file_offset + ordered_extent->len - 1);
-
-	done = btrfs_dec_test_ordered_pending(inode, &ordered_extent,
-					ordered_start,
-					ordered_end - ordered_start + 1,
-					uptodate);
-	if (done) {
-		if (btrfs_is_free_space_inode(inode)) {
-			wq = root->fs_info->endio_freespace_worker;
-			func = btrfs_freespace_write_helper;
-		} else {
-			wq = root->fs_info->endio_write_workers;
-			func = btrfs_endio_write_helper;
+			btrfs_init_work(&ordered->work, func,
+					finish_ordered_fn, NULL, NULL);
+			btrfs_queue_work(wq, &ordered->work);
 		}
 
-		btrfs_init_work(&ordered_extent->work, func,
-				finish_ordered_fn, NULL, NULL);
-		btrfs_queue_work(wq, &ordered_extent->work);
+		blk++;
 	}
+}
 
-	btrfs_put_ordered_extent(ordered_extent);
+int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
+				struct extent_state *state, int uptodate)
+{
+	struct inode *inode = page->mapping->host;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_ordered_extent *ordered_extent = NULL;
+	u64 blk, nr_blks;
+	int clear;
 
-	start = ordered_end + 1;
+	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
+
+	while (start < end) {
+		ordered_extent = btrfs_lookup_ordered_extent(inode, start);
+		if (!ordered_extent) {
+			start += root->sectorsize;
+			continue;
+		}
 
-	if (start < end)
-		goto loop;
+		blk = BTRFS_BYTES_TO_BLKS(root->fs_info,
+					start - ordered_extent->file_offset);
+
+		nr_blks = BTRFS_BYTES_TO_BLKS(root->fs_info,
+					min(end, ordered_extent->file_offset
+						+ ordered_extent->len - 1)
+					+ 1 - start);
+
+		ASSERT(nr_blks);
+
+		mark_blks_io_complete(ordered_extent, blk, nr_blks, uptodate);
+
+		start = ordered_extent->file_offset + ordered_extent->len;
+
+		btrfs_put_ordered_extent(ordered_extent);
+	}
+
+	start = page_offset(page);
+	end = start + PAGE_SIZE - 1;
+	clear = 1;
+
+	while (start < end) {
+		ordered_extent = btrfs_lookup_ordered_extent(inode, start);
+		if (!ordered_extent) {
+			start += root->sectorsize;
+			continue;
+		}
+
+		blk = BTRFS_BYTES_TO_BLKS(root->fs_info,
+					start - ordered_extent->file_offset);
+		nr_blks = BTRFS_BYTES_TO_BLKS(root->fs_info,
+					min(end, ordered_extent->file_offset
+						+ ordered_extent->len - 1)
+					+ 1  - start);
+
+		ASSERT(nr_blks);
+
+		while (nr_blks--) {
+			if (!test_bit(blk++, ordered_extent->blocks_done)) {
+				clear = 0;
+				break;
+			}
+		}
+
+		if (!clear) {
+			btrfs_put_ordered_extent(ordered_extent);
+			break;
+		}
+
+		start += ordered_extent->len;
+
+		btrfs_put_ordered_extent(ordered_extent);
+	}
+
+	if (clear)
+		ClearPagePrivate2(page);
 
-out:
 	return 0;
 }
 
@@ -8841,7 +8904,9 @@ btrfs_readpages(struct file *file, struct address_space *mapping,
 	return extent_readpages(tree, mapping, pages, nr_pages,
 				btrfs_get_extent);
 }
-static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
+
+static int __btrfs_releasepage(struct page *page, u64 start, u64 end,
+			gfp_t gfp_flags)
 {
 	struct extent_io_tree *tree;
 	struct extent_map_tree *map;
@@ -8849,33 +8914,149 @@ static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
 
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
 	map = &BTRFS_I(page->mapping->host)->extent_tree;
-	ret = try_release_extent_mapping(map, tree, page, gfp_flags);
-	if (ret == 1)
+
+	ret = try_release_extent_mapping(map, tree, page, start, end,
+					gfp_flags);
+	if ((ret == 1) && ((end - start + 1) == PAGE_SIZE))
 		clear_page_extent_mapped(page);
+	else
+		ret = 0;
 
 	return ret;
 }
 
 static int btrfs_releasepage(struct page *page, gfp_t gfp_flags)
 {
+	u64 start = page_offset(page);
+	u64 end = start + PAGE_SIZE - 1;
+
 	if (PageWriteback(page) || PageDirty(page))
 		return 0;
-	return __btrfs_releasepage(page, gfp_flags & GFP_NOFS);
+
+	return __btrfs_releasepage(page, start, end, gfp_flags & GFP_NOFS);
+}
+
+static void invalidate_ordered_extent_blocks(struct inode *inode,
+					struct btrfs_ordered_extent *ordered,
+					u64 locked_start, u64 locked_end,
+					u64 cur,
+					int inode_evicting)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_ordered_inode_tree *ordered_tree;
+	struct extent_io_tree *tree;
+	u64 blk, blk_done, nr_blks;
+	u64 end;
+	u64 new_len;
+
+	tree = &BTRFS_I(inode)->io_tree;
+
+	end = min(locked_end, ordered->file_offset + ordered->len - 1);
+
+	if (!inode_evicting) {
+		clear_extent_bit(tree, cur, end,
+				EXTENT_DIRTY | EXTENT_DELALLOC |
+				EXTENT_DO_ACCOUNTING |
+				EXTENT_DEFRAG, 1, 0, NULL,
+				GFP_NOFS);
+		unlock_extent(tree, locked_start, locked_end);
+	}
+
+
+	ordered_tree = &BTRFS_I(inode)->ordered_tree;
+	spin_lock_irq(&ordered_tree->lock);
+	set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
+	new_len = cur - ordered->file_offset;
+	if (new_len < ordered->truncated_len)
+		ordered->truncated_len = new_len;
+
+	blk = BTRFS_BYTES_TO_BLKS(root->fs_info,
+				cur - ordered->file_offset);
+	nr_blks = BTRFS_BYTES_TO_BLKS(root->fs_info, end + 1 - cur);
+
+	while (nr_blks--) {
+		blk_done = !test_and_set_bit(blk, ordered->blocks_done);
+		if (blk_done) {
+			spin_unlock_irq(&ordered_tree->lock);
+			if (btrfs_dec_test_ordered_pending(inode, &ordered,
+							ordered->file_offset + (blk << inode->i_blkbits),
+							root->sectorsize, 1))
+				btrfs_finish_ordered_io(ordered);
+
+			spin_lock_irq(&ordered_tree->lock);
+		}
+		blk++;
+	}
+
+	spin_unlock_irq(&ordered_tree->lock);
+
+	if (!inode_evicting)
+		lock_extent_bits(tree, locked_start, locked_end, NULL);
+}
+
+static int page_blocks_written(struct page *page)
+{
+	struct btrfs_ordered_extent *ordered;
+	struct btrfs_root *root;
+	struct inode *inode;
+	unsigned long outstanding_blk;
+	u64 page_start, page_end;
+	u64 blk, last_blk, nr_blks;
+	u64 cur;
+	u64 len;
+
+	inode = page->mapping->host;
+	root = BTRFS_I(inode)->root;
+
+	page_start = page_offset(page);
+	page_end = page_start + PAGE_SIZE - 1;
+
+	cur = page_start;
+	while (cur < page_end) {
+		ordered = btrfs_lookup_ordered_extent(inode, cur);
+		if (!ordered) {
+			cur += root->sectorsize;
+			continue;
+		}
+
+		blk = BTRFS_BYTES_TO_BLKS(root->fs_info,
+					cur - ordered->file_offset);
+		len = min(page_end, ordered->file_offset + ordered->len - 1)
+			- cur + 1;
+		nr_blks = BTRFS_BYTES_TO_BLKS(root->fs_info, len);
+
+		last_blk = blk + nr_blks - 1;
+
+		outstanding_blk = find_next_zero_bit(ordered->blocks_done,
+						BTRFS_BYTES_TO_BLKS(root->fs_info,
+								ordered->len),
+						blk);
+		if (outstanding_blk <= last_blk) {
+			btrfs_put_ordered_extent(ordered);
+			return 0;
+		}
+
+		btrfs_put_ordered_extent(ordered);
+		cur += len;
+	}
+
+	return 1;
 }
 
 static void btrfs_invalidatepage(struct page *page, unsigned int offset,
-				 unsigned int length)
+				unsigned int length)
 {
 	struct inode *inode = page->mapping->host;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct extent_io_tree *tree;
 	struct btrfs_ordered_extent *ordered;
-	struct extent_state *cached_state = NULL;
-	u64 page_start = page_offset(page);
-	u64 page_end = page_start + PAGE_SIZE - 1;
-	u64 start;
-	u64 end;
+	u64 start, end, cur;
+	u64 page_start, page_end;
 	int inode_evicting = inode->i_state & I_FREEING;
 
+	page_start = page_offset(page);
+	page_end = page_start + PAGE_SIZE - 1;
+
 	/*
 	 * we have the page locked, so new writeback can't start,
 	 * and the dirty bit won't be cleared while we are here.
@@ -8886,61 +9067,35 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	wait_on_page_writeback(page);
 
 	tree = &BTRFS_I(inode)->io_tree;
-	if (offset) {
+
+	start = round_up(offset, root->sectorsize);
+	end = round_down(offset + length, root->sectorsize) - 1;
+	if (end - start + 1 < root->sectorsize) {
 		btrfs_releasepage(page, GFP_NOFS);
 		return;
 	}
 
+	start = round_up(page_start + offset, root->sectorsize);
+	end = round_down(page_start + offset + length,
+			root->sectorsize) - 1;
+
 	if (!inode_evicting)
-		lock_extent_bits(tree, page_start, page_end, &cached_state);
-again:
-	start = page_start;
-	ordered = btrfs_lookup_ordered_range(inode, start,
-					page_end - start + 1);
-	if (ordered) {
-		end = min(page_end, ordered->file_offset + ordered->len - 1);
-		/*
-		 * IO on this page will never be started, so we need
-		 * to account for any ordered extents now
-		 */
-		if (!inode_evicting)
-			clear_extent_bit(tree, start, end,
-					 EXTENT_DIRTY | EXTENT_DELALLOC |
-					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
-					 EXTENT_DEFRAG, 1, 0, &cached_state,
-					 GFP_NOFS);
-		/*
-		 * whoever cleared the private bit is responsible
-		 * for the finish_ordered_io
-		 */
-		if (TestClearPagePrivate2(page)) {
-			struct btrfs_ordered_inode_tree *tree;
-			u64 new_len;
+		lock_extent_bits(tree, start, end, NULL);
 
-			tree = &BTRFS_I(inode)->ordered_tree;
+	cur = start;
+	while (cur < end) {
+		ordered = btrfs_lookup_ordered_extent(inode, cur);
+		if (!ordered) {
+			cur += root->sectorsize;
+			continue;
+		}
 
-			spin_lock_irq(&tree->lock);
-			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-			new_len = start - ordered->file_offset;
-			if (new_len < ordered->truncated_len)
-				ordered->truncated_len = new_len;
-			spin_unlock_irq(&tree->lock);
+		invalidate_ordered_extent_blocks(inode, ordered,
+						start, end, cur,
+						inode_evicting);
 
-			if (btrfs_dec_test_ordered_pending(inode, &ordered,
-							   start,
-							   end - start + 1, 1))
-				btrfs_finish_ordered_io(ordered);
-		}
+		cur = min(end + 1, ordered->file_offset + ordered->len);
 		btrfs_put_ordered_extent(ordered);
-		if (!inode_evicting) {
-			cached_state = NULL;
-			lock_extent_bits(tree, start, end,
-					 &cached_state);
-		}
-
-		start = end + 1;
-		if (start < page_end)
-			goto again;
 	}
 
 	/*
@@ -8956,24 +9111,22 @@ again:
 	 */
 	btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE);
 
-	clear_page_blks_state(page, 1 << BLK_STATE_DIRTY, page_start,
-			page_end);
+	clear_page_blks_state(page, 1 << BLK_STATE_DIRTY, start, end);
 
-	if (!inode_evicting) {
-		clear_extent_bit(tree, page_start, page_end,
-				 EXTENT_LOCKED | EXTENT_DIRTY |
-				 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
-				 EXTENT_DEFRAG, 1, 1,
-				 &cached_state, GFP_NOFS);
+	if (page_blocks_written(page))
+		ClearPagePrivate2(page);
 
-		__btrfs_releasepage(page, GFP_NOFS);
+	if (!inode_evicting) {
+		clear_extent_bit(tree, start, end,
+				EXTENT_LOCKED | EXTENT_DIRTY |
+				EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
+				EXTENT_DEFRAG, 1, 1, NULL, GFP_NOFS);
+		__btrfs_releasepage(page, start, end, GFP_NOFS);
 	}
 
-	ClearPageChecked(page);
-	if (PagePrivate(page)) {
-		ClearPagePrivate(page);
-		set_page_private(page, 0);
-		put_page(page);
+	if (!offset && length == PAGE_SIZE) {
+		ClearPageChecked(page);
+		clear_page_extent_mapped(page);
 	}
 }
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 3b78d38..18bdd7c 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -190,12 +190,27 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 	struct btrfs_ordered_inode_tree *tree;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry;
+	u64 nr_longs;
+	u64 nr_blks;
 
 	tree = &BTRFS_I(inode)->ordered_tree;
 	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
 	if (!entry)
 		return -ENOMEM;
 
+	nr_blks = BTRFS_BYTES_TO_BLKS(root->fs_info, len);
+	nr_longs = BITS_TO_LONGS(nr_blks);
+	if (nr_longs == 1) {
+		entry->blocks_done = &entry->blocks_bitmap;
+	} else {
+		entry->blocks_done = kzalloc(nr_longs * sizeof(unsigned long),
+					GFP_NOFS);
+		if (!entry->blocks_done) {
+			kmem_cache_free(btrfs_ordered_extent_cache, entry);
+			return -ENOMEM;
+		}
+	}
+
 	entry->file_offset = file_offset;
 	entry->start = start;
 	entry->len = len;
@@ -577,6 +592,10 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry)
 			list_del(&sum->list);
 			kfree(sum);
 		}
+
+		if (entry->blocks_done != &entry->blocks_bitmap)
+			kfree(entry->blocks_done);
+
 		kmem_cache_free(btrfs_ordered_extent_cache, entry);
 	}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 4515077..2f66f4a 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -139,6 +139,10 @@ struct btrfs_ordered_extent {
 	struct completion completion;
 	struct btrfs_work flush_work;
 	struct list_head work_list;
+
+	/* bitmap to track the blocks that have been written to disk */
+	unsigned long *blocks_done;
+	unsigned long blocks_bitmap;
 };
 
 /*
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 13/19] Btrfs: subpage-blocksize: btrfs_punch_hole: Fix uptodate blocks check
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (11 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 12/19] Btrfs: subpage-blocksize: Explicitly track I/O status of blocks of an ordered extent Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 14/19] Btrfs: subpage-blocksize: Fix file defragmentation code Chandan Rajendra
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

In case of subpage-blocksize, the file blocks to be punched may map only
part of a page. For file blocks inside such pages, we need to check for
the presence of BLK_STATE_UPTODATE flag.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/file.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 88 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 54602e6..6490e56 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2350,6 +2350,8 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	struct btrfs_path *path;
 	struct btrfs_block_rsv *rsv;
 	struct btrfs_trans_handle *trans;
+	struct address_space *mapping = inode->i_mapping;
+	pgoff_t start_index, end_index;
 	u64 lockstart;
 	u64 lockend;
 	u64 tail_start;
@@ -2362,6 +2364,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	int err = 0;
 	unsigned int rsv_count;
 	bool same_block;
+	bool same_page;
 	bool no_holes = btrfs_fs_incompat(root->fs_info, NO_HOLES);
 	u64 ino_size;
 	bool truncated_block = false;
@@ -2458,11 +2461,45 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 		goto out_only_mutex;
 	}
 
+	start_index = lockstart >> PAGE_SHIFT;
+	end_index = lockend >> PAGE_SHIFT;
+
+	same_page = lockstart >> PAGE_SHIFT
+		== lockend >> PAGE_SHIFT;
+
 	while (1) {
 		struct btrfs_ordered_extent *ordered;
+		struct page *start_page = NULL;
+		struct page *end_page = NULL;
+		u64 nr_pages;
+		int start_page_blks_uptodate;
+		int end_page_blks_uptodate;
 
 		truncate_pagecache_range(inode, lockstart, lockend);
 
+		if (lockstart & (PAGE_SIZE - 1)) {
+			start_page = find_or_create_page(mapping, start_index,
+							GFP_NOFS);
+			if (!start_page) {
+				inode_unlock(inode);
+				return -ENOMEM;
+			}
+		}
+
+		if (!same_page && ((lockend + 1) & (PAGE_SIZE - 1))) {
+			end_page = find_or_create_page(mapping, end_index,
+						GFP_NOFS);
+			if (!end_page) {
+				if (start_page) {
+					unlock_page(start_page);
+					put_page(start_page);
+				}
+				inode_unlock(inode);
+				return -ENOMEM;
+			}
+		}
+
+
 		lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend,
 				 &cached_state);
 		ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
@@ -2472,18 +2509,68 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 		 * and nobody raced in and read a page in this range, if we did
 		 * we need to try again.
 		 */
+		nr_pages = round_up(lockend, PAGE_SIZE)
+			- round_down(lockstart, PAGE_SIZE);
+		nr_pages >>= PAGE_SHIFT;
+
+		start_page_blks_uptodate = 0;
+		end_page_blks_uptodate = 0;
+		if (root->sectorsize < PAGE_SIZE) {
+			u64 page_end;
+
+			page_end = round_down(lockstart, PAGE_SIZE)
+				+ PAGE_SIZE - 1;
+			page_end = min(page_end, lockend);
+			if (start_page
+				&& PagePrivate(start_page)
+				&& test_page_blks_state(start_page, 1 << BLK_STATE_UPTODATE,
+							lockstart, page_end, 0))
+				start_page_blks_uptodate = 1;
+			if (end_page
+				&& PagePrivate(end_page)
+				&& test_page_blks_state(end_page, 1 << BLK_STATE_UPTODATE,
+							page_offset(end_page), lockend, 0))
+				end_page_blks_uptodate = 1;
+		} else {
+			if (start_page && PagePrivate(start_page)
+				&& PageUptodate(start_page))
+				start_page_blks_uptodate = 1;
+			if (end_page && PagePrivate(end_page)
+				&& PageUptodate(end_page))
+				end_page_blks_uptodate = 1;
+		}
+
 		if ((!ordered ||
 		    (ordered->file_offset + ordered->len <= lockstart ||
 		     ordered->file_offset > lockend)) &&
-		     !btrfs_page_exists_in_range(inode, lockstart, lockend)) {
+		     (!start_page_blks_uptodate && !end_page_blks_uptodate &&
+			!(nr_pages > 2 && btrfs_page_exists_in_range(inode,
+					 round_up(lockstart, PAGE_SIZE),
+					 round_down(lockend, PAGE_SIZE) - 1)))) {
 			if (ordered)
 				btrfs_put_ordered_extent(ordered);
+			if (end_page) {
+				unlock_page(end_page);
+				put_page(end_page);
+			}
+			if (start_page) {
+				unlock_page(start_page);
+				put_page(start_page);
+			}
 			break;
 		}
 		if (ordered)
 			btrfs_put_ordered_extent(ordered);
 		unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart,
 				     lockend, &cached_state, GFP_NOFS);
+		if (end_page) {
+			unlock_page(end_page);
+			put_page(end_page);
+		}
+		if (start_page) {
+			unlock_page(start_page);
+			put_page(start_page);
+		}
 		ret = btrfs_wait_ordered_range(inode, lockstart,
 					       lockend - lockstart + 1);
 		if (ret) {
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 14/19] Btrfs: subpage-blocksize: Fix file defragmentation code
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (12 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 13/19] Btrfs: subpage-blocksize: btrfs_punch_hole: Fix uptodate blocks check Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 15/19] Btrfs: subpage-blocksize: Enable dedupe ioctl Chandan Rajendra
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

This commit gets file defragmentation code to work in subpage-blocksize
scenario. It does this by keeping track of page offsets that mark block
boundaries and passing them as arguments to the functions that implement
the defragmentation logic.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/ioctl.c | 198 ++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 136 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a222bad..4077fc1 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -902,12 +902,13 @@ out_unlock:
 static int check_defrag_in_cache(struct inode *inode, u64 offset, u32 thresh)
 {
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct extent_map *em = NULL;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
 	u64 end;
 
 	read_lock(&em_tree->lock);
-	em = lookup_extent_mapping(em_tree, offset, PAGE_SIZE);
+	em = lookup_extent_mapping(em_tree, offset, root->sectorsize);
 	read_unlock(&em_tree->lock);
 
 	if (em) {
@@ -997,7 +998,7 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start)
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct extent_map *em;
-	u64 len = PAGE_SIZE;
+	u64 len = BTRFS_I(inode)->root->sectorsize;
 
 	/*
 	 * hopefully we have this extent in the tree already, try without
@@ -1116,37 +1117,47 @@ out:
  * before calling this.
  */
 static int cluster_pages_for_defrag(struct inode *inode,
-				    struct page **pages,
-				    unsigned long start_index,
-				    unsigned long num_pages)
+				struct page **pages,
+				unsigned long start_index,
+				size_t pg_offset,
+				unsigned long num_blks)
 {
-	unsigned long file_end;
 	u64 isize = i_size_read(inode);
+	u64 start_blk;
+	u64 end_blk;
 	u64 page_start;
 	u64 page_end;
 	u64 page_cnt;
+	u64 blk_cnt;
 	int ret;
 	int i;
 	int i_done;
 	struct btrfs_ordered_extent *ordered;
 	struct extent_state *cached_state = NULL;
 	struct extent_io_tree *tree;
+	struct btrfs_root *root;
 	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
 
-	file_end = (isize - 1) >> PAGE_SHIFT;
-	if (!isize || start_index > file_end)
+	root = BTRFS_I(inode)->root;
+	start_blk = (start_index << PAGE_SHIFT) + pg_offset;
+	start_blk >>= inode->i_blkbits;
+	end_blk = (isize - 1) >> inode->i_blkbits;
+	if (!isize || start_blk > end_blk)
 		return 0;
 
-	page_cnt = min_t(u64, (u64)num_pages, (u64)file_end - start_index + 1);
+	blk_cnt = min_t(u64, (u64)num_blks, (u64)end_blk - start_blk + 1);
 
 	ret = btrfs_delalloc_reserve_space(inode,
-			start_index << PAGE_SHIFT,
-			page_cnt << PAGE_SHIFT);
+					start_blk << inode->i_blkbits,
+					blk_cnt << inode->i_blkbits);
 	if (ret)
 		return ret;
 	i_done = 0;
 	tree = &BTRFS_I(inode)->io_tree;
 
+	page_cnt = DIV_ROUND_UP(pg_offset + (blk_cnt << inode->i_blkbits),
+				PAGE_SIZE);
+
 	/* step one, lock all the pages */
 	for (i = 0; i < page_cnt; i++) {
 		struct page *page;
@@ -1157,12 +1168,22 @@ again:
 			break;
 
 		page_start = page_offset(page);
-		page_end = page_start + PAGE_SIZE - 1;
+
+		if (i == 0)
+			page_start += pg_offset;
+
+		if (i == page_cnt - 1) {
+			page_end = (start_index << PAGE_SHIFT) + pg_offset;
+			page_end += (blk_cnt << inode->i_blkbits) - 1;
+		} else {
+			page_end = page_offset(page) + PAGE_SIZE - 1;
+		}
+
 		while (1) {
 			lock_extent_bits(tree, page_start, page_end,
 					 &cached_state);
-			ordered = btrfs_lookup_ordered_extent(inode,
-							      page_start);
+			ordered = btrfs_lookup_ordered_range(inode, page_start,
+							page_end - page_start + 1);
 			unlock_extent_cached(tree, page_start, page_end,
 					     &cached_state, GFP_NOFS);
 			if (!ordered)
@@ -1201,7 +1222,7 @@ again:
 		}
 
 		pages[i] = page;
-		i_done++;
+		i_done += (page_end - page_start + 1) >> inode->i_blkbits;
 	}
 	if (!i_done || ret)
 		goto out;
@@ -1213,55 +1234,77 @@ again:
 	 * so now we have a nice long stream of locked
 	 * and up to date pages, lets wait on them
 	 */
-	for (i = 0; i < i_done; i++)
+	page_cnt = DIV_ROUND_UP(pg_offset + (i_done << inode->i_blkbits),
+				PAGE_SIZE);
+	for (i = 0; i < page_cnt; i++)
 		wait_on_page_writeback(pages[i]);
 
-	page_start = page_offset(pages[0]);
-	page_end = page_offset(pages[i_done - 1]) + PAGE_SIZE;
+	page_start = page_offset(pages[0]) + pg_offset;
+	page_end = page_start + (i_done << inode->i_blkbits) - 1;
 
 	lock_extent_bits(&BTRFS_I(inode)->io_tree,
-			 page_start, page_end - 1, &cached_state);
+			page_start, page_end, &cached_state);
 	clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start,
-			  page_end - 1, EXTENT_DIRTY | EXTENT_DELALLOC |
+			  page_end, EXTENT_DIRTY | EXTENT_DELALLOC |
 			  EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0,
 			  &cached_state, GFP_NOFS);
 
-	if (i_done != page_cnt) {
+	if (i_done != blk_cnt) {
 		spin_lock(&BTRFS_I(inode)->lock);
 		BTRFS_I(inode)->outstanding_extents++;
 		spin_unlock(&BTRFS_I(inode)->lock);
 		btrfs_delalloc_release_space(inode,
-				start_index << PAGE_SHIFT,
-				(page_cnt - i_done) << PAGE_SHIFT);
+					start_blk << inode->i_blkbits,
+					(blk_cnt - i_done) << inode->i_blkbits);
 	}
 
 
-	set_extent_defrag(&BTRFS_I(inode)->io_tree, page_start, page_end - 1,
-			  &cached_state);
+	set_extent_defrag(&BTRFS_I(inode)->io_tree, page_start, page_end,
+			&cached_state);
 
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree,
-			     page_start, page_end - 1, &cached_state,
+			     page_start, page_end, &cached_state,
 			     GFP_NOFS);
 
-	for (i = 0; i < i_done; i++) {
+	for (i = 0; i < page_cnt; i++) {
 		clear_page_dirty_for_io(pages[i]);
 		ClearPageChecked(pages[i]);
 		set_page_extent_mapped(pages[i]);
+
+		page_start = page_offset(pages[i]);
+		if (i == 0)
+			page_start += pg_offset;
+
+		if (i == page_cnt - 1) {
+			page_end = page_offset(pages[0]) + pg_offset;
+			page_end += (i_done << inode->i_blkbits) - 1;
+		} else {
+			page_end = page_offset(pages[i]) + PAGE_SIZE - 1;
+		}
+
+		if (root->sectorsize < PAGE_SIZE)
+			set_page_blks_state(pages[i],
+					1 << BLK_STATE_UPTODATE | 1 << BLK_STATE_DIRTY,
+					page_start, page_end);
 		set_page_dirty(pages[i]);
 		unlock_page(pages[i]);
 		put_page(pages[i]);
 	}
 	return i_done;
 out:
-	for (i = 0; i < i_done; i++) {
-		unlock_page(pages[i]);
-		put_page(pages[i]);
+	if (i_done) {
+		page_cnt = DIV_ROUND_UP(pg_offset + (i_done << inode->i_blkbits),
+					PAGE_SIZE);
+		for (i = 0; i < page_cnt; i++) {
+			unlock_page(pages[i]);
+			put_page(pages[i]);
+		}
 	}
+
 	btrfs_delalloc_release_space(inode,
-			start_index << PAGE_SHIFT,
-			page_cnt << PAGE_SHIFT);
+				start_blk << inode->i_blkbits,
+				blk_cnt << inode->i_blkbits);
 	return ret;
-
 }
 
 int btrfs_defrag_file(struct inode *inode, struct file *file,
@@ -1270,19 +1313,24 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct file_ra_state *ra = NULL;
+	unsigned long first_off, last_off;
+	unsigned long first_block, last_block;
 	unsigned long last_index;
 	u64 isize = i_size_read(inode);
 	u64 last_len = 0;
 	u64 skip = 0;
 	u64 defrag_end = 0;
 	u64 newer_off = range->start;
+	u64 start;
+	u64 page_cnt;
 	unsigned long i;
 	unsigned long ra_index = 0;
+	size_t pg_offset;
 	int ret;
 	int defrag_count = 0;
 	int compress_type = BTRFS_COMPRESS_ZLIB;
 	u32 extent_thresh = range->extent_thresh;
-	unsigned long max_cluster = SZ_256K >> PAGE_SHIFT;
+	unsigned long max_cluster = SZ_256K >> inode->i_blkbits;
 	unsigned long cluster = max_cluster;
 	u64 new_align = ~((u64)SZ_128K - 1);
 	struct page **pages = NULL;
@@ -1316,8 +1364,14 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 		ra = &file->f_ra;
 	}
 
-	pages = kmalloc_array(max_cluster, sizeof(struct page *),
-			GFP_NOFS);
+	/*
+         * In subpage-blocksize scenario the first of "max_cluster" blocks
+	 * may start on a non-zero page offset. In such scenarios we need one
+	 * page more than what would be needed in the case where the first block
+	 * maps to first block of a page.
+         */
+	page_cnt = (max_cluster >> (PAGE_SHIFT - inode->i_blkbits)) + 1;
+	pages = kmalloc_array(page_cnt, sizeof(struct page *), GFP_NOFS);
 	if (!pages) {
 		ret = -ENOMEM;
 		goto out_ra;
@@ -1325,12 +1379,15 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 
 	/* find the last page to defrag */
 	if (range->start + range->len > range->start) {
-		last_index = min_t(u64, isize - 1,
-			 range->start + range->len - 1) >> PAGE_SHIFT;
+		last_off = min_t(u64, isize - 1, range->start + range->len - 1);
 	} else {
-		last_index = (isize - 1) >> PAGE_SHIFT;
+		last_off = isize - 1;
 	}
 
+	last_off = round_up(last_off, root->sectorsize) - 1;
+	last_block = last_off >> inode->i_blkbits;
+	last_index = last_off >> PAGE_SHIFT;
+
 	if (newer_than) {
 		ret = find_new_extents(root, inode, newer_than,
 				       &newer_off, SZ_64K);
@@ -1340,14 +1397,20 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 			 * we always align our defrag to help keep
 			 * the extents in the file evenly spaced
 			 */
-			i = (newer_off & new_align) >> PAGE_SHIFT;
+			first_off = newer_off & new_align;
 		} else
 			goto out_ra;
 	} else {
-		i = range->start >> PAGE_SHIFT;
+		first_off = range->start;
 	}
+
+	first_off = round_down(first_off, root->sectorsize);
+	first_block = first_off >> inode->i_blkbits;
+	i = first_off >> PAGE_SHIFT;
+	pg_offset = first_off & (PAGE_SIZE - 1);
+
 	if (!max_to_defrag)
-		max_to_defrag = last_index - i + 1;
+		max_to_defrag = last_block - first_block + 1;
 
 	/*
 	 * make writeback starts from i, so the defrag range can be
@@ -1371,39 +1434,50 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 			break;
 		}
 
-		if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
-					 extent_thresh, &last_len, &skip,
-					 &defrag_end, range->flags &
-					 BTRFS_DEFRAG_RANGE_COMPRESS)) {
+		start = pg_offset + ((u64)i << PAGE_SHIFT);
+		if (!should_defrag_range(inode, start,
+					extent_thresh, &last_len, &skip,
+					&defrag_end, range->flags &
+					BTRFS_DEFRAG_RANGE_COMPRESS)) {
 			unsigned long next;
 			/*
 			 * the should_defrag function tells us how much to skip
 			 * bump our counter by the suggested amount
 			 */
-			next = DIV_ROUND_UP(skip, PAGE_SIZE);
-			i = max(i + 1, next);
+			next = max(skip, start + root->sectorsize);
+			next >>= inode->i_blkbits;
+
+			first_off = next << inode->i_blkbits;
+			i = first_off >> PAGE_SHIFT;
+			pg_offset = first_off & (PAGE_SIZE - 1);
 			continue;
 		}
 
 		if (!newer_than) {
-			cluster = (PAGE_ALIGN(defrag_end) >>
-				   PAGE_SHIFT) - i;
+			cluster = (defrag_end >> inode->i_blkbits)
+				- (start >> inode->i_blkbits);
+
 			cluster = min(cluster, max_cluster);
 		} else {
 			cluster = max_cluster;
 		}
 
-		if (i + cluster > ra_index) {
+		page_cnt = pg_offset + (cluster << inode->i_blkbits) - 1;
+		page_cnt = DIV_ROUND_UP(page_cnt, PAGE_SIZE);
+		if (i + page_cnt > ra_index) {
 			ra_index = max(i, ra_index);
 			btrfs_force_ra(inode->i_mapping, ra, file, ra_index,
-				       cluster);
-			ra_index += cluster;
+				       page_cnt);
+			ra_index += DIV_ROUND_UP(pg_offset +
+						(cluster << inode->i_blkbits),
+						PAGE_SIZE);
 		}
 
 		inode_lock(inode);
 		if (range->flags & BTRFS_DEFRAG_RANGE_COMPRESS)
 			BTRFS_I(inode)->force_compress = compress_type;
-		ret = cluster_pages_for_defrag(inode, pages, i, cluster);
+		ret = cluster_pages_for_defrag(inode, pages, i, pg_offset,
+					cluster);
 		if (ret < 0) {
 			inode_unlock(inode);
 			goto out_ra;
@@ -1418,29 +1492,29 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 			if (newer_off == (u64)-1)
 				break;
 
-			if (ret > 0)
-				i += ret;
-
 			newer_off = max(newer_off + 1,
-					(u64)i << PAGE_SHIFT);
+					start + (ret << inode->i_blkbits));
 
 			ret = find_new_extents(root, inode, newer_than,
 					       &newer_off, SZ_64K);
 			if (!ret) {
 				range->start = newer_off;
-				i = (newer_off & new_align) >> PAGE_SHIFT;
+				first_off = newer_off & new_align;
 			} else {
 				break;
 			}
 		} else {
 			if (ret > 0) {
-				i += ret;
-				last_len += ret << PAGE_SHIFT;
+				first_off = start + (ret << inode->i_blkbits);
+				last_len += ret << inode->i_blkbits;
 			} else {
-				i++;
+				first_off = start + root->sectorsize;
 				last_len = 0;
 			}
 		}
+		first_off = round_down(first_off, root->sectorsize);
+		i = first_off >> PAGE_SHIFT;
+		pg_offset = first_off & (PAGE_SIZE - 1);
 	}
 
 	if ((range->flags & BTRFS_DEFRAG_RANGE_START_IO)) {
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 15/19] Btrfs: subpage-blocksize: Enable dedupe ioctl
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (13 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 14/19] Btrfs: subpage-blocksize: Fix file defragmentation code Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 16/19] Btrfs: subpage-blocksize: btrfs_clone: Flush dirty blocks of a page that do not map the clone range Chandan Rajendra
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

The function implementing the dedupe ioctl
i.e. btrfs_ioctl_file_extent_same(), returns with an error in
subpage-blocksize scenario. This was done due to the fact that Btrfs did
not have code to deal with block size < page size. This commit removes
this restriction since we now support "block size < page size".

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/ioctl.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4077fc1..cf13029 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3321,21 +3321,11 @@ ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
 {
 	struct inode *src = file_inode(src_file);
 	struct inode *dst = file_inode(dst_file);
-	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
 	ssize_t res;
 
 	if (olen > BTRFS_MAX_DEDUPE_LEN)
 		olen = BTRFS_MAX_DEDUPE_LEN;
 
-	if (WARN_ON_ONCE(bs < PAGE_SIZE)) {
-		/*
-		 * Btrfs does not support blocksize < page_size. As a
-		 * result, btrfs_cmp_data() won't correctly handle
-		 * this situation without an update.
-		 */
-		return -EINVAL;
-	}
-
 	res = btrfs_extent_same(src, loff, olen, dst, dst_loff);
 	if (res)
 		return res;
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 16/19] Btrfs: subpage-blocksize: btrfs_clone: Flush dirty blocks of a page that do not map the clone range
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (14 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 15/19] Btrfs: subpage-blocksize: Enable dedupe ioctl Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 17/19] Btrfs: subpage-blocksize: Make file extent relocate code subpage blocksize aware Chandan Rajendra
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

After cloning the required extents, we truncate all the pages that map
the file range being cloned. In subpage-blocksize scenario, we could
have dirty blocks before and/or after the clone range in the
leading/trailing pages. Truncating these pages would lead to data
loss. Hence this commit forces such dirty blocks to be flushed to disk
before performing the clone operation.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/ioctl.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index cf13029..0fdc0a0 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3914,6 +3914,7 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 	int ret;
 	u64 len = olen;
 	u64 bs = root->fs_info->sb->s_blocksize;
+	u64 dest_end;
 	int same_inode = src == inode;
 
 	/*
@@ -3974,6 +3975,21 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 			goto out_unlock;
 	}
 
+	if ((round_down(destoff, PAGE_SIZE) < inode->i_size) &&
+		!IS_ALIGNED(destoff, PAGE_SIZE)) {
+		ret = filemap_write_and_wait_range(inode->i_mapping,
+					round_down(destoff, PAGE_SIZE),
+					destoff - 1);
+	}
+
+	dest_end = destoff + len - 1;
+	if ((dest_end < inode->i_size) &&
+		!IS_ALIGNED(dest_end + 1, PAGE_SIZE)) {
+		ret = filemap_write_and_wait_range(inode->i_mapping,
+					dest_end + 1,
+					round_up(dest_end, PAGE_SIZE));
+	}
+
 	if (destoff > inode->i_size) {
 		ret = btrfs_cont_expand(inode, inode->i_size, destoff);
 		if (ret)
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 17/19] Btrfs: subpage-blocksize: Make file extent relocate code subpage blocksize aware
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (15 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 16/19] Btrfs: subpage-blocksize: btrfs_clone: Flush dirty blocks of a page that do not map the clone range Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 18/19] Btrfs: subpage-blocksize: __btrfs_lookup_bio_sums: Set offset when moving to a new bio_vec Chandan Rajendra
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

The file extent relocation code currently assumes blocksize to be same
as PAGE_SIZE. This commit adds code to support subpage blocksize
scenario.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/relocation.c | 90 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 71 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index f724fb5..75e51a3 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3114,14 +3114,19 @@ static int relocate_file_extent_cluster(struct inode *inode,
 {
 	u64 page_start;
 	u64 page_end;
+	u64 block_start;
 	u64 offset = BTRFS_I(inode)->index_cnt;
+	u64 blocksize = BTRFS_I(inode)->root->sectorsize;
+	u64 reserved_space;
 	unsigned long index;
 	unsigned long last_index;
 	struct page *page;
 	struct file_ra_state *ra;
 	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
+	int nr_blocks;
 	int nr = 0;
 	int ret = 0;
+	int i;
 
 	if (!cluster->nr)
 		return 0;
@@ -3141,13 +3146,19 @@ static int relocate_file_extent_cluster(struct inode *inode,
 	if (ret)
 		goto out;
 
+	page_start = cluster->start - offset;
+	page_end = min_t(u64, round_down(page_start, PAGE_SIZE) + PAGE_SIZE - 1,
+			cluster->end - offset);
+
 	index = (cluster->start - offset) >> PAGE_SHIFT;
 	last_index = (cluster->end - offset) >> PAGE_SHIFT;
 	while (index <= last_index) {
-		ret = btrfs_delalloc_reserve_metadata(inode, PAGE_SIZE);
+		reserved_space = page_end - page_start + 1;
+
+		ret = btrfs_delalloc_reserve_metadata(inode, reserved_space);
 		if (ret)
 			goto out;
-
+again:
 		page = find_lock_page(inode->i_mapping, index);
 		if (!page) {
 			page_cache_sync_readahead(inode->i_mapping,
@@ -3157,7 +3168,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
 						   mask);
 			if (!page) {
 				btrfs_delalloc_release_metadata(inode,
-							PAGE_SIZE);
+								reserved_space);
 				ret = -ENOMEM;
 				goto out;
 			}
@@ -3169,6 +3180,38 @@ static int relocate_file_extent_cluster(struct inode *inode,
 						   last_index + 1 - index);
 		}
 
+		if (PageDirty(page)) {
+			u64 pg_offset = page_offset(page);
+
+			unlock_page(page);
+			put_page(page);
+			ret = btrfs_fdatawrite_range(inode, pg_offset,
+							page_start - 1);
+			if (ret) {
+				btrfs_delalloc_release_metadata(inode,
+								reserved_space);
+				goto out;
+			}
+
+			ret = filemap_fdatawait_range(inode->i_mapping,
+						pg_offset, page_start - 1);
+			if (ret) {
+				btrfs_delalloc_release_metadata(inode,
+								reserved_space);
+				goto out;
+			}
+
+			goto again;
+		}
+
+		if (BTRFS_I(inode)->root->sectorsize < PAGE_SIZE) {
+			ClearPageUptodate(page);
+			if (page->private)
+				clear_page_blks_state(page,
+						1 << BLK_STATE_UPTODATE,
+						page_start, page_end);
+		}
+
 		if (!PageUptodate(page)) {
 			btrfs_readpage(NULL, page);
 			lock_page(page);
@@ -3176,35 +3219,40 @@ static int relocate_file_extent_cluster(struct inode *inode,
 				unlock_page(page);
 				put_page(page);
 				btrfs_delalloc_release_metadata(inode,
-							PAGE_SIZE);
+								reserved_space);
 				ret = -EIO;
 				goto out;
 			}
 		}
 
-		page_start = page_offset(page);
-		page_end = page_start + PAGE_SIZE - 1;
-
 		lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
 
 		set_page_extent_mapped(page);
 
-		if (nr < cluster->nr &&
-		    page_start + offset == cluster->boundary[nr]) {
-			set_extent_bits(&BTRFS_I(inode)->io_tree,
-					page_start, page_end,
-					EXTENT_BOUNDARY);
-			nr++;
+		nr_blocks = (page_end + 1 - page_start) >> inode->i_blkbits;
+
+		block_start = page_start;
+		for (i = 0; i < nr_blocks; i++) {
+			if (nr < cluster->nr &&
+				block_start + offset == cluster->boundary[nr]) {
+				set_extent_bits(&BTRFS_I(inode)->io_tree,
+						block_start, block_start + blocksize - 1,
+						EXTENT_BOUNDARY);
+				nr++;
+			}
+
+			block_start += blocksize;
 		}
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
-		set_page_blks_state(page,
-				1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
-				page_start, page_end);
-		set_page_dirty(page);
+		if (blocksize < PAGE_SIZE)
+			set_page_blks_state(page,
+					1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
+					page_start, page_end);
 
-		unlock_extent(&BTRFS_I(inode)->io_tree,
-			      page_start, page_end);
+		unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+
+		set_page_dirty(page);
 		unlock_page(page);
 		put_page(page);
 
@@ -3212,6 +3260,10 @@ static int relocate_file_extent_cluster(struct inode *inode,
 		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
 						inode->i_sb);
 		btrfs_throttle(BTRFS_I(inode)->root);
+
+		page_start = page_end + 1;
+		page_end = min_t(u64, page_start + PAGE_SIZE - 1,
+				cluster->end - offset);
 	}
 	WARN_ON(nr != cluster->nr);
 out:
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 18/19] Btrfs: subpage-blocksize: __btrfs_lookup_bio_sums: Set offset when moving to a new bio_vec
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (16 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 17/19] Btrfs: subpage-blocksize: Make file extent relocate code subpage blocksize aware Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2016-10-02 13:24 ` [PATCH V21 19/19] Btrfs: subpage-blocksize: Disable compression Chandan Rajendra
  2017-06-19 10:19 ` [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

In __btrfs_lookup_bio_sums() we set the file offset value at the
beginning of every iteration of the while loop. This is incorrect since
the blocks mapped by the current bvec->bv_page might not yet have been
completely processed.

This commit fixes the issue by setting the file offset value when we
move to the next bvec of the bio.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/file-item.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index d0d571c..8fc09c1 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -222,11 +222,11 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 	disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
 	if (dio)
 		offset = logical_offset;
+	else
+		offset = page_offset(bvec->bv_page) + bvec->bv_offset;
 
 	page_bytes_left = bvec->bv_len;
 	while (bio_index < bio->bi_vcnt) {
-		if (!dio)
-			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
 		count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
 					       (u32 *)csum, nblocks);
 		if (count)
@@ -301,6 +301,9 @@ found:
 					goto done;
 				}
 				bvec++;
+				if (!dio)
+					offset = page_offset(bvec->bv_page)
+						+ bvec->bv_offset;
 				page_bytes_left = bvec->bv_len;
 			}
 
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V21 19/19] Btrfs: subpage-blocksize: Disable compression
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (17 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 18/19] Btrfs: subpage-blocksize: __btrfs_lookup_bio_sums: Set offset when moving to a new bio_vec Chandan Rajendra
@ 2016-10-02 13:24 ` Chandan Rajendra
  2017-06-19 10:19 ` [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2016-10-02 13:24 UTC (permalink / raw)
  To: clm, jbacik, dsterba; +Cc: Chandan Rajendra, linux-btrfs

The subpage-blocksize patchset does not yet support compression. Hence,
the kernel might crash when executing compression code in
subpage-blocksize scenario. This commit disables enabling compression
feature during 'mount' and also when the  user invokes
'chattr +c <filename>' command.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/ioctl.c |  8 +++++++-
 fs/btrfs/super.c | 19 +++++++++++++++++++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0fdc0a0..862d97b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -322,6 +322,11 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 	} else if (flags & FS_COMPR_FL) {
 		const char *comp;
 
+		if (root->sectorsize < PAGE_SIZE) {
+			ret = -EINVAL;
+			goto out_drop;
+		}
+
 		ip->flags |= BTRFS_INODE_COMPRESS;
 		ip->flags &= ~BTRFS_INODE_NOCOMPRESS;
 
@@ -1342,7 +1347,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 		return -EINVAL;
 
 	if (range->flags & BTRFS_DEFRAG_RANGE_COMPRESS) {
-		if (range->compress_type > BTRFS_COMPRESS_TYPES)
+		if ((range->compress_type > BTRFS_COMPRESS_TYPES)
+			|| (root->sectorsize < PAGE_SIZE))
 			return -EINVAL;
 		if (range->compress_type)
 			compress_type = range->compress_type;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 73a1d8d..3a2e9d7 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -392,6 +392,17 @@ static const match_table_t tokens = {
 	{Opt_err, NULL},
 };
 
+static int can_enable_compression(struct btrfs_fs_info *fs_info)
+{
+	if (btrfs_super_sectorsize(fs_info->super_copy) < PAGE_SIZE) {
+		btrfs_err(fs_info,
+			"Compression is not supported for subpage-blocksize");
+		return 0;
+	}
+
+	return 1;
+}
+
 /*
  * Regular mount options parser.  Everything that is needed only when
  * reading in a new superblock is parsed here.
@@ -502,6 +513,10 @@ int btrfs_parse_options(struct btrfs_root *root, char *options,
 			if (token == Opt_compress ||
 			    token == Opt_compress_force ||
 			    strcmp(args[0].from, "zlib") == 0) {
+				if (!can_enable_compression(info)) {
+					ret = -EINVAL;
+					goto out;
+				}
 				compress_type = "zlib";
 				info->compress_type = BTRFS_COMPRESS_ZLIB;
 				btrfs_set_opt(info->mount_opt, COMPRESS);
@@ -509,6 +524,10 @@ int btrfs_parse_options(struct btrfs_root *root, char *options,
 				btrfs_clear_opt(info->mount_opt, NODATASUM);
 				no_compress = 0;
 			} else if (strcmp(args[0].from, "lzo") == 0) {
+				if (!can_enable_compression(info)) {
+					ret = -EINVAL;
+					goto out;
+				}
 				compress_type = "lzo";
 				info->compress_type = BTRFS_COMPRESS_LZO;
 				btrfs_set_opt(info->mount_opt, COMPRESS);
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size
  2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (18 preceding siblings ...)
  2016-10-02 13:24 ` [PATCH V21 19/19] Btrfs: subpage-blocksize: Disable compression Chandan Rajendra
@ 2017-06-19 10:19 ` Chandan Rajendra
  19 siblings, 0 replies; 21+ messages in thread
From: Chandan Rajendra @ 2017-06-19 10:19 UTC (permalink / raw)
  To: clm; +Cc: jbacik, dsterba, linux-btrfs, Nikolay Borisov

On Sunday, October 2, 2016 6:54:09 PM IST Chandan Rajendra wrote:
> Btrfs assumes block size to be the same as the machine's page
> size. This would mean that a Btrfs instance created on a 4k page size
> machine (e.g. x86) will not be mountable on machines with larger page
> sizes (e.g. PPC64/AARCH64). This patchset aims to resolve this
> incompatibility.
> 
> This patchset continues with the work posted previously at
> http://marc.info/?l=linux-btrfs&m=146760691422240&w=2
> 
> This patchset is based on top of Josef's
> 1. Metadata throttling in writeback patches
> 2. Kill the btree inode patches

Hi Josef,

Did you get any chance to work on the above listed patchsets? 

Please let me know when you get a fairly working solution uploaded on your 
Linux git tree. I could use it to rebase my patchset and start testing the
code base.

I have put in a lot of time & effort to get the subpage-blocksize
patchset in its current form. Rebasing and retesting the
subpage-blocksize patchset across various kernel releases also would
consume time. It would be great to have it merged into the mainline
kernel. Once that is done, I will have to get other features of Btrfs
(scrub, compression, etc) to work in subpage-blocksize scenario.

It would be great to have it merged into the mainline kernel
soon. Once that is done, I will have to get other features of Btrfs
(scrub, compression, etc) to work in subpage-blocksize scenario.

> The major change in this version is the usage of kmalloc()-ed memory for
> holding metadata blocks whose size is less than the machine's page size. This
> vastly reduces the complexity of extent buffer mangement (Thanks to Josef's
> "Kill the btree inode patches").
> 
> When writing back dirty extent buffers, we currently track the corresponding
> extent buffers using the pointer at page->private. With kmalloc-ed() memory
> this isn't possible and hence we track the first extent buffer under writeback
> using bio->bi_private. Also, For kmalloc-ed() extent buffers this patchset
> currently limits the number of dirty extent buffers in a "write" bio to
> 1. This limit will be removed in a future patchset.
> 

-- 
chandan


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-06-19 10:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-02 13:24 [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 01/19] Btrfs: subpage-blocksize: extent_clear_unlock_delalloc: Prevent page from being unlocked more than once Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 02/19] Btrfs: subpage-blocksize: Make sure delalloc range intersects with the locked page's range Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 03/19] Btrfs: subpage-blocksize: Use PG_Uptodate flag to track block uptodate status Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 04/19] Btrfs: Remove extent_io_tree's track_uptodate member Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 05/19] Btrfs: subpage-blocksize: Fix whole page read Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 06/19] Btrfs: subpage-blocksize: Fix whole page write Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 07/19] Btrfs: subpage-blocksize: Use kmalloc()-ed memory to hold metadata blocks Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 08/19] Btrfs: subpage-blocksize: Execute sanity tests on all possible block sizes Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 09/19] Btrfs: subpage-blocksize: Compute free space tree BITMAP_RANGE based on sectorsize Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 10/19] Btrfs: subpage-blocksize: Allow mounting filesystems where sectorsize < PAGE_SIZE Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 11/19] Btrfs: subpage-blocksize: Deal with partial ordered extent allocations Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 12/19] Btrfs: subpage-blocksize: Explicitly track I/O status of blocks of an ordered extent Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 13/19] Btrfs: subpage-blocksize: btrfs_punch_hole: Fix uptodate blocks check Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 14/19] Btrfs: subpage-blocksize: Fix file defragmentation code Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 15/19] Btrfs: subpage-blocksize: Enable dedupe ioctl Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 16/19] Btrfs: subpage-blocksize: btrfs_clone: Flush dirty blocks of a page that do not map the clone range Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 17/19] Btrfs: subpage-blocksize: Make file extent relocate code subpage blocksize aware Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 18/19] Btrfs: subpage-blocksize: __btrfs_lookup_bio_sums: Set offset when moving to a new bio_vec Chandan Rajendra
2016-10-02 13:24 ` [PATCH V21 19/19] Btrfs: subpage-blocksize: Disable compression Chandan Rajendra
2017-06-19 10:19 ` [PATCH V21 00/19] Allow I/O on blocks whose size is less than page size Chandan Rajendra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.