All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size.
@ 2015-06-01 15:22 Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read Chandan Rajendra
                   ` (20 more replies)
  0 siblings, 21 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

Btrfs assumes block size to be the same as the machine's page
size. This would mean that a Btrfs instance created on a 4k page size
machine (e.g. x86) will not be mountable on machines with larger page
sizes (e.g. PPC64/AARCH64). This patchset aims to resolve this
incompatibility.

Based on the discussion during Btrfs BoF meetup at Vault conference,
this patchset now tracks the Dirty, Uptodate and I/O status of each
block of a page in a bitmap pointed to by page->private.

The patchset is based off the contents of linux-btrfs/next branch as
available on 07 May 2015 (i.e. commit
e082f56313f374d723b0366978ddb062c8fe79ea). I have also added "Btrfs:
fill ->last_trans for delayed inode in btrfs_fill_inode" commit (from
linux-btrfs/for-linus-4.1) to prevent a failure when executing
xfstests' generic/311.

I have reverted the upstream commit "btrfs: fix lockups from
btrfs_clear_path_blocking" (f82c458a2c3ffb94b431fc6ad791a79df1b3713e)
since this led to soft-lockups when "Btrfs: subpagesize-blocksize:
Prevent writes to an extent buffer when PG_writeback flag is set"
patch is applied.

The commits for the Btrfs kernel module can be found at
https://github.com/chandanr/linux/tree/btrfs/subpagesize-blocksize.

The commits for Btrfs-progs can be found at
https://github.com/chandanr/btrfs-progs/tree/btrfs/subpagesize-blocksize.

xfstests' generic tests were run on an x86_64 machine with the
following setups:
1. 4k data block size and 16k metadata block size
   The following tests failed,
   - 018 (Defrag failure)
   - 019 (Fails on linux-btrfs/next branch)
   - 038 
   - 224 (Fails on linux-btrfs/next branch)
   - 251 (Fails on linux-btrfs/next branch)
   - 256 (The test never completes nor causes a softlock up or Hung
          task timeout).
   - 324 (Defrag failure)
2. 2k data block size and 2k metadata block size
   Apart from the tests listed above, the following tests failed,
   - 029
   - 030

The following is a list of known TODO items which will be implemented in
future revisions of this patchset:
1. Split the trivial/non-controversial patches and mail them for
   upstream inclusion.
2. Get Xfstests' generic tests to successfully run on both 2k and 4k
   blocksizes (with both 4k and 64k page sizes).
3. Create separate caches for 'extent buffer head' and 'extent buffer'.
4. Add 'leak list' tracking for 'extent buffer' instances.
5. Rename EXTENT_BUFFER_TREE_REF and EXTENT_BUFFER_IN_TREE to
   EXTENT_BUFFER_HEAD_TREE_REF and EXTENT_BUFFER_HEAD_IN_TREE
   respectively.

Chandan Rajendra (21):
  Btrfs: subpagesize-blocksize: Fix whole page read.
  Btrfs: subpagesize-blocksize: Fix whole page write.
  Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release
    extents aligned to block size.
  Btrfs: subpagesize-blocksize: Define extent_buffer_head.
  Btrfs: subpagesize-blocksize: Read tree blocks whose size is <
    PAGE_SIZE.
  Btrfs: subpagesize-blocksize: Write only dirty extent buffers
    belonging to a page
  Btrfs: subpagesize-blocksize: Allow mounting filesystems where
    sectorsize != PAGE_SIZE
  Btrfs: subpagesize-blocksize: Compute and look up csums based on
    sectorsized blocks.
  Btrfs: subpagesize-blocksize: Direct I/O read: Work on sectorsized
    blocks.
  Btrfs: subpagesize-blocksize: fallocate: Work with sectorsized units.
  Btrfs: subpagesize-blocksize: btrfs_page_mkwrite: Reserve space in
    sectorsized units.
  Btrfs: subpagesize-blocksize: Search for all ordered extents that
    could span across a page.
  Btrfs: subpagesize-blocksize: Deal with partial ordered extent
    allocations.
  Btrfs: subpagesize-blocksize: Explicitly Track I/O status of blocks of
    an ordered extent.
  Btrfs: subpagesize-blocksize: Revert commit
    fc4adbff823f76577ece26dcb88bf6f8392dbd43.
  Btrfs: subpagesize-blocksize: Prevent writes to an extent buffer when
    PG_writeback flag is set.
  Btrfs: subpagesize-blocksize: Use (eb->start, seq) as search key for
    tree modification log.
  Btrfs: subpagesize-blocksize: btrfs_submit_direct_hook: Handle
    map_length < bio vector length
  Revert "btrfs: fix lockups from btrfs_clear_path_blocking"
  Btrfs: subpagesize-blockssize: Limit inline extents to
    root->sectorsize.
  Btrfs: subpagesize-blocksize: Fix block size returned to user space.

 fs/btrfs/backref.c           |    2 +-
 fs/btrfs/btrfs_inode.h       |    2 -
 fs/btrfs/ctree.c             |   71 ++-
 fs/btrfs/ctree.h             |    8 +-
 fs/btrfs/disk-io.c           |  169 +++--
 fs/btrfs/disk-io.h           |    3 +
 fs/btrfs/extent-tree.c       |   17 +-
 fs/btrfs/extent_io.c         | 1447 +++++++++++++++++++++++++++++-------------
 fs/btrfs/extent_io.h         |   73 ++-
 fs/btrfs/file-item.c         |   87 ++-
 fs/btrfs/file.c              |  107 +++-
 fs/btrfs/inode.c             |  602 +++++++++++++-----
 fs/btrfs/locking.c           |   24 +-
 fs/btrfs/locking.h           |    2 -
 fs/btrfs/ordered-data.c      |   17 +
 fs/btrfs/ordered-data.h      |    4 +
 fs/btrfs/volumes.c           |    2 +-
 include/trace/events/btrfs.h |    2 +-
 18 files changed, 1842 insertions(+), 797 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-06-19  4:45   ` Liu Bo
  2015-06-01 15:22 ` [RFC PATCH V11 02/21] Btrfs: subpagesize-blocksize: Fix whole page write Chandan Rajendra
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

For the subpagesize-blocksize scenario, a page can contain multiple
blocks. In such cases, this patch handles reading data from files.

To track the status of individual blocks of a page, this patch makes use of a
bitmap pointed to by page->private.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c | 301 +++++++++++++++++++++++++++++++++------------------
 fs/btrfs/extent_io.h |  28 ++++-
 fs/btrfs/inode.c     |  13 +--
 3 files changed, 224 insertions(+), 118 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 782f3bc..d37badb 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1325,6 +1325,88 @@ int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
 				cached_state, mask);
 }
 
+static int modify_page_blks_state(struct page *page,
+				unsigned long blk_states,
+				u64 start, u64 end, int set)
+{
+	struct inode *inode = page->mapping->host;
+	unsigned long *bitmap;
+	unsigned long state;
+	u64 nr_blks;
+	u64 blk;
+
+	BUG_ON(!PagePrivate(page));
+
+	bitmap = ((struct btrfs_page_private *)page->private)->bstate;
+
+	blk = (start & (PAGE_CACHE_SIZE - 1)) >> inode->i_blkbits;
+	nr_blks = (end - start + 1) >> inode->i_blkbits;
+
+	while (nr_blks--) {
+		state = find_next_bit(&blk_states, BLK_NR_STATE, 0);
+
+		while (state < BLK_NR_STATE) {
+			if (set)
+				set_bit((blk * BLK_NR_STATE) + state, bitmap);
+			else
+				clear_bit((blk * BLK_NR_STATE) + state, bitmap);
+
+			state = find_next_bit(&blk_states, BLK_NR_STATE,
+					state + 1);
+		}
+
+		++blk;
+	}
+
+	return 0;
+}
+
+int set_page_blks_state(struct page *page, unsigned long blk_states,
+			u64 start, u64 end)
+{
+	return modify_page_blks_state(page, blk_states, start, end, 1);
+}
+
+int clear_page_blks_state(struct page *page, unsigned long blk_states,
+			u64 start, u64 end)
+{
+	return modify_page_blks_state(page, blk_states, start, end, 0);
+}
+
+int test_page_blks_state(struct page *page, enum blk_state blk_state,
+			u64 start, u64 end, int check_all)
+{
+	struct inode *inode = page->mapping->host;
+	unsigned long *bitmap;
+	unsigned long blk;
+	u64 nr_blks;
+	int found = 0;
+
+	BUG_ON(!PagePrivate(page));
+
+	bitmap = ((struct btrfs_page_private *)page->private)->bstate;
+
+	blk = (start & (PAGE_CACHE_SIZE - 1)) >> inode->i_blkbits;
+	nr_blks = (end - start + 1) >> inode->i_blkbits;
+
+	while (nr_blks--) {
+		if (test_bit((blk * BLK_NR_STATE) + blk_state, bitmap)) {
+			if (!check_all)
+				return 1;
+			found = 1;
+		} else if (check_all) {
+			return 0;
+		}
+
+		++blk;
+	}
+
+	if (!check_all && !found)
+		return 0;
+
+	return 1;
+}
+
 /*
  * either insert or lock state struct between start and end use mask to tell
  * us if waiting is desired.
@@ -1982,14 +2064,22 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
  * helper function to set a given page up to date if all the
  * extents in the tree for that page are up to date
  */
-static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
+static void check_page_uptodate(struct page *page)
 {
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_CACHE_SIZE - 1;
-	if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
+	if (test_page_blks_state(page, BLK_STATE_UPTODATE, start, end, 1))
 		SetPageUptodate(page);
 }
 
+static int page_read_complete(struct page *page)
+{
+	u64 start = page_offset(page);
+	u64 end = start + PAGE_CACHE_SIZE - 1;
+
+	return !test_page_blks_state(page, BLK_STATE_IO, start, end, 0);
+}
+
 int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
 	int ret;
@@ -2311,7 +2401,9 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
 	 *	a) deliver good data to the caller
 	 *	b) correct the bad sectors on disk
 	 */
-	if (failed_bio->bi_vcnt > 1) {
+	if ((failed_bio->bi_vcnt > 1)
+		|| (failed_bio->bi_io_vec->bv_len
+			> BTRFS_I(inode)->root->sectorsize)) {
 		/*
 		 * to fulfill b), we need to know the exact failing sectors, as
 		 * we don't want to rewrite any more than the failed ones. thus,
@@ -2520,18 +2612,6 @@ static void end_bio_extent_writepage(struct bio *bio, int err)
 	bio_put(bio);
 }
 
-static void
-endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len,
-			      int uptodate)
-{
-	struct extent_state *cached = NULL;
-	u64 end = start + len - 1;
-
-	if (uptodate && tree->track_uptodate)
-		set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
-	unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC);
-}
-
 /*
  * after a readpage IO is done, we need to:
  * clear the uptodate bits on error
@@ -2548,14 +2628,16 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 	struct bio_vec *bvec;
 	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct extent_state *cached = NULL;
+	struct btrfs_page_private *pg_private;
 	struct extent_io_tree *tree;
+	unsigned long flags;
 	u64 offset = 0;
 	u64 start;
 	u64 end;
-	u64 len;
-	u64 extent_start = 0;
-	u64 extent_len = 0;
+	int nr_sectors;
 	int mirror;
+	int unlock;
 	int ret;
 	int i;
 
@@ -2565,54 +2647,31 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 	bio_for_each_segment_all(bvec, bio, i) {
 		struct page *page = bvec->bv_page;
 		struct inode *inode = page->mapping->host;
+		struct btrfs_root *root = BTRFS_I(inode)->root;
 
 		pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
 			 "mirror=%u\n", (u64)bio->bi_iter.bi_sector, err,
 			 io_bio->mirror_num);
 		tree = &BTRFS_I(inode)->io_tree;
 
-		/* We always issue full-page reads, but if some block
-		 * in a page fails to read, blk_update_request() will
-		 * advance bv_offset and adjust bv_len to compensate.
-		 * Print a warning for nonzero offsets, and an error
-		 * if they don't add up to a full page.  */
-		if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) {
-			if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE)
-				btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
-				   "partial page read in btrfs with offset %u and length %u",
-					bvec->bv_offset, bvec->bv_len);
-			else
-				btrfs_info(BTRFS_I(page->mapping->host)->root->fs_info,
-				   "incomplete page read in btrfs with offset %u and "
-				   "length %u",
-					bvec->bv_offset, bvec->bv_len);
-		}
-
-		start = page_offset(page);
-		end = start + bvec->bv_offset + bvec->bv_len - 1;
-		len = bvec->bv_len;
-
+		start = page_offset(page) + bvec->bv_offset;
+		end = start + bvec->bv_len - 1;
+		nr_sectors = bvec->bv_len >> inode->i_sb->s_blocksize_bits;
 		mirror = io_bio->mirror_num;
-		if (likely(uptodate && tree->ops &&
-			   tree->ops->readpage_end_io_hook)) {
+
+next_block:
+		if (likely(uptodate)) {
 			ret = tree->ops->readpage_end_io_hook(io_bio, offset,
-							      page, start, end,
-							      mirror);
+							page, start,
+							start + root->sectorsize - 1,
+							mirror);
 			if (ret)
 				uptodate = 0;
 			else
 				clean_io_failure(inode, start, page, 0);
 		}
 
-		if (likely(uptodate))
-			goto readpage_ok;
-
-		if (tree->ops && tree->ops->readpage_io_failed_hook) {
-			ret = tree->ops->readpage_io_failed_hook(page, mirror);
-			if (!ret && !err &&
-			    test_bit(BIO_UPTODATE, &bio->bi_flags))
-				uptodate = 1;
-		} else {
+		if (!uptodate) {
 			/*
 			 * The generic bio_readpage_error handles errors the
 			 * following way: If possible, new read requests are
@@ -2623,61 +2682,63 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 			 * can't handle the error it will return -EIO and we
 			 * remain responsible for that page.
 			 */
-			ret = bio_readpage_error(bio, offset, page, start, end,
-						 mirror);
+			ret = bio_readpage_error(bio, offset, page,
+						start, start + root->sectorsize - 1,
+						mirror);
 			if (ret == 0) {
-				uptodate =
-					test_bit(BIO_UPTODATE, &bio->bi_flags);
+				uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
 				if (err)
 					uptodate = 0;
-				offset += len;
-				continue;
+				offset += root->sectorsize;
+				if (--nr_sectors) {
+					start += root->sectorsize;
+					goto next_block;
+				} else {
+					continue;
+				}
 			}
 		}
-readpage_ok:
-		if (likely(uptodate)) {
-			loff_t i_size = i_size_read(inode);
-			pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
-			unsigned off;
-
-			/* Zero out the end if this page straddles i_size */
-			off = i_size & (PAGE_CACHE_SIZE-1);
-			if (page->index == end_index && off)
-				zero_user_segment(page, off, PAGE_CACHE_SIZE);
-			SetPageUptodate(page);
+
+		if (uptodate) {
+			set_page_blks_state(page, 1 << BLK_STATE_UPTODATE, start,
+					start + root->sectorsize - 1);
+			check_page_uptodate(page);
 		} else {
 			ClearPageUptodate(page);
 			SetPageError(page);
 		}
-		unlock_page(page);
-		offset += len;
-
-		if (unlikely(!uptodate)) {
-			if (extent_len) {
-				endio_readpage_release_extent(tree,
-							      extent_start,
-							      extent_len, 1);
-				extent_start = 0;
-				extent_len = 0;
-			}
-			endio_readpage_release_extent(tree, start,
-						      end - start + 1, 0);
-		} else if (!extent_len) {
-			extent_start = start;
-			extent_len = end + 1 - start;
-		} else if (extent_start + extent_len == start) {
-			extent_len += end + 1 - start;
-		} else {
-			endio_readpage_release_extent(tree, extent_start,
-						      extent_len, uptodate);
-			extent_start = start;
-			extent_len = end + 1 - start;
+
+		offset += root->sectorsize;
+
+		if (--nr_sectors) {
+			clear_page_blks_state(page, 1 << BLK_STATE_IO,
+					start, start + root->sectorsize - 1);
+			clear_extent_bit(tree, start, start + root->sectorsize - 1,
+					EXTENT_LOCKED, 1, 0, &cached, GFP_ATOMIC);
+			start += root->sectorsize;
+			goto next_block;
 		}
+
+		WARN_ON(!PagePrivate(page));
+
+		pg_private = (struct btrfs_page_private *)page->private;
+
+		spin_lock_irqsave(&pg_private->io_lock, flags);
+
+		clear_page_blks_state(page, 1 << BLK_STATE_IO,
+				start, start + root->sectorsize - 1);
+
+		unlock = page_read_complete(page);
+
+		spin_unlock_irqrestore(&pg_private->io_lock, flags);
+
+		clear_extent_bit(tree, start, start + root->sectorsize - 1,
+				EXTENT_LOCKED, 1, 0, &cached, GFP_ATOMIC);
+
+		if (unlock)
+			unlock_page(page);
 	}
 
-	if (extent_len)
-		endio_readpage_release_extent(tree, extent_start, extent_len,
-					      uptodate);
 	if (io_bio->end_io)
 		io_bio->end_io(io_bio, err);
 	bio_put(bio);
@@ -2859,13 +2920,36 @@ static void attach_extent_buffer_page(struct extent_buffer *eb,
 	}
 }
 
-void set_page_extent_mapped(struct page *page)
+int set_page_extent_mapped(struct page *page)
 {
+	struct btrfs_page_private *pg_private;
+
 	if (!PagePrivate(page)) {
+		pg_private = kzalloc(sizeof(*pg_private), GFP_NOFS);
+		if (!pg_private)
+			return -ENOMEM;
+
+		spin_lock_init(&pg_private->io_lock);
+
 		SetPagePrivate(page);
 		page_cache_get(page);
-		set_page_private(page, EXTENT_PAGE_PRIVATE);
+
+		set_page_private(page, (unsigned long)pg_private);
+	}
+
+	return 0;
+}
+
+int clear_page_extent_mapped(struct page *page)
+{
+	if (PagePrivate(page)) {
+		kfree((struct btrfs_page_private *)(page->private));
+		ClearPagePrivate(page);
+		set_page_private(page, 0);
+		page_cache_release(page);
 	}
+
+	return 0;
 }
 
 static struct extent_map *
@@ -2909,6 +2993,7 @@ static int __do_readpage(struct extent_io_tree *tree,
 			 unsigned long *bio_flags, int rw)
 {
 	struct inode *inode = page->mapping->host;
+	struct extent_state *cached = NULL;
 	u64 start = page_offset(page);
 	u64 page_end = start + PAGE_CACHE_SIZE - 1;
 	u64 end;
@@ -2964,8 +3049,8 @@ static int __do_readpage(struct extent_io_tree *tree,
 			memset(userpage + pg_offset, 0, iosize);
 			flush_dcache_page(page);
 			kunmap_atomic(userpage);
-			set_extent_uptodate(tree, cur, cur + iosize - 1,
-					    &cached, GFP_NOFS);
+			set_page_blks_state(page, 1 << BLK_STATE_UPTODATE, cur,
+					cur + iosize - 1);
 			if (!parent_locked)
 				unlock_extent_cached(tree, cur,
 						     cur + iosize - 1,
@@ -3017,8 +3102,8 @@ static int __do_readpage(struct extent_io_tree *tree,
 			flush_dcache_page(page);
 			kunmap_atomic(userpage);
 
-			set_extent_uptodate(tree, cur, cur + iosize - 1,
-					    &cached, GFP_NOFS);
+			set_page_blks_state(page, 1 << BLK_STATE_UPTODATE, cur,
+					cur + iosize - 1);
 			unlock_extent_cached(tree, cur, cur + iosize - 1,
 			                     &cached, GFP_NOFS);
 			cur = cur + iosize;
@@ -3026,9 +3111,9 @@ static int __do_readpage(struct extent_io_tree *tree,
 			continue;
 		}
 		/* the get_extent function already copied into the page */
-		if (test_range_bit(tree, cur, cur_end,
-				   EXTENT_UPTODATE, 1, NULL)) {
-			check_page_uptodate(tree, page);
+		if (test_page_blks_state(page, BLK_STATE_UPTODATE, cur,
+						cur_end, 1)) {
+			check_page_uptodate(page);
 			if (!parent_locked)
 				unlock_extent(tree, cur, cur + iosize - 1);
 			cur = cur + iosize;
@@ -3048,6 +3133,9 @@ static int __do_readpage(struct extent_io_tree *tree,
 		}
 
 		pnr -= page->index;
+
+		set_page_blks_state(page, 1 << BLK_STATE_IO, cur,
+				cur + iosize - 1);
 		ret = submit_extent_page(rw, tree, page,
 					 sector, disk_io_size, pg_offset,
 					 bdev, bio, pnr,
@@ -3059,8 +3147,11 @@ static int __do_readpage(struct extent_io_tree *tree,
 			*bio_flags = this_bio_flag;
 		} else {
 			SetPageError(page);
+			clear_page_blks_state(page, 1 << BLK_STATE_IO, cur,
+					cur + iosize - 1);
 			if (!parent_locked)
-				unlock_extent(tree, cur, cur + iosize - 1);
+				unlock_extent_cached(tree, cur, cur + iosize - 1,
+						&cached, GFP_NOFS);
 		}
 		cur = cur + iosize;
 		pg_offset += iosize;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index c668f36..541b40a 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -51,11 +51,22 @@
 #define PAGE_SET_PRIVATE2	(1 << 4)
 #define PAGE_SET_ERROR		(1 << 5)
 
+enum blk_state {
+	BLK_STATE_UPTODATE,
+	BLK_STATE_DIRTY,
+	BLK_STATE_IO,
+	BLK_NR_STATE,
+};
+
 /*
- * page->private values.  Every page that is controlled by the extent
- * map has page->private set to one.
- */
-#define EXTENT_PAGE_PRIVATE 1
+  The maximum number of blocks per page (i.e. 32) occurs when using 2k
+  as the block size and having 64k as the page size.
+*/
+#define BLK_STATE_NR_LONGS DIV_ROUND_UP(BLK_NR_STATE * 32, BITS_PER_LONG)
+struct btrfs_page_private {
+	spinlock_t io_lock;
+	unsigned long bstate[BLK_STATE_NR_LONGS];
+};
 
 struct extent_state;
 struct btrfs_root;
@@ -259,7 +270,14 @@ int extent_readpages(struct extent_io_tree *tree,
 int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		__u64 start, __u64 len, get_extent_t *get_extent);
 int get_state_private(struct extent_io_tree *tree, u64 start, u64 *private);
-void set_page_extent_mapped(struct page *page);
+int set_page_extent_mapped(struct page *page);
+int clear_page_extent_mapped(struct page *page);
+int set_page_blks_state(struct page *page, unsigned long blk_states,
+ 			u64 start, u64 end);
+int clear_page_blks_state(struct page *page, unsigned long blk_states,
+ 			u64 start, u64 end);
+int test_page_blks_state(struct page *page, enum blk_state blk_state,
+			u64 start, u64 end, int check_all);
 
 struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 					  u64 start);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0020b56..8262f83 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6622,7 +6622,6 @@ struct extent_map *btrfs_get_extent(struct inode *inode, struct page *page,
 	struct btrfs_key found_key;
 	struct extent_map *em = NULL;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
-	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_trans_handle *trans = NULL;
 	const bool new_inline = !page || create;
 
@@ -6800,8 +6799,8 @@ next:
 			kunmap(page);
 			btrfs_mark_buffer_dirty(leaf);
 		}
-		set_extent_uptodate(io_tree, em->start,
-				    extent_map_end(em) - 1, NULL, GFP_NOFS);
+		set_page_blks_state(page, 1 << BLK_STATE_UPTODATE, em->start,
+				extent_map_end(em) - 1);
 		goto insert;
 	}
 not_found:
@@ -8392,11 +8391,9 @@ static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
 	map = &BTRFS_I(page->mapping->host)->extent_tree;
 	ret = try_release_extent_mapping(map, tree, page, gfp_flags);
-	if (ret == 1) {
-		ClearPagePrivate(page);
-		set_page_private(page, 0);
-		page_cache_release(page);
-	}
+	if (ret == 1)
+		clear_page_extent_mapped(page);
+
 	return ret;
 }
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 02/21] Btrfs: subpagesize-blocksize: Fix whole page write.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-06-26  9:50   ` Liu Bo
  2015-06-01 15:22 ` [RFC PATCH V11 03/21] Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release extents aligned to block size Chandan Rajendra
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

For the subpagesize-blocksize scenario, a page can contain multiple
blocks. In such cases, this patch handles writing data to files.

Also, When setting EXTENT_DELALLOC, we no longer set EXTENT_UPTODATE bit on
the extent_io_tree since uptodate status is being tracked by the bitmap
pointed to by page->private.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c | 141 +++++++++++++++++++++++----------------------------
 fs/btrfs/file.c      |  16 ++++++
 fs/btrfs/inode.c     |  58 ++++++++++++++++-----
 3 files changed, 125 insertions(+), 90 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d37badb..3736ab5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1283,9 +1283,8 @@ int clear_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end,
 			struct extent_state **cached_state, gfp_t mask)
 {
-	return set_extent_bit(tree, start, end,
-			      EXTENT_DELALLOC | EXTENT_UPTODATE,
-			      NULL, cached_state, mask);
+	return set_extent_bit(tree, start, end, EXTENT_DELALLOC,
+			NULL, cached_state, mask);
 }
 
 int set_extent_defrag(struct extent_io_tree *tree, u64 start, u64 end,
@@ -1498,25 +1497,6 @@ int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end)
 	return 0;
 }
 
-/*
- * helper function to set both pages and extents in the tree writeback
- */
-static int set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
-{
-	unsigned long index = start >> PAGE_CACHE_SHIFT;
-	unsigned long end_index = end >> PAGE_CACHE_SHIFT;
-	struct page *page;
-
-	while (index <= end_index) {
-		page = find_get_page(tree->mapping, index);
-		BUG_ON(!page); /* Pages should be in the extent_io_tree */
-		set_page_writeback(page);
-		page_cache_release(page);
-		index++;
-	}
-	return 0;
-}
-
 /* find the first state struct with 'bits' set after 'start', and
  * return it.  tree->lock must be held.  NULL will returned if
  * nothing was found after 'start'
@@ -2080,6 +2060,14 @@ static int page_read_complete(struct page *page)
 	return !test_page_blks_state(page, BLK_STATE_IO, start, end, 0);
 }
 
+static int page_write_complete(struct page *page)
+{
+	u64 start = page_offset(page);
+	u64 end = start + PAGE_CACHE_SIZE - 1;
+
+	return !test_page_blks_state(page, BLK_STATE_IO, start, end, 0);
+}
+
 int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
 	int ret;
@@ -2575,38 +2563,37 @@ int end_extent_writepage(struct page *page, int err, u64 start, u64 end)
  */
 static void end_bio_extent_writepage(struct bio *bio, int err)
 {
+	struct btrfs_page_private *pg_private;
 	struct bio_vec *bvec;
+	unsigned long flags;
 	u64 start;
 	u64 end;
+	int clear_writeback;
 	int i;
 
 	bio_for_each_segment_all(bvec, bio, i) {
 		struct page *page = bvec->bv_page;
 
-		/* We always issue full-page reads, but if some block
-		 * in a page fails to read, blk_update_request() will
-		 * advance bv_offset and adjust bv_len to compensate.
-		 * Print a warning for nonzero offsets, and an error
-		 * if they don't add up to a full page.  */
-		if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) {
-			if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE)
-				btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
-				   "partial page write in btrfs with offset %u and length %u",
-					bvec->bv_offset, bvec->bv_len);
-			else
-				btrfs_info(BTRFS_I(page->mapping->host)->root->fs_info,
-				   "incomplete page write in btrfs with offset %u and "
-				   "length %u",
-					bvec->bv_offset, bvec->bv_len);
-		}
+		start = page_offset(page) + bvec->bv_offset;
+		end = start + bvec->bv_len - 1;
 
-		start = page_offset(page);
-		end = start + bvec->bv_offset + bvec->bv_len - 1;
+		pg_private = (struct btrfs_page_private *)page->private;
+
+		spin_lock_irqsave(&pg_private->io_lock, flags);
 
-		if (end_extent_writepage(page, err, start, end))
+		if (end_extent_writepage(page, err, start, end)) {
+			spin_unlock_irqrestore(&pg_private->io_lock, flags);
 			continue;
+		}
 
-		end_page_writeback(page);
+		clear_page_blks_state(page, 1 << BLK_STATE_IO, start, end);
+
+		clear_writeback = page_write_complete(page);
+
+		spin_unlock_irqrestore(&pg_private->io_lock, flags);
+
+		if (clear_writeback)
+			end_page_writeback(page);
 	}
 
 	bio_put(bio);
@@ -3417,10 +3404,9 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 	u64 block_start;
 	u64 iosize;
 	sector_t sector;
-	struct extent_state *cached_state = NULL;
 	struct extent_map *em;
 	struct block_device *bdev;
-	size_t pg_offset = 0;
+	size_t pg_offset;
 	size_t blocksize;
 	int ret = 0;
 	int nr = 0;
@@ -3467,8 +3453,16 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 							 page_end, NULL, 1);
 			break;
 		}
-		em = epd->get_extent(inode, page, pg_offset, cur,
-				     end - cur + 1, 1);
+
+		pg_offset = cur & (PAGE_CACHE_SIZE - 1);
+
+		if (!test_page_blks_state(page, BLK_STATE_DIRTY, cur,
+						cur + blocksize - 1, 1)) {
+			cur += blocksize;
+			continue;
+		}
+
+		em = epd->get_extent(inode, page, pg_offset, cur, blocksize, 1);
 		if (IS_ERR_OR_NULL(em)) {
 			SetPageError(page);
 			ret = PTR_ERR_OR_ZERO(em);
@@ -3479,7 +3473,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 		em_end = extent_map_end(em);
 		BUG_ON(em_end <= cur);
 		BUG_ON(end < cur);
-		iosize = min(em_end - cur, end - cur + 1);
+		iosize = min_t(u64, em_end - cur, blocksize);
 		iosize = ALIGN(iosize, blocksize);
 		sector = (em->block_start + extent_offset) >> 9;
 		bdev = em->bdev;
@@ -3488,32 +3482,20 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 		free_extent_map(em);
 		em = NULL;
 
-		/*
-		 * compressed and inline extents are written through other
-		 * paths in the FS
-		 */
-		if (compressed || block_start == EXTENT_MAP_HOLE ||
-		    block_start == EXTENT_MAP_INLINE) {
-			/*
-			 * end_io notification does not happen here for
-			 * compressed extents
-			 */
-			if (!compressed && tree->ops &&
-			    tree->ops->writepage_end_io_hook)
-				tree->ops->writepage_end_io_hook(page, cur,
-							 cur + iosize - 1,
-							 NULL, 1);
-			else if (compressed) {
-				/* we don't want to end_page_writeback on
-				 * a compressed extent.  this happens
-				 * elsewhere
-				 */
-				nr++;
-			}
+		BUG_ON(compressed);
+		BUG_ON(block_start == EXTENT_MAP_INLINE);
 
-			cur += iosize;
-			pg_offset += iosize;
-			continue;
+		if (block_start == EXTENT_MAP_HOLE) {
+			if (test_page_blks_state(page, BLK_STATE_UPTODATE, cur,
+							cur + iosize - 1, 1)) {
+				clear_page_blks_state(page,
+						1 << BLK_STATE_DIRTY, cur,
+						cur + iosize - 1);
+				cur += iosize;
+				continue;
+			} else {
+				BUG();
+			}
 		}
 
 		if (tree->ops && tree->ops->writepage_io_hook) {
@@ -3527,7 +3509,13 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 		} else {
 			unsigned long max_nr = (i_size >> PAGE_CACHE_SHIFT) + 1;
 
-			set_range_writeback(tree, cur, cur + iosize - 1);
+			clear_page_blks_state(page, 1 << BLK_STATE_DIRTY, cur,
+					cur + iosize - 1);
+			set_page_writeback(page);
+
+			set_page_blks_state(page, 1 << BLK_STATE_IO, cur,
+					cur + iosize - 1);
+
 			if (!PageWriteback(page)) {
 				btrfs_err(BTRFS_I(inode)->root->fs_info,
 					   "page %lu not writeback, cur %llu end %llu",
@@ -3542,17 +3530,14 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 			if (ret)
 				SetPageError(page);
 		}
-		cur = cur + iosize;
-		pg_offset += iosize;
+
+		cur += iosize;
 		nr++;
 	}
 done:
 	*nr_ret = nr;
 
 done_unlocked:
-
-	/* drop our reference on any cached states */
-	free_extent_state(cached_state);
 	return ret;
 }
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 23b6e03..cbe6381 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -495,6 +495,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 	u64 num_bytes;
 	u64 start_pos;
 	u64 end_of_last_block;
+	u64 start;
+	u64 end;
+	u64 page_end;
 	u64 end_pos = pos + write_bytes;
 	loff_t isize = i_size_read(inode);
 
@@ -507,11 +510,24 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 	if (err)
 		return err;
 
+	start = start_pos;
+
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = pages[i];
 		SetPageUptodate(p);
 		ClearPageChecked(p);
+
+		end = page_end = page_offset(p) + PAGE_CACHE_SIZE - 1;
+
+		if (i == num_pages - 1)
+			end = min_t(u64, page_end, end_of_last_block);
+
+		set_page_blks_state(p,
+				1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
+				start, end);
 		set_page_dirty(p);
+
+		start = page_end + 1;
 	}
 
 	/*
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8262f83..ac6a3f3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1995,6 +1995,11 @@ again:
 	 }
 
 	btrfs_set_extent_delalloc(inode, page_start, page_end, &cached_state);
+
+	set_page_blks_state(page,
+			1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
+			page_start, page_end);
+
 	ClearPageChecked(page);
 	set_page_dirty(page);
 out:
@@ -2984,26 +2989,48 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
 	struct btrfs_ordered_extent *ordered_extent = NULL;
 	struct btrfs_workqueue *wq;
 	btrfs_work_func_t func;
+	u64 ordered_start, ordered_end;
+	int done;
 
 	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
 
 	ClearPagePrivate2(page);
-	if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
-					    end - start + 1, uptodate))
-		return 0;
+loop:
+	ordered_extent = btrfs_lookup_ordered_range(inode, start,
+						end - start + 1);
+	if (!ordered_extent)
+		goto out;
 
-	if (btrfs_is_free_space_inode(inode)) {
-		wq = root->fs_info->endio_freespace_worker;
-		func = btrfs_freespace_write_helper;
-	} else {
-		wq = root->fs_info->endio_write_workers;
-		func = btrfs_endio_write_helper;
+	ordered_start = max_t(u64, start, ordered_extent->file_offset);
+	ordered_end = min_t(u64, end,
+			ordered_extent->file_offset + ordered_extent->len - 1);
+
+	done = btrfs_dec_test_ordered_pending(inode, &ordered_extent,
+					ordered_start,
+					ordered_end - ordered_start + 1,
+					uptodate);
+	if (done) {
+		if (btrfs_is_free_space_inode(inode)) {
+			wq = root->fs_info->endio_freespace_worker;
+			func = btrfs_freespace_write_helper;
+		} else {
+			wq = root->fs_info->endio_write_workers;
+			func = btrfs_endio_write_helper;
+		}
+
+		btrfs_init_work(&ordered_extent->work, func,
+				finish_ordered_fn, NULL, NULL);
+		btrfs_queue_work(wq, &ordered_extent->work);
 	}
 
-	btrfs_init_work(&ordered_extent->work, func, finish_ordered_fn, NULL,
-			NULL);
-	btrfs_queue_work(wq, &ordered_extent->work);
+	btrfs_put_ordered_extent(ordered_extent);
+
+	start = ordered_end + 1;
+
+	if (start < end)
+		goto loop;
 
+out:
 	return 0;
 }
 
@@ -4601,6 +4628,9 @@ again:
 		goto out_unlock;
 	}
 
+	set_page_blks_state(page, 1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
+			page_start, page_end);
+
 	if (offset != PAGE_CACHE_SIZE) {
 		if (!len)
 			len = PAGE_CACHE_SIZE - offset;
@@ -8590,6 +8620,10 @@ again:
 		ret = VM_FAULT_SIGBUS;
 		goto out_unlock;
 	}
+
+	set_page_blks_state(page, 1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
+			page_start, end);
+
 	ret = 0;
 
 	/* page is wholly or partially inside EOF */
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 03/21] Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release extents aligned to block size.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 02/21] Btrfs: subpagesize-blocksize: Fix whole page write Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 04/21] Btrfs: subpagesize-blocksize: Define extent_buffer_head Chandan Rajendra
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE
units. Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/file.c | 40 ++++++++++++++++++++++++++++------------
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index cbe6381..287192fb 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1378,18 +1378,21 @@ fail:
 static noinline int
 lock_and_cleanup_extent_if_need(struct inode *inode, struct page **pages,
 				size_t num_pages, loff_t pos,
+				size_t write_bytes,
 				u64 *lockstart, u64 *lockend,
 				struct extent_state **cached_state)
 {
+	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 start_pos;
 	u64 last_pos;
 	int i;
 	int ret = 0;
 
-	start_pos = pos & ~((u64)PAGE_CACHE_SIZE - 1);
-	last_pos = start_pos + ((u64)num_pages << PAGE_CACHE_SHIFT) - 1;
+       start_pos = pos & ~((u64)root->sectorsize - 1);
+       last_pos = start_pos
+               + ALIGN(pos + write_bytes - start_pos, root->sectorsize) - 1;
 
-	if (start_pos < inode->i_size) {
+       if (start_pos < inode->i_size) {
 		struct btrfs_ordered_extent *ordered;
 		lock_extent_bits(&BTRFS_I(inode)->io_tree,
 				 start_pos, last_pos, 0, cached_state);
@@ -1505,6 +1508,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 
 	while (iov_iter_count(i) > 0) {
 		size_t offset = pos & (PAGE_CACHE_SIZE - 1);
+		size_t sector_offset;
 		size_t write_bytes = min(iov_iter_count(i),
 					 nrptrs * (size_t)PAGE_CACHE_SIZE -
 					 offset);
@@ -1513,6 +1517,8 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 		size_t reserve_bytes;
 		size_t dirty_pages;
 		size_t copied;
+		size_t dirty_sectors;
+		size_t num_sectors;
 
 		WARN_ON(num_pages > nrptrs);
 
@@ -1525,8 +1531,11 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 			break;
 		}
 
-		reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
+		sector_offset = pos & (root->sectorsize - 1);
+		reserve_bytes = ALIGN(write_bytes + sector_offset, root->sectorsize);
+
 		ret = btrfs_check_data_free_space(inode, reserve_bytes, write_bytes);
+
 		if (ret == -ENOSPC &&
 		    (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
 					      BTRFS_INODE_PREALLOC))) {
@@ -1539,7 +1548,9 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 				 */
 				num_pages = DIV_ROUND_UP(write_bytes + offset,
 							 PAGE_CACHE_SIZE);
-				reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
+				reserve_bytes = ALIGN(write_bytes + sector_offset,
+						root->sectorsize);
+
 				ret = 0;
 			} else {
 				ret = -ENOSPC;
@@ -1574,8 +1585,8 @@ again:
 			break;
 
 		ret = lock_and_cleanup_extent_if_need(inode, pages, num_pages,
-						      pos, &lockstart, &lockend,
-						      &cached_state);
+						pos, write_bytes, &lockstart, &lockend,
+						&cached_state);
 		if (ret < 0) {
 			if (ret == -EAGAIN)
 				goto again;
@@ -1611,9 +1622,14 @@ again:
 		 * we still have an outstanding extent for the chunk we actually
 		 * managed to copy.
 		 */
-		if (num_pages > dirty_pages) {
-			release_bytes = (num_pages - dirty_pages) <<
-				PAGE_CACHE_SHIFT;
+		num_sectors = reserve_bytes >> inode->i_blkbits;
+		dirty_sectors = round_up(copied + sector_offset,
+					root->sectorsize);
+		dirty_sectors >>= inode->i_blkbits;
+
+		if (num_sectors > dirty_sectors) {
+			release_bytes = (write_bytes - copied)
+				& ~((u64)root->sectorsize - 1);
 			if (copied > 0) {
 				spin_lock(&BTRFS_I(inode)->lock);
 				BTRFS_I(inode)->outstanding_extents++;
@@ -1627,7 +1643,7 @@ again:
 							     release_bytes);
 		}
 
-		release_bytes = dirty_pages << PAGE_CACHE_SHIFT;
+		release_bytes = ALIGN(copied + sector_offset, root->sectorsize);
 
 		if (copied > 0)
 			ret = btrfs_dirty_pages(root, inode, pages,
@@ -1649,7 +1665,7 @@ again:
 		if (only_release_metadata && copied > 0) {
 			lockstart = round_down(pos, root->sectorsize);
 			lockend = lockstart +
-				(dirty_pages << PAGE_CACHE_SHIFT) - 1;
+				ALIGN(copied, root->sectorsize) - 1;
 
 			set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
 				       lockend, EXTENT_NORESERVE, NULL,
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 04/21] Btrfs: subpagesize-blocksize: Define extent_buffer_head.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (2 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 03/21] Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release extents aligned to block size Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-07-01 14:33   ` Liu Bo
  2015-06-01 15:22 ` [RFC PATCH V11 05/21] Btrfs: subpagesize-blocksize: Read tree blocks whose size is < PAGE_SIZE Chandan Rajendra
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

In order to handle multiple extent buffers per page, first we need to create a
way to handle all the extent buffers that are attached to a page.

This patch creates a new data structure 'struct extent_buffer_head', and moves
fields that are common to all extent buffers in a page from 'struct extent
buffer' to 'struct extent_buffer_head'

Also, this patch moves EXTENT_BUFFER_TREE_REF, EXTENT_BUFFER_DUMMY and
EXTENT_BUFFER_IN_TREE flags from extent_buffer->ebflags  to
extent_buffer_head->bflags.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/backref.c           |   2 +-
 fs/btrfs/ctree.c             |   2 +-
 fs/btrfs/ctree.h             |   6 +-
 fs/btrfs/disk-io.c           |  73 ++++---
 fs/btrfs/extent-tree.c       |   6 +-
 fs/btrfs/extent_io.c         | 469 ++++++++++++++++++++++++++++---------------
 fs/btrfs/extent_io.h         |  39 +++-
 fs/btrfs/volumes.c           |   2 +-
 include/trace/events/btrfs.h |   2 +-
 9 files changed, 392 insertions(+), 209 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 9de772e..b4d911c 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1372,7 +1372,7 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
 		eb = path->nodes[0];
 		/* make sure we can use eb after releasing the path */
 		if (eb != eb_in) {
-			atomic_inc(&eb->refs);
+			atomic_inc(&eb_head(eb)->refs);
 			btrfs_tree_read_lock(eb);
 			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
 		}
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 0f11ebc..b28f14d 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -159,7 +159,7 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root *root)
 		 * the inc_not_zero dance and if it doesn't work then
 		 * synchronize_rcu and try again.
 		 */
-		if (atomic_inc_not_zero(&eb->refs)) {
+		if (atomic_inc_not_zero(&eb_head(eb)->refs)) {
 			rcu_read_unlock();
 			break;
 		}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6f364e1..2bc3e0e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2320,14 +2320,16 @@ static inline void btrfs_set_token_##name(struct extent_buffer *eb,	\
 #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)		\
 static inline u##bits btrfs_##name(struct extent_buffer *eb)		\
 {									\
-	type *p = page_address(eb->pages[0]);				\
+	type *p = page_address(eb_head(eb)->pages[0]) +			\
+				(eb->start & (PAGE_CACHE_SIZE -1));	\
 	u##bits res = le##bits##_to_cpu(p->member);			\
 	return res;							\
 }									\
 static inline void btrfs_set_##name(struct extent_buffer *eb,		\
 				    u##bits val)			\
 {									\
-	type *p = page_address(eb->pages[0]);				\
+	type *p = page_address(eb_head(eb)->pages[0]) +			\
+				(eb->start & (PAGE_CACHE_SIZE -1));	\
 	p->member = cpu_to_le##bits(val);				\
 }
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2ef9a4b..51fe2ec 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -368,9 +368,10 @@ static int verify_parent_transid(struct extent_io_tree *io_tree,
 		ret = 0;
 		goto out;
 	}
+
 	printk_ratelimited(KERN_ERR
 	    "BTRFS (device %s): parent transid verify failed on %llu wanted %llu found %llu\n",
-			eb->fs_info->sb->s_id, eb->start,
+			eb_head(eb)->fs_info->sb->s_id, eb->start,
 			parent_transid, btrfs_header_generation(eb));
 	ret = 1;
 
@@ -445,7 +446,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
 	int mirror_num = 0;
 	int failed_mirror = 0;
 
-	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags);
 	io_tree = &BTRFS_I(root->fs_info->btree_inode)->io_tree;
 	while (1) {
 		ret = read_extent_buffer_pages(io_tree, eb, start,
@@ -464,7 +465,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
 		 * there is no reason to read the other copies, they won't be
 		 * any less wrong.
 		 */
-		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags))
+		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags))
 			break;
 
 		num_copies = btrfs_num_copies(root->fs_info,
@@ -622,7 +623,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		goto err;
 
 	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -631,13 +632,14 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	if (found_start != eb->start) {
 		printk_ratelimited(KERN_ERR "BTRFS (device %s): bad tree block start "
 			       "%llu %llu\n",
-			       eb->fs_info->sb->s_id, found_start, eb->start);
+				eb_head(eb)->fs_info->sb->s_id, found_start,
+				eb->start);
 		ret = -EIO;
 		goto err;
 	}
 	if (check_tree_block_fsid(root->fs_info, eb)) {
 		printk_ratelimited(KERN_ERR "BTRFS (device %s): bad fsid on block %llu\n",
-			       eb->fs_info->sb->s_id, eb->start);
+			       eb_head(eb)->fs_info->sb->s_id, eb->start);
 		ret = -EIO;
 		goto err;
 	}
@@ -664,7 +666,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	 * return -EIO.
 	 */
 	if (found_level == 0 && check_leaf(root, eb)) {
-		set_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
+		set_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags);
 		ret = -EIO;
 	}
 
@@ -672,7 +674,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		set_extent_buffer_uptodate(eb);
 err:
 	if (reads_done &&
-	    test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
+	    test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->ebflags))
 		btree_readahead_hook(root, eb, eb->start, ret);
 
 	if (ret) {
@@ -695,10 +697,10 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
 	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
 	eb->read_mirror = failed_mirror;
 	atomic_dec(&eb->io_pages);
-	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
+	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->ebflags))
 		btree_readahead_hook(root, eb, eb->start, -EIO);
 	return -EIO;	/* we fixed nothing */
 }
@@ -1047,13 +1049,24 @@ static int btree_set_page_dirty(struct page *page)
 {
 #ifdef DEBUG
 	struct extent_buffer *eb;
+	int i, dirty = 0;
 
 	BUG_ON(!PagePrivate(page));
 	eb = (struct extent_buffer *)page->private;
 	BUG_ON(!eb);
-	BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
-	BUG_ON(!atomic_read(&eb->refs));
-	btrfs_assert_tree_locked(eb);
+
+	do {
+		dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
+		if (dirty)
+			break;
+	} while ((eb = eb->eb_next) != NULL);
+
+	BUG_ON(!dirty);
+
+	eb = (struct extent_buffer *)page->private;
+	BUG_ON(!atomic_read(&(eb_head(eb)->refs)));
+
+	btrfs_assert_tree_locked(&ebh->eb);
 #endif
 	return __set_page_dirty_nobuffers(page);
 }
@@ -1094,7 +1107,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr,
 	if (!buf)
 		return 0;
 
-	set_bit(EXTENT_BUFFER_READAHEAD, &buf->bflags);
+	set_bit(EXTENT_BUFFER_READAHEAD, &buf->ebflags);
 
 	ret = read_extent_buffer_pages(io_tree, buf, 0, WAIT_PAGE_LOCK,
 				       btree_get_extent, mirror_num);
@@ -1103,7 +1116,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr,
 		return ret;
 	}
 
-	if (test_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags)) {
+	if (test_bit(EXTENT_BUFFER_CORRUPT, &buf->ebflags)) {
 		free_extent_buffer(buf);
 		return -EIO;
 	} else if (extent_buffer_uptodate(buf)) {
@@ -1131,14 +1144,16 @@ struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
 
 int btrfs_write_tree_block(struct extent_buffer *buf)
 {
-	return filemap_fdatawrite_range(buf->pages[0]->mapping, buf->start,
+	return filemap_fdatawrite_range(eb_head(buf)->pages[0]->mapping,
+					buf->start,
 					buf->start + buf->len - 1);
 }
 
 int btrfs_wait_tree_block_writeback(struct extent_buffer *buf)
 {
-	return filemap_fdatawait_range(buf->pages[0]->mapping,
-				       buf->start, buf->start + buf->len - 1);
+	return filemap_fdatawait_range(eb_head(buf)->pages[0]->mapping,
+					buf->start,
+					buf->start + buf->len - 1);
 }
 
 struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
@@ -1168,7 +1183,8 @@ void clean_tree_block(struct btrfs_trans_handle *trans,
 	    fs_info->running_transaction->transid) {
 		btrfs_assert_tree_locked(buf);
 
-		if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)) {
+		if (test_and_clear_bit(EXTENT_BUFFER_DIRTY,
+						&buf->ebflags)) {
 			__percpu_counter_add(&fs_info->dirty_metadata_bytes,
 					     -buf->len,
 					     fs_info->dirty_metadata_batch);
@@ -2798,9 +2814,10 @@ int open_ctree(struct super_block *sb,
 					   btrfs_super_chunk_root(disk_super),
 					   generation);
 	if (!chunk_root->node ||
-	    !test_bit(EXTENT_BUFFER_UPTODATE, &chunk_root->node->bflags)) {
+		!test_bit(EXTENT_BUFFER_UPTODATE,
+			&chunk_root->node->ebflags)) {
 		printk(KERN_ERR "BTRFS: failed to read chunk root on %s\n",
-		       sb->s_id);
+			sb->s_id);
 		goto fail_tree_roots;
 	}
 	btrfs_set_root_node(&chunk_root->root_item, chunk_root->node);
@@ -2835,7 +2852,8 @@ retry_root_backup:
 					  btrfs_super_root(disk_super),
 					  generation);
 	if (!tree_root->node ||
-	    !test_bit(EXTENT_BUFFER_UPTODATE, &tree_root->node->bflags)) {
+		!test_bit(EXTENT_BUFFER_UPTODATE,
+			&tree_root->node->ebflags)) {
 		printk(KERN_WARNING "BTRFS: failed to read tree root on %s\n",
 		       sb->s_id);
 
@@ -3786,7 +3804,7 @@ int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
 			  int atomic)
 {
 	int ret;
-	struct inode *btree_inode = buf->pages[0]->mapping->host;
+	struct inode *btree_inode = eb_head(buf)->pages[0]->mapping->host;
 
 	ret = extent_buffer_uptodate(buf);
 	if (!ret)
@@ -3816,10 +3834,10 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
 	 * enabled.  Normal people shouldn't be marking dummy buffers as dirty
 	 * outside of the sanity tests.
 	 */
-	if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, &buf->bflags)))
+	if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, &eb_head(buf)->bflags)))
 		return;
 #endif
-	root = BTRFS_I(buf->pages[0]->mapping->host)->root;
+	root = BTRFS_I(eb_head(buf)->pages[0]->mapping->host)->root;
 	btrfs_assert_tree_locked(buf);
 	if (transid != root->fs_info->generation)
 		WARN(1, KERN_CRIT "btrfs transid mismatch buffer %llu, "
@@ -3874,7 +3892,8 @@ void btrfs_btree_balance_dirty_nodelay(struct btrfs_root *root)
 
 int btrfs_read_buffer(struct extent_buffer *buf, u64 parent_transid)
 {
-	struct btrfs_root *root = BTRFS_I(buf->pages[0]->mapping->host)->root;
+	struct btrfs_root *root =
+			BTRFS_I(eb_head(buf)->pages[0]->mapping->host)->root;
 	return btree_read_extent_buffer_pages(root, buf, 0, parent_transid);
 }
 
@@ -4185,7 +4204,7 @@ static int btrfs_destroy_marked_extents(struct btrfs_root *root,
 			wait_on_extent_buffer_writeback(eb);
 
 			if (test_and_clear_bit(EXTENT_BUFFER_DIRTY,
-					       &eb->bflags))
+					       &eb->ebflags))
 				clear_extent_buffer_dirty(eb);
 			free_extent_buffer_stale(eb);
 		}
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1eef4ee..b93a922 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6450,7 +6450,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 			goto out;
 		}
 
-		WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
+		WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->ebflags));
 
 		btrfs_add_free_space(cache, buf->start, buf->len);
 		btrfs_update_reserved_bytes(cache, buf->len, RESERVE_FREE, 0);
@@ -6468,7 +6468,7 @@ out:
 	 * Deleting the buffer, clear the corrupt flag since it doesn't matter
 	 * anymore.
 	 */
-	clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
+	clear_bit(EXTENT_BUFFER_CORRUPT, &buf->ebflags);
 }
 
 /* Can return -ENOMEM */
@@ -7444,7 +7444,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
 	btrfs_tree_lock(buf);
 	clean_tree_block(trans, root->fs_info, buf);
-	clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
+	clear_bit(EXTENT_BUFFER_STALE, &buf->ebflags);
 
 	btrfs_set_lock_blocking(buf);
 	btrfs_set_buffer_uptodate(buf);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3736ab5..a7e715a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -61,6 +61,7 @@ void btrfs_leak_debug_check(void)
 {
 	struct extent_state *state;
 	struct extent_buffer *eb;
+	struct extent_buffer_head *ebh;
 
 	while (!list_empty(&states)) {
 		state = list_entry(states.next, struct extent_state, leak_list);
@@ -73,12 +74,17 @@ void btrfs_leak_debug_check(void)
 	}
 
 	while (!list_empty(&buffers)) {
-		eb = list_entry(buffers.next, struct extent_buffer, leak_list);
-		printk(KERN_ERR "BTRFS: buffer leak start %llu len %lu "
-		       "refs %d\n",
-		       eb->start, eb->len, atomic_read(&eb->refs));
-		list_del(&eb->leak_list);
-		kmem_cache_free(extent_buffer_cache, eb);
+		ebh = list_entry(buffers.next, struct extent_buffer_head, leak_list);
+		printk(KERN_ERR "btrfs buffer leak ");
+
+		eb = &ebh->eb;
+		do {
+			printk(KERN_ERR "eb %p %llu:%lu ", eb, eb->start, eb->len);
+		} while ((eb = eb->eb_next) != NULL);
+
+		printk(KERN_ERR "refs %d\n", atomic_read(&ebh->refs));
+		list_del(&ebh->leak_list);
+		kmem_cache_free(extent_buffer_cache, ebh);
 	}
 }
 
@@ -149,7 +155,7 @@ int __init extent_io_init(void)
 		return -ENOMEM;
 
 	extent_buffer_cache = kmem_cache_create("btrfs_extent_buffer",
-			sizeof(struct extent_buffer), 0,
+			sizeof(struct extent_buffer_head), 0,
 			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
 	if (!extent_buffer_cache)
 		goto free_state_cache;
@@ -2170,7 +2176,7 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 		return -EROFS;
 
 	for (i = 0; i < num_pages; i++) {
-		struct page *p = eb->pages[i];
+		struct page *p = eb_head(eb)->pages[i];
 
 		ret = repair_io_failure(root->fs_info->btree_inode, start,
 					PAGE_CACHE_SIZE, start, p,
@@ -3625,8 +3631,8 @@ done_unlocked:
 
 void wait_on_extent_buffer_writeback(struct extent_buffer *eb)
 {
-	wait_on_bit_io(&eb->bflags, EXTENT_BUFFER_WRITEBACK,
-		       TASK_UNINTERRUPTIBLE);
+	wait_on_bit_io(&eb->ebflags, EXTENT_BUFFER_WRITEBACK,
+		    TASK_UNINTERRUPTIBLE);
 }
 
 static noinline_for_stack int
@@ -3644,7 +3650,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 		btrfs_tree_lock(eb);
 	}
 
-	if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags)) {
 		btrfs_tree_unlock(eb);
 		if (!epd->sync_io)
 			return 0;
@@ -3655,7 +3661,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 		while (1) {
 			wait_on_extent_buffer_writeback(eb);
 			btrfs_tree_lock(eb);
-			if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags))
+			if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags))
 				break;
 			btrfs_tree_unlock(eb);
 		}
@@ -3666,17 +3672,17 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 	 * under IO since we can end up having no IO bits set for a short period
 	 * of time.
 	 */
-	spin_lock(&eb->refs_lock);
-	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
-		set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
-		spin_unlock(&eb->refs_lock);
+	spin_lock(&eb_head(eb)->refs_lock);
+	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags)) {
+		set_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags);
+		spin_unlock(&eb_head(eb)->refs_lock);
 		btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
 		__percpu_counter_add(&fs_info->dirty_metadata_bytes,
 				     -eb->len,
 				     fs_info->dirty_metadata_batch);
 		ret = 1;
 	} else {
-		spin_unlock(&eb->refs_lock);
+		spin_unlock(&eb_head(eb)->refs_lock);
 	}
 
 	btrfs_tree_unlock(eb);
@@ -3686,7 +3692,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
-		struct page *p = eb->pages[i];
+		struct page *p = eb_head(eb)->pages[i];
 
 		if (!trylock_page(p)) {
 			if (!flush) {
@@ -3702,18 +3708,19 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 
 static void end_extent_buffer_writeback(struct extent_buffer *eb)
 {
-	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags);
 	smp_mb__after_atomic();
-	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
+	wake_up_bit(&eb->ebflags, EXTENT_BUFFER_WRITEBACK);
 }
 
 static void set_btree_ioerr(struct page *page)
 {
 	struct extent_buffer *eb = (struct extent_buffer *)page->private;
-	struct btrfs_inode *btree_ino = BTRFS_I(eb->fs_info->btree_inode);
+	struct extent_buffer_head *ebh = eb_head(eb);
+	struct btrfs_inode *btree_ino = BTRFS_I(ebh->fs_info->btree_inode);
 
 	SetPageError(page);
-	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->ebflags))
 		return;
 
 	/*
@@ -3782,7 +3789,7 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
-		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
+		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->ebflags)) {
 			ClearPageUptodate(page);
 			set_btree_ioerr(page);
 		}
@@ -3811,14 +3818,14 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->ebflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	atomic_set(&eb->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
 		bio_flags = EXTENT_BIO_TREE_LOG;
 
 	for (i = 0; i < num_pages; i++) {
-		struct page *p = eb->pages[i];
+		struct page *p = eb_head(eb)->pages[i];
 
 		clear_page_dirty_for_io(p);
 		set_page_writeback(p);
@@ -3842,7 +3849,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 
 	if (unlikely(ret)) {
 		for (; i < num_pages; i++) {
-			struct page *p = eb->pages[i];
+			struct page *p = eb_head(eb)->pages[i];
 			clear_page_dirty_for_io(p);
 			unlock_page(p);
 		}
@@ -4605,17 +4612,36 @@ out:
 	return ret;
 }
 
-static void __free_extent_buffer(struct extent_buffer *eb)
+static void __free_extent_buffer(struct extent_buffer_head *ebh)
 {
-	btrfs_leak_debug_del(&eb->leak_list);
-	kmem_cache_free(extent_buffer_cache, eb);
+	struct extent_buffer *eb, *next_eb;
+
+	btrfs_leak_debug_del(&ebh->leak_list);
+
+	eb = ebh->eb.eb_next;
+	while (eb) {
+		next_eb = eb->eb_next;
+		kfree(eb);
+		eb = next_eb;
+	}
+
+	kmem_cache_free(extent_buffer_cache, ebh);
 }
 
 int extent_buffer_under_io(struct extent_buffer *eb)
 {
-	return (atomic_read(&eb->io_pages) ||
-		test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) ||
-		test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
+	struct extent_buffer_head *ebh = eb->ebh;
+	int dirty_or_writeback = 0;
+
+	for (eb = &ebh->eb; eb; eb = eb->eb_next) {
+		if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags)
+			|| test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags)) {
+			dirty_or_writeback = 1;
+			break;
+		}
+	}
+
+	return (atomic_read(&ebh->io_bvecs) || dirty_or_writeback);
 }
 
 /*
@@ -4625,7 +4651,8 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
 {
 	unsigned long index;
 	struct page *page;
-	int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
+	struct extent_buffer_head *ebh = eb_head(eb);
+	int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &ebh->bflags);
 
 	BUG_ON(extent_buffer_under_io(eb));
 
@@ -4634,8 +4661,10 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
 		return;
 
 	do {
+		struct extent_buffer *e;
+
 		index--;
-		page = eb->pages[index];
+		page = ebh->pages[index];
 		if (page && mapped) {
 			spin_lock(&page->mapping->private_lock);
 			/*
@@ -4646,8 +4675,10 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
 			 * this eb.
 			 */
 			if (PagePrivate(page) &&
-			    page->private == (unsigned long)eb) {
-				BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
+				page->private == (unsigned long)(&ebh->eb)) {
+				for (e = &ebh->eb; !e; e = e->eb_next)
+					BUG_ON(test_bit(EXTENT_BUFFER_DIRTY,
+								&e->ebflags));
 				BUG_ON(PageDirty(page));
 				BUG_ON(PageWriteback(page));
 				/*
@@ -4675,22 +4706,18 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
 static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
 {
 	btrfs_release_extent_buffer_page(eb);
-	__free_extent_buffer(eb);
+	__free_extent_buffer(eb_head(eb));
 }
 
-static struct extent_buffer *
-__alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
-		      unsigned long len)
+static void __init_extent_buffer(struct extent_buffer *eb,
+				struct extent_buffer_head *ebh,
+				u64 start,
+				unsigned long len)
 {
-	struct extent_buffer *eb = NULL;
-
-	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
-	if (eb == NULL)
-		return NULL;
 	eb->start = start;
 	eb->len = len;
-	eb->fs_info = fs_info;
-	eb->bflags = 0;
+	eb->ebh = ebh;
+	eb->eb_next = NULL;
 	rwlock_init(&eb->lock);
 	atomic_set(&eb->write_locks, 0);
 	atomic_set(&eb->read_locks, 0);
@@ -4701,12 +4728,26 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 	eb->lock_nested = 0;
 	init_waitqueue_head(&eb->write_lock_wq);
 	init_waitqueue_head(&eb->read_lock_wq);
+}
 
-	btrfs_leak_debug_add(&eb->leak_list, &buffers);
+static struct extent_buffer *
+__alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
+		      unsigned long len)
+{
+	struct extent_buffer_head *ebh = NULL;
+	struct extent_buffer *eb = NULL;
+	int i;
+
+	ebh = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
+	if (ebh == NULL)
+		return NULL;
+	ebh->fs_info = fs_info;
+	ebh->bflags = 0;
+	btrfs_leak_debug_add(&ebh->leak_list, &buffers);
 
-	spin_lock_init(&eb->refs_lock);
-	atomic_set(&eb->refs, 1);
-	atomic_set(&eb->io_pages, 0);
+	spin_lock_init(&ebh->refs_lock);
+	atomic_set(&ebh->refs, 1);
+	atomic_set(&ebh->io_bvecs, 0);
 
 	/*
 	 * Sanity checks, currently the maximum is 64k covered by 16x 4k pages
@@ -4715,6 +4756,29 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 		> MAX_INLINE_EXTENT_BUFFER_SIZE);
 	BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE);
 
+	if (len < PAGE_CACHE_SIZE) {
+		struct extent_buffer *cur_eb, *prev_eb;
+		int ebs_per_page = PAGE_CACHE_SIZE / len;
+		u64 st = start & ~(PAGE_CACHE_SIZE - 1);
+
+		prev_eb = NULL;
+		cur_eb = &ebh->eb;
+		for (i = 0; i < ebs_per_page; i++, st += len) {
+			if (prev_eb) {
+				cur_eb = kzalloc(sizeof(*eb), GFP_NOFS);
+				prev_eb->eb_next = cur_eb;
+			}
+			__init_extent_buffer(cur_eb, ebh, st, len);
+			prev_eb = cur_eb;
+			if (st == start)
+				eb = cur_eb;
+		}
+		BUG_ON(!eb);
+	} else {
+		eb = &ebh->eb;
+		__init_extent_buffer(eb, ebh, start, len);
+	}
+
 	return eb;
 }
 
@@ -4725,7 +4789,8 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
 	struct extent_buffer *new;
 	unsigned long num_pages = num_extent_pages(src->start, src->len);
 
-	new = __alloc_extent_buffer(src->fs_info, src->start, src->len);
+	new = __alloc_extent_buffer(eb_head(src)->fs_info, src->start,
+				src->len);
 	if (new == NULL)
 		return NULL;
 
@@ -4735,15 +4800,16 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
 			btrfs_release_extent_buffer(new);
 			return NULL;
 		}
-		attach_extent_buffer_page(new, p);
+		attach_extent_buffer_page(&(eb_head(new)->eb), p);
 		WARN_ON(PageDirty(p));
 		SetPageUptodate(p);
-		new->pages[i] = p;
+		eb_head(new)->pages[i] = p;
 	}
 
+	set_bit(EXTENT_BUFFER_UPTODATE, &new->ebflags);
+	set_bit(EXTENT_BUFFER_DUMMY, &eb_head(new)->bflags);
+
 	copy_extent_buffer(new, src, 0, 0, src->len);
-	set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
-	set_bit(EXTENT_BUFFER_DUMMY, &new->bflags);
 
 	return new;
 }
@@ -4772,19 +4838,19 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 		return NULL;
 
 	for (i = 0; i < num_pages; i++) {
-		eb->pages[i] = alloc_page(GFP_NOFS);
-		if (!eb->pages[i])
+		eb_head(eb)->pages[i] = alloc_page(GFP_NOFS);
+		if (!eb_head(eb)->pages[i])
 			goto err;
 	}
 	set_extent_buffer_uptodate(eb);
 	btrfs_set_header_nritems(eb, 0);
-	set_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
+	set_bit(EXTENT_BUFFER_DUMMY, &eb_head(eb)->bflags);
 
 	return eb;
 err:
 	for (; i > 0; i--)
-		__free_page(eb->pages[i - 1]);
-	__free_extent_buffer(eb);
+		__free_page(eb_head(eb)->pages[i - 1]);
+	__free_extent_buffer(eb_head(eb));
 	return NULL;
 }
 
@@ -4811,14 +4877,15 @@ static void check_buffer_tree_ref(struct extent_buffer *eb)
 	 * So bump the ref count first, then set the bit.  If someone
 	 * beat us to it, drop the ref we added.
 	 */
-	refs = atomic_read(&eb->refs);
-	if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
+	refs = atomic_read(&eb_head(eb)->refs);
+	if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF,
+					&eb_head(eb)->bflags))
 		return;
 
-	spin_lock(&eb->refs_lock);
-	if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
-		atomic_inc(&eb->refs);
-	spin_unlock(&eb->refs_lock);
+	spin_lock(&eb_head(eb)->refs_lock);
+	if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb_head(eb)->bflags))
+		atomic_inc(&eb_head(eb)->refs);
+	spin_unlock(&eb_head(eb)->refs_lock);
 }
 
 static void mark_extent_buffer_accessed(struct extent_buffer *eb,
@@ -4830,7 +4897,7 @@ static void mark_extent_buffer_accessed(struct extent_buffer *eb,
 
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
-		struct page *p = eb->pages[i];
+		struct page *p = eb_head(eb)->pages[i];
 
 		if (p != accessed)
 			mark_page_accessed(p);
@@ -4840,15 +4907,24 @@ static void mark_extent_buffer_accessed(struct extent_buffer *eb,
 struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
 					 u64 start)
 {
+	struct extent_buffer_head *ebh;
 	struct extent_buffer *eb;
 
 	rcu_read_lock();
-	eb = radix_tree_lookup(&fs_info->buffer_radix,
-			       start >> PAGE_CACHE_SHIFT);
-	if (eb && atomic_inc_not_zero(&eb->refs)) {
+	ebh = radix_tree_lookup(&fs_info->buffer_radix,
+				start >> PAGE_CACHE_SHIFT);
+	if (ebh && atomic_inc_not_zero(&ebh->refs)) {
 		rcu_read_unlock();
-		mark_extent_buffer_accessed(eb, NULL);
-		return eb;
+
+		eb = &ebh->eb;
+		do {
+			if (eb->start == start) {
+				mark_extent_buffer_accessed(eb, NULL);
+				return eb;
+			}
+		} while ((eb = eb->eb_next) != NULL);
+
+		BUG();
 	}
 	rcu_read_unlock();
 
@@ -4909,7 +4985,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	unsigned long num_pages = num_extent_pages(start, len);
 	unsigned long i;
 	unsigned long index = start >> PAGE_CACHE_SHIFT;
-	struct extent_buffer *eb;
+	struct extent_buffer *eb, *cur_eb;
 	struct extent_buffer *exists = NULL;
 	struct page *p;
 	struct address_space *mapping = fs_info->btree_inode->i_mapping;
@@ -4939,12 +5015,18 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 			 * overwrite page->private.
 			 */
 			exists = (struct extent_buffer *)p->private;
-			if (atomic_inc_not_zero(&exists->refs)) {
+			if (atomic_inc_not_zero(&eb_head(exists)->refs)) {
 				spin_unlock(&mapping->private_lock);
 				unlock_page(p);
 				page_cache_release(p);
-				mark_extent_buffer_accessed(exists, p);
-				goto free_eb;
+				do {
+					if (exists->start == start) {
+						mark_extent_buffer_accessed(exists, p);
+						goto free_eb;
+					}
+				} while ((exists = exists->eb_next) != NULL);
+
+				BUG();
 			}
 
 			/*
@@ -4955,10 +5037,11 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 			WARN_ON(PageDirty(p));
 			page_cache_release(p);
 		}
-		attach_extent_buffer_page(eb, p);
+		attach_extent_buffer_page(&(eb_head(eb)->eb), p);
 		spin_unlock(&mapping->private_lock);
 		WARN_ON(PageDirty(p));
-		eb->pages[i] = p;
+		mark_page_accessed(p);
+		eb_head(eb)->pages[i] = p;
 		if (!PageUptodate(p))
 			uptodate = 0;
 
@@ -4967,16 +5050,22 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		 * and why we unlock later
 		 */
 	}
-	if (uptodate)
-		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+	if (uptodate) {
+		cur_eb = &(eb_head(eb)->eb);
+		do {
+			set_bit(EXTENT_BUFFER_UPTODATE, &cur_eb->ebflags);
+		} while ((cur_eb = cur_eb->eb_next) != NULL);
+	}
 again:
 	ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
-	if (ret)
+	if (ret) {
+		exists = NULL;
 		goto free_eb;
+	}
 
 	spin_lock(&fs_info->buffer_lock);
 	ret = radix_tree_insert(&fs_info->buffer_radix,
-				start >> PAGE_CACHE_SHIFT, eb);
+				start >> PAGE_CACHE_SHIFT, eb_head(eb));
 	spin_unlock(&fs_info->buffer_lock);
 	radix_tree_preload_end();
 	if (ret == -EEXIST) {
@@ -4988,7 +5077,7 @@ again:
 	}
 	/* add one reference for the tree */
 	check_buffer_tree_ref(eb);
-	set_bit(EXTENT_BUFFER_IN_TREE, &eb->bflags);
+	set_bit(EXTENT_BUFFER_IN_TREE, &eb_head(eb)->bflags);
 
 	/*
 	 * there is a race where release page may have
@@ -4999,114 +5088,131 @@ again:
 	 * after the extent buffer is in the radix tree so
 	 * it doesn't get lost
 	 */
-	SetPageChecked(eb->pages[0]);
+	SetPageChecked(eb_head(eb)->pages[0]);
 	for (i = 1; i < num_pages; i++) {
-		p = eb->pages[i];
+		p = eb_head(eb)->pages[i];
 		ClearPageChecked(p);
 		unlock_page(p);
 	}
-	unlock_page(eb->pages[0]);
+	unlock_page(eb_head(eb)->pages[0]);
 	return eb;
 
 free_eb:
 	for (i = 0; i < num_pages; i++) {
-		if (eb->pages[i])
-			unlock_page(eb->pages[i]);
+		if (eb_head(eb)->pages[i])
+			unlock_page(eb_head(eb)->pages[i]);
 	}
 
-	WARN_ON(!atomic_dec_and_test(&eb->refs));
+	WARN_ON(!atomic_dec_and_test(&eb_head(eb)->refs));
 	btrfs_release_extent_buffer(eb);
 	return exists;
 }
 
 static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head)
 {
-	struct extent_buffer *eb =
-			container_of(head, struct extent_buffer, rcu_head);
+	struct extent_buffer_head *ebh =
+			container_of(head, struct extent_buffer_head, rcu_head);
 
-	__free_extent_buffer(eb);
+	__free_extent_buffer(ebh);
 }
 
 /* Expects to have eb->eb_lock already held */
-static int release_extent_buffer(struct extent_buffer *eb)
+static int release_extent_buffer(struct extent_buffer_head *ebh)
 {
-	WARN_ON(atomic_read(&eb->refs) == 0);
-	if (atomic_dec_and_test(&eb->refs)) {
-		if (test_and_clear_bit(EXTENT_BUFFER_IN_TREE, &eb->bflags)) {
-			struct btrfs_fs_info *fs_info = eb->fs_info;
+	WARN_ON(atomic_read(&ebh->refs) == 0);
+	if (atomic_dec_and_test(&ebh->refs)) {
+		if (test_and_clear_bit(EXTENT_BUFFER_IN_TREE, &ebh->bflags)) {
+			struct btrfs_fs_info *fs_info = ebh->fs_info;
 
-			spin_unlock(&eb->refs_lock);
+			spin_unlock(&ebh->refs_lock);
 
 			spin_lock(&fs_info->buffer_lock);
 			radix_tree_delete(&fs_info->buffer_radix,
-					  eb->start >> PAGE_CACHE_SHIFT);
+					ebh->eb.start >> PAGE_CACHE_SHIFT);
 			spin_unlock(&fs_info->buffer_lock);
 		} else {
-			spin_unlock(&eb->refs_lock);
+			spin_unlock(&ebh->refs_lock);
 		}
 
 		/* Should be safe to release our pages at this point */
-		btrfs_release_extent_buffer_page(eb);
+		btrfs_release_extent_buffer_page(&ebh->eb);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-		if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags))) {
-			__free_extent_buffer(eb);
+		if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, &eb_head(buf)->bflags))) {
+			__free_extent_buffer(eb_head(eb));
 			return 1;
 		}
 #endif
-		call_rcu(&eb->rcu_head, btrfs_release_extent_buffer_rcu);
+		call_rcu(&ebh->rcu_head, btrfs_release_extent_buffer_rcu);
 		return 1;
 	}
-	spin_unlock(&eb->refs_lock);
+	spin_unlock(&ebh->refs_lock);
 
 	return 0;
 }
 
 void free_extent_buffer(struct extent_buffer *eb)
 {
+	struct extent_buffer_head *ebh;
 	int refs;
 	int old;
 	if (!eb)
 		return;
 
+	ebh = eb_head(eb);
 	while (1) {
-		refs = atomic_read(&eb->refs);
+		refs = atomic_read(&ebh->refs);
 		if (refs <= 3)
 			break;
-		old = atomic_cmpxchg(&eb->refs, refs, refs - 1);
+		old = atomic_cmpxchg(&ebh->refs, refs, refs - 1);
 		if (old == refs)
 			return;
 	}
 
-	spin_lock(&eb->refs_lock);
-	if (atomic_read(&eb->refs) == 2 &&
-	    test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags))
-		atomic_dec(&eb->refs);
+	spin_lock(&ebh->refs_lock);
+	if (atomic_read(&ebh->refs) == 2 &&
+	    test_bit(EXTENT_BUFFER_DUMMY, &ebh->bflags))
+		atomic_dec(&ebh->refs);
 
-	if (atomic_read(&eb->refs) == 2 &&
-	    test_bit(EXTENT_BUFFER_STALE, &eb->bflags) &&
+	if (atomic_read(&ebh->refs) == 2 &&
+	    test_bit(EXTENT_BUFFER_STALE, &eb->ebflags) &&
 	    !extent_buffer_under_io(eb) &&
-	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
-		atomic_dec(&eb->refs);
+	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &ebh->bflags))
+		atomic_dec(&ebh->refs);
 
 	/*
 	 * I know this is terrible, but it's temporary until we stop tracking
 	 * the uptodate bits and such for the extent buffers.
 	 */
-	release_extent_buffer(eb);
+	release_extent_buffer(ebh);
 }
 
 void free_extent_buffer_stale(struct extent_buffer *eb)
 {
+	struct extent_buffer_head *ebh;
 	if (!eb)
 		return;
 
-	spin_lock(&eb->refs_lock);
-	set_bit(EXTENT_BUFFER_STALE, &eb->bflags);
+	ebh = eb_head(eb);
+	spin_lock(&ebh->refs_lock);
+
+	set_bit(EXTENT_BUFFER_STALE, &eb->ebflags);
+	if (atomic_read(&ebh->refs) == 2 && !extent_buffer_under_io(eb) &&
+	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &ebh->bflags))
+		atomic_dec(&ebh->refs);
 
-	if (atomic_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) &&
-	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
-		atomic_dec(&eb->refs);
-	release_extent_buffer(eb);
+	release_extent_buffer(ebh);
+}
+
+static int page_ebs_clean(struct extent_buffer_head *ebh)
+{
+	struct extent_buffer *eb = &ebh->eb;
+
+	do {
+		if (test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags))
+			return 0;
+	} while ((eb = eb->eb_next) != NULL);
+
+	return 1;
 }
 
 void clear_extent_buffer_dirty(struct extent_buffer *eb)
@@ -5117,8 +5223,11 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
 
 	num_pages = num_extent_pages(eb->start, eb->len);
 
+	if (eb->len < PAGE_CACHE_SIZE && !page_ebs_clean(eb_head(eb)))
+		return;
+
 	for (i = 0; i < num_pages; i++) {
-		page = eb->pages[i];
+		page = eb_head(eb)->pages[i];
 		if (!PageDirty(page))
 			continue;
 
@@ -5136,7 +5245,7 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
 		ClearPageError(page);
 		unlock_page(page);
 	}
-	WARN_ON(atomic_read(&eb->refs) == 0);
+	WARN_ON(atomic_read(&eb_head(eb)->refs) == 0);
 }
 
 int set_extent_buffer_dirty(struct extent_buffer *eb)
@@ -5147,14 +5256,14 @@ int set_extent_buffer_dirty(struct extent_buffer *eb)
 
 	check_buffer_tree_ref(eb);
 
-	was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
+	was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
 
 	num_pages = num_extent_pages(eb->start, eb->len);
-	WARN_ON(atomic_read(&eb->refs) == 0);
-	WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags));
+	WARN_ON(atomic_read(&eb_head(eb)->refs) == 0);
+	WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb_head(eb)->bflags));
 
 	for (i = 0; i < num_pages; i++)
-		set_page_dirty(eb->pages[i]);
+		set_page_dirty(eb_head(eb)->pages[i]);
 	return was_dirty;
 }
 
@@ -5164,10 +5273,12 @@ int clear_extent_buffer_uptodate(struct extent_buffer *eb)
 	struct page *page;
 	unsigned long num_pages;
 
-	clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+	if (!eb || !eb_head(eb))
+		return 0;
+	clear_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
-		page = eb->pages[i];
+		page = eb_head(eb)->pages[i];
 		if (page)
 			ClearPageUptodate(page);
 	}
@@ -5176,22 +5287,43 @@ int clear_extent_buffer_uptodate(struct extent_buffer *eb)
 
 int set_extent_buffer_uptodate(struct extent_buffer *eb)
 {
+	struct extent_buffer_head *ebh;
 	unsigned long i;
 	struct page *page;
 	unsigned long num_pages;
+	int uptodate;
 
-	set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
-	num_pages = num_extent_pages(eb->start, eb->len);
-	for (i = 0; i < num_pages; i++) {
-		page = eb->pages[i];
-		SetPageUptodate(page);
+	ebh = eb->ebh;
+
+	set_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags);
+	if (eb->len < PAGE_CACHE_SIZE) {
+		eb = &(eb_head(eb)->eb);
+		uptodate = 1;
+		do {
+			if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags)) {
+				uptodate = 0;
+				break;
+			}
+		} while ((eb = eb->eb_next) != NULL);
+
+		if (uptodate) {
+			page = ebh->pages[0];
+			SetPageUptodate(page);
+		}
+	} else {
+		num_pages = num_extent_pages(eb->start, eb->len);
+		for (i = 0; i < num_pages; i++) {
+			page = ebh->pages[i];
+			SetPageUptodate(page);
+		}
 	}
+
 	return 0;
 }
 
 int extent_buffer_uptodate(struct extent_buffer *eb)
 {
-	return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+	return test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags);
 }
 
 int read_extent_buffer_pages(struct extent_io_tree *tree,
@@ -5210,7 +5342,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 	struct bio *bio = NULL;
 	unsigned long bio_flags = 0;
 
-	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
+	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags))
 		return 0;
 
 	if (start) {
@@ -5223,7 +5355,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = start_i; i < num_pages; i++) {
-		page = eb->pages[i];
+		page = eb_head(eb)->pages[i];
 		if (wait == WAIT_NONE) {
 			if (!trylock_page(page))
 				goto unlock_exit;
@@ -5238,15 +5370,15 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 	}
 	if (all_uptodate) {
 		if (start_i == 0)
-			set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+			set_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags);
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
 	eb->read_mirror = 0;
 	atomic_set(&eb->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
-		page = eb->pages[i];
+		page = eb_head(eb)->pages[i];
 		if (!PageUptodate(page)) {
 			ClearPageError(page);
 			err = __extent_read_full_page(tree, page,
@@ -5271,7 +5403,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		return ret;
 
 	for (i = start_i; i < num_pages; i++) {
-		page = eb->pages[i];
+		page = eb_head(eb)->pages[i];
 		wait_on_page_locked(page);
 		if (!PageUptodate(page))
 			ret = -EIO;
@@ -5282,7 +5414,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 unlock_exit:
 	i = start_i;
 	while (locked_pages > 0) {
-		page = eb->pages[i];
+		page = eb_head(eb)->pages[i];
 		i++;
 		unlock_page(page);
 		locked_pages--;
@@ -5308,7 +5440,7 @@ void read_extent_buffer(struct extent_buffer *eb, void *dstv,
 	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
 
 	while (len > 0) {
-		page = eb->pages[i];
+		page = eb_head(eb)->pages[i];
 
 		cur = min(len, (PAGE_CACHE_SIZE - offset));
 		kaddr = page_address(page);
@@ -5340,7 +5472,7 @@ int read_extent_buffer_to_user(struct extent_buffer *eb, void __user *dstv,
 	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
 
 	while (len > 0) {
-		page = eb->pages[i];
+		page = eb_head(eb)->pages[i];
 
 		cur = min(len, (PAGE_CACHE_SIZE - offset));
 		kaddr = page_address(page);
@@ -5389,7 +5521,7 @@ int map_private_extent_buffer(struct extent_buffer *eb, unsigned long start,
 		return -EINVAL;
 	}
 
-	p = eb->pages[i];
+	p = eb_head(eb)->pages[i];
 	kaddr = page_address(p);
 	*map = kaddr + offset;
 	*map_len = PAGE_CACHE_SIZE - offset;
@@ -5415,7 +5547,7 @@ int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv,
 	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
 
 	while (len > 0) {
-		page = eb->pages[i];
+		page = eb_head(eb)->pages[i];
 
 		cur = min(len, (PAGE_CACHE_SIZE - offset));
 
@@ -5445,12 +5577,12 @@ void write_extent_buffer(struct extent_buffer *eb, const void *srcv,
 
 	WARN_ON(start > eb->len);
 	WARN_ON(start + len > eb->start + eb->len);
+	WARN_ON(!test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags));
 
 	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
 
 	while (len > 0) {
-		page = eb->pages[i];
-		WARN_ON(!PageUptodate(page));
+		page = eb_head(eb)->pages[i];
 
 		cur = min(len, PAGE_CACHE_SIZE - offset);
 		kaddr = page_address(page);
@@ -5478,9 +5610,10 @@ void memset_extent_buffer(struct extent_buffer *eb, char c,
 
 	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
 
+	WARN_ON(!test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags));
+
 	while (len > 0) {
-		page = eb->pages[i];
-		WARN_ON(!PageUptodate(page));
+		page = eb_head(eb)->pages[i];
 
 		cur = min(len, PAGE_CACHE_SIZE - offset);
 		kaddr = page_address(page);
@@ -5509,9 +5642,10 @@ void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src,
 	offset = (start_offset + dst_offset) &
 		(PAGE_CACHE_SIZE - 1);
 
+	WARN_ON(!test_bit(EXTENT_BUFFER_UPTODATE, &dst->ebflags));
+
 	while (len > 0) {
-		page = dst->pages[i];
-		WARN_ON(!PageUptodate(page));
+		page = eb_head(dst)->pages[i];
 
 		cur = min(len, (unsigned long)(PAGE_CACHE_SIZE - offset));
 
@@ -5588,8 +5722,9 @@ void memcpy_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 		cur = min_t(unsigned long, cur,
 			(unsigned long)(PAGE_CACHE_SIZE - dst_off_in_page));
 
-		copy_pages(dst->pages[dst_i], dst->pages[src_i],
-			   dst_off_in_page, src_off_in_page, cur);
+		copy_pages(eb_head(dst)->pages[dst_i],
+			eb_head(dst)->pages[src_i],
+			dst_off_in_page, src_off_in_page, cur);
 
 		src_offset += cur;
 		dst_offset += cur;
@@ -5634,9 +5769,10 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 
 		cur = min_t(unsigned long, len, src_off_in_page + 1);
 		cur = min(cur, dst_off_in_page + 1);
-		copy_pages(dst->pages[dst_i], dst->pages[src_i],
-			   dst_off_in_page - cur + 1,
-			   src_off_in_page - cur + 1, cur);
+		copy_pages(eb_head(dst)->pages[dst_i],
+			eb_head(dst)->pages[src_i],
+			dst_off_in_page - cur + 1,
+			src_off_in_page - cur + 1, cur);
 
 		dst_end -= cur;
 		src_end -= cur;
@@ -5646,6 +5782,7 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 
 int try_release_extent_buffer(struct page *page)
 {
+	struct extent_buffer_head *ebh;
 	struct extent_buffer *eb;
 
 	/*
@@ -5661,14 +5798,15 @@ int try_release_extent_buffer(struct page *page)
 	eb = (struct extent_buffer *)page->private;
 	BUG_ON(!eb);
 
+	ebh = eb->ebh;
 	/*
 	 * This is a little awful but should be ok, we need to make sure that
 	 * the eb doesn't disappear out from under us while we're looking at
 	 * this page.
 	 */
-	spin_lock(&eb->refs_lock);
-	if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
-		spin_unlock(&eb->refs_lock);
+	spin_lock(&ebh->refs_lock);
+	if (atomic_read(&ebh->refs) != 1 || extent_buffer_under_io(eb)) {
+		spin_unlock(&ebh->refs_lock);
 		spin_unlock(&page->mapping->private_lock);
 		return 0;
 	}
@@ -5678,10 +5816,11 @@ int try_release_extent_buffer(struct page *page)
 	 * If tree ref isn't set then we know the ref on this eb is a real ref,
 	 * so just return, this page will likely be freed soon anyway.
 	 */
-	if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
-		spin_unlock(&eb->refs_lock);
+	if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &ebh->bflags)) {
+		spin_unlock(&ebh->refs_lock);
 		return 0;
 	}
 
-	return release_extent_buffer(eb);
+	return release_extent_buffer(ebh);
 }
+
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 541b40a..8fe5ac3 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -131,17 +131,17 @@ struct extent_state {
 
 #define INLINE_EXTENT_BUFFER_PAGES 16
 #define MAX_INLINE_EXTENT_BUFFER_SIZE (INLINE_EXTENT_BUFFER_PAGES * PAGE_CACHE_SIZE)
+
+/* Forward declaration */
+struct extent_buffer_head;
+
 struct extent_buffer {
 	u64 start;
 	unsigned long len;
-	unsigned long bflags;
-	struct btrfs_fs_info *fs_info;
-	spinlock_t refs_lock;
-	atomic_t refs;
-	atomic_t io_pages;
+	unsigned long ebflags;
+	struct extent_buffer_head *ebh;
+	struct extent_buffer *eb_next;
 	int read_mirror;
-	struct rcu_head rcu_head;
-	pid_t lock_owner;
 
 	/* count of read lock holders on the extent buffer */
 	atomic_t write_locks;
@@ -154,6 +154,8 @@ struct extent_buffer {
 	/* >= 0 if eb belongs to a log tree, -1 otherwise */
 	short log_index;
 
+	pid_t lock_owner;
+
 	/* protects write locks */
 	rwlock_t lock;
 
@@ -166,7 +168,20 @@ struct extent_buffer {
 	 * to unlock
 	 */
 	wait_queue_head_t read_lock_wq;
+	wait_queue_head_t lock_wq;
+};
+
+struct extent_buffer_head {
+	unsigned long bflags;
+	struct btrfs_fs_info *fs_info;
+	spinlock_t refs_lock;
+	atomic_t refs;
+	atomic_t io_bvecs;
+	struct rcu_head rcu_head;
+
 	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
+
+	struct extent_buffer eb;
 #ifdef CONFIG_BTRFS_DEBUG
 	struct list_head leak_list;
 #endif
@@ -183,6 +198,14 @@ static inline int extent_compress_type(unsigned long bio_flags)
 	return bio_flags >> EXTENT_BIO_FLAG_SHIFT;
 }
 
+/*
+ * return the extent_buffer_head that contains the extent buffer provided.
+ */
+static inline struct extent_buffer_head *eb_head(struct extent_buffer *eb)
+{
+	return eb->ebh;
+
+}
 struct extent_map_tree;
 
 typedef struct extent_map *(get_extent_t)(struct inode *inode,
@@ -304,7 +327,7 @@ static inline unsigned long num_extent_pages(u64 start, u64 len)
 
 static inline void extent_buffer_get(struct extent_buffer *eb)
 {
-	atomic_inc(&eb->refs);
+	atomic_inc(&eb_head(eb)->refs);
 }
 
 int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8bcd2a0..9c8eb4a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6282,7 +6282,7 @@ int btrfs_read_sys_array(struct btrfs_root *root)
 	 * to silence the warning eg. on PowerPC 64.
 	 */
 	if (PAGE_CACHE_SIZE > BTRFS_SUPER_INFO_SIZE)
-		SetPageUptodate(sb->pages[0]);
+		SetPageUptodate(eb_head(sb)->pages[0]);
 
 	write_extent_buffer(sb, super_copy, 0, BTRFS_SUPER_INFO_SIZE);
 	array_size = btrfs_super_sys_array_size(super_copy);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 1faecea..283bbe7 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -699,7 +699,7 @@ TRACE_EVENT(btrfs_cow_block,
 	TP_fast_assign(
 		__entry->root_objectid	= root->root_key.objectid;
 		__entry->buf_start	= buf->start;
-		__entry->refs		= atomic_read(&buf->refs);
+		__entry->refs		= atomic_read(&eb_head(buf)->refs);
 		__entry->cow_start	= cow->start;
 		__entry->buf_level	= btrfs_header_level(buf);
 		__entry->cow_level	= btrfs_header_level(cow);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 05/21] Btrfs: subpagesize-blocksize: Read tree blocks whose size is < PAGE_SIZE.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (3 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 04/21] Btrfs: subpagesize-blocksize: Define extent_buffer_head Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-07-01 14:40   ` Liu Bo
  2015-06-01 15:22 ` [RFC PATCH V11 06/21] Btrfs: subpagesize-blocksize: Write only dirty extent buffers belonging to a page Chandan Rajendra
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

In the case of subpagesize-blocksize, this patch makes it possible to read
only a single metadata block from the disk instead of all the metadata blocks
that map into a page.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/disk-io.c   |  65 +++++++++++++++++++------------
 fs/btrfs/disk-io.h   |   3 ++
 fs/btrfs/extent_io.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 144 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 51fe2ec..b794e33 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -597,28 +597,41 @@ static noinline int check_leaf(struct btrfs_root *root,
 	return 0;
 }
 
-static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
-				      u64 phy_offset, struct page *page,
-				      u64 start, u64 end, int mirror)
+int verify_extent_buffer_read(struct btrfs_io_bio *io_bio,
+			struct page *page,
+			u64 start, u64 end, int mirror)
 {
-	u64 found_start;
-	int found_level;
+	struct address_space *mapping = (io_bio->bio).bi_io_vec->bv_page->mapping;
+	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
+	struct extent_buffer_head *ebh;
 	struct extent_buffer *eb;
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
-	int ret = 0;
+	unsigned long num_pages;
+	unsigned long i;
+	u64 found_start;
+	int found_level;
 	int reads_done;
+	int ret = 0;
 
 	if (!page->private)
 		goto out;
 
 	eb = (struct extent_buffer *)page->private;
+	do {
+		if ((eb->start <= start) && (eb->start + eb->len - 1 > start))
+			break;
+	} while ((eb = eb->eb_next) != NULL);
+
+	BUG_ON(!eb);
+
+	ebh = eb_head(eb);
 
 	/* the pending IO might have been the only thing that kept this buffer
 	 * in memory.  Make sure we have a ref for all this other checks
 	 */
 	extent_buffer_get(eb);
 
-	reads_done = atomic_dec_and_test(&eb->io_pages);
+	reads_done = atomic_dec_and_test(&ebh->io_bvecs);
 	if (!reads_done)
 		goto err;
 
@@ -683,28 +696,34 @@ err:
 		 * again, we have to make sure it has something
 		 * to decrement
 		 */
-		atomic_inc(&eb->io_pages);
+		atomic_inc(&eb_head(eb)->io_bvecs);
 		clear_extent_buffer_uptodate(eb);
 	}
+
+	/*
+	  We never read more than one extent buffer from a page at a time.
+	  So unlocking the page here should be fine.
+	 */
+	if (reads_done) {
+		num_pages = num_extent_pages(eb->start, eb->len);
+		for (i = 0; i < num_pages; i++) {
+			page = eb_head(eb)->pages[i];
+			unlock_page(page);
+		}
+
+		/*
+		  We don't need to add a check to see if
+		  extent_io_tree->track_uptodate is set or not, Since
+		  this function only deals with extent buffers.
+		*/
+		unlock_extent(tree, eb->start, eb->start + eb->len - 1);
+	}
+
 	free_extent_buffer(eb);
 out:
 	return ret;
 }
 
-static int btree_io_failed_hook(struct page *page, int failed_mirror)
-{
-	struct extent_buffer *eb;
-	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
-
-	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
-	eb->read_mirror = failed_mirror;
-	atomic_dec(&eb->io_pages);
-	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->ebflags))
-		btree_readahead_hook(root, eb, eb->start, -EIO);
-	return -EIO;	/* we fixed nothing */
-}
-
 static void end_workqueue_bio(struct bio *bio, int err)
 {
 	struct btrfs_end_io_wq *end_io_wq = bio->bi_private;
@@ -4349,8 +4368,6 @@ static int btrfs_cleanup_transaction(struct btrfs_root *root)
 }
 
 static const struct extent_io_ops btree_extent_io_ops = {
-	.readpage_end_io_hook = btree_readpage_end_io_hook,
-	.readpage_io_failed_hook = btree_io_failed_hook,
 	.submit_bio_hook = btree_submit_bio_hook,
 	/* note we're sharing with inode.c for the merge bio hook */
 	.merge_bio_hook = btrfs_merge_bio_hook,
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index d4cbfee..c69076c 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -111,6 +111,9 @@ static inline void btrfs_put_fs_root(struct btrfs_root *root)
 		kfree(root);
 }
 
+int verify_extent_buffer_read(struct btrfs_io_bio *io_bio,
+			struct page *page,
+			u64 start, u64 end, int mirror);
 void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
 			  int atomic);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a7e715a..76a6e39 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -14,6 +14,7 @@
 #include "extent_io.h"
 #include "extent_map.h"
 #include "ctree.h"
+#include "disk-io.h"
 #include "btrfs_inode.h"
 #include "volumes.h"
 #include "check-integrity.h"
@@ -2179,7 +2180,7 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 		struct page *p = eb_head(eb)->pages[i];
 
 		ret = repair_io_failure(root->fs_info->btree_inode, start,
-					PAGE_CACHE_SIZE, start, p,
+					eb->len, start, p,
 					start - page_offset(p), mirror_num);
 		if (ret)
 			break;
@@ -3706,6 +3707,77 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 	return ret;
 }
 
+static void end_bio_extent_buffer_readpage(struct bio *bio, int err)
+{
+	struct address_space *mapping = bio->bi_io_vec->bv_page->mapping;
+	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct extent_buffer *eb;
+	struct btrfs_root *root;
+	struct bio_vec *bvec;
+	struct page *page;
+	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	u64 start;
+	u64 end;
+	int mirror;
+	int ret;
+	int i;
+
+	if (err)
+		uptodate = 0;
+
+	bio_for_each_segment_all(bvec, bio, i) {
+		page = bvec->bv_page;
+		root = BTRFS_I(page->mapping->host)->root;
+
+		start = page_offset(page) + bvec->bv_offset;
+		end = start + bvec->bv_len - 1;
+
+		if (!page->private) {
+			unlock_page(page);
+			unlock_extent(tree, start, end);
+			continue;
+		}
+
+		eb = (struct extent_buffer *)page->private;
+
+		do {
+			/*
+			  read_extent_buffer_pages() does not start
+			  I/O on PG_uptodate pages. Hence the bio may
+			  map only part of the extent buffer.
+			 */
+			if ((eb->start <= start) && (eb->start + eb->len - 1 > start))
+				break;
+		} while ((eb = eb->eb_next) != NULL);
+
+		BUG_ON(!eb);
+
+		mirror = io_bio->mirror_num;
+
+		if (uptodate) {
+			ret = verify_extent_buffer_read(io_bio, page, start,
+							end, mirror);
+			if (ret)
+				uptodate = 0;
+		}
+
+		if (!uptodate) {
+			set_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
+			eb->read_mirror = mirror;
+			atomic_dec(&eb_head(eb)->io_bvecs);
+			if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD,
+						&eb->ebflags))
+				btree_readahead_hook(root, eb, eb->start,
+						-EIO);
+			ClearPageUptodate(page);
+			SetPageError(page);
+		}
+	}
+
+	bio_put(bio);
+}
+
 static void end_extent_buffer_writeback(struct extent_buffer *eb)
 {
 	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags);
@@ -5330,6 +5402,9 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 			     struct extent_buffer *eb, u64 start, int wait,
 			     get_extent_t *get_extent, int mirror_num)
 {
+	struct inode *inode = tree->mapping->host;
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+	struct extent_state *cached_state = NULL;
 	unsigned long i;
 	unsigned long start_i;
 	struct page *page;
@@ -5376,15 +5451,31 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 
 	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
 	eb->read_mirror = 0;
-	atomic_set(&eb->io_pages, num_reads);
+	atomic_set(&eb_head(eb)->io_bvecs, num_reads);
 	for (i = start_i; i < num_pages; i++) {
 		page = eb_head(eb)->pages[i];
 		if (!PageUptodate(page)) {
 			ClearPageError(page);
-			err = __extent_read_full_page(tree, page,
-						      get_extent, &bio,
-						      mirror_num, &bio_flags,
-						      READ | REQ_META);
+			if (eb->len < PAGE_CACHE_SIZE) {
+				lock_extent_bits(tree, eb->start, eb->start + eb->len - 1, 0,
+							&cached_state);
+				err = submit_extent_page(READ | REQ_META, tree,
+							page, eb->start >> 9,
+							eb->len, eb->start - page_offset(page),
+							fs_info->fs_devices->latest_bdev,
+							&bio, -1, end_bio_extent_buffer_readpage,
+							mirror_num, bio_flags, bio_flags);
+			} else {
+				lock_extent_bits(tree, page_offset(page),
+						page_offset(page) + PAGE_CACHE_SIZE - 1,
+						0, &cached_state);
+				err = submit_extent_page(READ | REQ_META, tree,
+							page, page_offset(page) >> 9,
+							PAGE_CACHE_SIZE, 0,
+							fs_info->fs_devices->latest_bdev,
+							&bio, -1, end_bio_extent_buffer_readpage,
+							mirror_num, bio_flags, bio_flags);
+			}
 			if (err)
 				ret = err;
 		} else {
@@ -5405,10 +5496,11 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 	for (i = start_i; i < num_pages; i++) {
 		page = eb_head(eb)->pages[i];
 		wait_on_page_locked(page);
-		if (!PageUptodate(page))
-			ret = -EIO;
 	}
 
+	if (!extent_buffer_uptodate(eb))
+		ret = -EIO;
+
 	return ret;
 
 unlock_exit:
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 06/21] Btrfs: subpagesize-blocksize: Write only dirty extent buffers belonging to a page
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (4 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 05/21] Btrfs: subpagesize-blocksize: Read tree blocks whose size is < PAGE_SIZE Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 07/21] Btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE Chandan Rajendra
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

For the subpagesize-blocksize scenario, this patch adds the ability to write a
single extent buffer to the disk.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/disk-io.c   |  21 ++--
 fs/btrfs/extent_io.c | 279 +++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 239 insertions(+), 61 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b794e33..9800888 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -499,17 +499,24 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
 
 static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
 {
-	u64 start = page_offset(page);
-	u64 found_start;
 	struct extent_buffer *eb;
+	u64 found_start;
 
 	eb = (struct extent_buffer *)page->private;
-	if (page != eb->pages[0])
-		return 0;
-	found_start = btrfs_header_bytenr(eb);
-	if (WARN_ON(found_start != start || !PageUptodate(page)))
+	if (page != eb_head(eb)->pages[0])
 		return 0;
-	csum_tree_block(fs_info, eb, 0);
+
+	do {
+		if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags))
+			continue;
+		if (WARN_ON(!test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags)))
+			continue;
+		found_start = btrfs_header_bytenr(eb);
+		if (WARN_ON(found_start != eb->start))
+			return 0;
+		csum_tree_block(fs_info, eb, 0);
+	} while ((eb = eb->eb_next) != NULL);
+
 	return 0;
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 76a6e39..14b4e05 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3636,29 +3636,49 @@ void wait_on_extent_buffer_writeback(struct extent_buffer *eb)
 		    TASK_UNINTERRUPTIBLE);
 }
 
-static noinline_for_stack int
-lock_extent_buffer_for_io(struct extent_buffer *eb,
-			  struct btrfs_fs_info *fs_info,
-			  struct extent_page_data *epd)
+static void lock_extent_buffer_pages(struct extent_buffer_head *ebh,
+				struct extent_page_data *epd)
 {
+	struct extent_buffer *eb = &ebh->eb;
 	unsigned long i, num_pages;
-	int flush = 0;
+
+	num_pages = num_extent_pages(eb->start, eb->len);
+	for (i = 0; i < num_pages; i++) {
+		struct page *p = ebh->pages[i];
+		if (!trylock_page(p)) {
+			flush_write_bio(epd);
+			lock_page(p);
+		}
+	}
+
+	return;
+}
+
+static int noinline_for_stack
+lock_extent_buffer_for_io(struct extent_buffer *eb,
+			struct btrfs_fs_info *fs_info,
+			struct extent_page_data *epd)
+{
+	int dirty;
 	int ret = 0;
 
 	if (!btrfs_try_tree_write_lock(eb)) {
-		flush = 1;
 		flush_write_bio(epd);
 		btrfs_tree_lock(eb);
 	}
 
 	if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags)) {
+		dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
 		btrfs_tree_unlock(eb);
-		if (!epd->sync_io)
-			return 0;
-		if (!flush) {
-			flush_write_bio(epd);
-			flush = 1;
+		if (!epd->sync_io) {
+			if (!dirty)
+				return 1;
+			else
+				return 2;
 		}
+
+		flush_write_bio(epd);
+
 		while (1) {
 			wait_on_extent_buffer_writeback(eb);
 			btrfs_tree_lock(eb);
@@ -3681,29 +3701,14 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 		__percpu_counter_add(&fs_info->dirty_metadata_bytes,
 				     -eb->len,
 				     fs_info->dirty_metadata_batch);
-		ret = 1;
+		ret = 0;
 	} else {
 		spin_unlock(&eb_head(eb)->refs_lock);
+		ret = 1;
 	}
 
 	btrfs_tree_unlock(eb);
 
-	if (!ret)
-		return ret;
-
-	num_pages = num_extent_pages(eb->start, eb->len);
-	for (i = 0; i < num_pages; i++) {
-		struct page *p = eb_head(eb)->pages[i];
-
-		if (!trylock_page(p)) {
-			if (!flush) {
-				flush_write_bio(epd);
-				flush = 1;
-			}
-			lock_page(p);
-		}
-	}
-
 	return ret;
 }
 
@@ -3785,9 +3790,8 @@ static void end_extent_buffer_writeback(struct extent_buffer *eb)
 	wake_up_bit(&eb->ebflags, EXTENT_BUFFER_WRITEBACK);
 }
 
-static void set_btree_ioerr(struct page *page)
+static void set_btree_ioerr(struct extent_buffer *eb, struct page *page)
 {
-	struct extent_buffer *eb = (struct extent_buffer *)page->private;
 	struct extent_buffer_head *ebh = eb_head(eb);
 	struct btrfs_inode *btree_ino = BTRFS_I(ebh->fs_info->btree_inode);
 
@@ -3848,7 +3852,7 @@ static void set_btree_ioerr(struct page *page)
 	}
 }
 
-static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
+static void end_bio_subpagesize_blocksize_ebh_writepage(struct bio *bio, int err)
 {
 	struct bio_vec *bvec;
 	struct extent_buffer *eb;
@@ -3856,14 +3860,55 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 
 	bio_for_each_segment_all(bvec, bio, i) {
 		struct page *page = bvec->bv_page;
+		u64 start, end;
+
+		eb = (struct extent_buffer *)page->private;
+		BUG_ON(!eb);
+		start = page_offset(page) + bvec->bv_offset;
+		end = start + bvec->bv_len - 1;
+
+		do {
+			if (!(eb->start >= start
+					&& (eb->start + eb->len) <= (end + 1))) {
+				continue;
+			}
+
+			done = atomic_dec_and_test(&eb_head(eb)->io_bvecs);
+
+			if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->ebflags)) {
+				ClearPageUptodate(page);
+				set_btree_ioerr(eb, page);
+			}
+
+			if (done)
+				end_page_writeback(page);
+
+			end_extent_buffer_writeback(eb);
+
+		} while ((eb = eb->eb_next) != NULL);
+
+	}
+
+	bio_put(bio);
+}
+
+static void end_bio_regular_ebh_writepage(struct bio *bio, int err)
+{
+	struct extent_buffer *eb;
+	struct bio_vec *bvec;
+	int i, done;
+
+	bio_for_each_segment_all(bvec, bio, i) {
+		struct page *page = bvec->bv_page;
 
 		eb = (struct extent_buffer *)page->private;
 		BUG_ON(!eb);
-		done = atomic_dec_and_test(&eb->io_pages);
+
+		done = atomic_dec_and_test(&eb_head(eb)->io_bvecs);
 
 		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->ebflags)) {
 			ClearPageUptodate(page);
-			set_btree_ioerr(page);
+			set_btree_ioerr(eb, page);
 		}
 
 		end_page_writeback(page);
@@ -3877,14 +3922,17 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 	bio_put(bio);
 }
 
-static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
-			struct btrfs_fs_info *fs_info,
-			struct writeback_control *wbc,
-			struct extent_page_data *epd)
+
+static noinline_for_stack int
+write_regular_ebh(struct extent_buffer_head *ebh,
+		struct btrfs_fs_info *fs_info,
+		struct writeback_control *wbc,
+		struct extent_page_data *epd)
 {
 	struct block_device *bdev = fs_info->fs_devices->latest_bdev;
 	struct extent_io_tree *tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
-	u64 offset = eb->start;
+	struct extent_buffer *eb = &ebh->eb;
+	u64 offset = eb->start & ~(PAGE_CACHE_SIZE - 1);
 	unsigned long i, num_pages;
 	unsigned long bio_flags = 0;
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
@@ -3892,7 +3940,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 
 	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->ebflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
-	atomic_set(&eb->io_pages, num_pages);
+	atomic_set(&eb_head(eb)->io_bvecs, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
 		bio_flags = EXTENT_BIO_TREE_LOG;
 
@@ -3903,13 +3951,14 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 		set_page_writeback(p);
 		ret = submit_extent_page(rw, tree, p, offset >> 9,
 					 PAGE_CACHE_SIZE, 0, bdev, &epd->bio,
-					 -1, end_bio_extent_buffer_writepage,
+					-1, end_bio_regular_ebh_writepage,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_btree_ioerr(p);
+			set_btree_ioerr(eb, p);
 			end_page_writeback(p);
-			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
+			if (atomic_sub_and_test(num_pages - i,
+							&eb_head(eb)->io_bvecs))
 				end_extent_buffer_writeback(eb);
 			ret = -EIO;
 			break;
@@ -3930,12 +3979,84 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	return ret;
 }
 
+static int write_subpagesize_blocksize_ebh(struct extent_buffer_head *ebh,
+					struct btrfs_fs_info *fs_info,
+					struct writeback_control *wbc,
+					struct extent_page_data *epd,
+					unsigned long ebs_to_write)
+{
+	struct block_device *bdev = fs_info->fs_devices->latest_bdev;
+	struct extent_io_tree *tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
+	struct extent_buffer *eb;
+	struct page *p;
+	u64 offset;
+	unsigned long i;
+	unsigned long bio_flags = 0;
+	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
+	int ret = 0, err = 0;
+
+	eb = &ebh->eb;
+	p = ebh->pages[0];
+	clear_page_dirty_for_io(p);
+	set_page_writeback(p);
+	i = 0;
+	do {
+		if (!test_bit(i++, &ebs_to_write))
+			continue;
+
+		clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->ebflags);
+		atomic_inc(&eb_head(eb)->io_bvecs);
+
+		if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
+			bio_flags = EXTENT_BIO_TREE_LOG;
+
+		offset = eb->start - page_offset(p);
+
+		ret = submit_extent_page(rw, tree, p, eb->start >> 9,
+					eb->len, offset,
+					bdev, &epd->bio, -1,
+					end_bio_subpagesize_blocksize_ebh_writepage,
+					0, epd->bio_flags, bio_flags);
+		epd->bio_flags = bio_flags;
+		if (ret) {
+			set_btree_ioerr(eb, p);
+			atomic_dec(&eb_head(eb)->io_bvecs);
+			end_extent_buffer_writeback(eb);
+			err = -EIO;
+		}
+	} while ((eb = eb->eb_next) != NULL);
+
+	if (!err) {
+		update_nr_written(p, wbc, 1);
+	}
+
+	unlock_page(p);
+
+	return ret;
+}
+
+static void redirty_extent_buffer_pages_for_writepage(struct extent_buffer *eb,
+						struct writeback_control *wbc)
+{
+	unsigned long i, num_pages;
+	struct page *p;
+
+	num_pages = num_extent_pages(eb->start, eb->len);
+	for (i = 0; i < num_pages; i++) {
+		p = eb_head(eb)->pages[i];
+		redirty_page_for_writepage(wbc, p);
+	}
+
+	return;
+}
+
 int btree_write_cache_pages(struct address_space *mapping,
-				   struct writeback_control *wbc)
+			struct writeback_control *wbc)
 {
 	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
 	struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info;
-	struct extent_buffer *eb, *prev_eb = NULL;
+	struct extent_buffer *eb;
+	struct extent_buffer_head *ebh, *prev_ebh = NULL;
 	struct extent_page_data epd = {
 		.bio = NULL,
 		.tree = tree,
@@ -3946,6 +4067,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 	int ret = 0;
 	int done = 0;
 	int nr_to_write_done = 0;
+	unsigned long ebs_to_write, dirty_ebs;
 	struct pagevec pvec;
 	int nr_pages;
 	pgoff_t index;
@@ -3972,7 +4094,7 @@ retry:
 	while (!done && !nr_to_write_done && (index <= end) &&
 	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
-		unsigned i;
+		unsigned i, j;
 
 		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
@@ -4004,30 +4126,79 @@ retry:
 				continue;
 			}
 
-			if (eb == prev_eb) {
+			ebh = eb_head(eb);
+			if (ebh == prev_ebh) {
 				spin_unlock(&mapping->private_lock);
 				continue;
 			}
 
-			ret = atomic_inc_not_zero(&eb->refs);
+			ret = atomic_inc_not_zero(&ebh->refs);
 			spin_unlock(&mapping->private_lock);
 			if (!ret)
 				continue;
 
-			prev_eb = eb;
-			ret = lock_extent_buffer_for_io(eb, fs_info, &epd);
-			if (!ret) {
-				free_extent_buffer(eb);
+			prev_ebh = ebh;
+
+			j = 0;
+			ebs_to_write = dirty_ebs = 0;
+			eb = &ebh->eb;
+			do {
+				BUG_ON(j >= BITS_PER_LONG);
+
+				ret = lock_extent_buffer_for_io(eb, fs_info, &epd);
+				switch (ret) {
+				case 0:
+					/*
+					  EXTENT_BUFFER_DIRTY was set and we were able to
+					  clear it.
+					*/
+					set_bit(j, &ebs_to_write);
+					break;
+				case 2:
+					/*
+					  EXTENT_BUFFER_DIRTY was set, but we were unable
+					  to clear EXTENT_BUFFER_WRITEBACK that was set
+					  before we got the extent buffer locked.
+					 */
+					set_bit(j, &dirty_ebs);
+				default:
+					/*
+					  EXTENT_BUFFER_DIRTY wasn't set.
+					 */
+					break;
+				}
+				++j;
+			} while ((eb = eb->eb_next) != NULL);
+
+			ret = 0;
+
+			if (!ebs_to_write) {
+				free_extent_buffer(&ebh->eb);
 				continue;
 			}
 
-			ret = write_one_eb(eb, fs_info, wbc, &epd);
+			/*
+			  Now that we know that atleast one of the extent buffer
+			  belonging to the extent buffer head must be written to
+			  the disk, lock the extent_buffer_head's pages.
+			 */
+			lock_extent_buffer_pages(ebh, &epd);
+
+			if (ebh->eb.len < PAGE_CACHE_SIZE) {
+				ret = write_subpagesize_blocksize_ebh(ebh, fs_info, wbc, &epd, ebs_to_write);
+				if (dirty_ebs) {
+					redirty_extent_buffer_pages_for_writepage(&ebh->eb, wbc);
+				}
+			} else {
+				ret = write_regular_ebh(ebh, fs_info, wbc, &epd);
+			}
+
 			if (ret) {
 				done = 1;
-				free_extent_buffer(eb);
+				free_extent_buffer(&ebh->eb);
 				break;
 			}
-			free_extent_buffer(eb);
+			free_extent_buffer(&ebh->eb);
 
 			/*
 			 * the filesystem may choose to bump up nr_to_write.
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 07/21] Btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (5 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 06/21] Btrfs: subpagesize-blocksize: Write only dirty extent buffers belonging to a page Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 08/21] Btrfs: subpagesize-blocksize: Compute and look up csums based on sectorsized blocks Chandan Rajendra
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

This patch allows mounting filesystems with blocksize smaller than the
PAGE_SIZE.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/disk-io.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9800888..5dea6b4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2816,12 +2816,6 @@ int open_ctree(struct super_block *sb,
 		goto fail_sb_buffer;
 	}
 
-	if (sectorsize != PAGE_SIZE) {
-		printk(KERN_ERR "BTRFS: incompatible sector size (%lu) "
-		       "found on %s\n", (unsigned long)sectorsize, sb->s_id);
-		goto fail_sb_buffer;
-	}
-
 	mutex_lock(&fs_info->chunk_mutex);
 	ret = btrfs_read_sys_array(tree_root);
 	mutex_unlock(&fs_info->chunk_mutex);
@@ -3963,13 +3957,13 @@ static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
 	 * Check the lower bound, the alignment and other constraints are
 	 * checked later.
 	 */
-	if (btrfs_super_nodesize(sb) < 4096) {
-		printk(KERN_ERR "BTRFS: nodesize too small: %u < 4096\n",
+	if (btrfs_super_nodesize(sb) < 2048) {
+		printk(KERN_ERR "BTRFS: nodesize too small: %u < 2048\n",
 				btrfs_super_nodesize(sb));
 		ret = -EINVAL;
 	}
-	if (btrfs_super_sectorsize(sb) < 4096) {
-		printk(KERN_ERR "BTRFS: sectorsize too small: %u < 4096\n",
+	if (btrfs_super_sectorsize(sb) < 2048) {
+		printk(KERN_ERR "BTRFS: sectorsize too small: %u < 2048\n",
 				btrfs_super_sectorsize(sb));
 		ret = -EINVAL;
 	}
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 08/21] Btrfs: subpagesize-blocksize: Compute and look up csums based on sectorsized blocks.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (6 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 07/21] Btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-07-01 14:37   ` Liu Bo
  2015-06-01 15:22 ` [RFC PATCH V11 09/21] Btrfs: subpagesize-blocksize: Direct I/O read: Work " Chandan Rajendra
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

Checksums are applicable to sectorsize units. The current code uses
bio->bv_len units to compute and look up checksums. This works on machines
where sectorsize == PAGE_SIZE. This patch makes the checksum computation and
look up code to work with sectorsize units.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/file-item.c | 87 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 54 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 58ece65..65ab9c3 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -172,6 +172,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 	u64 item_start_offset = 0;
 	u64 item_last_offset = 0;
 	u64 disk_bytenr;
+	u64 page_bytes_left;
 	u32 diff;
 	int nblocks;
 	int bio_index = 0;
@@ -220,6 +221,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 	disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
 	if (dio)
 		offset = logical_offset;
+
+	page_bytes_left = bvec->bv_len;
 	while (bio_index < bio->bi_vcnt) {
 		if (!dio)
 			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
@@ -243,7 +246,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 				if (BTRFS_I(inode)->root->root_key.objectid ==
 				    BTRFS_DATA_RELOC_TREE_OBJECTID) {
 					set_extent_bits(io_tree, offset,
-						offset + bvec->bv_len - 1,
+						offset + root->sectorsize - 1,
 						EXTENT_NODATASUM, GFP_NOFS);
 				} else {
 					btrfs_info(BTRFS_I(inode)->root->fs_info,
@@ -281,11 +284,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 found:
 		csum += count * csum_size;
 		nblocks -= count;
-		bio_index += count;
+
 		while (count--) {
-			disk_bytenr += bvec->bv_len;
-			offset += bvec->bv_len;
-			bvec++;
+			disk_bytenr += root->sectorsize;
+			offset += root->sectorsize;
+			page_bytes_left -= root->sectorsize;
+			if (!page_bytes_left) {
+				bio_index++;
+				bvec++;
+				page_bytes_left = bvec->bv_len;
+			}
+
 		}
 	}
 	btrfs_free_path(path);
@@ -432,6 +441,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
 	struct bio_vec *bvec = bio->bi_io_vec;
 	int bio_index = 0;
 	int index;
+	int nr_sectors;
+	int i;
 	unsigned long total_bytes = 0;
 	unsigned long this_sum_bytes = 0;
 	u64 offset;
@@ -459,41 +470,51 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
 		if (!contig)
 			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
 
-		if (offset >= ordered->file_offset + ordered->len ||
-		    offset < ordered->file_offset) {
-			unsigned long bytes_left;
-			sums->len = this_sum_bytes;
-			this_sum_bytes = 0;
-			btrfs_add_ordered_sum(inode, ordered, sums);
-			btrfs_put_ordered_extent(ordered);
+		data = kmap_atomic(bvec->bv_page);
 
-			bytes_left = bio->bi_iter.bi_size - total_bytes;
 
-			sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
-				       GFP_NOFS);
-			BUG_ON(!sums); /* -ENOMEM */
-			sums->len = bytes_left;
-			ordered = btrfs_lookup_ordered_extent(inode, offset);
-			BUG_ON(!ordered); /* Logic error */
-			sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9) +
-				       total_bytes;
-			index = 0;
+		nr_sectors = (bvec->bv_len + root->sectorsize - 1)
+			>> root->fs_info->sb->s_blocksize_bits;
+
+
+		for (i = 0; i < nr_sectors; i++) {
+			if (offset >= ordered->file_offset + ordered->len ||
+				offset < ordered->file_offset) {
+				unsigned long bytes_left;
+				sums->len = this_sum_bytes;
+				this_sum_bytes = 0;
+				btrfs_add_ordered_sum(inode, ordered, sums);
+				btrfs_put_ordered_extent(ordered);
+
+				bytes_left = bio->bi_iter.bi_size - total_bytes;
+
+				sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
+					GFP_NOFS);
+				BUG_ON(!sums); /* -ENOMEM */
+				sums->len = bytes_left;
+				ordered = btrfs_lookup_ordered_extent(inode, offset);
+				BUG_ON(!ordered); /* Logic error */
+				sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9) +
+					total_bytes;
+				index = 0;
+			}
+
+			sums->sums[index] = ~(u32)0;
+			sums->sums[index]
+				= btrfs_csum_data(data + bvec->bv_offset + (i * root->sectorsize),
+						sums->sums[index],
+						root->sectorsize);
+			btrfs_csum_final(sums->sums[index],
+					(char *)(sums->sums + index));
+			index++;
+			offset += root->sectorsize;
+			this_sum_bytes += root->sectorsize;
+			total_bytes += root->sectorsize;
 		}
 
-		data = kmap_atomic(bvec->bv_page);
-		sums->sums[index] = ~(u32)0;
-		sums->sums[index] = btrfs_csum_data(data + bvec->bv_offset,
-						    sums->sums[index],
-						    bvec->bv_len);
 		kunmap_atomic(data);
-		btrfs_csum_final(sums->sums[index],
-				 (char *)(sums->sums + index));
 
 		bio_index++;
-		index++;
-		total_bytes += bvec->bv_len;
-		this_sum_bytes += bvec->bv_len;
-		offset += bvec->bv_len;
 		bvec++;
 	}
 	this_sum_bytes = 0;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 09/21] Btrfs: subpagesize-blocksize: Direct I/O read: Work on sectorsized blocks.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (7 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 08/21] Btrfs: subpagesize-blocksize: Compute and look up csums based on sectorsized blocks Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-07-01 14:45   ` Liu Bo
  2015-06-01 15:22 ` [RFC PATCH V11 10/21] Btrfs: subpagesize-blocksize: fallocate: Work with sectorsized units Chandan Rajendra
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

The direct I/O read's endio and corresponding repair functions work on
page sized blocks. Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/inode.c | 94 ++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 71 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ac6a3f3..958e4e6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7643,9 +7643,9 @@ static int btrfs_check_dio_repairable(struct inode *inode,
 }
 
 static int dio_read_error(struct inode *inode, struct bio *failed_bio,
-			  struct page *page, u64 start, u64 end,
-			  int failed_mirror, bio_end_io_t *repair_endio,
-			  void *repair_arg)
+			struct page *page, unsigned int pgoff,
+			u64 start, u64 end, int failed_mirror,
+			bio_end_io_t *repair_endio, void *repair_arg)
 {
 	struct io_failure_record *failrec;
 	struct bio *bio;
@@ -7666,7 +7666,9 @@ static int dio_read_error(struct inode *inode, struct bio *failed_bio,
 		return -EIO;
 	}
 
-	if (failed_bio->bi_vcnt > 1)
+	if ((failed_bio->bi_vcnt > 1)
+		|| (failed_bio->bi_io_vec->bv_len
+			> BTRFS_I(inode)->root->sectorsize))
 		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
 	else
 		read_mode = READ_SYNC;
@@ -7674,7 +7676,7 @@ static int dio_read_error(struct inode *inode, struct bio *failed_bio,
 	isector = start - btrfs_io_bio(failed_bio)->logical;
 	isector >>= inode->i_sb->s_blocksize_bits;
 	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
-				      0, isector, repair_endio, repair_arg);
+				pgoff, isector, repair_endio, repair_arg);
 	if (!bio) {
 		free_io_failure(inode, failrec);
 		return -EIO;
@@ -7704,12 +7706,17 @@ struct btrfs_retry_complete {
 static void btrfs_retry_endio_nocsum(struct bio *bio, int err)
 {
 	struct btrfs_retry_complete *done = bio->bi_private;
+	struct inode *inode;
 	struct bio_vec *bvec;
 	int i;
 
 	if (err)
 		goto end;
 
+	BUG_ON(bio->bi_vcnt != 1);
+	inode = bio->bi_io_vec->bv_page->mapping->host;
+	BUG_ON(bio->bi_io_vec->bv_len != BTRFS_I(inode)->root->sectorsize);
+
 	done->uptodate = 1;
 	bio_for_each_segment_all(bvec, bio, i)
 		clean_io_failure(done->inode, done->start, bvec->bv_page, 0);
@@ -7724,22 +7731,30 @@ static int __btrfs_correct_data_nocsum(struct inode *inode,
 	struct bio_vec *bvec;
 	struct btrfs_retry_complete done;
 	u64 start;
+	unsigned int pgoff;
+	u32 sectorsize;
+	int nr_sectors;
 	int i;
 	int ret;
 
+	sectorsize = BTRFS_I(inode)->root->sectorsize;
+
 	start = io_bio->logical;
 	done.inode = inode;
 
 	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
-try_again:
+		nr_sectors = bvec->bv_len >> inode->i_sb->s_blocksize_bits;
+		pgoff = bvec->bv_offset;
+
+next_block_or_try_again:
 		done.uptodate = 0;
 		done.start = start;
 		init_completion(&done.done);
 
-		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
-				     start + bvec->bv_len - 1,
-				     io_bio->mirror_num,
-				     btrfs_retry_endio_nocsum, &done);
+		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page,
+				pgoff, start, start + sectorsize - 1,
+				io_bio->mirror_num,
+				btrfs_retry_endio_nocsum, &done);
 		if (ret)
 			return ret;
 
@@ -7747,10 +7762,15 @@ try_again:
 
 		if (!done.uptodate) {
 			/* We might have another mirror, so try again */
-			goto try_again;
+			goto next_block_or_try_again;
 		}
 
-		start += bvec->bv_len;
+		start += sectorsize;
+
+		if (nr_sectors--) {
+			pgoff += sectorsize;
+			goto next_block_or_try_again;
+		}
 	}
 
 	return 0;
@@ -7760,7 +7780,9 @@ static void btrfs_retry_endio(struct bio *bio, int err)
 {
 	struct btrfs_retry_complete *done = bio->bi_private;
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct inode * inode;
 	struct bio_vec *bvec;
+	u64 start;
 	int uptodate;
 	int ret;
 	int i;
@@ -7769,13 +7791,20 @@ static void btrfs_retry_endio(struct bio *bio, int err)
 		goto end;
 
 	uptodate = 1;
+
+	start = done->start;
+
+	BUG_ON(bio->bi_vcnt != 1);
+	inode = bio->bi_io_vec->bv_page->mapping->host;
+	BUG_ON(bio->bi_io_vec->bv_len != BTRFS_I(inode)->root->sectorsize);
+
 	bio_for_each_segment_all(bvec, bio, i) {
 		ret = __readpage_endio_check(done->inode, io_bio, i,
-					     bvec->bv_page, 0,
-					     done->start, bvec->bv_len);
+					bvec->bv_page, bvec->bv_offset,
+					done->start, bvec->bv_len);
 		if (!ret)
 			clean_io_failure(done->inode, done->start,
-					 bvec->bv_page, 0);
+					bvec->bv_page, bvec->bv_offset);
 		else
 			uptodate = 0;
 	}
@@ -7793,16 +7822,30 @@ static int __btrfs_subio_endio_read(struct inode *inode,
 	struct btrfs_retry_complete done;
 	u64 start;
 	u64 offset = 0;
+	u32 sectorsize;
+	int nr_sectors;
+	unsigned int pgoff;
+	int csum_pos;
 	int i;
 	int ret;
+	unsigned char blocksize_bits;
+
+	blocksize_bits = inode->i_sb->s_blocksize_bits;
+	sectorsize = BTRFS_I(inode)->root->sectorsize;
 
 	err = 0;
 	start = io_bio->logical;
 	done.inode = inode;
 
 	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
-		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
-					     0, start, bvec->bv_len);
+		nr_sectors = bvec->bv_len >> blocksize_bits;
+		pgoff = bvec->bv_offset;
+next_block:
+		csum_pos = offset >> blocksize_bits;
+
+		ret = __readpage_endio_check(inode, io_bio, csum_pos,
+					bvec->bv_page, pgoff, start,
+					sectorsize);
 		if (likely(!ret))
 			goto next;
 try_again:
@@ -7810,10 +7853,10 @@ try_again:
 		done.start = start;
 		init_completion(&done.done);
 
-		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
-				     start + bvec->bv_len - 1,
-				     io_bio->mirror_num,
-				     btrfs_retry_endio, &done);
+		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page,
+				pgoff, start, start + sectorsize - 1,
+				io_bio->mirror_num,
+				btrfs_retry_endio, &done);
 		if (ret) {
 			err = ret;
 			goto next;
@@ -7826,8 +7869,13 @@ try_again:
 			goto try_again;
 		}
 next:
-		offset += bvec->bv_len;
-		start += bvec->bv_len;
+		offset += sectorsize;
+		start += sectorsize;
+
+		if (--nr_sectors) {
+			pgoff += sectorsize;
+			goto next_block;
+		}
 	}
 
 	return err;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 10/21] Btrfs: subpagesize-blocksize: fallocate: Work with sectorsized units.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (8 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 09/21] Btrfs: subpagesize-blocksize: Direct I/O read: Work " Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 11/21] Btrfs: subpagesize-blocksize: btrfs_page_mkwrite: Reserve space in " Chandan Rajendra
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

While at it, this commit changes btrfs_truncate_page() to truncate sectorsized
blocks instead of pages. Hence the function has been renamed to
btrfs_truncate_block().

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/file.c  | 47 +++++++++++++++++++++++++----------------------
 fs/btrfs/inode.c | 55 +++++++++++++++++++++++++++++--------------------------
 3 files changed, 55 insertions(+), 49 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2bc3e0e..3e535f1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3896,7 +3896,7 @@ int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
 			struct btrfs_root *root,
 			struct inode *dir, u64 objectid,
 			const char *name, int name_len);
-int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len,
+int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 			int front);
 int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 			       struct btrfs_root *root,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 287192fb..9600410a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2297,23 +2297,26 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	u64 tail_len;
 	u64 orig_start = offset;
 	u64 cur_offset;
+	unsigned char blocksize_bits;
 	u64 min_size = btrfs_calc_trunc_metadata_size(root, 1);
 	u64 drop_end;
 	int ret = 0;
 	int err = 0;
 	int rsv_count;
-	bool same_page;
+	bool same_block;
 	bool no_holes = btrfs_fs_incompat(root->fs_info, NO_HOLES);
 	u64 ino_size;
-	bool truncated_page = false;
+	bool truncated_block = false;
 	bool updated_inode = false;
 
+	blocksize_bits = inode->i_sb->s_blocksize_bits;
+
 	ret = btrfs_wait_ordered_range(inode, offset, len);
 	if (ret)
 		return ret;
 
 	mutex_lock(&inode->i_mutex);
-	ino_size = round_up(inode->i_size, PAGE_CACHE_SIZE);
+	ino_size = round_up(inode->i_size, root->sectorsize);
 	ret = find_first_non_hole(inode, &offset, &len);
 	if (ret < 0)
 		goto out_only_mutex;
@@ -2326,31 +2329,30 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize);
 	lockend = round_down(offset + len,
 			     BTRFS_I(inode)->root->sectorsize) - 1;
-	same_page = ((offset >> PAGE_CACHE_SHIFT) ==
-		    ((offset + len - 1) >> PAGE_CACHE_SHIFT));
-
+	same_block = ((offset >> blocksize_bits)
+		== ((offset + len - 1) >> blocksize_bits));
 	/*
-	 * We needn't truncate any page which is beyond the end of the file
+	 * We needn't truncate any block which is beyond the end of the file
 	 * because we are sure there is no data there.
 	 */
 	/*
-	 * Only do this if we are in the same page and we aren't doing the
-	 * entire page.
+	 * Only do this if we are in the same block and we aren't doing the
+	 * entire block.
 	 */
-	if (same_page && len < PAGE_CACHE_SIZE) {
+	if (same_block && len < root->sectorsize) {
 		if (offset < ino_size) {
-			truncated_page = true;
-			ret = btrfs_truncate_page(inode, offset, len, 0);
+			truncated_block = true;
+			ret = btrfs_truncate_block(inode, offset, len, 0);
 		} else {
 			ret = 0;
 		}
 		goto out_only_mutex;
 	}
 
-	/* zero back part of the first page */
+	/* zero back part of the first block */
 	if (offset < ino_size) {
-		truncated_page = true;
-		ret = btrfs_truncate_page(inode, offset, 0, 0);
+		truncated_block = true;
+		ret = btrfs_truncate_block(inode, offset, 0, 0);
 		if (ret) {
 			mutex_unlock(&inode->i_mutex);
 			return ret;
@@ -2385,9 +2387,10 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 		if (!ret) {
 			/* zero the front end of the last page */
 			if (tail_start + tail_len < ino_size) {
-				truncated_page = true;
-				ret = btrfs_truncate_page(inode,
-						tail_start + tail_len, 0, 1);
+				truncated_block = true;
+				ret = btrfs_truncate_block(inode,
+							tail_start + tail_len,
+							0, 1);
 				if (ret)
 					goto out_only_mutex;
 			}
@@ -2554,7 +2557,7 @@ out:
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend,
 			     &cached_state, GFP_NOFS);
 out_only_mutex:
-	if (!updated_inode && truncated_page && !ret && !err) {
+	if (!updated_inode && truncated_block && !ret && !err) {
 		/*
 		 * If we only end up zeroing part of a page, we still need to
 		 * update the inode item, so that all the time fields are
@@ -2622,10 +2625,10 @@ static long btrfs_fallocate(struct file *file, int mode,
 	} else {
 		/*
 		 * If we are fallocating from the end of the file onward we
-		 * need to zero out the end of the page if i_size lands in the
-		 * middle of a page.
+		 * need to zero out the end of the block if i_size lands in the
+		 * middle of a block.
 		 */
-		ret = btrfs_truncate_page(inode, inode->i_size, 0, 0);
+		ret = btrfs_truncate_block(inode, inode->i_size, 0, 0);
 		if (ret)
 			goto out;
 	}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 958e4e6..9486e61 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4540,17 +4540,17 @@ error:
 }
 
 /*
- * btrfs_truncate_page - read, zero a chunk and write a page
+ * btrfs_truncate_block - read, zero a chunk and write a block
  * @inode - inode that we're zeroing
  * @from - the offset to start zeroing
  * @len - the length to zero, 0 to zero the entire range respective to the
  *	offset
  * @front - zero up to the offset instead of from the offset on
  *
- * This will find the page for the "from" offset and cow the page and zero the
+ * This will find the block for the "from" offset and cow the block and zero the
  * part we want to zero.  This is used with truncate and hole punching.
  */
-int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len,
+int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 			int front)
 {
 	struct address_space *mapping = inode->i_mapping;
@@ -4561,30 +4561,30 @@ int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len,
 	char *kaddr;
 	u32 blocksize = root->sectorsize;
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	unsigned offset = from & (blocksize - 1);
 	struct page *page;
 	gfp_t mask = btrfs_alloc_write_mask(mapping);
 	int ret = 0;
-	u64 page_start;
-	u64 page_end;
+	u64 block_start;
+	u64 block_end;
 
 	if ((offset & (blocksize - 1)) == 0 &&
 	    (!len || ((len & (blocksize - 1)) == 0)))
 		goto out;
-	ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+	ret = btrfs_delalloc_reserve_space(inode, blocksize);
 	if (ret)
 		goto out;
 
 again:
 	page = find_or_create_page(mapping, index, mask);
 	if (!page) {
-		btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+		btrfs_delalloc_release_space(inode, blocksize);
 		ret = -ENOMEM;
 		goto out;
 	}
 
-	page_start = page_offset(page);
-	page_end = page_start + PAGE_CACHE_SIZE - 1;
+	block_start = round_down(from, blocksize);
+	block_end = block_start + blocksize - 1;
 
 	if (!PageUptodate(page)) {
 		ret = btrfs_readpage(NULL, page);
@@ -4601,12 +4601,12 @@ again:
 	}
 	wait_on_page_writeback(page);
 
-	lock_extent_bits(io_tree, page_start, page_end, 0, &cached_state);
+	lock_extent_bits(io_tree, block_start, block_end, 0, &cached_state);
 	set_page_extent_mapped(page);
 
-	ordered = btrfs_lookup_ordered_extent(inode, page_start);
+	ordered = btrfs_lookup_ordered_extent(inode, block_start);
 	if (ordered) {
-		unlock_extent_cached(io_tree, page_start, page_end,
+		unlock_extent_cached(io_tree, block_start, block_end,
 				     &cached_state, GFP_NOFS);
 		unlock_page(page);
 		page_cache_release(page);
@@ -4615,41 +4615,44 @@ again:
 		goto again;
 	}
 
-	clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, page_end,
+	clear_extent_bit(&BTRFS_I(inode)->io_tree, block_start, block_end,
 			  EXTENT_DIRTY | EXTENT_DELALLOC |
 			  EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
 			  0, 0, &cached_state, GFP_NOFS);
 
-	ret = btrfs_set_extent_delalloc(inode, page_start, page_end,
+	ret = btrfs_set_extent_delalloc(inode, block_start, block_end,
 					&cached_state);
 	if (ret) {
-		unlock_extent_cached(io_tree, page_start, page_end,
+		unlock_extent_cached(io_tree, block_start, block_end,
 				     &cached_state, GFP_NOFS);
 		goto out_unlock;
 	}
 
+
 	set_page_blks_state(page, 1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
-			page_start, page_end);
+			block_start, block_end);
 
-	if (offset != PAGE_CACHE_SIZE) {
+	if (offset != blocksize) {
 		if (!len)
-			len = PAGE_CACHE_SIZE - offset;
+			len = blocksize - offset;
 		kaddr = kmap(page);
 		if (front)
-			memset(kaddr, 0, offset);
+			memset(kaddr + (block_start - page_offset(page)),
+				0, offset);
 		else
-			memset(kaddr + offset, 0, len);
+			memset(kaddr + (block_start - page_offset(page)) +  offset,
+				0, len);
 		flush_dcache_page(page);
 		kunmap(page);
 	}
 	ClearPageChecked(page);
 	set_page_dirty(page);
-	unlock_extent_cached(io_tree, page_start, page_end, &cached_state,
+	unlock_extent_cached(io_tree, block_start, block_end, &cached_state,
 			     GFP_NOFS);
 
 out_unlock:
 	if (ret)
-		btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+		btrfs_delalloc_release_space(inode, blocksize);
 	unlock_page(page);
 	page_cache_release(page);
 out:
@@ -4720,11 +4723,11 @@ int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size)
 	int err = 0;
 
 	/*
-	 * If our size started in the middle of a page we need to zero out the
-	 * rest of the page before we expand the i_size, otherwise we could
+	 * If our size started in the middle of a block we need to zero out the
+	 * rest of the block before we expand the i_size, otherwise we could
 	 * expose stale data.
 	 */
-	err = btrfs_truncate_page(inode, oldsize, 0, 0);
+	err = btrfs_truncate_block(inode, oldsize, 0, 0);
 	if (err)
 		return err;
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 11/21] Btrfs: subpagesize-blocksize: btrfs_page_mkwrite: Reserve space in sectorsized units.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (9 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 10/21] Btrfs: subpagesize-blocksize: fallocate: Work with sectorsized units Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-07-06  3:18   ` Liu Bo
  2015-06-01 15:22 ` [RFC PATCH V11 12/21] Btrfs: subpagesize-blocksize: Search for all ordered extents that could span across a page Chandan Rajendra
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

In subpagesize-blocksize scenario, if i_size occurs in a block which is not
the last block in the page, then the space to be reserved should be calculated
appropriately.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/inode.c | 36 +++++++++++++++++++++++++++++++-----
 1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9486e61..e9bab73 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8601,11 +8601,24 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	loff_t size;
 	int ret;
 	int reserved = 0;
+	u64 reserved_space;
 	u64 page_start;
 	u64 page_end;
+	u64 end;
+
+	reserved_space = PAGE_CACHE_SIZE;
 
 	sb_start_pagefault(inode->i_sb);
-	ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+
+	/*
+	  Reserving delalloc space after obtaining the page lock can lead to
+	  deadlock. For example, if a dirty page is locked by this function
+	  and the call to btrfs_delalloc_reserve_space() ends up triggering
+	  dirty page write out, then the btrfs_writepage() function could
+	  end up waiting indefinitely to get a lock on the page currently
+	  being processed by btrfs_page_mkwrite() function.
+	 */
+	ret  = btrfs_delalloc_reserve_space(inode, reserved_space);
 	if (!ret) {
 		ret = file_update_time(vma->vm_file);
 		reserved = 1;
@@ -8626,6 +8639,7 @@ again:
 	size = i_size_read(inode);
 	page_start = page_offset(page);
 	page_end = page_start + PAGE_CACHE_SIZE - 1;
+	end = page_end;
 
 	if ((page->mapping != inode->i_mapping) ||
 	    (page_start >= size)) {
@@ -8641,7 +8655,7 @@ again:
 	 * we can't set the delalloc bits if there are pending ordered
 	 * extents.  Drop our locks and wait for them to finish
 	 */
-	ordered = btrfs_lookup_ordered_extent(inode, page_start);
+	ordered = btrfs_lookup_ordered_range(inode, page_start, page_end);
 	if (ordered) {
 		unlock_extent_cached(io_tree, page_start, page_end,
 				     &cached_state, GFP_NOFS);
@@ -8651,6 +8665,18 @@ again:
 		goto again;
 	}
 
+	if (page->index == ((size - 1) >> PAGE_CACHE_SHIFT)) {
+		reserved_space = round_up(size - page_start, root->sectorsize);
+		if (reserved_space < PAGE_CACHE_SIZE) {
+			end = page_start + reserved_space - 1;
+			spin_lock(&BTRFS_I(inode)->lock);
+			BTRFS_I(inode)->outstanding_extents++;
+			spin_unlock(&BTRFS_I(inode)->lock);
+			btrfs_delalloc_release_space(inode,
+						PAGE_CACHE_SIZE - reserved_space);
+		}
+	}
+
 	/*
 	 * XXX - page_mkwrite gets called every time the page is dirtied, even
 	 * if it was already dirty, so for space accounting reasons we need to
@@ -8658,12 +8684,12 @@ again:
 	 * is probably a better way to do this, but for now keep consistent with
 	 * prepare_pages in the normal write path.
 	 */
-	clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, page_end,
+	clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, end,
 			  EXTENT_DIRTY | EXTENT_DELALLOC |
 			  EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
 			  0, 0, &cached_state, GFP_NOFS);
 
-	ret = btrfs_set_extent_delalloc(inode, page_start, page_end,
+	ret = btrfs_set_extent_delalloc(inode, page_start, end,
 					&cached_state);
 	if (ret) {
 		unlock_extent_cached(io_tree, page_start, page_end,
@@ -8706,7 +8732,7 @@ out_unlock:
 	}
 	unlock_page(page);
 out:
-	btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+	btrfs_delalloc_release_space(inode, reserved_space);
 out_noreserve:
 	sb_end_pagefault(inode->i_sb);
 	return ret;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 12/21] Btrfs: subpagesize-blocksize: Search for all ordered extents that could span across a page.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (10 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 11/21] Btrfs: subpagesize-blocksize: btrfs_page_mkwrite: Reserve space in " Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-07-01 14:47   ` Liu Bo
  2015-06-01 15:22 ` [RFC PATCH V11 13/21] Btrfs: subpagesize-blocksize: Deal with partial ordered extent allocations Chandan Rajendra
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

In subpagesize-blocksize scenario it is not sufficient to search using the
first byte of the page to make sure that there are no ordered extents
present across the page. Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c | 3 ++-
 fs/btrfs/inode.c     | 4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 14b4e05..0b017e1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3244,7 +3244,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
 
 	while (1) {
 		lock_extent(tree, start, end);
-		ordered = btrfs_lookup_ordered_extent(inode, start);
+		ordered = btrfs_lookup_ordered_range(inode, start,
+						PAGE_CACHE_SIZE);
 		if (!ordered)
 			break;
 		unlock_extent(tree, start, end);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e9bab73..8b4aaed 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1976,7 +1976,7 @@ again:
 	if (PagePrivate2(page))
 		goto out;
 
-	ordered = btrfs_lookup_ordered_extent(inode, page_start);
+	ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_CACHE_SIZE);
 	if (ordered) {
 		unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start,
 				     page_end, &cached_state, GFP_NOFS);
@@ -8513,7 +8513,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 
 	if (!inode_evicting)
 		lock_extent_bits(tree, page_start, page_end, 0, &cached_state);
-	ordered = btrfs_lookup_ordered_extent(inode, page_start);
+	ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_CACHE_SIZE);
 	if (ordered) {
 		/*
 		 * IO on this page will never be started, so we need
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 13/21] Btrfs: subpagesize-blocksize: Deal with partial ordered extent allocations.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (11 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 12/21] Btrfs: subpagesize-blocksize: Search for all ordered extents that could span across a page Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-07-06 10:06   ` Liu Bo
  2015-06-01 15:22 ` [RFC PATCH V11 14/21] Btrfs: subpagesize-blocksize: Explicitly Track I/O status of blocks of an ordered extent Chandan Rajendra
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

In subpagesize-blocksize scenario, extent allocations for only some of the
dirty blocks of a page can succeed, while allocation for rest of the blocks
can fail. This patch allows I/O against such partially allocated ordered
extents to be submitted.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c | 27 ++++++++++++++-------------
 fs/btrfs/inode.c     | 35 ++++++++++++++++++++++-------------
 2 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0b017e1..0110abc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1850,17 +1850,23 @@ int extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
 			if (page_ops & PAGE_SET_PRIVATE2)
 				SetPagePrivate2(pages[i]);
 
+			if (page_ops & PAGE_SET_ERROR)
+				SetPageError(pages[i]);
+
 			if (pages[i] == locked_page) {
 				page_cache_release(pages[i]);
 				continue;
 			}
-			if (page_ops & PAGE_CLEAR_DIRTY)
+
+			if ((page_ops & PAGE_CLEAR_DIRTY)
+				&& !PagePrivate2(pages[i]))
 				clear_page_dirty_for_io(pages[i]);
-			if (page_ops & PAGE_SET_WRITEBACK)
+			if ((page_ops & PAGE_SET_WRITEBACK)
+				&& !PagePrivate2(pages[i]))
 				set_page_writeback(pages[i]);
-			if (page_ops & PAGE_SET_ERROR)
-				SetPageError(pages[i]);
-			if (page_ops & PAGE_END_WRITEBACK)
+
+			if ((page_ops & PAGE_END_WRITEBACK)
+				&& !PagePrivate2(pages[i]))
 				end_page_writeback(pages[i]);
 			if (page_ops & PAGE_UNLOCK)
 				unlock_page(pages[i]);
@@ -2550,7 +2556,7 @@ int end_extent_writepage(struct page *page, int err, u64 start, u64 end)
 			uptodate = 0;
 	}
 
-	if (!uptodate) {
+	if (!uptodate || PageError(page)) {
 		ClearPageUptodate(page);
 		SetPageError(page);
 		ret = ret < 0 ? ret : -EIO;
@@ -3340,7 +3346,6 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode,
 					       nr_written);
 		/* File system has been set read-only */
 		if (ret) {
-			SetPageError(page);
 			/* fill_delalloc should be return < 0 for error
 			 * but just in case, we use > 0 here meaning the
 			 * IO is started, so we don't want to return > 0
@@ -3561,7 +3566,6 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
 	struct inode *inode = page->mapping->host;
 	struct extent_page_data *epd = data;
 	u64 start = page_offset(page);
-	u64 page_end = start + PAGE_CACHE_SIZE - 1;
 	int ret;
 	int nr = 0;
 	size_t pg_offset = 0;
@@ -3606,7 +3610,7 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
 	ret = writepage_delalloc(inode, page, wbc, epd, start, &nr_written);
 	if (ret == 1)
 		goto done_unlocked;
-	if (ret)
+	if (ret && !PagePrivate2(page))
 		goto done;
 
 	ret = __extent_writepage_io(inode, page, wbc, epd,
@@ -3620,10 +3624,7 @@ done:
 		set_page_writeback(page);
 		end_page_writeback(page);
 	}
-	if (PageError(page)) {
-		ret = ret < 0 ? ret : -EIO;
-		end_extent_writepage(page, ret, start, page_end);
-	}
+
 	unlock_page(page);
 	return ret;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8b4aaed..bff60c6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -925,6 +925,8 @@ static noinline int cow_file_range(struct inode *inode,
 	struct btrfs_key ins;
 	struct extent_map *em;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+	struct btrfs_ordered_extent *ordered;
+	unsigned long page_ops, extent_ops;
 	int ret = 0;
 
 	if (btrfs_is_free_space_inode(inode)) {
@@ -969,8 +971,6 @@ static noinline int cow_file_range(struct inode *inode,
 	btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0);
 
 	while (disk_num_bytes > 0) {
-		unsigned long op;
-
 		cur_alloc_size = disk_num_bytes;
 		ret = btrfs_reserve_extent(root, cur_alloc_size,
 					   root->sectorsize, 0, alloc_hint,
@@ -1023,7 +1023,7 @@ static noinline int cow_file_range(struct inode *inode,
 			ret = btrfs_reloc_clone_csums(inode, start,
 						      cur_alloc_size);
 			if (ret)
-				goto out_drop_extent_cache;
+				goto out_remove_ordered_extent;
 		}
 
 		if (disk_num_bytes < cur_alloc_size)
@@ -1036,13 +1036,12 @@ static noinline int cow_file_range(struct inode *inode,
 		 * Do set the Private2 bit so we know this page was properly
 		 * setup for writepage
 		 */
-		op = unlock ? PAGE_UNLOCK : 0;
-		op |= PAGE_SET_PRIVATE2;
-
+		page_ops = unlock ? PAGE_UNLOCK : 0;
+		page_ops |= PAGE_SET_PRIVATE2;
+		extent_ops = EXTENT_LOCKED | EXTENT_DELALLOC;
 		extent_clear_unlock_delalloc(inode, start,
-					     start + ram_size - 1, locked_page,
-					     EXTENT_LOCKED | EXTENT_DELALLOC,
-					     op);
+					start + ram_size - 1, locked_page,
+					extent_ops, page_ops);
 		disk_num_bytes -= cur_alloc_size;
 		num_bytes -= cur_alloc_size;
 		alloc_hint = ins.objectid + ins.offset;
@@ -1051,16 +1050,26 @@ static noinline int cow_file_range(struct inode *inode,
 out:
 	return ret;
 
+out_remove_ordered_extent:
+	ordered = btrfs_lookup_ordered_extent(inode, ins.objectid);
+	BUG_ON(!ordered);
+	btrfs_remove_ordered_extent(inode, ordered);
+
 out_drop_extent_cache:
 	btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
+
 out_reserve:
 	btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
+
 out_unlock:
+	page_ops = unlock ? PAGE_UNLOCK : 0;
+	page_ops |= PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK
+		| PAGE_SET_ERROR;
+	extent_ops = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING
+		| EXTENT_DEFRAG;
+
 	extent_clear_unlock_delalloc(inode, start, end, locked_page,
-				     EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
-				     EXTENT_DELALLOC | EXTENT_DEFRAG,
-				     PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
-				     PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK);
+				extent_ops, page_ops);
 	goto out;
 }
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 14/21] Btrfs: subpagesize-blocksize: Explicitly Track I/O status of blocks of an ordered extent.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (12 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 13/21] Btrfs: subpagesize-blocksize: Deal with partial ordered extent allocations Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-07-20  8:34   ` Liu Bo
  2015-06-01 15:22 ` [RFC PATCH V11 15/21] Btrfs: subpagesize-blocksize: Revert commit fc4adbff823f76577ece26dcb88bf6f8392dbd43 Chandan Rajendra
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

In subpagesize-blocksize scenario a page can have more than one block. So
in addition to PagePrivate2 flag, we would have to track the I/O status of
each block of a page to reliably mark the ordered extent as complete.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/extent_io.c    |  19 +--
 fs/btrfs/extent_io.h    |   5 +-
 fs/btrfs/inode.c        | 346 +++++++++++++++++++++++++++++++++++-------------
 fs/btrfs/ordered-data.c |  17 +++
 fs/btrfs/ordered-data.h |   4 +
 5 files changed, 287 insertions(+), 104 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0110abc..55f900a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4545,11 +4545,10 @@ int extent_invalidatepage(struct extent_io_tree *tree,
  * to drop the page.
  */
 static int try_release_extent_state(struct extent_map_tree *map,
-				    struct extent_io_tree *tree,
-				    struct page *page, gfp_t mask)
+				struct extent_io_tree *tree,
+				struct page *page, u64 start, u64 end,
+				gfp_t mask)
 {
-	u64 start = page_offset(page);
-	u64 end = start + PAGE_CACHE_SIZE - 1;
 	int ret = 1;
 
 	if (test_range_bit(tree, start, end,
@@ -4583,12 +4582,12 @@ static int try_release_extent_state(struct extent_map_tree *map,
  * map records are removed
  */
 int try_release_extent_mapping(struct extent_map_tree *map,
-			       struct extent_io_tree *tree, struct page *page,
-			       gfp_t mask)
+			struct extent_io_tree *tree, struct page *page,
+			u64 start, u64 end, gfp_t mask)
 {
 	struct extent_map *em;
-	u64 start = page_offset(page);
-	u64 end = start + PAGE_CACHE_SIZE - 1;
+	u64 orig_start = start;
+	u64 orig_end = end;
 
 	if ((mask & __GFP_WAIT) &&
 	    page->mapping->host->i_size > 16 * 1024 * 1024) {
@@ -4622,7 +4621,9 @@ int try_release_extent_mapping(struct extent_map_tree *map,
 			free_extent_map(em);
 		}
 	}
-	return try_release_extent_state(map, tree, page, mask);
+	return try_release_extent_state(map, tree, page,
+					orig_start, orig_end,
+					mask);
 }
 
 /*
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 8fe5ac3..c629e53 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -217,8 +217,9 @@ typedef struct extent_map *(get_extent_t)(struct inode *inode,
 void extent_io_tree_init(struct extent_io_tree *tree,
 			 struct address_space *mapping);
 int try_release_extent_mapping(struct extent_map_tree *map,
-			       struct extent_io_tree *tree, struct page *page,
-			       gfp_t mask);
+			struct extent_io_tree *tree, struct page *page,
+			u64 start, u64 end,
+			gfp_t mask);
 int try_release_extent_buffer(struct page *page);
 int lock_extent(struct extent_io_tree *tree, u64 start, u64 end);
 int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bff60c6..bfffc62 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2990,56 +2990,115 @@ static void finish_ordered_fn(struct btrfs_work *work)
 	btrfs_finish_ordered_io(ordered_extent);
 }
 
-static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
-				struct extent_state *state, int uptodate)
+static void mark_blks_io_complete(struct btrfs_ordered_extent *ordered,
+				u64 blk, u64 nr_blks, int uptodate)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = ordered->inode;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct btrfs_ordered_extent *ordered_extent = NULL;
 	struct btrfs_workqueue *wq;
 	btrfs_work_func_t func;
-	u64 ordered_start, ordered_end;
 	int done;
 
-	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
+	while (nr_blks--) {
+		if (test_and_set_bit(blk, ordered->blocks_done)) {
+			blk++;
+			continue;
+		}
 
-	ClearPagePrivate2(page);
-loop:
-	ordered_extent = btrfs_lookup_ordered_range(inode, start,
-						end - start + 1);
-	if (!ordered_extent)
-		goto out;
+		done = btrfs_dec_test_ordered_pending(inode, &ordered,
+						ordered->file_offset
+						+ (blk << inode->i_sb->s_blocksize_bits),
+						root->sectorsize,
+						uptodate);
+		if (done) {
+			if (btrfs_is_free_space_inode(inode)) {
+				wq = root->fs_info->endio_freespace_worker;
+				func = btrfs_freespace_write_helper;
+			} else {
+				wq = root->fs_info->endio_write_workers;
+				func = btrfs_endio_write_helper;
+			}
 
-	ordered_start = max_t(u64, start, ordered_extent->file_offset);
-	ordered_end = min_t(u64, end,
-			ordered_extent->file_offset + ordered_extent->len - 1);
-
-	done = btrfs_dec_test_ordered_pending(inode, &ordered_extent,
-					ordered_start,
-					ordered_end - ordered_start + 1,
-					uptodate);
-	if (done) {
-		if (btrfs_is_free_space_inode(inode)) {
-			wq = root->fs_info->endio_freespace_worker;
-			func = btrfs_freespace_write_helper;
-		} else {
-			wq = root->fs_info->endio_write_workers;
-			func = btrfs_endio_write_helper;
+			btrfs_init_work(&ordered->work, func,
+					finish_ordered_fn, NULL, NULL);
+			btrfs_queue_work(wq, &ordered->work);
 		}
 
-		btrfs_init_work(&ordered_extent->work, func,
-				finish_ordered_fn, NULL, NULL);
-		btrfs_queue_work(wq, &ordered_extent->work);
+		blk++;
 	}
+}
 
-	btrfs_put_ordered_extent(ordered_extent);
+int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
+				struct extent_state *state, int uptodate)
+{
+	struct inode *inode = page->mapping->host;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_ordered_extent *ordered_extent = NULL;
+	u64 blk, nr_blks;
+	int clear;
 
-	start = ordered_end + 1;
+	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
 
-	if (start < end)
-		goto loop;
+	while (start < end) {
+		ordered_extent = btrfs_lookup_ordered_extent(inode, start);
+		if (!ordered_extent) {
+			start += root->sectorsize;
+			continue;
+		}
+
+		blk = (start - ordered_extent->file_offset)
+			>> inode->i_sb->s_blocksize_bits;
+
+		nr_blks = (min(end, ordered_extent->file_offset + ordered_extent->len - 1)
+			+ 1 - start) >> inode->i_sb->s_blocksize_bits;
+
+		BUG_ON(!nr_blks);
+
+		mark_blks_io_complete(ordered_extent, blk, nr_blks, uptodate);
+
+		start = ordered_extent->file_offset + ordered_extent->len;
+
+		btrfs_put_ordered_extent(ordered_extent);
+	}
+
+	start = page_offset(page);
+	end = start + PAGE_CACHE_SIZE - 1;
+	clear = 1;
+
+	while (start < end) {
+		ordered_extent = btrfs_lookup_ordered_extent(inode, start);
+		if (!ordered_extent) {
+			start += root->sectorsize;
+			continue;
+		}
+
+		blk = (start - ordered_extent->file_offset)
+			>> inode->i_sb->s_blocksize_bits;
+		nr_blks = (min(end, ordered_extent->file_offset + ordered_extent->len - 1)
+			+ 1  - start) >> inode->i_sb->s_blocksize_bits;
+
+		BUG_ON(!nr_blks);
+
+		while (nr_blks--) {
+			if (!test_bit(blk++, ordered_extent->blocks_done)) {
+				clear = 0;
+				break;
+			}
+		}
+
+		if (!clear) {
+			btrfs_put_ordered_extent(ordered_extent);
+			break;
+		}
+
+		start += ordered_extent->len;
+
+		btrfs_put_ordered_extent(ordered_extent);
+	}
+
+	if (clear)
+		ClearPagePrivate2(page);
 
-out:
 	return 0;
 }
 
@@ -8472,7 +8531,9 @@ btrfs_readpages(struct file *file, struct address_space *mapping,
 	return extent_readpages(tree, mapping, pages, nr_pages,
 				btrfs_get_extent);
 }
-static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
+
+static int __btrfs_releasepage(struct page *page, u64 start, u64 end,
+			gfp_t gfp_flags)
 {
 	struct extent_io_tree *tree;
 	struct extent_map_tree *map;
@@ -8480,31 +8541,149 @@ static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
 
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
 	map = &BTRFS_I(page->mapping->host)->extent_tree;
-	ret = try_release_extent_mapping(map, tree, page, gfp_flags);
-	if (ret == 1)
+
+	ret = try_release_extent_mapping(map, tree, page, start, end,
+					gfp_flags);
+	if ((ret == 1) && ((end - start + 1) == PAGE_CACHE_SIZE)) {
 		clear_page_extent_mapped(page);
+	} else {
+		ret = 0;
+	}
 
 	return ret;
 }
 
 static int btrfs_releasepage(struct page *page, gfp_t gfp_flags)
 {
+	u64 start = page_offset(page);
+	u64 end = start + PAGE_CACHE_SIZE - 1;
+
 	if (PageWriteback(page) || PageDirty(page))
 		return 0;
-	return __btrfs_releasepage(page, gfp_flags & GFP_NOFS);
+
+	return __btrfs_releasepage(page, start, end, gfp_flags & GFP_NOFS);
+}
+
+static void invalidate_ordered_extent_blocks(struct inode *inode,
+					struct btrfs_ordered_extent *ordered,
+					u64 locked_start, u64 locked_end,
+					u64 cur,
+					int inode_evicting)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_ordered_inode_tree *ordered_tree;
+	struct extent_io_tree *tree;
+	u64 blk, blk_done, nr_blks;
+	u64 end;
+	u64 new_len;
+
+	tree = &BTRFS_I(inode)->io_tree;
+
+	end = min(locked_end, ordered->file_offset + ordered->len - 1);
+
+	if (!inode_evicting) {
+		clear_extent_bit(tree, cur, end,
+				EXTENT_DIRTY | EXTENT_DELALLOC |
+				EXTENT_DO_ACCOUNTING |
+				EXTENT_DEFRAG, 1, 0, NULL,
+				GFP_NOFS);
+		unlock_extent(tree, locked_start, locked_end);
+	}
+
+
+	ordered_tree = &BTRFS_I(inode)->ordered_tree;
+	spin_lock_irq(&ordered_tree->lock);
+	set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
+	new_len = cur - ordered->file_offset;
+	if (new_len < ordered->truncated_len)
+		ordered->truncated_len = new_len;
+
+	blk = (cur - ordered->file_offset) >> inode->i_sb->s_blocksize_bits;
+	nr_blks = (end + 1 - cur) >> inode->i_sb->s_blocksize_bits;
+
+	while (nr_blks--) {
+		blk_done = !test_and_set_bit(blk, ordered->blocks_done);
+		if (blk_done) {
+			spin_unlock_irq(&ordered_tree->lock);
+			if (btrfs_dec_test_ordered_pending(inode, &ordered,
+								ordered->file_offset + (blk << inode->i_sb->s_blocksize_bits),
+								root->sectorsize,
+								1))
+				btrfs_finish_ordered_io(ordered);
+
+			spin_lock_irq(&ordered_tree->lock);
+		}
+		blk++;
+	}
+
+	spin_unlock_irq(&ordered_tree->lock);
+
+	if (!inode_evicting)
+		lock_extent_bits(tree, locked_start, locked_end, 0, NULL);
+}
+
+static int page_blocks_written(struct page *page)
+{
+	struct btrfs_ordered_extent *ordered;
+	struct btrfs_root *root;
+	struct inode *inode;
+	unsigned long outstanding_blk;
+	u64 page_start, page_end;
+	u64 blk, last_blk, nr_blks;
+	u64 cur;
+	u64 len;
+
+	inode = page->mapping->host;
+	root = BTRFS_I(inode)->root;
+
+	page_start = page_offset(page);
+	page_end = page_start + PAGE_CACHE_SIZE - 1;
+
+	cur = page_start;
+	while (cur < page_end) {
+		ordered = btrfs_lookup_ordered_extent(inode, cur);
+		if (!ordered) {
+			cur += root->sectorsize;
+			continue;
+		}
+
+		blk = (cur - ordered->file_offset)
+			>> inode->i_sb->s_blocksize_bits;
+		len = min(page_end, ordered->file_offset + ordered->len - 1)
+			- cur + 1;
+		nr_blks = len >> inode->i_sb->s_blocksize_bits;
+
+		last_blk = blk + nr_blks - 1;
+
+		outstanding_blk = find_next_zero_bit(ordered->blocks_done,
+						ordered->len >> inode->i_sb->s_blocksize_bits,
+						blk);
+		if (outstanding_blk <= last_blk) {
+			btrfs_put_ordered_extent(ordered);
+			return 0;
+		}
+
+		btrfs_put_ordered_extent(ordered);
+		cur += len;
+	}
+
+	return 1;
 }
 
 static void btrfs_invalidatepage(struct page *page, unsigned int offset,
-				 unsigned int length)
+				unsigned int length)
 {
 	struct inode *inode = page->mapping->host;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct extent_io_tree *tree;
 	struct btrfs_ordered_extent *ordered;
-	struct extent_state *cached_state = NULL;
-	u64 page_start = page_offset(page);
-	u64 page_end = page_start + PAGE_CACHE_SIZE - 1;
+	u64 start, end, cur;
+	u64 page_start, page_end;
 	int inode_evicting = inode->i_state & I_FREEING;
 
+	page_start = page_offset(page);
+	page_end = page_start + PAGE_CACHE_SIZE - 1;
+
 	/*
 	 * we have the page locked, so new writeback can't start,
 	 * and the dirty bit won't be cleared while we are here.
@@ -8515,73 +8694,54 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	wait_on_page_writeback(page);
 
 	tree = &BTRFS_I(inode)->io_tree;
-	if (offset) {
+
+	start = round_up(offset, root->sectorsize);
+	end = round_down(offset + length, root->sectorsize) - 1;
+	if (end - start + 1 < root->sectorsize) {
 		btrfs_releasepage(page, GFP_NOFS);
 		return;
 	}
 
+	start = round_up(page_start + offset, root->sectorsize);
+	end = round_down(page_start + offset + length,
+			root->sectorsize) - 1;
+
 	if (!inode_evicting)
-		lock_extent_bits(tree, page_start, page_end, 0, &cached_state);
-	ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_CACHE_SIZE);
-	if (ordered) {
-		/*
-		 * IO on this page will never be started, so we need
-		 * to account for any ordered extents now
-		 */
-		if (!inode_evicting)
-			clear_extent_bit(tree, page_start, page_end,
-					 EXTENT_DIRTY | EXTENT_DELALLOC |
-					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
-					 EXTENT_DEFRAG, 1, 0, &cached_state,
-					 GFP_NOFS);
-		/*
-		 * whoever cleared the private bit is responsible
-		 * for the finish_ordered_io
-		 */
-		if (TestClearPagePrivate2(page)) {
-			struct btrfs_ordered_inode_tree *tree;
-			u64 new_len;
+		lock_extent_bits(tree, start, end, 0, NULL);
 
-			tree = &BTRFS_I(inode)->ordered_tree;
+	cur = start;
+	while (cur < end) {
+		ordered = btrfs_lookup_ordered_extent(inode, cur);
+		if (!ordered) {
+			cur += root->sectorsize;
+			continue;
+		}
 
-			spin_lock_irq(&tree->lock);
-			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-			new_len = page_start - ordered->file_offset;
-			if (new_len < ordered->truncated_len)
-				ordered->truncated_len = new_len;
-			spin_unlock_irq(&tree->lock);
+		invalidate_ordered_extent_blocks(inode, ordered,
+						start, end, cur,
+						inode_evicting);
 
-			if (btrfs_dec_test_ordered_pending(inode, &ordered,
-							   page_start,
-							   PAGE_CACHE_SIZE, 1))
-				btrfs_finish_ordered_io(ordered);
-		}
+		cur = min(end + 1, ordered->file_offset + ordered->len);
 		btrfs_put_ordered_extent(ordered);
-		if (!inode_evicting) {
-			cached_state = NULL;
-			lock_extent_bits(tree, page_start, page_end, 0,
-					 &cached_state);
-		}
 	}
 
-	if (!inode_evicting) {
-		clear_extent_bit(tree, page_start, page_end,
-				 EXTENT_LOCKED | EXTENT_DIRTY |
-				 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
-				 EXTENT_DEFRAG, 1, 1,
-				 &cached_state, GFP_NOFS);
+	if (page_blocks_written(page))
+		ClearPagePrivate2(page);
 
-		__btrfs_releasepage(page, GFP_NOFS);
+	if (!inode_evicting) {
+		clear_extent_bit(tree, start, end,
+				EXTENT_LOCKED | EXTENT_DIRTY |
+				EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
+				EXTENT_DEFRAG, 1, 1, NULL, GFP_NOFS);
 	}
 
-	ClearPageChecked(page);
-	if (PagePrivate(page)) {
-		ClearPagePrivate(page);
-		set_page_private(page, 0);
-		page_cache_release(page);
+	if (!offset && length == PAGE_CACHE_SIZE) {
+		WARN_ON(!__btrfs_releasepage(page, start, end, GFP_NOFS));
+		ClearPageChecked(page);
 	}
 }
 
+
 /*
  * btrfs_page_mkwrite() is not allowed to change the file size as it gets
  * called from a page fault handler when a page is first dirtied. Hence we must
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 157cc54..8e614ca 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -189,12 +189,25 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 	struct btrfs_ordered_inode_tree *tree;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry;
+	u64 nr_longs;
 
 	tree = &BTRFS_I(inode)->ordered_tree;
 	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
 	if (!entry)
 		return -ENOMEM;
 
+	nr_longs = BITS_TO_LONGS(len >> inode->i_sb->s_blocksize_bits);
+	if (nr_longs == 1) {
+		entry->blocks_done = &entry->blocks_bitmap;
+	} else {
+		entry->blocks_done = kzalloc(nr_longs * sizeof(unsigned long),
+					GFP_NOFS);
+		if (!entry->blocks_done) {
+			kmem_cache_free(btrfs_ordered_extent_cache, entry);
+			return -ENOMEM;
+		}
+	}
+
 	entry->file_offset = file_offset;
 	entry->start = start;
 	entry->len = len;
@@ -553,6 +566,10 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry)
 			list_del(&sum->list);
 			kfree(sum);
 		}
+
+		if (entry->blocks_done != &entry->blocks_bitmap)
+			kfree(entry->blocks_done);
+
 		kmem_cache_free(btrfs_ordered_extent_cache, entry);
 	}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index e96cd4c..4b3356a 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -140,6 +140,10 @@ struct btrfs_ordered_extent {
 	struct completion completion;
 	struct btrfs_work flush_work;
 	struct list_head work_list;
+
+	/* bitmap to track the blocks that have been written to disk */
+	unsigned long *blocks_done;
+	unsigned long blocks_bitmap;
 };
 
 /*
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 15/21] Btrfs: subpagesize-blocksize: Revert commit fc4adbff823f76577ece26dcb88bf6f8392dbd43.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (13 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 14/21] Btrfs: subpagesize-blocksize: Explicitly Track I/O status of blocks of an ordered extent Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 16/21] Btrfs: subpagesize-blocksize: Prevent writes to an extent buffer when PG_writeback flag is set Chandan Rajendra
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

In subpagesize-blocksize, we have multiple blocks in a page. Checking for
existence of a page in the page cache isn't a sufficient check, since we
could be truncating a subset of the blocks mapped by the page.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/btrfs_inode.h | 2 --
 fs/btrfs/file.c        | 4 +++-
 fs/btrfs/inode.c       | 3 ++-
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 0ef5cc1..2bf8043 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -323,6 +323,4 @@ static inline void btrfs_inode_resume_unlocked_dio(struct inode *inode)
 		  &BTRFS_I(inode)->runtime_flags);
 }
 
-bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end);
-
 #endif
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9600410a..cc9feed 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2419,7 +2419,9 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 		if ((!ordered ||
 		    (ordered->file_offset + ordered->len <= lockstart ||
 		     ordered->file_offset > lockend)) &&
-		     !btrfs_page_exists_in_range(inode, lockstart, lockend)) {
+		     !test_range_bit(&BTRFS_I(inode)->io_tree, lockstart,
+				     lockend, EXTENT_UPTODATE, 0,
+				     cached_state)) {
 			if (ordered)
 				btrfs_put_ordered_extent(ordered);
 			break;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bfffc62..03faff0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7281,7 +7281,8 @@ out:
 	return ret;
 }
 
-bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end)
+static bool btrfs_page_exists_in_range(struct inode *inode, loff_t start,
+				loff_t end)
 {
 	struct radix_tree_root *root = &inode->i_mapping->page_tree;
 	int found = false;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 16/21] Btrfs: subpagesize-blocksize: Prevent writes to an extent buffer when PG_writeback flag is set.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (14 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 15/21] Btrfs: subpagesize-blocksize: Revert commit fc4adbff823f76577ece26dcb88bf6f8392dbd43 Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 17/21] Btrfs: subpagesize-blocksize: Use (eb->start, seq) as search key for tree modification log Chandan Rajendra
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

In non-subpagesize-blocksize scenario, BTRFS_HEADER_FLAG_WRITTEN flag prevents
Btrfs code from writing into an extent buffer whose pages are under
writeback. This facility isn't sufficient for achieving the same in
subpagesize-blocksize scenario, since we have more than one extent buffer
mapped to a page.

Hence this patch adds a new flag (i.e. EXTENT_BUFFER_HEAD_WRITEBACK) and
corresponding code to track the writeback status of the page and to prevent
writes to any of the extent buffers mapped to the page while writeback is
going on.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/ctree.c       |  21 ++++++-
 fs/btrfs/extent-tree.c |  11 ++++
 fs/btrfs/extent_io.c   | 150 ++++++++++++++++++++++++++++++++++++++++---------
 fs/btrfs/extent_io.h   |   1 +
 4 files changed, 155 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index b28f14d..ba6fbb0 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1535,6 +1535,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans,
 		    struct extent_buffer *parent, int parent_slot,
 		    struct extent_buffer **cow_ret)
 {
+	struct extent_buffer_head *ebh = eb_head(buf);
 	u64 search_start;
 	int ret;
 
@@ -1548,6 +1549,14 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans,
 		       trans->transid, root->fs_info->generation);
 
 	if (!should_cow_block(trans, root, buf)) {
+		if (test_bit(EXTENT_BUFFER_HEAD_WRITEBACK, &ebh->bflags)) {
+			if (parent)
+				btrfs_set_lock_blocking(parent);
+			btrfs_set_lock_blocking(buf);
+			wait_on_bit_io(&ebh->bflags,
+				EXTENT_BUFFER_HEAD_WRITEBACK,
+				TASK_UNINTERRUPTIBLE);
+		}
 		*cow_ret = buf;
 		return 0;
 	}
@@ -2665,6 +2674,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root
 		      *root, struct btrfs_key *key, struct btrfs_path *p, int
 		      ins_len, int cow)
 {
+	struct extent_buffer_head *ebh;
 	struct extent_buffer *b;
 	int slot;
 	int ret;
@@ -2767,8 +2777,17 @@ again:
 			 * then we don't want to set the path blocking,
 			 * so we test it here
 			 */
-			if (!should_cow_block(trans, root, b))
+			if (!should_cow_block(trans, root, b)) {
+				ebh = eb_head(b);
+				if (test_bit(EXTENT_BUFFER_HEAD_WRITEBACK,
+						&ebh->bflags)) {
+					btrfs_set_path_blocking(p);
+					wait_on_bit_io(&ebh->bflags,
+						EXTENT_BUFFER_HEAD_WRITEBACK,
+						TASK_UNINTERRUPTIBLE);
+				}
 				goto cow_done;
+			}
 
 			/*
 			 * must have write locks on this node and the
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b93a922..fc324b8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7435,14 +7435,25 @@ static struct extent_buffer *
 btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		      u64 bytenr, int level)
 {
+	struct extent_buffer_head *ebh;
 	struct extent_buffer *buf;
 
 	buf = btrfs_find_create_tree_block(root, bytenr);
 	if (!buf)
 		return ERR_PTR(-ENOMEM);
+
+	ebh = eb_head(buf);
 	btrfs_set_header_generation(buf, trans->transid);
 	btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
 	btrfs_tree_lock(buf);
+
+	if (test_bit(EXTENT_BUFFER_HEAD_WRITEBACK,
+			&ebh->bflags)) {
+		btrfs_set_lock_blocking(buf);
+		wait_on_bit_io(&ebh->bflags, EXTENT_BUFFER_HEAD_WRITEBACK,
+			TASK_UNINTERRUPTIBLE);
+	}
+
 	clean_tree_block(trans, root->fs_info, buf);
 	clear_bit(EXTENT_BUFFER_STALE, &buf->ebflags);
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 55f900a..1ae1059 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3638,6 +3638,52 @@ void wait_on_extent_buffer_writeback(struct extent_buffer *eb)
 		    TASK_UNINTERRUPTIBLE);
 }
 
+static void lock_extent_buffers(struct extent_buffer_head *ebh,
+				struct extent_page_data *epd)
+{
+	struct extent_buffer *locked_eb = NULL;
+	struct extent_buffer *eb;
+again:
+	eb = &ebh->eb;
+	do {
+		if (eb == locked_eb)
+			continue;
+
+		if (!btrfs_try_tree_write_lock(eb))
+			goto backoff;
+
+	} while ((eb = eb->eb_next) != NULL);
+
+	return;
+
+backoff:
+	if (locked_eb && (locked_eb->start > eb->start))
+		btrfs_tree_unlock(locked_eb);
+
+	locked_eb = eb;
+
+	eb = &ebh->eb;
+	while (eb != locked_eb) {
+		btrfs_tree_unlock(eb);
+		eb = eb->eb_next;
+	}
+
+	flush_write_bio(epd);
+
+	btrfs_tree_lock(locked_eb);
+
+	goto again;
+}
+
+static void unlock_extent_buffers(struct extent_buffer_head *ebh)
+{
+	struct extent_buffer *eb = &ebh->eb;
+
+	do {
+		btrfs_tree_unlock(eb);
+	} while ((eb = eb->eb_next) != NULL);
+}
+
 static void lock_extent_buffer_pages(struct extent_buffer_head *ebh,
 				struct extent_page_data *epd)
 {
@@ -3657,21 +3703,17 @@ static void lock_extent_buffer_pages(struct extent_buffer_head *ebh,
 }
 
 static int noinline_for_stack
-lock_extent_buffer_for_io(struct extent_buffer *eb,
+mark_extent_buffer_writeback(struct extent_buffer *eb,
 			struct btrfs_fs_info *fs_info,
 			struct extent_page_data *epd)
 {
+	struct extent_buffer_head *ebh = eb_head(eb);
+	struct extent_buffer *cur;
 	int dirty;
 	int ret = 0;
 
-	if (!btrfs_try_tree_write_lock(eb)) {
-		flush_write_bio(epd);
-		btrfs_tree_lock(eb);
-	}
-
 	if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags)) {
 		dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
-		btrfs_tree_unlock(eb);
 		if (!epd->sync_io) {
 			if (!dirty)
 				return 1;
@@ -3679,15 +3721,23 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 				return 2;
 		}
 
+		cur = &ebh->eb;
+		do {
+			btrfs_set_lock_blocking(cur);
+		} while ((cur = cur->eb_next) != NULL);
+
 		flush_write_bio(epd);
 
 		while (1) {
 			wait_on_extent_buffer_writeback(eb);
-			btrfs_tree_lock(eb);
 			if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags))
 				break;
-			btrfs_tree_unlock(eb);
 		}
+
+		cur = &ebh->eb;
+		do {
+			btrfs_clear_lock_blocking(cur);
+		} while ((cur = cur->eb_next) != NULL);
 	}
 
 	/*
@@ -3695,22 +3745,20 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 	 * under IO since we can end up having no IO bits set for a short period
 	 * of time.
 	 */
-	spin_lock(&eb_head(eb)->refs_lock);
+	spin_lock(&ebh->refs_lock);
 	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags)) {
 		set_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags);
-		spin_unlock(&eb_head(eb)->refs_lock);
+		spin_unlock(&ebh->refs_lock);
 		btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
 		__percpu_counter_add(&fs_info->dirty_metadata_bytes,
 				     -eb->len,
 				     fs_info->dirty_metadata_batch);
 		ret = 0;
 	} else {
-		spin_unlock(&eb_head(eb)->refs_lock);
+		spin_unlock(&ebh->refs_lock);
 		ret = 1;
 	}
 
-	btrfs_tree_unlock(eb);
-
 	return ret;
 }
 
@@ -3856,8 +3904,8 @@ static void set_btree_ioerr(struct extent_buffer *eb, struct page *page)
 
 static void end_bio_subpagesize_blocksize_ebh_writepage(struct bio *bio, int err)
 {
-	struct bio_vec *bvec;
 	struct extent_buffer *eb;
+	struct bio_vec *bvec;
 	int i, done;
 
 	bio_for_each_segment_all(bvec, bio, i) {
@@ -3887,6 +3935,15 @@ static void end_bio_subpagesize_blocksize_ebh_writepage(struct bio *bio, int err
 
 			end_extent_buffer_writeback(eb);
 
+			if (done) {
+				struct extent_buffer_head *ebh = eb_head(eb);
+
+				clear_bit(EXTENT_BUFFER_HEAD_WRITEBACK,
+					&ebh->bflags);
+				smp_mb__after_atomic();
+				wake_up_bit(&ebh->bflags,
+					EXTENT_BUFFER_HEAD_WRITEBACK);
+			}
 		} while ((eb = eb->eb_next) != NULL);
 
 	}
@@ -3896,6 +3953,7 @@ static void end_bio_subpagesize_blocksize_ebh_writepage(struct bio *bio, int err
 
 static void end_bio_regular_ebh_writepage(struct bio *bio, int err)
 {
+	struct extent_buffer_head *ebh;
 	struct extent_buffer *eb;
 	struct bio_vec *bvec;
 	int i, done;
@@ -3906,7 +3964,9 @@ static void end_bio_regular_ebh_writepage(struct bio *bio, int err)
 		eb = (struct extent_buffer *)page->private;
 		BUG_ON(!eb);
 
-		done = atomic_dec_and_test(&eb_head(eb)->io_bvecs);
+		ebh = eb_head(eb);
+
+		done = atomic_dec_and_test(&ebh->io_bvecs);
 
 		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->ebflags)) {
 			ClearPageUptodate(page);
@@ -3919,6 +3979,10 @@ static void end_bio_regular_ebh_writepage(struct bio *bio, int err)
 			continue;
 
 		end_extent_buffer_writeback(eb);
+
+		clear_bit(EXTENT_BUFFER_HEAD_WRITEBACK, &ebh->bflags);
+		smp_mb__after_atomic();
+		wake_up_bit(&ebh->bflags, EXTENT_BUFFER_HEAD_WRITEBACK);
 	}
 
 	bio_put(bio);
@@ -3960,8 +4024,14 @@ write_regular_ebh(struct extent_buffer_head *ebh,
 			set_btree_ioerr(eb, p);
 			end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i,
-							&eb_head(eb)->io_bvecs))
+							&ebh->io_bvecs)) {
 				end_extent_buffer_writeback(eb);
+				clear_bit(EXTENT_BUFFER_HEAD_WRITEBACK,
+					&ebh->bflags);
+				smp_mb__after_atomic();
+				wake_up_bit(&ebh->bflags,
+					EXTENT_BUFFER_HEAD_WRITEBACK);
+			}
 			ret = -EIO;
 			break;
 		}
@@ -3995,6 +4065,7 @@ static int write_subpagesize_blocksize_ebh(struct extent_buffer_head *ebh,
 	unsigned long i;
 	unsigned long bio_flags = 0;
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
+	int nr_eb_submitted = 0;
 	int ret = 0, err = 0;
 
 	eb = &ebh->eb;
@@ -4007,7 +4078,7 @@ static int write_subpagesize_blocksize_ebh(struct extent_buffer_head *ebh,
 			continue;
 
 		clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->ebflags);
-		atomic_inc(&eb_head(eb)->io_bvecs);
+		atomic_inc(&ebh->io_bvecs);
 
 		if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
 			bio_flags = EXTENT_BIO_TREE_LOG;
@@ -4025,6 +4096,8 @@ static int write_subpagesize_blocksize_ebh(struct extent_buffer_head *ebh,
 			atomic_dec(&eb_head(eb)->io_bvecs);
 			end_extent_buffer_writeback(eb);
 			err = -EIO;
+		} else {
+			++nr_eb_submitted;
 		}
 	} while ((eb = eb->eb_next) != NULL);
 
@@ -4032,6 +4105,12 @@ static int write_subpagesize_blocksize_ebh(struct extent_buffer_head *ebh,
 		update_nr_written(p, wbc, 1);
 	}
 
+	if (!nr_eb_submitted) {
+		clear_bit(EXTENT_BUFFER_HEAD_WRITEBACK, &ebh->bflags);
+		smp_mb__after_atomic();
+		wake_up_bit(&ebh->bflags, EXTENT_BUFFER_HEAD_WRITEBACK);
+	}
+
 	unlock_page(p);
 
 	return ret;
@@ -4143,24 +4222,31 @@ retry:
 
 			j = 0;
 			ebs_to_write = dirty_ebs = 0;
+
+			lock_extent_buffers(ebh, &epd);
+
+			set_bit(EXTENT_BUFFER_HEAD_WRITEBACK, &ebh->bflags);
+
 			eb = &ebh->eb;
 			do {
 				BUG_ON(j >= BITS_PER_LONG);
 
-				ret = lock_extent_buffer_for_io(eb, fs_info, &epd);
+				ret = mark_extent_buffer_writeback(eb, fs_info,
+								&epd);
 				switch (ret) {
 				case 0:
 					/*
-					  EXTENT_BUFFER_DIRTY was set and we were able to
-					  clear it.
+					  EXTENT_BUFFER_DIRTY was set and we were
+					  able to clear it.
 					*/
 					set_bit(j, &ebs_to_write);
 					break;
 				case 2:
 					/*
-					  EXTENT_BUFFER_DIRTY was set, but we were unable
-					  to clear EXTENT_BUFFER_WRITEBACK that was set
-					  before we got the extent buffer locked.
+					  EXTENT_BUFFER_DIRTY was set, but we were
+					  unable to clear EXTENT_BUFFER_WRITEBACK
+					  that was set before we got the extent
+					  buffer locked.
 					 */
 					set_bit(j, &dirty_ebs);
 				default:
@@ -4174,22 +4260,32 @@ retry:
 
 			ret = 0;
 
+			unlock_extent_buffers(ebh);
+
 			if (!ebs_to_write) {
+				clear_bit(EXTENT_BUFFER_HEAD_WRITEBACK,
+					&ebh->bflags);
+				smp_mb__after_atomic();
+				wake_up_bit(&ebh->bflags,
+					EXTENT_BUFFER_HEAD_WRITEBACK);
 				free_extent_buffer(&ebh->eb);
 				continue;
 			}
 
 			/*
-			  Now that we know that atleast one of the extent buffer
+			  Now that we know that atleast one of the extent buffers
 			  belonging to the extent buffer head must be written to
 			  the disk, lock the extent_buffer_head's pages.
 			 */
 			lock_extent_buffer_pages(ebh, &epd);
 
 			if (ebh->eb.len < PAGE_CACHE_SIZE) {
-				ret = write_subpagesize_blocksize_ebh(ebh, fs_info, wbc, &epd, ebs_to_write);
+				ret = write_subpagesize_blocksize_ebh(ebh, fs_info,
+								wbc, &epd,
+								ebs_to_write);
 				if (dirty_ebs) {
-					redirty_extent_buffer_pages_for_writepage(&ebh->eb, wbc);
+					redirty_extent_buffer_pages_for_writepage(&ebh->eb,
+										wbc);
 				}
 			} else {
 				ret = write_regular_ebh(ebh, fs_info, wbc, &epd);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index c629e53..cbc7d73 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -42,6 +42,7 @@
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
 #define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
+#define EXTENT_BUFFER_HEAD_WRITEBACK 12
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define PAGE_UNLOCK		(1 << 0)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 17/21] Btrfs: subpagesize-blocksize: Use (eb->start, seq) as search key for tree modification log.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (15 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 16/21] Btrfs: subpagesize-blocksize: Prevent writes to an extent buffer when PG_writeback flag is set Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-07-20 14:46   ` Liu Bo
  2015-06-01 15:22 ` [RFC PATCH V11 18/21] Btrfs: subpagesize-blocksize: btrfs_submit_direct_hook: Handle map_length < bio vector length Chandan Rajendra
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

In subpagesize-blocksize a page can map multiple extent buffers and hence
using (page index, seq) as the search key is incorrect. For example, searching
through tree modification log tree can return an entry associated with the
first extent buffer mapped by the page (if such an entry exists), when we are
actually searching for entries associated with extent buffers that are mapped
at position 2 or more in the page.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/ctree.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index ba6fbb0..47310d3 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -311,7 +311,7 @@ struct tree_mod_root {
 
 struct tree_mod_elem {
 	struct rb_node node;
-	u64 index;		/* shifted logical */
+	u64 logical;
 	u64 seq;
 	enum mod_log_op op;
 
@@ -435,11 +435,11 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
 
 /*
  * key order of the log:
- *       index -> sequence
+ *       node/leaf start address -> sequence
  *
- * the index is the shifted logical of the *new* root node for root replace
- * operations, or the shifted logical of the affected block for all other
- * operations.
+ * The 'start address' is the logical address of the *new* root node
+ * for root replace operations, or the logical address of the affected
+ * block for all other operations.
  *
  * Note: must be called with write lock (tree_mod_log_write_lock).
  */
@@ -460,9 +460,9 @@ __tree_mod_log_insert(struct btrfs_fs_info *fs_info, struct tree_mod_elem *tm)
 	while (*new) {
 		cur = container_of(*new, struct tree_mod_elem, node);
 		parent = *new;
-		if (cur->index < tm->index)
+		if (cur->logical < tm->logical)
 			new = &((*new)->rb_left);
-		else if (cur->index > tm->index)
+		else if (cur->logical > tm->logical)
 			new = &((*new)->rb_right);
 		else if (cur->seq < tm->seq)
 			new = &((*new)->rb_left);
@@ -523,7 +523,7 @@ alloc_tree_mod_elem(struct extent_buffer *eb, int slot,
 	if (!tm)
 		return NULL;
 
-	tm->index = eb->start >> PAGE_CACHE_SHIFT;
+	tm->logical = eb->start;
 	if (op != MOD_LOG_KEY_ADD) {
 		btrfs_node_key(eb, &tm->key, slot);
 		tm->blockptr = btrfs_node_blockptr(eb, slot);
@@ -588,7 +588,7 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info,
 		goto free_tms;
 	}
 
-	tm->index = eb->start >> PAGE_CACHE_SHIFT;
+	tm->logical = eb->start;
 	tm->slot = src_slot;
 	tm->move.dst_slot = dst_slot;
 	tm->move.nr_items = nr_items;
@@ -699,7 +699,7 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info,
 		goto free_tms;
 	}
 
-	tm->index = new_root->start >> PAGE_CACHE_SHIFT;
+	tm->logical = new_root->start;
 	tm->old_root.logical = old_root->start;
 	tm->old_root.level = btrfs_header_level(old_root);
 	tm->generation = btrfs_header_generation(old_root);
@@ -739,16 +739,15 @@ __tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 start, u64 min_seq,
 	struct rb_node *node;
 	struct tree_mod_elem *cur = NULL;
 	struct tree_mod_elem *found = NULL;
-	u64 index = start >> PAGE_CACHE_SHIFT;
 
 	tree_mod_log_read_lock(fs_info);
 	tm_root = &fs_info->tree_mod_log;
 	node = tm_root->rb_node;
 	while (node) {
 		cur = container_of(node, struct tree_mod_elem, node);
-		if (cur->index < index) {
+		if (cur->logical < start) {
 			node = node->rb_left;
-		} else if (cur->index > index) {
+		} else if (cur->logical > start) {
 			node = node->rb_right;
 		} else if (cur->seq < min_seq) {
 			node = node->rb_left;
@@ -1228,9 +1227,10 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info,
 		return NULL;
 
 	/*
-	 * the very last operation that's logged for a root is the replacement
-	 * operation (if it is replaced at all). this has the index of the *new*
-	 * root, making it the very first operation that's logged for this root.
+	 * the very last operation that's logged for a root is the
+	 * replacement operation (if it is replaced at all). this has
+	 * the logical address of the *new* root, making it the very
+	 * first operation that's logged for this root.
 	 */
 	while (1) {
 		tm = tree_mod_log_search_oldest(fs_info, root_logical,
@@ -1334,7 +1334,7 @@ __tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct extent_buffer *eb,
 		if (!next)
 			break;
 		tm = container_of(next, struct tree_mod_elem, node);
-		if (tm->index != first_tm->index)
+		if (tm->logical != first_tm->logical)
 			break;
 	}
 	tree_mod_log_read_unlock(fs_info);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 18/21] Btrfs: subpagesize-blocksize: btrfs_submit_direct_hook: Handle map_length < bio vector length
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (16 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 17/21] Btrfs: subpagesize-blocksize: Use (eb->start, seq) as search key for tree modification log Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 19/21] Revert "btrfs: fix lockups from btrfs_clear_path_blocking" Chandan Rajendra
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

In subpagesize-blocksize scenario, map_length can be less than the length of a
bio vector. Such a condition may cause btrfs_submit_direct_hook() to submit a
zero length bio. Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/inode.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 03faff0..1684339 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8193,9 +8193,11 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 	u64 file_offset = dip->logical_offset;
 	u64 submit_len = 0;
 	u64 map_length;
-	int nr_pages = 0;
-	int ret;
+	u32 blocksize = root->sectorsize;
 	int async_submit = 0;
+	int nr_sectors;
+	int ret;
+	int i;
 
 	map_length = orig_bio->bi_iter.bi_size;
 	ret = btrfs_map_block(root->fs_info, rw, start_sector << 9,
@@ -8225,9 +8227,12 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 	atomic_inc(&dip->pending_bios);
 
 	while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
-		if (map_length < submit_len + bvec->bv_len ||
-		    bio_add_page(bio, bvec->bv_page, bvec->bv_len,
-				 bvec->bv_offset) < bvec->bv_len) {
+		nr_sectors = bvec->bv_len >> inode->i_sb->s_blocksize_bits;
+		i = 0;
+next_block:
+		if (unlikely(map_length < submit_len + blocksize ||
+		    bio_add_page(bio, bvec->bv_page, blocksize,
+			    bvec->bv_offset + (i * blocksize)) < blocksize)) {
 			/*
 			 * inc the count before we submit the bio so
 			 * we know the end IO handler won't happen before
@@ -8248,7 +8253,6 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 			file_offset += submit_len;
 
 			submit_len = 0;
-			nr_pages = 0;
 
 			bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev,
 						  start_sector, GFP_NOFS);
@@ -8266,9 +8270,14 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 				bio_put(bio);
 				goto out_err;
 			}
+
+			goto next_block;
 		} else {
-			submit_len += bvec->bv_len;
-			nr_pages++;
+			submit_len += blocksize;
+			if (--nr_sectors) {
+				i++;
+				goto next_block;
+			}
 			bvec++;
 		}
 	}
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 19/21] Revert "btrfs: fix lockups from btrfs_clear_path_blocking"
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (17 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 18/21] Btrfs: subpagesize-blocksize: btrfs_submit_direct_hook: Handle map_length < bio vector length Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 20/21] Btrfs: subpagesize-blockssize: Limit inline extents to root->sectorsize Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 21/21] Btrfs: subpagesize-blocksize: Fix block size returned to user space Chandan Rajendra
  20 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

This reverts commit f82c458a2c3ffb94b431fc6ad791a79df1b3713e.
---
 fs/btrfs/ctree.c   | 14 ++++++++++++--
 fs/btrfs/locking.c | 24 +++---------------------
 fs/btrfs/locking.h |  2 --
 3 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 47310d3..594130f 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -80,6 +80,13 @@ noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
 {
 	int i;
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	/* lockdep really cares that we take all of these spinlocks
+	 * in the right order.  If any of the locks in the path are not
+	 * currently blocking, it is going to complain.  So, make really
+	 * really sure by forcing the path to blocking before we clear
+	 * the path blocking.
+	 */
 	if (held) {
 		btrfs_set_lock_blocking_rw(held, held_rw);
 		if (held_rw == BTRFS_WRITE_LOCK)
@@ -88,6 +95,7 @@ noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
 			held_rw = BTRFS_READ_LOCK_BLOCKING;
 	}
 	btrfs_set_path_blocking(p);
+#endif
 
 	for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
 		if (p->nodes[i] && p->locks[i]) {
@@ -99,8 +107,10 @@ noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
 		}
 	}
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
 	if (held)
 		btrfs_clear_lock_blocking_rw(held, held_rw);
+#endif
 }
 
 /* this also releases the path */
@@ -2898,7 +2908,7 @@ cow_done:
 					}
 					p->locks[level] = BTRFS_WRITE_LOCK;
 				} else {
-					err = btrfs_tree_read_lock_atomic(b);
+					err = btrfs_try_tree_read_lock(b);
 					if (!err) {
 						btrfs_set_path_blocking(p);
 						btrfs_tree_read_lock(b);
@@ -3030,7 +3040,7 @@ again:
 			}
 
 			level = btrfs_header_level(b);
-			err = btrfs_tree_read_lock_atomic(b);
+			err = btrfs_try_tree_read_lock(b);
 			if (!err) {
 				btrfs_set_path_blocking(p);
 				btrfs_tree_read_lock(b);
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index f8229ef..5665d21 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -128,26 +128,6 @@ again:
 }
 
 /*
- * take a spinning read lock.
- * returns 1 if we get the read lock and 0 if we don't
- * this won't wait for blocking writers
- */
-int btrfs_tree_read_lock_atomic(struct extent_buffer *eb)
-{
-	if (atomic_read(&eb->blocking_writers))
-		return 0;
-
-	read_lock(&eb->lock);
-	if (atomic_read(&eb->blocking_writers)) {
-		read_unlock(&eb->lock);
-		return 0;
-	}
-	atomic_inc(&eb->read_locks);
-	atomic_inc(&eb->spinning_readers);
-	return 1;
-}
-
-/*
  * returns 1 if we get the read lock and 0 if we don't
  * this won't wait for blocking writers
  */
@@ -178,7 +158,9 @@ int btrfs_try_tree_write_lock(struct extent_buffer *eb)
 	    atomic_read(&eb->blocking_readers))
 		return 0;
 
-	write_lock(&eb->lock);
+	if (!write_trylock(&eb->lock))
+		return 0;
+
 	if (atomic_read(&eb->blocking_writers) ||
 	    atomic_read(&eb->blocking_readers)) {
 		write_unlock(&eb->lock);
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index c44a9d5..b81e0e9 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -35,8 +35,6 @@ void btrfs_clear_lock_blocking_rw(struct extent_buffer *eb, int rw);
 void btrfs_assert_tree_locked(struct extent_buffer *eb);
 int btrfs_try_tree_read_lock(struct extent_buffer *eb);
 int btrfs_try_tree_write_lock(struct extent_buffer *eb);
-int btrfs_tree_read_lock_atomic(struct extent_buffer *eb);
-
 
 static inline void btrfs_tree_unlock_rw(struct extent_buffer *eb, int rw)
 {
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 20/21] Btrfs: subpagesize-blockssize: Limit inline extents to root->sectorsize.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (18 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 19/21] Revert "btrfs: fix lockups from btrfs_clear_path_blocking" Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  2015-06-01 15:22 ` [RFC PATCH V11 21/21] Btrfs: subpagesize-blocksize: Fix block size returned to user space Chandan Rajendra
  20 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

cow_file_range_inline() limits the size of an inline extent to
PAGE_CACHE_SIZE. This breaks in subpagesize-blocksize scenarios. Fix this by
comparing against root->sectorsize.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1684339..4d42123 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -258,7 +258,7 @@ static noinline int cow_file_range_inline(struct btrfs_root *root,
 		data_len = compressed_size;
 
 	if (start > 0 ||
-	    actual_end > PAGE_CACHE_SIZE ||
+	    actual_end > root->sectorsize ||
 	    data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) ||
 	    (!compressed_size &&
 	    (actual_end & (root->sectorsize - 1)) == 0) ||
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH V11 21/21] Btrfs: subpagesize-blocksize: Fix block size returned to user space.
  2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
                   ` (19 preceding siblings ...)
  2015-06-01 15:22 ` [RFC PATCH V11 20/21] Btrfs: subpagesize-blockssize: Limit inline extents to root->sectorsize Chandan Rajendra
@ 2015-06-01 15:22 ` Chandan Rajendra
  20 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-01 15:22 UTC (permalink / raw)
  To: clm, jbacik, dsterba, bo.li.liu; +Cc: Chandan Rajendra, linux-btrfs, chandan

btrfs_getattr() returns PAGE_CACHE_SIZE as the block size. Since
generic_fillattr() already does the right thing (by obtaining block size
from inode->i_blkbits), just remove the statement from btrfs_getattr.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
 fs/btrfs/inode.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4d42123..3c35430 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9302,7 +9302,6 @@ static int btrfs_getattr(struct vfsmount *mnt,
 
 	generic_fillattr(inode, stat);
 	stat->dev = BTRFS_I(inode)->root->anon_dev;
-	stat->blksize = PAGE_CACHE_SIZE;
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	delalloc_bytes = BTRFS_I(inode)->delalloc_bytes;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.
  2015-06-01 15:22 ` [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read Chandan Rajendra
@ 2015-06-19  4:45   ` Liu Bo
  2015-06-19  9:45     ` Chandan Rajendra
  0 siblings, 1 reply; 47+ messages in thread
From: Liu Bo @ 2015-06-19  4:45 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Mon, Jun 01, 2015 at 08:52:36PM +0530, Chandan Rajendra wrote:
> For the subpagesize-blocksize scenario, a page can contain multiple
> blocks. In such cases, this patch handles reading data from files.
> 
> To track the status of individual blocks of a page, this patch makes use of a
> bitmap pointed to by page->private.

Start going through the patchset, it's not easy though.

Several comments are following.

> 
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>  fs/btrfs/extent_io.c | 301 +++++++++++++++++++++++++++++++++------------------
>  fs/btrfs/extent_io.h |  28 ++++-
>  fs/btrfs/inode.c     |  13 +--
>  3 files changed, 224 insertions(+), 118 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 782f3bc..d37badb 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1325,6 +1325,88 @@ int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
>  				cached_state, mask);
>  }
>  
> +static int modify_page_blks_state(struct page *page,
> +				unsigned long blk_states,
> +				u64 start, u64 end, int set)
> +{
> +	struct inode *inode = page->mapping->host;
> +	unsigned long *bitmap;
> +	unsigned long state;
> +	u64 nr_blks;
> +	u64 blk;
> +
> +	BUG_ON(!PagePrivate(page));
> +
> +	bitmap = ((struct btrfs_page_private *)page->private)->bstate;
> +
> +	blk = (start & (PAGE_CACHE_SIZE - 1)) >> inode->i_blkbits;
> +	nr_blks = (end - start + 1) >> inode->i_blkbits;
> +
> +	while (nr_blks--) {
> +		state = find_next_bit(&blk_states, BLK_NR_STATE, 0);

Looks like we don't need to do find_next_bit for every block.

> +
> +		while (state < BLK_NR_STATE) {
> +			if (set)
> +				set_bit((blk * BLK_NR_STATE) + state, bitmap);
> +			else
> +				clear_bit((blk * BLK_NR_STATE) + state, bitmap);
> +
> +			state = find_next_bit(&blk_states, BLK_NR_STATE,
> +					state + 1);
> +		}
> +
> +		++blk;
> +	}
> +
> +	return 0;
> +}
> +
> +int set_page_blks_state(struct page *page, unsigned long blk_states,
> +			u64 start, u64 end)
> +{
> +	return modify_page_blks_state(page, blk_states, start, end, 1);
> +}
> +
> +int clear_page_blks_state(struct page *page, unsigned long blk_states,
> +			u64 start, u64 end)
> +{
> +	return modify_page_blks_state(page, blk_states, start, end, 0);
> +}
> +
> +int test_page_blks_state(struct page *page, enum blk_state blk_state,
> +			u64 start, u64 end, int check_all)
> +{
> +	struct inode *inode = page->mapping->host;
> +	unsigned long *bitmap;
> +	unsigned long blk;
> +	u64 nr_blks;
> +	int found = 0;
> +
> +	BUG_ON(!PagePrivate(page));
> +
> +	bitmap = ((struct btrfs_page_private *)page->private)->bstate;
> +
> +	blk = (start & (PAGE_CACHE_SIZE - 1)) >> inode->i_blkbits;
> +	nr_blks = (end - start + 1) >> inode->i_blkbits;
> +
> +	while (nr_blks--) {
> +		if (test_bit((blk * BLK_NR_STATE) + blk_state, bitmap)) {
> +			if (!check_all)
> +				return 1;
> +			found = 1;
> +		} else if (check_all) {
> +			return 0;
> +		}
> +
> +		++blk;
> +	}
> +
> +	if (!check_all && !found)
> +		return 0;
> +
> +	return 1;
> +}
> +
>  /*
>   * either insert or lock state struct between start and end use mask to tell
>   * us if waiting is desired.
> @@ -1982,14 +2064,22 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
>   * helper function to set a given page up to date if all the
>   * extents in the tree for that page are up to date
>   */
> -static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
> +static void check_page_uptodate(struct page *page)
>  {
>  	u64 start = page_offset(page);
>  	u64 end = start + PAGE_CACHE_SIZE - 1;
> -	if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
> +	if (test_page_blks_state(page, BLK_STATE_UPTODATE, start, end, 1))
>  		SetPageUptodate(page);
>  }
>  
> +static int page_read_complete(struct page *page)
> +{
> +	u64 start = page_offset(page);
> +	u64 end = start + PAGE_CACHE_SIZE - 1;
> +
> +	return !test_page_blks_state(page, BLK_STATE_IO, start, end, 0);
> +}
> +
>  int free_io_failure(struct inode *inode, struct io_failure_record *rec)
>  {
>  	int ret;
> @@ -2311,7 +2401,9 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
>  	 *	a) deliver good data to the caller
>  	 *	b) correct the bad sectors on disk
>  	 */
> -	if (failed_bio->bi_vcnt > 1) {
> +	if ((failed_bio->bi_vcnt > 1)
> +		|| (failed_bio->bi_io_vec->bv_len
> +			> BTRFS_I(inode)->root->sectorsize)) {
>  		/*
>  		 * to fulfill b), we need to know the exact failing sectors, as
>  		 * we don't want to rewrite any more than the failed ones. thus,
> @@ -2520,18 +2612,6 @@ static void end_bio_extent_writepage(struct bio *bio, int err)
>  	bio_put(bio);
>  }
>  
> -static void
> -endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len,
> -			      int uptodate)
> -{
> -	struct extent_state *cached = NULL;
> -	u64 end = start + len - 1;
> -
> -	if (uptodate && tree->track_uptodate)
> -		set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
> -	unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC);
> -}
> -
>  /*
>   * after a readpage IO is done, we need to:
>   * clear the uptodate bits on error
> @@ -2548,14 +2628,16 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
>  	struct bio_vec *bvec;
>  	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
>  	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> +	struct extent_state *cached = NULL;
> +	struct btrfs_page_private *pg_private;
>  	struct extent_io_tree *tree;
> +	unsigned long flags;
>  	u64 offset = 0;
>  	u64 start;
>  	u64 end;
> -	u64 len;
> -	u64 extent_start = 0;
> -	u64 extent_len = 0;
> +	int nr_sectors;
>  	int mirror;
> +	int unlock;
>  	int ret;
>  	int i;
>  
> @@ -2565,54 +2647,31 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
>  	bio_for_each_segment_all(bvec, bio, i) {
>  		struct page *page = bvec->bv_page;
>  		struct inode *inode = page->mapping->host;
> +		struct btrfs_root *root = BTRFS_I(inode)->root;
>  
>  		pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
>  			 "mirror=%u\n", (u64)bio->bi_iter.bi_sector, err,
>  			 io_bio->mirror_num);
>  		tree = &BTRFS_I(inode)->io_tree;
>  
> -		/* We always issue full-page reads, but if some block
> -		 * in a page fails to read, blk_update_request() will
> -		 * advance bv_offset and adjust bv_len to compensate.
> -		 * Print a warning for nonzero offsets, and an error
> -		 * if they don't add up to a full page.  */
> -		if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) {
> -			if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE)
> -				btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
> -				   "partial page read in btrfs with offset %u and length %u",
> -					bvec->bv_offset, bvec->bv_len);
> -			else
> -				btrfs_info(BTRFS_I(page->mapping->host)->root->fs_info,
> -				   "incomplete page read in btrfs with offset %u and "
> -				   "length %u",
> -					bvec->bv_offset, bvec->bv_len);
> -		}
> -
> -		start = page_offset(page);
> -		end = start + bvec->bv_offset + bvec->bv_len - 1;
> -		len = bvec->bv_len;
> -
> +		start = page_offset(page) + bvec->bv_offset;
> +		end = start + bvec->bv_len - 1;
> +		nr_sectors = bvec->bv_len >> inode->i_sb->s_blocksize_bits;
>  		mirror = io_bio->mirror_num;
> -		if (likely(uptodate && tree->ops &&
> -			   tree->ops->readpage_end_io_hook)) {
> +
> +next_block:
> +		if (likely(uptodate)) {

Any reason of killing (tree->ops && tree->ops->readpage_end_io_hook)?

>  			ret = tree->ops->readpage_end_io_hook(io_bio, offset,
> -							      page, start, end,
> -							      mirror);
> +							page, start,
> +							start + root->sectorsize - 1,
> +							mirror);
>  			if (ret)
>  				uptodate = 0;
>  			else
>  				clean_io_failure(inode, start, page, 0);
>  		}
>  
> -		if (likely(uptodate))
> -			goto readpage_ok;
> -
> -		if (tree->ops && tree->ops->readpage_io_failed_hook) {
> -			ret = tree->ops->readpage_io_failed_hook(page, mirror);
> -			if (!ret && !err &&
> -			    test_bit(BIO_UPTODATE, &bio->bi_flags))
> -				uptodate = 1;
> -		} else {
> +		if (!uptodate) {
>  			/*
>  			 * The generic bio_readpage_error handles errors the
>  			 * following way: If possible, new read requests are
> @@ -2623,61 +2682,63 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
>  			 * can't handle the error it will return -EIO and we
>  			 * remain responsible for that page.
>  			 */
> -			ret = bio_readpage_error(bio, offset, page, start, end,
> -						 mirror);
> +			ret = bio_readpage_error(bio, offset, page,
> +						start, start + root->sectorsize - 1,
> +						mirror);
>  			if (ret == 0) {
> -				uptodate =
> -					test_bit(BIO_UPTODATE, &bio->bi_flags);
> +				uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
>  				if (err)
>  					uptodate = 0;
> -				offset += len;
> -				continue;
> +				offset += root->sectorsize;
> +				if (--nr_sectors) {
> +					start += root->sectorsize;
> +					goto next_block;
> +				} else {
> +					continue;
> +				}
>  			}
>  		}
> -readpage_ok:
> -		if (likely(uptodate)) {
> -			loff_t i_size = i_size_read(inode);
> -			pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
> -			unsigned off;
> -
> -			/* Zero out the end if this page straddles i_size */
> -			off = i_size & (PAGE_CACHE_SIZE-1);
> -			if (page->index == end_index && off)
> -				zero_user_segment(page, off, PAGE_CACHE_SIZE);
> -			SetPageUptodate(page);
> +
> +		if (uptodate) {
> +			set_page_blks_state(page, 1 << BLK_STATE_UPTODATE, start,
> +					start + root->sectorsize - 1);
> +			check_page_uptodate(page);
>  		} else {
>  			ClearPageUptodate(page);
>  			SetPageError(page);
>  		}
> -		unlock_page(page);
> -		offset += len;
> -
> -		if (unlikely(!uptodate)) {
> -			if (extent_len) {
> -				endio_readpage_release_extent(tree,
> -							      extent_start,
> -							      extent_len, 1);
> -				extent_start = 0;
> -				extent_len = 0;
> -			}
> -			endio_readpage_release_extent(tree, start,
> -						      end - start + 1, 0);
> -		} else if (!extent_len) {
> -			extent_start = start;
> -			extent_len = end + 1 - start;
> -		} else if (extent_start + extent_len == start) {
> -			extent_len += end + 1 - start;
> -		} else {
> -			endio_readpage_release_extent(tree, extent_start,
> -						      extent_len, uptodate);
> -			extent_start = start;
> -			extent_len = end + 1 - start;
> +
> +		offset += root->sectorsize;
> +
> +		if (--nr_sectors) {
> +			clear_page_blks_state(page, 1 << BLK_STATE_IO,
> +					start, start + root->sectorsize - 1);

private->io_lock is not acquired here but not in below.

IIUC, this can be protected by EXTENT_LOCKED.

Thanks,

-liubo

> +			clear_extent_bit(tree, start, start + root->sectorsize - 1,
> +					EXTENT_LOCKED, 1, 0, &cached, GFP_ATOMIC);
> +			start += root->sectorsize;
> +			goto next_block;
>  		}
> +
> +		WARN_ON(!PagePrivate(page));
> +
> +		pg_private = (struct btrfs_page_private *)page->private;
> +
> +		spin_lock_irqsave(&pg_private->io_lock, flags);
> +
> +		clear_page_blks_state(page, 1 << BLK_STATE_IO,
> +				start, start + root->sectorsize - 1);
> +
> +		unlock = page_read_complete(page);
> +
> +		spin_unlock_irqrestore(&pg_private->io_lock, flags);
> +
> +		clear_extent_bit(tree, start, start + root->sectorsize - 1,
> +				EXTENT_LOCKED, 1, 0, &cached, GFP_ATOMIC);
> +
> +		if (unlock)
> +			unlock_page(page);
>  	}
>  
> -	if (extent_len)
> -		endio_readpage_release_extent(tree, extent_start, extent_len,
> -					      uptodate);
>  	if (io_bio->end_io)
>  		io_bio->end_io(io_bio, err);
>  	bio_put(bio);
> @@ -2859,13 +2920,36 @@ static void attach_extent_buffer_page(struct extent_buffer *eb,
>  	}
>  }
>  
> -void set_page_extent_mapped(struct page *page)
> +int set_page_extent_mapped(struct page *page)
>  {
> +	struct btrfs_page_private *pg_private;
> +
>  	if (!PagePrivate(page)) {
> +		pg_private = kzalloc(sizeof(*pg_private), GFP_NOFS);
> +		if (!pg_private)
> +			return -ENOMEM;
> +
> +		spin_lock_init(&pg_private->io_lock);
> +
>  		SetPagePrivate(page);
>  		page_cache_get(page);
> -		set_page_private(page, EXTENT_PAGE_PRIVATE);
> +
> +		set_page_private(page, (unsigned long)pg_private);
> +	}
> +
> +	return 0;
> +}
> +
> +int clear_page_extent_mapped(struct page *page)
> +{
> +	if (PagePrivate(page)) {
> +		kfree((struct btrfs_page_private *)(page->private));
> +		ClearPagePrivate(page);
> +		set_page_private(page, 0);
> +		page_cache_release(page);
>  	}
> +
> +	return 0;
>  }
>  
>  static struct extent_map *
> @@ -2909,6 +2993,7 @@ static int __do_readpage(struct extent_io_tree *tree,
>  			 unsigned long *bio_flags, int rw)
>  {
>  	struct inode *inode = page->mapping->host;
> +	struct extent_state *cached = NULL;
>  	u64 start = page_offset(page);
>  	u64 page_end = start + PAGE_CACHE_SIZE - 1;
>  	u64 end;
> @@ -2964,8 +3049,8 @@ static int __do_readpage(struct extent_io_tree *tree,
>  			memset(userpage + pg_offset, 0, iosize);
>  			flush_dcache_page(page);
>  			kunmap_atomic(userpage);
> -			set_extent_uptodate(tree, cur, cur + iosize - 1,
> -					    &cached, GFP_NOFS);
> +			set_page_blks_state(page, 1 << BLK_STATE_UPTODATE, cur,
> +					cur + iosize - 1);
>  			if (!parent_locked)
>  				unlock_extent_cached(tree, cur,
>  						     cur + iosize - 1,
> @@ -3017,8 +3102,8 @@ static int __do_readpage(struct extent_io_tree *tree,
>  			flush_dcache_page(page);
>  			kunmap_atomic(userpage);
>  
> -			set_extent_uptodate(tree, cur, cur + iosize - 1,
> -					    &cached, GFP_NOFS);
> +			set_page_blks_state(page, 1 << BLK_STATE_UPTODATE, cur,
> +					cur + iosize - 1);
>  			unlock_extent_cached(tree, cur, cur + iosize - 1,
>  			                     &cached, GFP_NOFS);
>  			cur = cur + iosize;
> @@ -3026,9 +3111,9 @@ static int __do_readpage(struct extent_io_tree *tree,
>  			continue;
>  		}
>  		/* the get_extent function already copied into the page */
> -		if (test_range_bit(tree, cur, cur_end,
> -				   EXTENT_UPTODATE, 1, NULL)) {
> -			check_page_uptodate(tree, page);
> +		if (test_page_blks_state(page, BLK_STATE_UPTODATE, cur,
> +						cur_end, 1)) {
> +			check_page_uptodate(page);
>  			if (!parent_locked)
>  				unlock_extent(tree, cur, cur + iosize - 1);
>  			cur = cur + iosize;
> @@ -3048,6 +3133,9 @@ static int __do_readpage(struct extent_io_tree *tree,
>  		}
>  
>  		pnr -= page->index;
> +
> +		set_page_blks_state(page, 1 << BLK_STATE_IO, cur,
> +				cur + iosize - 1);
>  		ret = submit_extent_page(rw, tree, page,
>  					 sector, disk_io_size, pg_offset,
>  					 bdev, bio, pnr,
> @@ -3059,8 +3147,11 @@ static int __do_readpage(struct extent_io_tree *tree,
>  			*bio_flags = this_bio_flag;
>  		} else {
>  			SetPageError(page);
> +			clear_page_blks_state(page, 1 << BLK_STATE_IO, cur,
> +					cur + iosize - 1);
>  			if (!parent_locked)
> -				unlock_extent(tree, cur, cur + iosize - 1);
> +				unlock_extent_cached(tree, cur, cur + iosize - 1,
> +						&cached, GFP_NOFS);
>  		}
>  		cur = cur + iosize;
>  		pg_offset += iosize;
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index c668f36..541b40a 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -51,11 +51,22 @@
>  #define PAGE_SET_PRIVATE2	(1 << 4)
>  #define PAGE_SET_ERROR		(1 << 5)
>  
> +enum blk_state {
> +	BLK_STATE_UPTODATE,
> +	BLK_STATE_DIRTY,
> +	BLK_STATE_IO,
> +	BLK_NR_STATE,
> +};
> +
>  /*
> - * page->private values.  Every page that is controlled by the extent
> - * map has page->private set to one.
> - */
> -#define EXTENT_PAGE_PRIVATE 1
> +  The maximum number of blocks per page (i.e. 32) occurs when using 2k
> +  as the block size and having 64k as the page size.
> +*/
> +#define BLK_STATE_NR_LONGS DIV_ROUND_UP(BLK_NR_STATE * 32, BITS_PER_LONG)
> +struct btrfs_page_private {
> +	spinlock_t io_lock;
> +	unsigned long bstate[BLK_STATE_NR_LONGS];
> +};
>  
>  struct extent_state;
>  struct btrfs_root;
> @@ -259,7 +270,14 @@ int extent_readpages(struct extent_io_tree *tree,
>  int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  		__u64 start, __u64 len, get_extent_t *get_extent);
>  int get_state_private(struct extent_io_tree *tree, u64 start, u64 *private);
> -void set_page_extent_mapped(struct page *page);
> +int set_page_extent_mapped(struct page *page);
> +int clear_page_extent_mapped(struct page *page);
> +int set_page_blks_state(struct page *page, unsigned long blk_states,
> + 			u64 start, u64 end);
> +int clear_page_blks_state(struct page *page, unsigned long blk_states,
> + 			u64 start, u64 end);
> +int test_page_blks_state(struct page *page, enum blk_state blk_state,
> +			u64 start, u64 end, int check_all);
>  
>  struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>  					  u64 start);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0020b56..8262f83 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6622,7 +6622,6 @@ struct extent_map *btrfs_get_extent(struct inode *inode, struct page *page,
>  	struct btrfs_key found_key;
>  	struct extent_map *em = NULL;
>  	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
> -	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>  	struct btrfs_trans_handle *trans = NULL;
>  	const bool new_inline = !page || create;
>  
> @@ -6800,8 +6799,8 @@ next:
>  			kunmap(page);
>  			btrfs_mark_buffer_dirty(leaf);
>  		}
> -		set_extent_uptodate(io_tree, em->start,
> -				    extent_map_end(em) - 1, NULL, GFP_NOFS);
> +		set_page_blks_state(page, 1 << BLK_STATE_UPTODATE, em->start,
> +				extent_map_end(em) - 1);
>  		goto insert;
>  	}
>  not_found:
> @@ -8392,11 +8391,9 @@ static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
>  	tree = &BTRFS_I(page->mapping->host)->io_tree;
>  	map = &BTRFS_I(page->mapping->host)->extent_tree;
>  	ret = try_release_extent_mapping(map, tree, page, gfp_flags);
> -	if (ret == 1) {
> -		ClearPagePrivate(page);
> -		set_page_private(page, 0);
> -		page_cache_release(page);
> -	}
> +	if (ret == 1)
> +		clear_page_extent_mapped(page);
> +
>  	return ret;
>  }
>  
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.
  2015-06-19  4:45   ` Liu Bo
@ 2015-06-19  9:45     ` Chandan Rajendra
  2015-06-23  8:37       ` Liu Bo
  2016-02-10 10:39       ` David Sterba
  0 siblings, 2 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-19  9:45 UTC (permalink / raw)
  To: bo.li.liu; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Friday 19 Jun 2015 12:45:37 Liu Bo wrote:
> On Mon, Jun 01, 2015 at 08:52:36PM +0530, Chandan Rajendra wrote:
> > For the subpagesize-blocksize scenario, a page can contain multiple
> > blocks. In such cases, this patch handles reading data from files.
> > 
> > To track the status of individual blocks of a page, this patch makes use
> > of a bitmap pointed to by page->private.
> 
> Start going through the patchset, it's not easy though.
> 
> Several comments are following.

Thanks for the review comments Liu.

> > +static int modify_page_blks_state(struct page *page,
> > +				unsigned long blk_states,
> > +				u64 start, u64 end, int set)
> > +{
> > +	struct inode *inode = page->mapping->host;
> > +	unsigned long *bitmap;
> > +	unsigned long state;
> > +	u64 nr_blks;
> > +	u64 blk;
> > +
> > +	BUG_ON(!PagePrivate(page));
> > +
> > +	bitmap = ((struct btrfs_page_private *)page->private)->bstate;
> > +
> > +	blk = (start & (PAGE_CACHE_SIZE - 1)) >> inode->i_blkbits;
> > +	nr_blks = (end - start + 1) >> inode->i_blkbits;
> > +
> > +	while (nr_blks--) {
> > +		state = find_next_bit(&blk_states, BLK_NR_STATE, 0);
> 
> Looks like we don't need to do find_next_bit for every block.

Yes, I agree. The find_next_bit() invocation in the outer loop can be moved
outside the loop.
> 
> > +
> > +		while (state < BLK_NR_STATE) {
> > +			if (set)
> > +				set_bit((blk * BLK_NR_STATE) + state, bitmap);
> > +			else
> > +				clear_bit((blk * BLK_NR_STATE) + state, 
bitmap);
> > +
> > +			state = find_next_bit(&blk_states, BLK_NR_STATE,
> > +					state + 1);
> > +		}
> > +
> > +		++blk;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > 
> >  /*
> >  
> >   * after a readpage IO is done, we need to:
> >   * clear the uptodate bits on error
> > 
> > @@ -2548,14 +2628,16 @@ static void end_bio_extent_readpage(struct bio
> > *bio, int err)> 
> >  	struct bio_vec *bvec;
> >  	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
> >  	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> > 
> > +	struct extent_state *cached = NULL;
> > +	struct btrfs_page_private *pg_private;
> > 
> >  	struct extent_io_tree *tree;
> > 
> > +	unsigned long flags;
> > 
> >  	u64 offset = 0;
> >  	u64 start;
> >  	u64 end;
> > 
> > -	u64 len;
> > -	u64 extent_start = 0;
> > -	u64 extent_len = 0;
> > +	int nr_sectors;
> > 
> >  	int mirror;
> > 
> > +	int unlock;
> > 
> >  	int ret;
> >  	int i;
> > 
> > @@ -2565,54 +2647,31 @@ static void end_bio_extent_readpage(struct bio
> > *bio, int err)> 
> >  	bio_for_each_segment_all(bvec, bio, i) {
> >  	
> >  		struct page *page = bvec->bv_page;
> >  		struct inode *inode = page->mapping->host;
> > 
> > +		struct btrfs_root *root = BTRFS_I(inode)->root;
> > 
> >  		pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
> >  		
> >  			 "mirror=%u\n", (u64)bio->bi_iter.bi_sector, err,
> >  			 io_bio->mirror_num);
> >  		
> >  		tree = &BTRFS_I(inode)->io_tree;
> > 
> > -		/* We always issue full-page reads, but if some block
> > -		 * in a page fails to read, blk_update_request() will
> > -		 * advance bv_offset and adjust bv_len to compensate.
> > -		 * Print a warning for nonzero offsets, and an error
> > -		 * if they don't add up to a full page.  */
> > -		if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) {
> > -			if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE)
> > -				btrfs_err(BTRFS_I(page->mapping->host)->root-
>fs_info,
> > -				   "partial page read in btrfs with offset %u 
and length %u",
> > -					bvec->bv_offset, bvec->bv_len);
> > -			else
> > -				btrfs_info(BTRFS_I(page->mapping->host)->root-
>fs_info,
> > -				   "incomplete page read in btrfs with offset 
%u and "
> > -				   "length %u",
> > -					bvec->bv_offset, bvec->bv_len);
> > -		}
> > -
> > -		start = page_offset(page);
> > -		end = start + bvec->bv_offset + bvec->bv_len - 1;
> > -		len = bvec->bv_len;
> > -
> > +		start = page_offset(page) + bvec->bv_offset;
> > +		end = start + bvec->bv_len - 1;
> > +		nr_sectors = bvec->bv_len >> inode->i_sb->s_blocksize_bits;
> > 
> >  		mirror = io_bio->mirror_num;
> > 
> > -		if (likely(uptodate && tree->ops &&
> > -			   tree->ops->readpage_end_io_hook)) {
> > +
> > +next_block:
> > +		if (likely(uptodate)) {
> 
> Any reason of killing (tree->ops && tree->ops->readpage_end_io_hook)?

In subpagesize-blocksize scenario, For extent buffers we need the ability to
read just a single extent buffer rather than reading the complete contents of
the page containing the extent buffer. Similarly in the corresponding endio
function we need to verify a single extent buffer rather than the contents of
the full page.  Hence I ended up removing btree_readpage_end_io_hook() and
btree_io_failed_hook() functions and had verfication functions being
invoked directly by the endio function.

So since data "read page code" was the only one left to have
extent_io_tree->ops->readpage_end_io_hook set, I removed the code to check for
its existance. Now i realize that it is not the right thing to do. I will
restore back the condition check to its original state.

> 
> >  			ret = tree->ops->readpage_end_io_hook(io_bio, offset,
> > 
> > -							      page, start, 
end,
> > -							      mirror);
> > +							page, start,
> > +							start + root-
>sectorsize - 1,
> > +							mirror);
> > 
> >  			if (ret)
> >  			
> >  				uptodate = 0;
> >  			
> >  			else
> >  			
> >  				clean_io_failure(inode, start, page, 0);
> >  		
> >  		}
> > 
> > -		if (likely(uptodate))
> > -			goto readpage_ok;
> > -
> > -		if (tree->ops && tree->ops->readpage_io_failed_hook) {
> > -			ret = tree->ops->readpage_io_failed_hook(page, 
mirror);
> > -			if (!ret && !err &&
> > -			    test_bit(BIO_UPTODATE, &bio->bi_flags))
> > -				uptodate = 1;
> > -		} else {
> > +		if (!uptodate) {
> > 
> >  			/*
> >  			
> >  			 * The generic bio_readpage_error handles errors the
> >  			 * following way: If possible, new read requests are
> > 
> > @@ -2623,61 +2682,63 @@ static void end_bio_extent_readpage(struct bio
> > *bio, int err)> 
> >  			 * can't handle the error it will return -EIO and we
> >  			 * remain responsible for that page.
> >  			 */
> > 
> > -			ret = bio_readpage_error(bio, offset, page, start, 
end,
> > -						 mirror);
> > +			ret = bio_readpage_error(bio, offset, page,
> > +						start, start + root-
>sectorsize - 1,
> > +						mirror);
> > 
> >  			if (ret == 0) {
> > 
> > -				uptodate =
> > -					test_bit(BIO_UPTODATE, &bio-
>bi_flags);
> > +				uptodate = test_bit(BIO_UPTODATE, &bio-
>bi_flags);
> > 
> >  				if (err)
> >  				
> >  					uptodate = 0;
> > 
> > -				offset += len;
> > -				continue;
> > +				offset += root->sectorsize;
> > +				if (--nr_sectors) {
> > +					start += root->sectorsize;
> > +					goto next_block;
> > +				} else {
> > +					continue;
> > +				}
> > 
> >  			}
> >  		
> >  		}
> > 
> > -readpage_ok:
> > -		if (likely(uptodate)) {
> > -			loff_t i_size = i_size_read(inode);
> > -			pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
> > -			unsigned off;
> > -
> > -			/* Zero out the end if this page straddles i_size */
> > -			off = i_size & (PAGE_CACHE_SIZE-1);
> > -			if (page->index == end_index && off)
> > -				zero_user_segment(page, off, PAGE_CACHE_SIZE);
> > -			SetPageUptodate(page);
> > +
> > +		if (uptodate) {
> > +			set_page_blks_state(page, 1 << BLK_STATE_UPTODATE, 
start,
> > +					start + root->sectorsize - 1);
> > +			check_page_uptodate(page);
> > 
> >  		} else {
> >  		
> >  			ClearPageUptodate(page);
> >  			SetPageError(page);
> >  		
> >  		}
> > 
> > -		unlock_page(page);
> > -		offset += len;
> > -
> > -		if (unlikely(!uptodate)) {
> > -			if (extent_len) {
> > -				endio_readpage_release_extent(tree,
> > -							      extent_start,
> > -							      extent_len, 1);
> > -				extent_start = 0;
> > -				extent_len = 0;
> > -			}
> > -			endio_readpage_release_extent(tree, start,
> > -						      end - start + 1, 0);
> > -		} else if (!extent_len) {
> > -			extent_start = start;
> > -			extent_len = end + 1 - start;
> > -		} else if (extent_start + extent_len == start) {
> > -			extent_len += end + 1 - start;
> > -		} else {
> > -			endio_readpage_release_extent(tree, extent_start,
> > -						      extent_len, uptodate);
> > -			extent_start = start;
> > -			extent_len = end + 1 - start;
> > +
> > +		offset += root->sectorsize;
> > +
> > +		if (--nr_sectors) {
> > +			clear_page_blks_state(page, 1 << BLK_STATE_IO,
> > +					start, start + root->sectorsize - 1);
> 
> private->io_lock is not acquired here but not in below.
> 
> IIUC, this can be protected by EXTENT_LOCKED.
>

private->io_lock plays the same role as BH_Uptodate_Lock (see
end_buffer_async_read()) i.e. without the io_lock we may end up in the
following situation,

NOTE: Assume 64k page size and 4k block size. Also assume that the first 12
blocks of the page are contiguous while the next 4 blocks are contiguous. When
reading the page we end up submitting two "logical address space" bios. So
end_bio_extent_readpage function is invoked twice (once for each bio).

|-------------------------+-------------------------+-------------|
| Task A                  | Task B                  | Task C      |
|-------------------------+-------------------------+-------------|
| end_bio_extent_readpage |                         |             |
| process block 0         |                         |             |
| - clear BLK_STATE_IO    |                         |             |
| - page_read_complete    |                         |             |
| process block 1         |                         |             |
| ...                     |                         |             |
| ...                     |                         |             |
| ...                     | end_bio_extent_readpage |             |
| ...                     | process block 0         |             |
| ...                     | - clear BLK_STATE_IO    |             |
| ...                     | - page_read_complete    |             |
| ...                     | process block 1         |             |
| ...                     | ...                     |             |
| process block 11        | process block 3         |             |
| - clear BLK_STATE_IO    | - clear BLK_STATE_IO    |             |
| - page_read_complete    | - page_read_complete    |             |
|   - returns true        |   - returns true        |             |
|   - unlock_page()       |                         |             |
|                         |                         | lock_page() |
|                         |   - unlock_page()       |             |
|-------------------------+-------------------------+-------------|

So we end up incorrectly unlocking the page twice and "Task C" ends up working
on an unlocked page. So private->io_lock makes sure that only one of the tasks
gets "true" as the return value when page_read_complete() is invoked. As an
optimization the patch gets the io_lock only when nr_sectors counter reaches
the value 0 (i.e. when the last block of the bio_vec is being processed).
Please let me know if my analysis was incorrect.

Also, I noticed that page_read_complete() and page_write_complete() can be
replaced by just one function i.e. page_io_complete().


-- 
chandan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.
  2015-06-19  9:45     ` Chandan Rajendra
@ 2015-06-23  8:37       ` Liu Bo
  2016-02-10 10:44         ` David Sterba
  2016-02-10 10:39       ` David Sterba
  1 sibling, 1 reply; 47+ messages in thread
From: Liu Bo @ 2015-06-23  8:37 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Fri, Jun 19, 2015 at 03:15:01PM +0530, Chandan Rajendra wrote:
> On Friday 19 Jun 2015 12:45:37 Liu Bo wrote:
> > On Mon, Jun 01, 2015 at 08:52:36PM +0530, Chandan Rajendra wrote:
> > > For the subpagesize-blocksize scenario, a page can contain multiple
> > > blocks. In such cases, this patch handles reading data from files.
> > > 
> > > To track the status of individual blocks of a page, this patch makes use
> > > of a bitmap pointed to by page->private.
> > 
> > Start going through the patchset, it's not easy though.
> > 
> > Several comments are following.
> 
> Thanks for the review comments Liu.
> 
> > > +static int modify_page_blks_state(struct page *page,
> > > +				unsigned long blk_states,
> > > +				u64 start, u64 end, int set)
> > > +{
> > > +	struct inode *inode = page->mapping->host;
> > > +	unsigned long *bitmap;
> > > +	unsigned long state;
> > > +	u64 nr_blks;
> > > +	u64 blk;
> > > +
> > > +	BUG_ON(!PagePrivate(page));
> > > +
> > > +	bitmap = ((struct btrfs_page_private *)page->private)->bstate;
> > > +
> > > +	blk = (start & (PAGE_CACHE_SIZE - 1)) >> inode->i_blkbits;
> > > +	nr_blks = (end - start + 1) >> inode->i_blkbits;
> > > +
> > > +	while (nr_blks--) {
> > > +		state = find_next_bit(&blk_states, BLK_NR_STATE, 0);
> > 
> > Looks like we don't need to do find_next_bit for every block.
> 
> Yes, I agree. The find_next_bit() invocation in the outer loop can be moved
> outside the loop.
> > 
> > > +
> > > +		while (state < BLK_NR_STATE) {
> > > +			if (set)
> > > +				set_bit((blk * BLK_NR_STATE) + state, bitmap);
> > > +			else
> > > +				clear_bit((blk * BLK_NR_STATE) + state, 
> bitmap);
> > > +
> > > +			state = find_next_bit(&blk_states, BLK_NR_STATE,
> > > +					state + 1);
> > > +		}
> > > +
> > > +		++blk;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > 
> > >  /*
> > >  
> > >   * after a readpage IO is done, we need to:
> > >   * clear the uptodate bits on error
> > > 
> > > @@ -2548,14 +2628,16 @@ static void end_bio_extent_readpage(struct bio
> > > *bio, int err)> 
> > >  	struct bio_vec *bvec;
> > >  	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
> > >  	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> > > 
> > > +	struct extent_state *cached = NULL;
> > > +	struct btrfs_page_private *pg_private;
> > > 
> > >  	struct extent_io_tree *tree;
> > > 
> > > +	unsigned long flags;
> > > 
> > >  	u64 offset = 0;
> > >  	u64 start;
> > >  	u64 end;
> > > 
> > > -	u64 len;
> > > -	u64 extent_start = 0;
> > > -	u64 extent_len = 0;
> > > +	int nr_sectors;
> > > 
> > >  	int mirror;
> > > 
> > > +	int unlock;
> > > 
> > >  	int ret;
> > >  	int i;
> > > 
> > > @@ -2565,54 +2647,31 @@ static void end_bio_extent_readpage(struct bio
> > > *bio, int err)> 
> > >  	bio_for_each_segment_all(bvec, bio, i) {
> > >  	
> > >  		struct page *page = bvec->bv_page;
> > >  		struct inode *inode = page->mapping->host;
> > > 
> > > +		struct btrfs_root *root = BTRFS_I(inode)->root;
> > > 
> > >  		pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
> > >  		
> > >  			 "mirror=%u\n", (u64)bio->bi_iter.bi_sector, err,
> > >  			 io_bio->mirror_num);
> > >  		
> > >  		tree = &BTRFS_I(inode)->io_tree;
> > > 
> > > -		/* We always issue full-page reads, but if some block
> > > -		 * in a page fails to read, blk_update_request() will
> > > -		 * advance bv_offset and adjust bv_len to compensate.
> > > -		 * Print a warning for nonzero offsets, and an error
> > > -		 * if they don't add up to a full page.  */
> > > -		if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) {
> > > -			if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE)
> > > -				btrfs_err(BTRFS_I(page->mapping->host)->root-
> >fs_info,
> > > -				   "partial page read in btrfs with offset %u 
> and length %u",
> > > -					bvec->bv_offset, bvec->bv_len);
> > > -			else
> > > -				btrfs_info(BTRFS_I(page->mapping->host)->root-
> >fs_info,
> > > -				   "incomplete page read in btrfs with offset 
> %u and "
> > > -				   "length %u",
> > > -					bvec->bv_offset, bvec->bv_len);
> > > -		}
> > > -
> > > -		start = page_offset(page);
> > > -		end = start + bvec->bv_offset + bvec->bv_len - 1;
> > > -		len = bvec->bv_len;
> > > -
> > > +		start = page_offset(page) + bvec->bv_offset;
> > > +		end = start + bvec->bv_len - 1;
> > > +		nr_sectors = bvec->bv_len >> inode->i_sb->s_blocksize_bits;
> > > 
> > >  		mirror = io_bio->mirror_num;
> > > 
> > > -		if (likely(uptodate && tree->ops &&
> > > -			   tree->ops->readpage_end_io_hook)) {
> > > +
> > > +next_block:
> > > +		if (likely(uptodate)) {
> > 
> > Any reason of killing (tree->ops && tree->ops->readpage_end_io_hook)?
> 
> In subpagesize-blocksize scenario, For extent buffers we need the ability to
> read just a single extent buffer rather than reading the complete contents of
> the page containing the extent buffer. Similarly in the corresponding endio
> function we need to verify a single extent buffer rather than the contents of
> the full page.  Hence I ended up removing btree_readpage_end_io_hook() and
> btree_io_failed_hook() functions and had verfication functions being
> invoked directly by the endio function.
> 
> So since data "read page code" was the only one left to have
> extent_io_tree->ops->readpage_end_io_hook set, I removed the code to check for
> its existance. Now i realize that it is not the right thing to do. I will
> restore back the condition check to its original state.
> 
> > 
> > >  			ret = tree->ops->readpage_end_io_hook(io_bio, offset,
> > > 
> > > -							      page, start, 
> end,
> > > -							      mirror);
> > > +							page, start,
> > > +							start + root-
> >sectorsize - 1,
> > > +							mirror);
> > > 
> > >  			if (ret)
> > >  			
> > >  				uptodate = 0;
> > >  			
> > >  			else
> > >  			
> > >  				clean_io_failure(inode, start, page, 0);
> > >  		
> > >  		}
> > > 
> > > -		if (likely(uptodate))
> > > -			goto readpage_ok;
> > > -
> > > -		if (tree->ops && tree->ops->readpage_io_failed_hook) {
> > > -			ret = tree->ops->readpage_io_failed_hook(page, 
> mirror);
> > > -			if (!ret && !err &&
> > > -			    test_bit(BIO_UPTODATE, &bio->bi_flags))
> > > -				uptodate = 1;
> > > -		} else {
> > > +		if (!uptodate) {
> > > 
> > >  			/*
> > >  			
> > >  			 * The generic bio_readpage_error handles errors the
> > >  			 * following way: If possible, new read requests are
> > > 
> > > @@ -2623,61 +2682,63 @@ static void end_bio_extent_readpage(struct bio
> > > *bio, int err)> 
> > >  			 * can't handle the error it will return -EIO and we
> > >  			 * remain responsible for that page.
> > >  			 */
> > > 
> > > -			ret = bio_readpage_error(bio, offset, page, start, 
> end,
> > > -						 mirror);
> > > +			ret = bio_readpage_error(bio, offset, page,
> > > +						start, start + root-
> >sectorsize - 1,
> > > +						mirror);
> > > 
> > >  			if (ret == 0) {
> > > 
> > > -				uptodate =
> > > -					test_bit(BIO_UPTODATE, &bio-
> >bi_flags);
> > > +				uptodate = test_bit(BIO_UPTODATE, &bio-
> >bi_flags);
> > > 
> > >  				if (err)
> > >  				
> > >  					uptodate = 0;
> > > 
> > > -				offset += len;
> > > -				continue;
> > > +				offset += root->sectorsize;
> > > +				if (--nr_sectors) {
> > > +					start += root->sectorsize;
> > > +					goto next_block;
> > > +				} else {
> > > +					continue;
> > > +				}
> > > 
> > >  			}
> > >  		
> > >  		}
> > > 
> > > -readpage_ok:
> > > -		if (likely(uptodate)) {
> > > -			loff_t i_size = i_size_read(inode);
> > > -			pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
> > > -			unsigned off;
> > > -
> > > -			/* Zero out the end if this page straddles i_size */
> > > -			off = i_size & (PAGE_CACHE_SIZE-1);
> > > -			if (page->index == end_index && off)
> > > -				zero_user_segment(page, off, PAGE_CACHE_SIZE);
> > > -			SetPageUptodate(page);
> > > +
> > > +		if (uptodate) {
> > > +			set_page_blks_state(page, 1 << BLK_STATE_UPTODATE, 
> start,
> > > +					start + root->sectorsize - 1);
> > > +			check_page_uptodate(page);
> > > 
> > >  		} else {
> > >  		
> > >  			ClearPageUptodate(page);
> > >  			SetPageError(page);
> > >  		
> > >  		}
> > > 
> > > -		unlock_page(page);
> > > -		offset += len;
> > > -
> > > -		if (unlikely(!uptodate)) {
> > > -			if (extent_len) {
> > > -				endio_readpage_release_extent(tree,
> > > -							      extent_start,
> > > -							      extent_len, 1);
> > > -				extent_start = 0;
> > > -				extent_len = 0;
> > > -			}
> > > -			endio_readpage_release_extent(tree, start,
> > > -						      end - start + 1, 0);
> > > -		} else if (!extent_len) {
> > > -			extent_start = start;
> > > -			extent_len = end + 1 - start;
> > > -		} else if (extent_start + extent_len == start) {
> > > -			extent_len += end + 1 - start;
> > > -		} else {
> > > -			endio_readpage_release_extent(tree, extent_start,
> > > -						      extent_len, uptodate);
> > > -			extent_start = start;
> > > -			extent_len = end + 1 - start;
> > > +
> > > +		offset += root->sectorsize;
> > > +
> > > +		if (--nr_sectors) {
> > > +			clear_page_blks_state(page, 1 << BLK_STATE_IO,
> > > +					start, start + root->sectorsize - 1);
> > 
> > private->io_lock is not acquired here but not in below.
> > 
> > IIUC, this can be protected by EXTENT_LOCKED.
> >
> 
> private->io_lock plays the same role as BH_Uptodate_Lock (see
> end_buffer_async_read()) i.e. without the io_lock we may end up in the
> following situation,
> 
> NOTE: Assume 64k page size and 4k block size. Also assume that the first 12
> blocks of the page are contiguous while the next 4 blocks are contiguous. When
> reading the page we end up submitting two "logical address space" bios. So
> end_bio_extent_readpage function is invoked twice (once for each bio).
> 
> |-------------------------+-------------------------+-------------|
> | Task A                  | Task B                  | Task C      |
> |-------------------------+-------------------------+-------------|
> | end_bio_extent_readpage |                         |             |
> | process block 0         |                         |             |
> | - clear BLK_STATE_IO    |                         |             |
> | - page_read_complete    |                         |             |
> | process block 1         |                         |             |
> | ...                     |                         |             |
> | ...                     |                         |             |
> | ...                     | end_bio_extent_readpage |             |
> | ...                     | process block 0         |             |
> | ...                     | - clear BLK_STATE_IO    |             |
> | ...                     | - page_read_complete    |             |
> | ...                     | process block 1         |             |
> | ...                     | ...                     |             |
> | process block 11        | process block 3         |             |
> | - clear BLK_STATE_IO    | - clear BLK_STATE_IO    |             |
> | - page_read_complete    | - page_read_complete    |             |
> |   - returns true        |   - returns true        |             |
> |   - unlock_page()       |                         |             |
> |                         |                         | lock_page() |
> |                         |   - unlock_page()       |             |
> |-------------------------+-------------------------+-------------|
> 
> So we end up incorrectly unlocking the page twice and "Task C" ends up working
> on an unlocked page. So private->io_lock makes sure that only one of the tasks
> gets "true" as the return value when page_read_complete() is invoked. As an
> optimization the patch gets the io_lock only when nr_sectors counter reaches
> the value 0 (i.e. when the last block of the bio_vec is being processed).
> Please let me know if my analysis was incorrect.

Thanks for the nice explanation, it looks reasonable to me.

Thanks,

-liubo

> 
> Also, I noticed that page_read_complete() and page_write_complete() can be
> replaced by just one function i.e. page_io_complete().
> 
> 
> -- 
> chandan
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 02/21] Btrfs: subpagesize-blocksize: Fix whole page write.
  2015-06-01 15:22 ` [RFC PATCH V11 02/21] Btrfs: subpagesize-blocksize: Fix whole page write Chandan Rajendra
@ 2015-06-26  9:50   ` Liu Bo
  2015-06-29  8:54     ` Chandan Rajendra
  0 siblings, 1 reply; 47+ messages in thread
From: Liu Bo @ 2015-06-26  9:50 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Mon, Jun 01, 2015 at 08:52:37PM +0530, Chandan Rajendra wrote:
> For the subpagesize-blocksize scenario, a page can contain multiple
> blocks. In such cases, this patch handles writing data to files.
> 
> Also, When setting EXTENT_DELALLOC, we no longer set EXTENT_UPTODATE bit on
> the extent_io_tree since uptodate status is being tracked by the bitmap
> pointed to by page->private.

To be honestly, I'm not sure why we set EXTENT_UPTODATE bit for data as we
don't check for that bit at all for now, correct me if I'm wrong.

> 
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>  fs/btrfs/extent_io.c | 141 +++++++++++++++++++++++----------------------------
>  fs/btrfs/file.c      |  16 ++++++
>  fs/btrfs/inode.c     |  58 ++++++++++++++++-----
>  3 files changed, 125 insertions(+), 90 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index d37badb..3736ab5 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1283,9 +1283,8 @@ int clear_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
>  int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end,
>  			struct extent_state **cached_state, gfp_t mask)
>  {
> -	return set_extent_bit(tree, start, end,
> -			      EXTENT_DELALLOC | EXTENT_UPTODATE,
> -			      NULL, cached_state, mask);
> +	return set_extent_bit(tree, start, end, EXTENT_DELALLOC,
> +			NULL, cached_state, mask);
>  }
>  
>  int set_extent_defrag(struct extent_io_tree *tree, u64 start, u64 end,
> @@ -1498,25 +1497,6 @@ int extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end)
>  	return 0;
>  }
>  
> -/*
> - * helper function to set both pages and extents in the tree writeback
> - */
> -static int set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
> -{
> -	unsigned long index = start >> PAGE_CACHE_SHIFT;
> -	unsigned long end_index = end >> PAGE_CACHE_SHIFT;
> -	struct page *page;
> -
> -	while (index <= end_index) {
> -		page = find_get_page(tree->mapping, index);
> -		BUG_ON(!page); /* Pages should be in the extent_io_tree */
> -		set_page_writeback(page);
> -		page_cache_release(page);
> -		index++;
> -	}
> -	return 0;
> -}
> -
>  /* find the first state struct with 'bits' set after 'start', and
>   * return it.  tree->lock must be held.  NULL will returned if
>   * nothing was found after 'start'
> @@ -2080,6 +2060,14 @@ static int page_read_complete(struct page *page)
>  	return !test_page_blks_state(page, BLK_STATE_IO, start, end, 0);
>  }
>  
> +static int page_write_complete(struct page *page)
> +{
> +	u64 start = page_offset(page);
> +	u64 end = start + PAGE_CACHE_SIZE - 1;
> +
> +	return !test_page_blks_state(page, BLK_STATE_IO, start, end, 0);
> +}
> +
>  int free_io_failure(struct inode *inode, struct io_failure_record *rec)
>  {
>  	int ret;
> @@ -2575,38 +2563,37 @@ int end_extent_writepage(struct page *page, int err, u64 start, u64 end)
>   */
>  static void end_bio_extent_writepage(struct bio *bio, int err)
>  {
> +	struct btrfs_page_private *pg_private;
>  	struct bio_vec *bvec;
> +	unsigned long flags;
>  	u64 start;
>  	u64 end;
> +	int clear_writeback;
>  	int i;
>  
>  	bio_for_each_segment_all(bvec, bio, i) {
>  		struct page *page = bvec->bv_page;
>  
> -		/* We always issue full-page reads, but if some block
> -		 * in a page fails to read, blk_update_request() will
> -		 * advance bv_offset and adjust bv_len to compensate.
> -		 * Print a warning for nonzero offsets, and an error
> -		 * if they don't add up to a full page.  */
> -		if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) {
> -			if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE)
> -				btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
> -				   "partial page write in btrfs with offset %u and length %u",
> -					bvec->bv_offset, bvec->bv_len);
> -			else
> -				btrfs_info(BTRFS_I(page->mapping->host)->root->fs_info,
> -				   "incomplete page write in btrfs with offset %u and "
> -				   "length %u",
> -					bvec->bv_offset, bvec->bv_len);
> -		}
> +		start = page_offset(page) + bvec->bv_offset;
> +		end = start + bvec->bv_len - 1;
>  
> -		start = page_offset(page);
> -		end = start + bvec->bv_offset + bvec->bv_len - 1;
> +		pg_private = (struct btrfs_page_private *)page->private;
> +
> +		spin_lock_irqsave(&pg_private->io_lock, flags);
>  
> -		if (end_extent_writepage(page, err, start, end))
> +		if (end_extent_writepage(page, err, start, end)) {
> +			spin_unlock_irqrestore(&pg_private->io_lock, flags);
>  			continue;
> +		}
>  
> -		end_page_writeback(page);
> +		clear_page_blks_state(page, 1 << BLK_STATE_IO, start, end);
> +
> +		clear_writeback = page_write_complete(page);
> +
> +		spin_unlock_irqrestore(&pg_private->io_lock, flags);
> +
> +		if (clear_writeback)
> +			end_page_writeback(page);
>  	}
>  
>  	bio_put(bio);
> @@ -3417,10 +3404,9 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
>  	u64 block_start;
>  	u64 iosize;
>  	sector_t sector;
> -	struct extent_state *cached_state = NULL;
>  	struct extent_map *em;
>  	struct block_device *bdev;
> -	size_t pg_offset = 0;
> +	size_t pg_offset;
>  	size_t blocksize;
>  	int ret = 0;
>  	int nr = 0;
> @@ -3467,8 +3453,16 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
>  							 page_end, NULL, 1);
>  			break;
>  		}
> -		em = epd->get_extent(inode, page, pg_offset, cur,
> -				     end - cur + 1, 1);
> +
> +		pg_offset = cur & (PAGE_CACHE_SIZE - 1);
> +
> +		if (!test_page_blks_state(page, BLK_STATE_DIRTY, cur,
> +						cur + blocksize - 1, 1)) {
> +			cur += blocksize;
> +			continue;
> +		}

If we don't check this, the below get_extent() will return a HOLE (block_start
== EXTENT_MAP_HOLE) and we can still go on to the next block, then we don't
need to maintain this BLK_STATE_DIRTY bit all the while.

> +
> +		em = epd->get_extent(inode, page, pg_offset, cur, blocksize, 1);
>  		if (IS_ERR_OR_NULL(em)) {
>  			SetPageError(page);
>  			ret = PTR_ERR_OR_ZERO(em);
> @@ -3479,7 +3473,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
>  		em_end = extent_map_end(em);
>  		BUG_ON(em_end <= cur);
>  		BUG_ON(end < cur);
> -		iosize = min(em_end - cur, end - cur + 1);
> +		iosize = min_t(u64, em_end - cur, blocksize);
>  		iosize = ALIGN(iosize, blocksize);

This limits us to do one block per loop, if two blocks are contiguous,
it should be fine to write them along.

>  		sector = (em->block_start + extent_offset) >> 9;
>  		bdev = em->bdev;
> @@ -3488,32 +3482,20 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
>  		free_extent_map(em);
>  		em = NULL;
>  
> -		/*
> -		 * compressed and inline extents are written through other
> -		 * paths in the FS
> -		 */
> -		if (compressed || block_start == EXTENT_MAP_HOLE ||
> -		    block_start == EXTENT_MAP_INLINE) {
> -			/*
> -			 * end_io notification does not happen here for
> -			 * compressed extents
> -			 */
> -			if (!compressed && tree->ops &&
> -			    tree->ops->writepage_end_io_hook)
> -				tree->ops->writepage_end_io_hook(page, cur,
> -							 cur + iosize - 1,
> -							 NULL, 1);
> -			else if (compressed) {
> -				/* we don't want to end_page_writeback on
> -				 * a compressed extent.  this happens
> -				 * elsewhere
> -				 */
> -				nr++;
> -			}
> +		BUG_ON(compressed);
> +		BUG_ON(block_start == EXTENT_MAP_INLINE);
>  
> -			cur += iosize;
> -			pg_offset += iosize;
> -			continue;
> +		if (block_start == EXTENT_MAP_HOLE) {
> +			if (test_page_blks_state(page, BLK_STATE_UPTODATE, cur,
> +							cur + iosize - 1, 1)) {
> +				clear_page_blks_state(page,
> +						1 << BLK_STATE_DIRTY, cur,
> +						cur + iosize - 1);
> +				cur += iosize;
> +				continue;
> +			} else {
> +				BUG();
> +			}
>  		}
>  
>  		if (tree->ops && tree->ops->writepage_io_hook) {
> @@ -3527,7 +3509,13 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
>  		} else {
>  			unsigned long max_nr = (i_size >> PAGE_CACHE_SHIFT) + 1;
>  
> -			set_range_writeback(tree, cur, cur + iosize - 1);
> +			clear_page_blks_state(page, 1 << BLK_STATE_DIRTY, cur,
> +					cur + iosize - 1);
> +			set_page_writeback(page);
> +
> +			set_page_blks_state(page, 1 << BLK_STATE_IO, cur,
> +					cur + iosize - 1);
> +
>  			if (!PageWriteback(page)) {
>  				btrfs_err(BTRFS_I(inode)->root->fs_info,
>  					   "page %lu not writeback, cur %llu end %llu",
> @@ -3542,17 +3530,14 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
>  			if (ret)
>  				SetPageError(page);
>  		}
> -		cur = cur + iosize;
> -		pg_offset += iosize;
> +
> +		cur += iosize;
>  		nr++;
>  	}
>  done:
>  	*nr_ret = nr;
>  
>  done_unlocked:
> -
> -	/* drop our reference on any cached states */
> -	free_extent_state(cached_state);
>  	return ret;
>  }
>  
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 23b6e03..cbe6381 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -495,6 +495,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
>  	u64 num_bytes;
>  	u64 start_pos;
>  	u64 end_of_last_block;
> +	u64 start;
> +	u64 end;
> +	u64 page_end;
>  	u64 end_pos = pos + write_bytes;
>  	loff_t isize = i_size_read(inode);
>  
> @@ -507,11 +510,24 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
>  	if (err)
>  		return err;
>  
> +	start = start_pos;
> +
>  	for (i = 0; i < num_pages; i++) {
>  		struct page *p = pages[i];
>  		SetPageUptodate(p);
>  		ClearPageChecked(p);
> +
> +		end = page_end = page_offset(p) + PAGE_CACHE_SIZE - 1;
> +
> +		if (i == num_pages - 1)
> +			end = min_t(u64, page_end, end_of_last_block);
> +
> +		set_page_blks_state(p,
> +				1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
> +				start, end);
>  		set_page_dirty(p);
> +
> +		start = page_end + 1;

This is not the usual way, page_end is unnecessary, (start += PAGE_CACHE_SIZE) should work.

>  	}
>  
>  	/*
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8262f83..ac6a3f3 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1995,6 +1995,11 @@ again:
>  	 }
>  
>  	btrfs_set_extent_delalloc(inode, page_start, page_end, &cached_state);
> +
> +	set_page_blks_state(page,
> +			1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
> +			page_start, page_end);
> +
>  	ClearPageChecked(page);
>  	set_page_dirty(page);
>  out:
> @@ -2984,26 +2989,48 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
>  	struct btrfs_ordered_extent *ordered_extent = NULL;
>  	struct btrfs_workqueue *wq;
>  	btrfs_work_func_t func;
> +	u64 ordered_start, ordered_end;
> +	int done;
>  
>  	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
>  
>  	ClearPagePrivate2(page);
> -	if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
> -					    end - start + 1, uptodate))
> -		return 0;
> +loop:
> +	ordered_extent = btrfs_lookup_ordered_range(inode, start,
> +						end - start + 1);
> +	if (!ordered_extent)
> +		goto out;
>  
> -	if (btrfs_is_free_space_inode(inode)) {
> -		wq = root->fs_info->endio_freespace_worker;
> -		func = btrfs_freespace_write_helper;
> -	} else {
> -		wq = root->fs_info->endio_write_workers;
> -		func = btrfs_endio_write_helper;
> +	ordered_start = max_t(u64, start, ordered_extent->file_offset);
> +	ordered_end = min_t(u64, end,
> +			ordered_extent->file_offset + ordered_extent->len - 1);
> +
> +	done = btrfs_dec_test_ordered_pending(inode, &ordered_extent,
> +					ordered_start,
> +					ordered_end - ordered_start + 1,
> +					uptodate);
> +	if (done) {
> +		if (btrfs_is_free_space_inode(inode)) {
> +			wq = root->fs_info->endio_freespace_worker;
> +			func = btrfs_freespace_write_helper;
> +		} else {
> +			wq = root->fs_info->endio_write_workers;
> +			func = btrfs_endio_write_helper;
> +		}
> +
> +		btrfs_init_work(&ordered_extent->work, func,
> +				finish_ordered_fn, NULL, NULL);
> +		btrfs_queue_work(wq, &ordered_extent->work);
>  	}
>  
> -	btrfs_init_work(&ordered_extent->work, func, finish_ordered_fn, NULL,
> -			NULL);
> -	btrfs_queue_work(wq, &ordered_extent->work);
> +	btrfs_put_ordered_extent(ordered_extent);
> +
> +	start = ordered_end + 1;
> +
> +	if (start < end)
> +		goto loop;
>  
> +out:

I saw this's put a BUG_ON(block_start == EXTENT_MAP_INLINE); in writepage(),
but I didn't see the code of disabling inline data in patch 01 and patch 02,
but anyway I think we can avoid above searching for ordered_extents in a single page
if we enable inline data.

Thanks,

-liubo

>  	return 0;
>  }
>  
> @@ -4601,6 +4628,9 @@ again:
>  		goto out_unlock;
>  	}
>  
> +	set_page_blks_state(page, 1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
> +			page_start, page_end);
> +
>  	if (offset != PAGE_CACHE_SIZE) {
>  		if (!len)
>  			len = PAGE_CACHE_SIZE - offset;
> @@ -8590,6 +8620,10 @@ again:
>  		ret = VM_FAULT_SIGBUS;
>  		goto out_unlock;
>  	}
> +
> +	set_page_blks_state(page, 1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
> +			page_start, end);
> +
>  	ret = 0;
>  
>  	/* page is wholly or partially inside EOF */
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 02/21] Btrfs: subpagesize-blocksize: Fix whole page write.
  2015-06-26  9:50   ` Liu Bo
@ 2015-06-29  8:54     ` Chandan Rajendra
  2015-07-01 14:27       ` Liu Bo
  0 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-06-29  8:54 UTC (permalink / raw)
  To: bo.li.liu; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Friday 26 Jun 2015 17:50:54 Liu Bo wrote:
> On Mon, Jun 01, 2015 at 08:52:37PM +0530, Chandan Rajendra wrote:
> > For the subpagesize-blocksize scenario, a page can contain multiple
> > blocks. In such cases, this patch handles writing data to files.
> > 
> > Also, When setting EXTENT_DELALLOC, we no longer set EXTENT_UPTODATE bit
> > on
> > the extent_io_tree since uptodate status is being tracked by the bitmap
> > pointed to by page->private.
> 
> To be honestly, I'm not sure why we set EXTENT_UPTODATE bit for data as we
> don't check for that bit at all for now, correct me if I'm wrong.

Yes, I didn't find any code using EXTENT_UPTODATE flag. That is probably
because we could get away by referring to the page's PG_uptodate flag in
blocksize == Pagesize scenario. But for the subpagesize-blocksize scenario we
need BLK_STATE_UPTODATE to determine if a page's PG_uptodate flag can be set.

> 
> > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> > ---
> > 
> >  fs/btrfs/extent_io.c | 141
> >  +++++++++++++++++++++++---------------------------- fs/btrfs/file.c     
> >  |  16 ++++++
> >  fs/btrfs/inode.c     |  58 ++++++++++++++++-----
> >  3 files changed, 125 insertions(+), 90 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index d37badb..3736ab5 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -1283,9 +1283,8 @@ int clear_extent_bits(struct extent_io_tree *tree,
> > u64 start, u64 end,> 
> >  int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end,
> >  
> >  			struct extent_state **cached_state, gfp_t mask)
> >  
> >  {
> > 
> > -	return set_extent_bit(tree, start, end,
> > -			      EXTENT_DELALLOC | EXTENT_UPTODATE,
> > -			      NULL, cached_state, mask);
> > +	return set_extent_bit(tree, start, end, EXTENT_DELALLOC,
> > +			NULL, cached_state, mask);
> > 
> >  }
> >  
> >  int set_extent_defrag(struct extent_io_tree *tree, u64 start, u64 end,
> > 
> > @@ -1498,25 +1497,6 @@ int extent_range_redirty_for_io(struct inode
> > *inode, u64 start, u64 end)> 
> >  	return 0;
> >  
> >  }
> > 
> > -/*
> > - * helper function to set both pages and extents in the tree writeback
> > - */
> > -static int set_range_writeback(struct extent_io_tree *tree, u64 start,
> > u64 end) -{
> > -	unsigned long index = start >> PAGE_CACHE_SHIFT;
> > -	unsigned long end_index = end >> PAGE_CACHE_SHIFT;
> > -	struct page *page;
> > -
> > -	while (index <= end_index) {
> > -		page = find_get_page(tree->mapping, index);
> > -		BUG_ON(!page); /* Pages should be in the extent_io_tree */
> > -		set_page_writeback(page);
> > -		page_cache_release(page);
> > -		index++;
> > -	}
> > -	return 0;
> > -}
> > -
> > 
> >  /* find the first state struct with 'bits' set after 'start', and
> >  
> >   * return it.  tree->lock must be held.  NULL will returned if
> >   * nothing was found after 'start'
> > 
> > @@ -2080,6 +2060,14 @@ static int page_read_complete(struct page *page)
> > 
> >  	return !test_page_blks_state(page, BLK_STATE_IO, start, end, 0);
> >  
> >  }
> > 
> > +static int page_write_complete(struct page *page)
> > +{
> > +	u64 start = page_offset(page);
> > +	u64 end = start + PAGE_CACHE_SIZE - 1;
> > +
> > +	return !test_page_blks_state(page, BLK_STATE_IO, start, end, 0);
> > +}
> > +
> > 
> >  int free_io_failure(struct inode *inode, struct io_failure_record *rec)
> >  {
> >  
> >  	int ret;
> > 
> > @@ -2575,38 +2563,37 @@ int end_extent_writepage(struct page *page, int
> > err, u64 start, u64 end)> 
> >   */
> >  
> >  static void end_bio_extent_writepage(struct bio *bio, int err)
> >  {
> > 
> > +	struct btrfs_page_private *pg_private;
> > 
> >  	struct bio_vec *bvec;
> > 
> > +	unsigned long flags;
> > 
> >  	u64 start;
> >  	u64 end;
> > 
> > +	int clear_writeback;
> > 
> >  	int i;
> >  	
> >  	bio_for_each_segment_all(bvec, bio, i) {
> >  	
> >  		struct page *page = bvec->bv_page;
> > 
> > -		/* We always issue full-page reads, but if some block
> > -		 * in a page fails to read, blk_update_request() will
> > -		 * advance bv_offset and adjust bv_len to compensate.
> > -		 * Print a warning for nonzero offsets, and an error
> > -		 * if they don't add up to a full page.  */
> > -		if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) {
> > -			if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE)
> > -				btrfs_err(BTRFS_I(page->mapping->host)->root-
>fs_info,
> > -				   "partial page write in btrfs with offset %u 
and length %u",
> > -					bvec->bv_offset, bvec->bv_len);
> > -			else
> > -				btrfs_info(BTRFS_I(page->mapping->host)->root-
>fs_info,
> > -				   "incomplete page write in btrfs with offset 
%u and "
> > -				   "length %u",
> > -					bvec->bv_offset, bvec->bv_len);
> > -		}
> > +		start = page_offset(page) + bvec->bv_offset;
> > +		end = start + bvec->bv_len - 1;
> > 
> > -		start = page_offset(page);
> > -		end = start + bvec->bv_offset + bvec->bv_len - 1;
> > +		pg_private = (struct btrfs_page_private *)page->private;
> > +
> > +		spin_lock_irqsave(&pg_private->io_lock, flags);
> > 
> > -		if (end_extent_writepage(page, err, start, end))
> > +		if (end_extent_writepage(page, err, start, end)) {
> > +			spin_unlock_irqrestore(&pg_private->io_lock, flags);
> > 
> >  			continue;
> > 
> > +		}
> > 
> > -		end_page_writeback(page);
> > +		clear_page_blks_state(page, 1 << BLK_STATE_IO, start, end);
> > +
> > +		clear_writeback = page_write_complete(page);
> > +
> > +		spin_unlock_irqrestore(&pg_private->io_lock, flags);
> > +
> > +		if (clear_writeback)
> > +			end_page_writeback(page);
> > 
> >  	}
> >  	
> >  	bio_put(bio);
> > 
> > @@ -3417,10 +3404,9 @@ static noinline_for_stack int
> > __extent_writepage_io(struct inode *inode,> 
> >  	u64 block_start;
> >  	u64 iosize;
> >  	sector_t sector;
> > 
> > -	struct extent_state *cached_state = NULL;
> > 
> >  	struct extent_map *em;
> >  	struct block_device *bdev;
> > 
> > -	size_t pg_offset = 0;
> > +	size_t pg_offset;
> > 
> >  	size_t blocksize;
> >  	int ret = 0;
> >  	int nr = 0;
> > 
> > @@ -3467,8 +3453,16 @@ static noinline_for_stack int
> > __extent_writepage_io(struct inode *inode,> 
> >  							 page_end, NULL, 1);
> >  			
> >  			break;
> >  		
> >  		}
> > 
> > -		em = epd->get_extent(inode, page, pg_offset, cur,
> > -				     end - cur + 1, 1);
> > +
> > +		pg_offset = cur & (PAGE_CACHE_SIZE - 1);
> > +
> > +		if (!test_page_blks_state(page, BLK_STATE_DIRTY, cur,
> > +						cur + blocksize - 1, 1)) {
> > +			cur += blocksize;
> > +			continue;
> > +		}
> 
> If we don't check this, the below get_extent() will return a HOLE
> (block_start == EXTENT_MAP_HOLE) and we can still go on to the next block,
> then we don't need to maintain this BLK_STATE_DIRTY bit all the while.

Sorry, I am not sure if I understood your comment correctly. Are you
suggesting that *page blocks* that are not dirty are always holes?

Let's assume a 64k page whose contents are within i_size and none of the
blocks of the page map to a file hole. Also assume 4k as the block size. Say,
the userspace writes to the "block 0" of the page. The corresponding code in
__btrfs_buffered_write() reads up the complete page into the inode's page
cache and then marks "block 0" of the page as BLK_STATE_DIRTY. Next, the
userspace seeks and writes to "block 4" of the page. In this case, since the
page has PG_uptodate flag already set we don't read the data from the disk
again. We simply go ahead and mark "block 4" as BLK_STATE_DIRTY. As can be
seen in the example scenario, the blocks 1, 2 and 3 are not holes and hence
btrfs_get_extent() would end up returning values other than EXTENT_MAP_HOLE
for em->block_start.

> 
> > +
> > +		em = epd->get_extent(inode, page, pg_offset, cur, blocksize, 
1);
> > 
> >  		if (IS_ERR_OR_NULL(em)) {
> >  		
> >  			SetPageError(page);
> >  			ret = PTR_ERR_OR_ZERO(em);
> > 
> > @@ -3479,7 +3473,7 @@ static noinline_for_stack int
> > __extent_writepage_io(struct inode *inode,> 
> >  		em_end = extent_map_end(em);
> >  		BUG_ON(em_end <= cur);
> >  		BUG_ON(end < cur);
> > 
> > -		iosize = min(em_end - cur, end - cur + 1);
> > +		iosize = min_t(u64, em_end - cur, blocksize);
> > 
> >  		iosize = ALIGN(iosize, blocksize);
> 
> This limits us to do one block per loop, if two blocks are contiguous,
> it should be fine to write them along.

Yes, I agree. I will fix this up in one of the next versions of the
patchset. Thanks for pointing it out.

> 
> >  		sector = (em->block_start + extent_offset) >> 9;
> >  		bdev = em->bdev;
> > 
> > @@ -3488,32 +3482,20 @@ static noinline_for_stack int
> > __extent_writepage_io(struct inode *inode,> 
> >  		free_extent_map(em);
> >  		em = NULL;
> > 
> > -		/*
> > -		 * compressed and inline extents are written through other
> > -		 * paths in the FS
> > -		 */
> > -		if (compressed || block_start == EXTENT_MAP_HOLE ||
> > -		    block_start == EXTENT_MAP_INLINE) {
> > -			/*
> > -			 * end_io notification does not happen here for
> > -			 * compressed extents
> > -			 */
> > -			if (!compressed && tree->ops &&
> > -			    tree->ops->writepage_end_io_hook)
> > -				tree->ops->writepage_end_io_hook(page, cur,
> > -							 cur + iosize - 1,
> > -							 NULL, 1);
> > -			else if (compressed) {
> > -				/* we don't want to end_page_writeback on
> > -				 * a compressed extent.  this happens
> > -				 * elsewhere
> > -				 */
> > -				nr++;
> > -			}
> > +		BUG_ON(compressed);
> > +		BUG_ON(block_start == EXTENT_MAP_INLINE);
> > 
> > -			cur += iosize;
> > -			pg_offset += iosize;
> > -			continue;
> > +		if (block_start == EXTENT_MAP_HOLE) {
> > +			if (test_page_blks_state(page, BLK_STATE_UPTODATE, 
cur,
> > +							cur + iosize - 1, 1)) 
{
> > +				clear_page_blks_state(page,
> > +						1 << BLK_STATE_DIRTY, cur,
> > +						cur + iosize - 1);
> > +				cur += iosize;
> > +				continue;
> > +			} else {
> > +				BUG();
> > +			}
> > 
> >  		}
> >  		
> >  		if (tree->ops && tree->ops->writepage_io_hook) {
> > 
> > @@ -3527,7 +3509,13 @@ static noinline_for_stack int
> > __extent_writepage_io(struct inode *inode,> 
> >  		} else {
> >  		
> >  			unsigned long max_nr = (i_size >> PAGE_CACHE_SHIFT) + 
1;
> > 
> > -			set_range_writeback(tree, cur, cur + iosize - 1);
> > +			clear_page_blks_state(page, 1 << BLK_STATE_DIRTY, cur,
> > +					cur + iosize - 1);
> > +			set_page_writeback(page);
> > +
> > +			set_page_blks_state(page, 1 << BLK_STATE_IO, cur,
> > +					cur + iosize - 1);
> > +
> > 
> >  			if (!PageWriteback(page)) {
> >  			
> >  				btrfs_err(BTRFS_I(inode)->root->fs_info,
> >  				
> >  					   "page %lu not writeback, cur %llu 
end %llu",
> > 
> > @@ -3542,17 +3530,14 @@ static noinline_for_stack int
> > __extent_writepage_io(struct inode *inode,> 
> >  			if (ret)
> >  			
> >  				SetPageError(page);
> >  		
> >  		}
> > 
> > -		cur = cur + iosize;
> > -		pg_offset += iosize;
> > +
> > +		cur += iosize;
> > 
> >  		nr++;
> >  	
> >  	}
> >  
> >  done:
> >  	*nr_ret = nr;
> >  
> >  done_unlocked:
> > -
> > -	/* drop our reference on any cached states */
> > -	free_extent_state(cached_state);
> > 
> >  	return ret;
> >  
> >  }
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 23b6e03..cbe6381 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -495,6 +495,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct
> > inode *inode,> 
> >  	u64 num_bytes;
> >  	u64 start_pos;
> >  	u64 end_of_last_block;
> > 
> > +	u64 start;
> > +	u64 end;
> > +	u64 page_end;
> > 
> >  	u64 end_pos = pos + write_bytes;
> >  	loff_t isize = i_size_read(inode);
> > 
> > @@ -507,11 +510,24 @@ int btrfs_dirty_pages(struct btrfs_root *root,
> > struct inode *inode,> 
> >  	if (err)
> >  	
> >  		return err;
> > 
> > +	start = start_pos;
> > +
> > 
> >  	for (i = 0; i < num_pages; i++) {
> >  	
> >  		struct page *p = pages[i];
> >  		SetPageUptodate(p);
> >  		ClearPageChecked(p);
> > 
> > +
> > +		end = page_end = page_offset(p) + PAGE_CACHE_SIZE - 1;
> > +
> > +		if (i == num_pages - 1)
> > +			end = min_t(u64, page_end, end_of_last_block);
> > +
> > +		set_page_blks_state(p,
> > +				1 << BLK_STATE_DIRTY | 1 << 
BLK_STATE_UPTODATE,
> > +				start, end);
> > 
> >  		set_page_dirty(p);
> > 
> > +
> > +		start = page_end + 1;
> 
> This is not the usual way, page_end is unnecessary, (start +=
> PAGE_CACHE_SIZE) should work.

"start" may not always be set to a file offset that is a multiple of page
size. If the userspace dirties say "block 4" of 64k page, then start will be
set to 16384. Hence in such cases, "start += PAGE_CACHE_SIZE" would yield an
incorrect value.

> >  	}
> >  	
> >  	/*
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 8262f83..ac6a3f3 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > 
> > @@ -1995,6 +1995,11 @@ again:
> >  	 }
> >  	
> >  	btrfs_set_extent_delalloc(inode, page_start, page_end, &cached_state);
> > 
> > +
> > +	set_page_blks_state(page,
> > +			1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
> > +			page_start, page_end);
> > +
> > 
> >  	ClearPageChecked(page);
> >  	set_page_dirty(page);
> >  
> >  out:
> > @@ -2984,26 +2989,48 @@ static int btrfs_writepage_end_io_hook(struct page
> > *page, u64 start, u64 end,> 
> >  	struct btrfs_ordered_extent *ordered_extent = NULL;
> >  	struct btrfs_workqueue *wq;
> >  	btrfs_work_func_t func;
> > 
> > +	u64 ordered_start, ordered_end;
> > +	int done;
> > 
> >  	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
> >  	
> >  	ClearPagePrivate2(page);
> > 
> > -	if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
> > -					    end - start + 1, uptodate))
> > -		return 0;
> > +loop:
> > +	ordered_extent = btrfs_lookup_ordered_range(inode, start,
> > +						end - start + 1);
> > +	if (!ordered_extent)
> > +		goto out;
> > 
> > -	if (btrfs_is_free_space_inode(inode)) {
> > -		wq = root->fs_info->endio_freespace_worker;
> > -		func = btrfs_freespace_write_helper;
> > -	} else {
> > -		wq = root->fs_info->endio_write_workers;
> > -		func = btrfs_endio_write_helper;
> > +	ordered_start = max_t(u64, start, ordered_extent->file_offset);
> > +	ordered_end = min_t(u64, end,
> > +			ordered_extent->file_offset + ordered_extent->len - 
1);
> > +
> > +	done = btrfs_dec_test_ordered_pending(inode, &ordered_extent,
> > +					ordered_start,
> > +					ordered_end - ordered_start + 1,
> > +					uptodate);
> > +	if (done) {
> > +		if (btrfs_is_free_space_inode(inode)) {
> > +			wq = root->fs_info->endio_freespace_worker;
> > +			func = btrfs_freespace_write_helper;
> > +		} else {
> > +			wq = root->fs_info->endio_write_workers;
> > +			func = btrfs_endio_write_helper;
> > +		}
> > +
> > +		btrfs_init_work(&ordered_extent->work, func,
> > +				finish_ordered_fn, NULL, NULL);
> > +		btrfs_queue_work(wq, &ordered_extent->work);
> > 
> >  	}
> > 
> > -	btrfs_init_work(&ordered_extent->work, func, finish_ordered_fn, NULL,
> > -			NULL);
> > -	btrfs_queue_work(wq, &ordered_extent->work);
> > +	btrfs_put_ordered_extent(ordered_extent);
> > +
> > +	start = ordered_end + 1;
> > +
> > +	if (start < end)
> > +		goto loop;
> 
> > +out:
> I saw this's put a BUG_ON(block_start == EXTENT_MAP_INLINE); in writepage(),
> but I didn't see the code of disabling inline data in patch 01 and patch
> 02, but anyway I think we can avoid above searching for ordered_extents in
> a single page if we enable inline data.

For inline extents, The call to __extent_writepage => writepage_delalloc =>
run_delalloc_range => cow_file_range => cow_file_range_inline should write the
block's content into the appropriate location in the btree leaf. Hence
__extent_writepage_io() should never get invoked for files with inline
extents. The call to BUG_ON(block_start == EXTENT_MAP_INLINE) just makes this
explicit and also helps in debugging.

Liu, However I am not sure if we could avoid looping across ordered
extents in the above code. Could you please elaborate on that?

-- 
chandan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 02/21] Btrfs: subpagesize-blocksize: Fix whole page write.
  2015-06-29  8:54     ` Chandan Rajendra
@ 2015-07-01 14:27       ` Liu Bo
  0 siblings, 0 replies; 47+ messages in thread
From: Liu Bo @ 2015-07-01 14:27 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Mon, Jun 29, 2015 at 02:24:18PM +0530, Chandan Rajendra wrote:
> On Friday 26 Jun 2015 17:50:54 Liu Bo wrote:
> > On Mon, Jun 01, 2015 at 08:52:37PM +0530, Chandan Rajendra wrote:
> > > For the subpagesize-blocksize scenario, a page can contain multiple
> > > blocks. In such cases, this patch handles writing data to files.
> > > 
> > > Also, When setting EXTENT_DELALLOC, we no longer set EXTENT_UPTODATE bit
> > > on
> > > the extent_io_tree since uptodate status is being tracked by the bitmap
> > > pointed to by page->private.
> > 
> > To be honestly, I'm not sure why we set EXTENT_UPTODATE bit for data as we
> > don't check for that bit at all for now, correct me if I'm wrong.
> 
> Yes, I didn't find any code using EXTENT_UPTODATE flag. That is probably
> because we could get away by referring to the page's PG_uptodate flag in
> blocksize == Pagesize scenario. But for the subpagesize-blocksize scenario we
> need BLK_STATE_UPTODATE to determine if a page's PG_uptodate flag can be set.
> 
> > 
> > > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> > > ---
> > > 
> > >  fs/btrfs/extent_io.c | 141
> > >  +++++++++++++++++++++++---------------------------- fs/btrfs/file.c     
> > >  |  16 ++++++
> > >  fs/btrfs/inode.c     |  58 ++++++++++++++++-----
> > >  3 files changed, 125 insertions(+), 90 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index d37badb..3736ab5 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -1283,9 +1283,8 @@ int clear_extent_bits(struct extent_io_tree *tree,
> > > u64 start, u64 end,> 
> > >  int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end,
> > >  
> > >  			struct extent_state **cached_state, gfp_t mask)
> > >  
> > >  {
> > > 
> > > -	return set_extent_bit(tree, start, end,
> > > -			      EXTENT_DELALLOC | EXTENT_UPTODATE,
> > > -			      NULL, cached_state, mask);
> > > +	return set_extent_bit(tree, start, end, EXTENT_DELALLOC,
> > > +			NULL, cached_state, mask);
> > > 
> > >  }
> > >  
> > >  int set_extent_defrag(struct extent_io_tree *tree, u64 start, u64 end,
> > > 
> > > @@ -1498,25 +1497,6 @@ int extent_range_redirty_for_io(struct inode
> > > *inode, u64 start, u64 end)> 
> > >  	return 0;
> > >  
> > >  }
> > > 
> > > -/*
> > > - * helper function to set both pages and extents in the tree writeback
> > > - */
> > > -static int set_range_writeback(struct extent_io_tree *tree, u64 start,
> > > u64 end) -{
> > > -	unsigned long index = start >> PAGE_CACHE_SHIFT;
> > > -	unsigned long end_index = end >> PAGE_CACHE_SHIFT;
> > > -	struct page *page;
> > > -
> > > -	while (index <= end_index) {
> > > -		page = find_get_page(tree->mapping, index);
> > > -		BUG_ON(!page); /* Pages should be in the extent_io_tree */
> > > -		set_page_writeback(page);
> > > -		page_cache_release(page);
> > > -		index++;
> > > -	}
> > > -	return 0;
> > > -}
> > > -
> > > 
> > >  /* find the first state struct with 'bits' set after 'start', and
> > >  
> > >   * return it.  tree->lock must be held.  NULL will returned if
> > >   * nothing was found after 'start'
> > > 
> > > @@ -2080,6 +2060,14 @@ static int page_read_complete(struct page *page)
> > > 
> > >  	return !test_page_blks_state(page, BLK_STATE_IO, start, end, 0);
> > >  
> > >  }
> > > 
> > > +static int page_write_complete(struct page *page)
> > > +{
> > > +	u64 start = page_offset(page);
> > > +	u64 end = start + PAGE_CACHE_SIZE - 1;
> > > +
> > > +	return !test_page_blks_state(page, BLK_STATE_IO, start, end, 0);
> > > +}
> > > +
> > > 
> > >  int free_io_failure(struct inode *inode, struct io_failure_record *rec)
> > >  {
> > >  
> > >  	int ret;
> > > 
> > > @@ -2575,38 +2563,37 @@ int end_extent_writepage(struct page *page, int
> > > err, u64 start, u64 end)> 
> > >   */
> > >  
> > >  static void end_bio_extent_writepage(struct bio *bio, int err)
> > >  {
> > > 
> > > +	struct btrfs_page_private *pg_private;
> > > 
> > >  	struct bio_vec *bvec;
> > > 
> > > +	unsigned long flags;
> > > 
> > >  	u64 start;
> > >  	u64 end;
> > > 
> > > +	int clear_writeback;
> > > 
> > >  	int i;
> > >  	
> > >  	bio_for_each_segment_all(bvec, bio, i) {
> > >  	
> > >  		struct page *page = bvec->bv_page;
> > > 
> > > -		/* We always issue full-page reads, but if some block
> > > -		 * in a page fails to read, blk_update_request() will
> > > -		 * advance bv_offset and adjust bv_len to compensate.
> > > -		 * Print a warning for nonzero offsets, and an error
> > > -		 * if they don't add up to a full page.  */
> > > -		if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) {
> > > -			if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE)
> > > -				btrfs_err(BTRFS_I(page->mapping->host)->root-
> >fs_info,
> > > -				   "partial page write in btrfs with offset %u 
> and length %u",
> > > -					bvec->bv_offset, bvec->bv_len);
> > > -			else
> > > -				btrfs_info(BTRFS_I(page->mapping->host)->root-
> >fs_info,
> > > -				   "incomplete page write in btrfs with offset 
> %u and "
> > > -				   "length %u",
> > > -					bvec->bv_offset, bvec->bv_len);
> > > -		}
> > > +		start = page_offset(page) + bvec->bv_offset;
> > > +		end = start + bvec->bv_len - 1;
> > > 
> > > -		start = page_offset(page);
> > > -		end = start + bvec->bv_offset + bvec->bv_len - 1;
> > > +		pg_private = (struct btrfs_page_private *)page->private;
> > > +
> > > +		spin_lock_irqsave(&pg_private->io_lock, flags);
> > > 
> > > -		if (end_extent_writepage(page, err, start, end))
> > > +		if (end_extent_writepage(page, err, start, end)) {
> > > +			spin_unlock_irqrestore(&pg_private->io_lock, flags);
> > > 
> > >  			continue;
> > > 
> > > +		}
> > > 
> > > -		end_page_writeback(page);
> > > +		clear_page_blks_state(page, 1 << BLK_STATE_IO, start, end);
> > > +
> > > +		clear_writeback = page_write_complete(page);
> > > +
> > > +		spin_unlock_irqrestore(&pg_private->io_lock, flags);
> > > +
> > > +		if (clear_writeback)
> > > +			end_page_writeback(page);
> > > 
> > >  	}
> > >  	
> > >  	bio_put(bio);
> > > 
> > > @@ -3417,10 +3404,9 @@ static noinline_for_stack int
> > > __extent_writepage_io(struct inode *inode,> 
> > >  	u64 block_start;
> > >  	u64 iosize;
> > >  	sector_t sector;
> > > 
> > > -	struct extent_state *cached_state = NULL;
> > > 
> > >  	struct extent_map *em;
> > >  	struct block_device *bdev;
> > > 
> > > -	size_t pg_offset = 0;
> > > +	size_t pg_offset;
> > > 
> > >  	size_t blocksize;
> > >  	int ret = 0;
> > >  	int nr = 0;
> > > 
> > > @@ -3467,8 +3453,16 @@ static noinline_for_stack int
> > > __extent_writepage_io(struct inode *inode,> 
> > >  							 page_end, NULL, 1);
> > >  			
> > >  			break;
> > >  		
> > >  		}
> > > 
> > > -		em = epd->get_extent(inode, page, pg_offset, cur,
> > > -				     end - cur + 1, 1);
> > > +
> > > +		pg_offset = cur & (PAGE_CACHE_SIZE - 1);
> > > +
> > > +		if (!test_page_blks_state(page, BLK_STATE_DIRTY, cur,
> > > +						cur + blocksize - 1, 1)) {
> > > +			cur += blocksize;
> > > +			continue;
> > > +		}
> > 
> > If we don't check this, the below get_extent() will return a HOLE
> > (block_start == EXTENT_MAP_HOLE) and we can still go on to the next block,
> > then we don't need to maintain this BLK_STATE_DIRTY bit all the while.
> 
> Sorry, I am not sure if I understood your comment correctly. Are you
> suggesting that *page blocks* that are not dirty are always holes?
> 
> Let's assume a 64k page whose contents are within i_size and none of the
> blocks of the page map to a file hole. Also assume 4k as the block size. Say,
> the userspace writes to the "block 0" of the page. The corresponding code in
> __btrfs_buffered_write() reads up the complete page into the inode's page
> cache and then marks "block 0" of the page as BLK_STATE_DIRTY. Next, the
> userspace seeks and writes to "block 4" of the page. In this case, since the
> page has PG_uptodate flag already set we don't read the data from the disk
> again. We simply go ahead and mark "block 4" as BLK_STATE_DIRTY. As can be
> seen in the example scenario, the blocks 1, 2 and 3 are not holes and hence
> btrfs_get_extent() would end up returning values other than EXTENT_MAP_HOLE
> for em->block_start.

I see it now, this is a bit subtle at the first glance.

> 
> > 
> > > +
> > > +		em = epd->get_extent(inode, page, pg_offset, cur, blocksize, 
> 1);
> > > 
> > >  		if (IS_ERR_OR_NULL(em)) {
> > >  		
> > >  			SetPageError(page);
> > >  			ret = PTR_ERR_OR_ZERO(em);
> > > 
> > > @@ -3479,7 +3473,7 @@ static noinline_for_stack int
> > > __extent_writepage_io(struct inode *inode,> 
> > >  		em_end = extent_map_end(em);
> > >  		BUG_ON(em_end <= cur);
> > >  		BUG_ON(end < cur);
> > > 
> > > -		iosize = min(em_end - cur, end - cur + 1);
> > > +		iosize = min_t(u64, em_end - cur, blocksize);
> > > 
> > >  		iosize = ALIGN(iosize, blocksize);
> > 
> > This limits us to do one block per loop, if two blocks are contiguous,
> > it should be fine to write them along.
> 
> Yes, I agree. I will fix this up in one of the next versions of the
> patchset. Thanks for pointing it out.

OK.

> 
> > 
> > >  		sector = (em->block_start + extent_offset) >> 9;
> > >  		bdev = em->bdev;
> > > 
> > > @@ -3488,32 +3482,20 @@ static noinline_for_stack int
> > > __extent_writepage_io(struct inode *inode,> 
> > >  		free_extent_map(em);
> > >  		em = NULL;
> > > 
> > > -		/*
> > > -		 * compressed and inline extents are written through other
> > > -		 * paths in the FS
> > > -		 */
> > > -		if (compressed || block_start == EXTENT_MAP_HOLE ||
> > > -		    block_start == EXTENT_MAP_INLINE) {
> > > -			/*
> > > -			 * end_io notification does not happen here for
> > > -			 * compressed extents
> > > -			 */
> > > -			if (!compressed && tree->ops &&
> > > -			    tree->ops->writepage_end_io_hook)
> > > -				tree->ops->writepage_end_io_hook(page, cur,
> > > -							 cur + iosize - 1,
> > > -							 NULL, 1);
> > > -			else if (compressed) {
> > > -				/* we don't want to end_page_writeback on
> > > -				 * a compressed extent.  this happens
> > > -				 * elsewhere
> > > -				 */
> > > -				nr++;
> > > -			}
> > > +		BUG_ON(compressed);
> > > +		BUG_ON(block_start == EXTENT_MAP_INLINE);
> > > 
> > > -			cur += iosize;
> > > -			pg_offset += iosize;
> > > -			continue;
> > > +		if (block_start == EXTENT_MAP_HOLE) {
> > > +			if (test_page_blks_state(page, BLK_STATE_UPTODATE, 
> cur,
> > > +							cur + iosize - 1, 1)) 
> {
> > > +				clear_page_blks_state(page,
> > > +						1 << BLK_STATE_DIRTY, cur,
> > > +						cur + iosize - 1);
> > > +				cur += iosize;
> > > +				continue;
> > > +			} else {
> > > +				BUG();
> > > +			}
> > > 
> > >  		}
> > >  		
> > >  		if (tree->ops && tree->ops->writepage_io_hook) {
> > > 
> > > @@ -3527,7 +3509,13 @@ static noinline_for_stack int
> > > __extent_writepage_io(struct inode *inode,> 
> > >  		} else {
> > >  		
> > >  			unsigned long max_nr = (i_size >> PAGE_CACHE_SHIFT) + 
> 1;
> > > 
> > > -			set_range_writeback(tree, cur, cur + iosize - 1);
> > > +			clear_page_blks_state(page, 1 << BLK_STATE_DIRTY, cur,
> > > +					cur + iosize - 1);
> > > +			set_page_writeback(page);
> > > +
> > > +			set_page_blks_state(page, 1 << BLK_STATE_IO, cur,
> > > +					cur + iosize - 1);
> > > +
> > > 
> > >  			if (!PageWriteback(page)) {
> > >  			
> > >  				btrfs_err(BTRFS_I(inode)->root->fs_info,
> > >  				
> > >  					   "page %lu not writeback, cur %llu 
> end %llu",
> > > 
> > > @@ -3542,17 +3530,14 @@ static noinline_for_stack int
> > > __extent_writepage_io(struct inode *inode,> 
> > >  			if (ret)
> > >  			
> > >  				SetPageError(page);
> > >  		
> > >  		}
> > > 
> > > -		cur = cur + iosize;
> > > -		pg_offset += iosize;
> > > +
> > > +		cur += iosize;
> > > 
> > >  		nr++;
> > >  	
> > >  	}
> > >  
> > >  done:
> > >  	*nr_ret = nr;
> > >  
> > >  done_unlocked:
> > > -
> > > -	/* drop our reference on any cached states */
> > > -	free_extent_state(cached_state);
> > > 
> > >  	return ret;
> > >  
> > >  }
> > > 
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index 23b6e03..cbe6381 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -495,6 +495,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct
> > > inode *inode,> 
> > >  	u64 num_bytes;
> > >  	u64 start_pos;
> > >  	u64 end_of_last_block;
> > > 
> > > +	u64 start;
> > > +	u64 end;
> > > +	u64 page_end;
> > > 
> > >  	u64 end_pos = pos + write_bytes;
> > >  	loff_t isize = i_size_read(inode);
> > > 
> > > @@ -507,11 +510,24 @@ int btrfs_dirty_pages(struct btrfs_root *root,
> > > struct inode *inode,> 
> > >  	if (err)
> > >  	
> > >  		return err;
> > > 
> > > +	start = start_pos;
> > > +
> > > 
> > >  	for (i = 0; i < num_pages; i++) {
> > >  	
> > >  		struct page *p = pages[i];
> > >  		SetPageUptodate(p);
> > >  		ClearPageChecked(p);
> > > 
> > > +
> > > +		end = page_end = page_offset(p) + PAGE_CACHE_SIZE - 1;
> > > +
> > > +		if (i == num_pages - 1)
> > > +			end = min_t(u64, page_end, end_of_last_block);
> > > +
> > > +		set_page_blks_state(p,
> > > +				1 << BLK_STATE_DIRTY | 1 << 
> BLK_STATE_UPTODATE,
> > > +				start, end);
> > > 
> > >  		set_page_dirty(p);
> > > 
> > > +
> > > +		start = page_end + 1;
> > 
> > This is not the usual way, page_end is unnecessary, (start +=
> > PAGE_CACHE_SIZE) should work.
> 
> "start" may not always be set to a file offset that is a multiple of page
> size. If the userspace dirties say "block 4" of 64k page, then start will be
> set to 16384. Hence in such cases, "start += PAGE_CACHE_SIZE" would yield an
> incorrect value.

Right.

> 
> > >  	}
> > >  	
> > >  	/*
> > > 
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 8262f83..ac6a3f3 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > 
> > > @@ -1995,6 +1995,11 @@ again:
> > >  	 }
> > >  	
> > >  	btrfs_set_extent_delalloc(inode, page_start, page_end, &cached_state);
> > > 
> > > +
> > > +	set_page_blks_state(page,
> > > +			1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
> > > +			page_start, page_end);
> > > +
> > > 
> > >  	ClearPageChecked(page);
> > >  	set_page_dirty(page);
> > >  
> > >  out:
> > > @@ -2984,26 +2989,48 @@ static int btrfs_writepage_end_io_hook(struct page
> > > *page, u64 start, u64 end,> 
> > >  	struct btrfs_ordered_extent *ordered_extent = NULL;
> > >  	struct btrfs_workqueue *wq;
> > >  	btrfs_work_func_t func;
> > > 
> > > +	u64 ordered_start, ordered_end;
> > > +	int done;
> > > 
> > >  	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
> > >  	
> > >  	ClearPagePrivate2(page);
> > > 
> > > -	if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
> > > -					    end - start + 1, uptodate))
> > > -		return 0;
> > > +loop:
> > > +	ordered_extent = btrfs_lookup_ordered_range(inode, start,
> > > +						end - start + 1);
> > > +	if (!ordered_extent)
> > > +		goto out;
> > > 
> > > -	if (btrfs_is_free_space_inode(inode)) {
> > > -		wq = root->fs_info->endio_freespace_worker;
> > > -		func = btrfs_freespace_write_helper;
> > > -	} else {
> > > -		wq = root->fs_info->endio_write_workers;
> > > -		func = btrfs_endio_write_helper;
> > > +	ordered_start = max_t(u64, start, ordered_extent->file_offset);
> > > +	ordered_end = min_t(u64, end,
> > > +			ordered_extent->file_offset + ordered_extent->len - 
> 1);
> > > +
> > > +	done = btrfs_dec_test_ordered_pending(inode, &ordered_extent,
> > > +					ordered_start,
> > > +					ordered_end - ordered_start + 1,
> > > +					uptodate);
> > > +	if (done) {
> > > +		if (btrfs_is_free_space_inode(inode)) {
> > > +			wq = root->fs_info->endio_freespace_worker;
> > > +			func = btrfs_freespace_write_helper;
> > > +		} else {
> > > +			wq = root->fs_info->endio_write_workers;
> > > +			func = btrfs_endio_write_helper;
> > > +		}
> > > +
> > > +		btrfs_init_work(&ordered_extent->work, func,
> > > +				finish_ordered_fn, NULL, NULL);
> > > +		btrfs_queue_work(wq, &ordered_extent->work);
> > > 
> > >  	}
> > > 
> > > -	btrfs_init_work(&ordered_extent->work, func, finish_ordered_fn, NULL,
> > > -			NULL);
> > > -	btrfs_queue_work(wq, &ordered_extent->work);
> > > +	btrfs_put_ordered_extent(ordered_extent);
> > > +
> > > +	start = ordered_end + 1;
> > > +
> > > +	if (start < end)
> > > +		goto loop;
> > 
> > > +out:
> > I saw this's put a BUG_ON(block_start == EXTENT_MAP_INLINE); in writepage(),
> > but I didn't see the code of disabling inline data in patch 01 and patch
> > 02, but anyway I think we can avoid above searching for ordered_extents in
> > a single page if we enable inline data.
> 
> For inline extents, The call to __extent_writepage => writepage_delalloc =>
> run_delalloc_range => cow_file_range => cow_file_range_inline should write the
> block's content into the appropriate location in the btree leaf. Hence
> __extent_writepage_io() should never get invoked for files with inline
> extents. The call to BUG_ON(block_start == EXTENT_MAP_INLINE) just makes this
> explicit and also helps in debugging.

Yes, that's right, thanks for the explanation..

> 
> Liu, However I am not sure if we could avoid looping across ordered
> extents in the above code. Could you please elaborate on that?

Given that a page may span two ordered extents(in cow_file_range(), a
ENOSPC can split contiguous range into two ordered extents),
the above loop can make sure we don't miss any of the two. 

Thanks,

-liubo

> 
> -- 
> chandan
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 04/21] Btrfs: subpagesize-blocksize: Define extent_buffer_head.
  2015-06-01 15:22 ` [RFC PATCH V11 04/21] Btrfs: subpagesize-blocksize: Define extent_buffer_head Chandan Rajendra
@ 2015-07-01 14:33   ` Liu Bo
  0 siblings, 0 replies; 47+ messages in thread
From: Liu Bo @ 2015-07-01 14:33 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Mon, Jun 01, 2015 at 08:52:39PM +0530, Chandan Rajendra wrote:
> In order to handle multiple extent buffers per page, first we need to create a
> way to handle all the extent buffers that are attached to a page.
> 
> This patch creates a new data structure 'struct extent_buffer_head', and moves
> fields that are common to all extent buffers in a page from 'struct extent
> buffer' to 'struct extent_buffer_head'

This makes that extent buffers in a page share @ref on ebh and may
cause much memory pressure as they may not be freed even with
setting EXTENT_BUFFER_STALE, but I guess that's the penaty we have to
pay in such ways.

Others look good.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo

> 
> Also, this patch moves EXTENT_BUFFER_TREE_REF, EXTENT_BUFFER_DUMMY and
> EXTENT_BUFFER_IN_TREE flags from extent_buffer->ebflags  to
> extent_buffer_head->bflags.
> 
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>  fs/btrfs/backref.c           |   2 +-
>  fs/btrfs/ctree.c             |   2 +-
>  fs/btrfs/ctree.h             |   6 +-
>  fs/btrfs/disk-io.c           |  73 ++++---
>  fs/btrfs/extent-tree.c       |   6 +-
>  fs/btrfs/extent_io.c         | 469 ++++++++++++++++++++++++++++---------------
>  fs/btrfs/extent_io.h         |  39 +++-
>  fs/btrfs/volumes.c           |   2 +-
>  include/trace/events/btrfs.h |   2 +-
>  9 files changed, 392 insertions(+), 209 deletions(-)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 9de772e..b4d911c 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1372,7 +1372,7 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
>  		eb = path->nodes[0];
>  		/* make sure we can use eb after releasing the path */
>  		if (eb != eb_in) {
> -			atomic_inc(&eb->refs);
> +			atomic_inc(&eb_head(eb)->refs);
>  			btrfs_tree_read_lock(eb);
>  			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
>  		}
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 0f11ebc..b28f14d 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -159,7 +159,7 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root *root)
>  		 * the inc_not_zero dance and if it doesn't work then
>  		 * synchronize_rcu and try again.
>  		 */
> -		if (atomic_inc_not_zero(&eb->refs)) {
> +		if (atomic_inc_not_zero(&eb_head(eb)->refs)) {
>  			rcu_read_unlock();
>  			break;
>  		}
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 6f364e1..2bc3e0e 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2320,14 +2320,16 @@ static inline void btrfs_set_token_##name(struct extent_buffer *eb,	\
>  #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)		\
>  static inline u##bits btrfs_##name(struct extent_buffer *eb)		\
>  {									\
> -	type *p = page_address(eb->pages[0]);				\
> +	type *p = page_address(eb_head(eb)->pages[0]) +			\
> +				(eb->start & (PAGE_CACHE_SIZE -1));	\
>  	u##bits res = le##bits##_to_cpu(p->member);			\
>  	return res;							\
>  }									\
>  static inline void btrfs_set_##name(struct extent_buffer *eb,		\
>  				    u##bits val)			\
>  {									\
> -	type *p = page_address(eb->pages[0]);				\
> +	type *p = page_address(eb_head(eb)->pages[0]) +			\
> +				(eb->start & (PAGE_CACHE_SIZE -1));	\
>  	p->member = cpu_to_le##bits(val);				\
>  }
>  
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 2ef9a4b..51fe2ec 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -368,9 +368,10 @@ static int verify_parent_transid(struct extent_io_tree *io_tree,
>  		ret = 0;
>  		goto out;
>  	}
> +
>  	printk_ratelimited(KERN_ERR
>  	    "BTRFS (device %s): parent transid verify failed on %llu wanted %llu found %llu\n",
> -			eb->fs_info->sb->s_id, eb->start,
> +			eb_head(eb)->fs_info->sb->s_id, eb->start,
>  			parent_transid, btrfs_header_generation(eb));
>  	ret = 1;
>  
> @@ -445,7 +446,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
>  	int mirror_num = 0;
>  	int failed_mirror = 0;
>  
> -	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
> +	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags);
>  	io_tree = &BTRFS_I(root->fs_info->btree_inode)->io_tree;
>  	while (1) {
>  		ret = read_extent_buffer_pages(io_tree, eb, start,
> @@ -464,7 +465,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
>  		 * there is no reason to read the other copies, they won't be
>  		 * any less wrong.
>  		 */
> -		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags))
> +		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags))
>  			break;
>  
>  		num_copies = btrfs_num_copies(root->fs_info,
> @@ -622,7 +623,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
>  		goto err;
>  
>  	eb->read_mirror = mirror;
> -	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
> +	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags)) {
>  		ret = -EIO;
>  		goto err;
>  	}
> @@ -631,13 +632,14 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
>  	if (found_start != eb->start) {
>  		printk_ratelimited(KERN_ERR "BTRFS (device %s): bad tree block start "
>  			       "%llu %llu\n",
> -			       eb->fs_info->sb->s_id, found_start, eb->start);
> +				eb_head(eb)->fs_info->sb->s_id, found_start,
> +				eb->start);
>  		ret = -EIO;
>  		goto err;
>  	}
>  	if (check_tree_block_fsid(root->fs_info, eb)) {
>  		printk_ratelimited(KERN_ERR "BTRFS (device %s): bad fsid on block %llu\n",
> -			       eb->fs_info->sb->s_id, eb->start);
> +			       eb_head(eb)->fs_info->sb->s_id, eb->start);
>  		ret = -EIO;
>  		goto err;
>  	}
> @@ -664,7 +666,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
>  	 * return -EIO.
>  	 */
>  	if (found_level == 0 && check_leaf(root, eb)) {
> -		set_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
> +		set_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags);
>  		ret = -EIO;
>  	}
>  
> @@ -672,7 +674,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
>  		set_extent_buffer_uptodate(eb);
>  err:
>  	if (reads_done &&
> -	    test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
> +	    test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->ebflags))
>  		btree_readahead_hook(root, eb, eb->start, ret);
>  
>  	if (ret) {
> @@ -695,10 +697,10 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
>  	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
>  
>  	eb = (struct extent_buffer *)page->private;
> -	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
> +	set_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
>  	eb->read_mirror = failed_mirror;
>  	atomic_dec(&eb->io_pages);
> -	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
> +	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->ebflags))
>  		btree_readahead_hook(root, eb, eb->start, -EIO);
>  	return -EIO;	/* we fixed nothing */
>  }
> @@ -1047,13 +1049,24 @@ static int btree_set_page_dirty(struct page *page)
>  {
>  #ifdef DEBUG
>  	struct extent_buffer *eb;
> +	int i, dirty = 0;
>  
>  	BUG_ON(!PagePrivate(page));
>  	eb = (struct extent_buffer *)page->private;
>  	BUG_ON(!eb);
> -	BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
> -	BUG_ON(!atomic_read(&eb->refs));
> -	btrfs_assert_tree_locked(eb);
> +
> +	do {
> +		dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
> +		if (dirty)
> +			break;
> +	} while ((eb = eb->eb_next) != NULL);
> +
> +	BUG_ON(!dirty);
> +
> +	eb = (struct extent_buffer *)page->private;
> +	BUG_ON(!atomic_read(&(eb_head(eb)->refs)));
> +
> +	btrfs_assert_tree_locked(&ebh->eb);
>  #endif
>  	return __set_page_dirty_nobuffers(page);
>  }
> @@ -1094,7 +1107,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr,
>  	if (!buf)
>  		return 0;
>  
> -	set_bit(EXTENT_BUFFER_READAHEAD, &buf->bflags);
> +	set_bit(EXTENT_BUFFER_READAHEAD, &buf->ebflags);
>  
>  	ret = read_extent_buffer_pages(io_tree, buf, 0, WAIT_PAGE_LOCK,
>  				       btree_get_extent, mirror_num);
> @@ -1103,7 +1116,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr,
>  		return ret;
>  	}
>  
> -	if (test_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags)) {
> +	if (test_bit(EXTENT_BUFFER_CORRUPT, &buf->ebflags)) {
>  		free_extent_buffer(buf);
>  		return -EIO;
>  	} else if (extent_buffer_uptodate(buf)) {
> @@ -1131,14 +1144,16 @@ struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
>  
>  int btrfs_write_tree_block(struct extent_buffer *buf)
>  {
> -	return filemap_fdatawrite_range(buf->pages[0]->mapping, buf->start,
> +	return filemap_fdatawrite_range(eb_head(buf)->pages[0]->mapping,
> +					buf->start,
>  					buf->start + buf->len - 1);
>  }
>  
>  int btrfs_wait_tree_block_writeback(struct extent_buffer *buf)
>  {
> -	return filemap_fdatawait_range(buf->pages[0]->mapping,
> -				       buf->start, buf->start + buf->len - 1);
> +	return filemap_fdatawait_range(eb_head(buf)->pages[0]->mapping,
> +					buf->start,
> +					buf->start + buf->len - 1);
>  }
>  
>  struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
> @@ -1168,7 +1183,8 @@ void clean_tree_block(struct btrfs_trans_handle *trans,
>  	    fs_info->running_transaction->transid) {
>  		btrfs_assert_tree_locked(buf);
>  
> -		if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)) {
> +		if (test_and_clear_bit(EXTENT_BUFFER_DIRTY,
> +						&buf->ebflags)) {
>  			__percpu_counter_add(&fs_info->dirty_metadata_bytes,
>  					     -buf->len,
>  					     fs_info->dirty_metadata_batch);
> @@ -2798,9 +2814,10 @@ int open_ctree(struct super_block *sb,
>  					   btrfs_super_chunk_root(disk_super),
>  					   generation);
>  	if (!chunk_root->node ||
> -	    !test_bit(EXTENT_BUFFER_UPTODATE, &chunk_root->node->bflags)) {
> +		!test_bit(EXTENT_BUFFER_UPTODATE,
> +			&chunk_root->node->ebflags)) {
>  		printk(KERN_ERR "BTRFS: failed to read chunk root on %s\n",
> -		       sb->s_id);
> +			sb->s_id);
>  		goto fail_tree_roots;
>  	}
>  	btrfs_set_root_node(&chunk_root->root_item, chunk_root->node);
> @@ -2835,7 +2852,8 @@ retry_root_backup:
>  					  btrfs_super_root(disk_super),
>  					  generation);
>  	if (!tree_root->node ||
> -	    !test_bit(EXTENT_BUFFER_UPTODATE, &tree_root->node->bflags)) {
> +		!test_bit(EXTENT_BUFFER_UPTODATE,
> +			&tree_root->node->ebflags)) {
>  		printk(KERN_WARNING "BTRFS: failed to read tree root on %s\n",
>  		       sb->s_id);
>  
> @@ -3786,7 +3804,7 @@ int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
>  			  int atomic)
>  {
>  	int ret;
> -	struct inode *btree_inode = buf->pages[0]->mapping->host;
> +	struct inode *btree_inode = eb_head(buf)->pages[0]->mapping->host;
>  
>  	ret = extent_buffer_uptodate(buf);
>  	if (!ret)
> @@ -3816,10 +3834,10 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
>  	 * enabled.  Normal people shouldn't be marking dummy buffers as dirty
>  	 * outside of the sanity tests.
>  	 */
> -	if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, &buf->bflags)))
> +	if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, &eb_head(buf)->bflags)))
>  		return;
>  #endif
> -	root = BTRFS_I(buf->pages[0]->mapping->host)->root;
> +	root = BTRFS_I(eb_head(buf)->pages[0]->mapping->host)->root;
>  	btrfs_assert_tree_locked(buf);
>  	if (transid != root->fs_info->generation)
>  		WARN(1, KERN_CRIT "btrfs transid mismatch buffer %llu, "
> @@ -3874,7 +3892,8 @@ void btrfs_btree_balance_dirty_nodelay(struct btrfs_root *root)
>  
>  int btrfs_read_buffer(struct extent_buffer *buf, u64 parent_transid)
>  {
> -	struct btrfs_root *root = BTRFS_I(buf->pages[0]->mapping->host)->root;
> +	struct btrfs_root *root =
> +			BTRFS_I(eb_head(buf)->pages[0]->mapping->host)->root;
>  	return btree_read_extent_buffer_pages(root, buf, 0, parent_transid);
>  }
>  
> @@ -4185,7 +4204,7 @@ static int btrfs_destroy_marked_extents(struct btrfs_root *root,
>  			wait_on_extent_buffer_writeback(eb);
>  
>  			if (test_and_clear_bit(EXTENT_BUFFER_DIRTY,
> -					       &eb->bflags))
> +					       &eb->ebflags))
>  				clear_extent_buffer_dirty(eb);
>  			free_extent_buffer_stale(eb);
>  		}
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 1eef4ee..b93a922 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6450,7 +6450,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
>  			goto out;
>  		}
>  
> -		WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
> +		WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->ebflags));
>  
>  		btrfs_add_free_space(cache, buf->start, buf->len);
>  		btrfs_update_reserved_bytes(cache, buf->len, RESERVE_FREE, 0);
> @@ -6468,7 +6468,7 @@ out:
>  	 * Deleting the buffer, clear the corrupt flag since it doesn't matter
>  	 * anymore.
>  	 */
> -	clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
> +	clear_bit(EXTENT_BUFFER_CORRUPT, &buf->ebflags);
>  }
>  
>  /* Can return -ENOMEM */
> @@ -7444,7 +7444,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>  	btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
>  	btrfs_tree_lock(buf);
>  	clean_tree_block(trans, root->fs_info, buf);
> -	clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
> +	clear_bit(EXTENT_BUFFER_STALE, &buf->ebflags);
>  
>  	btrfs_set_lock_blocking(buf);
>  	btrfs_set_buffer_uptodate(buf);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 3736ab5..a7e715a 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -61,6 +61,7 @@ void btrfs_leak_debug_check(void)
>  {
>  	struct extent_state *state;
>  	struct extent_buffer *eb;
> +	struct extent_buffer_head *ebh;
>  
>  	while (!list_empty(&states)) {
>  		state = list_entry(states.next, struct extent_state, leak_list);
> @@ -73,12 +74,17 @@ void btrfs_leak_debug_check(void)
>  	}
>  
>  	while (!list_empty(&buffers)) {
> -		eb = list_entry(buffers.next, struct extent_buffer, leak_list);
> -		printk(KERN_ERR "BTRFS: buffer leak start %llu len %lu "
> -		       "refs %d\n",
> -		       eb->start, eb->len, atomic_read(&eb->refs));
> -		list_del(&eb->leak_list);
> -		kmem_cache_free(extent_buffer_cache, eb);
> +		ebh = list_entry(buffers.next, struct extent_buffer_head, leak_list);
> +		printk(KERN_ERR "btrfs buffer leak ");
> +
> +		eb = &ebh->eb;
> +		do {
> +			printk(KERN_ERR "eb %p %llu:%lu ", eb, eb->start, eb->len);
> +		} while ((eb = eb->eb_next) != NULL);
> +
> +		printk(KERN_ERR "refs %d\n", atomic_read(&ebh->refs));
> +		list_del(&ebh->leak_list);
> +		kmem_cache_free(extent_buffer_cache, ebh);
>  	}
>  }
>  
> @@ -149,7 +155,7 @@ int __init extent_io_init(void)
>  		return -ENOMEM;
>  
>  	extent_buffer_cache = kmem_cache_create("btrfs_extent_buffer",
> -			sizeof(struct extent_buffer), 0,
> +			sizeof(struct extent_buffer_head), 0,
>  			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
>  	if (!extent_buffer_cache)
>  		goto free_state_cache;
> @@ -2170,7 +2176,7 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
>  		return -EROFS;
>  
>  	for (i = 0; i < num_pages; i++) {
> -		struct page *p = eb->pages[i];
> +		struct page *p = eb_head(eb)->pages[i];
>  
>  		ret = repair_io_failure(root->fs_info->btree_inode, start,
>  					PAGE_CACHE_SIZE, start, p,
> @@ -3625,8 +3631,8 @@ done_unlocked:
>  
>  void wait_on_extent_buffer_writeback(struct extent_buffer *eb)
>  {
> -	wait_on_bit_io(&eb->bflags, EXTENT_BUFFER_WRITEBACK,
> -		       TASK_UNINTERRUPTIBLE);
> +	wait_on_bit_io(&eb->ebflags, EXTENT_BUFFER_WRITEBACK,
> +		    TASK_UNINTERRUPTIBLE);
>  }
>  
>  static noinline_for_stack int
> @@ -3644,7 +3650,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
>  		btrfs_tree_lock(eb);
>  	}
>  
> -	if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
> +	if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags)) {
>  		btrfs_tree_unlock(eb);
>  		if (!epd->sync_io)
>  			return 0;
> @@ -3655,7 +3661,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
>  		while (1) {
>  			wait_on_extent_buffer_writeback(eb);
>  			btrfs_tree_lock(eb);
> -			if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags))
> +			if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags))
>  				break;
>  			btrfs_tree_unlock(eb);
>  		}
> @@ -3666,17 +3672,17 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
>  	 * under IO since we can end up having no IO bits set for a short period
>  	 * of time.
>  	 */
> -	spin_lock(&eb->refs_lock);
> -	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> -		set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
> -		spin_unlock(&eb->refs_lock);
> +	spin_lock(&eb_head(eb)->refs_lock);
> +	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags)) {
> +		set_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags);
> +		spin_unlock(&eb_head(eb)->refs_lock);
>  		btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
>  		__percpu_counter_add(&fs_info->dirty_metadata_bytes,
>  				     -eb->len,
>  				     fs_info->dirty_metadata_batch);
>  		ret = 1;
>  	} else {
> -		spin_unlock(&eb->refs_lock);
> +		spin_unlock(&eb_head(eb)->refs_lock);
>  	}
>  
>  	btrfs_tree_unlock(eb);
> @@ -3686,7 +3692,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
>  
>  	num_pages = num_extent_pages(eb->start, eb->len);
>  	for (i = 0; i < num_pages; i++) {
> -		struct page *p = eb->pages[i];
> +		struct page *p = eb_head(eb)->pages[i];
>  
>  		if (!trylock_page(p)) {
>  			if (!flush) {
> @@ -3702,18 +3708,19 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
>  
>  static void end_extent_buffer_writeback(struct extent_buffer *eb)
>  {
> -	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
> +	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags);
>  	smp_mb__after_atomic();
> -	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
> +	wake_up_bit(&eb->ebflags, EXTENT_BUFFER_WRITEBACK);
>  }
>  
>  static void set_btree_ioerr(struct page *page)
>  {
>  	struct extent_buffer *eb = (struct extent_buffer *)page->private;
> -	struct btrfs_inode *btree_ino = BTRFS_I(eb->fs_info->btree_inode);
> +	struct extent_buffer_head *ebh = eb_head(eb);
> +	struct btrfs_inode *btree_ino = BTRFS_I(ebh->fs_info->btree_inode);
>  
>  	SetPageError(page);
> -	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
> +	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->ebflags))
>  		return;
>  
>  	/*
> @@ -3782,7 +3789,7 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
>  		BUG_ON(!eb);
>  		done = atomic_dec_and_test(&eb->io_pages);
>  
> -		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
> +		if (err || test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->ebflags)) {
>  			ClearPageUptodate(page);
>  			set_btree_ioerr(page);
>  		}
> @@ -3811,14 +3818,14 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>  	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
>  	int ret = 0;
>  
> -	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
> +	clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->ebflags);
>  	num_pages = num_extent_pages(eb->start, eb->len);
>  	atomic_set(&eb->io_pages, num_pages);
>  	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
>  		bio_flags = EXTENT_BIO_TREE_LOG;
>  
>  	for (i = 0; i < num_pages; i++) {
> -		struct page *p = eb->pages[i];
> +		struct page *p = eb_head(eb)->pages[i];
>  
>  		clear_page_dirty_for_io(p);
>  		set_page_writeback(p);
> @@ -3842,7 +3849,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
>  
>  	if (unlikely(ret)) {
>  		for (; i < num_pages; i++) {
> -			struct page *p = eb->pages[i];
> +			struct page *p = eb_head(eb)->pages[i];
>  			clear_page_dirty_for_io(p);
>  			unlock_page(p);
>  		}
> @@ -4605,17 +4612,36 @@ out:
>  	return ret;
>  }
>  
> -static void __free_extent_buffer(struct extent_buffer *eb)
> +static void __free_extent_buffer(struct extent_buffer_head *ebh)
>  {
> -	btrfs_leak_debug_del(&eb->leak_list);
> -	kmem_cache_free(extent_buffer_cache, eb);
> +	struct extent_buffer *eb, *next_eb;
> +
> +	btrfs_leak_debug_del(&ebh->leak_list);
> +
> +	eb = ebh->eb.eb_next;
> +	while (eb) {
> +		next_eb = eb->eb_next;
> +		kfree(eb);
> +		eb = next_eb;
> +	}
> +
> +	kmem_cache_free(extent_buffer_cache, ebh);
>  }
>  
>  int extent_buffer_under_io(struct extent_buffer *eb)
>  {
> -	return (atomic_read(&eb->io_pages) ||
> -		test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) ||
> -		test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
> +	struct extent_buffer_head *ebh = eb->ebh;
> +	int dirty_or_writeback = 0;
> +
> +	for (eb = &ebh->eb; eb; eb = eb->eb_next) {
> +		if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags)
> +			|| test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags)) {
> +			dirty_or_writeback = 1;
> +			break;
> +		}
> +	}
> +
> +	return (atomic_read(&ebh->io_bvecs) || dirty_or_writeback);
>  }
>  
>  /*
> @@ -4625,7 +4651,8 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
>  {
>  	unsigned long index;
>  	struct page *page;
> -	int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
> +	struct extent_buffer_head *ebh = eb_head(eb);
> +	int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &ebh->bflags);
>  
>  	BUG_ON(extent_buffer_under_io(eb));
>  
> @@ -4634,8 +4661,10 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
>  		return;
>  
>  	do {
> +		struct extent_buffer *e;
> +
>  		index--;
> -		page = eb->pages[index];
> +		page = ebh->pages[index];
>  		if (page && mapped) {
>  			spin_lock(&page->mapping->private_lock);
>  			/*
> @@ -4646,8 +4675,10 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
>  			 * this eb.
>  			 */
>  			if (PagePrivate(page) &&
> -			    page->private == (unsigned long)eb) {
> -				BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
> +				page->private == (unsigned long)(&ebh->eb)) {
> +				for (e = &ebh->eb; !e; e = e->eb_next)
> +					BUG_ON(test_bit(EXTENT_BUFFER_DIRTY,
> +								&e->ebflags));
>  				BUG_ON(PageDirty(page));
>  				BUG_ON(PageWriteback(page));
>  				/*
> @@ -4675,22 +4706,18 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
>  static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
>  {
>  	btrfs_release_extent_buffer_page(eb);
> -	__free_extent_buffer(eb);
> +	__free_extent_buffer(eb_head(eb));
>  }
>  
> -static struct extent_buffer *
> -__alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
> -		      unsigned long len)
> +static void __init_extent_buffer(struct extent_buffer *eb,
> +				struct extent_buffer_head *ebh,
> +				u64 start,
> +				unsigned long len)
>  {
> -	struct extent_buffer *eb = NULL;
> -
> -	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
> -	if (eb == NULL)
> -		return NULL;
>  	eb->start = start;
>  	eb->len = len;
> -	eb->fs_info = fs_info;
> -	eb->bflags = 0;
> +	eb->ebh = ebh;
> +	eb->eb_next = NULL;
>  	rwlock_init(&eb->lock);
>  	atomic_set(&eb->write_locks, 0);
>  	atomic_set(&eb->read_locks, 0);
> @@ -4701,12 +4728,26 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
>  	eb->lock_nested = 0;
>  	init_waitqueue_head(&eb->write_lock_wq);
>  	init_waitqueue_head(&eb->read_lock_wq);
> +}
>  
> -	btrfs_leak_debug_add(&eb->leak_list, &buffers);
> +static struct extent_buffer *
> +__alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
> +		      unsigned long len)
> +{
> +	struct extent_buffer_head *ebh = NULL;
> +	struct extent_buffer *eb = NULL;
> +	int i;
> +
> +	ebh = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
> +	if (ebh == NULL)
> +		return NULL;
> +	ebh->fs_info = fs_info;
> +	ebh->bflags = 0;
> +	btrfs_leak_debug_add(&ebh->leak_list, &buffers);
>  
> -	spin_lock_init(&eb->refs_lock);
> -	atomic_set(&eb->refs, 1);
> -	atomic_set(&eb->io_pages, 0);
> +	spin_lock_init(&ebh->refs_lock);
> +	atomic_set(&ebh->refs, 1);
> +	atomic_set(&ebh->io_bvecs, 0);
>  
>  	/*
>  	 * Sanity checks, currently the maximum is 64k covered by 16x 4k pages
> @@ -4715,6 +4756,29 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
>  		> MAX_INLINE_EXTENT_BUFFER_SIZE);
>  	BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE);
>  
> +	if (len < PAGE_CACHE_SIZE) {
> +		struct extent_buffer *cur_eb, *prev_eb;
> +		int ebs_per_page = PAGE_CACHE_SIZE / len;
> +		u64 st = start & ~(PAGE_CACHE_SIZE - 1);
> +
> +		prev_eb = NULL;
> +		cur_eb = &ebh->eb;
> +		for (i = 0; i < ebs_per_page; i++, st += len) {
> +			if (prev_eb) {
> +				cur_eb = kzalloc(sizeof(*eb), GFP_NOFS);
> +				prev_eb->eb_next = cur_eb;
> +			}
> +			__init_extent_buffer(cur_eb, ebh, st, len);
> +			prev_eb = cur_eb;
> +			if (st == start)
> +				eb = cur_eb;
> +		}
> +		BUG_ON(!eb);
> +	} else {
> +		eb = &ebh->eb;
> +		__init_extent_buffer(eb, ebh, start, len);
> +	}
> +
>  	return eb;
>  }
>  
> @@ -4725,7 +4789,8 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
>  	struct extent_buffer *new;
>  	unsigned long num_pages = num_extent_pages(src->start, src->len);
>  
> -	new = __alloc_extent_buffer(src->fs_info, src->start, src->len);
> +	new = __alloc_extent_buffer(eb_head(src)->fs_info, src->start,
> +				src->len);
>  	if (new == NULL)
>  		return NULL;
>  
> @@ -4735,15 +4800,16 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
>  			btrfs_release_extent_buffer(new);
>  			return NULL;
>  		}
> -		attach_extent_buffer_page(new, p);
> +		attach_extent_buffer_page(&(eb_head(new)->eb), p);
>  		WARN_ON(PageDirty(p));
>  		SetPageUptodate(p);
> -		new->pages[i] = p;
> +		eb_head(new)->pages[i] = p;
>  	}
>  
> +	set_bit(EXTENT_BUFFER_UPTODATE, &new->ebflags);
> +	set_bit(EXTENT_BUFFER_DUMMY, &eb_head(new)->bflags);
> +
>  	copy_extent_buffer(new, src, 0, 0, src->len);
> -	set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
> -	set_bit(EXTENT_BUFFER_DUMMY, &new->bflags);
>  
>  	return new;
>  }
> @@ -4772,19 +4838,19 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>  		return NULL;
>  
>  	for (i = 0; i < num_pages; i++) {
> -		eb->pages[i] = alloc_page(GFP_NOFS);
> -		if (!eb->pages[i])
> +		eb_head(eb)->pages[i] = alloc_page(GFP_NOFS);
> +		if (!eb_head(eb)->pages[i])
>  			goto err;
>  	}
>  	set_extent_buffer_uptodate(eb);
>  	btrfs_set_header_nritems(eb, 0);
> -	set_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
> +	set_bit(EXTENT_BUFFER_DUMMY, &eb_head(eb)->bflags);
>  
>  	return eb;
>  err:
>  	for (; i > 0; i--)
> -		__free_page(eb->pages[i - 1]);
> -	__free_extent_buffer(eb);
> +		__free_page(eb_head(eb)->pages[i - 1]);
> +	__free_extent_buffer(eb_head(eb));
>  	return NULL;
>  }
>  
> @@ -4811,14 +4877,15 @@ static void check_buffer_tree_ref(struct extent_buffer *eb)
>  	 * So bump the ref count first, then set the bit.  If someone
>  	 * beat us to it, drop the ref we added.
>  	 */
> -	refs = atomic_read(&eb->refs);
> -	if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> +	refs = atomic_read(&eb_head(eb)->refs);
> +	if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF,
> +					&eb_head(eb)->bflags))
>  		return;
>  
> -	spin_lock(&eb->refs_lock);
> -	if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> -		atomic_inc(&eb->refs);
> -	spin_unlock(&eb->refs_lock);
> +	spin_lock(&eb_head(eb)->refs_lock);
> +	if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb_head(eb)->bflags))
> +		atomic_inc(&eb_head(eb)->refs);
> +	spin_unlock(&eb_head(eb)->refs_lock);
>  }
>  
>  static void mark_extent_buffer_accessed(struct extent_buffer *eb,
> @@ -4830,7 +4897,7 @@ static void mark_extent_buffer_accessed(struct extent_buffer *eb,
>  
>  	num_pages = num_extent_pages(eb->start, eb->len);
>  	for (i = 0; i < num_pages; i++) {
> -		struct page *p = eb->pages[i];
> +		struct page *p = eb_head(eb)->pages[i];
>  
>  		if (p != accessed)
>  			mark_page_accessed(p);
> @@ -4840,15 +4907,24 @@ static void mark_extent_buffer_accessed(struct extent_buffer *eb,
>  struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
>  					 u64 start)
>  {
> +	struct extent_buffer_head *ebh;
>  	struct extent_buffer *eb;
>  
>  	rcu_read_lock();
> -	eb = radix_tree_lookup(&fs_info->buffer_radix,
> -			       start >> PAGE_CACHE_SHIFT);
> -	if (eb && atomic_inc_not_zero(&eb->refs)) {
> +	ebh = radix_tree_lookup(&fs_info->buffer_radix,
> +				start >> PAGE_CACHE_SHIFT);
> +	if (ebh && atomic_inc_not_zero(&ebh->refs)) {
>  		rcu_read_unlock();
> -		mark_extent_buffer_accessed(eb, NULL);
> -		return eb;
> +
> +		eb = &ebh->eb;
> +		do {
> +			if (eb->start == start) {
> +				mark_extent_buffer_accessed(eb, NULL);
> +				return eb;
> +			}
> +		} while ((eb = eb->eb_next) != NULL);
> +
> +		BUG();
>  	}
>  	rcu_read_unlock();
>  
> @@ -4909,7 +4985,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>  	unsigned long num_pages = num_extent_pages(start, len);
>  	unsigned long i;
>  	unsigned long index = start >> PAGE_CACHE_SHIFT;
> -	struct extent_buffer *eb;
> +	struct extent_buffer *eb, *cur_eb;
>  	struct extent_buffer *exists = NULL;
>  	struct page *p;
>  	struct address_space *mapping = fs_info->btree_inode->i_mapping;
> @@ -4939,12 +5015,18 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>  			 * overwrite page->private.
>  			 */
>  			exists = (struct extent_buffer *)p->private;
> -			if (atomic_inc_not_zero(&exists->refs)) {
> +			if (atomic_inc_not_zero(&eb_head(exists)->refs)) {
>  				spin_unlock(&mapping->private_lock);
>  				unlock_page(p);
>  				page_cache_release(p);
> -				mark_extent_buffer_accessed(exists, p);
> -				goto free_eb;
> +				do {
> +					if (exists->start == start) {
> +						mark_extent_buffer_accessed(exists, p);
> +						goto free_eb;
> +					}
> +				} while ((exists = exists->eb_next) != NULL);
> +
> +				BUG();
>  			}
>  
>  			/*
> @@ -4955,10 +5037,11 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>  			WARN_ON(PageDirty(p));
>  			page_cache_release(p);
>  		}
> -		attach_extent_buffer_page(eb, p);
> +		attach_extent_buffer_page(&(eb_head(eb)->eb), p);
>  		spin_unlock(&mapping->private_lock);
>  		WARN_ON(PageDirty(p));
> -		eb->pages[i] = p;
> +		mark_page_accessed(p);
> +		eb_head(eb)->pages[i] = p;
>  		if (!PageUptodate(p))
>  			uptodate = 0;
>  
> @@ -4967,16 +5050,22 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>  		 * and why we unlock later
>  		 */
>  	}
> -	if (uptodate)
> -		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
> +	if (uptodate) {
> +		cur_eb = &(eb_head(eb)->eb);
> +		do {
> +			set_bit(EXTENT_BUFFER_UPTODATE, &cur_eb->ebflags);
> +		} while ((cur_eb = cur_eb->eb_next) != NULL);
> +	}
>  again:
>  	ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
> -	if (ret)
> +	if (ret) {
> +		exists = NULL;
>  		goto free_eb;
> +	}
>  
>  	spin_lock(&fs_info->buffer_lock);
>  	ret = radix_tree_insert(&fs_info->buffer_radix,
> -				start >> PAGE_CACHE_SHIFT, eb);
> +				start >> PAGE_CACHE_SHIFT, eb_head(eb));
>  	spin_unlock(&fs_info->buffer_lock);
>  	radix_tree_preload_end();
>  	if (ret == -EEXIST) {
> @@ -4988,7 +5077,7 @@ again:
>  	}
>  	/* add one reference for the tree */
>  	check_buffer_tree_ref(eb);
> -	set_bit(EXTENT_BUFFER_IN_TREE, &eb->bflags);
> +	set_bit(EXTENT_BUFFER_IN_TREE, &eb_head(eb)->bflags);
>  
>  	/*
>  	 * there is a race where release page may have
> @@ -4999,114 +5088,131 @@ again:
>  	 * after the extent buffer is in the radix tree so
>  	 * it doesn't get lost
>  	 */
> -	SetPageChecked(eb->pages[0]);
> +	SetPageChecked(eb_head(eb)->pages[0]);
>  	for (i = 1; i < num_pages; i++) {
> -		p = eb->pages[i];
> +		p = eb_head(eb)->pages[i];
>  		ClearPageChecked(p);
>  		unlock_page(p);
>  	}
> -	unlock_page(eb->pages[0]);
> +	unlock_page(eb_head(eb)->pages[0]);
>  	return eb;
>  
>  free_eb:
>  	for (i = 0; i < num_pages; i++) {
> -		if (eb->pages[i])
> -			unlock_page(eb->pages[i]);
> +		if (eb_head(eb)->pages[i])
> +			unlock_page(eb_head(eb)->pages[i]);
>  	}
>  
> -	WARN_ON(!atomic_dec_and_test(&eb->refs));
> +	WARN_ON(!atomic_dec_and_test(&eb_head(eb)->refs));
>  	btrfs_release_extent_buffer(eb);
>  	return exists;
>  }
>  
>  static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head)
>  {
> -	struct extent_buffer *eb =
> -			container_of(head, struct extent_buffer, rcu_head);
> +	struct extent_buffer_head *ebh =
> +			container_of(head, struct extent_buffer_head, rcu_head);
>  
> -	__free_extent_buffer(eb);
> +	__free_extent_buffer(ebh);
>  }
>  
>  /* Expects to have eb->eb_lock already held */
> -static int release_extent_buffer(struct extent_buffer *eb)
> +static int release_extent_buffer(struct extent_buffer_head *ebh)
>  {
> -	WARN_ON(atomic_read(&eb->refs) == 0);
> -	if (atomic_dec_and_test(&eb->refs)) {
> -		if (test_and_clear_bit(EXTENT_BUFFER_IN_TREE, &eb->bflags)) {
> -			struct btrfs_fs_info *fs_info = eb->fs_info;
> +	WARN_ON(atomic_read(&ebh->refs) == 0);
> +	if (atomic_dec_and_test(&ebh->refs)) {
> +		if (test_and_clear_bit(EXTENT_BUFFER_IN_TREE, &ebh->bflags)) {
> +			struct btrfs_fs_info *fs_info = ebh->fs_info;
>  
> -			spin_unlock(&eb->refs_lock);
> +			spin_unlock(&ebh->refs_lock);
>  
>  			spin_lock(&fs_info->buffer_lock);
>  			radix_tree_delete(&fs_info->buffer_radix,
> -					  eb->start >> PAGE_CACHE_SHIFT);
> +					ebh->eb.start >> PAGE_CACHE_SHIFT);
>  			spin_unlock(&fs_info->buffer_lock);
>  		} else {
> -			spin_unlock(&eb->refs_lock);
> +			spin_unlock(&ebh->refs_lock);
>  		}
>  
>  		/* Should be safe to release our pages at this point */
> -		btrfs_release_extent_buffer_page(eb);
> +		btrfs_release_extent_buffer_page(&ebh->eb);
>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> -		if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags))) {
> -			__free_extent_buffer(eb);
> +		if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, &eb_head(buf)->bflags))) {
> +			__free_extent_buffer(eb_head(eb));
>  			return 1;
>  		}
>  #endif
> -		call_rcu(&eb->rcu_head, btrfs_release_extent_buffer_rcu);
> +		call_rcu(&ebh->rcu_head, btrfs_release_extent_buffer_rcu);
>  		return 1;
>  	}
> -	spin_unlock(&eb->refs_lock);
> +	spin_unlock(&ebh->refs_lock);
>  
>  	return 0;
>  }
>  
>  void free_extent_buffer(struct extent_buffer *eb)
>  {
> +	struct extent_buffer_head *ebh;
>  	int refs;
>  	int old;
>  	if (!eb)
>  		return;
>  
> +	ebh = eb_head(eb);
>  	while (1) {
> -		refs = atomic_read(&eb->refs);
> +		refs = atomic_read(&ebh->refs);
>  		if (refs <= 3)
>  			break;
> -		old = atomic_cmpxchg(&eb->refs, refs, refs - 1);
> +		old = atomic_cmpxchg(&ebh->refs, refs, refs - 1);
>  		if (old == refs)
>  			return;
>  	}
>  
> -	spin_lock(&eb->refs_lock);
> -	if (atomic_read(&eb->refs) == 2 &&
> -	    test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags))
> -		atomic_dec(&eb->refs);
> +	spin_lock(&ebh->refs_lock);
> +	if (atomic_read(&ebh->refs) == 2 &&
> +	    test_bit(EXTENT_BUFFER_DUMMY, &ebh->bflags))
> +		atomic_dec(&ebh->refs);
>  
> -	if (atomic_read(&eb->refs) == 2 &&
> -	    test_bit(EXTENT_BUFFER_STALE, &eb->bflags) &&
> +	if (atomic_read(&ebh->refs) == 2 &&
> +	    test_bit(EXTENT_BUFFER_STALE, &eb->ebflags) &&
>  	    !extent_buffer_under_io(eb) &&
> -	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> -		atomic_dec(&eb->refs);
> +	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &ebh->bflags))
> +		atomic_dec(&ebh->refs);
>  
>  	/*
>  	 * I know this is terrible, but it's temporary until we stop tracking
>  	 * the uptodate bits and such for the extent buffers.
>  	 */
> -	release_extent_buffer(eb);
> +	release_extent_buffer(ebh);
>  }
>  
>  void free_extent_buffer_stale(struct extent_buffer *eb)
>  {
> +	struct extent_buffer_head *ebh;
>  	if (!eb)
>  		return;
>  
> -	spin_lock(&eb->refs_lock);
> -	set_bit(EXTENT_BUFFER_STALE, &eb->bflags);
> +	ebh = eb_head(eb);
> +	spin_lock(&ebh->refs_lock);
> +
> +	set_bit(EXTENT_BUFFER_STALE, &eb->ebflags);
> +	if (atomic_read(&ebh->refs) == 2 && !extent_buffer_under_io(eb) &&
> +	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &ebh->bflags))
> +		atomic_dec(&ebh->refs);
>  
> -	if (atomic_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) &&
> -	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> -		atomic_dec(&eb->refs);
> -	release_extent_buffer(eb);
> +	release_extent_buffer(ebh);
> +}
> +
> +static int page_ebs_clean(struct extent_buffer_head *ebh)
> +{
> +	struct extent_buffer *eb = &ebh->eb;
> +
> +	do {
> +		if (test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags))
> +			return 0;
> +	} while ((eb = eb->eb_next) != NULL);
> +
> +	return 1;
>  }
>  
>  void clear_extent_buffer_dirty(struct extent_buffer *eb)
> @@ -5117,8 +5223,11 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
>  
>  	num_pages = num_extent_pages(eb->start, eb->len);
>  
> +	if (eb->len < PAGE_CACHE_SIZE && !page_ebs_clean(eb_head(eb)))
> +		return;
> +
>  	for (i = 0; i < num_pages; i++) {
> -		page = eb->pages[i];
> +		page = eb_head(eb)->pages[i];
>  		if (!PageDirty(page))
>  			continue;
>  
> @@ -5136,7 +5245,7 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
>  		ClearPageError(page);
>  		unlock_page(page);
>  	}
> -	WARN_ON(atomic_read(&eb->refs) == 0);
> +	WARN_ON(atomic_read(&eb_head(eb)->refs) == 0);
>  }
>  
>  int set_extent_buffer_dirty(struct extent_buffer *eb)
> @@ -5147,14 +5256,14 @@ int set_extent_buffer_dirty(struct extent_buffer *eb)
>  
>  	check_buffer_tree_ref(eb);
>  
> -	was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
> +	was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
>  
>  	num_pages = num_extent_pages(eb->start, eb->len);
> -	WARN_ON(atomic_read(&eb->refs) == 0);
> -	WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags));
> +	WARN_ON(atomic_read(&eb_head(eb)->refs) == 0);
> +	WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb_head(eb)->bflags));
>  
>  	for (i = 0; i < num_pages; i++)
> -		set_page_dirty(eb->pages[i]);
> +		set_page_dirty(eb_head(eb)->pages[i]);
>  	return was_dirty;
>  }
>  
> @@ -5164,10 +5273,12 @@ int clear_extent_buffer_uptodate(struct extent_buffer *eb)
>  	struct page *page;
>  	unsigned long num_pages;
>  
> -	clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
> +	if (!eb || !eb_head(eb))
> +		return 0;
> +	clear_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags);
>  	num_pages = num_extent_pages(eb->start, eb->len);
>  	for (i = 0; i < num_pages; i++) {
> -		page = eb->pages[i];
> +		page = eb_head(eb)->pages[i];
>  		if (page)
>  			ClearPageUptodate(page);
>  	}
> @@ -5176,22 +5287,43 @@ int clear_extent_buffer_uptodate(struct extent_buffer *eb)
>  
>  int set_extent_buffer_uptodate(struct extent_buffer *eb)
>  {
> +	struct extent_buffer_head *ebh;
>  	unsigned long i;
>  	struct page *page;
>  	unsigned long num_pages;
> +	int uptodate;
>  
> -	set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
> -	num_pages = num_extent_pages(eb->start, eb->len);
> -	for (i = 0; i < num_pages; i++) {
> -		page = eb->pages[i];
> -		SetPageUptodate(page);
> +	ebh = eb->ebh;
> +
> +	set_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags);
> +	if (eb->len < PAGE_CACHE_SIZE) {
> +		eb = &(eb_head(eb)->eb);
> +		uptodate = 1;
> +		do {
> +			if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags)) {
> +				uptodate = 0;
> +				break;
> +			}
> +		} while ((eb = eb->eb_next) != NULL);
> +
> +		if (uptodate) {
> +			page = ebh->pages[0];
> +			SetPageUptodate(page);
> +		}
> +	} else {
> +		num_pages = num_extent_pages(eb->start, eb->len);
> +		for (i = 0; i < num_pages; i++) {
> +			page = ebh->pages[i];
> +			SetPageUptodate(page);
> +		}
>  	}
> +
>  	return 0;
>  }
>  
>  int extent_buffer_uptodate(struct extent_buffer *eb)
>  {
> -	return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
> +	return test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags);
>  }
>  
>  int read_extent_buffer_pages(struct extent_io_tree *tree,
> @@ -5210,7 +5342,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
>  	struct bio *bio = NULL;
>  	unsigned long bio_flags = 0;
>  
> -	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
> +	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags))
>  		return 0;
>  
>  	if (start) {
> @@ -5223,7 +5355,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
>  
>  	num_pages = num_extent_pages(eb->start, eb->len);
>  	for (i = start_i; i < num_pages; i++) {
> -		page = eb->pages[i];
> +		page = eb_head(eb)->pages[i];
>  		if (wait == WAIT_NONE) {
>  			if (!trylock_page(page))
>  				goto unlock_exit;
> @@ -5238,15 +5370,15 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
>  	}
>  	if (all_uptodate) {
>  		if (start_i == 0)
> -			set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
> +			set_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags);
>  		goto unlock_exit;
>  	}
>  
> -	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
> +	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
>  	eb->read_mirror = 0;
>  	atomic_set(&eb->io_pages, num_reads);
>  	for (i = start_i; i < num_pages; i++) {
> -		page = eb->pages[i];
> +		page = eb_head(eb)->pages[i];
>  		if (!PageUptodate(page)) {
>  			ClearPageError(page);
>  			err = __extent_read_full_page(tree, page,
> @@ -5271,7 +5403,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
>  		return ret;
>  
>  	for (i = start_i; i < num_pages; i++) {
> -		page = eb->pages[i];
> +		page = eb_head(eb)->pages[i];
>  		wait_on_page_locked(page);
>  		if (!PageUptodate(page))
>  			ret = -EIO;
> @@ -5282,7 +5414,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
>  unlock_exit:
>  	i = start_i;
>  	while (locked_pages > 0) {
> -		page = eb->pages[i];
> +		page = eb_head(eb)->pages[i];
>  		i++;
>  		unlock_page(page);
>  		locked_pages--;
> @@ -5308,7 +5440,7 @@ void read_extent_buffer(struct extent_buffer *eb, void *dstv,
>  	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
>  
>  	while (len > 0) {
> -		page = eb->pages[i];
> +		page = eb_head(eb)->pages[i];
>  
>  		cur = min(len, (PAGE_CACHE_SIZE - offset));
>  		kaddr = page_address(page);
> @@ -5340,7 +5472,7 @@ int read_extent_buffer_to_user(struct extent_buffer *eb, void __user *dstv,
>  	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
>  
>  	while (len > 0) {
> -		page = eb->pages[i];
> +		page = eb_head(eb)->pages[i];
>  
>  		cur = min(len, (PAGE_CACHE_SIZE - offset));
>  		kaddr = page_address(page);
> @@ -5389,7 +5521,7 @@ int map_private_extent_buffer(struct extent_buffer *eb, unsigned long start,
>  		return -EINVAL;
>  	}
>  
> -	p = eb->pages[i];
> +	p = eb_head(eb)->pages[i];
>  	kaddr = page_address(p);
>  	*map = kaddr + offset;
>  	*map_len = PAGE_CACHE_SIZE - offset;
> @@ -5415,7 +5547,7 @@ int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv,
>  	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
>  
>  	while (len > 0) {
> -		page = eb->pages[i];
> +		page = eb_head(eb)->pages[i];
>  
>  		cur = min(len, (PAGE_CACHE_SIZE - offset));
>  
> @@ -5445,12 +5577,12 @@ void write_extent_buffer(struct extent_buffer *eb, const void *srcv,
>  
>  	WARN_ON(start > eb->len);
>  	WARN_ON(start + len > eb->start + eb->len);
> +	WARN_ON(!test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags));
>  
>  	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
>  
>  	while (len > 0) {
> -		page = eb->pages[i];
> -		WARN_ON(!PageUptodate(page));
> +		page = eb_head(eb)->pages[i];
>  
>  		cur = min(len, PAGE_CACHE_SIZE - offset);
>  		kaddr = page_address(page);
> @@ -5478,9 +5610,10 @@ void memset_extent_buffer(struct extent_buffer *eb, char c,
>  
>  	offset = (start_offset + start) & (PAGE_CACHE_SIZE - 1);
>  
> +	WARN_ON(!test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags));
> +
>  	while (len > 0) {
> -		page = eb->pages[i];
> -		WARN_ON(!PageUptodate(page));
> +		page = eb_head(eb)->pages[i];
>  
>  		cur = min(len, PAGE_CACHE_SIZE - offset);
>  		kaddr = page_address(page);
> @@ -5509,9 +5642,10 @@ void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src,
>  	offset = (start_offset + dst_offset) &
>  		(PAGE_CACHE_SIZE - 1);
>  
> +	WARN_ON(!test_bit(EXTENT_BUFFER_UPTODATE, &dst->ebflags));
> +
>  	while (len > 0) {
> -		page = dst->pages[i];
> -		WARN_ON(!PageUptodate(page));
> +		page = eb_head(dst)->pages[i];
>  
>  		cur = min(len, (unsigned long)(PAGE_CACHE_SIZE - offset));
>  
> @@ -5588,8 +5722,9 @@ void memcpy_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
>  		cur = min_t(unsigned long, cur,
>  			(unsigned long)(PAGE_CACHE_SIZE - dst_off_in_page));
>  
> -		copy_pages(dst->pages[dst_i], dst->pages[src_i],
> -			   dst_off_in_page, src_off_in_page, cur);
> +		copy_pages(eb_head(dst)->pages[dst_i],
> +			eb_head(dst)->pages[src_i],
> +			dst_off_in_page, src_off_in_page, cur);
>  
>  		src_offset += cur;
>  		dst_offset += cur;
> @@ -5634,9 +5769,10 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
>  
>  		cur = min_t(unsigned long, len, src_off_in_page + 1);
>  		cur = min(cur, dst_off_in_page + 1);
> -		copy_pages(dst->pages[dst_i], dst->pages[src_i],
> -			   dst_off_in_page - cur + 1,
> -			   src_off_in_page - cur + 1, cur);
> +		copy_pages(eb_head(dst)->pages[dst_i],
> +			eb_head(dst)->pages[src_i],
> +			dst_off_in_page - cur + 1,
> +			src_off_in_page - cur + 1, cur);
>  
>  		dst_end -= cur;
>  		src_end -= cur;
> @@ -5646,6 +5782,7 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
>  
>  int try_release_extent_buffer(struct page *page)
>  {
> +	struct extent_buffer_head *ebh;
>  	struct extent_buffer *eb;
>  
>  	/*
> @@ -5661,14 +5798,15 @@ int try_release_extent_buffer(struct page *page)
>  	eb = (struct extent_buffer *)page->private;
>  	BUG_ON(!eb);
>  
> +	ebh = eb->ebh;
>  	/*
>  	 * This is a little awful but should be ok, we need to make sure that
>  	 * the eb doesn't disappear out from under us while we're looking at
>  	 * this page.
>  	 */
> -	spin_lock(&eb->refs_lock);
> -	if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
> -		spin_unlock(&eb->refs_lock);
> +	spin_lock(&ebh->refs_lock);
> +	if (atomic_read(&ebh->refs) != 1 || extent_buffer_under_io(eb)) {
> +		spin_unlock(&ebh->refs_lock);
>  		spin_unlock(&page->mapping->private_lock);
>  		return 0;
>  	}
> @@ -5678,10 +5816,11 @@ int try_release_extent_buffer(struct page *page)
>  	 * If tree ref isn't set then we know the ref on this eb is a real ref,
>  	 * so just return, this page will likely be freed soon anyway.
>  	 */
> -	if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
> -		spin_unlock(&eb->refs_lock);
> +	if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &ebh->bflags)) {
> +		spin_unlock(&ebh->refs_lock);
>  		return 0;
>  	}
>  
> -	return release_extent_buffer(eb);
> +	return release_extent_buffer(ebh);
>  }
> +
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 541b40a..8fe5ac3 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -131,17 +131,17 @@ struct extent_state {
>  
>  #define INLINE_EXTENT_BUFFER_PAGES 16
>  #define MAX_INLINE_EXTENT_BUFFER_SIZE (INLINE_EXTENT_BUFFER_PAGES * PAGE_CACHE_SIZE)
> +
> +/* Forward declaration */
> +struct extent_buffer_head;
> +
>  struct extent_buffer {
>  	u64 start;
>  	unsigned long len;
> -	unsigned long bflags;
> -	struct btrfs_fs_info *fs_info;
> -	spinlock_t refs_lock;
> -	atomic_t refs;
> -	atomic_t io_pages;
> +	unsigned long ebflags;
> +	struct extent_buffer_head *ebh;
> +	struct extent_buffer *eb_next;
>  	int read_mirror;
> -	struct rcu_head rcu_head;
> -	pid_t lock_owner;
>  
>  	/* count of read lock holders on the extent buffer */
>  	atomic_t write_locks;
> @@ -154,6 +154,8 @@ struct extent_buffer {
>  	/* >= 0 if eb belongs to a log tree, -1 otherwise */
>  	short log_index;
>  
> +	pid_t lock_owner;
> +
>  	/* protects write locks */
>  	rwlock_t lock;
>  
> @@ -166,7 +168,20 @@ struct extent_buffer {
>  	 * to unlock
>  	 */
>  	wait_queue_head_t read_lock_wq;
> +	wait_queue_head_t lock_wq;
> +};
> +
> +struct extent_buffer_head {
> +	unsigned long bflags;
> +	struct btrfs_fs_info *fs_info;
> +	spinlock_t refs_lock;
> +	atomic_t refs;
> +	atomic_t io_bvecs;
> +	struct rcu_head rcu_head;
> +
>  	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
> +
> +	struct extent_buffer eb;
>  #ifdef CONFIG_BTRFS_DEBUG
>  	struct list_head leak_list;
>  #endif
> @@ -183,6 +198,14 @@ static inline int extent_compress_type(unsigned long bio_flags)
>  	return bio_flags >> EXTENT_BIO_FLAG_SHIFT;
>  }
>  
> +/*
> + * return the extent_buffer_head that contains the extent buffer provided.
> + */
> +static inline struct extent_buffer_head *eb_head(struct extent_buffer *eb)
> +{
> +	return eb->ebh;
> +
> +}
>  struct extent_map_tree;
>  
>  typedef struct extent_map *(get_extent_t)(struct inode *inode,
> @@ -304,7 +327,7 @@ static inline unsigned long num_extent_pages(u64 start, u64 len)
>  
>  static inline void extent_buffer_get(struct extent_buffer *eb)
>  {
> -	atomic_inc(&eb->refs);
> +	atomic_inc(&eb_head(eb)->refs);
>  }
>  
>  int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv,
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8bcd2a0..9c8eb4a 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6282,7 +6282,7 @@ int btrfs_read_sys_array(struct btrfs_root *root)
>  	 * to silence the warning eg. on PowerPC 64.
>  	 */
>  	if (PAGE_CACHE_SIZE > BTRFS_SUPER_INFO_SIZE)
> -		SetPageUptodate(sb->pages[0]);
> +		SetPageUptodate(eb_head(sb)->pages[0]);
>  
>  	write_extent_buffer(sb, super_copy, 0, BTRFS_SUPER_INFO_SIZE);
>  	array_size = btrfs_super_sys_array_size(super_copy);
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 1faecea..283bbe7 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -699,7 +699,7 @@ TRACE_EVENT(btrfs_cow_block,
>  	TP_fast_assign(
>  		__entry->root_objectid	= root->root_key.objectid;
>  		__entry->buf_start	= buf->start;
> -		__entry->refs		= atomic_read(&buf->refs);
> +		__entry->refs		= atomic_read(&eb_head(buf)->refs);
>  		__entry->cow_start	= cow->start;
>  		__entry->buf_level	= btrfs_header_level(buf);
>  		__entry->cow_level	= btrfs_header_level(cow);
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 08/21] Btrfs: subpagesize-blocksize: Compute and look up csums based on sectorsized blocks.
  2015-06-01 15:22 ` [RFC PATCH V11 08/21] Btrfs: subpagesize-blocksize: Compute and look up csums based on sectorsized blocks Chandan Rajendra
@ 2015-07-01 14:37   ` Liu Bo
  0 siblings, 0 replies; 47+ messages in thread
From: Liu Bo @ 2015-07-01 14:37 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Mon, Jun 01, 2015 at 08:52:43PM +0530, Chandan Rajendra wrote:
> Checksums are applicable to sectorsize units. The current code uses
> bio->bv_len units to compute and look up checksums. This works on machines
> where sectorsize == PAGE_SIZE. This patch makes the checksum computation and
> look up code to work with sectorsize units.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo

> 
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>  fs/btrfs/file-item.c | 87 ++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 54 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 58ece65..65ab9c3 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -172,6 +172,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
>  	u64 item_start_offset = 0;
>  	u64 item_last_offset = 0;
>  	u64 disk_bytenr;
> +	u64 page_bytes_left;
>  	u32 diff;
>  	int nblocks;
>  	int bio_index = 0;
> @@ -220,6 +221,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
>  	disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
>  	if (dio)
>  		offset = logical_offset;
> +
> +	page_bytes_left = bvec->bv_len;
>  	while (bio_index < bio->bi_vcnt) {
>  		if (!dio)
>  			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
> @@ -243,7 +246,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
>  				if (BTRFS_I(inode)->root->root_key.objectid ==
>  				    BTRFS_DATA_RELOC_TREE_OBJECTID) {
>  					set_extent_bits(io_tree, offset,
> -						offset + bvec->bv_len - 1,
> +						offset + root->sectorsize - 1,
>  						EXTENT_NODATASUM, GFP_NOFS);
>  				} else {
>  					btrfs_info(BTRFS_I(inode)->root->fs_info,
> @@ -281,11 +284,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
>  found:
>  		csum += count * csum_size;
>  		nblocks -= count;
> -		bio_index += count;
> +
>  		while (count--) {
> -			disk_bytenr += bvec->bv_len;
> -			offset += bvec->bv_len;
> -			bvec++;
> +			disk_bytenr += root->sectorsize;
> +			offset += root->sectorsize;
> +			page_bytes_left -= root->sectorsize;
> +			if (!page_bytes_left) {
> +				bio_index++;
> +				bvec++;
> +				page_bytes_left = bvec->bv_len;
> +			}
> +
>  		}
>  	}
>  	btrfs_free_path(path);
> @@ -432,6 +441,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
>  	struct bio_vec *bvec = bio->bi_io_vec;
>  	int bio_index = 0;
>  	int index;
> +	int nr_sectors;
> +	int i;
>  	unsigned long total_bytes = 0;
>  	unsigned long this_sum_bytes = 0;
>  	u64 offset;
> @@ -459,41 +470,51 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
>  		if (!contig)
>  			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
>  
> -		if (offset >= ordered->file_offset + ordered->len ||
> -		    offset < ordered->file_offset) {
> -			unsigned long bytes_left;
> -			sums->len = this_sum_bytes;
> -			this_sum_bytes = 0;
> -			btrfs_add_ordered_sum(inode, ordered, sums);
> -			btrfs_put_ordered_extent(ordered);
> +		data = kmap_atomic(bvec->bv_page);
>  
> -			bytes_left = bio->bi_iter.bi_size - total_bytes;
>  
> -			sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
> -				       GFP_NOFS);
> -			BUG_ON(!sums); /* -ENOMEM */
> -			sums->len = bytes_left;
> -			ordered = btrfs_lookup_ordered_extent(inode, offset);
> -			BUG_ON(!ordered); /* Logic error */
> -			sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9) +
> -				       total_bytes;
> -			index = 0;
> +		nr_sectors = (bvec->bv_len + root->sectorsize - 1)
> +			>> root->fs_info->sb->s_blocksize_bits;
> +
> +
> +		for (i = 0; i < nr_sectors; i++) {
> +			if (offset >= ordered->file_offset + ordered->len ||
> +				offset < ordered->file_offset) {
> +				unsigned long bytes_left;
> +				sums->len = this_sum_bytes;
> +				this_sum_bytes = 0;
> +				btrfs_add_ordered_sum(inode, ordered, sums);
> +				btrfs_put_ordered_extent(ordered);
> +
> +				bytes_left = bio->bi_iter.bi_size - total_bytes;
> +
> +				sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
> +					GFP_NOFS);
> +				BUG_ON(!sums); /* -ENOMEM */
> +				sums->len = bytes_left;
> +				ordered = btrfs_lookup_ordered_extent(inode, offset);
> +				BUG_ON(!ordered); /* Logic error */
> +				sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9) +
> +					total_bytes;
> +				index = 0;
> +			}
> +
> +			sums->sums[index] = ~(u32)0;
> +			sums->sums[index]
> +				= btrfs_csum_data(data + bvec->bv_offset + (i * root->sectorsize),
> +						sums->sums[index],
> +						root->sectorsize);
> +			btrfs_csum_final(sums->sums[index],
> +					(char *)(sums->sums + index));
> +			index++;
> +			offset += root->sectorsize;
> +			this_sum_bytes += root->sectorsize;
> +			total_bytes += root->sectorsize;
>  		}
>  
> -		data = kmap_atomic(bvec->bv_page);
> -		sums->sums[index] = ~(u32)0;
> -		sums->sums[index] = btrfs_csum_data(data + bvec->bv_offset,
> -						    sums->sums[index],
> -						    bvec->bv_len);
>  		kunmap_atomic(data);
> -		btrfs_csum_final(sums->sums[index],
> -				 (char *)(sums->sums + index));
>  
>  		bio_index++;
> -		index++;
> -		total_bytes += bvec->bv_len;
> -		this_sum_bytes += bvec->bv_len;
> -		offset += bvec->bv_len;
>  		bvec++;
>  	}
>  	this_sum_bytes = 0;
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 05/21] Btrfs: subpagesize-blocksize: Read tree blocks whose size is < PAGE_SIZE.
  2015-06-01 15:22 ` [RFC PATCH V11 05/21] Btrfs: subpagesize-blocksize: Read tree blocks whose size is < PAGE_SIZE Chandan Rajendra
@ 2015-07-01 14:40   ` Liu Bo
  2015-07-03 10:02     ` Chandan Rajendra
  0 siblings, 1 reply; 47+ messages in thread
From: Liu Bo @ 2015-07-01 14:40 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Mon, Jun 01, 2015 at 08:52:40PM +0530, Chandan Rajendra wrote:
> In the case of subpagesize-blocksize, this patch makes it possible to read
> only a single metadata block from the disk instead of all the metadata blocks
> that map into a page.

I'm a bit curious about how much benefit is gained from reading a single block
rather reading the whole page.

Thanks,

-liubo

> 
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>  fs/btrfs/disk-io.c   |  65 +++++++++++++++++++------------
>  fs/btrfs/disk-io.h   |   3 ++
>  fs/btrfs/extent_io.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 144 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 51fe2ec..b794e33 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -597,28 +597,41 @@ static noinline int check_leaf(struct btrfs_root *root,
>  	return 0;
>  }
>  
> -static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
> -				      u64 phy_offset, struct page *page,
> -				      u64 start, u64 end, int mirror)
> +int verify_extent_buffer_read(struct btrfs_io_bio *io_bio,
> +			struct page *page,
> +			u64 start, u64 end, int mirror)
>  {
> -	u64 found_start;
> -	int found_level;
> +	struct address_space *mapping = (io_bio->bio).bi_io_vec->bv_page->mapping;
> +	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
> +	struct extent_buffer_head *ebh;
>  	struct extent_buffer *eb;
>  	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
> -	int ret = 0;
> +	unsigned long num_pages;
> +	unsigned long i;
> +	u64 found_start;
> +	int found_level;
>  	int reads_done;
> +	int ret = 0;
>  
>  	if (!page->private)
>  		goto out;
>  
>  	eb = (struct extent_buffer *)page->private;
> +	do {
> +		if ((eb->start <= start) && (eb->start + eb->len - 1 > start))
> +			break;
> +	} while ((eb = eb->eb_next) != NULL);
> +
> +	BUG_ON(!eb);
> +
> +	ebh = eb_head(eb);
>  
>  	/* the pending IO might have been the only thing that kept this buffer
>  	 * in memory.  Make sure we have a ref for all this other checks
>  	 */
>  	extent_buffer_get(eb);
>  
> -	reads_done = atomic_dec_and_test(&eb->io_pages);
> +	reads_done = atomic_dec_and_test(&ebh->io_bvecs);
>  	if (!reads_done)
>  		goto err;
>  
> @@ -683,28 +696,34 @@ err:
>  		 * again, we have to make sure it has something
>  		 * to decrement
>  		 */
> -		atomic_inc(&eb->io_pages);
> +		atomic_inc(&eb_head(eb)->io_bvecs);
>  		clear_extent_buffer_uptodate(eb);
>  	}
> +
> +	/*
> +	  We never read more than one extent buffer from a page at a time.
> +	  So unlocking the page here should be fine.
> +	 */
> +	if (reads_done) {
> +		num_pages = num_extent_pages(eb->start, eb->len);
> +		for (i = 0; i < num_pages; i++) {
> +			page = eb_head(eb)->pages[i];
> +			unlock_page(page);
> +		}
> +
> +		/*
> +		  We don't need to add a check to see if
> +		  extent_io_tree->track_uptodate is set or not, Since
> +		  this function only deals with extent buffers.
> +		*/
> +		unlock_extent(tree, eb->start, eb->start + eb->len - 1);
> +	}
> +
>  	free_extent_buffer(eb);
>  out:
>  	return ret;
>  }
>  
> -static int btree_io_failed_hook(struct page *page, int failed_mirror)
> -{
> -	struct extent_buffer *eb;
> -	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
> -
> -	eb = (struct extent_buffer *)page->private;
> -	set_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
> -	eb->read_mirror = failed_mirror;
> -	atomic_dec(&eb->io_pages);
> -	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->ebflags))
> -		btree_readahead_hook(root, eb, eb->start, -EIO);
> -	return -EIO;	/* we fixed nothing */
> -}
> -
>  static void end_workqueue_bio(struct bio *bio, int err)
>  {
>  	struct btrfs_end_io_wq *end_io_wq = bio->bi_private;
> @@ -4349,8 +4368,6 @@ static int btrfs_cleanup_transaction(struct btrfs_root *root)
>  }
>  
>  static const struct extent_io_ops btree_extent_io_ops = {
> -	.readpage_end_io_hook = btree_readpage_end_io_hook,
> -	.readpage_io_failed_hook = btree_io_failed_hook,
>  	.submit_bio_hook = btree_submit_bio_hook,
>  	/* note we're sharing with inode.c for the merge bio hook */
>  	.merge_bio_hook = btrfs_merge_bio_hook,
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index d4cbfee..c69076c 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -111,6 +111,9 @@ static inline void btrfs_put_fs_root(struct btrfs_root *root)
>  		kfree(root);
>  }
>  
> +int verify_extent_buffer_read(struct btrfs_io_bio *io_bio,
> +			struct page *page,
> +			u64 start, u64 end, int mirror);
>  void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
>  int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
>  			  int atomic);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index a7e715a..76a6e39 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -14,6 +14,7 @@
>  #include "extent_io.h"
>  #include "extent_map.h"
>  #include "ctree.h"
> +#include "disk-io.h"
>  #include "btrfs_inode.h"
>  #include "volumes.h"
>  #include "check-integrity.h"
> @@ -2179,7 +2180,7 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
>  		struct page *p = eb_head(eb)->pages[i];
>  
>  		ret = repair_io_failure(root->fs_info->btree_inode, start,
> -					PAGE_CACHE_SIZE, start, p,
> +					eb->len, start, p,
>  					start - page_offset(p), mirror_num);
>  		if (ret)
>  			break;
> @@ -3706,6 +3707,77 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
>  	return ret;
>  }
>  
> +static void end_bio_extent_buffer_readpage(struct bio *bio, int err)
> +{
> +	struct address_space *mapping = bio->bi_io_vec->bv_page->mapping;
> +	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
> +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> +	struct extent_buffer *eb;
> +	struct btrfs_root *root;
> +	struct bio_vec *bvec;
> +	struct page *page;
> +	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
> +	u64 start;
> +	u64 end;
> +	int mirror;
> +	int ret;
> +	int i;
> +
> +	if (err)
> +		uptodate = 0;
> +
> +	bio_for_each_segment_all(bvec, bio, i) {
> +		page = bvec->bv_page;
> +		root = BTRFS_I(page->mapping->host)->root;
> +
> +		start = page_offset(page) + bvec->bv_offset;
> +		end = start + bvec->bv_len - 1;
> +
> +		if (!page->private) {
> +			unlock_page(page);
> +			unlock_extent(tree, start, end);
> +			continue;
> +		}
> +
> +		eb = (struct extent_buffer *)page->private;
> +
> +		do {
> +			/*
> +			  read_extent_buffer_pages() does not start
> +			  I/O on PG_uptodate pages. Hence the bio may
> +			  map only part of the extent buffer.
> +			 */
> +			if ((eb->start <= start) && (eb->start + eb->len - 1 > start))
> +				break;
> +		} while ((eb = eb->eb_next) != NULL);
> +
> +		BUG_ON(!eb);
> +
> +		mirror = io_bio->mirror_num;
> +
> +		if (uptodate) {
> +			ret = verify_extent_buffer_read(io_bio, page, start,
> +							end, mirror);
> +			if (ret)
> +				uptodate = 0;
> +		}
> +
> +		if (!uptodate) {
> +			set_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
> +			eb->read_mirror = mirror;
> +			atomic_dec(&eb_head(eb)->io_bvecs);
> +			if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD,
> +						&eb->ebflags))
> +				btree_readahead_hook(root, eb, eb->start,
> +						-EIO);
> +			ClearPageUptodate(page);
> +			SetPageError(page);
> +		}
> +	}
> +
> +	bio_put(bio);
> +}
> +
>  static void end_extent_buffer_writeback(struct extent_buffer *eb)
>  {
>  	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags);
> @@ -5330,6 +5402,9 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
>  			     struct extent_buffer *eb, u64 start, int wait,
>  			     get_extent_t *get_extent, int mirror_num)
>  {
> +	struct inode *inode = tree->mapping->host;
> +	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> +	struct extent_state *cached_state = NULL;
>  	unsigned long i;
>  	unsigned long start_i;
>  	struct page *page;
> @@ -5376,15 +5451,31 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
>  
>  	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
>  	eb->read_mirror = 0;
> -	atomic_set(&eb->io_pages, num_reads);
> +	atomic_set(&eb_head(eb)->io_bvecs, num_reads);
>  	for (i = start_i; i < num_pages; i++) {
>  		page = eb_head(eb)->pages[i];
>  		if (!PageUptodate(page)) {
>  			ClearPageError(page);
> -			err = __extent_read_full_page(tree, page,
> -						      get_extent, &bio,
> -						      mirror_num, &bio_flags,
> -						      READ | REQ_META);
> +			if (eb->len < PAGE_CACHE_SIZE) {
> +				lock_extent_bits(tree, eb->start, eb->start + eb->len - 1, 0,
> +							&cached_state);
> +				err = submit_extent_page(READ | REQ_META, tree,
> +							page, eb->start >> 9,
> +							eb->len, eb->start - page_offset(page),
> +							fs_info->fs_devices->latest_bdev,
> +							&bio, -1, end_bio_extent_buffer_readpage,
> +							mirror_num, bio_flags, bio_flags);
> +			} else {
> +				lock_extent_bits(tree, page_offset(page),
> +						page_offset(page) + PAGE_CACHE_SIZE - 1,
> +						0, &cached_state);
> +				err = submit_extent_page(READ | REQ_META, tree,
> +							page, page_offset(page) >> 9,
> +							PAGE_CACHE_SIZE, 0,
> +							fs_info->fs_devices->latest_bdev,
> +							&bio, -1, end_bio_extent_buffer_readpage,
> +							mirror_num, bio_flags, bio_flags);
> +			}
>  			if (err)
>  				ret = err;
>  		} else {
> @@ -5405,10 +5496,11 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
>  	for (i = start_i; i < num_pages; i++) {
>  		page = eb_head(eb)->pages[i];
>  		wait_on_page_locked(page);
> -		if (!PageUptodate(page))
> -			ret = -EIO;
>  	}
>  
> +	if (!extent_buffer_uptodate(eb))
> +		ret = -EIO;
> +
>  	return ret;
>  
>  unlock_exit:
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 09/21] Btrfs: subpagesize-blocksize: Direct I/O read: Work on sectorsized blocks.
  2015-06-01 15:22 ` [RFC PATCH V11 09/21] Btrfs: subpagesize-blocksize: Direct I/O read: Work " Chandan Rajendra
@ 2015-07-01 14:45   ` Liu Bo
  2015-07-03 10:05     ` Chandan Rajendra
  0 siblings, 1 reply; 47+ messages in thread
From: Liu Bo @ 2015-07-01 14:45 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Mon, Jun 01, 2015 at 08:52:44PM +0530, Chandan Rajendra wrote:
> The direct I/O read's endio and corresponding repair functions work on
> page sized blocks. Fix this.
> 
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>  fs/btrfs/inode.c | 94 ++++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 71 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index ac6a3f3..958e4e6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7643,9 +7643,9 @@ static int btrfs_check_dio_repairable(struct inode *inode,
>  }
>  
>  static int dio_read_error(struct inode *inode, struct bio *failed_bio,
> -			  struct page *page, u64 start, u64 end,
> -			  int failed_mirror, bio_end_io_t *repair_endio,
> -			  void *repair_arg)
> +			struct page *page, unsigned int pgoff,
> +			u64 start, u64 end, int failed_mirror,
> +			bio_end_io_t *repair_endio, void *repair_arg)
>  {
>  	struct io_failure_record *failrec;
>  	struct bio *bio;
> @@ -7666,7 +7666,9 @@ static int dio_read_error(struct inode *inode, struct bio *failed_bio,
>  		return -EIO;
>  	}
>  
> -	if (failed_bio->bi_vcnt > 1)
> +	if ((failed_bio->bi_vcnt > 1)
> +		|| (failed_bio->bi_io_vec->bv_len
> +			> BTRFS_I(inode)->root->sectorsize))
>  		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
>  	else
>  		read_mode = READ_SYNC;
> @@ -7674,7 +7676,7 @@ static int dio_read_error(struct inode *inode, struct bio *failed_bio,
>  	isector = start - btrfs_io_bio(failed_bio)->logical;
>  	isector >>= inode->i_sb->s_blocksize_bits;
>  	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
> -				      0, isector, repair_endio, repair_arg);
> +				pgoff, isector, repair_endio, repair_arg);
>  	if (!bio) {
>  		free_io_failure(inode, failrec);
>  		return -EIO;
> @@ -7704,12 +7706,17 @@ struct btrfs_retry_complete {
>  static void btrfs_retry_endio_nocsum(struct bio *bio, int err)
>  {
>  	struct btrfs_retry_complete *done = bio->bi_private;
> +	struct inode *inode;
>  	struct bio_vec *bvec;
>  	int i;
>  
>  	if (err)
>  		goto end;
>  
> +	BUG_ON(bio->bi_vcnt != 1);
> +	inode = bio->bi_io_vec->bv_page->mapping->host;
> +	BUG_ON(bio->bi_io_vec->bv_len != BTRFS_I(inode)->root->sectorsize);
> +
>  	done->uptodate = 1;
>  	bio_for_each_segment_all(bvec, bio, i)
>  		clean_io_failure(done->inode, done->start, bvec->bv_page, 0);
> @@ -7724,22 +7731,30 @@ static int __btrfs_correct_data_nocsum(struct inode *inode,
>  	struct bio_vec *bvec;
>  	struct btrfs_retry_complete done;
>  	u64 start;
> +	unsigned int pgoff;
> +	u32 sectorsize;
> +	int nr_sectors;
>  	int i;
>  	int ret;
>  
> +	sectorsize = BTRFS_I(inode)->root->sectorsize;
> +
>  	start = io_bio->logical;
>  	done.inode = inode;
>  
>  	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
> -try_again:
> +		nr_sectors = bvec->bv_len >> inode->i_sb->s_blocksize_bits;
> +		pgoff = bvec->bv_offset;
> +
> +next_block_or_try_again:
>  		done.uptodate = 0;
>  		done.start = start;
>  		init_completion(&done.done);
>  
> -		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
> -				     start + bvec->bv_len - 1,
> -				     io_bio->mirror_num,
> -				     btrfs_retry_endio_nocsum, &done);
> +		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page,
> +				pgoff, start, start + sectorsize - 1,
> +				io_bio->mirror_num,
> +				btrfs_retry_endio_nocsum, &done);
>  		if (ret)
>  			return ret;
>  
> @@ -7747,10 +7762,15 @@ try_again:
>  
>  		if (!done.uptodate) {
>  			/* We might have another mirror, so try again */
> -			goto try_again;
> +			goto next_block_or_try_again;
>  		}
>  
> -		start += bvec->bv_len;
> +		start += sectorsize;
> +
> +		if (nr_sectors--) {
> +			pgoff += sectorsize;
> +			goto next_block_or_try_again;
> +		}
>  	}
>  
>  	return 0;
> @@ -7760,7 +7780,9 @@ static void btrfs_retry_endio(struct bio *bio, int err)
>  {
>  	struct btrfs_retry_complete *done = bio->bi_private;
>  	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> +	struct inode * inode;
>  	struct bio_vec *bvec;
> +	u64 start;
>  	int uptodate;
>  	int ret;
>  	int i;
> @@ -7769,13 +7791,20 @@ static void btrfs_retry_endio(struct bio *bio, int err)
>  		goto end;
>  
>  	uptodate = 1;
> +
> +	start = done->start;
> +
> +	BUG_ON(bio->bi_vcnt != 1);
> +	inode = bio->bi_io_vec->bv_page->mapping->host;
> +	BUG_ON(bio->bi_io_vec->bv_len != BTRFS_I(inode)->root->sectorsize);
> +
>  	bio_for_each_segment_all(bvec, bio, i) {
>  		ret = __readpage_endio_check(done->inode, io_bio, i,
> -					     bvec->bv_page, 0,
> -					     done->start, bvec->bv_len);
> +					bvec->bv_page, bvec->bv_offset,
> +					done->start, bvec->bv_len);
>  		if (!ret)
>  			clean_io_failure(done->inode, done->start,
> -					 bvec->bv_page, 0);
> +					bvec->bv_page, bvec->bv_offset);
>  		else
>  			uptodate = 0;
>  	}
> @@ -7793,16 +7822,30 @@ static int __btrfs_subio_endio_read(struct inode *inode,
>  	struct btrfs_retry_complete done;
>  	u64 start;
>  	u64 offset = 0;
> +	u32 sectorsize;
> +	int nr_sectors;
> +	unsigned int pgoff;
> +	int csum_pos;
>  	int i;
>  	int ret;
> +	unsigned char blocksize_bits;
> +
> +	blocksize_bits = inode->i_sb->s_blocksize_bits;
> +	sectorsize = BTRFS_I(inode)->root->sectorsize;
>  
>  	err = 0;
>  	start = io_bio->logical;
>  	done.inode = inode;
>  
>  	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
> -		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
> -					     0, start, bvec->bv_len);
> +		nr_sectors = bvec->bv_len >> blocksize_bits;
> +		pgoff = bvec->bv_offset;
> +next_block:
> +		csum_pos = offset >> blocksize_bits;
> +
> +		ret = __readpage_endio_check(inode, io_bio, csum_pos,
> +					bvec->bv_page, pgoff, start,
> +					sectorsize);
>  		if (likely(!ret))
>  			goto next;
>  try_again:
> @@ -7810,10 +7853,10 @@ try_again:
>  		done.start = start;
>  		init_completion(&done.done);
>  
> -		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
> -				     start + bvec->bv_len - 1,
> -				     io_bio->mirror_num,
> -				     btrfs_retry_endio, &done);
> +		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page,
> +				pgoff, start, start + sectorsize - 1,
> +				io_bio->mirror_num,
> +				btrfs_retry_endio, &done);
>  		if (ret) {
>  			err = ret;
>  			goto next;
> @@ -7826,8 +7869,13 @@ try_again:
>  			goto try_again;
>  		}
>  next:
> -		offset += bvec->bv_len;
> -		start += bvec->bv_len;
> +		offset += sectorsize;
> +		start += sectorsize;
> +

It'd better to put a ASSERT(nr_sectors) in case some crazy things
happen.

Thanks,

-liubo
> +		if (--nr_sectors) {
> +			pgoff += sectorsize;
> +			goto next_block;
> +		}
>  	}
>  
>  	return err;
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 12/21] Btrfs: subpagesize-blocksize: Search for all ordered extents that could span across a page.
  2015-06-01 15:22 ` [RFC PATCH V11 12/21] Btrfs: subpagesize-blocksize: Search for all ordered extents that could span across a page Chandan Rajendra
@ 2015-07-01 14:47   ` Liu Bo
  2015-07-03 10:08     ` Chandan Rajendra
  0 siblings, 1 reply; 47+ messages in thread
From: Liu Bo @ 2015-07-01 14:47 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Mon, Jun 01, 2015 at 08:52:47PM +0530, Chandan Rajendra wrote:
> In subpagesize-blocksize scenario it is not sufficient to search using the
> first byte of the page to make sure that there are no ordered extents
> present across the page. Fix this.
> 
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>  fs/btrfs/extent_io.c | 3 ++-
>  fs/btrfs/inode.c     | 4 ++--
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 14b4e05..0b017e1 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3244,7 +3244,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
>  
>  	while (1) {
>  		lock_extent(tree, start, end);
> -		ordered = btrfs_lookup_ordered_extent(inode, start);
> +		ordered = btrfs_lookup_ordered_range(inode, start,
> +						PAGE_CACHE_SIZE);

A minor suggestion, it'd be better to include the new prototype in the
same patch, which will be benefit to later cherry-picking or reverting.

Thanks,

-liubo
>  		if (!ordered)
>  			break;
>  		unlock_extent(tree, start, end);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e9bab73..8b4aaed 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1976,7 +1976,7 @@ again:
>  	if (PagePrivate2(page))
>  		goto out;
>  
> -	ordered = btrfs_lookup_ordered_extent(inode, page_start);
> +	ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_CACHE_SIZE);
>  	if (ordered) {
>  		unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start,
>  				     page_end, &cached_state, GFP_NOFS);
> @@ -8513,7 +8513,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>  
>  	if (!inode_evicting)
>  		lock_extent_bits(tree, page_start, page_end, 0, &cached_state);
> -	ordered = btrfs_lookup_ordered_extent(inode, page_start);
> +	ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_CACHE_SIZE);
>  	if (ordered) {
>  		/*
>  		 * IO on this page will never be started, so we need
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 05/21] Btrfs: subpagesize-blocksize: Read tree blocks whose size is < PAGE_SIZE.
  2015-07-01 14:40   ` Liu Bo
@ 2015-07-03 10:02     ` Chandan Rajendra
  0 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-07-03 10:02 UTC (permalink / raw)
  To: bo.li.liu; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Wednesday 01 Jul 2015 22:40:08 Liu Bo wrote:
> On Mon, Jun 01, 2015 at 08:52:40PM +0530, Chandan Rajendra wrote:
> > In the case of subpagesize-blocksize, this patch makes it possible to read
> > only a single metadata block from the disk instead of all the metadata
> > blocks that map into a page.
> 
> I'm a bit curious about how much benefit is gained from reading a single
> block rather reading the whole page.
>
If we submit the whole page for reading, we will not know the extent buffer
for which we actually initiated the read operation. For subpagesize-blocksize
scenario some of the "other" parts of the page may not even contain valid data
(since they may actually be "free" space"). In
end_bio_extent_buffer_readpage(), We need to verify the contents of the extent
buffer that was read. Hence there is a need to know the "actual" extent buffer
that was submitted for read I/O. This information can be obtained from
[bvec->bv_offset, bvec->bv_offset + bvec->bv_len - 1] combination.

On the other hand, I found an issue with the patchset. For
subpagesize-blocksize scenario, set_extent_buffer_uptodate() sets PG_uptodate
flag only when all the extent buffers present in the page have
EXTENT_BUFFER_UPTODATE set. For some pages this can never happen because one
or more parts of the page may be "free" space. Currently, this isn't causing
any issue since the check "test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags)" in
read_extent_buffer_pages() causes control to return from the function without
initiating any read I/O operation. This needs to be fixed.

> Thanks,
> 
> -liubo
> 
> > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> > ---
> > 
> >  fs/btrfs/disk-io.c   |  65 +++++++++++++++++++------------
> >  fs/btrfs/disk-io.h   |   3 ++
> >  fs/btrfs/extent_io.c | 108
> >  +++++++++++++++++++++++++++++++++++++++++++++++---- 3 files changed, 144
> >  insertions(+), 32 deletions(-)
> > 
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 51fe2ec..b794e33 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -597,28 +597,41 @@ static noinline int check_leaf(struct btrfs_root
> > *root,> 
> >  	return 0;
> >  
> >  }
> > 
> > -static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
> > -				      u64 phy_offset, struct page *page,
> > -				      u64 start, u64 end, int mirror)
> > +int verify_extent_buffer_read(struct btrfs_io_bio *io_bio,
> > +			struct page *page,
> > +			u64 start, u64 end, int mirror)
> > 
> >  {
> > 
> > -	u64 found_start;
> > -	int found_level;
> > +	struct address_space *mapping =
> > (io_bio->bio).bi_io_vec->bv_page->mapping; +	struct extent_io_tree 
*tree
> > = &BTRFS_I(mapping->host)->io_tree;
> > +	struct extent_buffer_head *ebh;
> > 
> >  	struct extent_buffer *eb;
> >  	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
> > 
> > -	int ret = 0;
> > +	unsigned long num_pages;
> > +	unsigned long i;
> > +	u64 found_start;
> > +	int found_level;
> > 
> >  	int reads_done;
> > 
> > +	int ret = 0;
> > 
> >  	if (!page->private)
> >  	
> >  		goto out;
> >  	
> >  	eb = (struct extent_buffer *)page->private;
> > 
> > +	do {
> > +		if ((eb->start <= start) && (eb->start + eb->len - 1 > start))
> > +			break;
> > +	} while ((eb = eb->eb_next) != NULL);
> > +
> > +	BUG_ON(!eb);
> > +
> > +	ebh = eb_head(eb);
> > 
> >  	/* the pending IO might have been the only thing that kept this buffer
> >  	
> >  	 * in memory.  Make sure we have a ref for all this other checks
> >  	 */
> >  	
> >  	extent_buffer_get(eb);
> > 
> > -	reads_done = atomic_dec_and_test(&eb->io_pages);
> > +	reads_done = atomic_dec_and_test(&ebh->io_bvecs);
> > 
> >  	if (!reads_done)
> >  	
> >  		goto err;
> > 
> > @@ -683,28 +696,34 @@ err:
> >  		 * again, we have to make sure it has something
> >  		 * to decrement
> >  		 */
> > 
> > -		atomic_inc(&eb->io_pages);
> > +		atomic_inc(&eb_head(eb)->io_bvecs);
> > 
> >  		clear_extent_buffer_uptodate(eb);
> >  	
> >  	}
> > 
> > +
> > +	/*
> > +	  We never read more than one extent buffer from a page at a time.
> > +	  So unlocking the page here should be fine.
> > +	 */
> > +	if (reads_done) {
> > +		num_pages = num_extent_pages(eb->start, eb->len);
> > +		for (i = 0; i < num_pages; i++) {
> > +			page = eb_head(eb)->pages[i];
> > +			unlock_page(page);
> > +		}
> > +
> > +		/*
> > +		  We don't need to add a check to see if
> > +		  extent_io_tree->track_uptodate is set or not, Since
> > +		  this function only deals with extent buffers.
> > +		*/
> > +		unlock_extent(tree, eb->start, eb->start + eb->len - 1);
> > +	}
> > +
> > 
> >  	free_extent_buffer(eb);
> >  
> >  out:
> >  	return ret;
> >  
> >  }
> > 
> > -static int btree_io_failed_hook(struct page *page, int failed_mirror)
> > -{
> > -	struct extent_buffer *eb;
> > -	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
> > -
> > -	eb = (struct extent_buffer *)page->private;
> > -	set_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
> > -	eb->read_mirror = failed_mirror;
> > -	atomic_dec(&eb->io_pages);
> > -	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->ebflags))
> > -		btree_readahead_hook(root, eb, eb->start, -EIO);
> > -	return -EIO;	/* we fixed nothing */
> > -}
> > -
> > 
> >  static void end_workqueue_bio(struct bio *bio, int err)
> >  {
> >  
> >  	struct btrfs_end_io_wq *end_io_wq = bio->bi_private;
> > 
> > @@ -4349,8 +4368,6 @@ static int btrfs_cleanup_transaction(struct
> > btrfs_root *root)> 
> >  }
> >  
> >  static const struct extent_io_ops btree_extent_io_ops = {
> > 
> > -	.readpage_end_io_hook = btree_readpage_end_io_hook,
> > -	.readpage_io_failed_hook = btree_io_failed_hook,
> > 
> >  	.submit_bio_hook = btree_submit_bio_hook,
> >  	/* note we're sharing with inode.c for the merge bio hook */
> >  	.merge_bio_hook = btrfs_merge_bio_hook,
> > 
> > diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> > index d4cbfee..c69076c 100644
> > --- a/fs/btrfs/disk-io.h
> > +++ b/fs/btrfs/disk-io.h
> > @@ -111,6 +111,9 @@ static inline void btrfs_put_fs_root(struct btrfs_root
> > *root)> 
> >  		kfree(root);
> >  
> >  }
> > 
> > +int verify_extent_buffer_read(struct btrfs_io_bio *io_bio,
> > +			struct page *page,
> > +			u64 start, u64 end, int mirror);
> > 
> >  void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
> >  int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
> >  
> >  			  int atomic);
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index a7e715a..76a6e39 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -14,6 +14,7 @@
> > 
> >  #include "extent_io.h"
> >  #include "extent_map.h"
> >  #include "ctree.h"
> > 
> > +#include "disk-io.h"
> > 
> >  #include "btrfs_inode.h"
> >  #include "volumes.h"
> >  #include "check-integrity.h"
> > 
> > @@ -2179,7 +2180,7 @@ int repair_eb_io_failure(struct btrfs_root *root,
> > struct extent_buffer *eb,> 
> >  		struct page *p = eb_head(eb)->pages[i];
> >  		
> >  		ret = repair_io_failure(root->fs_info->btree_inode, start,
> > 
> > -					PAGE_CACHE_SIZE, start, p,
> > +					eb->len, start, p,
> > 
> >  					start - page_offset(p), mirror_num);
> >  		
> >  		if (ret)
> >  		
> >  			break;
> > 
> > @@ -3706,6 +3707,77 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
> > 
> >  	return ret;
> >  
> >  }
> > 
> > +static void end_bio_extent_buffer_readpage(struct bio *bio, int err)
> > +{
> > +	struct address_space *mapping = bio->bi_io_vec->bv_page->mapping;
> > +	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
> > +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> > +	struct extent_buffer *eb;
> > +	struct btrfs_root *root;
> > +	struct bio_vec *bvec;
> > +	struct page *page;
> > +	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
> > +	u64 start;
> > +	u64 end;
> > +	int mirror;
> > +	int ret;
> > +	int i;
> > +
> > +	if (err)
> > +		uptodate = 0;
> > +
> > +	bio_for_each_segment_all(bvec, bio, i) {
> > +		page = bvec->bv_page;
> > +		root = BTRFS_I(page->mapping->host)->root;
> > +
> > +		start = page_offset(page) + bvec->bv_offset;
> > +		end = start + bvec->bv_len - 1;
> > +
> > +		if (!page->private) {
> > +			unlock_page(page);
> > +			unlock_extent(tree, start, end);
> > +			continue;
> > +		}
> > +
> > +		eb = (struct extent_buffer *)page->private;
> > +
> > +		do {
> > +			/*
> > +			  read_extent_buffer_pages() does not start
> > +			  I/O on PG_uptodate pages. Hence the bio may
> > +			  map only part of the extent buffer.
> > +			 */
> > +			if ((eb->start <= start) && (eb->start + eb->len - 1 > 
start))
> > +				break;
> > +		} while ((eb = eb->eb_next) != NULL);
> > +
> > +		BUG_ON(!eb);
> > +
> > +		mirror = io_bio->mirror_num;
> > +
> > +		if (uptodate) {
> > +			ret = verify_extent_buffer_read(io_bio, page, start,
> > +							end, mirror);
> > +			if (ret)
> > +				uptodate = 0;
> > +		}
> > +
> > +		if (!uptodate) {
> > +			set_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
> > +			eb->read_mirror = mirror;
> > +			atomic_dec(&eb_head(eb)->io_bvecs);
> > +			if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD,
> > +						&eb->ebflags))
> > +				btree_readahead_hook(root, eb, eb->start,
> > +						-EIO);
> > +			ClearPageUptodate(page);
> > +			SetPageError(page);
> > +		}
> > +	}
> > +
> > +	bio_put(bio);
> > +}
> > +
> > 
> >  static void end_extent_buffer_writeback(struct extent_buffer *eb)
> >  {
> >  
> >  	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags);
> > 
> > @@ -5330,6 +5402,9 @@ int read_extent_buffer_pages(struct extent_io_tree
> > *tree,> 
> >  			     struct extent_buffer *eb, u64 start, int wait,
> >  			     get_extent_t *get_extent, int mirror_num)
> >  
> >  {
> > 
> > +	struct inode *inode = tree->mapping->host;
> > +	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> > +	struct extent_state *cached_state = NULL;
> > 
> >  	unsigned long i;
> >  	unsigned long start_i;
> >  	struct page *page;
> > 
> > @@ -5376,15 +5451,31 @@ int read_extent_buffer_pages(struct extent_io_tree
> > *tree,> 
> >  	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
> >  	eb->read_mirror = 0;
> > 
> > -	atomic_set(&eb->io_pages, num_reads);
> > +	atomic_set(&eb_head(eb)->io_bvecs, num_reads);
> > 
> >  	for (i = start_i; i < num_pages; i++) {
> >  	
> >  		page = eb_head(eb)->pages[i];
> >  		if (!PageUptodate(page)) {
> >  		
> >  			ClearPageError(page);
> > 
> > -			err = __extent_read_full_page(tree, page,
> > -						      get_extent, &bio,
> > -						      mirror_num, &bio_flags,
> > -						      READ | REQ_META);
> > +			if (eb->len < PAGE_CACHE_SIZE) {
> > +				lock_extent_bits(tree, eb->start, eb->start + 
eb->len - 1, 0,
> > +							&cached_state);
> > +				err = submit_extent_page(READ | REQ_META, 
tree,
> > +							page, eb->start >> 9,
> > +							eb->len, eb->start - 
page_offset(page),
> > +							fs_info->fs_devices-
>latest_bdev,
> > +							&bio, -1, 
end_bio_extent_buffer_readpage,
> > +							mirror_num, bio_flags, 
bio_flags);
> > +			} else {
> > +				lock_extent_bits(tree, page_offset(page),
> > +						page_offset(page) + 
PAGE_CACHE_SIZE - 1,
> > +						0, &cached_state);
> > +				err = submit_extent_page(READ | REQ_META, 
tree,
> > +							page, 
page_offset(page) >> 9,
> > +							PAGE_CACHE_SIZE, 0,
> > +							fs_info->fs_devices-
>latest_bdev,
> > +							&bio, -1, 
end_bio_extent_buffer_readpage,
> > +							mirror_num, bio_flags, 
bio_flags);
> > +			}
> > 
> >  			if (err)
> >  			
> >  				ret = err;
> >  		
> >  		} else {
> > 
> > @@ -5405,10 +5496,11 @@ int read_extent_buffer_pages(struct extent_io_tree
> > *tree,> 
> >  	for (i = start_i; i < num_pages; i++) {
> >  	
> >  		page = eb_head(eb)->pages[i];
> >  		wait_on_page_locked(page);
> > 
> > -		if (!PageUptodate(page))
> > -			ret = -EIO;
> > 
> >  	}
> > 
> > +	if (!extent_buffer_uptodate(eb))
> > +		ret = -EIO;
> > +
> > 
> >  	return ret;
> >  
> >  unlock_exit:

-- 
chandan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 09/21] Btrfs: subpagesize-blocksize: Direct I/O read: Work on sectorsized blocks.
  2015-07-01 14:45   ` Liu Bo
@ 2015-07-03 10:05     ` Chandan Rajendra
  0 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-07-03 10:05 UTC (permalink / raw)
  To: bo.li.liu; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Wednesday 01 Jul 2015 22:45:00 Liu Bo wrote:
> On Mon, Jun 01, 2015 at 08:52:44PM +0530, Chandan Rajendra wrote:
> > The direct I/O read's endio and corresponding repair functions work on
> > page sized blocks. Fix this.
> > 
> > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> > ---
> >  try_again:
> > @@ -7810,10 +7853,10 @@ try_again:
> >  		done.start = start;
> >  		init_completion(&done.done);
> > 
> > -		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, 
start,
> > -				     start + bvec->bv_len - 1,
> > -				     io_bio->mirror_num,
> > -				     btrfs_retry_endio, &done);
> > +		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page,
> > +				pgoff, start, start + sectorsize - 1,
> > +				io_bio->mirror_num,
> > +				btrfs_retry_endio, &done);
> > 
> >  		if (ret) {
> >  		
> >  			err = ret;
> >  			goto next;
> > 
> > @@ -7826,8 +7869,13 @@ try_again:
> >  			goto try_again;
> >  		
> >  		}
> >  
> >  next:
> > -		offset += bvec->bv_len;
> > -		start += bvec->bv_len;
> > +		offset += sectorsize;
> > +		start += sectorsize;
> > +
> 
> It'd better to put a ASSERT(nr_sectors) in case some crazy things
> happen.
> 

Yes, I will add that statement in the future versions of the patchset.

> 
> > +		if (--nr_sectors) {
> > +			pgoff += sectorsize;
> > +			goto next_block;
> > +		}
> > 
> >  	}
> >  	
> >  	return err;

-- 
chandan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 12/21] Btrfs: subpagesize-blocksize: Search for all ordered extents that could span across a page.
  2015-07-01 14:47   ` Liu Bo
@ 2015-07-03 10:08     ` Chandan Rajendra
  2015-07-06  3:17       ` Liu Bo
  0 siblings, 1 reply; 47+ messages in thread
From: Chandan Rajendra @ 2015-07-03 10:08 UTC (permalink / raw)
  To: bo.li.liu; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Wednesday 01 Jul 2015 22:47:10 Liu Bo wrote:
> On Mon, Jun 01, 2015 at 08:52:47PM +0530, Chandan Rajendra wrote:
> > In subpagesize-blocksize scenario it is not sufficient to search using the
> > first byte of the page to make sure that there are no ordered extents
> > present across the page. Fix this.
> > 
> > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> > ---
> > 
> >  fs/btrfs/extent_io.c | 3 ++-
> >  fs/btrfs/inode.c     | 4 ++--
> >  2 files changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 14b4e05..0b017e1 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -3244,7 +3244,8 @@ static int __extent_read_full_page(struct
> > extent_io_tree *tree,> 
> >  	while (1) {
> >  	
> >  		lock_extent(tree, start, end);
> > 
> > -		ordered = btrfs_lookup_ordered_extent(inode, start);
> > +		ordered = btrfs_lookup_ordered_range(inode, start,
> > +						PAGE_CACHE_SIZE);
> 
> A minor suggestion, it'd be better to include the new prototype in the
> same patch, which will be benefit to later cherry-picking or reverting.
> 

Liu, The definition of btrfs_lookup_ordered_range() is already part of
the mainline kernel.

> 
> >  		if (!ordered)
> >  		
> >  			break;
> >  		
> >  		unlock_extent(tree, start, end);
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index e9bab73..8b4aaed 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > 
> > @@ -1976,7 +1976,7 @@ again:
> >  	if (PagePrivate2(page))
> >  	
> >  		goto out;
> > 
> > -	ordered = btrfs_lookup_ordered_extent(inode, page_start);
> > +	ordered = btrfs_lookup_ordered_range(inode, page_start,
> > PAGE_CACHE_SIZE);
> > 
> >  	if (ordered) {
> >  	
> >  		unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start,
> >  		
> >  				     page_end, &cached_state, GFP_NOFS);
> > 
> > @@ -8513,7 +8513,7 @@ static void btrfs_invalidatepage(struct page *page,
> > unsigned int offset,> 
> >  	if (!inode_evicting)
> >  	
> >  		lock_extent_bits(tree, page_start, page_end, 0, 
&cached_state);
> > 
> > -	ordered = btrfs_lookup_ordered_extent(inode, page_start);
> > +	ordered = btrfs_lookup_ordered_range(inode, page_start,
> > PAGE_CACHE_SIZE);
> > 
> >  	if (ordered) {
> >  	
> >  		/*
> >  		
> >  		 * IO on this page will never be started, so we need

-- 
chandan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 12/21] Btrfs: subpagesize-blocksize: Search for all ordered extents that could span across a page.
  2015-07-03 10:08     ` Chandan Rajendra
@ 2015-07-06  3:17       ` Liu Bo
  2015-07-06 10:49         ` Chandan Rajendra
  0 siblings, 1 reply; 47+ messages in thread
From: Liu Bo @ 2015-07-06  3:17 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Fri, Jul 03, 2015 at 03:38:00PM +0530, Chandan Rajendra wrote:
> On Wednesday 01 Jul 2015 22:47:10 Liu Bo wrote:
> > On Mon, Jun 01, 2015 at 08:52:47PM +0530, Chandan Rajendra wrote:
> > > In subpagesize-blocksize scenario it is not sufficient to search using the
> > > first byte of the page to make sure that there are no ordered extents
> > > present across the page. Fix this.
> > > 
> > > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> > > ---
> > > 
> > >  fs/btrfs/extent_io.c | 3 ++-
> > >  fs/btrfs/inode.c     | 4 ++--
> > >  2 files changed, 4 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index 14b4e05..0b017e1 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -3244,7 +3244,8 @@ static int __extent_read_full_page(struct
> > > extent_io_tree *tree,> 
> > >  	while (1) {
> > >  	
> > >  		lock_extent(tree, start, end);
> > > 
> > > -		ordered = btrfs_lookup_ordered_extent(inode, start);
> > > +		ordered = btrfs_lookup_ordered_range(inode, start,
> > > +						PAGE_CACHE_SIZE);
> > 
> > A minor suggestion, it'd be better to include the new prototype in the
> > same patch, which will be benefit to later cherry-picking or reverting.
> > 
> 
> Liu, The definition of btrfs_lookup_ordered_range() is already part of
> the mainline kernel.

Ah, I didn't recognize the difference of btrfs_lookup_ordered_extent and
btrfs_lookup_ordered_range, sorry.

> 
> > 
> > >  		if (!ordered)
> > >  		
> > >  			break;
> > >  		
> > >  		unlock_extent(tree, start, end);
> > > 
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index e9bab73..8b4aaed 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > 
> > > @@ -1976,7 +1976,7 @@ again:
> > >  	if (PagePrivate2(page))
> > >  	
> > >  		goto out;
> > > 
> > > -	ordered = btrfs_lookup_ordered_extent(inode, page_start);
> > > +	ordered = btrfs_lookup_ordered_range(inode, page_start,
> > > PAGE_CACHE_SIZE);
> > > 
> > >  	if (ordered) {
> > >  	
> > >  		unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start,
> > >  		
> > >  				     page_end, &cached_state, GFP_NOFS);
> > > 
> > > @@ -8513,7 +8513,7 @@ static void btrfs_invalidatepage(struct page *page,
> > > unsigned int offset,> 
> > >  	if (!inode_evicting)
> > >  	
> > >  		lock_extent_bits(tree, page_start, page_end, 0, 
> &cached_state);
> > > 
> > > -	ordered = btrfs_lookup_ordered_extent(inode, page_start);
> > > +	ordered = btrfs_lookup_ordered_range(inode, page_start,
> > > PAGE_CACHE_SIZE);

It's possible for a page to hold two (or more) ordered extents here, a
while loop is necessary to ensure that every ordered extent is processed
properly.


Thanks,

-liubo

> > > 
> > >  	if (ordered) {
> > >  	
> > >  		/*
> > >  		
> > >  		 * IO on this page will never be started, so we need
> 
> -- 
> chandan
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 11/21] Btrfs: subpagesize-blocksize: btrfs_page_mkwrite: Reserve space in sectorsized units.
  2015-06-01 15:22 ` [RFC PATCH V11 11/21] Btrfs: subpagesize-blocksize: btrfs_page_mkwrite: Reserve space in " Chandan Rajendra
@ 2015-07-06  3:18   ` Liu Bo
  0 siblings, 0 replies; 47+ messages in thread
From: Liu Bo @ 2015-07-06  3:18 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Mon, Jun 01, 2015 at 08:52:46PM +0530, Chandan Rajendra wrote:
> In subpagesize-blocksize scenario, if i_size occurs in a block which is not
> the last block in the page, then the space to be reserved should be calculated
> appropriately.
> 

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>  fs/btrfs/inode.c | 36 +++++++++++++++++++++++++++++++-----
>  1 file changed, 31 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9486e61..e9bab73 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8601,11 +8601,24 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	loff_t size;
>  	int ret;
>  	int reserved = 0;
> +	u64 reserved_space;
>  	u64 page_start;
>  	u64 page_end;
> +	u64 end;
> +
> +	reserved_space = PAGE_CACHE_SIZE;
>  
>  	sb_start_pagefault(inode->i_sb);
> -	ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
> +
> +	/*
> +	  Reserving delalloc space after obtaining the page lock can lead to
> +	  deadlock. For example, if a dirty page is locked by this function
> +	  and the call to btrfs_delalloc_reserve_space() ends up triggering
> +	  dirty page write out, then the btrfs_writepage() function could
> +	  end up waiting indefinitely to get a lock on the page currently
> +	  being processed by btrfs_page_mkwrite() function.
> +	 */
> +	ret  = btrfs_delalloc_reserve_space(inode, reserved_space);
>  	if (!ret) {
>  		ret = file_update_time(vma->vm_file);
>  		reserved = 1;
> @@ -8626,6 +8639,7 @@ again:
>  	size = i_size_read(inode);
>  	page_start = page_offset(page);
>  	page_end = page_start + PAGE_CACHE_SIZE - 1;
> +	end = page_end;
>  
>  	if ((page->mapping != inode->i_mapping) ||
>  	    (page_start >= size)) {
> @@ -8641,7 +8655,7 @@ again:
>  	 * we can't set the delalloc bits if there are pending ordered
>  	 * extents.  Drop our locks and wait for them to finish
>  	 */
> -	ordered = btrfs_lookup_ordered_extent(inode, page_start);
> +	ordered = btrfs_lookup_ordered_range(inode, page_start, page_end);
>  	if (ordered) {
>  		unlock_extent_cached(io_tree, page_start, page_end,
>  				     &cached_state, GFP_NOFS);
> @@ -8651,6 +8665,18 @@ again:
>  		goto again;
>  	}
>  
> +	if (page->index == ((size - 1) >> PAGE_CACHE_SHIFT)) {
> +		reserved_space = round_up(size - page_start, root->sectorsize);
> +		if (reserved_space < PAGE_CACHE_SIZE) {
> +			end = page_start + reserved_space - 1;
> +			spin_lock(&BTRFS_I(inode)->lock);
> +			BTRFS_I(inode)->outstanding_extents++;
> +			spin_unlock(&BTRFS_I(inode)->lock);
> +			btrfs_delalloc_release_space(inode,
> +						PAGE_CACHE_SIZE - reserved_space);
> +		}
> +	}
> +
>  	/*
>  	 * XXX - page_mkwrite gets called every time the page is dirtied, even
>  	 * if it was already dirty, so for space accounting reasons we need to
> @@ -8658,12 +8684,12 @@ again:
>  	 * is probably a better way to do this, but for now keep consistent with
>  	 * prepare_pages in the normal write path.
>  	 */
> -	clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, page_end,
> +	clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, end,
>  			  EXTENT_DIRTY | EXTENT_DELALLOC |
>  			  EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
>  			  0, 0, &cached_state, GFP_NOFS);
>  
> -	ret = btrfs_set_extent_delalloc(inode, page_start, page_end,
> +	ret = btrfs_set_extent_delalloc(inode, page_start, end,
>  					&cached_state);
>  	if (ret) {
>  		unlock_extent_cached(io_tree, page_start, page_end,
> @@ -8706,7 +8732,7 @@ out_unlock:
>  	}
>  	unlock_page(page);
>  out:
> -	btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
> +	btrfs_delalloc_release_space(inode, reserved_space);
>  out_noreserve:
>  	sb_end_pagefault(inode->i_sb);
>  	return ret;
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 13/21] Btrfs: subpagesize-blocksize: Deal with partial ordered extent allocations.
  2015-06-01 15:22 ` [RFC PATCH V11 13/21] Btrfs: subpagesize-blocksize: Deal with partial ordered extent allocations Chandan Rajendra
@ 2015-07-06 10:06   ` Liu Bo
  2015-07-07 13:38     ` Chandan Rajendra
  0 siblings, 1 reply; 47+ messages in thread
From: Liu Bo @ 2015-07-06 10:06 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Mon, Jun 01, 2015 at 08:52:48PM +0530, Chandan Rajendra wrote:
> In subpagesize-blocksize scenario, extent allocations for only some of the
> dirty blocks of a page can succeed, while allocation for rest of the blocks
> can fail. This patch allows I/O against such partially allocated ordered
> extents to be submitted.
> 
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>  fs/btrfs/extent_io.c | 27 ++++++++++++++-------------
>  fs/btrfs/inode.c     | 35 ++++++++++++++++++++++-------------
>  2 files changed, 36 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0b017e1..0110abc 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1850,17 +1850,23 @@ int extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
>  			if (page_ops & PAGE_SET_PRIVATE2)
>  				SetPagePrivate2(pages[i]);
>  
> +			if (page_ops & PAGE_SET_ERROR)
> +				SetPageError(pages[i]);
> +
>  			if (pages[i] == locked_page) {
>  				page_cache_release(pages[i]);
>  				continue;
>  			}
> -			if (page_ops & PAGE_CLEAR_DIRTY)
> +
> +			if ((page_ops & PAGE_CLEAR_DIRTY)
> +				&& !PagePrivate2(pages[i]))
>  				clear_page_dirty_for_io(pages[i]);
> -			if (page_ops & PAGE_SET_WRITEBACK)
> +			if ((page_ops & PAGE_SET_WRITEBACK)
> +				&& !PagePrivate2(pages[i]))
>  				set_page_writeback(pages[i]);
> -			if (page_ops & PAGE_SET_ERROR)
> -				SetPageError(pages[i]);
> -			if (page_ops & PAGE_END_WRITEBACK)
> +
> +			if ((page_ops & PAGE_END_WRITEBACK)
> +				&& !PagePrivate2(pages[i]))
>  				end_page_writeback(pages[i]);
>  			if (page_ops & PAGE_UNLOCK)
>  				unlock_page(pages[i]);
> @@ -2550,7 +2556,7 @@ int end_extent_writepage(struct page *page, int err, u64 start, u64 end)
>  			uptodate = 0;
>  	}
>  
> -	if (!uptodate) {
> +	if (!uptodate || PageError(page)) {
>  		ClearPageUptodate(page);
>  		SetPageError(page);
>  		ret = ret < 0 ? ret : -EIO;
> @@ -3340,7 +3346,6 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode,
>  					       nr_written);
>  		/* File system has been set read-only */
>  		if (ret) {
> -			SetPageError(page);
>  			/* fill_delalloc should be return < 0 for error
>  			 * but just in case, we use > 0 here meaning the
>  			 * IO is started, so we don't want to return > 0
> @@ -3561,7 +3566,6 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
>  	struct inode *inode = page->mapping->host;
>  	struct extent_page_data *epd = data;
>  	u64 start = page_offset(page);
> -	u64 page_end = start + PAGE_CACHE_SIZE - 1;
>  	int ret;
>  	int nr = 0;
>  	size_t pg_offset = 0;
> @@ -3606,7 +3610,7 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
>  	ret = writepage_delalloc(inode, page, wbc, epd, start, &nr_written);
>  	if (ret == 1)
>  		goto done_unlocked;
> -	if (ret)
> +	if (ret && !PagePrivate2(page))
>  		goto done;
>  
>  	ret = __extent_writepage_io(inode, page, wbc, epd,
> @@ -3620,10 +3624,7 @@ done:
>  		set_page_writeback(page);
>  		end_page_writeback(page);
>  	}
> -	if (PageError(page)) {
> -		ret = ret < 0 ? ret : -EIO;
> -		end_extent_writepage(page, ret, start, page_end);
> -	}
> +
>  	unlock_page(page);
>  	return ret;
>  
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8b4aaed..bff60c6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -925,6 +925,8 @@ static noinline int cow_file_range(struct inode *inode,
>  	struct btrfs_key ins;
>  	struct extent_map *em;
>  	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
> +	struct btrfs_ordered_extent *ordered;
> +	unsigned long page_ops, extent_ops;
>  	int ret = 0;
>  
>  	if (btrfs_is_free_space_inode(inode)) {
> @@ -969,8 +971,6 @@ static noinline int cow_file_range(struct inode *inode,
>  	btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0);
>  
>  	while (disk_num_bytes > 0) {
> -		unsigned long op;
> -
>  		cur_alloc_size = disk_num_bytes;
>  		ret = btrfs_reserve_extent(root, cur_alloc_size,
>  					   root->sectorsize, 0, alloc_hint,
> @@ -1023,7 +1023,7 @@ static noinline int cow_file_range(struct inode *inode,
>  			ret = btrfs_reloc_clone_csums(inode, start,
>  						      cur_alloc_size);
>  			if (ret)
> -				goto out_drop_extent_cache;
> +				goto out_remove_ordered_extent;
>  		}
>  
>  		if (disk_num_bytes < cur_alloc_size)
> @@ -1036,13 +1036,12 @@ static noinline int cow_file_range(struct inode *inode,
>  		 * Do set the Private2 bit so we know this page was properly
>  		 * setup for writepage
>  		 */
> -		op = unlock ? PAGE_UNLOCK : 0;
> -		op |= PAGE_SET_PRIVATE2;
> -
> +		page_ops = unlock ? PAGE_UNLOCK : 0;
> +		page_ops |= PAGE_SET_PRIVATE2;
> +		extent_ops = EXTENT_LOCKED | EXTENT_DELALLOC;
>  		extent_clear_unlock_delalloc(inode, start,
> -					     start + ram_size - 1, locked_page,
> -					     EXTENT_LOCKED | EXTENT_DELALLOC,
> -					     op);
> +					start + ram_size - 1, locked_page,
> +					extent_ops, page_ops);
>  		disk_num_bytes -= cur_alloc_size;
>  		num_bytes -= cur_alloc_size;
>  		alloc_hint = ins.objectid + ins.offset;
> @@ -1051,16 +1050,26 @@ static noinline int cow_file_range(struct inode *inode,
>  out:
>  	return ret;
>  
> +out_remove_ordered_extent:
> +	ordered = btrfs_lookup_ordered_extent(inode, ins.objectid);
> +	BUG_ON(!ordered);
> +	btrfs_remove_ordered_extent(inode, ordered);
> +

Two problems here,

1. ins.objectid refers to block address while
btrfs_lookup_ordered_extent() expects a file offset.

2. Removing ordered extent is not enough for cleaning it up, not only
this ordered extent remains in memory, but our reserved space number
needs to be cleaned up.

If we have to do it this way, I'd copy what btrfs_finish_ordered_io()
does the cleanup job, however, I hope to call end_extent_writepage()
directly here to keep it simple as much as possible.

Thanks,

-liubo

>  out_drop_extent_cache:
>  	btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
> +
>  out_reserve:
>  	btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
> +
>  out_unlock:
> +	page_ops = unlock ? PAGE_UNLOCK : 0;
> +	page_ops |= PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK
> +		| PAGE_SET_ERROR;
> +	extent_ops = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING
> +		| EXTENT_DEFRAG;
> +
>  	extent_clear_unlock_delalloc(inode, start, end, locked_page,
> -				     EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
> -				     EXTENT_DELALLOC | EXTENT_DEFRAG,
> -				     PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
> -				     PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK);
> +				extent_ops, page_ops);
>  	goto out;
>  }
>  
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 12/21] Btrfs: subpagesize-blocksize: Search for all ordered extents that could span across a page.
  2015-07-06  3:17       ` Liu Bo
@ 2015-07-06 10:49         ` Chandan Rajendra
  0 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-07-06 10:49 UTC (permalink / raw)
  To: bo.li.liu; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Monday 06 Jul 2015 11:17:38 Liu Bo wrote:
> On Fri, Jul 03, 2015 at 03:38:00PM +0530, Chandan Rajendra wrote:
> > On Wednesday 01 Jul 2015 22:47:10 Liu Bo wrote:
> > > On Mon, Jun 01, 2015 at 08:52:47PM +0530, Chandan Rajendra wrote:
> > > > In subpagesize-blocksize scenario it is not sufficient to search using
> > > > the
> > > > first byte of the page to make sure that there are no ordered extents
> > > > present across the page. Fix this.
> > > > 
> > > > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> > > > ---
> > > > 
> > > >  fs/btrfs/extent_io.c | 3 ++-
> > > >  fs/btrfs/inode.c     | 4 ++--
> > > >  2 files changed, 4 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > > index 14b4e05..0b017e1 100644
> > > > --- a/fs/btrfs/extent_io.c
> > > > +++ b/fs/btrfs/extent_io.c
> > > > @@ -3244,7 +3244,8 @@ static int __extent_read_full_page(struct
> > > > extent_io_tree *tree,>
> > > > 
> > > >  	while (1) {
> > > >  	
> > > >  		lock_extent(tree, start, end);
> > > > 
> > > > -		ordered = btrfs_lookup_ordered_extent(inode, start);
> > > > +		ordered = btrfs_lookup_ordered_range(inode, start,
> > > > +						PAGE_CACHE_SIZE);
> > > 
> > > A minor suggestion, it'd be better to include the new prototype in the
> > > same patch, which will be benefit to later cherry-picking or reverting.
> > 
> > Liu, The definition of btrfs_lookup_ordered_range() is already part of
> > the mainline kernel.
> 
> Ah, I didn't recognize the difference of btrfs_lookup_ordered_extent and
> btrfs_lookup_ordered_range, sorry.
> 
> > > >  		if (!ordered)
> > > >  		
> > > >  			break;
> > > >  		
> > > >  		unlock_extent(tree, start, end);
> > > > 
> > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > > index e9bab73..8b4aaed 100644
> > > > --- a/fs/btrfs/inode.c
> > > > +++ b/fs/btrfs/inode.c
> > > > 
> > > > @@ -1976,7 +1976,7 @@ again:
> > > >  	if (PagePrivate2(page))
> > > >  	
> > > >  		goto out;
> > > > 
> > > > -	ordered = btrfs_lookup_ordered_extent(inode, page_start);
> > > > +	ordered = btrfs_lookup_ordered_range(inode, page_start,
> > > > PAGE_CACHE_SIZE);
> > > > 
> > > >  	if (ordered) {
> > > >  	
> > > >  		unlock_extent_cached(&BTRFS_I(inode)->io_tree, 
page_start,
> > > >  		
> > > >  				     page_end, &cached_state, 
GFP_NOFS);
> > > > 
> > > > @@ -8513,7 +8513,7 @@ static void btrfs_invalidatepage(struct page
> > > > *page,
> > > > unsigned int offset,>
> > > > 
> > > >  	if (!inode_evicting)
> > > >  	
> > > >  		lock_extent_bits(tree, page_start, page_end, 0,
> > 
> > &cached_state);
> > 
> > > > -	ordered = btrfs_lookup_ordered_extent(inode, page_start);
> > > > +	ordered = btrfs_lookup_ordered_range(inode, page_start,
> > > > PAGE_CACHE_SIZE);
> 
> It's possible for a page to hold two (or more) ordered extents here, a
> while loop is necessary to ensure that every ordered extent is processed
> properly.
>
Liu, Sorry, I had introduced the loop in the patch
"[RFC PATCH V11 14/21] Btrfs: subpagesize-blocksize: Explicitly Track I/O
status of blocks of an ordered extent". I will pull the loop to this patch for
the next version of that patchset. 

> Thanks,
> 
> -liubo
> 
> > > >  	if (ordered) {
> > > >  	
> > > >  		/*
> > > >  		
> > > >  		 * IO on this page will never be started, so we need

-- 
chandan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 13/21] Btrfs: subpagesize-blocksize: Deal with partial ordered extent allocations.
  2015-07-06 10:06   ` Liu Bo
@ 2015-07-07 13:38     ` Chandan Rajendra
  0 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-07-07 13:38 UTC (permalink / raw)
  To: bo.li.liu; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Monday 06 Jul 2015 18:06:33 Liu Bo wrote:
> On Mon, Jun 01, 2015 at 08:52:48PM +0530, Chandan Rajendra wrote:
> > In subpagesize-blocksize scenario, extent allocations for only some of the
> > dirty blocks of a page can succeed, while allocation for rest of the
> > blocks
> > can fail. This patch allows I/O against such partially allocated ordered
> > extents to be submitted.
> > 
> > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> > ---
> > 
> >  fs/btrfs/extent_io.c | 27 ++++++++++++++-------------
> >  fs/btrfs/inode.c     | 35 ++++++++++++++++++++++-------------
> >  2 files changed, 36 insertions(+), 26 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 0b017e1..0110abc 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -1850,17 +1850,23 @@ int extent_clear_unlock_delalloc(struct inode
> > *inode, u64 start, u64 end,> 
> >  			if (page_ops & PAGE_SET_PRIVATE2)
> >  			
> >  				SetPagePrivate2(pages[i]);
> > 
> > +			if (page_ops & PAGE_SET_ERROR)
> > +				SetPageError(pages[i]);
> > +
> > 
> >  			if (pages[i] == locked_page) {
> >  			
> >  				page_cache_release(pages[i]);
> >  				continue;
> >  			
> >  			}
> > 
> > -			if (page_ops & PAGE_CLEAR_DIRTY)
> > +
> > +			if ((page_ops & PAGE_CLEAR_DIRTY)
> > +				&& !PagePrivate2(pages[i]))
> > 
> >  				clear_page_dirty_for_io(pages[i]);
> > 
> > -			if (page_ops & PAGE_SET_WRITEBACK)
> > +			if ((page_ops & PAGE_SET_WRITEBACK)
> > +				&& !PagePrivate2(pages[i]))
> > 
> >  				set_page_writeback(pages[i]);
> > 
> > -			if (page_ops & PAGE_SET_ERROR)
> > -				SetPageError(pages[i]);
> > -			if (page_ops & PAGE_END_WRITEBACK)
> > +
> > +			if ((page_ops & PAGE_END_WRITEBACK)
> > +				&& !PagePrivate2(pages[i]))
> > 
> >  				end_page_writeback(pages[i]);
> >  			
> >  			if (page_ops & PAGE_UNLOCK)
> >  			
> >  				unlock_page(pages[i]);
> > 
> > @@ -2550,7 +2556,7 @@ int end_extent_writepage(struct page *page, int err,
> > u64 start, u64 end)> 
> >  			uptodate = 0;
> >  	
> >  	}
> > 
> > -	if (!uptodate) {
> > +	if (!uptodate || PageError(page)) {
> > 
> >  		ClearPageUptodate(page);
> >  		SetPageError(page);
> >  		ret = ret < 0 ? ret : -EIO;
> > 
> > @@ -3340,7 +3346,6 @@ static noinline_for_stack int
> > writepage_delalloc(struct inode *inode,> 
> >  					       nr_written);
> >  		
> >  		/* File system has been set read-only */
> >  		if (ret) {
> > 
> > -			SetPageError(page);
> > 
> >  			/* fill_delalloc should be return < 0 for error
> >  			
> >  			 * but just in case, we use > 0 here meaning the
> >  			 * IO is started, so we don't want to return > 0
> > 
> > @@ -3561,7 +3566,6 @@ static int __extent_writepage(struct page *page,
> > struct writeback_control *wbc,> 
> >  	struct inode *inode = page->mapping->host;
> >  	struct extent_page_data *epd = data;
> >  	u64 start = page_offset(page);
> > 
> > -	u64 page_end = start + PAGE_CACHE_SIZE - 1;
> > 
> >  	int ret;
> >  	int nr = 0;
> >  	size_t pg_offset = 0;
> > 
> > @@ -3606,7 +3610,7 @@ static int __extent_writepage(struct page *page,
> > struct writeback_control *wbc,> 
> >  	ret = writepage_delalloc(inode, page, wbc, epd, start, &nr_written);
> >  	if (ret == 1)
> >  	
> >  		goto done_unlocked;
> > 
> > -	if (ret)
> > +	if (ret && !PagePrivate2(page))
> > 
> >  		goto done;
> >  	
> >  	ret = __extent_writepage_io(inode, page, wbc, epd,
> > 
> > @@ -3620,10 +3624,7 @@ done:
> >  		set_page_writeback(page);
> >  		end_page_writeback(page);
> >  	
> >  	}
> > 
> > -	if (PageError(page)) {
> > -		ret = ret < 0 ? ret : -EIO;
> > -		end_extent_writepage(page, ret, start, page_end);
> > -	}
> > +
> > 
> >  	unlock_page(page);
> >  	return ret;
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 8b4aaed..bff60c6 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -925,6 +925,8 @@ static noinline int cow_file_range(struct inode
> > *inode,
> > 
> >  	struct btrfs_key ins;
> >  	struct extent_map *em;
> >  	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
> > 
> > +	struct btrfs_ordered_extent *ordered;
> > +	unsigned long page_ops, extent_ops;
> > 
> >  	int ret = 0;
> >  	
> >  	if (btrfs_is_free_space_inode(inode)) {
> > 
> > @@ -969,8 +971,6 @@ static noinline int cow_file_range(struct inode
> > *inode,
> > 
> >  	btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0);
> >  	
> >  	while (disk_num_bytes > 0) {
> > 
> > -		unsigned long op;
> > -
> > 
> >  		cur_alloc_size = disk_num_bytes;
> >  		ret = btrfs_reserve_extent(root, cur_alloc_size,
> >  		
> >  					   root->sectorsize, 0, alloc_hint,
> > 
> > @@ -1023,7 +1023,7 @@ static noinline int cow_file_range(struct inode
> > *inode,> 
> >  			ret = btrfs_reloc_clone_csums(inode, start,
> >  			
> >  						      cur_alloc_size);
> >  			
> >  			if (ret)
> > 
> > -				goto out_drop_extent_cache;
> > +				goto out_remove_ordered_extent;
> > 
> >  		}
> >  		
> >  		if (disk_num_bytes < cur_alloc_size)
> > 
> > @@ -1036,13 +1036,12 @@ static noinline int cow_file_range(struct inode
> > *inode,> 
> >  		 * Do set the Private2 bit so we know this page was properly
> >  		 * setup for writepage
> >  		 */
> > 
> > -		op = unlock ? PAGE_UNLOCK : 0;
> > -		op |= PAGE_SET_PRIVATE2;
> > -
> > +		page_ops = unlock ? PAGE_UNLOCK : 0;
> > +		page_ops |= PAGE_SET_PRIVATE2;
> > +		extent_ops = EXTENT_LOCKED | EXTENT_DELALLOC;
> > 
> >  		extent_clear_unlock_delalloc(inode, start,
> > 
> > -					     start + ram_size - 1, 
locked_page,
> > -					     EXTENT_LOCKED | EXTENT_DELALLOC,
> > -					     op);
> > +					start + ram_size - 1, locked_page,
> > +					extent_ops, page_ops);
> > 
> >  		disk_num_bytes -= cur_alloc_size;
> >  		num_bytes -= cur_alloc_size;
> >  		alloc_hint = ins.objectid + ins.offset;
> > 
> > @@ -1051,16 +1050,26 @@ static noinline int cow_file_range(struct inode
> > *inode,> 
> >  out:
> >  	return ret;
> > 
> > +out_remove_ordered_extent:
> > +	ordered = btrfs_lookup_ordered_extent(inode, ins.objectid);
> > +	BUG_ON(!ordered);
> > +	btrfs_remove_ordered_extent(inode, ordered);
> > +
> 
> Two problems here,
> 
> 1. ins.objectid refers to block address while
> btrfs_lookup_ordered_extent() expects a file offset.
>
Ah, That has most probably saved me from hours of debugging. Thanks a lot for
pointing it out.

> 2. Removing ordered extent is not enough for cleaning it up, not only
> this ordered extent remains in memory, but our reserved space number
> needs to be cleaned up.
> 
> If we have to do it this way, I'd copy what btrfs_finish_ordered_io()
> does the cleanup job, however, I hope to call end_extent_writepage()
> directly here to keep it simple as much as possible.
>
Yes, calling end_extent_writepage() looks to be the easiest and correct way to
do this. I will test and include the change in the next version of the
patchset.

> Thanks,
> 
> -liubo
> 
> >  out_drop_extent_cache:
> >  	btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
> > 
> > +
> > 
> >  out_reserve:
> >  	btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
> > 
> > +
> > 
> >  out_unlock:
> > +	page_ops = unlock ? PAGE_UNLOCK : 0;
> > +	page_ops |= PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK
> > +		| PAGE_SET_ERROR;
> > +	extent_ops = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING
> > +		| EXTENT_DEFRAG;
> > +
> > 
> >  	extent_clear_unlock_delalloc(inode, start, end, locked_page,
> > 
> > -				     EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
> > -				     EXTENT_DELALLOC | EXTENT_DEFRAG,
> > -				     PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
> > -				     PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK);
> > +				extent_ops, page_ops);
> > 
> >  	goto out;
> >  
> >  }

-- 
chandan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 14/21] Btrfs: subpagesize-blocksize: Explicitly Track I/O status of blocks of an ordered extent.
  2015-06-01 15:22 ` [RFC PATCH V11 14/21] Btrfs: subpagesize-blocksize: Explicitly Track I/O status of blocks of an ordered extent Chandan Rajendra
@ 2015-07-20  8:34   ` Liu Bo
  2015-07-20 12:54     ` Chandan Rajendra
  0 siblings, 1 reply; 47+ messages in thread
From: Liu Bo @ 2015-07-20  8:34 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Mon, Jun 01, 2015 at 08:52:49PM +0530, Chandan Rajendra wrote:
> In subpagesize-blocksize scenario a page can have more than one block. So
> in addition to PagePrivate2 flag, we would have to track the I/O status of
> each block of a page to reliably mark the ordered extent as complete.
> 
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>  fs/btrfs/extent_io.c    |  19 +--
>  fs/btrfs/extent_io.h    |   5 +-
>  fs/btrfs/inode.c        | 346 +++++++++++++++++++++++++++++++++++-------------
>  fs/btrfs/ordered-data.c |  17 +++
>  fs/btrfs/ordered-data.h |   4 +
>  5 files changed, 287 insertions(+), 104 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0110abc..55f900a 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4545,11 +4545,10 @@ int extent_invalidatepage(struct extent_io_tree *tree,
>   * to drop the page.
>   */
>  static int try_release_extent_state(struct extent_map_tree *map,
> -				    struct extent_io_tree *tree,
> -				    struct page *page, gfp_t mask)
> +				struct extent_io_tree *tree,
> +				struct page *page, u64 start, u64 end,
> +				gfp_t mask)
>  {
> -	u64 start = page_offset(page);
> -	u64 end = start + PAGE_CACHE_SIZE - 1;
>  	int ret = 1;
>  
>  	if (test_range_bit(tree, start, end,
> @@ -4583,12 +4582,12 @@ static int try_release_extent_state(struct extent_map_tree *map,
>   * map records are removed
>   */
>  int try_release_extent_mapping(struct extent_map_tree *map,
> -			       struct extent_io_tree *tree, struct page *page,
> -			       gfp_t mask)
> +			struct extent_io_tree *tree, struct page *page,
> +			u64 start, u64 end, gfp_t mask)
>  {
>  	struct extent_map *em;
> -	u64 start = page_offset(page);
> -	u64 end = start + PAGE_CACHE_SIZE - 1;
> +	u64 orig_start = start;
> +	u64 orig_end = end;
>  
>  	if ((mask & __GFP_WAIT) &&
>  	    page->mapping->host->i_size > 16 * 1024 * 1024) {
> @@ -4622,7 +4621,9 @@ int try_release_extent_mapping(struct extent_map_tree *map,
>  			free_extent_map(em);
>  		}
>  	}
> -	return try_release_extent_state(map, tree, page, mask);
> +	return try_release_extent_state(map, tree, page,
> +					orig_start, orig_end,
> +					mask);
>  }
>  
>  /*
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 8fe5ac3..c629e53 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -217,8 +217,9 @@ typedef struct extent_map *(get_extent_t)(struct inode *inode,
>  void extent_io_tree_init(struct extent_io_tree *tree,
>  			 struct address_space *mapping);
>  int try_release_extent_mapping(struct extent_map_tree *map,
> -			       struct extent_io_tree *tree, struct page *page,
> -			       gfp_t mask);
> +			struct extent_io_tree *tree, struct page *page,
> +			u64 start, u64 end,
> +			gfp_t mask);
>  int try_release_extent_buffer(struct page *page);
>  int lock_extent(struct extent_io_tree *tree, u64 start, u64 end);
>  int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index bff60c6..bfffc62 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2990,56 +2990,115 @@ static void finish_ordered_fn(struct btrfs_work *work)
>  	btrfs_finish_ordered_io(ordered_extent);
>  }
>  
> -static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
> -				struct extent_state *state, int uptodate)
> +static void mark_blks_io_complete(struct btrfs_ordered_extent *ordered,
> +				u64 blk, u64 nr_blks, int uptodate)
>  {
> -	struct inode *inode = page->mapping->host;
> +	struct inode *inode = ordered->inode;
>  	struct btrfs_root *root = BTRFS_I(inode)->root;
> -	struct btrfs_ordered_extent *ordered_extent = NULL;
>  	struct btrfs_workqueue *wq;
>  	btrfs_work_func_t func;
> -	u64 ordered_start, ordered_end;
>  	int done;
>  
> -	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
> +	while (nr_blks--) {
> +		if (test_and_set_bit(blk, ordered->blocks_done)) {
> +			blk++;
> +			continue;
> +		}
>  
> -	ClearPagePrivate2(page);
> -loop:
> -	ordered_extent = btrfs_lookup_ordered_range(inode, start,
> -						end - start + 1);
> -	if (!ordered_extent)
> -		goto out;
> +		done = btrfs_dec_test_ordered_pending(inode, &ordered,
> +						ordered->file_offset
> +						+ (blk << inode->i_sb->s_blocksize_bits),
> +						root->sectorsize,
> +						uptodate);
> +		if (done) {
> +			if (btrfs_is_free_space_inode(inode)) {
> +				wq = root->fs_info->endio_freespace_worker;
> +				func = btrfs_freespace_write_helper;
> +			} else {
> +				wq = root->fs_info->endio_write_workers;
> +				func = btrfs_endio_write_helper;
> +			}
>  
> -	ordered_start = max_t(u64, start, ordered_extent->file_offset);
> -	ordered_end = min_t(u64, end,
> -			ordered_extent->file_offset + ordered_extent->len - 1);
> -
> -	done = btrfs_dec_test_ordered_pending(inode, &ordered_extent,
> -					ordered_start,
> -					ordered_end - ordered_start + 1,
> -					uptodate);
> -	if (done) {
> -		if (btrfs_is_free_space_inode(inode)) {
> -			wq = root->fs_info->endio_freespace_worker;
> -			func = btrfs_freespace_write_helper;
> -		} else {
> -			wq = root->fs_info->endio_write_workers;
> -			func = btrfs_endio_write_helper;
> +			btrfs_init_work(&ordered->work, func,
> +					finish_ordered_fn, NULL, NULL);
> +			btrfs_queue_work(wq, &ordered->work);
>  		}
>  
> -		btrfs_init_work(&ordered_extent->work, func,
> -				finish_ordered_fn, NULL, NULL);
> -		btrfs_queue_work(wq, &ordered_extent->work);
> +		blk++;
>  	}
> +}
>  
> -	btrfs_put_ordered_extent(ordered_extent);
> +int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
> +				struct extent_state *state, int uptodate)
> +{
> +	struct inode *inode = page->mapping->host;
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	struct btrfs_ordered_extent *ordered_extent = NULL;
> +	u64 blk, nr_blks;
> +	int clear;
>  
> -	start = ordered_end + 1;
> +	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
>  
> -	if (start < end)
> -		goto loop;
> +	while (start < end) {
> +		ordered_extent = btrfs_lookup_ordered_extent(inode, start);
> +		if (!ordered_extent) {
> +			start += root->sectorsize;
> +			continue;
> +		}
> +
> +		blk = (start - ordered_extent->file_offset)
> +			>> inode->i_sb->s_blocksize_bits;
> +
> +		nr_blks = (min(end, ordered_extent->file_offset + ordered_extent->len - 1)
> +			+ 1 - start) >> inode->i_sb->s_blocksize_bits;
> +
> +		BUG_ON(!nr_blks);
> +
> +		mark_blks_io_complete(ordered_extent, blk, nr_blks, uptodate);

Range [start, end] is surely contiguous, so why are we processing blocks
one by one in mark_blks_io_complete()?

Same question for invalidatepage().

Thanks,

-liubo

> +
> +		start = ordered_extent->file_offset + ordered_extent->len;
> +
> +		btrfs_put_ordered_extent(ordered_extent);
> +	}
> +
> +	start = page_offset(page);
> +	end = start + PAGE_CACHE_SIZE - 1;
> +	clear = 1;
> +
> +	while (start < end) {
> +		ordered_extent = btrfs_lookup_ordered_extent(inode, start);
> +		if (!ordered_extent) {
> +			start += root->sectorsize;
> +			continue;
> +		}
> +
> +		blk = (start - ordered_extent->file_offset)
> +			>> inode->i_sb->s_blocksize_bits;
> +		nr_blks = (min(end, ordered_extent->file_offset + ordered_extent->len - 1)
> +			+ 1  - start) >> inode->i_sb->s_blocksize_bits;
> +
> +		BUG_ON(!nr_blks);
> +
> +		while (nr_blks--) {
> +			if (!test_bit(blk++, ordered_extent->blocks_done)) {
> +				clear = 0;
> +				break;
> +			}
> +		}
> +
> +		if (!clear) {
> +			btrfs_put_ordered_extent(ordered_extent);
> +			break;
> +		}
> +
> +		start += ordered_extent->len;
> +
> +		btrfs_put_ordered_extent(ordered_extent);
> +	}
> +
> +	if (clear)
> +		ClearPagePrivate2(page);
>  
> -out:
>  	return 0;
>  }
>  
> @@ -8472,7 +8531,9 @@ btrfs_readpages(struct file *file, struct address_space *mapping,
>  	return extent_readpages(tree, mapping, pages, nr_pages,
>  				btrfs_get_extent);
>  }
> -static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
> +
> +static int __btrfs_releasepage(struct page *page, u64 start, u64 end,
> +			gfp_t gfp_flags)
>  {
>  	struct extent_io_tree *tree;
>  	struct extent_map_tree *map;
> @@ -8480,31 +8541,149 @@ static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
>  
>  	tree = &BTRFS_I(page->mapping->host)->io_tree;
>  	map = &BTRFS_I(page->mapping->host)->extent_tree;
> -	ret = try_release_extent_mapping(map, tree, page, gfp_flags);
> -	if (ret == 1)
> +
> +	ret = try_release_extent_mapping(map, tree, page, start, end,
> +					gfp_flags);
> +	if ((ret == 1) && ((end - start + 1) == PAGE_CACHE_SIZE)) {
>  		clear_page_extent_mapped(page);
> +	} else {
> +		ret = 0;
> +	}
>  
>  	return ret;
>  }
>  
>  static int btrfs_releasepage(struct page *page, gfp_t gfp_flags)
>  {
> +	u64 start = page_offset(page);
> +	u64 end = start + PAGE_CACHE_SIZE - 1;
> +
>  	if (PageWriteback(page) || PageDirty(page))
>  		return 0;
> -	return __btrfs_releasepage(page, gfp_flags & GFP_NOFS);
> +
> +	return __btrfs_releasepage(page, start, end, gfp_flags & GFP_NOFS);
> +}
> +
> +static void invalidate_ordered_extent_blocks(struct inode *inode,
> +					struct btrfs_ordered_extent *ordered,
> +					u64 locked_start, u64 locked_end,
> +					u64 cur,
> +					int inode_evicting)
> +{
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	struct btrfs_ordered_inode_tree *ordered_tree;
> +	struct extent_io_tree *tree;
> +	u64 blk, blk_done, nr_blks;
> +	u64 end;
> +	u64 new_len;
> +
> +	tree = &BTRFS_I(inode)->io_tree;
> +
> +	end = min(locked_end, ordered->file_offset + ordered->len - 1);
> +
> +	if (!inode_evicting) {
> +		clear_extent_bit(tree, cur, end,
> +				EXTENT_DIRTY | EXTENT_DELALLOC |
> +				EXTENT_DO_ACCOUNTING |
> +				EXTENT_DEFRAG, 1, 0, NULL,
> +				GFP_NOFS);
> +		unlock_extent(tree, locked_start, locked_end);
> +	}
> +
> +
> +	ordered_tree = &BTRFS_I(inode)->ordered_tree;
> +	spin_lock_irq(&ordered_tree->lock);
> +	set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> +	new_len = cur - ordered->file_offset;
> +	if (new_len < ordered->truncated_len)
> +		ordered->truncated_len = new_len;
> +
> +	blk = (cur - ordered->file_offset) >> inode->i_sb->s_blocksize_bits;
> +	nr_blks = (end + 1 - cur) >> inode->i_sb->s_blocksize_bits;
> +
> +	while (nr_blks--) {
> +		blk_done = !test_and_set_bit(blk, ordered->blocks_done);
> +		if (blk_done) {
> +			spin_unlock_irq(&ordered_tree->lock);
> +			if (btrfs_dec_test_ordered_pending(inode, &ordered,
> +								ordered->file_offset + (blk << inode->i_sb->s_blocksize_bits),
> +								root->sectorsize,
> +								1))
> +				btrfs_finish_ordered_io(ordered);
> +
> +			spin_lock_irq(&ordered_tree->lock);
> +		}
> +		blk++;
> +	}
> +
> +	spin_unlock_irq(&ordered_tree->lock);
> +
> +	if (!inode_evicting)
> +		lock_extent_bits(tree, locked_start, locked_end, 0, NULL);
> +}
> +
> +static int page_blocks_written(struct page *page)
> +{
> +	struct btrfs_ordered_extent *ordered;
> +	struct btrfs_root *root;
> +	struct inode *inode;
> +	unsigned long outstanding_blk;
> +	u64 page_start, page_end;
> +	u64 blk, last_blk, nr_blks;
> +	u64 cur;
> +	u64 len;
> +
> +	inode = page->mapping->host;
> +	root = BTRFS_I(inode)->root;
> +
> +	page_start = page_offset(page);
> +	page_end = page_start + PAGE_CACHE_SIZE - 1;
> +
> +	cur = page_start;
> +	while (cur < page_end) {
> +		ordered = btrfs_lookup_ordered_extent(inode, cur);
> +		if (!ordered) {
> +			cur += root->sectorsize;
> +			continue;
> +		}
> +
> +		blk = (cur - ordered->file_offset)
> +			>> inode->i_sb->s_blocksize_bits;
> +		len = min(page_end, ordered->file_offset + ordered->len - 1)
> +			- cur + 1;
> +		nr_blks = len >> inode->i_sb->s_blocksize_bits;
> +
> +		last_blk = blk + nr_blks - 1;
> +
> +		outstanding_blk = find_next_zero_bit(ordered->blocks_done,
> +						ordered->len >> inode->i_sb->s_blocksize_bits,
> +						blk);
> +		if (outstanding_blk <= last_blk) {
> +			btrfs_put_ordered_extent(ordered);
> +			return 0;
> +		}
> +
> +		btrfs_put_ordered_extent(ordered);
> +		cur += len;
> +	}
> +
> +	return 1;
>  }
>  
>  static void btrfs_invalidatepage(struct page *page, unsigned int offset,
> -				 unsigned int length)
> +				unsigned int length)
>  {
>  	struct inode *inode = page->mapping->host;
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
>  	struct extent_io_tree *tree;
>  	struct btrfs_ordered_extent *ordered;
> -	struct extent_state *cached_state = NULL;
> -	u64 page_start = page_offset(page);
> -	u64 page_end = page_start + PAGE_CACHE_SIZE - 1;
> +	u64 start, end, cur;
> +	u64 page_start, page_end;
>  	int inode_evicting = inode->i_state & I_FREEING;
>  
> +	page_start = page_offset(page);
> +	page_end = page_start + PAGE_CACHE_SIZE - 1;
> +
>  	/*
>  	 * we have the page locked, so new writeback can't start,
>  	 * and the dirty bit won't be cleared while we are here.
> @@ -8515,73 +8694,54 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
>  	wait_on_page_writeback(page);
>  
>  	tree = &BTRFS_I(inode)->io_tree;
> -	if (offset) {
> +
> +	start = round_up(offset, root->sectorsize);
> +	end = round_down(offset + length, root->sectorsize) - 1;
> +	if (end - start + 1 < root->sectorsize) {
>  		btrfs_releasepage(page, GFP_NOFS);
>  		return;
>  	}
>  
> +	start = round_up(page_start + offset, root->sectorsize);
> +	end = round_down(page_start + offset + length,
> +			root->sectorsize) - 1;
> +
>  	if (!inode_evicting)
> -		lock_extent_bits(tree, page_start, page_end, 0, &cached_state);
> -	ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_CACHE_SIZE);
> -	if (ordered) {
> -		/*
> -		 * IO on this page will never be started, so we need
> -		 * to account for any ordered extents now
> -		 */
> -		if (!inode_evicting)
> -			clear_extent_bit(tree, page_start, page_end,
> -					 EXTENT_DIRTY | EXTENT_DELALLOC |
> -					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
> -					 EXTENT_DEFRAG, 1, 0, &cached_state,
> -					 GFP_NOFS);
> -		/*
> -		 * whoever cleared the private bit is responsible
> -		 * for the finish_ordered_io
> -		 */
> -		if (TestClearPagePrivate2(page)) {
> -			struct btrfs_ordered_inode_tree *tree;
> -			u64 new_len;
> +		lock_extent_bits(tree, start, end, 0, NULL);
>  
> -			tree = &BTRFS_I(inode)->ordered_tree;
> +	cur = start;
> +	while (cur < end) {
> +		ordered = btrfs_lookup_ordered_extent(inode, cur);
> +		if (!ordered) {
> +			cur += root->sectorsize;
> +			continue;
> +		}
>  
> -			spin_lock_irq(&tree->lock);
> -			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> -			new_len = page_start - ordered->file_offset;
> -			if (new_len < ordered->truncated_len)
> -				ordered->truncated_len = new_len;
> -			spin_unlock_irq(&tree->lock);
> +		invalidate_ordered_extent_blocks(inode, ordered,
> +						start, end, cur,
> +						inode_evicting);
>  
> -			if (btrfs_dec_test_ordered_pending(inode, &ordered,
> -							   page_start,
> -							   PAGE_CACHE_SIZE, 1))
> -				btrfs_finish_ordered_io(ordered);
> -		}
> +		cur = min(end + 1, ordered->file_offset + ordered->len);
>  		btrfs_put_ordered_extent(ordered);
> -		if (!inode_evicting) {
> -			cached_state = NULL;
> -			lock_extent_bits(tree, page_start, page_end, 0,
> -					 &cached_state);
> -		}
>  	}
>  
> -	if (!inode_evicting) {
> -		clear_extent_bit(tree, page_start, page_end,
> -				 EXTENT_LOCKED | EXTENT_DIRTY |
> -				 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
> -				 EXTENT_DEFRAG, 1, 1,
> -				 &cached_state, GFP_NOFS);
> +	if (page_blocks_written(page))
> +		ClearPagePrivate2(page);
>  
> -		__btrfs_releasepage(page, GFP_NOFS);
> +	if (!inode_evicting) {
> +		clear_extent_bit(tree, start, end,
> +				EXTENT_LOCKED | EXTENT_DIRTY |
> +				EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
> +				EXTENT_DEFRAG, 1, 1, NULL, GFP_NOFS);
>  	}
>  
> -	ClearPageChecked(page);
> -	if (PagePrivate(page)) {
> -		ClearPagePrivate(page);
> -		set_page_private(page, 0);
> -		page_cache_release(page);
> +	if (!offset && length == PAGE_CACHE_SIZE) {
> +		WARN_ON(!__btrfs_releasepage(page, start, end, GFP_NOFS));
> +		ClearPageChecked(page);
>  	}
>  }
>  
> +
>  /*
>   * btrfs_page_mkwrite() is not allowed to change the file size as it gets
>   * called from a page fault handler when a page is first dirtied. Hence we must
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 157cc54..8e614ca 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -189,12 +189,25 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
>  	struct btrfs_ordered_inode_tree *tree;
>  	struct rb_node *node;
>  	struct btrfs_ordered_extent *entry;
> +	u64 nr_longs;
>  
>  	tree = &BTRFS_I(inode)->ordered_tree;
>  	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
>  	if (!entry)
>  		return -ENOMEM;
>  
> +	nr_longs = BITS_TO_LONGS(len >> inode->i_sb->s_blocksize_bits);
> +	if (nr_longs == 1) {
> +		entry->blocks_done = &entry->blocks_bitmap;
> +	} else {
> +		entry->blocks_done = kzalloc(nr_longs * sizeof(unsigned long),
> +					GFP_NOFS);
> +		if (!entry->blocks_done) {
> +			kmem_cache_free(btrfs_ordered_extent_cache, entry);
> +			return -ENOMEM;
> +		}
> +	}
> +
>  	entry->file_offset = file_offset;
>  	entry->start = start;
>  	entry->len = len;
> @@ -553,6 +566,10 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry)
>  			list_del(&sum->list);
>  			kfree(sum);
>  		}
> +
> +		if (entry->blocks_done != &entry->blocks_bitmap)
> +			kfree(entry->blocks_done);
> +
>  		kmem_cache_free(btrfs_ordered_extent_cache, entry);
>  	}
>  }
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index e96cd4c..4b3356a 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -140,6 +140,10 @@ struct btrfs_ordered_extent {
>  	struct completion completion;
>  	struct btrfs_work flush_work;
>  	struct list_head work_list;
> +
> +	/* bitmap to track the blocks that have been written to disk */
> +	unsigned long *blocks_done;
> +	unsigned long blocks_bitmap;
>  };
>  
>  /*
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 14/21] Btrfs: subpagesize-blocksize: Explicitly Track I/O status of blocks of an ordered extent.
  2015-07-20  8:34   ` Liu Bo
@ 2015-07-20 12:54     ` Chandan Rajendra
  0 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2015-07-20 12:54 UTC (permalink / raw)
  To: bo.li.liu; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Monday 20 Jul 2015 16:34:35 Liu Bo wrote:
> On Mon, Jun 01, 2015 at 08:52:49PM +0530, Chandan Rajendra wrote:
> > In subpagesize-blocksize scenario a page can have more than one block. So
> > in addition to PagePrivate2 flag, we would have to track the I/O status of
> > each block of a page to reliably mark the ordered extent as complete.
> > 
> > Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> > ---
> > 
> >  fs/btrfs/extent_io.c    |  19 +--
> >  fs/btrfs/extent_io.h    |   5 +-
> >  fs/btrfs/inode.c        | 346
> >  +++++++++++++++++++++++++++++++++++------------- fs/btrfs/ordered-data.c
> >  |  17 +++
> >  fs/btrfs/ordered-data.h |   4 +
> >  5 files changed, 287 insertions(+), 104 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 0110abc..55f900a 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -4545,11 +4545,10 @@ int extent_invalidatepage(struct extent_io_tree
> > *tree,> 
> >   * to drop the page.
> >   */
> >  
> >  static int try_release_extent_state(struct extent_map_tree *map,
> > 
> > -				    struct extent_io_tree *tree,
> > -				    struct page *page, gfp_t mask)
> > +				struct extent_io_tree *tree,
> > +				struct page *page, u64 start, u64 end,
> > +				gfp_t mask)
> > 
> >  {
> > 
> > -	u64 start = page_offset(page);
> > -	u64 end = start + PAGE_CACHE_SIZE - 1;
> > 
> >  	int ret = 1;
> >  	
> >  	if (test_range_bit(tree, start, end,
> > 
> > @@ -4583,12 +4582,12 @@ static int try_release_extent_state(struct
> > extent_map_tree *map,> 
> >   * map records are removed
> >   */
> >  
> >  int try_release_extent_mapping(struct extent_map_tree *map,
> > 
> > -			       struct extent_io_tree *tree, struct page *page,
> > -			       gfp_t mask)
> > +			struct extent_io_tree *tree, struct page *page,
> > +			u64 start, u64 end, gfp_t mask)
> > 
> >  {
> >  
> >  	struct extent_map *em;
> > 
> > -	u64 start = page_offset(page);
> > -	u64 end = start + PAGE_CACHE_SIZE - 1;
> > +	u64 orig_start = start;
> > +	u64 orig_end = end;
> > 
> >  	if ((mask & __GFP_WAIT) &&
> >  	
> >  	    page->mapping->host->i_size > 16 * 1024 * 1024) {
> > 
> > @@ -4622,7 +4621,9 @@ int try_release_extent_mapping(struct
> > extent_map_tree *map,> 
> >  			free_extent_map(em);
> >  		
> >  		}
> >  	
> >  	}
> > 
> > -	return try_release_extent_state(map, tree, page, mask);
> > +	return try_release_extent_state(map, tree, page,
> > +					orig_start, orig_end,
> > +					mask);
> > 
> >  }
> >  
> >  /*
> > 
> > diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> > index 8fe5ac3..c629e53 100644
> > --- a/fs/btrfs/extent_io.h
> > +++ b/fs/btrfs/extent_io.h
> > @@ -217,8 +217,9 @@ typedef struct extent_map *(get_extent_t)(struct inode
> > *inode,> 
> >  void extent_io_tree_init(struct extent_io_tree *tree,
> >  
> >  			 struct address_space *mapping);
> >  
> >  int try_release_extent_mapping(struct extent_map_tree *map,
> > 
> > -			       struct extent_io_tree *tree, struct page *page,
> > -			       gfp_t mask);
> > +			struct extent_io_tree *tree, struct page *page,
> > +			u64 start, u64 end,
> > +			gfp_t mask);
> > 
> >  int try_release_extent_buffer(struct page *page);
> >  int lock_extent(struct extent_io_tree *tree, u64 start, u64 end);
> >  int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index bff60c6..bfffc62 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -2990,56 +2990,115 @@ static void finish_ordered_fn(struct btrfs_work
> > *work)> 
> >  	btrfs_finish_ordered_io(ordered_extent);
> >  
> >  }
> > 
> > -static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64
> > end, -				struct extent_state *state, int 
uptodate)
> > +static void mark_blks_io_complete(struct btrfs_ordered_extent *ordered,
> > +				u64 blk, u64 nr_blks, int uptodate)
> > 
> >  {
> > 
> > -	struct inode *inode = page->mapping->host;
> > +	struct inode *inode = ordered->inode;
> > 
> >  	struct btrfs_root *root = BTRFS_I(inode)->root;
> > 
> > -	struct btrfs_ordered_extent *ordered_extent = NULL;
> > 
> >  	struct btrfs_workqueue *wq;
> >  	btrfs_work_func_t func;
> > 
> > -	u64 ordered_start, ordered_end;
> > 
> >  	int done;
> > 
> > -	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
> > +	while (nr_blks--) {
> > +		if (test_and_set_bit(blk, ordered->blocks_done)) {
> > +			blk++;
> > +			continue;
> > +		}
> > 
> > -	ClearPagePrivate2(page);
> > -loop:
> > -	ordered_extent = btrfs_lookup_ordered_range(inode, start,
> > -						end - start + 1);
> > -	if (!ordered_extent)
> > -		goto out;
> > +		done = btrfs_dec_test_ordered_pending(inode, &ordered,
> > +						ordered->file_offset
> > +						+ (blk << inode->i_sb-
>s_blocksize_bits),
> > +						root->sectorsize,
> > +						uptodate);
> > +		if (done) {
> > +			if (btrfs_is_free_space_inode(inode)) {
> > +				wq = root->fs_info->endio_freespace_worker;
> > +				func = btrfs_freespace_write_helper;
> > +			} else {
> > +				wq = root->fs_info->endio_write_workers;
> > +				func = btrfs_endio_write_helper;
> > +			}
> > 
> > -	ordered_start = max_t(u64, start, ordered_extent->file_offset);
> > -	ordered_end = min_t(u64, end,
> > -			ordered_extent->file_offset + ordered_extent->len - 
1);
> > -
> > -	done = btrfs_dec_test_ordered_pending(inode, &ordered_extent,
> > -					ordered_start,
> > -					ordered_end - ordered_start + 1,
> > -					uptodate);
> > -	if (done) {
> > -		if (btrfs_is_free_space_inode(inode)) {
> > -			wq = root->fs_info->endio_freespace_worker;
> > -			func = btrfs_freespace_write_helper;
> > -		} else {
> > -			wq = root->fs_info->endio_write_workers;
> > -			func = btrfs_endio_write_helper;
> > +			btrfs_init_work(&ordered->work, func,
> > +					finish_ordered_fn, NULL, NULL);
> > +			btrfs_queue_work(wq, &ordered->work);
> > 
> >  		}
> > 
> > -		btrfs_init_work(&ordered_extent->work, func,
> > -				finish_ordered_fn, NULL, NULL);
> > -		btrfs_queue_work(wq, &ordered_extent->work);
> > +		blk++;
> > 
> >  	}
> > 
> > +}
> > 
> > -	btrfs_put_ordered_extent(ordered_extent);
> > +int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
> > +				struct extent_state *state, int uptodate)
> > +{
> > +	struct inode *inode = page->mapping->host;
> > +	struct btrfs_root *root = BTRFS_I(inode)->root;
> > +	struct btrfs_ordered_extent *ordered_extent = NULL;
> > +	u64 blk, nr_blks;
> > +	int clear;
> > 
> > -	start = ordered_end + 1;
> > +	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
> > 
> > -	if (start < end)
> > -		goto loop;
> > +	while (start < end) {
> > +		ordered_extent = btrfs_lookup_ordered_extent(inode, start);
> > +		if (!ordered_extent) {
> > +			start += root->sectorsize;
> > +			continue;
> > +		}
> > +
> > +		blk = (start - ordered_extent->file_offset)
> > +			>> inode->i_sb->s_blocksize_bits;
> > +
> > +		nr_blks = (min(end, ordered_extent->file_offset + 
ordered_extent->len -
> > 1) +			+ 1 - start) >> inode->i_sb->s_blocksize_bits;
> > +
> > +		BUG_ON(!nr_blks);
> > +
> > +		mark_blks_io_complete(ordered_extent, blk, nr_blks, uptodate);
> 
> Range [start, end] is surely contiguous, so why are we processing blocks
> one by one in mark_blks_io_complete()?
>
Liu, Thanks for pointing it out. We can actually get rid of the loop in
mark_blks_io_complete() and set the bits (corresponding to the blocks in the
range [start, end]) at btrfs_ordered_extent->blocks_done using bitmap_set().

> Same question for invalidatepage().
Unfortunately for btrfs_invalidatepage(), we need to loop across the blocks
sequentially. Consider the following file operations,

1. Write blocks [0, 7] to a file. Assume all the 8 blocks are part of the same
   ordered extent.
2. Punch a hole starting at block 4 and spanning two blocks in length.
   Here btrfs_invalidatepage() gets invoked and hence
   btrfs_ordered_extent->bytes_left gets decremented by (2 * sectorsize).
3. Punch a hole starting at block 3 and spanning two blocks in length. Again,
   btrfs_invalidatepage() gets invoked and hence
   btrfs_ordered_extent->bytes_left gets decremented by (2 * sectorsize). This
   isn't corrent since block 4 was already accounted for in step 2.

Hence we will have to check each block's completion status before invoking
btrfs_dec_test_ordered_pending().

> 
> Thanks,
> 
> -liubo
> 
> > +
> > +		start = ordered_extent->file_offset + ordered_extent->len;
> > +
> > +		btrfs_put_ordered_extent(ordered_extent);
> > +	}
> > +
> > +	start = page_offset(page);
> > +	end = start + PAGE_CACHE_SIZE - 1;
> > +	clear = 1;
> > +
> > +	while (start < end) {
> > +		ordered_extent = btrfs_lookup_ordered_extent(inode, start);
> > +		if (!ordered_extent) {
> > +			start += root->sectorsize;
> > +			continue;
> > +		}
> > +
> > +		blk = (start - ordered_extent->file_offset)
> > +			>> inode->i_sb->s_blocksize_bits;
> > +		nr_blks = (min(end, ordered_extent->file_offset + 
ordered_extent->len -
> > 1) +			+ 1  - start) >> inode->i_sb-
>s_blocksize_bits;
> > +
> > +		BUG_ON(!nr_blks);
> > +
> > +		while (nr_blks--) {
> > +			if (!test_bit(blk++, ordered_extent->blocks_done)) {
> > +				clear = 0;
> > +				break;
> > +			}
> > +		}
> > +
> > +		if (!clear) {
> > +			btrfs_put_ordered_extent(ordered_extent);
> > +			break;
> > +		}
> > +
> > +		start += ordered_extent->len;
> > +
> > +		btrfs_put_ordered_extent(ordered_extent);
> > +	}
> > +
> > +	if (clear)
> > +		ClearPagePrivate2(page);
> > 
> > -out:
> >  	return 0;
> >  
> >  }
> > 
> > @@ -8472,7 +8531,9 @@ btrfs_readpages(struct file *file, struct
> > address_space *mapping,> 
> >  	return extent_readpages(tree, mapping, pages, nr_pages,
> >  	
> >  				btrfs_get_extent);
> >  
> >  }
> > 
> > -static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
> > +
> > +static int __btrfs_releasepage(struct page *page, u64 start, u64 end,
> > +			gfp_t gfp_flags)
> > 
> >  {
> >  
> >  	struct extent_io_tree *tree;
> >  	struct extent_map_tree *map;
> > 
> > @@ -8480,31 +8541,149 @@ static int __btrfs_releasepage(struct page *page,
> > gfp_t gfp_flags)> 
> >  	tree = &BTRFS_I(page->mapping->host)->io_tree;
> >  	map = &BTRFS_I(page->mapping->host)->extent_tree;
> > 
> > -	ret = try_release_extent_mapping(map, tree, page, gfp_flags);
> > -	if (ret == 1)
> > +
> > +	ret = try_release_extent_mapping(map, tree, page, start, end,
> > +					gfp_flags);
> > +	if ((ret == 1) && ((end - start + 1) == PAGE_CACHE_SIZE)) {
> > 
> >  		clear_page_extent_mapped(page);
> > 
> > +	} else {
> > +		ret = 0;
> > +	}
> > 
> >  	return ret;
> >  
> >  }
> >  
> >  static int btrfs_releasepage(struct page *page, gfp_t gfp_flags)
> >  {
> > 
> > +	u64 start = page_offset(page);
> > +	u64 end = start + PAGE_CACHE_SIZE - 1;
> > +
> > 
> >  	if (PageWriteback(page) || PageDirty(page))
> >  	
> >  		return 0;
> > 
> > -	return __btrfs_releasepage(page, gfp_flags & GFP_NOFS);
> > +
> > +	return __btrfs_releasepage(page, start, end, gfp_flags & GFP_NOFS);
> > +}
> > +
> > +static void invalidate_ordered_extent_blocks(struct inode *inode,
> > +					struct btrfs_ordered_extent *ordered,
> > +					u64 locked_start, u64 locked_end,
> > +					u64 cur,
> > +					int inode_evicting)
> > +{
> > +	struct btrfs_root *root = BTRFS_I(inode)->root;
> > +	struct btrfs_ordered_inode_tree *ordered_tree;
> > +	struct extent_io_tree *tree;
> > +	u64 blk, blk_done, nr_blks;
> > +	u64 end;
> > +	u64 new_len;
> > +
> > +	tree = &BTRFS_I(inode)->io_tree;
> > +
> > +	end = min(locked_end, ordered->file_offset + ordered->len - 1);
> > +
> > +	if (!inode_evicting) {
> > +		clear_extent_bit(tree, cur, end,
> > +				EXTENT_DIRTY | EXTENT_DELALLOC |
> > +				EXTENT_DO_ACCOUNTING |
> > +				EXTENT_DEFRAG, 1, 0, NULL,
> > +				GFP_NOFS);
> > +		unlock_extent(tree, locked_start, locked_end);
> > +	}
> > +
> > +
> > +	ordered_tree = &BTRFS_I(inode)->ordered_tree;
> > +	spin_lock_irq(&ordered_tree->lock);
> > +	set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> > +	new_len = cur - ordered->file_offset;
> > +	if (new_len < ordered->truncated_len)
> > +		ordered->truncated_len = new_len;
> > +
> > +	blk = (cur - ordered->file_offset) >> inode->i_sb->s_blocksize_bits;
> > +	nr_blks = (end + 1 - cur) >> inode->i_sb->s_blocksize_bits;
> > +
> > +	while (nr_blks--) {
> > +		blk_done = !test_and_set_bit(blk, ordered->blocks_done);
> > +		if (blk_done) {
> > +			spin_unlock_irq(&ordered_tree->lock);
> > +			if (btrfs_dec_test_ordered_pending(inode, &ordered,
> > +								ordered-
>file_offset + (blk << inode->i_sb->s_blocksize_bits),
> > +								root-
>sectorsize,
> > +								1))
> > +				btrfs_finish_ordered_io(ordered);
> > +
> > +			spin_lock_irq(&ordered_tree->lock);
> > +		}
> > +		blk++;
> > +	}
> > +
> > +	spin_unlock_irq(&ordered_tree->lock);
> > +
> > +	if (!inode_evicting)
> > +		lock_extent_bits(tree, locked_start, locked_end, 0, NULL);
> > +}
> > +
> > +static int page_blocks_written(struct page *page)
> > +{
> > +	struct btrfs_ordered_extent *ordered;
> > +	struct btrfs_root *root;
> > +	struct inode *inode;
> > +	unsigned long outstanding_blk;
> > +	u64 page_start, page_end;
> > +	u64 blk, last_blk, nr_blks;
> > +	u64 cur;
> > +	u64 len;
> > +
> > +	inode = page->mapping->host;
> > +	root = BTRFS_I(inode)->root;
> > +
> > +	page_start = page_offset(page);
> > +	page_end = page_start + PAGE_CACHE_SIZE - 1;
> > +
> > +	cur = page_start;
> > +	while (cur < page_end) {
> > +		ordered = btrfs_lookup_ordered_extent(inode, cur);
> > +		if (!ordered) {
> > +			cur += root->sectorsize;
> > +			continue;
> > +		}
> > +
> > +		blk = (cur - ordered->file_offset)
> > +			>> inode->i_sb->s_blocksize_bits;
> > +		len = min(page_end, ordered->file_offset + ordered->len - 1)
> > +			- cur + 1;
> > +		nr_blks = len >> inode->i_sb->s_blocksize_bits;
> > +
> > +		last_blk = blk + nr_blks - 1;
> > +
> > +		outstanding_blk = find_next_zero_bit(ordered->blocks_done,
> > +						ordered->len >> inode->i_sb-
>s_blocksize_bits,
> > +						blk);
> > +		if (outstanding_blk <= last_blk) {
> > +			btrfs_put_ordered_extent(ordered);
> > +			return 0;
> > +		}
> > +
> > +		btrfs_put_ordered_extent(ordered);
> > +		cur += len;
> > +	}
> > +
> > +	return 1;
> > 
> >  }
> >  
> >  static void btrfs_invalidatepage(struct page *page, unsigned int offset,
> > 
> > -				 unsigned int length)
> > +				unsigned int length)
> > 
> >  {
> >  
> >  	struct inode *inode = page->mapping->host;
> > 
> > +	struct btrfs_root *root = BTRFS_I(inode)->root;
> > 
> >  	struct extent_io_tree *tree;
> >  	struct btrfs_ordered_extent *ordered;
> > 
> > -	struct extent_state *cached_state = NULL;
> > -	u64 page_start = page_offset(page);
> > -	u64 page_end = page_start + PAGE_CACHE_SIZE - 1;
> > +	u64 start, end, cur;
> > +	u64 page_start, page_end;
> > 
> >  	int inode_evicting = inode->i_state & I_FREEING;
> > 
> > +	page_start = page_offset(page);
> > +	page_end = page_start + PAGE_CACHE_SIZE - 1;
> > +
> > 
> >  	/*
> >  	
> >  	 * we have the page locked, so new writeback can't start,
> >  	 * and the dirty bit won't be cleared while we are here.
> > 
> > @@ -8515,73 +8694,54 @@ static void btrfs_invalidatepage(struct page
> > *page, unsigned int offset,> 
> >  	wait_on_page_writeback(page);
> >  	
> >  	tree = &BTRFS_I(inode)->io_tree;
> > 
> > -	if (offset) {
> > +
> > +	start = round_up(offset, root->sectorsize);
> > +	end = round_down(offset + length, root->sectorsize) - 1;
> > +	if (end - start + 1 < root->sectorsize) {
> > 
> >  		btrfs_releasepage(page, GFP_NOFS);
> >  		return;
> >  	
> >  	}
> > 
> > +	start = round_up(page_start + offset, root->sectorsize);
> > +	end = round_down(page_start + offset + length,
> > +			root->sectorsize) - 1;
> > +
> > 
> >  	if (!inode_evicting)
> > 
> > -		lock_extent_bits(tree, page_start, page_end, 0, 
&cached_state);
> > -	ordered = btrfs_lookup_ordered_range(inode, page_start,
> > PAGE_CACHE_SIZE);
> > -	if (ordered) {
> > -		/*
> > -		 * IO on this page will never be started, so we need
> > -		 * to account for any ordered extents now
> > -		 */
> > -		if (!inode_evicting)
> > -			clear_extent_bit(tree, page_start, page_end,
> > -					 EXTENT_DIRTY | EXTENT_DELALLOC |
> > -					 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING 
|
> > -					 EXTENT_DEFRAG, 1, 0, &cached_state,
> > -					 GFP_NOFS);
> > -		/*
> > -		 * whoever cleared the private bit is responsible
> > -		 * for the finish_ordered_io
> > -		 */
> > -		if (TestClearPagePrivate2(page)) {
> > -			struct btrfs_ordered_inode_tree *tree;
> > -			u64 new_len;
> > +		lock_extent_bits(tree, start, end, 0, NULL);
> > 
> > -			tree = &BTRFS_I(inode)->ordered_tree;
> > +	cur = start;
> > +	while (cur < end) {
> > +		ordered = btrfs_lookup_ordered_extent(inode, cur);
> > +		if (!ordered) {
> > +			cur += root->sectorsize;
> > +			continue;
> > +		}
> > 
> > -			spin_lock_irq(&tree->lock);
> > -			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> > -			new_len = page_start - ordered->file_offset;
> > -			if (new_len < ordered->truncated_len)
> > -				ordered->truncated_len = new_len;
> > -			spin_unlock_irq(&tree->lock);
> > +		invalidate_ordered_extent_blocks(inode, ordered,
> > +						start, end, cur,
> > +						inode_evicting);
> > 
> > -			if (btrfs_dec_test_ordered_pending(inode, &ordered,
> > -							   page_start,
> > -							   PAGE_CACHE_SIZE, 
1))
> > -				btrfs_finish_ordered_io(ordered);
> > -		}
> > +		cur = min(end + 1, ordered->file_offset + ordered->len);
> > 
> >  		btrfs_put_ordered_extent(ordered);
> > 
> > -		if (!inode_evicting) {
> > -			cached_state = NULL;
> > -			lock_extent_bits(tree, page_start, page_end, 0,
> > -					 &cached_state);
> > -		}
> > 
> >  	}
> > 
> > -	if (!inode_evicting) {
> > -		clear_extent_bit(tree, page_start, page_end,
> > -				 EXTENT_LOCKED | EXTENT_DIRTY |
> > -				 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
> > -				 EXTENT_DEFRAG, 1, 1,
> > -				 &cached_state, GFP_NOFS);
> > +	if (page_blocks_written(page))
> > +		ClearPagePrivate2(page);
> > 
> > -		__btrfs_releasepage(page, GFP_NOFS);
> > +	if (!inode_evicting) {
> > +		clear_extent_bit(tree, start, end,
> > +				EXTENT_LOCKED | EXTENT_DIRTY |
> > +				EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
> > +				EXTENT_DEFRAG, 1, 1, NULL, GFP_NOFS);
> > 
> >  	}
> > 
> > -	ClearPageChecked(page);
> > -	if (PagePrivate(page)) {
> > -		ClearPagePrivate(page);
> > -		set_page_private(page, 0);
> > -		page_cache_release(page);
> > +	if (!offset && length == PAGE_CACHE_SIZE) {
> > +		WARN_ON(!__btrfs_releasepage(page, start, end, GFP_NOFS));
> > +		ClearPageChecked(page);
> > 
> >  	}
> >  
> >  }
> > 
> > +
> > 
> >  /*
> >  
> >   * btrfs_page_mkwrite() is not allowed to change the file size as it gets
> >   * called from a page fault handler when a page is first dirtied. Hence
> >   we must> 
> > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > index 157cc54..8e614ca 100644
> > --- a/fs/btrfs/ordered-data.c
> > +++ b/fs/btrfs/ordered-data.c
> > @@ -189,12 +189,25 @@ static int __btrfs_add_ordered_extent(struct inode
> > *inode, u64 file_offset,> 
> >  	struct btrfs_ordered_inode_tree *tree;
> >  	struct rb_node *node;
> >  	struct btrfs_ordered_extent *entry;
> > 
> > +	u64 nr_longs;
> > 
> >  	tree = &BTRFS_I(inode)->ordered_tree;
> >  	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
> >  	if (!entry)
> >  	
> >  		return -ENOMEM;
> > 
> > +	nr_longs = BITS_TO_LONGS(len >> inode->i_sb->s_blocksize_bits);
> > +	if (nr_longs == 1) {
> > +		entry->blocks_done = &entry->blocks_bitmap;
> > +	} else {
> > +		entry->blocks_done = kzalloc(nr_longs * sizeof(unsigned long),
> > +					GFP_NOFS);
> > +		if (!entry->blocks_done) {
> > +			kmem_cache_free(btrfs_ordered_extent_cache, entry);
> > +			return -ENOMEM;
> > +		}
> > +	}
> > +
> > 
> >  	entry->file_offset = file_offset;
> >  	entry->start = start;
> >  	entry->len = len;
> > 
> > @@ -553,6 +566,10 @@ void btrfs_put_ordered_extent(struct
> > btrfs_ordered_extent *entry)> 
> >  			list_del(&sum->list);
> >  			kfree(sum);
> >  		
> >  		}
> > 
> > +
> > +		if (entry->blocks_done != &entry->blocks_bitmap)
> > +			kfree(entry->blocks_done);
> > +
> > 
> >  		kmem_cache_free(btrfs_ordered_extent_cache, entry);
> >  	
> >  	}
> >  
> >  }
> > 
> > diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> > index e96cd4c..4b3356a 100644
> > --- a/fs/btrfs/ordered-data.h
> > +++ b/fs/btrfs/ordered-data.h
> > @@ -140,6 +140,10 @@ struct btrfs_ordered_extent {
> > 
> >  	struct completion completion;
> >  	struct btrfs_work flush_work;
> >  	struct list_head work_list;
> > 
> > +
> > +	/* bitmap to track the blocks that have been written to disk */
> > +	unsigned long *blocks_done;
> > +	unsigned long blocks_bitmap;
> > 
> >  };
> >  
> >  /*

-- 
chandan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 17/21] Btrfs: subpagesize-blocksize: Use (eb->start, seq) as search key for tree modification log.
  2015-06-01 15:22 ` [RFC PATCH V11 17/21] Btrfs: subpagesize-blocksize: Use (eb->start, seq) as search key for tree modification log Chandan Rajendra
@ 2015-07-20 14:46   ` Liu Bo
  0 siblings, 0 replies; 47+ messages in thread
From: Liu Bo @ 2015-07-20 14:46 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: clm, jbacik, dsterba, linux-btrfs, chandan

On Mon, Jun 01, 2015 at 08:52:52PM +0530, Chandan Rajendra wrote:
> In subpagesize-blocksize a page can map multiple extent buffers and hence
> using (page index, seq) as the search key is incorrect. For example, searching
> through tree modification log tree can return an entry associated with the
> first extent buffer mapped by the page (if such an entry exists), when we are
> actually searching for entries associated with extent buffers that are mapped
> at position 2 or more in the page.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo

> 
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
>  fs/btrfs/ctree.c | 34 +++++++++++++++++-----------------
>  1 file changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index ba6fbb0..47310d3 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -311,7 +311,7 @@ struct tree_mod_root {
>  
>  struct tree_mod_elem {
>  	struct rb_node node;
> -	u64 index;		/* shifted logical */
> +	u64 logical;
>  	u64 seq;
>  	enum mod_log_op op;
>  
> @@ -435,11 +435,11 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
>  
>  /*
>   * key order of the log:
> - *       index -> sequence
> + *       node/leaf start address -> sequence
>   *
> - * the index is the shifted logical of the *new* root node for root replace
> - * operations, or the shifted logical of the affected block for all other
> - * operations.
> + * The 'start address' is the logical address of the *new* root node
> + * for root replace operations, or the logical address of the affected
> + * block for all other operations.
>   *
>   * Note: must be called with write lock (tree_mod_log_write_lock).
>   */
> @@ -460,9 +460,9 @@ __tree_mod_log_insert(struct btrfs_fs_info *fs_info, struct tree_mod_elem *tm)
>  	while (*new) {
>  		cur = container_of(*new, struct tree_mod_elem, node);
>  		parent = *new;
> -		if (cur->index < tm->index)
> +		if (cur->logical < tm->logical)
>  			new = &((*new)->rb_left);
> -		else if (cur->index > tm->index)
> +		else if (cur->logical > tm->logical)
>  			new = &((*new)->rb_right);
>  		else if (cur->seq < tm->seq)
>  			new = &((*new)->rb_left);
> @@ -523,7 +523,7 @@ alloc_tree_mod_elem(struct extent_buffer *eb, int slot,
>  	if (!tm)
>  		return NULL;
>  
> -	tm->index = eb->start >> PAGE_CACHE_SHIFT;
> +	tm->logical = eb->start;
>  	if (op != MOD_LOG_KEY_ADD) {
>  		btrfs_node_key(eb, &tm->key, slot);
>  		tm->blockptr = btrfs_node_blockptr(eb, slot);
> @@ -588,7 +588,7 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info,
>  		goto free_tms;
>  	}
>  
> -	tm->index = eb->start >> PAGE_CACHE_SHIFT;
> +	tm->logical = eb->start;
>  	tm->slot = src_slot;
>  	tm->move.dst_slot = dst_slot;
>  	tm->move.nr_items = nr_items;
> @@ -699,7 +699,7 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info,
>  		goto free_tms;
>  	}
>  
> -	tm->index = new_root->start >> PAGE_CACHE_SHIFT;
> +	tm->logical = new_root->start;
>  	tm->old_root.logical = old_root->start;
>  	tm->old_root.level = btrfs_header_level(old_root);
>  	tm->generation = btrfs_header_generation(old_root);
> @@ -739,16 +739,15 @@ __tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 start, u64 min_seq,
>  	struct rb_node *node;
>  	struct tree_mod_elem *cur = NULL;
>  	struct tree_mod_elem *found = NULL;
> -	u64 index = start >> PAGE_CACHE_SHIFT;
>  
>  	tree_mod_log_read_lock(fs_info);
>  	tm_root = &fs_info->tree_mod_log;
>  	node = tm_root->rb_node;
>  	while (node) {
>  		cur = container_of(node, struct tree_mod_elem, node);
> -		if (cur->index < index) {
> +		if (cur->logical < start) {
>  			node = node->rb_left;
> -		} else if (cur->index > index) {
> +		} else if (cur->logical > start) {
>  			node = node->rb_right;
>  		} else if (cur->seq < min_seq) {
>  			node = node->rb_left;
> @@ -1228,9 +1227,10 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info,
>  		return NULL;
>  
>  	/*
> -	 * the very last operation that's logged for a root is the replacement
> -	 * operation (if it is replaced at all). this has the index of the *new*
> -	 * root, making it the very first operation that's logged for this root.
> +	 * the very last operation that's logged for a root is the
> +	 * replacement operation (if it is replaced at all). this has
> +	 * the logical address of the *new* root, making it the very
> +	 * first operation that's logged for this root.
>  	 */
>  	while (1) {
>  		tm = tree_mod_log_search_oldest(fs_info, root_logical,
> @@ -1334,7 +1334,7 @@ __tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct extent_buffer *eb,
>  		if (!next)
>  			break;
>  		tm = container_of(next, struct tree_mod_elem, node);
> -		if (tm->index != first_tm->index)
> +		if (tm->logical != first_tm->logical)
>  			break;
>  	}
>  	tree_mod_log_read_unlock(fs_info);
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.
  2015-06-19  9:45     ` Chandan Rajendra
  2015-06-23  8:37       ` Liu Bo
@ 2016-02-10 10:39       ` David Sterba
  2016-02-11  5:42         ` Chandan Rajendra
  1 sibling, 1 reply; 47+ messages in thread
From: David Sterba @ 2016-02-10 10:39 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: bo.li.liu, clm, jbacik, dsterba, linux-btrfs, chandan

On Fri, Jun 19, 2015 at 03:15:01PM +0530, Chandan Rajendra wrote:
> > private->io_lock is not acquired here but not in below.
> > 
> > IIUC, this can be protected by EXTENT_LOCKED.
> >
> 
> private->io_lock plays the same role as BH_Uptodate_Lock (see
> end_buffer_async_read()) i.e. without the io_lock we may end up in the
> following situation,
> 
> NOTE: Assume 64k page size and 4k block size. Also assume that the first 12
> blocks of the page are contiguous while the next 4 blocks are contiguous. When
> reading the page we end up submitting two "logical address space" bios. So
> end_bio_extent_readpage function is invoked twice (once for each bio).
> 
> |-------------------------+-------------------------+-------------|
> | Task A                  | Task B                  | Task C      |
> |-------------------------+-------------------------+-------------|
> | end_bio_extent_readpage |                         |             |
> | process block 0         |                         |             |
> | - clear BLK_STATE_IO    |                         |             |
> | - page_read_complete    |                         |             |
> | process block 1         |                         |             |
> | ...                     |                         |             |
> | ...                     |                         |             |
> | ...                     | end_bio_extent_readpage |             |
> | ...                     | process block 0         |             |
> | ...                     | - clear BLK_STATE_IO    |             |
> | ...                     | - page_read_complete    |             |
> | ...                     | process block 1         |             |
> | ...                     | ...                     |             |
> | process block 11        | process block 3         |             |
> | - clear BLK_STATE_IO    | - clear BLK_STATE_IO    |             |
> | - page_read_complete    | - page_read_complete    |             |
> |   - returns true        |   - returns true        |             |
> |   - unlock_page()       |                         |             |
> |                         |                         | lock_page() |
> |                         |   - unlock_page()       |             |
> |-------------------------+-------------------------+-------------|
> 
> So we end up incorrectly unlocking the page twice and "Task C" ends up working
> on an unlocked page. So private->io_lock makes sure that only one of the tasks
> gets "true" as the return value when page_read_complete() is invoked. As an
> optimization the patch gets the io_lock only when nr_sectors counter reaches
> the value 0 (i.e. when the last block of the bio_vec is being processed).
> Please let me know if my analysis was incorrect.
> 
> Also, I noticed that page_read_complete() and page_write_complete() can be
> replaced by just one function i.e. page_io_complete().

The explanations and the table would be good in the changelog and as
comments. I think we'll need to consider the smaller blocks more often
so some examples and locking rules would be useful, eg. documented in
this file.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.
  2015-06-23  8:37       ` Liu Bo
@ 2016-02-10 10:44         ` David Sterba
  0 siblings, 0 replies; 47+ messages in thread
From: David Sterba @ 2016-02-10 10:44 UTC (permalink / raw)
  To: Liu Bo; +Cc: Chandan Rajendra, clm, jbacik, dsterba, linux-btrfs, chandan

On Tue, Jun 23, 2015 at 04:37:48PM +0800, Liu Bo wrote:
...
> > | - clear BLK_STATE_IO    | - clear BLK_STATE_IO    |             |
> > | - page_read_complete    | - page_read_complete    |             |
> > |   - returns true        |   - returns true        |             |
> > |   - unlock_page()       |                         |             |
> > |                         |                         | lock_page() |
> > |                         |   - unlock_page()       |             |
> > |-------------------------+-------------------------+-------------|
> > 
> > So we end up incorrectly unlocking the page twice and "Task C" ends up working
> > on an unlocked page. So private->io_lock makes sure that only one of the tasks
> > gets "true" as the return value when page_read_complete() is invoked. As an
> > optimization the patch gets the io_lock only when nr_sectors counter reaches
> > the value 0 (i.e. when the last block of the bio_vec is being processed).
> > Please let me know if my analysis was incorrect.
> 
> Thanks for the nice explanation, it looks reasonable to me.

Please don't hesitate to add your reviewed-by if you spent time on that
and think it's ok, this rellay helps to make decisions about merging.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.
  2016-02-10 10:39       ` David Sterba
@ 2016-02-11  5:42         ` Chandan Rajendra
  0 siblings, 0 replies; 47+ messages in thread
From: Chandan Rajendra @ 2016-02-11  5:42 UTC (permalink / raw)
  To: dsterba; +Cc: bo.li.liu, clm, jbacik, linux-btrfs, chandan

On Wednesday 10 Feb 2016 11:39:25 David Sterba wrote:
> 
> The explanations and the table would be good in the changelog and as
> comments. I think we'll need to consider the smaller blocks more often
> so some examples and locking rules would be useful, eg. documented in
> this file.

David, I agree.  As suggested, I will add the documentation to the commit
message and as comments in the code.

-- 
chandan


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2016-02-11  5:43 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-01 15:22 [RFC PATCH V11 00/21] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read Chandan Rajendra
2015-06-19  4:45   ` Liu Bo
2015-06-19  9:45     ` Chandan Rajendra
2015-06-23  8:37       ` Liu Bo
2016-02-10 10:44         ` David Sterba
2016-02-10 10:39       ` David Sterba
2016-02-11  5:42         ` Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 02/21] Btrfs: subpagesize-blocksize: Fix whole page write Chandan Rajendra
2015-06-26  9:50   ` Liu Bo
2015-06-29  8:54     ` Chandan Rajendra
2015-07-01 14:27       ` Liu Bo
2015-06-01 15:22 ` [RFC PATCH V11 03/21] Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release extents aligned to block size Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 04/21] Btrfs: subpagesize-blocksize: Define extent_buffer_head Chandan Rajendra
2015-07-01 14:33   ` Liu Bo
2015-06-01 15:22 ` [RFC PATCH V11 05/21] Btrfs: subpagesize-blocksize: Read tree blocks whose size is < PAGE_SIZE Chandan Rajendra
2015-07-01 14:40   ` Liu Bo
2015-07-03 10:02     ` Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 06/21] Btrfs: subpagesize-blocksize: Write only dirty extent buffers belonging to a page Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 07/21] Btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 08/21] Btrfs: subpagesize-blocksize: Compute and look up csums based on sectorsized blocks Chandan Rajendra
2015-07-01 14:37   ` Liu Bo
2015-06-01 15:22 ` [RFC PATCH V11 09/21] Btrfs: subpagesize-blocksize: Direct I/O read: Work " Chandan Rajendra
2015-07-01 14:45   ` Liu Bo
2015-07-03 10:05     ` Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 10/21] Btrfs: subpagesize-blocksize: fallocate: Work with sectorsized units Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 11/21] Btrfs: subpagesize-blocksize: btrfs_page_mkwrite: Reserve space in " Chandan Rajendra
2015-07-06  3:18   ` Liu Bo
2015-06-01 15:22 ` [RFC PATCH V11 12/21] Btrfs: subpagesize-blocksize: Search for all ordered extents that could span across a page Chandan Rajendra
2015-07-01 14:47   ` Liu Bo
2015-07-03 10:08     ` Chandan Rajendra
2015-07-06  3:17       ` Liu Bo
2015-07-06 10:49         ` Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 13/21] Btrfs: subpagesize-blocksize: Deal with partial ordered extent allocations Chandan Rajendra
2015-07-06 10:06   ` Liu Bo
2015-07-07 13:38     ` Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 14/21] Btrfs: subpagesize-blocksize: Explicitly Track I/O status of blocks of an ordered extent Chandan Rajendra
2015-07-20  8:34   ` Liu Bo
2015-07-20 12:54     ` Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 15/21] Btrfs: subpagesize-blocksize: Revert commit fc4adbff823f76577ece26dcb88bf6f8392dbd43 Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 16/21] Btrfs: subpagesize-blocksize: Prevent writes to an extent buffer when PG_writeback flag is set Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 17/21] Btrfs: subpagesize-blocksize: Use (eb->start, seq) as search key for tree modification log Chandan Rajendra
2015-07-20 14:46   ` Liu Bo
2015-06-01 15:22 ` [RFC PATCH V11 18/21] Btrfs: subpagesize-blocksize: btrfs_submit_direct_hook: Handle map_length < bio vector length Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 19/21] Revert "btrfs: fix lockups from btrfs_clear_path_blocking" Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 20/21] Btrfs: subpagesize-blockssize: Limit inline extents to root->sectorsize Chandan Rajendra
2015-06-01 15:22 ` [RFC PATCH V11 21/21] Btrfs: subpagesize-blocksize: Fix block size returned to user space Chandan Rajendra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.