All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] Patches to support subpagesize blocksize
@ 2013-12-11 23:38 Chandra Seetharaman
  2013-12-11 23:38 ` [PATCH 1/7] btrfs: subpagesize-blocksize: Define extent_buffer_head Chandra Seetharaman
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Chandra Seetharaman @ 2013-12-11 23:38 UTC (permalink / raw)
  To: linux-btrfs, chandra_pdx; +Cc: Chandra Seetharaman

In btrfs, blocksize, the basic IO size of the filesystem, has been
more than PAGE_SIZE.

But, some 64 bit architures, like PPC64 and ARM64 have the default
PAGE_SIZE as 64K, which means the filesystems handled in these
architectures are with a blocksize of 64K.

This works fine as long as you create and use the filesystems within
these systems.

In other words, one cannot create a filesystem in some other architecture
and use that filesystem in PPC64 or ARM64, and vice versa.,

Another restriction is that we cannot use ext? filesystems in these
architectures as btrfs filesystems, since ext? filesystems have a blocksize
of 4K.

Sometime last year, Wade Cline posted a patch(http://lwn.net/Articles/529682/).
I started testing it, and found many locking/race issues. So, I changed the
logic and created an extent_buffer_head that holds an array of extent buffers that
belong to a page.

There are few wrinkles in this patchset, like some xfstests are failing, which
could be due to me doing something incorrectly w.r.t how the blocksize and
PAGE_SIZE are used in these patched.

Would like to get some feedback, review comments.

Thanks,

Chandra

---

Chandra Seetharaman (7):
  btrfs: subpagesize-blocksize: Define extent_buffer_head
  btrfs: subpagesize-blocksize: Use a global alignment for size
  btrfs: subpagesize-blocksize: Handle small extent maps properly
  btrfs: subpagesize-blocksize: Handle iosize properly in submit_extent_page()
  btrfs: subpagesize-blocksize: handle checksum calculations properly
  btrfs: subpagesize-blocksize: Handle relocation clusters appropriately
  btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE

 fs/btrfs/backref.c           |   6 +-
 fs/btrfs/btrfs_inode.h       |   7 +
 fs/btrfs/compression.c       |   3 +-
 fs/btrfs/ctree.c             |   2 +-
 fs/btrfs/ctree.h             |   6 +-
 fs/btrfs/disk-io.c           | 115 ++++++-----
 fs/btrfs/extent-tree.c       |  18 +-
 fs/btrfs/extent_io.c         | 449 ++++++++++++++++++++++++++-----------------
 fs/btrfs/extent_io.h         |  55 ++++--
 fs/btrfs/file-item.c         |  45 ++++-
 fs/btrfs/file.c              |  15 +-
 fs/btrfs/inode.c             |  75 +++++---
 fs/btrfs/ioctl.c             |   6 +-
 fs/btrfs/ordered-data.c      |   2 +-
 fs/btrfs/relocation.c        |   6 +-
 fs/btrfs/tree-log.c          |   2 +-
 fs/btrfs/volumes.c           |   2 +-
 include/trace/events/btrfs.h |   2 +-
 18 files changed, 515 insertions(+), 301 deletions(-)

-- 
1.7.12.4


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/7] btrfs: subpagesize-blocksize: Define extent_buffer_head
  2013-12-11 23:38 [PATCH 0/7] Patches to support subpagesize blocksize Chandra Seetharaman
@ 2013-12-11 23:38 ` Chandra Seetharaman
  2013-12-16 12:32   ` saeed bishara
  2013-12-11 23:38 ` [PATCH 2/7] btrfs: subpagesize-blocksize: Use a global alignment for size Chandra Seetharaman
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Chandra Seetharaman @ 2013-12-11 23:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chandra Seetharaman

In order to handle multiple extent buffers per page, first we
need to create a way to handle all the extent buffers that
are attached to a page.

This patch creates a new data structure eb_head, and moves
fields that are common to all extent buffers in a page from
extent buffer to eb_head.

This also adds changes that are needed to handle multiple
extent buffers per page case.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
---
 fs/btrfs/backref.c           |   6 +-
 fs/btrfs/ctree.c             |   2 +-
 fs/btrfs/ctree.h             |   6 +-
 fs/btrfs/disk-io.c           | 109 +++++++----
 fs/btrfs/extent-tree.c       |   6 +-
 fs/btrfs/extent_io.c         | 429 +++++++++++++++++++++++++++----------------
 fs/btrfs/extent_io.h         |  55 ++++--
 fs/btrfs/volumes.c           |   2 +-
 include/trace/events/btrfs.h |   2 +-
 9 files changed, 390 insertions(+), 227 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 3775947..af1943f 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1283,7 +1283,7 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
 		eb = path->nodes[0];
 		/* make sure we can use eb after releasing the path */
 		if (eb != eb_in) {
-			atomic_inc(&eb->refs);
+			atomic_inc(&eb_head(eb)->refs);
 			btrfs_tree_read_lock(eb);
 			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
 		}
@@ -1616,7 +1616,7 @@ static int iterate_inode_refs(u64 inum, struct btrfs_root *fs_root,
 		slot = path->slots[0];
 		eb = path->nodes[0];
 		/* make sure we can use eb after releasing the path */
-		atomic_inc(&eb->refs);
+		atomic_inc(&eb_head(eb)->refs);
 		btrfs_tree_read_lock(eb);
 		btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
 		btrfs_release_path(path);
@@ -1676,7 +1676,7 @@ static int iterate_inode_extrefs(u64 inum, struct btrfs_root *fs_root,
 		slot = path->slots[0];
 		eb = path->nodes[0];
 		/* make sure we can use eb after releasing the path */
-		atomic_inc(&eb->refs);
+		atomic_inc(&eb_head(eb)->refs);
 
 		btrfs_tree_read_lock(eb);
 		btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 316136b..611b27e 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -170,7 +170,7 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root *root)
 		 * the inc_not_zero dance and if it doesn't work then
 		 * synchronize_rcu and try again.
 		 */
-		if (atomic_inc_not_zero(&eb->refs)) {
+		if (atomic_inc_not_zero(&eb_head(eb)->refs)) {
 			rcu_read_unlock();
 			break;
 		}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 54ab861..02de448 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2106,14 +2106,16 @@ static inline void btrfs_set_token_##name(struct extent_buffer *eb,	\
 #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)		\
 static inline u##bits btrfs_##name(struct extent_buffer *eb)		\
 {									\
-	type *p = page_address(eb->pages[0]);				\
+	type *p = page_address(eb_head(eb)->pages[0]) +			\
+				(eb->start & (PAGE_CACHE_SIZE -1));	\
 	u##bits res = le##bits##_to_cpu(p->member);			\
 	return res;							\
 }									\
 static inline void btrfs_set_##name(struct extent_buffer *eb,		\
 				    u##bits val)			\
 {									\
-	type *p = page_address(eb->pages[0]);				\
+	type *p = page_address(eb_head(eb)->pages[0]) +			\
+				(eb->start & (PAGE_CACHE_SIZE -1));	\
 	p->member = cpu_to_le##bits(val);				\
 }
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8072cfa..ca1526d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -411,7 +411,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
 	int mirror_num = 0;
 	int failed_mirror = 0;
 
-	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_CORRUPT, &eb_head(eb)->bflags);
 	io_tree = &BTRFS_I(root->fs_info->btree_inode)->io_tree;
 	while (1) {
 		ret = read_extent_buffer_pages(io_tree, eb, start,
@@ -430,7 +430,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
 		 * there is no reason to read the other copies, they won't be
 		 * any less wrong.
 		 */
-		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags))
+		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb_head(eb)->bflags))
 			break;
 
 		num_copies = btrfs_num_copies(root->fs_info,
@@ -440,7 +440,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
 
 		if (!failed_mirror) {
 			failed = 1;
-			failed_mirror = eb->read_mirror;
+			failed_mirror = eb_head(eb)->read_mirror;
 		}
 
 		mirror_num++;
@@ -465,19 +465,22 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
 static int csum_dirty_buffer(struct btrfs_root *root, struct page *page)
 {
 	struct extent_io_tree *tree;
-	u64 start = page_offset(page);
 	u64 found_start;
 	struct extent_buffer *eb;
+	struct extent_buffer_head *eb_head;
 
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
 
-	eb = (struct extent_buffer *)page->private;
-	if (page != eb->pages[0])
+	eb_head = (struct extent_buffer_head *)page->private;
+	if (page != eb_head->pages[0])
 		return 0;
-	found_start = btrfs_header_bytenr(eb);
-	if (WARN_ON(found_start != start || !PageUptodate(page)))
+	if (WARN_ON(!PageUptodate(page)))
 		return 0;
-	csum_tree_block(root, eb, 0);
+	for (eb = &eb_head->extent_buf[0]; eb->start; eb++) {
+		found_start = btrfs_header_bytenr(eb);
+		if (found_start == eb->start)
+			csum_tree_block(root, eb, 0);
+	}
 	return 0;
 }
 
@@ -575,25 +578,34 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	struct extent_buffer *eb;
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 	int ret = 0;
-	int reads_done;
+	int reads_done = 0;
+	struct extent_buffer_head *eb_head;
 
 	if (!page->private)
 		goto out;
 
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
-	eb = (struct extent_buffer *)page->private;
+	eb_head = (struct extent_buffer_head *)page->private;
+
+	/* Get the eb corresponding to this IO */
+	eb = eb_head->io_eb;
+	if (!eb) {
+		ret = -EIO;
+		goto err;
+	}
+	eb_head->io_eb = NULL;
 
 	/* the pending IO might have been the only thing that kept this buffer
 	 * in memory.  Make sure we have a ref for all this other checks
 	 */
 	extent_buffer_get(eb);
 
-	reads_done = atomic_dec_and_test(&eb->io_pages);
+	reads_done = atomic_dec_and_test(&eb_head->io_pages);
 	if (!reads_done)
 		goto err;
 
-	eb->read_mirror = mirror;
-	if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+	eb_head->read_mirror = mirror;
+	if (test_bit(EXTENT_BUFFER_IOERR, &eb_head->bflags)) {
 		ret = -EIO;
 		goto err;
 	}
@@ -635,7 +647,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	 * return -EIO.
 	 */
 	if (found_level == 0 && check_leaf(root, eb)) {
-		set_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
+		set_bit(EXTENT_BUFFER_CORRUPT, &eb_head->bflags);
 		ret = -EIO;
 	}
 
@@ -643,7 +655,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		set_extent_buffer_uptodate(eb);
 err:
 	if (reads_done &&
-	    test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
+	    test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb_head->bflags))
 		btree_readahead_hook(root, eb, eb->start, ret);
 
 	if (ret) {
@@ -652,7 +664,7 @@ err:
 		 * again, we have to make sure it has something
 		 * to decrement
 		 */
-		atomic_inc(&eb->io_pages);
+		atomic_inc(&eb_head->io_pages);
 		clear_extent_buffer_uptodate(eb);
 	}
 	free_extent_buffer(eb);
@@ -662,15 +674,22 @@ out:
 
 static int btree_io_failed_hook(struct page *page, int failed_mirror)
 {
+	struct extent_buffer_head *eb_head
+			=  (struct extent_buffer_head *)page->private;
 	struct extent_buffer *eb;
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
-	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-	eb->read_mirror = failed_mirror;
-	atomic_dec(&eb->io_pages);
-	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
+	set_bit(EXTENT_BUFFER_IOERR, &eb_head->bflags);
+	eb_head->read_mirror = failed_mirror;
+	atomic_dec(&eb_head->io_pages);
+	/* Get the eb corresponding to this IO */
+	eb = eb_head->io_eb;
+	if (!eb)
+		goto out;
+	eb_head->io_eb = NULL;
+	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb_head->bflags))
 		btree_readahead_hook(root, eb, eb->start, -EIO);
+out:
 	return -EIO;	/* we fixed nothing */
 }
 
@@ -1021,14 +1040,20 @@ static void btree_invalidatepage(struct page *page, unsigned int offset,
 static int btree_set_page_dirty(struct page *page)
 {
 #ifdef DEBUG
+	struct extent_buffer_head *ebh;
 	struct extent_buffer *eb;
+	int i, dirty = 0;
 
 	BUG_ON(!PagePrivate(page));
-	eb = (struct extent_buffer *)page->private;
-	BUG_ON(!eb);
-	BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
-	BUG_ON(!atomic_read(&eb->refs));
-	btrfs_assert_tree_locked(eb);
+	ebh = (struct extent_buffer_head *)page->private;
+	BUG_ON(!ebh);
+	for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE && !dirty; i++) {
+		eb = &ebh->extent_buf[i];
+		dirty = test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
+	}
+	BUG_ON(dirty);
+	BUG_ON(!atomic_read(&ebh->refs));
+	btrfs_assert_tree_locked(&ebh->extent_buf[0]);
 #endif
 	return __set_page_dirty_nobuffers(page);
 }
@@ -1072,7 +1097,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr, u32 blocksize,
 	if (!buf)
 		return 0;
 
-	set_bit(EXTENT_BUFFER_READAHEAD, &buf->bflags);
+	set_bit(EXTENT_BUFFER_READAHEAD, &eb_head(buf)->bflags);
 
 	ret = read_extent_buffer_pages(io_tree, buf, 0, WAIT_PAGE_LOCK,
 				       btree_get_extent, mirror_num);
@@ -1081,7 +1106,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr, u32 blocksize,
 		return ret;
 	}
 
-	if (test_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags)) {
+	if (test_bit(EXTENT_BUFFER_CORRUPT, &eb_head(buf)->bflags)) {
 		free_extent_buffer(buf);
 		return -EIO;
 	} else if (extent_buffer_uptodate(buf)) {
@@ -1115,14 +1140,16 @@ struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
 
 int btrfs_write_tree_block(struct extent_buffer *buf)
 {
-	return filemap_fdatawrite_range(buf->pages[0]->mapping, buf->start,
+	return filemap_fdatawrite_range(eb_head(buf)->pages[0]->mapping,
+					buf->start,
 					buf->start + buf->len - 1);
 }
 
 int btrfs_wait_tree_block_writeback(struct extent_buffer *buf)
 {
-	return filemap_fdatawait_range(buf->pages[0]->mapping,
-				       buf->start, buf->start + buf->len - 1);
+	return filemap_fdatawait_range(eb_head(buf)->pages[0]->mapping,
+					buf->start,
+					buf->start + buf->len - 1);
 }
 
 struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
@@ -1153,7 +1180,8 @@ void clean_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	    fs_info->running_transaction->transid) {
 		btrfs_assert_tree_locked(buf);
 
-		if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)) {
+		if (test_and_clear_bit(EXTENT_BUFFER_DIRTY,
+						&buf->ebflags)) {
 			__percpu_counter_add(&fs_info->dirty_metadata_bytes,
 					     -buf->len,
 					     fs_info->dirty_metadata_batch);
@@ -2613,7 +2641,8 @@ int open_ctree(struct super_block *sb,
 					   btrfs_super_chunk_root(disk_super),
 					   blocksize, generation);
 	if (!chunk_root->node ||
-	    !test_bit(EXTENT_BUFFER_UPTODATE, &chunk_root->node->bflags)) {
+	    !test_bit(EXTENT_BUFFER_UPTODATE,
+					&eb_head(chunk_root->node)->bflags)) {
 		printk(KERN_WARNING "btrfs: failed to read chunk root on %s\n",
 		       sb->s_id);
 		goto fail_tree_roots;
@@ -2652,7 +2681,8 @@ retry_root_backup:
 					  btrfs_super_root(disk_super),
 					  blocksize, generation);
 	if (!tree_root->node ||
-	    !test_bit(EXTENT_BUFFER_UPTODATE, &tree_root->node->bflags)) {
+	    !test_bit(EXTENT_BUFFER_UPTODATE,
+					&eb_head(tree_root->node)->bflags)) {
 		printk(KERN_WARNING "btrfs: failed to read tree root on %s\n",
 		       sb->s_id);
 
@@ -3619,7 +3649,7 @@ int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
 			  int atomic)
 {
 	int ret;
-	struct inode *btree_inode = buf->pages[0]->mapping->host;
+	struct inode *btree_inode = eb_head(buf)->pages[0]->mapping->host;
 
 	ret = extent_buffer_uptodate(buf);
 	if (!ret)
@@ -3652,7 +3682,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
 	if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, &buf->bflags)))
 		return;
 #endif
-	root = BTRFS_I(buf->pages[0]->mapping->host)->root;
+	root = BTRFS_I(eb_head(buf)->pages[0]->mapping->host)->root;
 	btrfs_assert_tree_locked(buf);
 	if (transid != root->fs_info->generation)
 		WARN(1, KERN_CRIT "btrfs transid mismatch buffer %llu, "
@@ -3701,7 +3731,8 @@ void btrfs_btree_balance_dirty_nodelay(struct btrfs_root *root)
 
 int btrfs_read_buffer(struct extent_buffer *buf, u64 parent_transid)
 {
-	struct btrfs_root *root = BTRFS_I(buf->pages[0]->mapping->host)->root;
+	struct btrfs_root *root =
+			BTRFS_I(eb_head(buf)->pages[0]->mapping->host)->root;
 	return btree_read_extent_buffer_pages(root, buf, 0, parent_transid);
 }
 
@@ -3938,7 +3969,7 @@ static int btrfs_destroy_marked_extents(struct btrfs_root *root,
 			wait_on_extent_buffer_writeback(eb);
 
 			if (test_and_clear_bit(EXTENT_BUFFER_DIRTY,
-					       &eb->bflags))
+					       &eb->ebflags))
 				clear_extent_buffer_dirty(eb);
 			free_extent_buffer_stale(eb);
 		}
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 45d98d0..79cf87f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6019,7 +6019,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 			goto out;
 		}
 
-		WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
+		WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->ebflags));
 
 		btrfs_add_free_space(cache, buf->start, buf->len);
 		btrfs_update_reserved_bytes(cache, buf->len, RESERVE_FREE);
@@ -6036,7 +6036,7 @@ out:
 	 * Deleting the buffer, clear the corrupt flag since it doesn't matter
 	 * anymore.
 	 */
-	clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
+	clear_bit(EXTENT_BUFFER_CORRUPT, &eb_head(buf)->bflags);
 	btrfs_put_block_group(cache);
 }
 
@@ -6910,7 +6910,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
 	btrfs_tree_lock(buf);
 	clean_tree_block(trans, root, buf);
-	clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
+	clear_bit(EXTENT_BUFFER_STALE, &eb_head(buf)->bflags);
 
 	btrfs_set_lock_blocking(buf);
 	btrfs_set_buffer_uptodate(buf);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ff43802..a1a849b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -54,8 +54,10 @@ void btrfs_leak_debug_del(struct list_head *entry)
 static inline
 void btrfs_leak_debug_check(void)
 {
+	int i;
 	struct extent_state *state;
 	struct extent_buffer *eb;
+	struct extent_buffer_head *ebh;
 
 	while (!list_empty(&states)) {
 		state = list_entry(states.next, struct extent_state, leak_list);
@@ -68,12 +70,17 @@ void btrfs_leak_debug_check(void)
 	}
 
 	while (!list_empty(&buffers)) {
-		eb = list_entry(buffers.next, struct extent_buffer, leak_list);
-		printk(KERN_ERR "btrfs buffer leak start %llu len %lu "
-		       "refs %d\n",
-		       eb->start, eb->len, atomic_read(&eb->refs));
-		list_del(&eb->leak_list);
-		kmem_cache_free(extent_buffer_cache, eb);
+		ebh = list_entry(buffers.next, struct extent_buffer_head, leak_list);
+		printk(KERN_ERR "btrfs buffer leak ");
+		for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE; i++) {
+			eb = &ebh->extent_buf[i];
+			if (!eb->start)
+				break;
+			printk(KERN_ERR "eb %p %llu:%lu ", eb, eb->start, eb->len);
+		}
+		printk(KERN_ERR "refs %d\n", atomic_read(&ebh->refs));
+		list_del(&ebh->leak_list);
+		kmem_cache_free(extent_buffer_cache, ebh);
 	}
 }
 
@@ -136,7 +143,7 @@ int __init extent_io_init(void)
 		return -ENOMEM;
 
 	extent_buffer_cache = kmem_cache_create("btrfs_extent_buffer",
-			sizeof(struct extent_buffer), 0,
+			sizeof(struct extent_buffer_head), 0,
 			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
 	if (!extent_buffer_cache)
 		goto free_state_cache;
@@ -2023,7 +2030,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num)
 {
-	u64 start = eb->start;
+	u64 start = eb_head(eb)->extent_buf[0].start;
 	unsigned long i, num_pages = num_extent_pages(eb->start, eb->len);
 	int ret = 0;
 
@@ -2680,15 +2687,15 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
 	return ret;
 }
 
-static void attach_extent_buffer_page(struct extent_buffer *eb,
+static void attach_extent_buffer_page(struct extent_buffer_head *ebh,
 				      struct page *page)
 {
 	if (!PagePrivate(page)) {
 		SetPagePrivate(page);
 		page_cache_get(page);
-		set_page_private(page, (unsigned long)eb);
+		set_page_private(page, (unsigned long)ebh);
 	} else {
-		WARN_ON(page->private != (unsigned long)eb);
+		WARN_ON(page->private != (unsigned long)ebh);
 	}
 }
 
@@ -3327,17 +3334,19 @@ static int eb_wait(void *word)
 
 void wait_on_extent_buffer_writeback(struct extent_buffer *eb)
 {
-	wait_on_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK, eb_wait,
+	wait_on_bit(&eb_head(eb)->bflags, EXTENT_BUFFER_WRITEBACK, eb_wait,
 		    TASK_UNINTERRUPTIBLE);
 }
 
-static int lock_extent_buffer_for_io(struct extent_buffer *eb,
+static int lock_extent_buffer_for_io(struct extent_buffer_head *ebh,
 				     struct btrfs_fs_info *fs_info,
 				     struct extent_page_data *epd)
 {
 	unsigned long i, num_pages;
 	int flush = 0;
+	bool dirty = false, dirty_arr[MAX_EXTENT_BUFFERS_PER_PAGE];
 	int ret = 0;
+	struct extent_buffer *eb = &ebh->extent_buf[0], *ebtemp;
 
 	if (!btrfs_try_tree_write_lock(eb)) {
 		flush = 1;
@@ -3345,7 +3354,7 @@ static int lock_extent_buffer_for_io(struct extent_buffer *eb,
 		btrfs_tree_lock(eb);
 	}
 
-	if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
+	if (test_bit(EXTENT_BUFFER_WRITEBACK, &ebh->bflags)) {
 		btrfs_tree_unlock(eb);
 		if (!epd->sync_io)
 			return 0;
@@ -3356,7 +3365,7 @@ static int lock_extent_buffer_for_io(struct extent_buffer *eb,
 		while (1) {
 			wait_on_extent_buffer_writeback(eb);
 			btrfs_tree_lock(eb);
-			if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags))
+			if (!test_bit(EXTENT_BUFFER_WRITEBACK, &ebh->bflags))
 				break;
 			btrfs_tree_unlock(eb);
 		}
@@ -3367,17 +3376,27 @@ static int lock_extent_buffer_for_io(struct extent_buffer *eb,
 	 * under IO since we can end up having no IO bits set for a short period
 	 * of time.
 	 */
-	spin_lock(&eb->refs_lock);
-	if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
-		set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
-		spin_unlock(&eb->refs_lock);
-		btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
-		__percpu_counter_add(&fs_info->dirty_metadata_bytes,
-				     -eb->len,
+	spin_lock(&ebh->refs_lock);
+	for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE; i++) {
+		ebtemp = &ebh->extent_buf[i];
+		dirty_arr[i] |= test_and_clear_bit(EXTENT_BUFFER_DIRTY, &ebtemp->ebflags);
+		dirty = dirty || dirty_arr[i];
+	}
+	if (dirty) {
+		set_bit(EXTENT_BUFFER_WRITEBACK, &ebh->bflags);
+		spin_unlock(&ebh->refs_lock);
+		for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE; i++) {
+			if (dirty_arr[i] == false)
+				continue;
+			ebtemp = &ebh->extent_buf[i];
+			btrfs_set_header_flag(ebtemp, BTRFS_HEADER_FLAG_WRITTEN);
+			__percpu_counter_add(&fs_info->dirty_metadata_bytes,
+				     -ebtemp->len,
 				     fs_info->dirty_metadata_batch);
+		}
 		ret = 1;
 	} else {
-		spin_unlock(&eb->refs_lock);
+		spin_unlock(&ebh->refs_lock);
 	}
 
 	btrfs_tree_unlock(eb);
@@ -3401,30 +3420,30 @@ static int lock_extent_buffer_for_io(struct extent_buffer *eb,
 	return ret;
 }
 
-static void end_extent_buffer_writeback(struct extent_buffer *eb)
+static void end_extent_buffer_writeback(struct extent_buffer_head *ebh)
 {
-	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_WRITEBACK, &ebh->bflags);
 	smp_mb__after_clear_bit();
-	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
+	wake_up_bit(&ebh->bflags, EXTENT_BUFFER_WRITEBACK);
 }
 
 static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 {
 	int uptodate = err == 0;
 	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
-	struct extent_buffer *eb;
+	struct extent_buffer_head *ebh;
 	int done;
 
 	do {
 		struct page *page = bvec->bv_page;
 
 		bvec--;
-		eb = (struct extent_buffer *)page->private;
-		BUG_ON(!eb);
-		done = atomic_dec_and_test(&eb->io_pages);
+		ebh = (struct extent_buffer_head *)page->private;
+		BUG_ON(!ebh);
+		done = atomic_dec_and_test(&ebh->io_pages);
 
-		if (!uptodate || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+		if (!uptodate || test_bit(EXTENT_BUFFER_IOERR, &ebh->bflags)) {
+			set_bit(EXTENT_BUFFER_IOERR, &ebh->bflags);
 			ClearPageUptodate(page);
 			SetPageError(page);
 		}
@@ -3434,7 +3453,7 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
 		if (!done)
 			continue;
 
-		end_extent_buffer_writeback(eb);
+		end_extent_buffer_writeback(ebh);
 	} while (bvec >= bio->bi_io_vec);
 
 	bio_put(bio);
@@ -3447,15 +3466,15 @@ static int write_one_eb(struct extent_buffer *eb,
 			struct extent_page_data *epd)
 {
 	struct block_device *bdev = fs_info->fs_devices->latest_bdev;
-	u64 offset = eb->start;
+	u64 offset = eb->start & ~(PAGE_CACHE_SIZE - 1);
 	unsigned long i, num_pages;
 	unsigned long bio_flags = 0;
 	int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
 	int ret = 0;
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+	clear_bit(EXTENT_BUFFER_IOERR, &eb_head(eb)->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
-	atomic_set(&eb->io_pages, num_pages);
+	atomic_set(&eb_head(eb)->io_pages, num_pages);
 	if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
 		bio_flags = EXTENT_BIO_TREE_LOG;
 
@@ -3464,16 +3483,17 @@ static int write_one_eb(struct extent_buffer *eb,
 
 		clear_page_dirty_for_io(p);
 		set_page_writeback(p);
-		ret = submit_extent_page(rw, eb->tree, p, offset >> 9,
+		ret = submit_extent_page(rw, eb_head(eb)->tree, p, offset >> 9,
 					 PAGE_CACHE_SIZE, 0, bdev, &epd->bio,
 					 -1, end_bio_extent_buffer_writepage,
 					 0, epd->bio_flags, bio_flags);
 		epd->bio_flags = bio_flags;
 		if (ret) {
-			set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
+			set_bit(EXTENT_BUFFER_IOERR, &eb_head(eb)->bflags);
 			SetPageError(p);
-			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
-				end_extent_buffer_writeback(eb);
+			if (atomic_sub_and_test(num_pages - i,
+							&eb_head(eb)->io_pages))
+				end_extent_buffer_writeback(eb_head(eb));
 			ret = -EIO;
 			break;
 		}
@@ -3497,7 +3517,8 @@ int btree_write_cache_pages(struct address_space *mapping,
 {
 	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
 	struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info;
-	struct extent_buffer *eb, *prev_eb = NULL;
+	struct extent_buffer *eb;
+	struct extent_buffer_head *ebh, *prev_ebh = NULL;
 	struct extent_page_data epd = {
 		.bio = NULL,
 		.tree = tree,
@@ -3554,30 +3575,31 @@ retry:
 				continue;
 			}
 
-			eb = (struct extent_buffer *)page->private;
+			ebh = (struct extent_buffer_head *)page->private;
 
 			/*
 			 * Shouldn't happen and normally this would be a BUG_ON
 			 * but no sense in crashing the users box for something
 			 * we can survive anyway.
 			 */
-			if (WARN_ON(!eb)) {
+			if (WARN_ON(!ebh)) {
 				spin_unlock(&mapping->private_lock);
 				continue;
 			}
 
-			if (eb == prev_eb) {
+			if (ebh == prev_ebh) {
 				spin_unlock(&mapping->private_lock);
 				continue;
 			}
 
-			ret = atomic_inc_not_zero(&eb->refs);
+			ret = atomic_inc_not_zero(&ebh->refs);
 			spin_unlock(&mapping->private_lock);
 			if (!ret)
 				continue;
 
-			prev_eb = eb;
-			ret = lock_extent_buffer_for_io(eb, fs_info, &epd);
+			eb = &ebh->extent_buf[0];
+			prev_ebh = ebh;
+			ret = lock_extent_buffer_for_io(ebh, fs_info, &epd);
 			if (!ret) {
 				free_extent_buffer(eb);
 				continue;
@@ -4257,17 +4279,23 @@ out:
 	return ret;
 }
 
-static void __free_extent_buffer(struct extent_buffer *eb)
+static void __free_extent_buffer(struct extent_buffer_head *ebh)
 {
-	btrfs_leak_debug_del(&eb->leak_list);
-	kmem_cache_free(extent_buffer_cache, eb);
+	btrfs_leak_debug_del(&ebh->leak_list);
+	kmem_cache_free(extent_buffer_cache, ebh);
 }
 
-static int extent_buffer_under_io(struct extent_buffer *eb)
+static int extent_buffer_under_io(struct extent_buffer_head *ebh)
 {
-	return (atomic_read(&eb->io_pages) ||
-		test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) ||
-		test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
+	int i, dirty = 0;
+	struct extent_buffer *eb;
+
+	for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE && !dirty; i++) {
+		eb = &ebh->extent_buf[i];
+		dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
+	}
+	return (dirty || atomic_read(&ebh->io_pages) ||
+		test_bit(EXTENT_BUFFER_WRITEBACK, &ebh->bflags));
 }
 
 /*
@@ -4279,9 +4307,10 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb,
 	unsigned long index;
 	unsigned long num_pages;
 	struct page *page;
-	int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
+	struct extent_buffer_head *ebh = eb_head(eb);
+	int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &ebh->bflags);
 
-	BUG_ON(extent_buffer_under_io(eb));
+	BUG_ON(extent_buffer_under_io(ebh));
 
 	num_pages = num_extent_pages(eb->start, eb->len);
 	index = start_idx + num_pages;
@@ -4301,8 +4330,8 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb,
 			 * this eb.
 			 */
 			if (PagePrivate(page) &&
-			    page->private == (unsigned long)eb) {
-				BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
+			    page->private == (unsigned long)ebh) {
+				BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags));
 				BUG_ON(PageDirty(page));
 				BUG_ON(PageWriteback(page));
 				/*
@@ -4330,23 +4359,14 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb,
 static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
 {
 	btrfs_release_extent_buffer_page(eb, 0);
-	__free_extent_buffer(eb);
+	__free_extent_buffer(eb_head(eb));
 }
 
-static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree,
-						   u64 start,
-						   unsigned long len,
-						   gfp_t mask)
+static void __init_extent_buffer(struct extent_buffer *eb, u64 start,
+				unsigned long len)
 {
-	struct extent_buffer *eb = NULL;
-
-	eb = kmem_cache_zalloc(extent_buffer_cache, mask);
-	if (eb == NULL)
-		return NULL;
 	eb->start = start;
 	eb->len = len;
-	eb->tree = tree;
-	eb->bflags = 0;
 	rwlock_init(&eb->lock);
 	atomic_set(&eb->write_locks, 0);
 	atomic_set(&eb->read_locks, 0);
@@ -4357,12 +4377,27 @@ static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree,
 	eb->lock_nested = 0;
 	init_waitqueue_head(&eb->write_lock_wq);
 	init_waitqueue_head(&eb->read_lock_wq);
+}
+
+static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree,
+						   u64 start,
+						   unsigned long len,
+						   gfp_t mask)
+{
+	struct extent_buffer_head *ebh = NULL;
+	struct extent_buffer *eb = NULL;
+	int i, index = -1;
 
-	btrfs_leak_debug_add(&eb->leak_list, &buffers);
+	ebh = kmem_cache_zalloc(extent_buffer_cache, mask);
+	if (ebh == NULL)
+		return NULL;
+	ebh->tree = tree;
+	ebh->bflags = 0;
+	btrfs_leak_debug_add(&ebh->leak_list, &buffers);
 
-	spin_lock_init(&eb->refs_lock);
-	atomic_set(&eb->refs, 1);
-	atomic_set(&eb->io_pages, 0);
+	spin_lock_init(&ebh->refs_lock);
+	atomic_set(&ebh->refs, 1);
+	atomic_set(&ebh->io_pages, 0);
 
 	/*
 	 * Sanity checks, currently the maximum is 64k covered by 16x 4k pages
@@ -4371,6 +4406,34 @@ static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree,
 		> MAX_INLINE_EXTENT_BUFFER_SIZE);
 	BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE);
 
+	if (len < PAGE_CACHE_SIZE) {
+		u64 st = start & ~(PAGE_CACHE_SIZE - 1);
+		unsigned long totlen = 0;
+		/*
+		 * Make sure we have enough room to fit extent buffers
+		 * that belong a single page in a single extent_buffer_head.
+		 * If this BUG_ON is tripped, then it means either the
+		 * blocksize, i.e len, is too small or we need to increase
+		 * MAX_EXTENT_BUFFERS_PER_PAGE.
+		 */
+		BUG_ON(len * MAX_EXTENT_BUFFERS_PER_PAGE < PAGE_CACHE_SIZE);
+
+		for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE
+				&& totlen < PAGE_CACHE_SIZE ;
+				i++, st += len, totlen += len) {
+			__init_extent_buffer(&ebh->extent_buf[i], st, len);
+			if (st == start) {
+				index = i;
+				eb = &ebh->extent_buf[i];
+			}
+
+		}
+		BUG_ON(!eb);
+	} else {
+		eb = &ebh->extent_buf[0];
+		__init_extent_buffer(eb, start, len);
+	}
+
 	return eb;
 }
 
@@ -4391,15 +4454,15 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
 			btrfs_release_extent_buffer(new);
 			return NULL;
 		}
-		attach_extent_buffer_page(new, p);
+		attach_extent_buffer_page(eb_head(new), p);
 		WARN_ON(PageDirty(p));
 		SetPageUptodate(p);
-		new->pages[i] = p;
+		eb_head(new)->pages[i] = p;
 	}
 
 	copy_extent_buffer(new, src, 0, 0, src->len);
-	set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
-	set_bit(EXTENT_BUFFER_DUMMY, &new->bflags);
+	set_bit(EXTENT_BUFFER_UPTODATE, &eb_head(new)->bflags);
+	set_bit(EXTENT_BUFFER_DUMMY, &eb_head(new)->bflags);
 
 	return new;
 }
@@ -4415,19 +4478,19 @@ struct extent_buffer *alloc_dummy_extent_buffer(u64 start, unsigned long len)
 		return NULL;
 
 	for (i = 0; i < num_pages; i++) {
-		eb->pages[i] = alloc_page(GFP_NOFS);
-		if (!eb->pages[i])
+		eb_head(eb)->pages[i] = alloc_page(GFP_NOFS);
+		if (!eb_head(eb)->pages[i])
 			goto err;
 	}
 	set_extent_buffer_uptodate(eb);
 	btrfs_set_header_nritems(eb, 0);
-	set_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
+	set_bit(EXTENT_BUFFER_DUMMY, &eb_head(eb)->bflags);
 
 	return eb;
 err:
 	for (; i > 0; i--)
-		__free_page(eb->pages[i - 1]);
-	__free_extent_buffer(eb);
+		__free_page(eb_head(eb)->pages[i - 1]);
+	__free_extent_buffer(eb_head(eb));
 	return NULL;
 }
 
@@ -4454,14 +4517,15 @@ static void check_buffer_tree_ref(struct extent_buffer *eb)
 	 * So bump the ref count first, then set the bit.  If someone
 	 * beat us to it, drop the ref we added.
 	 */
-	refs = atomic_read(&eb->refs);
-	if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
+	refs = atomic_read(&eb_head(eb)->refs);
+	if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF,
+						&eb_head(eb)->bflags))
 		return;
 
-	spin_lock(&eb->refs_lock);
-	if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
-		atomic_inc(&eb->refs);
-	spin_unlock(&eb->refs_lock);
+	spin_lock(&eb_head(eb)->refs_lock);
+	if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb_head(eb)->bflags))
+		atomic_inc(&eb_head(eb)->refs);
+	spin_unlock(&eb_head(eb)->refs_lock);
 }
 
 static void mark_extent_buffer_accessed(struct extent_buffer *eb)
@@ -4481,13 +4545,22 @@ struct extent_buffer *find_extent_buffer(struct extent_io_tree *tree,
 					 		u64 start)
 {
 	struct extent_buffer *eb;
+	struct extent_buffer_head *ebh;
+	int i = 0;
 
 	rcu_read_lock();
-	eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT);
-	if (eb && atomic_inc_not_zero(&eb->refs)) {
+	ebh = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT);
+	if (ebh && atomic_inc_not_zero(&ebh->refs)) {
 		rcu_read_unlock();
-		mark_extent_buffer_accessed(eb);
-		return eb;
+
+		do {
+			eb = &ebh->extent_buf[i++];
+			if (eb->start == start) {
+				mark_extent_buffer_accessed(eb);
+				return eb;
+			}
+		} while (i < MAX_EXTENT_BUFFERS_PER_PAGE);
+		BUG();
 	}
 	rcu_read_unlock();
 
@@ -4500,8 +4573,8 @@ struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree,
 	unsigned long num_pages = num_extent_pages(start, len);
 	unsigned long i;
 	unsigned long index = start >> PAGE_CACHE_SHIFT;
-	struct extent_buffer *eb;
-	struct extent_buffer *exists = NULL;
+	struct extent_buffer *eb, *old_eb = NULL;
+	struct extent_buffer_head *exists = NULL;
 	struct page *p;
 	struct address_space *mapping = tree->mapping;
 	int uptodate = 1;
@@ -4530,13 +4603,20 @@ struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree,
 			 * we can just return that one, else we know we can just
 			 * overwrite page->private.
 			 */
-			exists = (struct extent_buffer *)p->private;
+			exists = (struct extent_buffer_head *)p->private;
 			if (atomic_inc_not_zero(&exists->refs)) {
+				int j = 0;
 				spin_unlock(&mapping->private_lock);
 				unlock_page(p);
 				page_cache_release(p);
-				mark_extent_buffer_accessed(exists);
-				goto free_eb;
+				do {
+					old_eb = &exists->extent_buf[j++];
+					if (old_eb->start == start) {
+						mark_extent_buffer_accessed(old_eb);
+						goto free_eb;
+					}
+				} while (j < MAX_EXTENT_BUFFERS_PER_PAGE);
+				BUG();
 			}
 
 			/*
@@ -4547,11 +4627,11 @@ struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree,
 			WARN_ON(PageDirty(p));
 			page_cache_release(p);
 		}
-		attach_extent_buffer_page(eb, p);
+		attach_extent_buffer_page(eb_head(eb), p);
 		spin_unlock(&mapping->private_lock);
 		WARN_ON(PageDirty(p));
 		mark_page_accessed(p);
-		eb->pages[i] = p;
+		eb_head(eb)->pages[i] = p;
 		if (!PageUptodate(p))
 			uptodate = 0;
 
@@ -4561,19 +4641,20 @@ struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree,
 		 */
 	}
 	if (uptodate)
-		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+		set_bit(EXTENT_BUFFER_UPTODATE, &eb_head(eb)->bflags);
 again:
 	ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
 	if (ret)
 		goto free_eb;
 
 	spin_lock(&tree->buffer_lock);
-	ret = radix_tree_insert(&tree->buffer, start >> PAGE_CACHE_SHIFT, eb);
+	ret = radix_tree_insert(&tree->buffer,
+				start >> PAGE_CACHE_SHIFT, eb_head(eb));
 	spin_unlock(&tree->buffer_lock);
 	radix_tree_preload_end();
 	if (ret == -EEXIST) {
-		exists = find_extent_buffer(tree, start);
-		if (exists)
+		old_eb = find_extent_buffer(tree, start);
+		if (old_eb)
 			goto free_eb;
 		else
 			goto again;
@@ -4590,58 +4671,58 @@ again:
 	 * after the extent buffer is in the radix tree so
 	 * it doesn't get lost
 	 */
-	SetPageChecked(eb->pages[0]);
+	SetPageChecked(eb_head(eb)->pages[0]);
 	for (i = 1; i < num_pages; i++) {
 		p = extent_buffer_page(eb, i);
 		ClearPageChecked(p);
 		unlock_page(p);
 	}
-	unlock_page(eb->pages[0]);
+	unlock_page(eb_head(eb)->pages[0]);
 	return eb;
 
 free_eb:
 	for (i = 0; i < num_pages; i++) {
-		if (eb->pages[i])
-			unlock_page(eb->pages[i]);
+		if (eb_head(eb)->pages[i])
+			unlock_page(eb_head(eb)->pages[i]);
 	}
 
-	WARN_ON(!atomic_dec_and_test(&eb->refs));
+	WARN_ON(!atomic_dec_and_test(&eb_head(eb)->refs));
 	btrfs_release_extent_buffer(eb);
-	return exists;
+	return old_eb;
 }
 
 static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head)
 {
-	struct extent_buffer *eb =
-			container_of(head, struct extent_buffer, rcu_head);
+	struct extent_buffer_head *ebh =
+			container_of(head, struct extent_buffer_head, rcu_head);
 
-	__free_extent_buffer(eb);
+	__free_extent_buffer(ebh);
 }
 
 /* Expects to have eb->eb_lock already held */
-static int release_extent_buffer(struct extent_buffer *eb)
+static int release_extent_buffer(struct extent_buffer_head *ebh)
 {
-	WARN_ON(atomic_read(&eb->refs) == 0);
-	if (atomic_dec_and_test(&eb->refs)) {
-		if (test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags)) {
-			spin_unlock(&eb->refs_lock);
+	WARN_ON(atomic_read(&ebh->refs) == 0);
+	if (atomic_dec_and_test(&ebh->refs)) {
+		if (test_bit(EXTENT_BUFFER_DUMMY, &ebh->bflags)) {
+			spin_unlock(&ebh->refs_lock);
 		} else {
-			struct extent_io_tree *tree = eb->tree;
+			struct extent_io_tree *tree = ebh->tree;
 
-			spin_unlock(&eb->refs_lock);
+			spin_unlock(&ebh->refs_lock);
 
 			spin_lock(&tree->buffer_lock);
 			radix_tree_delete(&tree->buffer,
-					  eb->start >> PAGE_CACHE_SHIFT);
+				ebh->extent_buf[0].start >> PAGE_CACHE_SHIFT);
 			spin_unlock(&tree->buffer_lock);
 		}
 
 		/* Should be safe to release our pages at this point */
-		btrfs_release_extent_buffer_page(eb, 0);
-		call_rcu(&eb->rcu_head, btrfs_release_extent_buffer_rcu);
+		btrfs_release_extent_buffer_page(&ebh->extent_buf[0], 0);
+		call_rcu(&ebh->rcu_head, btrfs_release_extent_buffer_rcu);
 		return 1;
 	}
-	spin_unlock(&eb->refs_lock);
+	spin_unlock(&ebh->refs_lock);
 
 	return 0;
 }
@@ -4650,48 +4731,52 @@ void free_extent_buffer(struct extent_buffer *eb)
 {
 	int refs;
 	int old;
+	struct extent_buffer_head *ebh;
 	if (!eb)
 		return;
 
+	ebh = eb_head(eb);
 	while (1) {
-		refs = atomic_read(&eb->refs);
+		refs = atomic_read(&ebh->refs);
 		if (refs <= 3)
 			break;
-		old = atomic_cmpxchg(&eb->refs, refs, refs - 1);
+		old = atomic_cmpxchg(&ebh->refs, refs, refs - 1);
 		if (old == refs)
 			return;
 	}
 
-	spin_lock(&eb->refs_lock);
-	if (atomic_read(&eb->refs) == 2 &&
-	    test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags))
-		atomic_dec(&eb->refs);
+	spin_lock(&ebh->refs_lock);
+	if (atomic_read(&ebh->refs) == 2 &&
+	    test_bit(EXTENT_BUFFER_DUMMY, &ebh->bflags))
+		atomic_dec(&ebh->refs);
 
-	if (atomic_read(&eb->refs) == 2 &&
-	    test_bit(EXTENT_BUFFER_STALE, &eb->bflags) &&
-	    !extent_buffer_under_io(eb) &&
-	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
-		atomic_dec(&eb->refs);
+	if (atomic_read(&ebh->refs) == 2 &&
+	    test_bit(EXTENT_BUFFER_STALE, &ebh->bflags) &&
+	    !extent_buffer_under_io(ebh) &&
+	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &ebh->bflags))
+		atomic_dec(&ebh->refs);
 
 	/*
 	 * I know this is terrible, but it's temporary until we stop tracking
 	 * the uptodate bits and such for the extent buffers.
 	 */
-	release_extent_buffer(eb);
+	release_extent_buffer(ebh);
 }
 
 void free_extent_buffer_stale(struct extent_buffer *eb)
 {
+	struct extent_buffer_head *ebh;
 	if (!eb)
 		return;
 
-	spin_lock(&eb->refs_lock);
-	set_bit(EXTENT_BUFFER_STALE, &eb->bflags);
+	ebh = eb_head(eb);
+	spin_lock(&ebh->refs_lock);
+	set_bit(EXTENT_BUFFER_STALE, &ebh->bflags);
 
-	if (atomic_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) &&
-	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
-		atomic_dec(&eb->refs);
-	release_extent_buffer(eb);
+	if (atomic_read(&ebh->refs) == 2 && !extent_buffer_under_io(ebh) &&
+	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &ebh->bflags))
+		atomic_dec(&ebh->refs);
+	release_extent_buffer(ebh);
 }
 
 void clear_extent_buffer_dirty(struct extent_buffer *eb)
@@ -4721,7 +4806,7 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
 		ClearPageError(page);
 		unlock_page(page);
 	}
-	WARN_ON(atomic_read(&eb->refs) == 0);
+	WARN_ON(atomic_read(&eb_head(eb)->refs) == 0);
 }
 
 int set_extent_buffer_dirty(struct extent_buffer *eb)
@@ -4732,11 +4817,11 @@ int set_extent_buffer_dirty(struct extent_buffer *eb)
 
 	check_buffer_tree_ref(eb);
 
-	was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
+	was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
 
 	num_pages = num_extent_pages(eb->start, eb->len);
-	WARN_ON(atomic_read(&eb->refs) == 0);
-	WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags));
+	WARN_ON(atomic_read(&eb_head(eb)->refs) == 0);
+	WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb_head(eb)->bflags));
 
 	for (i = 0; i < num_pages; i++)
 		set_page_dirty(extent_buffer_page(eb, i));
@@ -4749,7 +4834,9 @@ int clear_extent_buffer_uptodate(struct extent_buffer *eb)
 	struct page *page;
 	unsigned long num_pages;
 
-	clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+	if (!eb || !eb_head(eb))
+		return 0;
+	clear_bit(EXTENT_BUFFER_UPTODATE, &eb_head(eb)->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
 		page = extent_buffer_page(eb, i);
@@ -4765,7 +4852,7 @@ int set_extent_buffer_uptodate(struct extent_buffer *eb)
 	struct page *page;
 	unsigned long num_pages;
 
-	set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+	set_bit(EXTENT_BUFFER_UPTODATE, &eb_head(eb)->bflags);
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
 		page = extent_buffer_page(eb, i);
@@ -4776,7 +4863,7 @@ int set_extent_buffer_uptodate(struct extent_buffer *eb)
 
 int extent_buffer_uptodate(struct extent_buffer *eb)
 {
-	return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+	return test_bit(EXTENT_BUFFER_UPTODATE, &eb_head(eb)->bflags);
 }
 
 int read_extent_buffer_pages(struct extent_io_tree *tree,
@@ -4794,8 +4881,9 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 	unsigned long num_reads = 0;
 	struct bio *bio = NULL;
 	unsigned long bio_flags = 0;
+	struct extent_buffer_head *ebh = eb_head(eb);
 
-	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
+	if (test_bit(EXTENT_BUFFER_UPTODATE, &ebh->bflags))
 		return 0;
 
 	if (start) {
@@ -4806,6 +4894,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 		start_i = 0;
 	}
 
+recheck:
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = start_i; i < num_pages; i++) {
 		page = extent_buffer_page(eb, i);
@@ -4823,13 +4912,26 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 	}
 	if (all_uptodate) {
 		if (start_i == 0)
-			set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+			set_bit(EXTENT_BUFFER_UPTODATE, &ebh->bflags);
 		goto unlock_exit;
 	}
 
-	clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-	eb->read_mirror = 0;
-	atomic_set(&eb->io_pages, num_reads);
+	if (eb_head(eb)->io_eb) {
+		all_uptodate = 1;
+		i = start_i;
+		while (locked_pages > 0) {
+			page = extent_buffer_page(eb, i);
+			i++;
+			unlock_page(page);
+			locked_pages--;
+		}
+		goto recheck;
+	}
+	BUG_ON(eb_head(eb)->io_eb);
+	eb_head(eb)->io_eb = eb;
+	clear_bit(EXTENT_BUFFER_IOERR, &ebh->bflags);
+	ebh->read_mirror = 0;
+	atomic_set(&ebh->io_pages, num_reads);
 	for (i = start_i; i < num_pages; i++) {
 		page = extent_buffer_page(eb, i);
 		if (!PageUptodate(page)) {
@@ -5196,7 +5298,7 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 
 int try_release_extent_buffer(struct page *page)
 {
-	struct extent_buffer *eb;
+	struct extent_buffer_head *ebh;
 
 	/*
 	 * We need to make sure noboody is attaching this page to an eb right
@@ -5208,17 +5310,17 @@ int try_release_extent_buffer(struct page *page)
 		return 1;
 	}
 
-	eb = (struct extent_buffer *)page->private;
-	BUG_ON(!eb);
+	ebh = (struct extent_buffer_head *)page->private;
+	BUG_ON(!ebh);
 
 	/*
 	 * This is a little awful but should be ok, we need to make sure that
 	 * the eb doesn't disappear out from under us while we're looking at
 	 * this page.
 	 */
-	spin_lock(&eb->refs_lock);
-	if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
-		spin_unlock(&eb->refs_lock);
+	spin_lock(&ebh->refs_lock);
+	if (atomic_read(&ebh->refs) != 1 || extent_buffer_under_io(ebh)) {
+		spin_unlock(&ebh->refs_lock);
 		spin_unlock(&page->mapping->private_lock);
 		return 0;
 	}
@@ -5228,10 +5330,11 @@ int try_release_extent_buffer(struct page *page)
 	 * If tree ref isn't set then we know the ref on this eb is a real ref,
 	 * so just return, this page will likely be freed soon anyway.
 	 */
-	if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
-		spin_unlock(&eb->refs_lock);
+	if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &ebh->bflags)) {
+		spin_unlock(&ebh->refs_lock);
 		return 0;
 	}
 
-	return release_extent_buffer(eb);
+	return release_extent_buffer(ebh);
 }
+
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 19620c5..b56de28 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -124,19 +124,12 @@ struct extent_state {
 
 #define INLINE_EXTENT_BUFFER_PAGES 16
 #define MAX_INLINE_EXTENT_BUFFER_SIZE (INLINE_EXTENT_BUFFER_PAGES * PAGE_CACHE_SIZE)
+#define MAX_EXTENT_BUFFERS_PER_PAGE 16
+
 struct extent_buffer {
 	u64 start;
 	unsigned long len;
-	unsigned long map_start;
-	unsigned long map_len;
-	unsigned long bflags;
-	struct extent_io_tree *tree;
-	spinlock_t refs_lock;
-	atomic_t refs;
-	atomic_t io_pages;
-	int read_mirror;
-	struct rcu_head rcu_head;
-	pid_t lock_owner;
+	unsigned long ebflags;
 
 	/* count of read lock holders on the extent buffer */
 	atomic_t write_locks;
@@ -147,6 +140,8 @@ struct extent_buffer {
 	atomic_t spinning_writers;
 	int lock_nested;
 
+	pid_t lock_owner;
+
 	/* protects write locks */
 	rwlock_t lock;
 
@@ -160,7 +155,21 @@ struct extent_buffer {
 	 */
 	wait_queue_head_t read_lock_wq;
 	wait_queue_head_t lock_wq;
+};
+
+struct extent_buffer_head {
+	unsigned long bflags;
+	struct extent_io_tree *tree;
+	spinlock_t refs_lock;
+	atomic_t refs;
+	atomic_t io_pages;
+	int read_mirror;
+	struct rcu_head rcu_head;
+
 	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
+
+	struct extent_buffer extent_buf[MAX_EXTENT_BUFFERS_PER_PAGE];
+	struct extent_buffer *io_eb; /* eb that submitted the current I/O */
 #ifdef CONFIG_BTRFS_DEBUG
 	struct list_head leak_list;
 #endif
@@ -177,6 +186,24 @@ static inline int extent_compress_type(unsigned long bio_flags)
 	return bio_flags >> EXTENT_BIO_FLAG_SHIFT;
 }
 
+/*
+ * return the extent_buffer_head that contains the extent buffer provided.
+ */
+static inline struct extent_buffer_head *eb_head(struct extent_buffer *eb)
+{
+	int start, index;
+	struct extent_buffer_head *ebh;
+	struct extent_buffer *eb_base;
+
+	BUG_ON(!eb);
+	start = eb->start & (PAGE_CACHE_SIZE - 1);
+	index = start >> (ffs(eb->len) - 1);
+	eb_base = eb - index;
+	ebh = (struct extent_buffer_head *)
+		((char *) eb_base - offsetof(struct extent_buffer_head, extent_buf));
+	return ebh;
+
+}
 struct extent_map_tree;
 
 typedef struct extent_map *(get_extent_t)(struct inode *inode,
@@ -288,15 +315,15 @@ static inline unsigned long num_extent_pages(u64 start, u64 len)
 		(start >> PAGE_CACHE_SHIFT);
 }
 
-static inline struct page *extent_buffer_page(struct extent_buffer *eb,
-					      unsigned long i)
+static inline struct page *extent_buffer_page(
+			struct extent_buffer *eb, unsigned long i)
 {
-	return eb->pages[i];
+	return eb_head(eb)->pages[i];
 }
 
 static inline void extent_buffer_get(struct extent_buffer *eb)
 {
-	atomic_inc(&eb->refs);
+	atomic_inc(&eb_head(eb)->refs);
 }
 
 int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 92303f4..37b2698 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5921,7 +5921,7 @@ int btrfs_read_sys_array(struct btrfs_root *root)
 	 * to silence the warning eg. on PowerPC 64.
 	 */
 	if (PAGE_CACHE_SIZE > BTRFS_SUPER_INFO_SIZE)
-		SetPageUptodate(sb->pages[0]);
+		SetPageUptodate(eb_head(sb)->pages[0]);
 
 	write_extent_buffer(sb, super_copy, 0, BTRFS_SUPER_INFO_SIZE);
 	array_size = btrfs_super_sys_array_size(super_copy);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 4832d75..ceb194f 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -694,7 +694,7 @@ TRACE_EVENT(btrfs_cow_block,
 	TP_fast_assign(
 		__entry->root_objectid	= root->root_key.objectid;
 		__entry->buf_start	= buf->start;
-		__entry->refs		= atomic_read(&buf->refs);
+		__entry->refs		= atomic_read(&eb_head(buf)->refs);
 		__entry->cow_start	= cow->start;
 		__entry->buf_level	= btrfs_header_level(buf);
 		__entry->cow_level	= btrfs_header_level(cow);
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/7] btrfs: subpagesize-blocksize: Use a global alignment for size
  2013-12-11 23:38 [PATCH 0/7] Patches to support subpagesize blocksize Chandra Seetharaman
  2013-12-11 23:38 ` [PATCH 1/7] btrfs: subpagesize-blocksize: Define extent_buffer_head Chandra Seetharaman
@ 2013-12-11 23:38 ` Chandra Seetharaman
  2013-12-16 12:33   ` saeed bishara
  2013-12-11 23:38 ` [PATCH 3/7] btrfs: subpagesize-blocksize: Handle small extent maps properly Chandra Seetharaman
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Chandra Seetharaman @ 2013-12-11 23:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chandra Seetharaman

In order to handle a blocksize that is smaller than the
PAGE_SIZE, we need align all IOs to PAGE_SIZE.

This patch defines a new macro btrfs_align_size() that
calculates the alignment size based on the sectorsize
and uses it at appropriate places.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
---
 fs/btrfs/btrfs_inode.h  |  7 +++++++
 fs/btrfs/compression.c  |  3 ++-
 fs/btrfs/extent-tree.c  | 12 ++++++------
 fs/btrfs/extent_io.c    | 17 ++++++-----------
 fs/btrfs/file.c         | 15 +++++++--------
 fs/btrfs/inode.c        | 41 ++++++++++++++++++++++-------------------
 fs/btrfs/ioctl.c        |  6 +++---
 fs/btrfs/ordered-data.c |  2 +-
 fs/btrfs/tree-log.c     |  2 +-
 9 files changed, 55 insertions(+), 50 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index ac0b39d..eee994f 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -280,4 +280,11 @@ static inline void btrfs_inode_resume_unlocked_dio(struct inode *inode)
 		  &BTRFS_I(inode)->runtime_flags);
 }
 
+static inline u64 btrfs_align_size(struct inode *inode)
+{
+	if (BTRFS_I(inode)->root->sectorsize < PAGE_CACHE_SIZE)
+		return (u64)PAGE_CACHE_SIZE;
+	else
+		return (u64)BTRFS_I(inode)->root->sectorsize;
+}
 #endif
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 1499b27..259f2c5 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -89,9 +89,10 @@ static inline int compressed_bio_size(struct btrfs_root *root,
 				      unsigned long disk_size)
 {
 	u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
+	int align_size = max_t(size_t, root->sectorsize, PAGE_CACHE_SIZE);
 
 	return sizeof(struct compressed_bio) +
-		((disk_size + root->sectorsize - 1) / root->sectorsize) *
+		((disk_size + align_size - 1) / align_size) *
 		csum_size;
 }
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 79cf87f..621af18 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3617,8 +3617,8 @@ int btrfs_check_data_free_space(struct inode *inode, u64 bytes)
 	u64 used;
 	int ret = 0, committed = 0, alloc_chunk = 1;
 
-	/* make sure bytes are sectorsize aligned */
-	bytes = ALIGN(bytes, root->sectorsize);
+	/* make sure bytes are appropriately aligned */
+	bytes = ALIGN(bytes, btrfs_align_size(inode));
 
 	if (btrfs_is_free_space_inode(inode)) {
 		committed = 1;
@@ -3726,8 +3726,8 @@ void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes)
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_space_info *data_sinfo;
 
-	/* make sure bytes are sectorsize aligned */
-	bytes = ALIGN(bytes, root->sectorsize);
+	/* make sure bytes are appropriately aligned */
+	bytes = ALIGN(bytes, btrfs_align_size(inode));
 
 	data_sinfo = root->fs_info->data_sinfo;
 	spin_lock(&data_sinfo->lock);
@@ -4988,7 +4988,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	if (delalloc_lock)
 		mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
 
-	num_bytes = ALIGN(num_bytes, root->sectorsize);
+	num_bytes = ALIGN(num_bytes, btrfs_align_size(inode));
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	BTRFS_I(inode)->outstanding_extents++;
@@ -5126,7 +5126,7 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
 	u64 to_free = 0;
 	unsigned dropped;
 
-	num_bytes = ALIGN(num_bytes, root->sectorsize);
+	num_bytes = ALIGN(num_bytes, btrfs_align_size(inode));
 	spin_lock(&BTRFS_I(inode)->lock);
 	dropped = drop_outstanding_extent(inode);
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a1a849b..e1992ed 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2766,7 +2766,7 @@ static int __do_readpage(struct extent_io_tree *tree,
 	size_t pg_offset = 0;
 	size_t iosize;
 	size_t disk_io_size;
-	size_t blocksize = inode->i_sb->s_blocksize;
+	size_t blocksize = btrfs_align_size(inode);
 	unsigned long this_bio_flag = *bio_flags & EXTENT_BIO_PARENT_LOCKED;
 
 	set_page_extent_mapped(page);
@@ -3078,7 +3078,6 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
 	int ret;
 	int nr = 0;
 	size_t pg_offset = 0;
-	size_t blocksize;
 	loff_t i_size = i_size_read(inode);
 	unsigned long end_index = i_size >> PAGE_CACHE_SHIFT;
 	u64 nr_delalloc;
@@ -3218,8 +3217,6 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
 		goto done;
 	}
 
-	blocksize = inode->i_sb->s_blocksize;
-
 	while (cur <= end) {
 		if (cur >= last_byte) {
 			if (tree->ops && tree->ops->writepage_end_io_hook)
@@ -3238,7 +3235,7 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
 		BUG_ON(extent_map_end(em) <= cur);
 		BUG_ON(end < cur);
 		iosize = min(extent_map_end(em) - cur, end - cur + 1);
-		iosize = ALIGN(iosize, blocksize);
+		iosize = ALIGN(iosize, btrfs_align_size(inode));
 		sector = (em->block_start + extent_offset) >> 9;
 		bdev = em->bdev;
 		block_start = em->block_start;
@@ -3934,9 +3931,8 @@ int extent_invalidatepage(struct extent_io_tree *tree,
 	struct extent_state *cached_state = NULL;
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_CACHE_SIZE - 1;
-	size_t blocksize = page->mapping->host->i_sb->s_blocksize;
 
-	start += ALIGN(offset, blocksize);
+	start += ALIGN(offset, btrfs_align_size(page->mapping->host));
 	if (start > end)
 		return 0;
 
@@ -4044,7 +4040,6 @@ static struct extent_map *get_extent_skip_holes(struct inode *inode,
 						u64 last,
 						get_extent_t *get_extent)
 {
-	u64 sectorsize = BTRFS_I(inode)->root->sectorsize;
 	struct extent_map *em;
 	u64 len;
 
@@ -4055,7 +4050,7 @@ static struct extent_map *get_extent_skip_holes(struct inode *inode,
 		len = last - offset;
 		if (len == 0)
 			break;
-		len = ALIGN(len, sectorsize);
+		len = ALIGN(len, btrfs_align_size(inode));
 		em = get_extent(inode, NULL, 0, offset, len, 0);
 		if (IS_ERR_OR_NULL(em))
 			return em;
@@ -4119,8 +4114,8 @@ int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		return -ENOMEM;
 	path->leave_spinning = 1;
 
-	start = ALIGN(start, BTRFS_I(inode)->root->sectorsize);
-	len = ALIGN(len, BTRFS_I(inode)->root->sectorsize);
+	start = ALIGN(start, btrfs_align_size(inode));
+	len = ALIGN(len, btrfs_align_size(inode));
 
 	/*
 	 * lookup the last file extent.  We're not using i_size here
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 82d0342..1861322 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -505,8 +505,8 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 	u64 end_pos = pos + write_bytes;
 	loff_t isize = i_size_read(inode);
 
-	start_pos = pos & ~((u64)root->sectorsize - 1);
-	num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize);
+	start_pos = pos & ~(btrfs_align_size(inode) - 1);
+	num_bytes = ALIGN(write_bytes + pos - start_pos, btrfs_align_size(inode));
 
 	end_of_last_block = start_pos + num_bytes - 1;
 	err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
@@ -889,7 +889,7 @@ next_slot:
 				inode_sub_bytes(inode,
 						extent_end - key.offset);
 				extent_end = ALIGN(extent_end,
-						   root->sectorsize);
+						   btrfs_align_size(inode));
 			} else if (update_refs && disk_bytenr > 0) {
 				ret = btrfs_free_extent(trans, root,
 						disk_bytenr, num_bytes, 0,
@@ -1254,7 +1254,7 @@ static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
 	u64 start_pos;
 	u64 last_pos;
 
-	start_pos = pos & ~((u64)root->sectorsize - 1);
+	start_pos = pos & ~((u64)btrfs_align_size(inode) - 1);
 	last_pos = ((u64)index + num_pages) << PAGE_CACHE_SHIFT;
 
 again:
@@ -2263,11 +2263,10 @@ static long btrfs_fallocate(struct file *file, int mode,
 	u64 alloc_hint = 0;
 	u64 locked_end;
 	struct extent_map *em;
-	int blocksize = BTRFS_I(inode)->root->sectorsize;
 	int ret;
 
-	alloc_start = round_down(offset, blocksize);
-	alloc_end = round_up(offset + len, blocksize);
+	alloc_start = round_down(offset, btrfs_align_size(inode));
+	alloc_end = round_up(offset + len, btrfs_align_size(inode));
 
 	/* Make sure we aren't being give some crap mode */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
@@ -2367,7 +2366,7 @@ static long btrfs_fallocate(struct file *file, int mode,
 		}
 		last_byte = min(extent_map_end(em), alloc_end);
 		actual_end = min_t(u64, extent_map_end(em), offset + len);
-		last_byte = ALIGN(last_byte, blocksize);
+		last_byte = ALIGN(last_byte, btrfs_align_size(inode));
 
 		if (em->block_start == EXTENT_MAP_HOLE ||
 		    (cur_offset >= inode->i_size &&
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f1a7744..c79c9cd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -239,7 +239,7 @@ static noinline int cow_file_range_inline(struct btrfs_root *root,
 	u64 isize = i_size_read(inode);
 	u64 actual_end = min(end + 1, isize);
 	u64 inline_len = actual_end - start;
-	u64 aligned_end = ALIGN(end, root->sectorsize);
+	u64 aligned_end = ALIGN(end, btrfs_align_size(inode));
 	u64 data_len = inline_len;
 	int ret;
 
@@ -354,7 +354,6 @@ static noinline int compress_file_range(struct inode *inode,
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 num_bytes;
-	u64 blocksize = root->sectorsize;
 	u64 actual_end;
 	u64 isize = i_size_read(inode);
 	int ret = 0;
@@ -407,8 +406,8 @@ again:
 	 * a compressed extent to 128k.
 	 */
 	total_compressed = min(total_compressed, max_uncompressed);
-	num_bytes = ALIGN(end - start + 1, blocksize);
-	num_bytes = max(blocksize,  num_bytes);
+	num_bytes = ALIGN(end - start + 1, btrfs_align_size(inode));
+	num_bytes = max(btrfs_align_size(inode),  num_bytes);
 	total_in = 0;
 	ret = 0;
 
@@ -508,7 +507,7 @@ cont:
 		 * up to a block size boundary so the allocator does sane
 		 * things
 		 */
-		total_compressed = ALIGN(total_compressed, blocksize);
+		total_compressed = ALIGN(total_compressed, btrfs_align_size(inode));
 
 		/*
 		 * one last check to make sure the compression is really a
@@ -837,7 +836,6 @@ static noinline int cow_file_range(struct inode *inode,
 	unsigned long ram_size;
 	u64 disk_num_bytes;
 	u64 cur_alloc_size;
-	u64 blocksize = root->sectorsize;
 	struct btrfs_key ins;
 	struct extent_map *em;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
@@ -848,8 +846,8 @@ static noinline int cow_file_range(struct inode *inode,
 		return -EINVAL;
 	}
 
-	num_bytes = ALIGN(end - start + 1, blocksize);
-	num_bytes = max(blocksize,  num_bytes);
+	num_bytes = ALIGN(end - start + 1, btrfs_align_size(inode));
+	num_bytes = max(btrfs_align_size(inode), num_bytes);
 	disk_num_bytes = num_bytes;
 
 	/* if this is a small write inside eof, kick off defrag */
@@ -1263,7 +1261,7 @@ next_slot:
 		} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
 			extent_end = found_key.offset +
 				btrfs_file_extent_inline_len(leaf, fi);
-			extent_end = ALIGN(extent_end, root->sectorsize);
+			extent_end = ALIGN(extent_end, btrfs_align_size(inode));
 		} else {
 			BUG_ON(1);
 		}
@@ -1389,6 +1387,12 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 	int ret;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 
+	if (inode->i_sb->s_blocksize < PAGE_CACHE_SIZE) {
+		start &= ~(PAGE_CACHE_SIZE - 1);
+		end = max_t(u64, start + PAGE_CACHE_SIZE - 1, end);
+	}
+
+
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) {
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 1, nr_written);
@@ -3894,7 +3898,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 	 */
 	if (root->ref_cows || root == root->fs_info->tree_root)
 		btrfs_drop_extent_cache(inode, ALIGN(new_size,
-					root->sectorsize), (u64)-1, 0);
+					btrfs_align_size(inode)), (u64)-1, 0);
 
 	/*
 	 * This function is also used to drop the items in the log tree before
@@ -3980,7 +3984,7 @@ search_again:
 					btrfs_file_extent_num_bytes(leaf, fi);
 				extent_num_bytes = ALIGN(new_size -
 						found_key.offset,
-						root->sectorsize);
+						btrfs_align_size(inode));
 				btrfs_set_file_extent_num_bytes(leaf, fi,
 							 extent_num_bytes);
 				num_dec = (orig_num_bytes -
@@ -4217,8 +4221,8 @@ int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size)
 	struct extent_map *em = NULL;
 	struct extent_state *cached_state = NULL;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
-	u64 hole_start = ALIGN(oldsize, root->sectorsize);
-	u64 block_end = ALIGN(size, root->sectorsize);
+	u64 hole_start = ALIGN(oldsize, btrfs_align_size(inode));
+	u64 block_end = ALIGN(size, btrfs_align_size(inode));
 	u64 last_byte;
 	u64 cur_offset;
 	u64 hole_size;
@@ -4261,7 +4265,7 @@ int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size)
 			break;
 		}
 		last_byte = min(extent_map_end(em), block_end);
-		last_byte = ALIGN(last_byte , root->sectorsize);
+		last_byte = ALIGN(last_byte , btrfs_align_size(inode));
 		if (!test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
 			struct extent_map *hole_em;
 			hole_size = last_byte - cur_offset;
@@ -6001,7 +6005,7 @@ again:
 	} else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
 		size_t size;
 		size = btrfs_file_extent_inline_len(leaf, item);
-		extent_end = ALIGN(extent_start + size, root->sectorsize);
+		extent_end = ALIGN(extent_start + size, btrfs_align_size(inode));
 	}
 next:
 	if (start >= extent_end) {
@@ -6074,7 +6078,7 @@ next:
 		copy_size = min_t(u64, PAGE_CACHE_SIZE - pg_offset,
 				size - extent_offset);
 		em->start = extent_start + extent_offset;
-		em->len = ALIGN(copy_size, root->sectorsize);
+		em->len = ALIGN(copy_size, btrfs_align_size(inode));
 		em->orig_block_len = em->len;
 		em->orig_start = em->start;
 		if (compress_type) {
@@ -7967,7 +7971,6 @@ static int btrfs_getattr(struct vfsmount *mnt,
 {
 	u64 delalloc_bytes;
 	struct inode *inode = dentry->d_inode;
-	u32 blocksize = inode->i_sb->s_blocksize;
 
 	generic_fillattr(inode, stat);
 	stat->dev = BTRFS_I(inode)->root->anon_dev;
@@ -7976,8 +7979,8 @@ static int btrfs_getattr(struct vfsmount *mnt,
 	spin_lock(&BTRFS_I(inode)->lock);
 	delalloc_bytes = BTRFS_I(inode)->delalloc_bytes;
 	spin_unlock(&BTRFS_I(inode)->lock);
-	stat->blocks = (ALIGN(inode_get_bytes(inode), blocksize) +
-			ALIGN(delalloc_bytes, blocksize)) >> 9;
+	stat->blocks = (ALIGN(inode_get_bytes(inode), btrfs_align_size(inode)) +
+			ALIGN(delalloc_bytes, btrfs_align_size(inode))) >> 9;
 	return 0;
 }
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a111622..c41e342 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2631,7 +2631,7 @@ static int btrfs_cmp_data(struct inode *src, u64 loff, struct inode *dst,
 
 static int extent_same_check_offsets(struct inode *inode, u64 off, u64 len)
 {
-	u64 bs = BTRFS_I(inode)->root->fs_info->sb->s_blocksize;
+	u64 bs = btrfs_align_size(inode);
 
 	if (off + len > inode->i_size || off + len < off)
 		return -EINVAL;
@@ -2698,7 +2698,7 @@ static long btrfs_ioctl_file_extent_same(struct file *file,
 	int i;
 	int ret;
 	unsigned long size;
-	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
+	u64 bs = btrfs_align_size(src);
 	bool is_admin = capable(CAP_SYS_ADMIN);
 
 	if (!(file->f_mode & FMODE_READ))
@@ -3111,7 +3111,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 	struct inode *src;
 	int ret;
 	u64 len = olen;
-	u64 bs = root->fs_info->sb->s_blocksize;
+	u64 bs = btrfs_align_size(inode);
 	int same_inode = 0;
 
 	/*
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 69582d5..8d703e8 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -936,7 +936,7 @@ int btrfs_ordered_update_i_size(struct inode *inode, u64 offset,
 				     ordered->file_offset +
 				     ordered->truncated_len);
 	} else {
-		offset = ALIGN(offset, BTRFS_I(inode)->root->sectorsize);
+		offset = ALIGN(offset, btrfs_align_size(inode));
 	}
 	disk_i_size = BTRFS_I(inode)->disk_i_size;
 
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 9f7fc51..455d288 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -572,7 +572,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
 	} else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
 		size = btrfs_file_extent_inline_len(eb, item);
 		nbytes = btrfs_file_extent_ram_bytes(eb, item);
-		extent_end = ALIGN(start + size, root->sectorsize);
+		extent_end = ALIGN(start + size, btrfs_align_size(inode));
 	} else {
 		ret = 0;
 		goto out;
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/7] btrfs: subpagesize-blocksize: Handle small extent maps properly
  2013-12-11 23:38 [PATCH 0/7] Patches to support subpagesize blocksize Chandra Seetharaman
  2013-12-11 23:38 ` [PATCH 1/7] btrfs: subpagesize-blocksize: Define extent_buffer_head Chandra Seetharaman
  2013-12-11 23:38 ` [PATCH 2/7] btrfs: subpagesize-blocksize: Use a global alignment for size Chandra Seetharaman
@ 2013-12-11 23:38 ` Chandra Seetharaman
  2013-12-11 23:38 ` [PATCH 4/7] btrfs: subpagesize-blocksize: Handle iosize properly in submit_extent_page() Chandra Seetharaman
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Chandra Seetharaman @ 2013-12-11 23:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chandra Seetharaman

This patch makes sure that the size extent maps handles are at
least PAGE_CACHE_SIZE for the subpagesize-blocksize case.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
---
 fs/btrfs/inode.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c79c9cd..c0c18ca 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6036,7 +6036,11 @@ next:
 	if (found_type == BTRFS_FILE_EXTENT_REG ||
 	    found_type == BTRFS_FILE_EXTENT_PREALLOC) {
 		em->start = extent_start;
-		em->len = extent_end - extent_start;
+		if (inode->i_sb->s_blocksize < PAGE_CACHE_SIZE &&
+				em->len < PAGE_CACHE_SIZE)
+			em->len = PAGE_CACHE_SIZE;
+		else
+			em->len = extent_end - extent_start;
 		em->orig_start = extent_start -
 				 btrfs_file_extent_offset(leaf, item);
 		em->orig_block_len = btrfs_file_extent_disk_num_bytes(leaf,
@@ -6077,6 +6081,8 @@ next:
 		extent_offset = page_offset(page) + pg_offset - extent_start;
 		copy_size = min_t(u64, PAGE_CACHE_SIZE - pg_offset,
 				size - extent_offset);
+		if (inode->i_sb->s_blocksize < PAGE_CACHE_SIZE)
+			copy_size = max_t(u64, copy_size, PAGE_CACHE_SIZE);
 		em->start = extent_start + extent_offset;
 		em->len = ALIGN(copy_size, btrfs_align_size(inode));
 		em->orig_block_len = em->len;
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/7] btrfs: subpagesize-blocksize: Handle iosize properly in submit_extent_page()
  2013-12-11 23:38 [PATCH 0/7] Patches to support subpagesize blocksize Chandra Seetharaman
                   ` (2 preceding siblings ...)
  2013-12-11 23:38 ` [PATCH 3/7] btrfs: subpagesize-blocksize: Handle small extent maps properly Chandra Seetharaman
@ 2013-12-11 23:38 ` Chandra Seetharaman
  2013-12-11 23:38 ` [PATCH 5/7] btrfs: subpagesize-blocksize: handle checksum calculations properly Chandra Seetharaman
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Chandra Seetharaman @ 2013-12-11 23:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chandra Seetharaman

For the subpagesize-blocksize case make sure that the IO submitted through
submit_extent_page() is at least of PAGE_CACHE_SIZE

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
---
 fs/btrfs/extent_io.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e1992ed..2cf2a3b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2647,6 +2647,9 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
 	int old_compressed = prev_bio_flags & EXTENT_BIO_COMPRESSED;
 	size_t page_size = min_t(size_t, size, PAGE_CACHE_SIZE);
 
+	if (page->mapping->host->i_sb->s_blocksize < PAGE_CACHE_SIZE)
+		page_size = PAGE_CACHE_SIZE;
+
 	if (bio_ret && *bio_ret) {
 		bio = *bio_ret;
 		if (old_compressed)
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/7] btrfs: subpagesize-blocksize: handle checksum calculations properly
  2013-12-11 23:38 [PATCH 0/7] Patches to support subpagesize blocksize Chandra Seetharaman
                   ` (3 preceding siblings ...)
  2013-12-11 23:38 ` [PATCH 4/7] btrfs: subpagesize-blocksize: Handle iosize properly in submit_extent_page() Chandra Seetharaman
@ 2013-12-11 23:38 ` Chandra Seetharaman
  2013-12-11 23:38 ` [PATCH 6/7] btrfs: subpagesize-blocksize: Handle relocation clusters appropriately Chandra Seetharaman
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Chandra Seetharaman @ 2013-12-11 23:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chandra Seetharaman

With subpagesize-blocksize, the IO is done in pages but checksums are
calculated in blocks.

This patch makes sure the checksums are calculated, stored, and verfied
from proper indexes in the page.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
---
 fs/btrfs/file-item.c | 45 ++++++++++++++++++++++++++++++++++++---------
 fs/btrfs/inode.c     | 26 ++++++++++++++++++++------
 2 files changed, 56 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 6f38488..d75bda3 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -175,14 +175,16 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 	u32 diff;
 	int nblocks;
 	int bio_index = 0;
-	int count;
+	int count, bvec_count;
 	u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
+	unsigned int blocks_per_bvec;
 
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
 
 	nblocks = bio->bi_size >> inode->i_sb->s_blocksize_bits;
+	blocks_per_bvec =  bvec->bv_len >> inode->i_sb->s_blocksize_bits;
 	if (!dst) {
 		if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
 			btrfs_bio->csum_allocated = kmalloc(nblocks * csum_size,
@@ -221,8 +223,10 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 	if (dio)
 		offset = logical_offset;
 	while (bio_index < bio->bi_vcnt) {
+		bvec_count = 0;
 		if (!dio)
 			offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+same_bvec:
 		count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
 					       (u32 *)csum, nblocks);
 		if (count)
@@ -281,12 +285,26 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
 found:
 		csum += count * csum_size;
 		nblocks -= count;
-		while (count--) {
+		while (count >= blocks_per_bvec) {
+			count -= blocks_per_bvec;
 			disk_bytenr += bvec->bv_len;
 			offset += bvec->bv_len;
 			bio_index++;
 			bvec++;
 		}
+			
+		if (count) {
+			while (count--) {
+				bvec_count++;
+				disk_bytenr += inode->i_sb->s_blocksize;
+				offset += inode->i_sb->s_blocksize;
+			}
+			if (bvec_count == blocks_per_bvec) {
+				bio_index++;
+				bvec++;
+			} else
+				goto same_bvec;
+		}
 	}
 	btrfs_free_path(path);
 	return 0;
@@ -444,7 +462,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
 	int index;
 	unsigned long total_bytes = 0;
 	unsigned long this_sum_bytes = 0;
-	u64 offset;
+	u64 offset, pg_offset;
+	size_t csum_size;
 
 	WARN_ON(bio->bi_vcnt <= 0);
 	sums = kzalloc(btrfs_ordered_sum_size(root, bio->bi_size), GFP_NOFS);
@@ -489,17 +508,25 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
 			index = 0;
 		}
 
+		if ((inode->i_sb->s_blocksize) < bvec->bv_len)
+			csum_size = inode->i_sb->s_blocksize;
+		else
+			csum_size = bvec->bv_len;
 		data = kmap_atomic(bvec->bv_page);
-		sums->sums[index] = ~(u32)0;
-		sums->sums[index] = btrfs_csum_data(data + bvec->bv_offset,
+		pg_offset = bvec->bv_offset;
+		while (pg_offset < bvec->bv_offset + bvec->bv_len) {
+			sums->sums[index] = ~(u32)0;
+			sums->sums[index] = btrfs_csum_data(data + pg_offset,
 						    sums->sums[index],
-						    bvec->bv_len);
+						    csum_size);
+			btrfs_csum_final(sums->sums[index],
+					(char *)(sums->sums + index));
+			index++;
+			pg_offset += csum_size;
+		}
 		kunmap_atomic(data);
-		btrfs_csum_final(sums->sums[index],
-				 (char *)(sums->sums + index));
 
 		bio_index++;
-		index++;
 		total_bytes += bvec->bv_len;
 		this_sum_bytes += bvec->bv_len;
 		offset += bvec->bv_len;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c0c18ca..a87d0d0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2781,6 +2781,7 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u32 csum_expected;
 	u32 csum = ~(u32)0;
+	u64 total_len, csum_len, csum_index;
 	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
 	                              DEFAULT_RATELIMIT_BURST);
 
@@ -2799,14 +2800,27 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		return 0;
 	}
 
-	phy_offset >>= inode->i_sb->s_blocksize_bits;
-	csum_expected = *(((u32 *)io_bio->csum) + phy_offset);
+	total_len = end - start + 1;
+	if (inode->i_sb->s_blocksize < PAGE_CACHE_SIZE)
+		csum_len = inode->i_sb->s_blocksize;
+	else
+		csum_len = end - start + 1;
+
+	csum_index = phy_offset >> inode->i_sb->s_blocksize_bits;
 
 	kaddr = kmap_atomic(page);
-	csum = btrfs_csum_data(kaddr + offset, csum,  end - start + 1);
-	btrfs_csum_final(csum, (char *)&csum);
-	if (csum != csum_expected)
-		goto zeroit;
+	while (total_len > 0) {
+		csum_expected = *(((u32 *)io_bio->csum) + csum_index);
+		csum = ~(u32)0;
+		csum = btrfs_csum_data(kaddr + offset, csum, csum_len);
+		btrfs_csum_final(csum, (char *)&csum);
+		if (csum != csum_expected)
+			goto zeroit;
+		offset += csum_len;
+		total_len -= csum_len;
+		csum_index += 1;
+		phy_offset += csum_len;
+	}
 
 	kunmap_atomic(kaddr);
 good:
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/7] btrfs: subpagesize-blocksize: Handle relocation clusters appropriately
  2013-12-11 23:38 [PATCH 0/7] Patches to support subpagesize blocksize Chandra Seetharaman
                   ` (4 preceding siblings ...)
  2013-12-11 23:38 ` [PATCH 5/7] btrfs: subpagesize-blocksize: handle checksum calculations properly Chandra Seetharaman
@ 2013-12-11 23:38 ` Chandra Seetharaman
  2013-12-11 23:38 ` [PATCH 7/7] btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE Chandra Seetharaman
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Chandra Seetharaman @ 2013-12-11 23:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chandra Seetharaman

For relocation clusters boundaries are at blocks, hence in the case
of subpagesize-blocksize, we need to make sure the data in the page
is handled correctly with the cluster boundary.

This patch does that.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
---
 fs/btrfs/relocation.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index ce459a7..fb5752a 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3149,11 +3149,13 @@ static int relocate_file_extent_cluster(struct inode *inode,
 		set_page_extent_mapped(page);
 
 		if (nr < cluster->nr &&
-		    page_start + offset == cluster->boundary[nr]) {
+		    page_start + offset <= cluster->boundary[nr] &&
+		    page_end + offset >= cluster->boundary[nr]) {
 			set_extent_bits(&BTRFS_I(inode)->io_tree,
 					page_start, page_end,
 					EXTENT_BOUNDARY, GFP_NOFS);
-			nr++;
+			while (page_end + offset < cluster->boundary[nr])
+				nr++;
 		}
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 7/7] btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE
  2013-12-11 23:38 [PATCH 0/7] Patches to support subpagesize blocksize Chandra Seetharaman
                   ` (5 preceding siblings ...)
  2013-12-11 23:38 ` [PATCH 6/7] btrfs: subpagesize-blocksize: Handle relocation clusters appropriately Chandra Seetharaman
@ 2013-12-11 23:38 ` Chandra Seetharaman
  2013-12-13  1:07   ` David Sterba
  2013-12-12 20:40 ` [PATCH 0/7] Patches to support subpagesize blocksize Josef Bacik
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Chandra Seetharaman @ 2013-12-11 23:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chandra Seetharaman

This is the final patch of the series that allows filesystems with
blocksize smaller than the PAGE_SIZE.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
---
 fs/btrfs/disk-io.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ca1526d..d9bd450 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2615,12 +2615,6 @@ int open_ctree(struct super_block *sb,
 		goto fail_sb_buffer;
 	}
 
-	if (sectorsize != PAGE_SIZE) {
-		printk(KERN_WARNING "btrfs: Incompatible sector size(%lu) "
-		       "found on %s\n", (unsigned long)sectorsize, sb->s_id);
-		goto fail_sb_buffer;
-	}
-
 	mutex_lock(&fs_info->chunk_mutex);
 	ret = btrfs_read_sys_array(tree_root);
 	mutex_unlock(&fs_info->chunk_mutex);
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/7] Patches to support subpagesize blocksize
  2013-12-11 23:38 [PATCH 0/7] Patches to support subpagesize blocksize Chandra Seetharaman
                   ` (6 preceding siblings ...)
  2013-12-11 23:38 ` [PATCH 7/7] btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE Chandra Seetharaman
@ 2013-12-12 20:40 ` Josef Bacik
  2013-12-13  1:17 ` David Sterba
  2013-12-13 18:39 ` Josef Bacik
  9 siblings, 0 replies; 23+ messages in thread
From: Josef Bacik @ 2013-12-12 20:40 UTC (permalink / raw)
  To: Chandra Seetharaman, linux-btrfs, chandra_pdx


On 12/11/2013 06:38 PM, Chandra Seetharaman wrote:
> In btrfs, blocksize, the basic IO size of the filesystem, has been
> more than PAGE_SIZE.
>
> But, some 64 bit architures, like PPC64 and ARM64 have the default
> PAGE_SIZE as 64K, which means the filesystems handled in these
> architectures are with a blocksize of 64K.
>
> This works fine as long as you create and use the filesystems within
> these systems.
>
> In other words, one cannot create a filesystem in some other architecture
> and use that filesystem in PPC64 or ARM64, and vice versa.,
>
> Another restriction is that we cannot use ext? filesystems in these
> architectures as btrfs filesystems, since ext? filesystems have a blocksize
> of 4K.
>
> Sometime last year, Wade Cline posted a patch(http://lwn.net/Articles/529682/).
> I started testing it, and found many locking/race issues. So, I changed the
> logic and created an extent_buffer_head that holds an array of extent buffers that
> belong to a page.
>
> There are few wrinkles in this patchset, like some xfstests are failing, which
> could be due to me doing something incorrectly w.r.t how the blocksize and
> PAGE_SIZE are used in these patched.
>
> Would like to get some feedback, review comments.
>

So I hate this whole approach, but that's not your fault ;).  We already 
keep track of what we need in the extent_buffer, adding a whole other 
layer of abstraction onto that will make me a very unhappy person.  The 
biggest problem with sub-page size block sizes is knowing when the page 
is really dirty or really clean.  For the most part we've done away with 
most of the tracking of the actual page state for metadata, we use flags 
on the EB for this.  We still depend on the page state for things like 
btree_write_cache_pages and being able to write out the transaction, but 
we can just replace that logic with setting the same tags in the extent 
buffer radix tree.  Then the only part we need to figure out is how to 
do the balance_dirty_pages() dance appropriately.  I'd be half tempted 
to just do account_page_dirtied() every time we mark an extent buffer 
dirty and then just abuse the metadata BDI's min_ratio/max_ratio to make 
sure it's properly adjusted for how many extent buffers per pages there 
are and see how that works, we should be able to adjust it so we're 
flushing as much as normal.  This should be simpler to implement and 
touch less stuff.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 7/7] btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE
  2013-12-11 23:38 ` [PATCH 7/7] btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE Chandra Seetharaman
@ 2013-12-13  1:07   ` David Sterba
  2013-12-16 12:50     ` saeed bishara
  0 siblings, 1 reply; 23+ messages in thread
From: David Sterba @ 2013-12-13  1:07 UTC (permalink / raw)
  To: Chandra Seetharaman; +Cc: linux-btrfs

On Wed, Dec 11, 2013 at 05:38:42PM -0600, Chandra Seetharaman wrote:
> This is the final patch of the series that allows filesystems with
> blocksize smaller than the PAGE_SIZE.

> -	if (sectorsize != PAGE_SIZE) {

You've implemented the sectorsize < PAGE_SIZE part (multiple extent
buffers per page), so the check should stay as:

	if (sectorsize > PAGE_SIZE) {

> -		printk(KERN_WARNING "btrfs: Incompatible sector size(%lu) "
> -		       "found on %s\n", (unsigned long)sectorsize, sb->s_id);
> -		goto fail_sb_buffer;
> -	}
> -
>  	mutex_lock(&fs_info->chunk_mutex);
>  	ret = btrfs_read_sys_array(tree_root);
>  	mutex_unlock(&fs_info->chunk_mutex);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/7] Patches to support subpagesize blocksize
  2013-12-11 23:38 [PATCH 0/7] Patches to support subpagesize blocksize Chandra Seetharaman
                   ` (7 preceding siblings ...)
  2013-12-12 20:40 ` [PATCH 0/7] Patches to support subpagesize blocksize Josef Bacik
@ 2013-12-13  1:17 ` David Sterba
  2013-12-13 15:17   ` Chandra Seetharaman
  2013-12-13 18:39 ` Josef Bacik
  9 siblings, 1 reply; 23+ messages in thread
From: David Sterba @ 2013-12-13  1:17 UTC (permalink / raw)
  To: Chandra Seetharaman; +Cc: linux-btrfs, chandra_pdx

On Wed, Dec 11, 2013 at 05:38:35PM -0600, Chandra Seetharaman wrote:
> In other words, one cannot create a filesystem in some other architecture
> and use that filesystem in PPC64 or ARM64, and vice versa.,

For a full compatibility with any blocksize on arch with any pagesize
you'd need to implement the case when sectorsize is larger than
pagesize. Your patchset does the "4k sector/64k page", but I haven't
noticed the "64k sector/4k page" counterpart.

> Sometime last year, Wade Cline posted a patch(http://lwn.net/Articles/529682/).
> I started testing it, and found many locking/race issues. So, I changed the
> logic and created an extent_buffer_head that holds an array of extent buffers that
> belong to a page.
> 
> There are few wrinkles in this patchset, like some xfstests are failing, which
> could be due to me doing something incorrectly w.r.t how the blocksize and
> PAGE_SIZE are used in these patched.

How does it handle compression? The current code relies on

  compression block == page size

but should rather use the sectorsize. That might be one of the reasons
why xfstests fail.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/7] Patches to support subpagesize blocksize
  2013-12-13  1:17 ` David Sterba
@ 2013-12-13 15:17   ` Chandra Seetharaman
  2013-12-13 15:58     ` David Sterba
  0 siblings, 1 reply; 23+ messages in thread
From: Chandra Seetharaman @ 2013-12-13 15:17 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs

On Fri, 2013-12-13 at 02:17 +0100, David Sterba wrote:

Hi David,

> On Wed, Dec 11, 2013 at 05:38:35PM -0600, Chandra Seetharaman wrote:
> > In other words, one cannot create a filesystem in some other architecture
> > and use that filesystem in PPC64 or ARM64, and vice versa.,
> 
> For a full compatibility with any blocksize on arch with any pagesize
> you'd need to implement the case when sectorsize is larger than
> pagesize. Your patchset does the "4k sector/64k page", but I haven't
> noticed the "64k sector/4k page" counterpart.

My object was to make btrfs filesystems from other arches to be directly
usable in PPC64.

Nevertheless, IIUC, btrfs currently support such a case. Each extent
buffers currently can have up to INLINE_EXTENT_BUFFER_PAGES(16). 

-------
#define INLINE_EXTENT_BUFFER_PAGES 16

struct extent_buffer {
        :
        :
        struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
        :
};
--------

No ?

> 
> > Sometime last year, Wade Cline posted a patch(http://lwn.net/Articles/529682/).
> > I started testing it, and found many locking/race issues. So, I changed the
> > logic and created an extent_buffer_head that holds an array of extent buffers that
> > belong to a page.
> > 
> > There are few wrinkles in this patchset, like some xfstests are failing, which
> > could be due to me doing something incorrectly w.r.t how the blocksize and
> > PAGE_SIZE are used in these patched.
> 
> How does it handle compression? The current code relies on
> 
>   compression block == page size
> 
> but should rather use the sectorsize. That might be one of the reasons
> why xfstests fail.

Thanks for this information. I will look at the code more closely with
this in mind.

There are some issues with relocation too. Is there similar assumption
in that code path too ?

> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/7] Patches to support subpagesize blocksize
  2013-12-13 15:17   ` Chandra Seetharaman
@ 2013-12-13 15:58     ` David Sterba
  0 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2013-12-13 15:58 UTC (permalink / raw)
  To: Chandra Seetharaman; +Cc: linux-btrfs

On Fri, Dec 13, 2013 at 09:17:00AM -0600, Chandra Seetharaman wrote:
> > For a full compatibility with any blocksize on arch with any pagesize
> > you'd need to implement the case when sectorsize is larger than
> > pagesize. Your patchset does the "4k sector/64k page", but I haven't
> > noticed the "64k sector/4k page" counterpart.
> 
> My object was to make btrfs filesystems from other arches to be directly
> usable in PPC64.

Ok then.

> Nevertheless, IIUC, btrfs currently support such a case. Each extent
> buffers currently can have up to INLINE_EXTENT_BUFFER_PAGES(16). 
> 
> -------
> #define INLINE_EXTENT_BUFFER_PAGES 16
> 
> struct extent_buffer {
>         :
>         :
>         struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
>         :
> };
> --------
> 
> No ?

Just for metadata blocks.

> There are some issues with relocation too. Is there similar assumption
> in that code path too ?

I don't know. If yes, then it's different from the compression issues,
because there are some hardwired assumptions about the binary format of
the compressed data, while relocation uses the common code.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/7] Patches to support subpagesize blocksize
  2013-12-11 23:38 [PATCH 0/7] Patches to support subpagesize blocksize Chandra Seetharaman
                   ` (8 preceding siblings ...)
  2013-12-13  1:17 ` David Sterba
@ 2013-12-13 18:39 ` Josef Bacik
  2013-12-13 22:09   ` Chandra Seetharaman
  2014-01-08 20:06   ` Chandra Seetharaman
  9 siblings, 2 replies; 23+ messages in thread
From: Josef Bacik @ 2013-12-13 18:39 UTC (permalink / raw)
  To: Chandra Seetharaman, linux-btrfs, chandra_pdx


On 12/11/2013 06:38 PM, Chandra Seetharaman wrote:
> In btrfs, blocksize, the basic IO size of the filesystem, has been
> more than PAGE_SIZE.
>
> But, some 64 bit architures, like PPC64 and ARM64 have the default
> PAGE_SIZE as 64K, which means the filesystems handled in these
> architectures are with a blocksize of 64K.
>
> This works fine as long as you create and use the filesystems within
> these systems.
>
> In other words, one cannot create a filesystem in some other architecture
> and use that filesystem in PPC64 or ARM64, and vice versa.,
>
> Another restriction is that we cannot use ext? filesystems in these
> architectures as btrfs filesystems, since ext? filesystems have a blocksize
> of 4K.
>
> Sometime last year, Wade Cline posted a patch(http://lwn.net/Articles/529682/).
> I started testing it, and found many locking/race issues. So, I changed the
> logic and created an extent_buffer_head that holds an array of extent buffers that
> belong to a page.
>
> There are few wrinkles in this patchset, like some xfstests are failing, which
> could be due to me doing something incorrectly w.r.t how the blocksize and
> PAGE_SIZE are used in these patched.
>
> Would like to get some feedback, review comments.
>

Ok so the more we talked about it on IRC and talking with Chris I think 
we have a way forward here.

1) Add an extent_buffer_head that embeds an extent_buffer, and in the 
extent_buffer_head track the state of the whole page.  So this is where 
we have a linked list of all the extent_buffers on the page, we can keep 
track of the number of extent_buffers that are dirty/not so we can be 
sure to set the page state and everything right.

2) Set page->private to the first extent_buffer like we currently do.  
Then we just have checks in the endio stuff to see if the eb we found is 
the one for our currently range (ie bv_offset == 0) and if not do a 
linear search through the extent_buffers on the extent_buffer_head part 
to get the right one.

We have to do this because we need to be able to track IO for each of 
the extent_buffer's independently of each other in case a page spans a 
block_group.

Hopefully that makes sense, this way you don't have to futz with any of 
my crazier long term goals of no longer using pagecache or any of that 
mess.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/7] Patches to support subpagesize blocksize
  2013-12-13 18:39 ` Josef Bacik
@ 2013-12-13 22:09   ` Chandra Seetharaman
  2014-01-08 20:06   ` Chandra Seetharaman
  1 sibling, 0 replies; 23+ messages in thread
From: Chandra Seetharaman @ 2013-12-13 22:09 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Fri, 2013-12-13 at 13:39 -0500, Josef Bacik wrote:
> On 12/11/2013 06:38 PM, Chandra Seetharaman wrote:
> > In btrfs, blocksize, the basic IO size of the filesystem, has been
> > more than PAGE_SIZE.
> >
> > But, some 64 bit architures, like PPC64 and ARM64 have the default
> > PAGE_SIZE as 64K, which means the filesystems handled in these
> > architectures are with a blocksize of 64K.
> >
> > This works fine as long as you create and use the filesystems within
> > these systems.
> >
> > In other words, one cannot create a filesystem in some other architecture
> > and use that filesystem in PPC64 or ARM64, and vice versa.,
> >
> > Another restriction is that we cannot use ext? filesystems in these
> > architectures as btrfs filesystems, since ext? filesystems have a blocksize
> > of 4K.
> >
> > Sometime last year, Wade Cline posted a patch(http://lwn.net/Articles/529682/).
> > I started testing it, and found many locking/race issues. So, I changed the
> > logic and created an extent_buffer_head that holds an array of extent buffers that
> > belong to a page.
> >
> > There are few wrinkles in this patchset, like some xfstests are failing, which
> > could be due to me doing something incorrectly w.r.t how the blocksize and
> > PAGE_SIZE are used in these patched.
> >
> > Would like to get some feedback, review comments.
> >
> 
> Ok so the more we talked about it on IRC and talking with Chris I think 
> we have a way forward here.
> 
> 1) Add an extent_buffer_head that embeds an extent_buffer, and in the 
> extent_buffer_head track the state of the whole page.  So this is where 
> we have a linked list of all the extent_buffers on the page, we can keep 
> track of the number of extent_buffers that are dirty/not so we can be 
> sure to set the page state and everything right.

Let me see if I understand you correctly:

In my patch I have,
-----------
extent_buffer {
        // buffer specific data
}; 

extent_buffer_head {
        // page wide data
        extent_buffer *extent_buf[];
};
--------------
You are suggesting to make it
------------
extent_buffer {
        // buffer specific data
	extent_buffer *ebuf_next; 
}; 

extent_buffer_head {
        // page wide data
        extent_buffer ebuf_first;
        extent_buffer *ebuf_next;
};
-----------
correct ? If yes, then, IMO, the code might look more convoluted as we
have to take care of two different situations ? isn't it ? 

> 
> 2) Set page->private to the first extent_buffer like we currently do.  
> Then we just have checks in the endio stuff to see if the eb we found is 
> the one for our currently range (ie bv_offset == 0) and if not do a 
> linear search through the extent_buffers on the extent_buffer_head part 
> to get the right one.
> 
> We have to do this because we need to be able to track IO for each of 
> the extent_buffer's independently of each other in case a page spans a 
> block_group.
> 
> Hopefully that makes sense, this way you don't have to futz with any of 
> my crazier long term goals of no longer using pagecache or any of that 
> mess.  Thanks,

Yeah, that would be good :)
> 
> Josef
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] btrfs: subpagesize-blocksize: Define extent_buffer_head
  2013-12-11 23:38 ` [PATCH 1/7] btrfs: subpagesize-blocksize: Define extent_buffer_head Chandra Seetharaman
@ 2013-12-16 12:32   ` saeed bishara
  2013-12-16 16:17     ` Chandra Seetharaman
  0 siblings, 1 reply; 23+ messages in thread
From: saeed bishara @ 2013-12-16 12:32 UTC (permalink / raw)
  To: Chandra Seetharaman; +Cc: linux-btrfs

On Thu, Dec 12, 2013 at 1:38 AM, Chandra Seetharaman
<sekharan@us.ibm.com> wrote:
> In order to handle multiple extent buffers per page, first we
> need to create a way to handle all the extent buffers that
> are attached to a page.
>
> This patch creates a new data structure eb_head, and moves
> fields that are common to all extent buffers in a page from
> extent buffer to eb_head.
>
> This also adds changes that are needed to handle multiple
> extent buffers per page case.
>
> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
> ---
>  fs/btrfs/backref.c           |   6 +-
>  fs/btrfs/ctree.c             |   2 +-
>  fs/btrfs/ctree.h             |   6 +-
>  fs/btrfs/disk-io.c           | 109 +++++++----
>  fs/btrfs/extent-tree.c       |   6 +-
>  fs/btrfs/extent_io.c         | 429 +++++++++++++++++++++++++++----------------
>  fs/btrfs/extent_io.h         |  55 ++++--
>  fs/btrfs/volumes.c           |   2 +-
>  include/trace/events/btrfs.h |   2 +-
>  9 files changed, 390 insertions(+), 227 deletions(-)
>
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 3775947..af1943f 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1283,7 +1283,7 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
>                 eb = path->nodes[0];
>                 /* make sure we can use eb after releasing the path */
>                 if (eb != eb_in) {
> -                       atomic_inc(&eb->refs);
> +                       atomic_inc(&eb_head(eb)->refs);
>                         btrfs_tree_read_lock(eb);
>                         btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
>                 }
> @@ -1616,7 +1616,7 @@ static int iterate_inode_refs(u64 inum, struct btrfs_root *fs_root,
>                 slot = path->slots[0];
>                 eb = path->nodes[0];
>                 /* make sure we can use eb after releasing the path */
> -               atomic_inc(&eb->refs);
> +               atomic_inc(&eb_head(eb)->refs);
>                 btrfs_tree_read_lock(eb);
>                 btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
>                 btrfs_release_path(path);
> @@ -1676,7 +1676,7 @@ static int iterate_inode_extrefs(u64 inum, struct btrfs_root *fs_root,
>                 slot = path->slots[0];
>                 eb = path->nodes[0];
>                 /* make sure we can use eb after releasing the path */
> -               atomic_inc(&eb->refs);
> +               atomic_inc(&eb_head(eb)->refs);
>
>                 btrfs_tree_read_lock(eb);
>                 btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 316136b..611b27e 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -170,7 +170,7 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root *root)
>                  * the inc_not_zero dance and if it doesn't work then
>                  * synchronize_rcu and try again.
>                  */
> -               if (atomic_inc_not_zero(&eb->refs)) {
> +               if (atomic_inc_not_zero(&eb_head(eb)->refs)) {
>                         rcu_read_unlock();
>                         break;
>                 }
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 54ab861..02de448 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2106,14 +2106,16 @@ static inline void btrfs_set_token_##name(struct extent_buffer *eb,     \
>  #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)            \
>  static inline u##bits btrfs_##name(struct extent_buffer *eb)           \
>  {                                                                      \
> -       type *p = page_address(eb->pages[0]);                           \
> +       type *p = page_address(eb_head(eb)->pages[0]) +                 \
> +                               (eb->start & (PAGE_CACHE_SIZE -1));     \
you can use PAGE_CACHE_MASK instead of PAGE_CACHE_SIZE - 1
>         u##bits res = le##bits##_to_cpu(p->member);                     \
>         return res;                                                     \
>  }                                                                      \
>  static inline void btrfs_set_##name(struct extent_buffer *eb,          \
>                                     u##bits val)                        \
>  {                                                                      \
> -       type *p = page_address(eb->pages[0]);                           \
> +       type *p = page_address(eb_head(eb)->pages[0]) +                 \
> +                               (eb->start & (PAGE_CACHE_SIZE -1));     \
>         p->member = cpu_to_le##bits(val);                               \
>  }
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 8072cfa..ca1526d 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -411,7 +411,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
>         int mirror_num = 0;
>         int failed_mirror = 0;
>
> -       clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
> +       clear_bit(EXTENT_BUFFER_CORRUPT, &eb_head(eb)->bflags);
>         io_tree = &BTRFS_I(root->fs_info->btree_inode)->io_tree;
>         while (1) {
>                 ret = read_extent_buffer_pages(io_tree, eb, start,
> @@ -430,7 +430,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
>                  * there is no reason to read the other copies, they won't be
>                  * any less wrong.
>                  */
> -               if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags))
> +               if (test_bit(EXTENT_BUFFER_CORRUPT, &eb_head(eb)->bflags))
>                         break;
>
>                 num_copies = btrfs_num_copies(root->fs_info,
> @@ -440,7 +440,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
>
>                 if (!failed_mirror) {
>                         failed = 1;
> -                       failed_mirror = eb->read_mirror;
> +                       failed_mirror = eb_head(eb)->read_mirror;
>                 }
>
>                 mirror_num++;
> @@ -465,19 +465,22 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
>  static int csum_dirty_buffer(struct btrfs_root *root, struct page *page)
>  {
>         struct extent_io_tree *tree;
> -       u64 start = page_offset(page);
>         u64 found_start;
>         struct extent_buffer *eb;
> +       struct extent_buffer_head *eb_head;
>
>         tree = &BTRFS_I(page->mapping->host)->io_tree;
>
> -       eb = (struct extent_buffer *)page->private;
> -       if (page != eb->pages[0])
> +       eb_head = (struct extent_buffer_head *)page->private;
> +       if (page != eb_head->pages[0])
>                 return 0;
> -       found_start = btrfs_header_bytenr(eb);
> -       if (WARN_ON(found_start != start || !PageUptodate(page)))
> +       if (WARN_ON(!PageUptodate(page)))
>                 return 0;
> -       csum_tree_block(root, eb, 0);
> +       for (eb = &eb_head->extent_buf[0]; eb->start; eb++) {
> +               found_start = btrfs_header_bytenr(eb);
> +               if (found_start == eb->start)
> +                       csum_tree_block(root, eb, 0);
> +       }
>         return 0;
>  }
>
> @@ -575,25 +578,34 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
>         struct extent_buffer *eb;
>         struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
>         int ret = 0;
> -       int reads_done;
> +       int reads_done = 0;
> +       struct extent_buffer_head *eb_head;
>
>         if (!page->private)
>                 goto out;
>
>         tree = &BTRFS_I(page->mapping->host)->io_tree;
> -       eb = (struct extent_buffer *)page->private;
> +       eb_head = (struct extent_buffer_head *)page->private;
> +
> +       /* Get the eb corresponding to this IO */
> +       eb = eb_head->io_eb;
> +       if (!eb) {
> +               ret = -EIO;
> +               goto err;
> +       }
> +       eb_head->io_eb = NULL;
>
>         /* the pending IO might have been the only thing that kept this buffer
>          * in memory.  Make sure we have a ref for all this other checks
>          */
>         extent_buffer_get(eb);
>
> -       reads_done = atomic_dec_and_test(&eb->io_pages);
> +       reads_done = atomic_dec_and_test(&eb_head->io_pages);
>         if (!reads_done)
>                 goto err;
>
> -       eb->read_mirror = mirror;
> -       if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
> +       eb_head->read_mirror = mirror;
> +       if (test_bit(EXTENT_BUFFER_IOERR, &eb_head->bflags)) {
>                 ret = -EIO;
>                 goto err;
>         }
> @@ -635,7 +647,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
>          * return -EIO.
>          */
>         if (found_level == 0 && check_leaf(root, eb)) {
> -               set_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
> +               set_bit(EXTENT_BUFFER_CORRUPT, &eb_head->bflags);
>                 ret = -EIO;
>         }
>
> @@ -643,7 +655,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
>                 set_extent_buffer_uptodate(eb);
>  err:
>         if (reads_done &&
> -           test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
> +           test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb_head->bflags))
>                 btree_readahead_hook(root, eb, eb->start, ret);
>
>         if (ret) {
> @@ -652,7 +664,7 @@ err:
>                  * again, we have to make sure it has something
>                  * to decrement
>                  */
> -               atomic_inc(&eb->io_pages);
> +               atomic_inc(&eb_head->io_pages);
>                 clear_extent_buffer_uptodate(eb);
>         }
>         free_extent_buffer(eb);
> @@ -662,15 +674,22 @@ out:
>
>  static int btree_io_failed_hook(struct page *page, int failed_mirror)
>  {
> +       struct extent_buffer_head *eb_head
> +                       =  (struct extent_buffer_head *)page->private;
>         struct extent_buffer *eb;
>         struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
>
> -       eb = (struct extent_buffer *)page->private;
> -       set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
> -       eb->read_mirror = failed_mirror;
> -       atomic_dec(&eb->io_pages);
> -       if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
> +       set_bit(EXTENT_BUFFER_IOERR, &eb_head->bflags);
> +       eb_head->read_mirror = failed_mirror;
> +       atomic_dec(&eb_head->io_pages);
> +       /* Get the eb corresponding to this IO */
> +       eb = eb_head->io_eb;
> +       if (!eb)
> +               goto out;
> +       eb_head->io_eb = NULL;
> +       if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb_head->bflags))
>                 btree_readahead_hook(root, eb, eb->start, -EIO);
> +out:
>         return -EIO;    /* we fixed nothing */
>  }
>
> @@ -1021,14 +1040,20 @@ static void btree_invalidatepage(struct page *page, unsigned int offset,
>  static int btree_set_page_dirty(struct page *page)
>  {
>  #ifdef DEBUG
> +       struct extent_buffer_head *ebh;
>         struct extent_buffer *eb;
> +       int i, dirty = 0;
>
>         BUG_ON(!PagePrivate(page));
> -       eb = (struct extent_buffer *)page->private;
> -       BUG_ON(!eb);
> -       BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
> -       BUG_ON(!atomic_read(&eb->refs));
> -       btrfs_assert_tree_locked(eb);
> +       ebh = (struct extent_buffer_head *)page->private;
> +       BUG_ON(!ebh);
> +       for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE && !dirty; i++) {
> +               eb = &ebh->extent_buf[i];
> +               dirty = test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
> +       }
> +       BUG_ON(dirty);
> +       BUG_ON(!atomic_read(&ebh->refs));
> +       btrfs_assert_tree_locked(&ebh->extent_buf[0]);
>  #endif
>         return __set_page_dirty_nobuffers(page);
>  }
> @@ -1072,7 +1097,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr, u32 blocksize,
>         if (!buf)
>                 return 0;
>
> -       set_bit(EXTENT_BUFFER_READAHEAD, &buf->bflags);
> +       set_bit(EXTENT_BUFFER_READAHEAD, &eb_head(buf)->bflags);
>
>         ret = read_extent_buffer_pages(io_tree, buf, 0, WAIT_PAGE_LOCK,
>                                        btree_get_extent, mirror_num);
> @@ -1081,7 +1106,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr, u32 blocksize,
>                 return ret;
>         }
>
> -       if (test_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags)) {
> +       if (test_bit(EXTENT_BUFFER_CORRUPT, &eb_head(buf)->bflags)) {
>                 free_extent_buffer(buf);
>                 return -EIO;
>         } else if (extent_buffer_uptodate(buf)) {
> @@ -1115,14 +1140,16 @@ struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
>
>  int btrfs_write_tree_block(struct extent_buffer *buf)
>  {
> -       return filemap_fdatawrite_range(buf->pages[0]->mapping, buf->start,
> +       return filemap_fdatawrite_range(eb_head(buf)->pages[0]->mapping,
> +                                       buf->start,
>                                         buf->start + buf->len - 1);
>  }
>
>  int btrfs_wait_tree_block_writeback(struct extent_buffer *buf)
>  {
> -       return filemap_fdatawait_range(buf->pages[0]->mapping,
> -                                      buf->start, buf->start + buf->len - 1);
> +       return filemap_fdatawait_range(eb_head(buf)->pages[0]->mapping,
> +                                       buf->start,
> +                                       buf->start + buf->len - 1);
>  }
>
>  struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
> @@ -1153,7 +1180,8 @@ void clean_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>             fs_info->running_transaction->transid) {
>                 btrfs_assert_tree_locked(buf);
>
> -               if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)) {
> +               if (test_and_clear_bit(EXTENT_BUFFER_DIRTY,
> +                                               &buf->ebflags)) {
>                         __percpu_counter_add(&fs_info->dirty_metadata_bytes,
>                                              -buf->len,
>                                              fs_info->dirty_metadata_batch);
> @@ -2613,7 +2641,8 @@ int open_ctree(struct super_block *sb,
>                                            btrfs_super_chunk_root(disk_super),
>                                            blocksize, generation);
>         if (!chunk_root->node ||
> -           !test_bit(EXTENT_BUFFER_UPTODATE, &chunk_root->node->bflags)) {
> +           !test_bit(EXTENT_BUFFER_UPTODATE,
> +                                       &eb_head(chunk_root->node)->bflags)) {
>                 printk(KERN_WARNING "btrfs: failed to read chunk root on %s\n",
>                        sb->s_id);
>                 goto fail_tree_roots;
> @@ -2652,7 +2681,8 @@ retry_root_backup:
>                                           btrfs_super_root(disk_super),
>                                           blocksize, generation);
>         if (!tree_root->node ||
> -           !test_bit(EXTENT_BUFFER_UPTODATE, &tree_root->node->bflags)) {
> +           !test_bit(EXTENT_BUFFER_UPTODATE,
> +                                       &eb_head(tree_root->node)->bflags)) {
>                 printk(KERN_WARNING "btrfs: failed to read tree root on %s\n",
>                        sb->s_id);
>
> @@ -3619,7 +3649,7 @@ int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
>                           int atomic)
>  {
>         int ret;
> -       struct inode *btree_inode = buf->pages[0]->mapping->host;
> +       struct inode *btree_inode = eb_head(buf)->pages[0]->mapping->host;
>
>         ret = extent_buffer_uptodate(buf);
>         if (!ret)
> @@ -3652,7 +3682,7 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
>         if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, &buf->bflags)))
>                 return;
>  #endif
> -       root = BTRFS_I(buf->pages[0]->mapping->host)->root;
> +       root = BTRFS_I(eb_head(buf)->pages[0]->mapping->host)->root;
>         btrfs_assert_tree_locked(buf);
>         if (transid != root->fs_info->generation)
>                 WARN(1, KERN_CRIT "btrfs transid mismatch buffer %llu, "
> @@ -3701,7 +3731,8 @@ void btrfs_btree_balance_dirty_nodelay(struct btrfs_root *root)
>
>  int btrfs_read_buffer(struct extent_buffer *buf, u64 parent_transid)
>  {
> -       struct btrfs_root *root = BTRFS_I(buf->pages[0]->mapping->host)->root;
> +       struct btrfs_root *root =
> +                       BTRFS_I(eb_head(buf)->pages[0]->mapping->host)->root;
>         return btree_read_extent_buffer_pages(root, buf, 0, parent_transid);
>  }
>
> @@ -3938,7 +3969,7 @@ static int btrfs_destroy_marked_extents(struct btrfs_root *root,
>                         wait_on_extent_buffer_writeback(eb);
>
>                         if (test_and_clear_bit(EXTENT_BUFFER_DIRTY,
> -                                              &eb->bflags))
> +                                              &eb->ebflags))
>                                 clear_extent_buffer_dirty(eb);
>                         free_extent_buffer_stale(eb);
>                 }
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 45d98d0..79cf87f 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6019,7 +6019,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
>                         goto out;
>                 }
>
> -               WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
> +               WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->ebflags));
>
>                 btrfs_add_free_space(cache, buf->start, buf->len);
>                 btrfs_update_reserved_bytes(cache, buf->len, RESERVE_FREE);
> @@ -6036,7 +6036,7 @@ out:
>          * Deleting the buffer, clear the corrupt flag since it doesn't matter
>          * anymore.
>          */
> -       clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
> +       clear_bit(EXTENT_BUFFER_CORRUPT, &eb_head(buf)->bflags);
>         btrfs_put_block_group(cache);
>  }
>
> @@ -6910,7 +6910,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>         btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
>         btrfs_tree_lock(buf);
>         clean_tree_block(trans, root, buf);
> -       clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
> +       clear_bit(EXTENT_BUFFER_STALE, &eb_head(buf)->bflags);
>
>         btrfs_set_lock_blocking(buf);
>         btrfs_set_buffer_uptodate(buf);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index ff43802..a1a849b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -54,8 +54,10 @@ void btrfs_leak_debug_del(struct list_head *entry)
>  static inline
>  void btrfs_leak_debug_check(void)
>  {
> +       int i;
>         struct extent_state *state;
>         struct extent_buffer *eb;
> +       struct extent_buffer_head *ebh;
>
>         while (!list_empty(&states)) {
>                 state = list_entry(states.next, struct extent_state, leak_list);
> @@ -68,12 +70,17 @@ void btrfs_leak_debug_check(void)
>         }
>
>         while (!list_empty(&buffers)) {
> -               eb = list_entry(buffers.next, struct extent_buffer, leak_list);
> -               printk(KERN_ERR "btrfs buffer leak start %llu len %lu "
> -                      "refs %d\n",
> -                      eb->start, eb->len, atomic_read(&eb->refs));
> -               list_del(&eb->leak_list);
> -               kmem_cache_free(extent_buffer_cache, eb);
> +               ebh = list_entry(buffers.next, struct extent_buffer_head, leak_list);
> +               printk(KERN_ERR "btrfs buffer leak ");
> +               for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE; i++) {
> +                       eb = &ebh->extent_buf[i];
> +                       if (!eb->start)
> +                               break;
> +                       printk(KERN_ERR "eb %p %llu:%lu ", eb, eb->start, eb->len);
> +               }
> +               printk(KERN_ERR "refs %d\n", atomic_read(&ebh->refs));
> +               list_del(&ebh->leak_list);
> +               kmem_cache_free(extent_buffer_cache, ebh);
>         }
>  }
>
> @@ -136,7 +143,7 @@ int __init extent_io_init(void)
>                 return -ENOMEM;
>
>         extent_buffer_cache = kmem_cache_create("btrfs_extent_buffer",
> -                       sizeof(struct extent_buffer), 0,
> +                       sizeof(struct extent_buffer_head), 0,
>                         SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
>         if (!extent_buffer_cache)
>                 goto free_state_cache;
> @@ -2023,7 +2030,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
>  int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
>                          int mirror_num)
>  {
> -       u64 start = eb->start;
> +       u64 start = eb_head(eb)->extent_buf[0].start;
>         unsigned long i, num_pages = num_extent_pages(eb->start, eb->len);
>         int ret = 0;
>
> @@ -2680,15 +2687,15 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
>         return ret;
>  }
>
> -static void attach_extent_buffer_page(struct extent_buffer *eb,
> +static void attach_extent_buffer_page(struct extent_buffer_head *ebh,
>                                       struct page *page)
>  {
>         if (!PagePrivate(page)) {
>                 SetPagePrivate(page);
>                 page_cache_get(page);
> -               set_page_private(page, (unsigned long)eb);
> +               set_page_private(page, (unsigned long)ebh);
>         } else {
> -               WARN_ON(page->private != (unsigned long)eb);
> +               WARN_ON(page->private != (unsigned long)ebh);
>         }
>  }
>
> @@ -3327,17 +3334,19 @@ static int eb_wait(void *word)
>
>  void wait_on_extent_buffer_writeback(struct extent_buffer *eb)
>  {
> -       wait_on_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK, eb_wait,
> +       wait_on_bit(&eb_head(eb)->bflags, EXTENT_BUFFER_WRITEBACK, eb_wait,
>                     TASK_UNINTERRUPTIBLE);
>  }
>
> -static int lock_extent_buffer_for_io(struct extent_buffer *eb,
> +static int lock_extent_buffer_for_io(struct extent_buffer_head *ebh,
>                                      struct btrfs_fs_info *fs_info,
>                                      struct extent_page_data *epd)
>  {
>         unsigned long i, num_pages;
>         int flush = 0;
> +       bool dirty = false, dirty_arr[MAX_EXTENT_BUFFERS_PER_PAGE];
>         int ret = 0;
> +       struct extent_buffer *eb = &ebh->extent_buf[0], *ebtemp;
>
>         if (!btrfs_try_tree_write_lock(eb)) {
>                 flush = 1;
> @@ -3345,7 +3354,7 @@ static int lock_extent_buffer_for_io(struct extent_buffer *eb,
>                 btrfs_tree_lock(eb);
>         }
>
> -       if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
> +       if (test_bit(EXTENT_BUFFER_WRITEBACK, &ebh->bflags)) {
>                 btrfs_tree_unlock(eb);
>                 if (!epd->sync_io)
>                         return 0;
> @@ -3356,7 +3365,7 @@ static int lock_extent_buffer_for_io(struct extent_buffer *eb,
>                 while (1) {
>                         wait_on_extent_buffer_writeback(eb);
>                         btrfs_tree_lock(eb);
> -                       if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags))
> +                       if (!test_bit(EXTENT_BUFFER_WRITEBACK, &ebh->bflags))
>                                 break;
>                         btrfs_tree_unlock(eb);
>                 }
> @@ -3367,17 +3376,27 @@ static int lock_extent_buffer_for_io(struct extent_buffer *eb,
>          * under IO since we can end up having no IO bits set for a short period
>          * of time.
>          */
> -       spin_lock(&eb->refs_lock);
> -       if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> -               set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
> -               spin_unlock(&eb->refs_lock);
> -               btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
> -               __percpu_counter_add(&fs_info->dirty_metadata_bytes,
> -                                    -eb->len,
> +       spin_lock(&ebh->refs_lock);
> +       for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE; i++) {
> +               ebtemp = &ebh->extent_buf[i];
> +               dirty_arr[i] |= test_and_clear_bit(EXTENT_BUFFER_DIRTY, &ebtemp->ebflags);
dirty_arr wasn't initialized, changing the "|=" to = fixed a crash
issue when doing writes
> +               dirty = dirty || dirty_arr[i];
> +       }
> +       if (dirty) {
> +               set_bit(EXTENT_BUFFER_WRITEBACK, &ebh->bflags);
> +               spin_unlock(&ebh->refs_lock);
> +               for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE; i++) {
> +                       if (dirty_arr[i] == false)
> +                               continue;
> +                       ebtemp = &ebh->extent_buf[i];
> +                       btrfs_set_header_flag(ebtemp, BTRFS_HEADER_FLAG_WRITTEN);
> +                       __percpu_counter_add(&fs_info->dirty_metadata_bytes,
> +                                    -ebtemp->len,
>                                      fs_info->dirty_metadata_batch);
> +               }
>                 ret = 1;
>         } else {
> -               spin_unlock(&eb->refs_lock);
> +               spin_unlock(&ebh->refs_lock);
>         }
>
>         btrfs_tree_unlock(eb);
> @@ -3401,30 +3420,30 @@ static int lock_extent_buffer_for_io(struct extent_buffer *eb,
>         return ret;
>  }
>
> -static void end_extent_buffer_writeback(struct extent_buffer *eb)
> +static void end_extent_buffer_writeback(struct extent_buffer_head *ebh)
>  {
> -       clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
> +       clear_bit(EXTENT_BUFFER_WRITEBACK, &ebh->bflags);
>         smp_mb__after_clear_bit();
> -       wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
> +       wake_up_bit(&ebh->bflags, EXTENT_BUFFER_WRITEBACK);
>  }
>
>  static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
>  {
>         int uptodate = err == 0;
>         struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
> -       struct extent_buffer *eb;
> +       struct extent_buffer_head *ebh;
>         int done;
>
>         do {
>                 struct page *page = bvec->bv_page;
>
>                 bvec--;
> -               eb = (struct extent_buffer *)page->private;
> -               BUG_ON(!eb);
> -               done = atomic_dec_and_test(&eb->io_pages);
> +               ebh = (struct extent_buffer_head *)page->private;
> +               BUG_ON(!ebh);
> +               done = atomic_dec_and_test(&ebh->io_pages);
>
> -               if (!uptodate || test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
> -                       set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
> +               if (!uptodate || test_bit(EXTENT_BUFFER_IOERR, &ebh->bflags)) {
> +                       set_bit(EXTENT_BUFFER_IOERR, &ebh->bflags);
>                         ClearPageUptodate(page);
>                         SetPageError(page);
>                 }
> @@ -3434,7 +3453,7 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
>                 if (!done)
>                         continue;
>
> -               end_extent_buffer_writeback(eb);
> +               end_extent_buffer_writeback(ebh);
>         } while (bvec >= bio->bi_io_vec);
>
>         bio_put(bio);
> @@ -3447,15 +3466,15 @@ static int write_one_eb(struct extent_buffer *eb,
>                         struct extent_page_data *epd)
>  {
>         struct block_device *bdev = fs_info->fs_devices->latest_bdev;
> -       u64 offset = eb->start;
> +       u64 offset = eb->start & ~(PAGE_CACHE_SIZE - 1);
>         unsigned long i, num_pages;
>         unsigned long bio_flags = 0;
>         int rw = (epd->sync_io ? WRITE_SYNC : WRITE) | REQ_META;
>         int ret = 0;
>
> -       clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
> +       clear_bit(EXTENT_BUFFER_IOERR, &eb_head(eb)->bflags);
>         num_pages = num_extent_pages(eb->start, eb->len);
> -       atomic_set(&eb->io_pages, num_pages);
> +       atomic_set(&eb_head(eb)->io_pages, num_pages);
>         if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
>                 bio_flags = EXTENT_BIO_TREE_LOG;
>
> @@ -3464,16 +3483,17 @@ static int write_one_eb(struct extent_buffer *eb,
>
>                 clear_page_dirty_for_io(p);
>                 set_page_writeback(p);
> -               ret = submit_extent_page(rw, eb->tree, p, offset >> 9,
> +               ret = submit_extent_page(rw, eb_head(eb)->tree, p, offset >> 9,
>                                          PAGE_CACHE_SIZE, 0, bdev, &epd->bio,
>                                          -1, end_bio_extent_buffer_writepage,
>                                          0, epd->bio_flags, bio_flags);
>                 epd->bio_flags = bio_flags;
>                 if (ret) {
> -                       set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
> +                       set_bit(EXTENT_BUFFER_IOERR, &eb_head(eb)->bflags);
>                         SetPageError(p);
> -                       if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
> -                               end_extent_buffer_writeback(eb);
> +                       if (atomic_sub_and_test(num_pages - i,
> +                                                       &eb_head(eb)->io_pages))
> +                               end_extent_buffer_writeback(eb_head(eb));
>                         ret = -EIO;
>                         break;
>                 }
> @@ -3497,7 +3517,8 @@ int btree_write_cache_pages(struct address_space *mapping,
>  {
>         struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
>         struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info;
> -       struct extent_buffer *eb, *prev_eb = NULL;
> +       struct extent_buffer *eb;
> +       struct extent_buffer_head *ebh, *prev_ebh = NULL;
>         struct extent_page_data epd = {
>                 .bio = NULL,
>                 .tree = tree,
> @@ -3554,30 +3575,31 @@ retry:
>                                 continue;
>                         }
>
> -                       eb = (struct extent_buffer *)page->private;
> +                       ebh = (struct extent_buffer_head *)page->private;
>
>                         /*
>                          * Shouldn't happen and normally this would be a BUG_ON
>                          * but no sense in crashing the users box for something
>                          * we can survive anyway.
>                          */
> -                       if (WARN_ON(!eb)) {
> +                       if (WARN_ON(!ebh)) {
>                                 spin_unlock(&mapping->private_lock);
>                                 continue;
>                         }
>
> -                       if (eb == prev_eb) {
> +                       if (ebh == prev_ebh) {
>                                 spin_unlock(&mapping->private_lock);
>                                 continue;
>                         }
>
> -                       ret = atomic_inc_not_zero(&eb->refs);
> +                       ret = atomic_inc_not_zero(&ebh->refs);
>                         spin_unlock(&mapping->private_lock);
>                         if (!ret)
>                                 continue;
>
> -                       prev_eb = eb;
> -                       ret = lock_extent_buffer_for_io(eb, fs_info, &epd);
> +                       eb = &ebh->extent_buf[0];
> +                       prev_ebh = ebh;
> +                       ret = lock_extent_buffer_for_io(ebh, fs_info, &epd);
>                         if (!ret) {
>                                 free_extent_buffer(eb);
>                                 continue;
> @@ -4257,17 +4279,23 @@ out:
>         return ret;
>  }
>
> -static void __free_extent_buffer(struct extent_buffer *eb)
> +static void __free_extent_buffer(struct extent_buffer_head *ebh)
>  {
> -       btrfs_leak_debug_del(&eb->leak_list);
> -       kmem_cache_free(extent_buffer_cache, eb);
> +       btrfs_leak_debug_del(&ebh->leak_list);
> +       kmem_cache_free(extent_buffer_cache, ebh);
>  }
>
> -static int extent_buffer_under_io(struct extent_buffer *eb)
> +static int extent_buffer_under_io(struct extent_buffer_head *ebh)
>  {
> -       return (atomic_read(&eb->io_pages) ||
> -               test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) ||
> -               test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
> +       int i, dirty = 0;
> +       struct extent_buffer *eb;
> +
> +       for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE && !dirty; i++) {
> +               eb = &ebh->extent_buf[i];
> +               dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
> +       }
> +       return (dirty || atomic_read(&ebh->io_pages) ||
> +               test_bit(EXTENT_BUFFER_WRITEBACK, &ebh->bflags));
>  }
>
>  /*
> @@ -4279,9 +4307,10 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb,
>         unsigned long index;
>         unsigned long num_pages;
>         struct page *page;
> -       int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
> +       struct extent_buffer_head *ebh = eb_head(eb);
> +       int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &ebh->bflags);
>
> -       BUG_ON(extent_buffer_under_io(eb));
> +       BUG_ON(extent_buffer_under_io(ebh));
>
>         num_pages = num_extent_pages(eb->start, eb->len);
>         index = start_idx + num_pages;
> @@ -4301,8 +4330,8 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb,
>                          * this eb.
>                          */
>                         if (PagePrivate(page) &&
> -                           page->private == (unsigned long)eb) {
> -                               BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
> +                           page->private == (unsigned long)ebh) {
> +                               BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags));
>                                 BUG_ON(PageDirty(page));
>                                 BUG_ON(PageWriteback(page));
>                                 /*
> @@ -4330,23 +4359,14 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb,
>  static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
>  {
>         btrfs_release_extent_buffer_page(eb, 0);
> -       __free_extent_buffer(eb);
> +       __free_extent_buffer(eb_head(eb));
>  }
>
> -static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree,
> -                                                  u64 start,
> -                                                  unsigned long len,
> -                                                  gfp_t mask)
> +static void __init_extent_buffer(struct extent_buffer *eb, u64 start,
> +                               unsigned long len)
>  {
> -       struct extent_buffer *eb = NULL;
> -
> -       eb = kmem_cache_zalloc(extent_buffer_cache, mask);
> -       if (eb == NULL)
> -               return NULL;
>         eb->start = start;
>         eb->len = len;
> -       eb->tree = tree;
> -       eb->bflags = 0;
>         rwlock_init(&eb->lock);
>         atomic_set(&eb->write_locks, 0);
>         atomic_set(&eb->read_locks, 0);
> @@ -4357,12 +4377,27 @@ static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree,
>         eb->lock_nested = 0;
>         init_waitqueue_head(&eb->write_lock_wq);
>         init_waitqueue_head(&eb->read_lock_wq);
> +}
> +
> +static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree,
> +                                                  u64 start,
> +                                                  unsigned long len,
> +                                                  gfp_t mask)
> +{
> +       struct extent_buffer_head *ebh = NULL;
> +       struct extent_buffer *eb = NULL;
> +       int i, index = -1;
>
> -       btrfs_leak_debug_add(&eb->leak_list, &buffers);
> +       ebh = kmem_cache_zalloc(extent_buffer_cache, mask);
> +       if (ebh == NULL)
> +               return NULL;
> +       ebh->tree = tree;
> +       ebh->bflags = 0;
> +       btrfs_leak_debug_add(&ebh->leak_list, &buffers);
>
> -       spin_lock_init(&eb->refs_lock);
> -       atomic_set(&eb->refs, 1);
> -       atomic_set(&eb->io_pages, 0);
> +       spin_lock_init(&ebh->refs_lock);
> +       atomic_set(&ebh->refs, 1);
> +       atomic_set(&ebh->io_pages, 0);
>
>         /*
>          * Sanity checks, currently the maximum is 64k covered by 16x 4k pages
> @@ -4371,6 +4406,34 @@ static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree,
>                 > MAX_INLINE_EXTENT_BUFFER_SIZE);
>         BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE);
>
> +       if (len < PAGE_CACHE_SIZE) {
> +               u64 st = start & ~(PAGE_CACHE_SIZE - 1);
> +               unsigned long totlen = 0;
> +               /*
> +                * Make sure we have enough room to fit extent buffers
> +                * that belong a single page in a single extent_buffer_head.
> +                * If this BUG_ON is tripped, then it means either the
> +                * blocksize, i.e len, is too small or we need to increase
> +                * MAX_EXTENT_BUFFERS_PER_PAGE.
> +                */
> +               BUG_ON(len * MAX_EXTENT_BUFFERS_PER_PAGE < PAGE_CACHE_SIZE);
> +
> +               for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE
> +                               && totlen < PAGE_CACHE_SIZE ;
> +                               i++, st += len, totlen += len) {
> +                       __init_extent_buffer(&ebh->extent_buf[i], st, len);
> +                       if (st == start) {
> +                               index = i;
> +                               eb = &ebh->extent_buf[i];
> +                       }
> +
> +               }
> +               BUG_ON(!eb);
> +       } else {
> +               eb = &ebh->extent_buf[0];
> +               __init_extent_buffer(eb, start, len);
> +       }
> +
>         return eb;
>  }
>
> @@ -4391,15 +4454,15 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
>                         btrfs_release_extent_buffer(new);
>                         return NULL;
>                 }
> -               attach_extent_buffer_page(new, p);
> +               attach_extent_buffer_page(eb_head(new), p);
>                 WARN_ON(PageDirty(p));
>                 SetPageUptodate(p);
> -               new->pages[i] = p;
> +               eb_head(new)->pages[i] = p;
>         }
>
>         copy_extent_buffer(new, src, 0, 0, src->len);
> -       set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
> -       set_bit(EXTENT_BUFFER_DUMMY, &new->bflags);
> +       set_bit(EXTENT_BUFFER_UPTODATE, &eb_head(new)->bflags);
> +       set_bit(EXTENT_BUFFER_DUMMY, &eb_head(new)->bflags);
>
>         return new;
>  }
> @@ -4415,19 +4478,19 @@ struct extent_buffer *alloc_dummy_extent_buffer(u64 start, unsigned long len)
>                 return NULL;
>
>         for (i = 0; i < num_pages; i++) {
> -               eb->pages[i] = alloc_page(GFP_NOFS);
> -               if (!eb->pages[i])
> +               eb_head(eb)->pages[i] = alloc_page(GFP_NOFS);
> +               if (!eb_head(eb)->pages[i])
>                         goto err;
>         }
>         set_extent_buffer_uptodate(eb);
>         btrfs_set_header_nritems(eb, 0);
> -       set_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
> +       set_bit(EXTENT_BUFFER_DUMMY, &eb_head(eb)->bflags);
>
>         return eb;
>  err:
>         for (; i > 0; i--)
> -               __free_page(eb->pages[i - 1]);
> -       __free_extent_buffer(eb);
> +               __free_page(eb_head(eb)->pages[i - 1]);
> +       __free_extent_buffer(eb_head(eb));
>         return NULL;
>  }
>
> @@ -4454,14 +4517,15 @@ static void check_buffer_tree_ref(struct extent_buffer *eb)
>          * So bump the ref count first, then set the bit.  If someone
>          * beat us to it, drop the ref we added.
>          */
> -       refs = atomic_read(&eb->refs);
> -       if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> +       refs = atomic_read(&eb_head(eb)->refs);
> +       if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF,
> +                                               &eb_head(eb)->bflags))
>                 return;
>
> -       spin_lock(&eb->refs_lock);
> -       if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> -               atomic_inc(&eb->refs);
> -       spin_unlock(&eb->refs_lock);
> +       spin_lock(&eb_head(eb)->refs_lock);
> +       if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb_head(eb)->bflags))
> +               atomic_inc(&eb_head(eb)->refs);
> +       spin_unlock(&eb_head(eb)->refs_lock);
>  }
>
>  static void mark_extent_buffer_accessed(struct extent_buffer *eb)
> @@ -4481,13 +4545,22 @@ struct extent_buffer *find_extent_buffer(struct extent_io_tree *tree,
>                                                         u64 start)
>  {
>         struct extent_buffer *eb;
> +       struct extent_buffer_head *ebh;
> +       int i = 0;
>
>         rcu_read_lock();
> -       eb = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT);
> -       if (eb && atomic_inc_not_zero(&eb->refs)) {
> +       ebh = radix_tree_lookup(&tree->buffer, start >> PAGE_CACHE_SHIFT);
> +       if (ebh && atomic_inc_not_zero(&ebh->refs)) {
>                 rcu_read_unlock();
> -               mark_extent_buffer_accessed(eb);
> -               return eb;
> +
> +               do {
> +                       eb = &ebh->extent_buf[i++];
> +                       if (eb->start == start) {
> +                               mark_extent_buffer_accessed(eb);
> +                               return eb;
> +                       }
> +               } while (i < MAX_EXTENT_BUFFERS_PER_PAGE);
> +               BUG();
>         }
>         rcu_read_unlock();
>
> @@ -4500,8 +4573,8 @@ struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree,
>         unsigned long num_pages = num_extent_pages(start, len);
>         unsigned long i;
>         unsigned long index = start >> PAGE_CACHE_SHIFT;
> -       struct extent_buffer *eb;
> -       struct extent_buffer *exists = NULL;
> +       struct extent_buffer *eb, *old_eb = NULL;
> +       struct extent_buffer_head *exists = NULL;
>         struct page *p;
>         struct address_space *mapping = tree->mapping;
>         int uptodate = 1;
> @@ -4530,13 +4603,20 @@ struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree,
>                          * we can just return that one, else we know we can just
>                          * overwrite page->private.
>                          */
> -                       exists = (struct extent_buffer *)p->private;
> +                       exists = (struct extent_buffer_head *)p->private;
>                         if (atomic_inc_not_zero(&exists->refs)) {
> +                               int j = 0;
>                                 spin_unlock(&mapping->private_lock);
>                                 unlock_page(p);
>                                 page_cache_release(p);
> -                               mark_extent_buffer_accessed(exists);
> -                               goto free_eb;
> +                               do {
> +                                       old_eb = &exists->extent_buf[j++];
> +                                       if (old_eb->start == start) {
> +                                               mark_extent_buffer_accessed(old_eb);
> +                                               goto free_eb;
> +                                       }
> +                               } while (j < MAX_EXTENT_BUFFERS_PER_PAGE);
> +                               BUG();
>                         }
>
>                         /*
> @@ -4547,11 +4627,11 @@ struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree,
>                         WARN_ON(PageDirty(p));
>                         page_cache_release(p);
>                 }
> -               attach_extent_buffer_page(eb, p);
> +               attach_extent_buffer_page(eb_head(eb), p);
>                 spin_unlock(&mapping->private_lock);
>                 WARN_ON(PageDirty(p));
>                 mark_page_accessed(p);
> -               eb->pages[i] = p;
> +               eb_head(eb)->pages[i] = p;
>                 if (!PageUptodate(p))
>                         uptodate = 0;
>
> @@ -4561,19 +4641,20 @@ struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree,
>                  */
>         }
>         if (uptodate)
> -               set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
> +               set_bit(EXTENT_BUFFER_UPTODATE, &eb_head(eb)->bflags);
>  again:
>         ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
>         if (ret)
>                 goto free_eb;
>
>         spin_lock(&tree->buffer_lock);
> -       ret = radix_tree_insert(&tree->buffer, start >> PAGE_CACHE_SHIFT, eb);
> +       ret = radix_tree_insert(&tree->buffer,
> +                               start >> PAGE_CACHE_SHIFT, eb_head(eb));
>         spin_unlock(&tree->buffer_lock);
>         radix_tree_preload_end();
>         if (ret == -EEXIST) {
> -               exists = find_extent_buffer(tree, start);
> -               if (exists)
> +               old_eb = find_extent_buffer(tree, start);
> +               if (old_eb)
>                         goto free_eb;
>                 else
>                         goto again;
> @@ -4590,58 +4671,58 @@ again:
>          * after the extent buffer is in the radix tree so
>          * it doesn't get lost
>          */
> -       SetPageChecked(eb->pages[0]);
> +       SetPageChecked(eb_head(eb)->pages[0]);
>         for (i = 1; i < num_pages; i++) {
>                 p = extent_buffer_page(eb, i);
>                 ClearPageChecked(p);
>                 unlock_page(p);
>         }
> -       unlock_page(eb->pages[0]);
> +       unlock_page(eb_head(eb)->pages[0]);
>         return eb;
>
>  free_eb:
>         for (i = 0; i < num_pages; i++) {
> -               if (eb->pages[i])
> -                       unlock_page(eb->pages[i]);
> +               if (eb_head(eb)->pages[i])
> +                       unlock_page(eb_head(eb)->pages[i]);
>         }
>
> -       WARN_ON(!atomic_dec_and_test(&eb->refs));
> +       WARN_ON(!atomic_dec_and_test(&eb_head(eb)->refs));
>         btrfs_release_extent_buffer(eb);
> -       return exists;
> +       return old_eb;
>  }
>
>  static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head)
>  {
> -       struct extent_buffer *eb =
> -                       container_of(head, struct extent_buffer, rcu_head);
> +       struct extent_buffer_head *ebh =
> +                       container_of(head, struct extent_buffer_head, rcu_head);
>
> -       __free_extent_buffer(eb);
> +       __free_extent_buffer(ebh);
>  }
>
>  /* Expects to have eb->eb_lock already held */
> -static int release_extent_buffer(struct extent_buffer *eb)
> +static int release_extent_buffer(struct extent_buffer_head *ebh)
>  {
> -       WARN_ON(atomic_read(&eb->refs) == 0);
> -       if (atomic_dec_and_test(&eb->refs)) {
> -               if (test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags)) {
> -                       spin_unlock(&eb->refs_lock);
> +       WARN_ON(atomic_read(&ebh->refs) == 0);
> +       if (atomic_dec_and_test(&ebh->refs)) {
> +               if (test_bit(EXTENT_BUFFER_DUMMY, &ebh->bflags)) {
> +                       spin_unlock(&ebh->refs_lock);
>                 } else {
> -                       struct extent_io_tree *tree = eb->tree;
> +                       struct extent_io_tree *tree = ebh->tree;
>
> -                       spin_unlock(&eb->refs_lock);
> +                       spin_unlock(&ebh->refs_lock);
>
>                         spin_lock(&tree->buffer_lock);
>                         radix_tree_delete(&tree->buffer,
> -                                         eb->start >> PAGE_CACHE_SHIFT);
> +                               ebh->extent_buf[0].start >> PAGE_CACHE_SHIFT);
>                         spin_unlock(&tree->buffer_lock);
>                 }
>
>                 /* Should be safe to release our pages at this point */
> -               btrfs_release_extent_buffer_page(eb, 0);
> -               call_rcu(&eb->rcu_head, btrfs_release_extent_buffer_rcu);
> +               btrfs_release_extent_buffer_page(&ebh->extent_buf[0], 0);
> +               call_rcu(&ebh->rcu_head, btrfs_release_extent_buffer_rcu);
>                 return 1;
>         }
> -       spin_unlock(&eb->refs_lock);
> +       spin_unlock(&ebh->refs_lock);
>
>         return 0;
>  }
> @@ -4650,48 +4731,52 @@ void free_extent_buffer(struct extent_buffer *eb)
>  {
>         int refs;
>         int old;
> +       struct extent_buffer_head *ebh;
>         if (!eb)
>                 return;
>
> +       ebh = eb_head(eb);
>         while (1) {
> -               refs = atomic_read(&eb->refs);
> +               refs = atomic_read(&ebh->refs);
>                 if (refs <= 3)
>                         break;
> -               old = atomic_cmpxchg(&eb->refs, refs, refs - 1);
> +               old = atomic_cmpxchg(&ebh->refs, refs, refs - 1);
>                 if (old == refs)
>                         return;
>         }
>
> -       spin_lock(&eb->refs_lock);
> -       if (atomic_read(&eb->refs) == 2 &&
> -           test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags))
> -               atomic_dec(&eb->refs);
> +       spin_lock(&ebh->refs_lock);
> +       if (atomic_read(&ebh->refs) == 2 &&
> +           test_bit(EXTENT_BUFFER_DUMMY, &ebh->bflags))
> +               atomic_dec(&ebh->refs);
>
> -       if (atomic_read(&eb->refs) == 2 &&
> -           test_bit(EXTENT_BUFFER_STALE, &eb->bflags) &&
> -           !extent_buffer_under_io(eb) &&
> -           test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> -               atomic_dec(&eb->refs);
> +       if (atomic_read(&ebh->refs) == 2 &&
> +           test_bit(EXTENT_BUFFER_STALE, &ebh->bflags) &&
> +           !extent_buffer_under_io(ebh) &&
> +           test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &ebh->bflags))
> +               atomic_dec(&ebh->refs);
>
>         /*
>          * I know this is terrible, but it's temporary until we stop tracking
>          * the uptodate bits and such for the extent buffers.
>          */
> -       release_extent_buffer(eb);
> +       release_extent_buffer(ebh);
>  }
>
>  void free_extent_buffer_stale(struct extent_buffer *eb)
>  {
> +       struct extent_buffer_head *ebh;
>         if (!eb)
>                 return;
>
> -       spin_lock(&eb->refs_lock);
> -       set_bit(EXTENT_BUFFER_STALE, &eb->bflags);
> +       ebh = eb_head(eb);
> +       spin_lock(&ebh->refs_lock);
> +       set_bit(EXTENT_BUFFER_STALE, &ebh->bflags);
>
> -       if (atomic_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) &&
> -           test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> -               atomic_dec(&eb->refs);
> -       release_extent_buffer(eb);
> +       if (atomic_read(&ebh->refs) == 2 && !extent_buffer_under_io(ebh) &&
> +           test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &ebh->bflags))
> +               atomic_dec(&ebh->refs);
> +       release_extent_buffer(ebh);
>  }
>
>  void clear_extent_buffer_dirty(struct extent_buffer *eb)
> @@ -4721,7 +4806,7 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
>                 ClearPageError(page);
>                 unlock_page(page);
>         }
> -       WARN_ON(atomic_read(&eb->refs) == 0);
> +       WARN_ON(atomic_read(&eb_head(eb)->refs) == 0);
>  }
>
>  int set_extent_buffer_dirty(struct extent_buffer *eb)
> @@ -4732,11 +4817,11 @@ int set_extent_buffer_dirty(struct extent_buffer *eb)
>
>         check_buffer_tree_ref(eb);
>
> -       was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
> +       was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
>
>         num_pages = num_extent_pages(eb->start, eb->len);
> -       WARN_ON(atomic_read(&eb->refs) == 0);
> -       WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags));
> +       WARN_ON(atomic_read(&eb_head(eb)->refs) == 0);
> +       WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb_head(eb)->bflags));
>
>         for (i = 0; i < num_pages; i++)
>                 set_page_dirty(extent_buffer_page(eb, i));
> @@ -4749,7 +4834,9 @@ int clear_extent_buffer_uptodate(struct extent_buffer *eb)
>         struct page *page;
>         unsigned long num_pages;
>
> -       clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
> +       if (!eb || !eb_head(eb))
> +               return 0;
> +       clear_bit(EXTENT_BUFFER_UPTODATE, &eb_head(eb)->bflags);
>         num_pages = num_extent_pages(eb->start, eb->len);
>         for (i = 0; i < num_pages; i++) {
>                 page = extent_buffer_page(eb, i);
> @@ -4765,7 +4852,7 @@ int set_extent_buffer_uptodate(struct extent_buffer *eb)
>         struct page *page;
>         unsigned long num_pages;
>
> -       set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
> +       set_bit(EXTENT_BUFFER_UPTODATE, &eb_head(eb)->bflags);
>         num_pages = num_extent_pages(eb->start, eb->len);
>         for (i = 0; i < num_pages; i++) {
>                 page = extent_buffer_page(eb, i);
> @@ -4776,7 +4863,7 @@ int set_extent_buffer_uptodate(struct extent_buffer *eb)
>
>  int extent_buffer_uptodate(struct extent_buffer *eb)
>  {
> -       return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
> +       return test_bit(EXTENT_BUFFER_UPTODATE, &eb_head(eb)->bflags);
>  }
>
>  int read_extent_buffer_pages(struct extent_io_tree *tree,
> @@ -4794,8 +4881,9 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
>         unsigned long num_reads = 0;
>         struct bio *bio = NULL;
>         unsigned long bio_flags = 0;
> +       struct extent_buffer_head *ebh = eb_head(eb);
>
> -       if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
> +       if (test_bit(EXTENT_BUFFER_UPTODATE, &ebh->bflags))
>                 return 0;
>
>         if (start) {
> @@ -4806,6 +4894,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
>                 start_i = 0;
>         }
>
> +recheck:
>         num_pages = num_extent_pages(eb->start, eb->len);
>         for (i = start_i; i < num_pages; i++) {
>                 page = extent_buffer_page(eb, i);
> @@ -4823,13 +4912,26 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
>         }
>         if (all_uptodate) {
>                 if (start_i == 0)
> -                       set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
> +                       set_bit(EXTENT_BUFFER_UPTODATE, &ebh->bflags);
>                 goto unlock_exit;
>         }
>
> -       clear_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
> -       eb->read_mirror = 0;
> -       atomic_set(&eb->io_pages, num_reads);
> +       if (eb_head(eb)->io_eb) {
> +               all_uptodate = 1;
> +               i = start_i;
> +               while (locked_pages > 0) {
> +                       page = extent_buffer_page(eb, i);
> +                       i++;
> +                       unlock_page(page);
> +                       locked_pages--;
> +               }
> +               goto recheck;
> +       }
> +       BUG_ON(eb_head(eb)->io_eb);
> +       eb_head(eb)->io_eb = eb;
> +       clear_bit(EXTENT_BUFFER_IOERR, &ebh->bflags);
> +       ebh->read_mirror = 0;
> +       atomic_set(&ebh->io_pages, num_reads);
>         for (i = start_i; i < num_pages; i++) {
>                 page = extent_buffer_page(eb, i);
>                 if (!PageUptodate(page)) {
> @@ -5196,7 +5298,7 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
>
>  int try_release_extent_buffer(struct page *page)
>  {
> -       struct extent_buffer *eb;
> +       struct extent_buffer_head *ebh;
>
>         /*
>          * We need to make sure noboody is attaching this page to an eb right
> @@ -5208,17 +5310,17 @@ int try_release_extent_buffer(struct page *page)
>                 return 1;
>         }
>
> -       eb = (struct extent_buffer *)page->private;
> -       BUG_ON(!eb);
> +       ebh = (struct extent_buffer_head *)page->private;
> +       BUG_ON(!ebh);
>
>         /*
>          * This is a little awful but should be ok, we need to make sure that
>          * the eb doesn't disappear out from under us while we're looking at
>          * this page.
>          */
> -       spin_lock(&eb->refs_lock);
> -       if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
> -               spin_unlock(&eb->refs_lock);
> +       spin_lock(&ebh->refs_lock);
> +       if (atomic_read(&ebh->refs) != 1 || extent_buffer_under_io(ebh)) {
> +               spin_unlock(&ebh->refs_lock);
>                 spin_unlock(&page->mapping->private_lock);
>                 return 0;
>         }
> @@ -5228,10 +5330,11 @@ int try_release_extent_buffer(struct page *page)
>          * If tree ref isn't set then we know the ref on this eb is a real ref,
>          * so just return, this page will likely be freed soon anyway.
>          */
> -       if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
> -               spin_unlock(&eb->refs_lock);
> +       if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &ebh->bflags)) {
> +               spin_unlock(&ebh->refs_lock);
>                 return 0;
>         }
>
> -       return release_extent_buffer(eb);
> +       return release_extent_buffer(ebh);
>  }
> +
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 19620c5..b56de28 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -124,19 +124,12 @@ struct extent_state {
>
>  #define INLINE_EXTENT_BUFFER_PAGES 16
>  #define MAX_INLINE_EXTENT_BUFFER_SIZE (INLINE_EXTENT_BUFFER_PAGES * PAGE_CACHE_SIZE)
> +#define MAX_EXTENT_BUFFERS_PER_PAGE 16
> +
>  struct extent_buffer {
>         u64 start;
>         unsigned long len;
> -       unsigned long map_start;
> -       unsigned long map_len;
> -       unsigned long bflags;
> -       struct extent_io_tree *tree;
> -       spinlock_t refs_lock;
> -       atomic_t refs;
> -       atomic_t io_pages;
> -       int read_mirror;
> -       struct rcu_head rcu_head;
> -       pid_t lock_owner;
> +       unsigned long ebflags;
>
>         /* count of read lock holders on the extent buffer */
>         atomic_t write_locks;
> @@ -147,6 +140,8 @@ struct extent_buffer {
>         atomic_t spinning_writers;
>         int lock_nested;
>
> +       pid_t lock_owner;
> +
>         /* protects write locks */
>         rwlock_t lock;
>
> @@ -160,7 +155,21 @@ struct extent_buffer {
>          */
>         wait_queue_head_t read_lock_wq;
>         wait_queue_head_t lock_wq;
> +};
> +
> +struct extent_buffer_head {
> +       unsigned long bflags;
> +       struct extent_io_tree *tree;
> +       spinlock_t refs_lock;
> +       atomic_t refs;
> +       atomic_t io_pages;
> +       int read_mirror;
> +       struct rcu_head rcu_head;
> +
>         struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
> +
> +       struct extent_buffer extent_buf[MAX_EXTENT_BUFFERS_PER_PAGE];
> +       struct extent_buffer *io_eb; /* eb that submitted the current I/O */
>  #ifdef CONFIG_BTRFS_DEBUG
>         struct list_head leak_list;
>  #endif
> @@ -177,6 +186,24 @@ static inline int extent_compress_type(unsigned long bio_flags)
>         return bio_flags >> EXTENT_BIO_FLAG_SHIFT;
>  }
>
> +/*
> + * return the extent_buffer_head that contains the extent buffer provided.
> + */
> +static inline struct extent_buffer_head *eb_head(struct extent_buffer *eb)
> +{
> +       int start, index;
> +       struct extent_buffer_head *ebh;
> +       struct extent_buffer *eb_base;
> +
> +       BUG_ON(!eb);
> +       start = eb->start & (PAGE_CACHE_SIZE - 1);
> +       index = start >> (ffs(eb->len) - 1);
> +       eb_base = eb - index;
> +       ebh = (struct extent_buffer_head *)
> +               ((char *) eb_base - offsetof(struct extent_buffer_head, extent_buf));
> +       return ebh;
> +
> +}
>  struct extent_map_tree;
>
>  typedef struct extent_map *(get_extent_t)(struct inode *inode,
> @@ -288,15 +315,15 @@ static inline unsigned long num_extent_pages(u64 start, u64 len)
>                 (start >> PAGE_CACHE_SHIFT);
>  }
>
> -static inline struct page *extent_buffer_page(struct extent_buffer *eb,
> -                                             unsigned long i)
> +static inline struct page *extent_buffer_page(
> +                       struct extent_buffer *eb, unsigned long i)
>  {
> -       return eb->pages[i];
> +       return eb_head(eb)->pages[i];
>  }
>
>  static inline void extent_buffer_get(struct extent_buffer *eb)
>  {
> -       atomic_inc(&eb->refs);
> +       atomic_inc(&eb_head(eb)->refs);
>  }
>
>  int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv,
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 92303f4..37b2698 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5921,7 +5921,7 @@ int btrfs_read_sys_array(struct btrfs_root *root)
>          * to silence the warning eg. on PowerPC 64.
>          */
>         if (PAGE_CACHE_SIZE > BTRFS_SUPER_INFO_SIZE)
> -               SetPageUptodate(sb->pages[0]);
> +               SetPageUptodate(eb_head(sb)->pages[0]);
>
>         write_extent_buffer(sb, super_copy, 0, BTRFS_SUPER_INFO_SIZE);
>         array_size = btrfs_super_sys_array_size(super_copy);
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 4832d75..ceb194f 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -694,7 +694,7 @@ TRACE_EVENT(btrfs_cow_block,
>         TP_fast_assign(
>                 __entry->root_objectid  = root->root_key.objectid;
>                 __entry->buf_start      = buf->start;
> -               __entry->refs           = atomic_read(&buf->refs);
> +               __entry->refs           = atomic_read(&eb_head(buf)->refs);
>                 __entry->cow_start      = cow->start;
>                 __entry->buf_level      = btrfs_header_level(buf);
>                 __entry->cow_level      = btrfs_header_level(cow);
> --
> 1.7.12.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/7] btrfs: subpagesize-blocksize: Use a global alignment for size
  2013-12-11 23:38 ` [PATCH 2/7] btrfs: subpagesize-blocksize: Use a global alignment for size Chandra Seetharaman
@ 2013-12-16 12:33   ` saeed bishara
  2013-12-16 14:48     ` David Sterba
  0 siblings, 1 reply; 23+ messages in thread
From: saeed bishara @ 2013-12-16 12:33 UTC (permalink / raw)
  To: Chandra Seetharaman; +Cc: linux-btrfs

On Thu, Dec 12, 2013 at 1:38 AM, Chandra Seetharaman
<sekharan@us.ibm.com> wrote:
> In order to handle a blocksize that is smaller than the
> PAGE_SIZE, we need align all IOs to PAGE_SIZE.
>
> This patch defines a new macro btrfs_align_size() that
> calculates the alignment size based on the sectorsize
> and uses it at appropriate places.
>
> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
> ---
>  fs/btrfs/btrfs_inode.h  |  7 +++++++
>  fs/btrfs/compression.c  |  3 ++-
>  fs/btrfs/extent-tree.c  | 12 ++++++------
>  fs/btrfs/extent_io.c    | 17 ++++++-----------
>  fs/btrfs/file.c         | 15 +++++++--------
>  fs/btrfs/inode.c        | 41 ++++++++++++++++++++++-------------------
>  fs/btrfs/ioctl.c        |  6 +++---
>  fs/btrfs/ordered-data.c |  2 +-
>  fs/btrfs/tree-log.c     |  2 +-
>  9 files changed, 55 insertions(+), 50 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index ac0b39d..eee994f 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -280,4 +280,11 @@ static inline void btrfs_inode_resume_unlocked_dio(struct inode *inode)
>                   &BTRFS_I(inode)->runtime_flags);
>  }
>
> +static inline u64 btrfs_align_size(struct inode *inode)
> +{
> +       if (BTRFS_I(inode)->root->sectorsize < PAGE_CACHE_SIZE)
> +               return (u64)PAGE_CACHE_SIZE;
> +       else
> +               return (u64)BTRFS_I(inode)->root->sectorsize;
> +}
for performance, isn't it worth to store this value instead of
calculating it each time?
>  #endif
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 1499b27..259f2c5 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -89,9 +89,10 @@ static inline int compressed_bio_size(struct btrfs_root *root,
>                                       unsigned long disk_size)
>  {
>         u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
> +       int align_size = max_t(size_t, root->sectorsize, PAGE_CACHE_SIZE);
>
>         return sizeof(struct compressed_bio) +
> -               ((disk_size + root->sectorsize - 1) / root->sectorsize) *
> +               ((disk_size + align_size - 1) / align_size) *
>                 csum_size;
>  }
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 79cf87f..621af18 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3617,8 +3617,8 @@ int btrfs_check_data_free_space(struct inode *inode, u64 bytes)
>         u64 used;
>         int ret = 0, committed = 0, alloc_chunk = 1;
>
> -       /* make sure bytes are sectorsize aligned */
> -       bytes = ALIGN(bytes, root->sectorsize);
> +       /* make sure bytes are appropriately aligned */
> +       bytes = ALIGN(bytes, btrfs_align_size(inode));
>
>         if (btrfs_is_free_space_inode(inode)) {
>                 committed = 1;
> @@ -3726,8 +3726,8 @@ void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes)
>         struct btrfs_root *root = BTRFS_I(inode)->root;
>         struct btrfs_space_info *data_sinfo;
>
> -       /* make sure bytes are sectorsize aligned */
> -       bytes = ALIGN(bytes, root->sectorsize);
> +       /* make sure bytes are appropriately aligned */
> +       bytes = ALIGN(bytes, btrfs_align_size(inode));
>
>         data_sinfo = root->fs_info->data_sinfo;
>         spin_lock(&data_sinfo->lock);
> @@ -4988,7 +4988,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
>         if (delalloc_lock)
>                 mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
>
> -       num_bytes = ALIGN(num_bytes, root->sectorsize);
> +       num_bytes = ALIGN(num_bytes, btrfs_align_size(inode));
>
>         spin_lock(&BTRFS_I(inode)->lock);
>         BTRFS_I(inode)->outstanding_extents++;
> @@ -5126,7 +5126,7 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
>         u64 to_free = 0;
>         unsigned dropped;
>
> -       num_bytes = ALIGN(num_bytes, root->sectorsize);
> +       num_bytes = ALIGN(num_bytes, btrfs_align_size(inode));
>         spin_lock(&BTRFS_I(inode)->lock);
>         dropped = drop_outstanding_extent(inode);
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index a1a849b..e1992ed 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2766,7 +2766,7 @@ static int __do_readpage(struct extent_io_tree *tree,
>         size_t pg_offset = 0;
>         size_t iosize;
>         size_t disk_io_size;
> -       size_t blocksize = inode->i_sb->s_blocksize;
> +       size_t blocksize = btrfs_align_size(inode);
>         unsigned long this_bio_flag = *bio_flags & EXTENT_BIO_PARENT_LOCKED;
>
>         set_page_extent_mapped(page);
> @@ -3078,7 +3078,6 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
>         int ret;
>         int nr = 0;
>         size_t pg_offset = 0;
> -       size_t blocksize;
>         loff_t i_size = i_size_read(inode);
>         unsigned long end_index = i_size >> PAGE_CACHE_SHIFT;
>         u64 nr_delalloc;
> @@ -3218,8 +3217,6 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
>                 goto done;
>         }
>
> -       blocksize = inode->i_sb->s_blocksize;
> -
>         while (cur <= end) {
>                 if (cur >= last_byte) {
>                         if (tree->ops && tree->ops->writepage_end_io_hook)
> @@ -3238,7 +3235,7 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
>                 BUG_ON(extent_map_end(em) <= cur);
>                 BUG_ON(end < cur);
>                 iosize = min(extent_map_end(em) - cur, end - cur + 1);
> -               iosize = ALIGN(iosize, blocksize);
> +               iosize = ALIGN(iosize, btrfs_align_size(inode));
>                 sector = (em->block_start + extent_offset) >> 9;
>                 bdev = em->bdev;
>                 block_start = em->block_start;
> @@ -3934,9 +3931,8 @@ int extent_invalidatepage(struct extent_io_tree *tree,
>         struct extent_state *cached_state = NULL;
>         u64 start = page_offset(page);
>         u64 end = start + PAGE_CACHE_SIZE - 1;
> -       size_t blocksize = page->mapping->host->i_sb->s_blocksize;
>
> -       start += ALIGN(offset, blocksize);
> +       start += ALIGN(offset, btrfs_align_size(page->mapping->host));
>         if (start > end)
>                 return 0;
>
> @@ -4044,7 +4040,6 @@ static struct extent_map *get_extent_skip_holes(struct inode *inode,
>                                                 u64 last,
>                                                 get_extent_t *get_extent)
>  {
> -       u64 sectorsize = BTRFS_I(inode)->root->sectorsize;
>         struct extent_map *em;
>         u64 len;
>
> @@ -4055,7 +4050,7 @@ static struct extent_map *get_extent_skip_holes(struct inode *inode,
>                 len = last - offset;
>                 if (len == 0)
>                         break;
> -               len = ALIGN(len, sectorsize);
> +               len = ALIGN(len, btrfs_align_size(inode));
>                 em = get_extent(inode, NULL, 0, offset, len, 0);
>                 if (IS_ERR_OR_NULL(em))
>                         return em;
> @@ -4119,8 +4114,8 @@ int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>                 return -ENOMEM;
>         path->leave_spinning = 1;
>
> -       start = ALIGN(start, BTRFS_I(inode)->root->sectorsize);
> -       len = ALIGN(len, BTRFS_I(inode)->root->sectorsize);
> +       start = ALIGN(start, btrfs_align_size(inode));
> +       len = ALIGN(len, btrfs_align_size(inode));
>
>         /*
>          * lookup the last file extent.  We're not using i_size here
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 82d0342..1861322 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -505,8 +505,8 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
>         u64 end_pos = pos + write_bytes;
>         loff_t isize = i_size_read(inode);
>
> -       start_pos = pos & ~((u64)root->sectorsize - 1);
> -       num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize);
> +       start_pos = pos & ~(btrfs_align_size(inode) - 1);
> +       num_bytes = ALIGN(write_bytes + pos - start_pos, btrfs_align_size(inode));
>
>         end_of_last_block = start_pos + num_bytes - 1;
>         err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
> @@ -889,7 +889,7 @@ next_slot:
>                                 inode_sub_bytes(inode,
>                                                 extent_end - key.offset);
>                                 extent_end = ALIGN(extent_end,
> -                                                  root->sectorsize);
> +                                                  btrfs_align_size(inode));
>                         } else if (update_refs && disk_bytenr > 0) {
>                                 ret = btrfs_free_extent(trans, root,
>                                                 disk_bytenr, num_bytes, 0,
> @@ -1254,7 +1254,7 @@ static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
>         u64 start_pos;
>         u64 last_pos;
>
> -       start_pos = pos & ~((u64)root->sectorsize - 1);
> +       start_pos = pos & ~((u64)btrfs_align_size(inode) - 1);
>         last_pos = ((u64)index + num_pages) << PAGE_CACHE_SHIFT;
>
>  again:
> @@ -2263,11 +2263,10 @@ static long btrfs_fallocate(struct file *file, int mode,
>         u64 alloc_hint = 0;
>         u64 locked_end;
>         struct extent_map *em;
> -       int blocksize = BTRFS_I(inode)->root->sectorsize;
>         int ret;
>
> -       alloc_start = round_down(offset, blocksize);
> -       alloc_end = round_up(offset + len, blocksize);
> +       alloc_start = round_down(offset, btrfs_align_size(inode));
> +       alloc_end = round_up(offset + len, btrfs_align_size(inode));
>
>         /* Make sure we aren't being give some crap mode */
>         if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> @@ -2367,7 +2366,7 @@ static long btrfs_fallocate(struct file *file, int mode,
>                 }
>                 last_byte = min(extent_map_end(em), alloc_end);
>                 actual_end = min_t(u64, extent_map_end(em), offset + len);
> -               last_byte = ALIGN(last_byte, blocksize);
> +               last_byte = ALIGN(last_byte, btrfs_align_size(inode));
>
>                 if (em->block_start == EXTENT_MAP_HOLE ||
>                     (cur_offset >= inode->i_size &&
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index f1a7744..c79c9cd 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -239,7 +239,7 @@ static noinline int cow_file_range_inline(struct btrfs_root *root,
>         u64 isize = i_size_read(inode);
>         u64 actual_end = min(end + 1, isize);
>         u64 inline_len = actual_end - start;
> -       u64 aligned_end = ALIGN(end, root->sectorsize);
> +       u64 aligned_end = ALIGN(end, btrfs_align_size(inode));
>         u64 data_len = inline_len;
>         int ret;
>
> @@ -354,7 +354,6 @@ static noinline int compress_file_range(struct inode *inode,
>  {
>         struct btrfs_root *root = BTRFS_I(inode)->root;
>         u64 num_bytes;
> -       u64 blocksize = root->sectorsize;
>         u64 actual_end;
>         u64 isize = i_size_read(inode);
>         int ret = 0;
> @@ -407,8 +406,8 @@ again:
>          * a compressed extent to 128k.
>          */
>         total_compressed = min(total_compressed, max_uncompressed);
> -       num_bytes = ALIGN(end - start + 1, blocksize);
> -       num_bytes = max(blocksize,  num_bytes);
> +       num_bytes = ALIGN(end - start + 1, btrfs_align_size(inode));
> +       num_bytes = max(btrfs_align_size(inode),  num_bytes);
>         total_in = 0;
>         ret = 0;
>
> @@ -508,7 +507,7 @@ cont:
>                  * up to a block size boundary so the allocator does sane
>                  * things
>                  */
> -               total_compressed = ALIGN(total_compressed, blocksize);
> +               total_compressed = ALIGN(total_compressed, btrfs_align_size(inode));
>
>                 /*
>                  * one last check to make sure the compression is really a
> @@ -837,7 +836,6 @@ static noinline int cow_file_range(struct inode *inode,
>         unsigned long ram_size;
>         u64 disk_num_bytes;
>         u64 cur_alloc_size;
> -       u64 blocksize = root->sectorsize;
>         struct btrfs_key ins;
>         struct extent_map *em;
>         struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
> @@ -848,8 +846,8 @@ static noinline int cow_file_range(struct inode *inode,
>                 return -EINVAL;
>         }
>
> -       num_bytes = ALIGN(end - start + 1, blocksize);
> -       num_bytes = max(blocksize,  num_bytes);
> +       num_bytes = ALIGN(end - start + 1, btrfs_align_size(inode));
> +       num_bytes = max(btrfs_align_size(inode), num_bytes);
>         disk_num_bytes = num_bytes;
>
>         /* if this is a small write inside eof, kick off defrag */
> @@ -1263,7 +1261,7 @@ next_slot:
>                 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>                         extent_end = found_key.offset +
>                                 btrfs_file_extent_inline_len(leaf, fi);
> -                       extent_end = ALIGN(extent_end, root->sectorsize);
> +                       extent_end = ALIGN(extent_end, btrfs_align_size(inode));
>                 } else {
>                         BUG_ON(1);
>                 }
> @@ -1389,6 +1387,12 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
>         int ret;
>         struct btrfs_root *root = BTRFS_I(inode)->root;
>
> +       if (inode->i_sb->s_blocksize < PAGE_CACHE_SIZE) {
> +               start &= ~(PAGE_CACHE_SIZE - 1);
> +               end = max_t(u64, start + PAGE_CACHE_SIZE - 1, end);
> +       }
> +
> +
>         if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) {
>                 ret = run_delalloc_nocow(inode, locked_page, start, end,
>                                          page_started, 1, nr_written);
> @@ -3894,7 +3898,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
>          */
>         if (root->ref_cows || root == root->fs_info->tree_root)
>                 btrfs_drop_extent_cache(inode, ALIGN(new_size,
> -                                       root->sectorsize), (u64)-1, 0);
> +                                       btrfs_align_size(inode)), (u64)-1, 0);
>
>         /*
>          * This function is also used to drop the items in the log tree before
> @@ -3980,7 +3984,7 @@ search_again:
>                                         btrfs_file_extent_num_bytes(leaf, fi);
>                                 extent_num_bytes = ALIGN(new_size -
>                                                 found_key.offset,
> -                                               root->sectorsize);
> +                                               btrfs_align_size(inode));
>                                 btrfs_set_file_extent_num_bytes(leaf, fi,
>                                                          extent_num_bytes);
>                                 num_dec = (orig_num_bytes -
> @@ -4217,8 +4221,8 @@ int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size)
>         struct extent_map *em = NULL;
>         struct extent_state *cached_state = NULL;
>         struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
> -       u64 hole_start = ALIGN(oldsize, root->sectorsize);
> -       u64 block_end = ALIGN(size, root->sectorsize);
> +       u64 hole_start = ALIGN(oldsize, btrfs_align_size(inode));
> +       u64 block_end = ALIGN(size, btrfs_align_size(inode));
>         u64 last_byte;
>         u64 cur_offset;
>         u64 hole_size;
> @@ -4261,7 +4265,7 @@ int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size)
>                         break;
>                 }
>                 last_byte = min(extent_map_end(em), block_end);
> -               last_byte = ALIGN(last_byte , root->sectorsize);
> +               last_byte = ALIGN(last_byte , btrfs_align_size(inode));
>                 if (!test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
>                         struct extent_map *hole_em;
>                         hole_size = last_byte - cur_offset;
> @@ -6001,7 +6005,7 @@ again:
>         } else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
>                 size_t size;
>                 size = btrfs_file_extent_inline_len(leaf, item);
> -               extent_end = ALIGN(extent_start + size, root->sectorsize);
> +               extent_end = ALIGN(extent_start + size, btrfs_align_size(inode));
>         }
>  next:
>         if (start >= extent_end) {
> @@ -6074,7 +6078,7 @@ next:
>                 copy_size = min_t(u64, PAGE_CACHE_SIZE - pg_offset,
>                                 size - extent_offset);
>                 em->start = extent_start + extent_offset;
> -               em->len = ALIGN(copy_size, root->sectorsize);
> +               em->len = ALIGN(copy_size, btrfs_align_size(inode));
>                 em->orig_block_len = em->len;
>                 em->orig_start = em->start;
>                 if (compress_type) {
> @@ -7967,7 +7971,6 @@ static int btrfs_getattr(struct vfsmount *mnt,
>  {
>         u64 delalloc_bytes;
>         struct inode *inode = dentry->d_inode;
> -       u32 blocksize = inode->i_sb->s_blocksize;
>
>         generic_fillattr(inode, stat);
>         stat->dev = BTRFS_I(inode)->root->anon_dev;
> @@ -7976,8 +7979,8 @@ static int btrfs_getattr(struct vfsmount *mnt,
>         spin_lock(&BTRFS_I(inode)->lock);
>         delalloc_bytes = BTRFS_I(inode)->delalloc_bytes;
>         spin_unlock(&BTRFS_I(inode)->lock);
> -       stat->blocks = (ALIGN(inode_get_bytes(inode), blocksize) +
> -                       ALIGN(delalloc_bytes, blocksize)) >> 9;
> +       stat->blocks = (ALIGN(inode_get_bytes(inode), btrfs_align_size(inode)) +
> +                       ALIGN(delalloc_bytes, btrfs_align_size(inode))) >> 9;
>         return 0;
>  }
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index a111622..c41e342 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2631,7 +2631,7 @@ static int btrfs_cmp_data(struct inode *src, u64 loff, struct inode *dst,
>
>  static int extent_same_check_offsets(struct inode *inode, u64 off, u64 len)
>  {
> -       u64 bs = BTRFS_I(inode)->root->fs_info->sb->s_blocksize;
> +       u64 bs = btrfs_align_size(inode);
>
>         if (off + len > inode->i_size || off + len < off)
>                 return -EINVAL;
> @@ -2698,7 +2698,7 @@ static long btrfs_ioctl_file_extent_same(struct file *file,
>         int i;
>         int ret;
>         unsigned long size;
> -       u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
> +       u64 bs = btrfs_align_size(src);
>         bool is_admin = capable(CAP_SYS_ADMIN);
>
>         if (!(file->f_mode & FMODE_READ))
> @@ -3111,7 +3111,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
>         struct inode *src;
>         int ret;
>         u64 len = olen;
> -       u64 bs = root->fs_info->sb->s_blocksize;
> +       u64 bs = btrfs_align_size(inode);
>         int same_inode = 0;
>
>         /*
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 69582d5..8d703e8 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -936,7 +936,7 @@ int btrfs_ordered_update_i_size(struct inode *inode, u64 offset,
>                                      ordered->file_offset +
>                                      ordered->truncated_len);
>         } else {
> -               offset = ALIGN(offset, BTRFS_I(inode)->root->sectorsize);
> +               offset = ALIGN(offset, btrfs_align_size(inode));
>         }
>         disk_i_size = BTRFS_I(inode)->disk_i_size;
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 9f7fc51..455d288 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -572,7 +572,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
>         } else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
>                 size = btrfs_file_extent_inline_len(eb, item);
>                 nbytes = btrfs_file_extent_ram_bytes(eb, item);
> -               extent_end = ALIGN(start + size, root->sectorsize);
> +               extent_end = ALIGN(start + size, btrfs_align_size(inode));
>         } else {
>                 ret = 0;
>                 goto out;
> --
> 1.7.12.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 7/7] btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE
  2013-12-13  1:07   ` David Sterba
@ 2013-12-16 12:50     ` saeed bishara
  0 siblings, 0 replies; 23+ messages in thread
From: saeed bishara @ 2013-12-16 12:50 UTC (permalink / raw)
  To: dsterba, Chandra Seetharaman, linux-btrfs

On Fri, Dec 13, 2013 at 3:07 AM, David Sterba <dsterba@suse.cz> wrote:
> On Wed, Dec 11, 2013 at 05:38:42PM -0600, Chandra Seetharaman wrote:
>> This is the final patch of the series that allows filesystems with
>> blocksize smaller than the PAGE_SIZE.
>
>> -     if (sectorsize != PAGE_SIZE) {
>
> You've implemented the sectorsize < PAGE_SIZE part (multiple extent
> buffers per page), so the check should stay as:
>
>         if (sectorsize > PAGE_SIZE) {
>
please add a check that PAGE_SIZE/sectorsize <= MAX_EXTENT_BUFFERS_PER_PAGE
also, can you please add kernel log message when you have subpage block size?

>> -             printk(KERN_WARNING "btrfs: Incompatible sector size(%lu) "
>> -                    "found on %s\n", (unsigned long)sectorsize, sb->s_id);
>> -             goto fail_sb_buffer;
>> -     }
>> -
>>       mutex_lock(&fs_info->chunk_mutex);
>>       ret = btrfs_read_sys_array(tree_root);
>>       mutex_unlock(&fs_info->chunk_mutex);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/7] btrfs: subpagesize-blocksize: Use a global alignment for size
  2013-12-16 12:33   ` saeed bishara
@ 2013-12-16 14:48     ` David Sterba
  2013-12-16 16:18       ` Chandra Seetharaman
  0 siblings, 1 reply; 23+ messages in thread
From: David Sterba @ 2013-12-16 14:48 UTC (permalink / raw)
  To: saeed bishara; +Cc: Chandra Seetharaman, linux-btrfs

On Mon, Dec 16, 2013 at 02:33:11PM +0200, saeed bishara wrote:
> On Thu, Dec 12, 2013 at 1:38 AM, Chandra Seetharaman
> <sekharan@us.ibm.com> wrote:
> > In order to handle a blocksize that is smaller than the
> > PAGE_SIZE, we need align all IOs to PAGE_SIZE.
> >
> > This patch defines a new macro btrfs_align_size() that
> > calculates the alignment size based on the sectorsize
> > and uses it at appropriate places.
> >
> > Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
> > ---
> >  fs/btrfs/btrfs_inode.h  |  7 +++++++
> >  fs/btrfs/compression.c  |  3 ++-
> >  fs/btrfs/extent-tree.c  | 12 ++++++------
> >  fs/btrfs/extent_io.c    | 17 ++++++-----------
> >  fs/btrfs/file.c         | 15 +++++++--------
> >  fs/btrfs/inode.c        | 41 ++++++++++++++++++++++-------------------
> >  fs/btrfs/ioctl.c        |  6 +++---
> >  fs/btrfs/ordered-data.c |  2 +-
> >  fs/btrfs/tree-log.c     |  2 +-
> >  9 files changed, 55 insertions(+), 50 deletions(-)
> >
> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > index ac0b39d..eee994f 100644
> > --- a/fs/btrfs/btrfs_inode.h
> > +++ b/fs/btrfs/btrfs_inode.h
> > @@ -280,4 +280,11 @@ static inline void btrfs_inode_resume_unlocked_dio(struct inode *inode)
> >                   &BTRFS_I(inode)->runtime_flags);
> >  }
> >
> > +static inline u64 btrfs_align_size(struct inode *inode)
> > +{
> > +       if (BTRFS_I(inode)->root->sectorsize < PAGE_CACHE_SIZE)
> > +               return (u64)PAGE_CACHE_SIZE;
> > +       else
> > +               return (u64)BTRFS_I(inode)->root->sectorsize;
> > +}
> for performance, isn't it worth to store this value instead of
> calculating it each time?

I agree, would be better to add the corresponding item into fs_info,
initialized as proposed above.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] btrfs: subpagesize-blocksize: Define extent_buffer_head
  2013-12-16 12:32   ` saeed bishara
@ 2013-12-16 16:17     ` Chandra Seetharaman
  2013-12-17 15:35       ` David Sterba
  0 siblings, 1 reply; 23+ messages in thread
From: Chandra Seetharaman @ 2013-12-16 16:17 UTC (permalink / raw)
  To: saeed bishara; +Cc: linux-btrfs

On Mon, 2013-12-16 at 14:32 +0200, saeed bishara wrote:
> On Thu, Dec 12, 2013 at 1:38 AM, Chandra Seetharaman
> <sekharan@us.ibm.com> wrote:
> > In order to handle multiple extent buffers per page, first we
> > need to create a way to handle all the extent buffers that
> > are attached to a page.
> >
> > This patch creates a new data structure eb_head, and moves
> > fields that are common to all extent buffers in a page from
> > extent buffer to eb_head.
> >
> > This also adds changes that are needed to handle multiple
> > extent buffers per page case.
> >
> > Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
> > ---

<snip>

> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 54ab861..02de448 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -2106,14 +2106,16 @@ static inline void btrfs_set_token_##name(struct extent_buffer *eb,     \
> >  #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)            \
> >  static inline u##bits btrfs_##name(struct extent_buffer *eb)           \
> >  {                                                                      \
> > -       type *p = page_address(eb->pages[0]);                           \
> > +       type *p = page_address(eb_head(eb)->pages[0]) +                 \
> > +                               (eb->start & (PAGE_CACHE_SIZE -1));     \
> you can use PAGE_CACHE_MASK instead of PAGE_CACHE_SIZE - 1

PAGE_CACHE_MASK get the page part of the value, not the offset in the
page, i.e it is defined as

#define PAGE_MASK (~(PAGE_SIZE-1))

> >         u##bits res = le##bits##_to_cpu(p->member);                     \
> >         return res;                                                     \

<snip>
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index ff43802..a1a849b 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c

<snip>

> > @@ -3367,17 +3376,27 @@ static int lock_extent_buffer_for_io(struct extent_buffer *eb,
> >          * under IO since we can end up having no IO bits set for a short period
> >          * of time.
> >          */
> > -       spin_lock(&eb->refs_lock);
> > -       if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
> > -               set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
> > -               spin_unlock(&eb->refs_lock);
> > -               btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
> > -               __percpu_counter_add(&fs_info->dirty_metadata_bytes,
> > -                                    -eb->len,
> > +       spin_lock(&ebh->refs_lock);
> > +       for (i = 0; i < MAX_EXTENT_BUFFERS_PER_PAGE; i++) {
> > +               ebtemp = &ebh->extent_buf[i];
> > +               dirty_arr[i] |= test_and_clear_bit(EXTENT_BUFFER_DIRTY, &ebtemp->ebflags);
> dirty_arr wasn't initialized, changing the "|=" to = fixed a crash
> issue when doing writes

Realized after posting the patch, not fixed in my tree.

Thanks
> > +               dirty = dirty || dirty_arr[i];
> > +       }
> > + 

<snip>
> > 1.7.12.4
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/7] btrfs: subpagesize-blocksize: Use a global alignment for size
  2013-12-16 14:48     ` David Sterba
@ 2013-12-16 16:18       ` Chandra Seetharaman
  0 siblings, 0 replies; 23+ messages in thread
From: Chandra Seetharaman @ 2013-12-16 16:18 UTC (permalink / raw)
  To: dsterba; +Cc: saeed bishara, linux-btrfs

On Mon, 2013-12-16 at 15:48 +0100, David Sterba wrote:
> On Mon, Dec 16, 2013 at 02:33:11PM +0200, saeed bishara wrote:
> > On Thu, Dec 12, 2013 at 1:38 AM, Chandra Seetharaman
> > <sekharan@us.ibm.com> wrote:
> > > In order to handle a blocksize that is smaller than the
> > > PAGE_SIZE, we need align all IOs to PAGE_SIZE.
> > >
> > > This patch defines a new macro btrfs_align_size() that
> > > calculates the alignment size based on the sectorsize
> > > and uses it at appropriate places.
> > >
> > > Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
> > > ---
> > >  fs/btrfs/btrfs_inode.h  |  7 +++++++
> > >  fs/btrfs/compression.c  |  3 ++-
> > >  fs/btrfs/extent-tree.c  | 12 ++++++------
> > >  fs/btrfs/extent_io.c    | 17 ++++++-----------
> > >  fs/btrfs/file.c         | 15 +++++++--------
> > >  fs/btrfs/inode.c        | 41 ++++++++++++++++++++++-------------------
> > >  fs/btrfs/ioctl.c        |  6 +++---
> > >  fs/btrfs/ordered-data.c |  2 +-
> > >  fs/btrfs/tree-log.c     |  2 +-
> > >  9 files changed, 55 insertions(+), 50 deletions(-)
> > >
> > > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > > index ac0b39d..eee994f 100644
> > > --- a/fs/btrfs/btrfs_inode.h
> > > +++ b/fs/btrfs/btrfs_inode.h
> > > @@ -280,4 +280,11 @@ static inline void btrfs_inode_resume_unlocked_dio(struct inode *inode)
> > >                   &BTRFS_I(inode)->runtime_flags);
> > >  }
> > >
> > > +static inline u64 btrfs_align_size(struct inode *inode)
> > > +{
> > > +       if (BTRFS_I(inode)->root->sectorsize < PAGE_CACHE_SIZE)
> > > +               return (u64)PAGE_CACHE_SIZE;
> > > +       else
> > > +               return (u64)BTRFS_I(inode)->root->sectorsize;
> > > +}
> > for performance, isn't it worth to store this value instead of
> > calculating it each time?

Good suggestion Saeed. will do
> 
> I agree, would be better to add the corresponding item into fs_info,
> initialized as proposed above.

Godd idea David. will do.
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] btrfs: subpagesize-blocksize: Define extent_buffer_head
  2013-12-16 16:17     ` Chandra Seetharaman
@ 2013-12-17 15:35       ` David Sterba
  0 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2013-12-17 15:35 UTC (permalink / raw)
  To: Chandra Seetharaman; +Cc: saeed bishara, linux-btrfs

On Mon, Dec 16, 2013 at 10:17:18AM -0600, Chandra Seetharaman wrote:
> On Mon, 2013-12-16 at 14:32 +0200, saeed bishara wrote:
> > On Thu, Dec 12, 2013 at 1:38 AM, Chandra Seetharaman
> > <sekharan@us.ibm.com> wrote:
> > > In order to handle multiple extent buffers per page, first we
> > > need to create a way to handle all the extent buffers that
> > > are attached to a page.
> > >
> > > This patch creates a new data structure eb_head, and moves
> > > fields that are common to all extent buffers in a page from
> > > extent buffer to eb_head.
> > >
> > > This also adds changes that are needed to handle multiple
> > > extent buffers per page case.
> > >
> > > Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
> > > ---
> 
> <snip>
> 
> > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > > index 54ab861..02de448 100644
> > > --- a/fs/btrfs/ctree.h
> > > +++ b/fs/btrfs/ctree.h
> > > @@ -2106,14 +2106,16 @@ static inline void btrfs_set_token_##name(struct extent_buffer *eb,     \
> > >  #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)            \
> > >  static inline u##bits btrfs_##name(struct extent_buffer *eb)           \
> > >  {                                                                      \
> > > -       type *p = page_address(eb->pages[0]);                           \
> > > +       type *p = page_address(eb_head(eb)->pages[0]) +                 \
> > > +                               (eb->start & (PAGE_CACHE_SIZE -1));     \
> > you can use PAGE_CACHE_MASK instead of PAGE_CACHE_SIZE - 1
> 
> PAGE_CACHE_MASK get the page part of the value, not the offset in the
> page, i.e it is defined as
> 
> #define PAGE_MASK (~(PAGE_SIZE-1))

Use ~PAGE_CACHE_MASK to get the offset. It's common, though not obvious
at first.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/7] Patches to support subpagesize blocksize
  2013-12-13 18:39 ` Josef Bacik
  2013-12-13 22:09   ` Chandra Seetharaman
@ 2014-01-08 20:06   ` Chandra Seetharaman
  1 sibling, 0 replies; 23+ messages in thread
From: Chandra Seetharaman @ 2014-01-08 20:06 UTC (permalink / raw)
  To: linux-btrfs

Hello All,

I had some random corruption issues when I run some serious IOs with
these patches.

Found out that the function clean_tree_block() is the problem.

IIUC, this function is used to drop a dirty extent buffer when it is not
needed any more to be written to the disk.

In my case, the extent buffer is part of a page and I do not know how to
handle this situation.

Is my understanding correct ? Does anyone have any suggestion to
circumvent this issue ?

Thanks & regards,

Chandra
On Fri, 2013-12-13 at 13:39 -0500, Josef Bacik wrote:
> On 12/11/2013 06:38 PM, Chandra Seetharaman wrote:
> > In btrfs, blocksize, the basic IO size of the filesystem, has been
> > more than PAGE_SIZE.
> >
> > But, some 64 bit architures, like PPC64 and ARM64 have the default
> > PAGE_SIZE as 64K, which means the filesystems handled in these
> > architectures are with a blocksize of 64K.
> >
> > This works fine as long as you create and use the filesystems within
> > these systems.
> >
> > In other words, one cannot create a filesystem in some other architecture
> > and use that filesystem in PPC64 or ARM64, and vice versa.,
> >
> > Another restriction is that we cannot use ext? filesystems in these
> > architectures as btrfs filesystems, since ext? filesystems have a blocksize
> > of 4K.
> >
> > Sometime last year, Wade Cline posted a patch(http://lwn.net/Articles/529682/).
> > I started testing it, and found many locking/race issues. So, I changed the
> > logic and created an extent_buffer_head that holds an array of extent buffers that
> > belong to a page.
> >
> > There are few wrinkles in this patchset, like some xfstests are failing, which
> > could be due to me doing something incorrectly w.r.t how the blocksize and
> > PAGE_SIZE are used in these patched.
> >
> > Would like to get some feedback, review comments.
> >
> 
> Ok so the more we talked about it on IRC and talking with Chris I think 
> we have a way forward here.
> 
> 1) Add an extent_buffer_head that embeds an extent_buffer, and in the 
> extent_buffer_head track the state of the whole page.  So this is where 
> we have a linked list of all the extent_buffers on the page, we can keep 
> track of the number of extent_buffers that are dirty/not so we can be 
> sure to set the page state and everything right.
> 
> 2) Set page->private to the first extent_buffer like we currently do.  
> Then we just have checks in the endio stuff to see if the eb we found is 
> the one for our currently range (ie bv_offset == 0) and if not do a 
> linear search through the extent_buffers on the extent_buffer_head part 
> to get the right one.
> 
> We have to do this because we need to be able to track IO for each of 
> the extent_buffer's independently of each other in case a page spans a 
> block_group.
> 
> Hopefully that makes sense, this way you don't have to futz with any of 
> my crazier long term goals of no longer using pagecache or any of that 
> mess.  Thanks,
> 
> Josef
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-01-08 20:06 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-11 23:38 [PATCH 0/7] Patches to support subpagesize blocksize Chandra Seetharaman
2013-12-11 23:38 ` [PATCH 1/7] btrfs: subpagesize-blocksize: Define extent_buffer_head Chandra Seetharaman
2013-12-16 12:32   ` saeed bishara
2013-12-16 16:17     ` Chandra Seetharaman
2013-12-17 15:35       ` David Sterba
2013-12-11 23:38 ` [PATCH 2/7] btrfs: subpagesize-blocksize: Use a global alignment for size Chandra Seetharaman
2013-12-16 12:33   ` saeed bishara
2013-12-16 14:48     ` David Sterba
2013-12-16 16:18       ` Chandra Seetharaman
2013-12-11 23:38 ` [PATCH 3/7] btrfs: subpagesize-blocksize: Handle small extent maps properly Chandra Seetharaman
2013-12-11 23:38 ` [PATCH 4/7] btrfs: subpagesize-blocksize: Handle iosize properly in submit_extent_page() Chandra Seetharaman
2013-12-11 23:38 ` [PATCH 5/7] btrfs: subpagesize-blocksize: handle checksum calculations properly Chandra Seetharaman
2013-12-11 23:38 ` [PATCH 6/7] btrfs: subpagesize-blocksize: Handle relocation clusters appropriately Chandra Seetharaman
2013-12-11 23:38 ` [PATCH 7/7] btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE Chandra Seetharaman
2013-12-13  1:07   ` David Sterba
2013-12-16 12:50     ` saeed bishara
2013-12-12 20:40 ` [PATCH 0/7] Patches to support subpagesize blocksize Josef Bacik
2013-12-13  1:17 ` David Sterba
2013-12-13 15:17   ` Chandra Seetharaman
2013-12-13 15:58     ` David Sterba
2013-12-13 18:39 ` Josef Bacik
2013-12-13 22:09   ` Chandra Seetharaman
2014-01-08 20:06   ` Chandra Seetharaman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.