linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/11] Metadata specific accouting and dirty writeout
@ 2017-12-11 21:55 Josef Bacik
  2017-12-11 21:55 ` [PATCH v3 01/10] remove mapping from balance_dirty_pages*() Josef Bacik
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Josef Bacik @ 2017-12-11 21:55 UTC (permalink / raw)
  To: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team, linux-btrfs

FYI patches 8-10 are purely there so people can see how I intend to use this.
These are large changes that need to go through the btrfs tree and will
undoubtedly change a lot.  My goal is for patches 1-7 to go through Andrew via
the mm tree and then once they have landed to go ahead and work out the details
of the btrfs patches with the other btrfs developers and merge via that tree.
I'm not asking for reviews on those, Jan just mentioned that it would be easier
to tell what I was trying to do if he could see how I intended to use it.

v2->v3:
- addressed issues brought up by Jan in the actual node metadata bytes
  accounting patch.
- collapsed the fprop patch that converted everything to bytes into the patch
  that converted the wb usage of fprop stuff to bytes.

-- Original message --
These patches are to support having metadata accounting and dirty handling
in a generic way.  For dirty metadata ext4 and xfs currently are limited by
their journal size, which allows them to handle dirty metadata flushing in a
relatively easy way.  Btrfs does not have this limiting factor, we can have as
much dirty metadata on the system as we have memory, so we have a dummy inode
that all of our metadat pages are allocated from so we can call
balance_dirty_pages() on it and make sure we don't overwhelm the system with
dirty metadata pages.

The problem with this is it severely limits our ability to do things like
support sub-pagesize blocksizes.  Btrfs also supports metadata blocksizes > page
size, which makes keeping track of our metadata and it's pages particularly
tricky.  We have the inode mapping with our pages, and we have another radix
tree for our actual metadata buffers.  This double accounting leads to some fun
shenanigans around reclaim and evicting pages we know we are done using.

To solve this we would like to switch to a scheme like xfs has, where we simply
have our metadata structures tied into the slab shrinking code, and we just use
alloc_page() for our pages, or kmalloc() when we add sub-pagesize blocksizes.
In order to do this we need infrastructure in place to make sure we still don't
overwhelm the system with dirty metadata pages.

Enter these patches.  Because metadata is tracked on a non-pagesize amount we
need to convert a bunch of our existing counters to bytes.  From there I've
added various counters for metadata, to keep track of overall metadata bytes,
how many are dirty and how many are under writeback.  I've added a super
operation to handle the dirty writeback, which is going to be handled mostly
inside the fs since we will need a little more smarts around what we writeback.

The last three patches are just there to show how we use the infrastructure in
the first 8 patches.  The actuall kill btree_inode patch is pretty big,
unfortunately ripping out all of the pagecache based handling and replacing it
with the new infrastructure has to be done whole-hog and can't be broken up
anymore than it already has been without making it un-bisectable.

Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 01/10] remove mapping from balance_dirty_pages*()
  2017-12-11 21:55 [PATCH v3 00/11] Metadata specific accouting and dirty writeout Josef Bacik
@ 2017-12-11 21:55 ` Josef Bacik
  2017-12-11 21:55 ` [PATCH v3 02/10] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes Josef Bacik
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Josef Bacik @ 2017-12-11 21:55 UTC (permalink / raw)
  To: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team, linux-btrfs
  Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

The only reason we pass in the mapping is to get the inode in order to see if
writeback cgroups is enabled, and even then it only checks the bdi and a super
block flag.  balance_dirty_pages() doesn't even use the mapping.  Since
balance_dirty_pages*() works on a bdi level, just pass in the bdi and super
block directly so we can avoid using mapping.  This will allow us to still use
balance_dirty_pages for dirty metadata pages that are not backed by an
address_mapping.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 drivers/mtd/devices/block2mtd.c | 12 ++++++++----
 fs/btrfs/disk-io.c              |  3 ++-
 fs/btrfs/file.c                 |  3 ++-
 fs/btrfs/ioctl.c                |  3 ++-
 fs/btrfs/relocation.c           |  3 ++-
 fs/buffer.c                     |  3 ++-
 fs/iomap.c                      |  6 ++++--
 fs/ntfs/attrib.c                | 11 ++++++++---
 fs/ntfs/file.c                  |  4 ++--
 include/linux/backing-dev.h     | 29 +++++++++++++++++++++++------
 include/linux/writeback.h       |  4 +++-
 mm/filemap.c                    |  4 +++-
 mm/memory.c                     |  5 ++++-
 mm/page-writeback.c             | 15 +++++++--------
 14 files changed, 72 insertions(+), 33 deletions(-)

diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index 7c887f111a7d..7892d0b9fcb0 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -52,7 +52,8 @@ static struct page *page_read(struct address_space *mapping, int index)
 /* erase a specified part of the device */
 static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len)
 {
-	struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+	struct inode *inode = dev->blkdev->bd_inode;
+	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
 	int index = to >> PAGE_SHIFT;	// page index
 	int pages = len >> PAGE_SHIFT;
@@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len)
 				memset(page_address(page), 0xff, PAGE_SIZE);
 				set_page_dirty(page);
 				unlock_page(page);
-				balance_dirty_pages_ratelimited(mapping);
+				balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+								inode->i_sb);
 				break;
 			}
 
@@ -141,7 +143,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, const u_char *buf,
 		loff_t to, size_t len, size_t *retlen)
 {
 	struct page *page;
-	struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+	struct inode *inode = dev->blkdev->bd_inode;
+	struct address_space *mapping = inode->i_mapping;
 	int index = to >> PAGE_SHIFT;	// page index
 	int offset = to & ~PAGE_MASK;	// page offset
 	int cpylen;
@@ -162,7 +165,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, const u_char *buf,
 			memcpy(page_address(page) + offset, buf, cpylen);
 			set_page_dirty(page);
 			unlock_page(page);
-			balance_dirty_pages_ratelimited(mapping);
+			balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+							inode->i_sb);
 		}
 		put_page(page);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 689b9913ccb5..8b6df7688d52 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4150,7 +4150,8 @@ static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info,
 	ret = percpu_counter_compare(&fs_info->dirty_metadata_bytes,
 				     BTRFS_DIRTY_METADATA_THRESH);
 	if (ret > 0) {
-		balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping);
+		balance_dirty_pages_ratelimited(fs_info->sb->s_bdi,
+						fs_info->sb);
 	}
 }
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ab1c38f2dd8c..4bc6cd6509be 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1779,7 +1779,8 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 
 		cond_resched();
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		if (dirty_pages < (fs_info->nodesize >> PAGE_SHIFT) + 1)
 			btrfs_btree_balance_dirty(fs_info);
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 6a07d4e12fd2..ec92fb5e2b51 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1368,7 +1368,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 		}
 
 		defrag_count += ret;
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		inode_unlock(inode);
 
 		if (newer_than) {
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 4cf2eb67eba6..9f31c5e6c0e5 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3278,7 +3278,8 @@ static int relocate_file_extent_cluster(struct inode *inode,
 
 		index++;
 		btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		btrfs_throttle(fs_info);
 	}
 	WARN_ON(nr != cluster->nr);
diff --git a/fs/buffer.c b/fs/buffer.c
index 170df856bdb9..36be326a316c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2421,7 +2421,8 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
 		BUG_ON(err != len);
 		err = 0;
 
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 
 		if (unlikely(fatal_signal_pending(current))) {
 			err = -EINTR;
diff --git a/fs/iomap.c b/fs/iomap.c
index 269b24a01f32..0eb1ec680f87 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -223,7 +223,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		written += copied;
 		length -= copied;
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 	} while (iov_iter_count(i) && length);
 
 	return written ? written : status;
@@ -305,7 +306,8 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		written += status;
 		length -= status;
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 	} while (length);
 
 	return written;
diff --git a/fs/ntfs/attrib.c b/fs/ntfs/attrib.c
index 44a39a099b54..d85368dd82e7 100644
--- a/fs/ntfs/attrib.c
+++ b/fs/ntfs/attrib.c
@@ -25,6 +25,7 @@
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/writeback.h>
+#include <linux/backing-dev.h>
 
 #include "attrib.h"
 #include "debug.h"
@@ -2493,6 +2494,7 @@ s64 ntfs_attr_extend_allocation(ntfs_inode *ni, s64 new_alloc_size,
 int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 {
 	ntfs_volume *vol = ni->vol;
+	struct inode *inode = VFS_I(ni);
 	struct address_space *mapping;
 	struct page *page;
 	u8 *kaddr;
@@ -2545,7 +2547,8 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		kunmap_atomic(kaddr);
 		set_page_dirty(page);
 		put_page(page);
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		cond_resched();
 		if (idx == end)
 			goto done;
@@ -2586,7 +2589,8 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		/* Finally unlock and release the page. */
 		unlock_page(page);
 		put_page(page);
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		cond_resched();
 	}
 	/* If there is a last partial page, need to do it the slow way. */
@@ -2603,7 +2607,8 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		kunmap_atomic(kaddr);
 		set_page_dirty(page);
 		put_page(page);
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		cond_resched();
 	}
 done:
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 331910fa8442..77b04be4a157 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -276,7 +276,7 @@ static int ntfs_attr_extend_initialized(ntfs_inode *ni, const s64 new_init_size)
 		 * number of pages we read and make dirty in the case of sparse
 		 * files.
 		 */
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(vi), vi->i_sb);
 		cond_resched();
 	} while (++index < end_index);
 	read_lock_irqsave(&ni->size_lock, flags);
@@ -1913,7 +1913,7 @@ static ssize_t ntfs_perform_write(struct file *file, struct iov_iter *i,
 		iov_iter_advance(i, copied);
 		pos += copied;
 		written += copied;
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(vi), vi->i_sb);
 		if (fatal_signal_pending(current)) {
 			status = -EINTR;
 			break;
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 854e1bdd0b2a..14e266d12620 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -228,8 +228,9 @@ void wb_blkcg_offline(struct blkcg *blkcg);
 int inode_congested(struct inode *inode, int cong_bits);
 
 /**
- * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
- * @inode: inode of interest
+ * bdi_cgwb_enabled - test wether cgroup writeback is enabled on a filesystem
+ * @bdi: the bdi we care about
+ * @sb: the super for the bdi
  *
  * cgroup writeback requires support from both the bdi and filesystem.
  * Also, both memcg and iocg have to be on the default hierarchy.  Test
@@ -238,15 +239,25 @@ int inode_congested(struct inode *inode, int cong_bits);
  * Note that the test result may change dynamically on the same inode
  * depending on how memcg and iocg are configured.
  */
-static inline bool inode_cgwb_enabled(struct inode *inode)
+static inline bool bdi_cgwb_enabled(struct backing_dev_info *bdi,
+				    struct super_block *sb)
 {
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
-
 	return cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
 		cgroup_subsys_on_dfl(io_cgrp_subsys) &&
 		bdi_cap_account_dirty(bdi) &&
 		(bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
-		(inode->i_sb->s_iflags & SB_I_CGROUPWB);
+		(sb->s_iflags & SB_I_CGROUPWB);
+}
+
+/**
+ * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
+ * @inode: inode of interest
+ *
+ * Does the inode have cgroup writeback support.
+ */
+static inline bool inode_cgwb_enabled(struct inode *inode)
+{
+	return bdi_cgwb_enabled(inode_to_bdi(inode), inode->i_sb);
 }
 
 /**
@@ -389,6 +400,12 @@ static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
+static inline bool bdi_cgwb_enabled(struct backing_dev_info *bdi,
+				    struct super_block *sb)
+{
+	return false;
+}
+
 static inline bool inode_cgwb_enabled(struct inode *inode)
 {
 	return false;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d5815794416c..fa799a4a7755 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -376,7 +376,9 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
 
 void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
-void balance_dirty_pages_ratelimited(struct address_space *mapping);
+void page_writeback_init(void);
+void balance_dirty_pages_ratelimited(struct backing_dev_info *bdi,
+				     struct super_block *sb);
 bool wb_over_bg_thresh(struct bdi_writeback *wb);
 
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/filemap.c b/mm/filemap.c
index 870971e20967..5ea4878e9c78 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2971,6 +2971,7 @@ ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
 {
 	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	long status = 0;
 	ssize_t written = 0;
@@ -3044,7 +3045,8 @@ ssize_t generic_perform_write(struct file *file,
 		pos += copied;
 		written += copied;
 
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 	} while (iov_iter_count(i));
 
 	return written ? written : status;
diff --git a/mm/memory.c b/mm/memory.c
index ec4e15494901..86f31b3d54c6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -70,6 +70,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
 #include <linux/oom.h>
+#include <linux/backing-dev.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -2391,11 +2392,13 @@ static void fault_dirty_shared_page(struct vm_area_struct *vma,
 	unlock_page(page);
 
 	if ((dirtied || page_mkwrite) && mapping) {
+		struct inode *inode = mapping->host;
 		/*
 		 * Some device drivers do not set page.mapping
 		 * but still dirty their pages
 		 */
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 	}
 
 	if (!page_mkwrite)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0b9c5cbe8eba..1a47d4296750 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1559,8 +1559,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
-static void balance_dirty_pages(struct address_space *mapping,
-				struct bdi_writeback *wb,
+static void balance_dirty_pages(struct bdi_writeback *wb,
 				unsigned long pages_dirtied)
 {
 	struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
@@ -1850,7 +1849,8 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
 
 /**
  * balance_dirty_pages_ratelimited - balance dirty memory state
- * @mapping: address_space which was dirtied
+ * @bdi: the bdi that was dirtied
+ * @sb: the super block that was dirtied
  *
  * Processes which are dirtying memory should call in here once for each page
  * which was newly dirtied.  The function will periodically check the system's
@@ -1861,10 +1861,9 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
  * limit we decrease the ratelimiting by a lot, to prevent individual processes
  * from overshooting the limit by (ratelimit_pages) each.
  */
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+void balance_dirty_pages_ratelimited(struct backing_dev_info *bdi,
+				     struct super_block *sb)
 {
-	struct inode *inode = mapping->host;
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct bdi_writeback *wb = NULL;
 	int ratelimit;
 	int *p;
@@ -1872,7 +1871,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	if (inode_cgwb_enabled(inode))
+	if (bdi_cgwb_enabled(bdi, sb))
 		wb = wb_get_create_current(bdi, GFP_KERNEL);
 	if (!wb)
 		wb = &bdi->wb;
@@ -1910,7 +1909,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
-		balance_dirty_pages(mapping, wb, current->nr_dirtied);
+		balance_dirty_pages(wb, current->nr_dirtied);
 
 	wb_put(wb);
 }
-- 
2.7.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 02/10] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes
  2017-12-11 21:55 [PATCH v3 00/11] Metadata specific accouting and dirty writeout Josef Bacik
  2017-12-11 21:55 ` [PATCH v3 01/10] remove mapping from balance_dirty_pages*() Josef Bacik
@ 2017-12-11 21:55 ` Josef Bacik
  2017-12-11 21:55 ` [PATCH v3 03/10] lib: add a __fprop_add_percpu_max Josef Bacik
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Josef Bacik @ 2017-12-11 21:55 UTC (permalink / raw)
  To: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team, linux-btrfs
  Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

These are counters that constantly go up in order to do bandwidth calculations.
It isn't important what the units are in, as long as they are consistent between
the two of them, so convert them to count bytes written/dirtied, and allow the
metadata accounting stuff to change the counters as well.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/fuse/file.c                   |  4 ++--
 include/linux/backing-dev-defs.h |  4 ++--
 include/linux/backing-dev.h      |  2 +-
 mm/backing-dev.c                 |  9 +++++----
 mm/page-writeback.c              | 20 ++++++++++----------
 5 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cb7dff5c45d7..67e7c4fac28d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1471,7 +1471,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 	for (i = 0; i < req->num_pages; i++) {
 		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
 		dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP);
-		wb_writeout_inc(&bdi->wb);
+		wb_writeout_add(&bdi->wb, PAGE_SIZE);
 	}
 	wake_up(&fi->page_waitq);
 }
@@ -1776,7 +1776,7 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req,
 
 		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK_TEMP);
-		wb_writeout_inc(&bdi->wb);
+		wb_writeout_add(&bdi->wb, PAGE_SIZE);
 		fuse_writepage_free(fc, new_req);
 		fuse_request_free(new_req);
 		goto out;
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 866c433e7d32..ded45ac2cec7 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -36,8 +36,8 @@ typedef int (congested_fn)(void *, int);
 enum wb_stat_item {
 	WB_RECLAIMABLE,
 	WB_WRITEBACK,
-	WB_DIRTIED,
-	WB_WRITTEN,
+	WB_DIRTIED_BYTES,
+	WB_WRITTEN_BYTES,
 	NR_WB_STAT_ITEMS
 };
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 14e266d12620..39b8dc486ea7 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -89,7 +89,7 @@ static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item)
 	return percpu_counter_sum_positive(&wb->stat[item]);
 }
 
-extern void wb_writeout_inc(struct bdi_writeback *wb);
+extern void wb_writeout_add(struct bdi_writeback *wb, long bytes);
 
 /*
  * maximal error of a stat counter.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e19606bb41a0..62a332a91b38 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -68,14 +68,15 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	wb_thresh = wb_calc_thresh(wb, dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
+#define BtoK(x) ((x) >> 10)
 	seq_printf(m,
 		   "BdiWriteback:       %10lu kB\n"
 		   "BdiReclaimable:     %10lu kB\n"
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
-		   "BdiDirtied:         %10lu kB\n"
-		   "BdiWritten:         %10lu kB\n"
+		   "BdiDirtiedBytes:    %10lu kB\n"
+		   "BdiWrittenBytes:    %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
@@ -88,8 +89,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   K(wb_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
-		   (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
-		   (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
+		   (unsigned long) BtoK(wb_stat(wb, WB_DIRTIED_BYTES)),
+		   (unsigned long) BtoK(wb_stat(wb, WB_WRITTEN_BYTES)),
 		   (unsigned long) K(wb->write_bandwidth),
 		   nr_dirty,
 		   nr_io,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1a47d4296750..e4563645749a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -597,11 +597,11 @@ static void wb_domain_writeout_inc(struct wb_domain *dom,
  * Increment @wb's writeout completion count and the global writeout
  * completion count. Called from test_clear_page_writeback().
  */
-static inline void __wb_writeout_inc(struct bdi_writeback *wb)
+static inline void __wb_writeout_add(struct bdi_writeback *wb, long bytes)
 {
 	struct wb_domain *cgdom;
 
-	inc_wb_stat(wb, WB_WRITTEN);
+	__add_wb_stat(wb, WB_WRITTEN_BYTES, bytes);
 	wb_domain_writeout_inc(&global_wb_domain, &wb->completions,
 			       wb->bdi->max_prop_frac);
 
@@ -611,15 +611,15 @@ static inline void __wb_writeout_inc(struct bdi_writeback *wb)
 				       wb->bdi->max_prop_frac);
 }
 
-void wb_writeout_inc(struct bdi_writeback *wb)
+void wb_writeout_add(struct bdi_writeback *wb, long bytes)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
-	__wb_writeout_inc(wb);
+	__wb_writeout_add(wb, bytes);
 	local_irq_restore(flags);
 }
-EXPORT_SYMBOL_GPL(wb_writeout_inc);
+EXPORT_SYMBOL_GPL(wb_writeout_add);
 
 /*
  * On idle system, we can be called long after we scheduled because we use
@@ -1362,8 +1362,8 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
-	dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]);
-	written = percpu_counter_read(&wb->stat[WB_WRITTEN]);
+	dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED_BYTES]) >> PAGE_SHIFT;
+	written = percpu_counter_read(&wb->stat[WB_WRITTEN_BYTES]) >> PAGE_SHIFT;
 
 	/*
 	 * Skip quiet periods when disk bandwidth is under-utilized.
@@ -2435,7 +2435,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_node_page_state(page, NR_DIRTIED);
 		inc_wb_stat(wb, WB_RECLAIMABLE);
-		inc_wb_stat(wb, WB_DIRTIED);
+		__add_wb_stat(wb, WB_DIRTIED_BYTES, PAGE_SIZE);
 		task_io_account_write(PAGE_SIZE);
 		current->nr_dirtied++;
 		this_cpu_inc(bdp_ratelimits);
@@ -2522,7 +2522,7 @@ void account_page_redirty(struct page *page)
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		current->nr_dirtied--;
 		dec_node_page_state(page, NR_DIRTIED);
-		dec_wb_stat(wb, WB_DIRTIED);
+		__add_wb_stat(wb, WB_DIRTIED_BYTES, -(long)PAGE_SIZE);
 		unlocked_inode_to_wb_end(inode, locked);
 	}
 }
@@ -2744,7 +2744,7 @@ int test_clear_page_writeback(struct page *page)
 				struct bdi_writeback *wb = inode_to_wb(inode);
 
 				dec_wb_stat(wb, WB_WRITEBACK);
-				__wb_writeout_inc(wb);
+				__wb_writeout_add(wb, PAGE_SIZE);
 			}
 		}
 
-- 
2.7.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 03/10] lib: add a __fprop_add_percpu_max
  2017-12-11 21:55 [PATCH v3 00/11] Metadata specific accouting and dirty writeout Josef Bacik
  2017-12-11 21:55 ` [PATCH v3 01/10] remove mapping from balance_dirty_pages*() Josef Bacik
  2017-12-11 21:55 ` [PATCH v3 02/10] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes Josef Bacik
@ 2017-12-11 21:55 ` Josef Bacik
  2017-12-19  7:25   ` Jan Kara
  2017-12-11 21:55 ` [PATCH v3 04/10] writeback: convert the flexible prop stuff to bytes Josef Bacik
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Josef Bacik @ 2017-12-11 21:55 UTC (permalink / raw)
  To: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team, linux-btrfs
  Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

This helper allows us to add an arbitrary amount to the fprop
structures.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 include/linux/flex_proportions.h | 11 +++++++++--
 lib/flex_proportions.c           |  9 +++++----
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/flex_proportions.h b/include/linux/flex_proportions.h
index 0d348e011a6e..9f88684bf0a0 100644
--- a/include/linux/flex_proportions.h
+++ b/include/linux/flex_proportions.h
@@ -83,8 +83,8 @@ struct fprop_local_percpu {
 int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp);
 void fprop_local_destroy_percpu(struct fprop_local_percpu *pl);
 void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl);
-void __fprop_inc_percpu_max(struct fprop_global *p, struct fprop_local_percpu *pl,
-			    int max_frac);
+void __fprop_add_percpu_max(struct fprop_global *p, struct fprop_local_percpu *pl,
+			    unsigned long nr, int max_frac);
 void fprop_fraction_percpu(struct fprop_global *p,
 	struct fprop_local_percpu *pl, unsigned long *numerator,
 	unsigned long *denominator);
@@ -99,4 +99,11 @@ void fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl)
 	local_irq_restore(flags);
 }
 
+static inline
+void __fprop_inc_percpu_max(struct fprop_global *p,
+			    struct fprop_local_percpu *pl, int max_frac)
+{
+	__fprop_add_percpu_max(p, pl, 1, max_frac);
+}
+
 #endif
diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
index 2cc1f94e03a1..31003989d34a 100644
--- a/lib/flex_proportions.c
+++ b/lib/flex_proportions.c
@@ -255,8 +255,9 @@ void fprop_fraction_percpu(struct fprop_global *p,
  * Like __fprop_inc_percpu() except that event is counted only if the given
  * type has fraction smaller than @max_frac/FPROP_FRAC_BASE
  */
-void __fprop_inc_percpu_max(struct fprop_global *p,
-			    struct fprop_local_percpu *pl, int max_frac)
+void __fprop_add_percpu_max(struct fprop_global *p,
+			    struct fprop_local_percpu *pl, unsigned long nr,
+			    int max_frac)
 {
 	if (unlikely(max_frac < FPROP_FRAC_BASE)) {
 		unsigned long numerator, denominator;
@@ -267,6 +268,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
 			return;
 	} else
 		fprop_reflect_period_percpu(p, pl);
-	percpu_counter_add_batch(&pl->events, 1, PROP_BATCH);
-	percpu_counter_add(&p->events, 1);
+	percpu_counter_add_batch(&pl->events, nr, PROP_BATCH);
+	percpu_counter_add(&p->events, nr);
 }
-- 
2.7.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 04/10] writeback: convert the flexible prop stuff to bytes
  2017-12-11 21:55 [PATCH v3 00/11] Metadata specific accouting and dirty writeout Josef Bacik
                   ` (2 preceding siblings ...)
  2017-12-11 21:55 ` [PATCH v3 03/10] lib: add a __fprop_add_percpu_max Josef Bacik
@ 2017-12-11 21:55 ` Josef Bacik
  2017-12-11 21:55 ` [PATCH v3 05/10] writeback: add counters for metadata usage Josef Bacik
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Josef Bacik @ 2017-12-11 21:55 UTC (permalink / raw)
  To: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team, linux-btrfs
  Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

The flexible proportions were all page based, but now that we are doing
metadata writeout that can be smaller or larger than page size we need
to account for this in bytes instead of number of pages.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 lib/flex_proportions.c |  2 +-
 mm/page-writeback.c    | 10 +++++-----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
index 31003989d34a..fd95791a2c93 100644
--- a/lib/flex_proportions.c
+++ b/lib/flex_proportions.c
@@ -166,7 +166,7 @@ void fprop_fraction_single(struct fprop_global *p,
 /*
  * ---- PERCPU ----
  */
-#define PROP_BATCH (8*(1+ilog2(nr_cpu_ids)))
+#define PROP_BATCH (8*PAGE_SIZE*(1+ilog2(nr_cpu_ids)))
 
 int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp)
 {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e4563645749a..2a1994194cc1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -574,11 +574,11 @@ static unsigned long wp_next_time(unsigned long cur_time)
 	return cur_time;
 }
 
-static void wb_domain_writeout_inc(struct wb_domain *dom,
+static void wb_domain_writeout_add(struct wb_domain *dom,
 				   struct fprop_local_percpu *completions,
-				   unsigned int max_prop_frac)
+				   long bytes, unsigned int max_prop_frac)
 {
-	__fprop_inc_percpu_max(&dom->completions, completions,
+	__fprop_add_percpu_max(&dom->completions, completions, bytes,
 			       max_prop_frac);
 	/* First event after period switching was turned off? */
 	if (unlikely(!dom->period_time)) {
@@ -602,12 +602,12 @@ static inline void __wb_writeout_add(struct bdi_writeback *wb, long bytes)
 	struct wb_domain *cgdom;
 
 	__add_wb_stat(wb, WB_WRITTEN_BYTES, bytes);
-	wb_domain_writeout_inc(&global_wb_domain, &wb->completions,
+	wb_domain_writeout_add(&global_wb_domain, &wb->completions, bytes,
 			       wb->bdi->max_prop_frac);
 
 	cgdom = mem_cgroup_wb_domain(wb);
 	if (cgdom)
-		wb_domain_writeout_inc(cgdom, wb_memcg_completions(wb),
+		wb_domain_writeout_add(cgdom, wb_memcg_completions(wb), bytes,
 				       wb->bdi->max_prop_frac);
 }
 
-- 
2.7.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 05/10] writeback: add counters for metadata usage
  2017-12-11 21:55 [PATCH v3 00/11] Metadata specific accouting and dirty writeout Josef Bacik
                   ` (3 preceding siblings ...)
  2017-12-11 21:55 ` [PATCH v3 04/10] writeback: convert the flexible prop stuff to bytes Josef Bacik
@ 2017-12-11 21:55 ` Josef Bacik
  2017-12-19  7:52   ` Jan Kara
  2017-12-11 21:55 ` [PATCH v3 06/10] writeback: introduce super_operations->write_metadata Josef Bacik
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Josef Bacik @ 2017-12-11 21:55 UTC (permalink / raw)
  To: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team, linux-btrfs
  Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

Btrfs has no bounds except memory on the amount of dirty memory that we have in
use for metadata.  Historically we have used a special inode so we could take
advantage of the balance_dirty_pages throttling that comes with using pagecache.
However as we'd like to support different blocksizes it would be nice to not
have to rely on pagecache, but still get the balance_dirty_pages throttling
without having to do it ourselves.

So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES.  These are
zone and bdi_writeback counters to keep track of how many bytes we have in
flight for METADATA.  We need to count in bytes as blocksizes could be
percentages of pagesize.  We simply convert the bytes to number of pages where
it is needed for the throttling.

Also introduce NR_METADATA_BYTES so we can keep track of the total amount of
pages used for metadata on the system.  This is also needed so things like dirty
throttling know that this is dirtyable memory as well and easily reclaimed.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/base/node.c              |   8 +++
 fs/fs-writeback.c                |   2 +
 fs/proc/meminfo.c                |   8 +++
 include/linux/backing-dev-defs.h |   2 +
 include/linux/mm.h               |   9 +++
 include/linux/mmzone.h           |  10 +++
 include/trace/events/writeback.h |  13 +++-
 mm/backing-dev.c                 |   4 ++
 mm/page-writeback.c              | 142 +++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c                  |  20 ++++--
 mm/util.c                        |   1 +
 mm/vmscan.c                      |   3 +-
 mm/vmstat.c                      |  10 +++
 13 files changed, 211 insertions(+), 21 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 3855902f2c5b..a39cecc8957a 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -51,6 +51,8 @@ static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
+#define BtoK(x) ((x) >> 10)
+
 static ssize_t node_read_meminfo(struct device *dev,
 			struct device_attribute *attr, char *buf)
 {
@@ -99,7 +101,10 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
 	n += sprintf(buf + n,
 		       "Node %d Dirty:          %8lu kB\n"
+		       "Node %d MetadataDirty:	%8lu kB\n"
 		       "Node %d Writeback:      %8lu kB\n"
+		       "Node %d MetaWriteback:  %8lu kB\n"
+		       "Node %d Metadata:       %8lu kB\n"
 		       "Node %d FilePages:      %8lu kB\n"
 		       "Node %d Mapped:         %8lu kB\n"
 		       "Node %d AnonPages:      %8lu kB\n"
@@ -119,8 +124,11 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
 			,
 		       nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
+		       nid, BtoK(node_page_state(pgdat, NR_METADATA_DIRTY_BYTES)),
 		       nid, K(node_page_state(pgdat, NR_WRITEBACK)),
+		       nid, BtoK(node_page_state(pgdat, NR_METADATA_WRITEBACK_BYTES)),
 		       nid, K(node_page_state(pgdat, NR_FILE_PAGES)),
+		       nid, BtoK(node_page_state(pgdat, NR_METADATA_BYTES)),
 		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
 		       nid, K(i.sharedram),
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 245c430a2e41..987448ed7698 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1814,6 +1814,7 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
 	return work;
 }
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
 /*
  * Add in the number of potentially dirty inodes, because each inode
  * write can dirty pagecache in the underlying blockdev.
@@ -1822,6 +1823,7 @@ static unsigned long get_nr_dirty_pages(void)
 {
 	return global_node_page_state(NR_FILE_DIRTY) +
 		global_node_page_state(NR_UNSTABLE_NFS) +
+		BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES)) +
 		get_nr_dirty_inodes();
 }
 
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index cdd979724c74..fa1fd24a4d99 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -42,6 +42,8 @@ static void show_val_kb(struct seq_file *m, const char *s, unsigned long num)
 	seq_write(m, " kB\n", 4);
 }
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
+
 static int meminfo_proc_show(struct seq_file *m, void *v)
 {
 	struct sysinfo i;
@@ -71,6 +73,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "Buffers:        ", i.bufferram);
 	show_val_kb(m, "Cached:         ", cached);
 	show_val_kb(m, "SwapCached:     ", total_swapcache_pages());
+	show_val_kb(m, "Metadata:       ",
+		    BtoP(global_node_page_state(NR_METADATA_BYTES)));
 	show_val_kb(m, "Active:         ", pages[LRU_ACTIVE_ANON] +
 					   pages[LRU_ACTIVE_FILE]);
 	show_val_kb(m, "Inactive:       ", pages[LRU_INACTIVE_ANON] +
@@ -98,8 +102,12 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "SwapFree:       ", i.freeswap);
 	show_val_kb(m, "Dirty:          ",
 		    global_node_page_state(NR_FILE_DIRTY));
+	show_val_kb(m, "MetadataDirty:  ",
+		    BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES)));
 	show_val_kb(m, "Writeback:      ",
 		    global_node_page_state(NR_WRITEBACK));
+	show_val_kb(m, "MetaWriteback:  ",
+		    BtoP(global_node_page_state(NR_METADATA_WRITEBACK_BYTES)));
 	show_val_kb(m, "AnonPages:      ",
 		    global_node_page_state(NR_ANON_MAPPED));
 	show_val_kb(m, "Mapped:         ",
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index ded45ac2cec7..78c65e2910dc 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -38,6 +38,8 @@ enum wb_stat_item {
 	WB_WRITEBACK,
 	WB_DIRTIED_BYTES,
 	WB_WRITTEN_BYTES,
+	WB_METADATA_DIRTY_BYTES,
+	WB_METADATA_WRITEBACK_BYTES,
 	NR_WB_STAT_ITEMS
 };
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f8c10d336e42..e14ada96af25 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -32,6 +32,7 @@ struct file_ra_state;
 struct user_struct;
 struct writeback_control;
 struct bdi_writeback;
+struct backing_dev_info;
 
 void init_mm_internals(void);
 
@@ -1428,6 +1429,14 @@ int redirty_page_for_writepage(struct writeback_control *wbc,
 void account_page_dirtied(struct page *page, struct address_space *mapping);
 void account_page_cleaned(struct page *page, struct address_space *mapping,
 			  struct bdi_writeback *wb);
+void account_metadata_dirtied(struct page *page, struct backing_dev_info *bdi,
+			      long bytes);
+void account_metadata_cleaned(struct page *page, struct backing_dev_info *bdi,
+			      long bytes);
+void account_metadata_writeback(struct page *page,
+				struct backing_dev_info *bdi, long bytes);
+void account_metadata_end_writeback(struct page *page,
+				    struct backing_dev_info *bdi, long bytes);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
 void cancel_dirty_page(struct page *page);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 356a814e7c8e..48de090f5a07 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -179,9 +179,19 @@ enum node_stat_item {
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
+	NR_METADATA_DIRTY_BYTES,	/* Metadata dirty bytes */
+	NR_METADATA_WRITEBACK_BYTES,	/* Metadata writeback bytes */
+	NR_METADATA_BYTES,	/* total metadata bytes in use. */
 	NR_VM_NODE_STAT_ITEMS
 };
 
+static inline int is_bytes_node_stat(enum node_stat_item item)
+{
+	return (item == NR_METADATA_DIRTY_BYTES ||
+		item == NR_METADATA_WRITEBACK_BYTES ||
+		item == NR_METADATA_BYTES);
+}
+
 /*
  * We do arithmetic on the LRU lists in various places in the code,
  * so it is important to keep the active lists LRU_ACTIVE higher in
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 9b57f014d79d..989cdae363db 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -390,6 +390,8 @@ TRACE_EVENT(writeback_queue_io,
 	)
 );
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
+
 TRACE_EVENT(global_dirty_state,
 
 	TP_PROTO(unsigned long background_thresh,
@@ -402,7 +404,9 @@ TRACE_EVENT(global_dirty_state,
 
 	TP_STRUCT__entry(
 		__field(unsigned long,	nr_dirty)
+		__field(unsigned long,	nr_metadata_dirty)
 		__field(unsigned long,	nr_writeback)
+		__field(unsigned long,	nr_metadata_writeback)
 		__field(unsigned long,	nr_unstable)
 		__field(unsigned long,	background_thresh)
 		__field(unsigned long,	dirty_thresh)
@@ -413,7 +417,9 @@ TRACE_EVENT(global_dirty_state,
 
 	TP_fast_assign(
 		__entry->nr_dirty	= global_node_page_state(NR_FILE_DIRTY);
+		__entry->nr_metadata_dirty = BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES));
 		__entry->nr_writeback	= global_node_page_state(NR_WRITEBACK);
+		__entry->nr_metadata_dirty = BtoP(global_node_page_state(NR_METADATA_WRITEBACK_BYTES));
 		__entry->nr_unstable	= global_node_page_state(NR_UNSTABLE_NFS);
 		__entry->nr_dirtied	= global_node_page_state(NR_DIRTIED);
 		__entry->nr_written	= global_node_page_state(NR_WRITTEN);
@@ -424,7 +430,8 @@ TRACE_EVENT(global_dirty_state,
 
 	TP_printk("dirty=%lu writeback=%lu unstable=%lu "
 		  "bg_thresh=%lu thresh=%lu limit=%lu "
-		  "dirtied=%lu written=%lu",
+		  "dirtied=%lu written=%lu metadata_dirty=%lu "
+		  "metadata_writeback=%lu",
 		  __entry->nr_dirty,
 		  __entry->nr_writeback,
 		  __entry->nr_unstable,
@@ -432,7 +439,9 @@ TRACE_EVENT(global_dirty_state,
 		  __entry->dirty_thresh,
 		  __entry->dirty_limit,
 		  __entry->nr_dirtied,
-		  __entry->nr_written
+		  __entry->nr_written,
+		  __entry->nr_metadata_dirty,
+		  __entry->nr_metadata_writeback
 	)
 );
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 62a332a91b38..0aad67bc0898 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -77,6 +77,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "BackgroundThresh:   %10lu kB\n"
 		   "BdiDirtiedBytes:    %10lu kB\n"
 		   "BdiWrittenBytes:    %10lu kB\n"
+		   "BdiMetadataDirty:   %10lu kB\n"
+		   "BdiMetaWriteback:	%10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
@@ -91,6 +93,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   K(background_thresh),
 		   (unsigned long) BtoK(wb_stat(wb, WB_DIRTIED_BYTES)),
 		   (unsigned long) BtoK(wb_stat(wb, WB_WRITTEN_BYTES)),
+		   (unsigned long) BtoK(wb_stat(wb, WB_METADATA_DIRTY_BYTES)),
+		   (unsigned long) BtoK(wb_stat(wb, WB_METADATA_WRITEBACK_BYTES)),
 		   (unsigned long) K(wb->write_bandwidth),
 		   nr_dirty,
 		   nr_io,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2a1994194cc1..044aaa1ab090 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -249,6 +249,8 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
 
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
+
 /*
  * In a memory zone, there is a certain amount of pages we consider
  * available for the page cache, which is essentially the number of
@@ -297,6 +299,7 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 
 	nr_pages += node_page_state(pgdat, NR_INACTIVE_FILE);
 	nr_pages += node_page_state(pgdat, NR_ACTIVE_FILE);
+	nr_pages += node_page_state(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT;
 
 	return nr_pages;
 }
@@ -373,6 +376,7 @@ static unsigned long global_dirtyable_memory(void)
 
 	x += global_node_page_state(NR_INACTIVE_FILE);
 	x += global_node_page_state(NR_ACTIVE_FILE);
+	x += global_node_page_state(NR_METADATA_BYTES) >> PAGE_SHIFT;
 
 	if (!vm_highmem_is_dirtyable)
 		x -= highmem_dirtyable_memory(x);
@@ -381,6 +385,30 @@ static unsigned long global_dirtyable_memory(void)
 }
 
 /**
+ * global_dirty_memory - the number of globally dirty pages
+ *
+ * Returns the global number of pages that are dirty in pagecache and metadata.
+ */
+static unsigned long global_dirty_memory(void)
+{
+	return global_node_page_state(NR_FILE_DIRTY) +
+		global_node_page_state(NR_UNSTABLE_NFS) +
+		(global_node_page_state(NR_METADATA_DIRTY_BYTES) >> PAGE_SHIFT);
+}
+
+/**
+ * global_writeback_memory - the number of pages under writeback globally
+ *
+ * Returns the global number of pages under writeback both in pagecache and in
+ * metadata.
+ */
+static unsigned long global_writeback_memory(void)
+{
+	return global_node_page_state(NR_WRITEBACK) +
+		(global_node_page_state(NR_METADATA_WRITEBACK_BYTES) >> PAGE_SHIFT);
+}
+
+/**
  * domain_dirty_limits - calculate thresh and bg_thresh for a wb_domain
  * @dtc: dirty_throttle_control of interest
  *
@@ -492,6 +520,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat)
 	return dirty;
 }
 
+
 /**
  * node_dirty_ok - tells whether a node is within its dirty limits
  * @pgdat: the node to check
@@ -507,6 +536,8 @@ bool node_dirty_ok(struct pglist_data *pgdat)
 	nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
 	nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
 	nr_pages += node_page_state(pgdat, NR_WRITEBACK);
+	nr_pages += BtoP(node_page_state(pgdat, NR_METADATA_DIRTY_BYTES));
+	nr_pages += BtoP(node_page_state(pgdat, NR_METADATA_WRITEBACK_BYTES));
 
 	return nr_pages <= limit;
 }
@@ -1514,7 +1545,7 @@ static long wb_min_pause(struct bdi_writeback *wb,
 static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
 {
 	struct bdi_writeback *wb = dtc->wb;
-	unsigned long wb_reclaimable;
+	unsigned long wb_reclaimable, wb_writeback;
 
 	/*
 	 * wb_thresh is not treated as some limiting factor as
@@ -1544,12 +1575,17 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
 	 * deltas.
 	 */
 	if (dtc->wb_thresh < 2 * wb_stat_error(wb)) {
-		wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
-		dtc->wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK);
+		wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE) +
+			BtoP(wb_stat_sum(wb, WB_METADATA_DIRTY_BYTES));
+		wb_writeback = wb_stat_sum(wb, WB_WRITEBACK) +
+			BtoP(wb_stat_sum(wb, WB_METADATA_WRITEBACK_BYTES));
 	} else {
-		wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE);
-		dtc->wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK);
+		wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE) +
+			BtoP(wb_stat(wb, WB_METADATA_DIRTY_BYTES));
+		wb_writeback = wb_stat(wb, WB_WRITEBACK) +
+			BtoP(wb_stat(wb, WB_METADATA_WRITEBACK_BYTES));
 	}
+	dtc->wb_dirty = wb_reclaimable + wb_writeback;
 }
 
 /*
@@ -1594,10 +1630,9 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
-					global_node_page_state(NR_UNSTABLE_NFS);
+		nr_reclaimable = global_dirty_memory();
 		gdtc->avail = global_dirtyable_memory();
-		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
+		gdtc->dirty = nr_reclaimable + global_writeback_memory();
 
 		domain_dirty_limits(gdtc);
 
@@ -1929,20 +1964,22 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	struct dirty_throttle_control * const gdtc = &gdtc_stor;
 	struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
 						     &mdtc_stor : NULL;
+	unsigned long wb_reclaimable;
 
 	/*
 	 * Similar to balance_dirty_pages() but ignores pages being written
 	 * as we're trying to decide whether to put more under writeback.
 	 */
 	gdtc->avail = global_dirtyable_memory();
-	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
-		      global_node_page_state(NR_UNSTABLE_NFS);
+	gdtc->dirty = global_dirty_memory();
 	domain_dirty_limits(gdtc);
 
 	if (gdtc->dirty > gdtc->bg_thresh)
 		return true;
 
-	if (wb_stat(wb, WB_RECLAIMABLE) >
+	wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE) +
+		BtoP(wb_stat(wb, WB_METADATA_DIRTY_BYTES));
+	if (wb_reclaimable >
 	    wb_calc_thresh(gdtc->wb, gdtc->bg_thresh))
 		return true;
 
@@ -1957,7 +1994,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 		if (mdtc->dirty > mdtc->bg_thresh)
 			return true;
 
-		if (wb_stat(wb, WB_RECLAIMABLE) >
+		if (wb_reclaimable >
 		    wb_calc_thresh(mdtc->wb, mdtc->bg_thresh))
 			return true;
 	}
@@ -1979,8 +2016,7 @@ int dirty_writeback_centisecs_handler(struct ctl_table *table, int write,
 void laptop_mode_timer_fn(unsigned long data)
 {
 	struct request_queue *q = (struct request_queue *)data;
-	int nr_pages = global_node_page_state(NR_FILE_DIRTY) +
-		global_node_page_state(NR_UNSTABLE_NFS);
+	int nr_pages = global_dirty_memory();
 	struct bdi_writeback *wb;
 
 	/*
@@ -2444,6 +2480,84 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 EXPORT_SYMBOL(account_page_dirtied);
 
 /*
+ * account_metadata_dirtied
+ * @page - the page being dirited
+ * @bdi - the bdi that owns this page
+ * @bytes - the number of bytes being dirtied
+ *
+ * Do the dirty page accounting for metadata pages that aren't backed by an
+ * address_space.
+ */
+void account_metadata_dirtied(struct page *page, struct backing_dev_info *bdi,
+			      long bytes)
+{
+	mod_node_page_state(page_pgdat(page), NR_METADATA_DIRTY_BYTES,
+			    bytes);
+	__add_wb_stat(&bdi->wb, WB_DIRTIED_BYTES, bytes);
+	__add_wb_stat(&bdi->wb, WB_METADATA_DIRTY_BYTES, bytes);
+	current->nr_dirtied++;
+	task_io_account_write(bytes);
+	this_cpu_inc(bdp_ratelimits);
+}
+EXPORT_SYMBOL(account_metadata_dirtied);
+
+/*
+ * account_metadata_cleaned
+ * @page - the page being cleaned
+ * @bdi - the bdi that owns this page
+ * @bytes - the number of bytes cleaned
+ *
+ * Called on a no longer dirty metadata page.
+ */
+void account_metadata_cleaned(struct page *page, struct backing_dev_info *bdi,
+			      long bytes)
+{
+	mod_node_page_state(page_pgdat(page), NR_METADATA_DIRTY_BYTES,
+			    -bytes);
+	__add_wb_stat(&bdi->wb, WB_METADATA_DIRTY_BYTES, -bytes);
+	task_io_account_cancelled_write(bytes);
+}
+EXPORT_SYMBOL(account_metadata_cleaned);
+
+/*
+ * account_metadata_writeback
+ * @page - the page being marked as writeback
+ * @bdi - the bdi that owns this page
+ * @bytes - the number of bytes we are submitting for writeback
+ *
+ * Called on a metadata page that has been marked writeback.
+ */
+void account_metadata_writeback(struct page *page,
+				struct backing_dev_info *bdi, long bytes)
+{
+	__add_wb_stat(&bdi->wb, WB_METADATA_DIRTY_BYTES, -bytes);
+	mod_node_page_state(page_pgdat(page), NR_METADATA_DIRTY_BYTES,
+					 -bytes);
+	__add_wb_stat(&bdi->wb, WB_METADATA_WRITEBACK_BYTES, bytes);
+	mod_node_page_state(page_pgdat(page), NR_METADATA_WRITEBACK_BYTES,
+					 bytes);
+}
+EXPORT_SYMBOL(account_metadata_writeback);
+
+/*
+ * account_metadata_end_writeback
+ * @page - the page we are ending writeback on
+ * @bdi - the bdi that owns this page
+ * @bytes - the number of bytes that just ended writeback
+ *
+ * Called on a metadata page that has completed writeback.
+ */
+void account_metadata_end_writeback(struct page *page,
+				    struct backing_dev_info *bdi, long bytes)
+{
+	__add_wb_stat(&bdi->wb, WB_METADATA_WRITEBACK_BYTES, -bytes);
+	mod_node_page_state(page_pgdat(page), NR_METADATA_WRITEBACK_BYTES,
+			    -bytes);
+	__wb_writeout_add(&bdi->wb, bytes);
+}
+EXPORT_SYMBOL(account_metadata_end_writeback);
+
+/*
  * Helper function for deaccounting dirty page without writeback.
  *
  * Caller must hold lock_page_memcg().
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af88836a..aab0dd6aa8d7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4640,6 +4640,8 @@ static bool show_mem_node_skip(unsigned int flags, int nid, nodemask_t *nodemask
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
+#define BtoK(x) ((x) >> 10)
+#define BtoP(x) ((x) >> PAGE_SHIFT)
 
 static void show_migration_types(unsigned char type)
 {
@@ -4694,10 +4696,11 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 
 	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
 		" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
-		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
-		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
-		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
-		" free:%lu free_pcp:%lu free_cma:%lu\n",
+		" unevictable:%lu metadata:%lu dirty:%lu metadata_dirty:%lu\n"
+		" writeback:%lu unstable:%lu metadata_writeback:%lu\n"
+		" slab_reclaimable:%lu slab_unreclaimable:%lu mapped:%lu\n"
+		" shmem:%lu pagetables:%lu bounce:%lu free:%lu free_pcp:%lu\n"
+	        " free_cma:%lu\n",
 		global_node_page_state(NR_ACTIVE_ANON),
 		global_node_page_state(NR_INACTIVE_ANON),
 		global_node_page_state(NR_ISOLATED_ANON),
@@ -4705,9 +4708,12 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		global_node_page_state(NR_INACTIVE_FILE),
 		global_node_page_state(NR_ISOLATED_FILE),
 		global_node_page_state(NR_UNEVICTABLE),
+		BtoP(global_node_page_state(NR_METADATA_BYTES)),
 		global_node_page_state(NR_FILE_DIRTY),
+		BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES)),
 		global_node_page_state(NR_WRITEBACK),
 		global_node_page_state(NR_UNSTABLE_NFS),
+		BtoP(global_node_page_state(NR_METADATA_WRITEBACK_BYTES)),
 		global_node_page_state(NR_SLAB_RECLAIMABLE),
 		global_node_page_state(NR_SLAB_UNRECLAIMABLE),
 		global_node_page_state(NR_FILE_MAPPED),
@@ -4730,9 +4736,12 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			" unevictable:%lukB"
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
+			" metadata:%lukB"
 			" mapped:%lukB"
 			" dirty:%lukB"
+			" metadata_dirty:%lukB"
 			" writeback:%lukB"
+			" metadata_writeback:%lukB"
 			" shmem:%lukB"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			" shmem_thp: %lukB"
@@ -4751,9 +4760,12 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_UNEVICTABLE)),
 			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
 			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
+			BtoK(node_page_state(pgdat, NR_METADATA_BYTES)),
 			K(node_page_state(pgdat, NR_FILE_MAPPED)),
 			K(node_page_state(pgdat, NR_FILE_DIRTY)),
+			BtoK(node_page_state(pgdat, NR_METADATA_DIRTY_BYTES)),
 			K(node_page_state(pgdat, NR_WRITEBACK)),
+			BtoK(node_page_state(pgdat, NR_METADATA_WRITEBACK_BYTES)),
 			K(node_page_state(pgdat, NR_SHMEM)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR),
diff --git a/mm/util.c b/mm/util.c
index 34e57fae959d..681d62631ee0 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -616,6 +616,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
 		free = global_zone_page_state(NR_FREE_PAGES);
 		free += global_node_page_state(NR_FILE_PAGES);
+		free += global_node_page_state(NR_METADATA_BYTES) >> PAGE_SHIFT;
 
 		/*
 		 * shmem pages shouldn't be counted as free in this
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13d711dd8776..8589a718a9c2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -225,7 +225,8 @@ unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 
 	nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
 	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
-	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
+	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE) +
+	     (node_page_state_snapshot(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT);
 
 	if (get_nr_swap_pages() > 0)
 		nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4bb13e72ac97..0b32e6381590 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -273,6 +273,13 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 
 	t = __this_cpu_read(pcp->stat_threshold);
 
+	/*
+	 * If this item is counted in bytes and not pages adjust the threshold
+	 * accordingly.
+	 */
+	if (is_bytes_node_stat(item))
+		t <<= PAGE_SHIFT;
+
 	if (unlikely(x > t || x < -t)) {
 		node_page_state_add(x, pgdat, item);
 		x = 0;
@@ -1090,6 +1097,9 @@ const char * const vmstat_text[] = {
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",
 	"nr_written",
+	"nr_metadata_dirty_bytes",
+	"nr_metadata_writeback_bytes",
+	"nr_metadata_bytes",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.7.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2017-12-11 21:55 [PATCH v3 00/11] Metadata specific accouting and dirty writeout Josef Bacik
                   ` (4 preceding siblings ...)
  2017-12-11 21:55 ` [PATCH v3 05/10] writeback: add counters for metadata usage Josef Bacik
@ 2017-12-11 21:55 ` Josef Bacik
  2017-12-11 23:36   ` Dave Chinner
  2017-12-19 12:21   ` Jan Kara
  2017-12-11 21:55 ` [PATCH v3 07/10] export radix_tree_iter_tag_set Josef Bacik
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: Josef Bacik @ 2017-12-11 21:55 UTC (permalink / raw)
  To: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team, linux-btrfs
  Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

Now that we have metadata counters in the VM, we need to provide a way to kick
writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
allows file systems to deal with writing back any dirty metadata we need based
on the writeback needs of the system.  Since there is no inode to key off of we
need a list in the bdi for dirty super blocks to be added.  From there we can
find any dirty sb's on the bdi we are currently doing writeback on and call into
their ->write_metadata callback.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tejun Heo <tj@kernel.org>
---
 fs/fs-writeback.c                | 72 ++++++++++++++++++++++++++++++++++++----
 fs/super.c                       |  6 ++++
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/fs.h               |  4 +++
 mm/backing-dev.c                 |  2 ++
 5 files changed, 80 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 987448ed7698..fba703dff678 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
 	return pages;
 }
 
+static long writeback_sb_metadata(struct super_block *sb,
+				  struct bdi_writeback *wb,
+				  struct wb_writeback_work *work)
+{
+	struct writeback_control wbc = {
+		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
+		.for_kupdate		= work->for_kupdate,
+		.for_background		= work->for_background,
+		.for_sync		= work->for_sync,
+		.range_cyclic		= work->range_cyclic,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+	};
+	long write_chunk;
+
+	write_chunk = writeback_chunk_size(wb, work);
+	wbc.nr_to_write = write_chunk;
+	sb->s_op->write_metadata(sb, &wbc);
+	work->nr_pages -= write_chunk - wbc.nr_to_write;
+
+	return write_chunk - wbc.nr_to_write;
+}
+
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -1505,6 +1530,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 	unsigned long start_time = jiffies;
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
+	bool done = false;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1621,12 +1647,18 @@ static long writeback_sb_inodes(struct super_block *sb,
 		 * background threshold and other termination conditions.
 		 */
 		if (wrote) {
-			if (time_is_before_jiffies(start_time + HZ / 10UL))
-				break;
-			if (work->nr_pages <= 0)
+			if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+			    work->nr_pages <= 0) {
+				done = true;
 				break;
+			}
 		}
 	}
+	if (!done && sb->s_op->write_metadata) {
+		spin_unlock(&wb->list_lock);
+		wrote += writeback_sb_metadata(sb, wb, work);
+		spin_lock(&wb->list_lock);
+	}
 	return wrote;
 }
 
@@ -1635,6 +1667,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 {
 	unsigned long start_time = jiffies;
 	long wrote = 0;
+	bool done = false;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1654,12 +1687,39 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 
 		/* refer to the same tests at the end of writeback_sb_inodes */
 		if (wrote) {
-			if (time_is_before_jiffies(start_time + HZ / 10UL))
-				break;
-			if (work->nr_pages <= 0)
+			if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+			    work->nr_pages <= 0) {
+				done = true;
 				break;
+			}
 		}
 	}
+
+	if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) {
+		LIST_HEAD(list);
+
+		spin_unlock(&wb->list_lock);
+		spin_lock(&wb->bdi->sb_list_lock);
+		list_splice_init(&wb->bdi->dirty_sb_list, &list);
+		while (!list_empty(&list)) {
+			struct super_block *sb;
+
+			sb = list_first_entry(&list, struct super_block,
+					      s_bdi_dirty_list);
+			list_move_tail(&sb->s_bdi_dirty_list,
+				       &wb->bdi->dirty_sb_list);
+			if (!sb->s_op->write_metadata)
+				continue;
+			if (!trylock_super(sb))
+				continue;
+			spin_unlock(&wb->bdi->sb_list_lock);
+			wrote += writeback_sb_metadata(sb, wb, work);
+			spin_lock(&wb->bdi->sb_list_lock);
+			up_read(&sb->s_umount);
+		}
+		spin_unlock(&wb->bdi->sb_list_lock);
+		spin_lock(&wb->list_lock);
+	}
 	/* Leave any unwritten inodes on b_io */
 	return wrote;
 }
diff --git a/fs/super.c b/fs/super.c
index 166c4ee0d0ed..2290bef486a3 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -214,6 +214,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	spin_lock_init(&s->s_inode_list_lock);
 	INIT_LIST_HEAD(&s->s_inodes_wb);
 	spin_lock_init(&s->s_inode_wblist_lock);
+	INIT_LIST_HEAD(&s->s_bdi_dirty_list);
 
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
@@ -446,6 +447,11 @@ void generic_shutdown_super(struct super_block *sb)
 	spin_unlock(&sb_lock);
 	up_write(&sb->s_umount);
 	if (sb->s_bdi != &noop_backing_dev_info) {
+		if (!list_empty(&sb->s_bdi_dirty_list)) {
+			spin_lock(&sb->s_bdi->sb_list_lock);
+			list_del_init(&sb->s_bdi_dirty_list);
+			spin_unlock(&sb->s_bdi->sb_list_lock);
+		}
 		bdi_put(sb->s_bdi);
 		sb->s_bdi = &noop_backing_dev_info;
 	}
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 78c65e2910dc..a961f9a51a38 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -176,6 +176,8 @@ struct backing_dev_info {
 
 	struct timer_list laptop_mode_wb_timer;
 
+	spinlock_t sb_list_lock;
+	struct list_head dirty_sb_list;
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e73742e73..298a28eaed2b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1440,6 +1440,8 @@ struct super_block {
 
 	spinlock_t		s_inode_wblist_lock;
 	struct list_head	s_inodes_wb;	/* writeback inodes */
+
+	struct list_head        s_bdi_dirty_list;
 } __randomize_layout;
 
 /* Helper functions so that in most cases filesystems will
@@ -1830,6 +1832,8 @@ struct super_operations {
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
+	void (*write_metadata)(struct super_block *sb,
+			       struct writeback_control *wbc);
 };
 
 /*
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0aad67bc0898..e3aa4e0dd15e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -839,6 +839,8 @@ static int bdi_init(struct backing_dev_info *bdi)
 	bdi->max_prop_frac = FPROP_FRAC_BASE;
 	INIT_LIST_HEAD(&bdi->bdi_list);
 	INIT_LIST_HEAD(&bdi->wb_list);
+	INIT_LIST_HEAD(&bdi->dirty_sb_list);
+	spin_lock_init(&bdi->sb_list_lock);
 	init_waitqueue_head(&bdi->wb_waitq);
 
 	ret = cgwb_bdi_init(bdi);
-- 
2.7.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 07/10] export radix_tree_iter_tag_set
  2017-12-11 21:55 [PATCH v3 00/11] Metadata specific accouting and dirty writeout Josef Bacik
                   ` (5 preceding siblings ...)
  2017-12-11 21:55 ` [PATCH v3 06/10] writeback: introduce super_operations->write_metadata Josef Bacik
@ 2017-12-11 21:55 ` Josef Bacik
  2017-12-11 21:55 ` [PATCH v3 08/10] Btrfs: kill the btree_inode Josef Bacik
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Josef Bacik @ 2017-12-11 21:55 UTC (permalink / raw)
  To: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team, linux-btrfs
  Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

We use this in btrfs for metadata writeback.

Acked-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 lib/radix-tree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 8b1feca1230a..0c1cde9fcb69 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1459,6 +1459,7 @@ void radix_tree_iter_tag_set(struct radix_tree_root *root,
 {
 	node_tag_set(root, iter->node, tag, iter_offset(iter));
 }
+EXPORT_SYMBOL(radix_tree_iter_tag_set);
 
 static void node_tag_clear(struct radix_tree_root *root,
 				struct radix_tree_node *node,
-- 
2.7.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 08/10] Btrfs: kill the btree_inode
  2017-12-11 21:55 [PATCH v3 00/11] Metadata specific accouting and dirty writeout Josef Bacik
                   ` (6 preceding siblings ...)
  2017-12-11 21:55 ` [PATCH v3 07/10] export radix_tree_iter_tag_set Josef Bacik
@ 2017-12-11 21:55 ` Josef Bacik
  2017-12-11 21:55 ` [PATCH v3 09/10] btrfs: rework end io for extent buffer reads Josef Bacik
  2017-12-11 21:55 ` [PATCH v3 10/10] btrfs: add NR_METADATA_BYTES accounting Josef Bacik
  9 siblings, 0 replies; 31+ messages in thread
From: Josef Bacik @ 2017-12-11 21:55 UTC (permalink / raw)
  To: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team, linux-btrfs
  Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

In order to more efficiently support sub-page blocksizes we need to stop
allocating pages from pagecache for our metadata.  Instead switch to using the
account_metadata* counters for making sure we are keeping the system aware of
how much dirty metadata we have, and use the ->free_cached_objects super
operation in order to handle freeing up extent buffers.  This greatly simplifies
how we deal with extent buffers as now we no longer have to tie the page cache
reclaimation stuff to the extent buffer stuff.  This will also allow us to
simply kmalloc() our data for sub-page blocksizes.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/btrfs_inode.h                 |   1 -
 fs/btrfs/ctree.c                       |  18 +-
 fs/btrfs/ctree.h                       |  17 +-
 fs/btrfs/dir-item.c                    |   2 +-
 fs/btrfs/disk-io.c                     | 386 ++++----------
 fs/btrfs/extent-tree.c                 |  14 +-
 fs/btrfs/extent_io.c                   | 906 ++++++++++++++++++---------------
 fs/btrfs/extent_io.h                   |  51 +-
 fs/btrfs/inode.c                       |   6 +-
 fs/btrfs/print-tree.c                  |  13 +-
 fs/btrfs/reada.c                       |   2 +-
 fs/btrfs/root-tree.c                   |   2 +-
 fs/btrfs/super.c                       |  31 +-
 fs/btrfs/tests/btrfs-tests.c           |  36 +-
 fs/btrfs/tests/extent-buffer-tests.c   |   3 +-
 fs/btrfs/tests/extent-io-tests.c       |   4 +-
 fs/btrfs/tests/free-space-tree-tests.c |   3 +-
 fs/btrfs/tests/inode-tests.c           |   4 +-
 fs/btrfs/tests/qgroup-tests.c          |   3 +-
 fs/btrfs/transaction.c                 |  13 +-
 20 files changed, 744 insertions(+), 771 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index f9c6887a8b6c..24582650622d 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -241,7 +241,6 @@ static inline u64 btrfs_ino(const struct btrfs_inode *inode)
 	u64 ino = inode->location.objectid;
 
 	/*
-	 * !ino: btree_inode
 	 * type == BTRFS_ROOT_ITEM_KEY: subvol dir
 	 */
 	if (!ino || inode->location.type == BTRFS_ROOT_ITEM_KEY)
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 531e0a8645b0..3c6610b5d0d3 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1361,7 +1361,8 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct btrfs_path *path,
 
 	if (tm->op == MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
 		BUG_ON(tm->slot != 0);
-		eb_rewin = alloc_dummy_extent_buffer(fs_info, eb->start);
+		eb_rewin = alloc_dummy_extent_buffer(fs_info->eb_info,
+						     eb->start, eb->len);
 		if (!eb_rewin) {
 			btrfs_tree_read_unlock_blocking(eb);
 			free_extent_buffer(eb);
@@ -1444,7 +1445,8 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
 	} else if (old_root) {
 		btrfs_tree_read_unlock(eb_root);
 		free_extent_buffer(eb_root);
-		eb = alloc_dummy_extent_buffer(fs_info, logical);
+		eb = alloc_dummy_extent_buffer(root->fs_info->eb_info, logical,
+					       root->fs_info->nodesize);
 	} else {
 		btrfs_set_lock_blocking_rw(eb_root, BTRFS_READ_LOCK);
 		eb = btrfs_clone_extent_buffer(eb_root);
@@ -1675,7 +1677,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
 			continue;
 		}
 
-		cur = find_extent_buffer(fs_info, blocknr);
+		cur = find_extent_buffer(fs_info->eb_info, blocknr);
 		if (cur)
 			uptodate = btrfs_buffer_uptodate(cur, gen, 0);
 		else
@@ -1748,7 +1750,7 @@ static noinline int generic_bin_search(struct extent_buffer *eb,
 	int err;
 
 	if (low > high) {
-		btrfs_err(eb->fs_info,
+		btrfs_err(eb->eb_info->fs_info,
 		 "%s: low (%d) > high (%d) eb %llu owner %llu level %d",
 			  __func__, low, high, eb->start,
 			  btrfs_header_owner(eb), btrfs_header_level(eb));
@@ -2260,7 +2262,7 @@ static void reada_for_search(struct btrfs_fs_info *fs_info,
 
 	search = btrfs_node_blockptr(node, slot);
 	blocksize = fs_info->nodesize;
-	eb = find_extent_buffer(fs_info, search);
+	eb = find_extent_buffer(fs_info->eb_info, search);
 	if (eb) {
 		free_extent_buffer(eb);
 		return;
@@ -2319,7 +2321,7 @@ static noinline void reada_for_balance(struct btrfs_fs_info *fs_info,
 	if (slot > 0) {
 		block1 = btrfs_node_blockptr(parent, slot - 1);
 		gen = btrfs_node_ptr_generation(parent, slot - 1);
-		eb = find_extent_buffer(fs_info, block1);
+		eb = find_extent_buffer(fs_info->eb_info, block1);
 		/*
 		 * if we get -eagain from btrfs_buffer_uptodate, we
 		 * don't want to return eagain here.  That will loop
@@ -2332,7 +2334,7 @@ static noinline void reada_for_balance(struct btrfs_fs_info *fs_info,
 	if (slot + 1 < nritems) {
 		block2 = btrfs_node_blockptr(parent, slot + 1);
 		gen = btrfs_node_ptr_generation(parent, slot + 1);
-		eb = find_extent_buffer(fs_info, block2);
+		eb = find_extent_buffer(fs_info->eb_info, block2);
 		if (eb && btrfs_buffer_uptodate(eb, gen, 1) != 0)
 			block2 = 0;
 		free_extent_buffer(eb);
@@ -2450,7 +2452,7 @@ read_block_for_search(struct btrfs_root *root, struct btrfs_path *p,
 	blocknr = btrfs_node_blockptr(b, slot);
 	gen = btrfs_node_ptr_generation(b, slot);
 
-	tmp = find_extent_buffer(fs_info, blocknr);
+	tmp = find_extent_buffer(fs_info->eb_info, blocknr);
 	if (tmp) {
 		/* first we do an atomic uptodate check */
 		if (btrfs_buffer_uptodate(tmp, gen, 1) > 0) {
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4ffbe9f07cf7..a7c764a1ee48 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -40,6 +40,7 @@
 #include <linux/sizes.h>
 #include <linux/dynamic_debug.h>
 #include <linux/refcount.h>
+#include <linux/list_lru.h>
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
@@ -701,6 +702,7 @@ struct btrfs_device;
 struct btrfs_fs_devices;
 struct btrfs_balance_control;
 struct btrfs_delayed_root;
+struct btrfs_eb_info;
 
 #define BTRFS_FS_BARRIER			1
 #define BTRFS_FS_CLOSING_START			2
@@ -818,7 +820,7 @@ struct btrfs_fs_info {
 	struct btrfs_super_block *super_copy;
 	struct btrfs_super_block *super_for_commit;
 	struct super_block *sb;
-	struct inode *btree_inode;
+	struct btrfs_eb_info *eb_info;
 	struct mutex tree_log_mutex;
 	struct mutex transaction_kthread_mutex;
 	struct mutex cleaner_mutex;
@@ -1060,10 +1062,6 @@ struct btrfs_fs_info {
 	/* readahead works cnt */
 	atomic_t reada_works_cnt;
 
-	/* Extent buffer radix tree */
-	spinlock_t buffer_lock;
-	struct radix_tree_root buffer_radix;
-
 	/* next backup root to be overwritten */
 	int backup_root_index;
 
@@ -1563,7 +1561,7 @@ static inline void btrfs_set_device_total_bytes(struct extent_buffer *eb,
 {
 	BUILD_BUG_ON(sizeof(u64) !=
 		     sizeof(((struct btrfs_dev_item *)0))->total_bytes);
-	WARN_ON(!IS_ALIGNED(val, eb->fs_info->sectorsize));
+	WARN_ON(!IS_ALIGNED(val, eb->eb_info->fs_info->sectorsize));
 	btrfs_set_64(eb, s, offsetof(struct btrfs_dev_item, total_bytes), val);
 }
 
@@ -2962,6 +2960,10 @@ static inline int btrfs_need_cleaner_sleep(struct btrfs_fs_info *fs_info)
 
 static inline void free_fs_info(struct btrfs_fs_info *fs_info)
 {
+	if (fs_info->eb_info) {
+		list_lru_destroy(&fs_info->eb_info->lru_list);
+		kfree(fs_info->eb_info);
+	}
 	kfree(fs_info->balance_ctl);
 	kfree(fs_info->delayed_root);
 	kfree(fs_info->extent_root);
@@ -3185,9 +3187,6 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *new_root,
 			     struct btrfs_root *parent_root,
 			     u64 new_dirid);
-int btrfs_merge_bio_hook(struct page *page, unsigned long offset,
-			 size_t size, struct bio *bio,
-			 unsigned long bio_flags);
 void btrfs_set_range_writeback(void *private_data, u64 start, u64 end);
 int btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index 41cb9196eaa8..f5782523e723 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -496,7 +496,7 @@ int verify_dir_item(struct btrfs_fs_info *fs_info,
 bool btrfs_is_name_len_valid(struct extent_buffer *leaf, int slot,
 			     unsigned long start, u16 name_len)
 {
-	struct btrfs_fs_info *fs_info = leaf->fs_info;
+	struct btrfs_fs_info *fs_info = leaf->eb_info->fs_info;
 	struct btrfs_key key;
 	u32 read_start;
 	u32 read_end;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8b6df7688d52..d9d69e181942 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -215,56 +215,6 @@ void btrfs_set_buffer_lockdep_class(u64 objectid, struct extent_buffer *eb,
 
 #endif
 
-/*
- * extents on the btree inode are pretty simple, there's one extent
- * that covers the entire device
- */
-static struct extent_map *btree_get_extent(struct btrfs_inode *inode,
-		struct page *page, size_t pg_offset, u64 start, u64 len,
-		int create)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
-	struct extent_map_tree *em_tree = &inode->extent_tree;
-	struct extent_map *em;
-	int ret;
-
-	read_lock(&em_tree->lock);
-	em = lookup_extent_mapping(em_tree, start, len);
-	if (em) {
-		em->bdev = fs_info->fs_devices->latest_bdev;
-		read_unlock(&em_tree->lock);
-		goto out;
-	}
-	read_unlock(&em_tree->lock);
-
-	em = alloc_extent_map();
-	if (!em) {
-		em = ERR_PTR(-ENOMEM);
-		goto out;
-	}
-	em->start = 0;
-	em->len = (u64)-1;
-	em->block_len = (u64)-1;
-	em->block_start = 0;
-	em->bdev = fs_info->fs_devices->latest_bdev;
-
-	write_lock(&em_tree->lock);
-	ret = add_extent_mapping(em_tree, em, 0);
-	if (ret == -EEXIST) {
-		free_extent_map(em);
-		em = lookup_extent_mapping(em_tree, start, len);
-		if (!em)
-			em = ERR_PTR(-EIO);
-	} else if (ret) {
-		free_extent_map(em);
-		em = ERR_PTR(ret);
-	}
-	write_unlock(&em_tree->lock);
-
-out:
-	return em;
-}
-
 u32 btrfs_csum_data(const char *data, u32 seed, size_t len)
 {
 	return btrfs_crc32c(seed, data, len);
@@ -346,11 +296,11 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info,
  * detect blocks that either didn't get written at all or got written
  * in the wrong place.
  */
-static int verify_parent_transid(struct extent_io_tree *io_tree,
-				 struct extent_buffer *eb, u64 parent_transid,
+static int verify_parent_transid(struct extent_buffer *eb, u64 parent_transid,
 				 int atomic)
 {
 	struct extent_state *cached_state = NULL;
+	struct extent_io_tree *io_tree = &eb->eb_info->io_tree;
 	int ret;
 	bool need_lock = (current->journal_info == BTRFS_SEND_TRANS_STUB);
 
@@ -372,7 +322,7 @@ static int verify_parent_transid(struct extent_io_tree *io_tree,
 		ret = 0;
 		goto out;
 	}
-	btrfs_err_rl(eb->fs_info,
+	btrfs_err_rl(eb->eb_info->fs_info,
 		"parent transid verify failed on %llu wanted %llu found %llu",
 			eb->start,
 			parent_transid, btrfs_header_generation(eb));
@@ -443,7 +393,6 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 					  struct extent_buffer *eb,
 					  u64 parent_transid)
 {
-	struct extent_io_tree *io_tree;
 	int failed = 0;
 	int ret;
 	int num_copies = 0;
@@ -451,13 +400,10 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 	int failed_mirror = 0;
 
 	clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
-	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
 	while (1) {
-		ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
-					       btree_get_extent, mirror_num);
+		ret = read_extent_buffer_pages(eb, WAIT_COMPLETE, mirror_num);
 		if (!ret) {
-			if (!verify_parent_transid(io_tree, eb,
-						   parent_transid, 0))
+			if (!verify_parent_transid(eb, parent_transid, 0))
 				break;
 			else
 				ret = -EIO;
@@ -502,24 +448,11 @@ static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
 
 static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
 {
-	u64 start = page_offset(page);
-	u64 found_start;
 	struct extent_buffer *eb;
 
 	eb = (struct extent_buffer *)page->private;
 	if (page != eb->pages[0])
 		return 0;
-
-	found_start = btrfs_header_bytenr(eb);
-	/*
-	 * Please do not consolidate these warnings into a single if.
-	 * It is useful to know what went wrong.
-	 */
-	if (WARN_ON(found_start != start))
-		return -EUCLEAN;
-	if (WARN_ON(!PageUptodate(page)))
-		return -EUCLEAN;
-
 	ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
 			btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
 
@@ -829,8 +762,8 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	u64 found_start;
 	int found_level;
 	struct extent_buffer *eb;
-	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
-	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_root *root;
+	struct btrfs_fs_info *fs_info;
 	int ret = 0;
 	int reads_done;
 
@@ -843,6 +776,8 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	 * in memory.  Make sure we have a ref for all this other checks
 	 */
 	extent_buffer_get(eb);
+	fs_info = eb->eb_info->fs_info;
+	root = fs_info->tree_root;
 
 	reads_done = atomic_dec_and_test(&eb->io_pages);
 	if (!reads_done)
@@ -906,11 +841,19 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 		/*
 		 * our io error hook is going to dec the io pages
 		 * again, we have to make sure it has something
-		 * to decrement
+		 * to decrement.
+		 *
+		 * TODO: Kill this, we've re-arranged how this works now so we
+		 * don't need to do this io_pages dance.
 		 */
 		atomic_inc(&eb->io_pages);
 		clear_extent_buffer_uptodate(eb);
 	}
+	if (reads_done) {
+		clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
+		smp_mb__after_atomic();
+		wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
+	}
 	free_extent_buffer(eb);
 out:
 	return ret;
@@ -1075,16 +1018,14 @@ blk_status_t btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	return 0;
 }
 
-static blk_status_t btree_csum_one_bio(struct bio *bio)
+static blk_status_t btree_csum_one_bio(struct btrfs_fs_info *fs_info, struct bio *bio)
 {
 	struct bio_vec *bvec;
-	struct btrfs_root *root;
 	int i, ret = 0;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, i) {
-		root = BTRFS_I(bvec->bv_page->mapping->host)->root;
-		ret = csum_dirty_buffer(root->fs_info, bvec->bv_page);
+		ret = csum_dirty_buffer(fs_info, bvec->bv_page);
 		if (ret)
 			break;
 	}
@@ -1096,25 +1037,26 @@ static blk_status_t __btree_submit_bio_start(void *private_data, struct bio *bio
 					     int mirror_num, unsigned long bio_flags,
 					     u64 bio_offset)
 {
+	struct btrfs_eb_info *eb_info = private_data;
 	/*
 	 * when we're called for a write, we're already in the async
 	 * submission context.  Just jump into btrfs_map_bio
 	 */
-	return btree_csum_one_bio(bio);
+	return btree_csum_one_bio(eb_info->fs_info, bio);
 }
 
 static blk_status_t __btree_submit_bio_done(void *private_data, struct bio *bio,
 					    int mirror_num, unsigned long bio_flags,
 					    u64 bio_offset)
 {
-	struct inode *inode = private_data;
-	blk_status_t ret;
+	struct btrfs_eb_info *eb_info = private_data;
+	int ret;
 
 	/*
 	 * when we're called for a write, we're already in the async
 	 * submission context.  Just jump into btrfs_map_bio
 	 */
-	ret = btrfs_map_bio(btrfs_sb(inode->i_sb), bio, mirror_num, 1);
+	ret = btrfs_map_bio(eb_info->fs_info, bio, mirror_num, 1);
 	if (ret) {
 		bio->bi_status = ret;
 		bio_endio(bio);
@@ -1122,9 +1064,10 @@ static blk_status_t __btree_submit_bio_done(void *private_data, struct bio *bio,
 	return ret;
 }
 
-static int check_async_write(struct btrfs_inode *bi)
+static int check_async_write(void)
 {
-	if (atomic_read(&bi->sync_writers))
+	/* If we are fsync we can be under a trans handle. */
+	if (current->journal_info)
 		return 0;
 #ifdef CONFIG_X86
 	if (static_cpu_has(X86_FEATURE_XMM4_2))
@@ -1137,9 +1080,9 @@ static blk_status_t btree_submit_bio_hook(void *private_data, struct bio *bio,
 					  int mirror_num, unsigned long bio_flags,
 					  u64 bio_offset)
 {
-	struct inode *inode = private_data;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	int async = check_async_write(BTRFS_I(inode));
+	struct btrfs_eb_info *eb_info = private_data;
+	struct btrfs_fs_info *fs_info = eb_info->fs_info;
+	int async = check_async_write();
 	blk_status_t ret;
 
 	if (bio_op(bio) != REQ_OP_WRITE) {
@@ -1153,7 +1096,7 @@ static blk_status_t btree_submit_bio_hook(void *private_data, struct bio *bio,
 			goto out_w_error;
 		ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
 	} else if (!async) {
-		ret = btree_csum_one_bio(bio);
+		ret = btree_csum_one_bio(eb_info->fs_info, bio);
 		if (ret)
 			goto out_w_error;
 		ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
@@ -1178,118 +1121,14 @@ static blk_status_t btree_submit_bio_hook(void *private_data, struct bio *bio,
 	return ret;
 }
 
-#ifdef CONFIG_MIGRATION
-static int btree_migratepage(struct address_space *mapping,
-			struct page *newpage, struct page *page,
-			enum migrate_mode mode)
-{
-	/*
-	 * we can't safely write a btree page from here,
-	 * we haven't done the locking hook
-	 */
-	if (PageDirty(page))
-		return -EAGAIN;
-	/*
-	 * Buffers may be managed in a filesystem specific way.
-	 * We must have no buffers or drop them.
-	 */
-	if (page_has_private(page) &&
-	    !try_to_release_page(page, GFP_KERNEL))
-		return -EAGAIN;
-	return migrate_page(mapping, newpage, page, mode);
-}
-#endif
-
-
-static int btree_writepages(struct address_space *mapping,
-			    struct writeback_control *wbc)
-{
-	struct btrfs_fs_info *fs_info;
-	int ret;
-
-	if (wbc->sync_mode == WB_SYNC_NONE) {
-
-		if (wbc->for_kupdate)
-			return 0;
-
-		fs_info = BTRFS_I(mapping->host)->root->fs_info;
-		/* this is a bit racy, but that's ok */
-		ret = percpu_counter_compare(&fs_info->dirty_metadata_bytes,
-					     BTRFS_DIRTY_METADATA_THRESH);
-		if (ret < 0)
-			return 0;
-	}
-	return btree_write_cache_pages(mapping, wbc);
-}
-
-static int btree_readpage(struct file *file, struct page *page)
-{
-	struct extent_io_tree *tree;
-	tree = &BTRFS_I(page->mapping->host)->io_tree;
-	return extent_read_full_page(tree, page, btree_get_extent, 0);
-}
-
-static int btree_releasepage(struct page *page, gfp_t gfp_flags)
-{
-	if (PageWriteback(page) || PageDirty(page))
-		return 0;
-
-	return try_release_extent_buffer(page);
-}
-
-static void btree_invalidatepage(struct page *page, unsigned int offset,
-				 unsigned int length)
-{
-	struct extent_io_tree *tree;
-	tree = &BTRFS_I(page->mapping->host)->io_tree;
-	extent_invalidatepage(tree, page, offset);
-	btree_releasepage(page, GFP_NOFS);
-	if (PagePrivate(page)) {
-		btrfs_warn(BTRFS_I(page->mapping->host)->root->fs_info,
-			   "page private not zero on page %llu",
-			   (unsigned long long)page_offset(page));
-		ClearPagePrivate(page);
-		set_page_private(page, 0);
-		put_page(page);
-	}
-}
-
-static int btree_set_page_dirty(struct page *page)
-{
-#ifdef DEBUG
-	struct extent_buffer *eb;
-
-	BUG_ON(!PagePrivate(page));
-	eb = (struct extent_buffer *)page->private;
-	BUG_ON(!eb);
-	BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
-	BUG_ON(!atomic_read(&eb->refs));
-	btrfs_assert_tree_locked(eb);
-#endif
-	return __set_page_dirty_nobuffers(page);
-}
-
-static const struct address_space_operations btree_aops = {
-	.readpage	= btree_readpage,
-	.writepages	= btree_writepages,
-	.releasepage	= btree_releasepage,
-	.invalidatepage = btree_invalidatepage,
-#ifdef CONFIG_MIGRATION
-	.migratepage	= btree_migratepage,
-#endif
-	.set_page_dirty = btree_set_page_dirty,
-};
-
 void readahead_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr)
 {
 	struct extent_buffer *buf = NULL;
-	struct inode *btree_inode = fs_info->btree_inode;
 
 	buf = btrfs_find_create_tree_block(fs_info, bytenr);
 	if (IS_ERR(buf))
 		return;
-	read_extent_buffer_pages(&BTRFS_I(btree_inode)->io_tree,
-				 buf, WAIT_NONE, btree_get_extent, 0);
+	read_extent_buffer_pages(buf, WAIT_NONE, 0);
 	free_extent_buffer(buf);
 }
 
@@ -1297,8 +1136,6 @@ int reada_tree_block_flagged(struct btrfs_fs_info *fs_info, u64 bytenr,
 			 int mirror_num, struct extent_buffer **eb)
 {
 	struct extent_buffer *buf = NULL;
-	struct inode *btree_inode = fs_info->btree_inode;
-	struct extent_io_tree *io_tree = &BTRFS_I(btree_inode)->io_tree;
 	int ret;
 
 	buf = btrfs_find_create_tree_block(fs_info, bytenr);
@@ -1307,8 +1144,7 @@ int reada_tree_block_flagged(struct btrfs_fs_info *fs_info, u64 bytenr,
 
 	set_bit(EXTENT_BUFFER_READAHEAD, &buf->bflags);
 
-	ret = read_extent_buffer_pages(io_tree, buf, WAIT_PAGE_LOCK,
-				       btree_get_extent, mirror_num);
+	ret = read_extent_buffer_pages(buf, WAIT_PAGE_LOCK, mirror_num);
 	if (ret) {
 		free_extent_buffer(buf);
 		return ret;
@@ -1330,21 +1166,22 @@ struct extent_buffer *btrfs_find_create_tree_block(
 						u64 bytenr)
 {
 	if (btrfs_is_testing(fs_info))
-		return alloc_test_extent_buffer(fs_info, bytenr);
+		return alloc_test_extent_buffer(fs_info->eb_info, bytenr,
+						fs_info->nodesize);
 	return alloc_extent_buffer(fs_info, bytenr);
 }
 
 
 int btrfs_write_tree_block(struct extent_buffer *buf)
 {
-	return filemap_fdatawrite_range(buf->pages[0]->mapping, buf->start,
-					buf->start + buf->len - 1);
+	return btree_write_range(buf->eb_info->fs_info, buf->start,
+				 buf->start + buf->len - 1);
 }
 
 void btrfs_wait_tree_block_writeback(struct extent_buffer *buf)
 {
-	filemap_fdatawait_range(buf->pages[0]->mapping,
-			        buf->start, buf->start + buf->len - 1);
+	btree_wait_range(buf->eb_info->fs_info, buf->start,
+			 buf->start + buf->len - 1);
 }
 
 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
@@ -1372,15 +1209,10 @@ void clean_tree_block(struct btrfs_fs_info *fs_info,
 	if (btrfs_header_generation(buf) ==
 	    fs_info->running_transaction->transid) {
 		btrfs_assert_tree_locked(buf);
-
-		if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)) {
+		if (clear_extent_buffer_dirty(buf))
 			percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
 						 -buf->len,
 						 fs_info->dirty_metadata_batch);
-			/* ugh, clear_extent_buffer_dirty needs to lock the page */
-			btrfs_set_lock_blocking(buf);
-			clear_extent_buffer_dirty(buf);
-		}
 	}
 }
 
@@ -2412,31 +2244,20 @@ static void btrfs_init_balance(struct btrfs_fs_info *fs_info)
 	init_waitqueue_head(&fs_info->balance_wait_q);
 }
 
-static void btrfs_init_btree_inode(struct btrfs_fs_info *fs_info)
+int btrfs_init_eb_info(struct btrfs_fs_info *fs_info)
 {
-	struct inode *inode = fs_info->btree_inode;
-
-	inode->i_ino = BTRFS_BTREE_INODE_OBJECTID;
-	set_nlink(inode, 1);
-	/*
-	 * we set the i_size on the btree inode to the max possible int.
-	 * the real end of the address space is determined by all of
-	 * the devices in the system
-	 */
-	inode->i_size = OFFSET_MAX;
-	inode->i_mapping->a_ops = &btree_aops;
-
-	RB_CLEAR_NODE(&BTRFS_I(inode)->rb_node);
-	extent_io_tree_init(&BTRFS_I(inode)->io_tree, inode);
-	BTRFS_I(inode)->io_tree.track_uptodate = 0;
-	extent_map_tree_init(&BTRFS_I(inode)->extent_tree);
-
-	BTRFS_I(inode)->io_tree.ops = &btree_extent_io_ops;
-
-	BTRFS_I(inode)->root = fs_info->tree_root;
-	memset(&BTRFS_I(inode)->location, 0, sizeof(struct btrfs_key));
-	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
-	btrfs_insert_inode_hash(inode);
+	struct btrfs_eb_info *eb_info = fs_info->eb_info;
+
+	eb_info->fs_info = fs_info;
+	extent_io_tree_init(&eb_info->io_tree, eb_info);
+	eb_info->io_tree.track_uptodate = 0;
+	eb_info->io_tree.ops = &btree_extent_io_ops;
+	extent_io_tree_init(&eb_info->io_failure_tree, eb_info);
+	INIT_RADIX_TREE(&eb_info->buffer_radix, GFP_ATOMIC);
+	spin_lock_init(&eb_info->buffer_lock);
+	if (list_lru_init(&eb_info->lru_list))
+		return -ENOMEM;
+	return 0;
 }
 
 static void btrfs_init_dev_replace_locks(struct btrfs_fs_info *fs_info)
@@ -2725,7 +2546,6 @@ int open_ctree(struct super_block *sb,
 	}
 
 	INIT_RADIX_TREE(&fs_info->fs_roots_radix, GFP_ATOMIC);
-	INIT_RADIX_TREE(&fs_info->buffer_radix, GFP_ATOMIC);
 	INIT_LIST_HEAD(&fs_info->trans_list);
 	INIT_LIST_HEAD(&fs_info->dead_roots);
 	INIT_LIST_HEAD(&fs_info->delayed_iputs);
@@ -2739,7 +2559,6 @@ int open_ctree(struct super_block *sb,
 	spin_lock_init(&fs_info->tree_mod_seq_lock);
 	spin_lock_init(&fs_info->super_lock);
 	spin_lock_init(&fs_info->qgroup_op_lock);
-	spin_lock_init(&fs_info->buffer_lock);
 	spin_lock_init(&fs_info->unused_bgs_lock);
 	rwlock_init(&fs_info->tree_mod_log_lock);
 	mutex_init(&fs_info->unused_bg_unpin_mutex);
@@ -2785,18 +2604,11 @@ int open_ctree(struct super_block *sb,
 	INIT_LIST_HEAD(&fs_info->ordered_roots);
 	spin_lock_init(&fs_info->ordered_root_lock);
 
-	fs_info->btree_inode = new_inode(sb);
-	if (!fs_info->btree_inode) {
-		err = -ENOMEM;
-		goto fail_bio_counter;
-	}
-	mapping_set_gfp_mask(fs_info->btree_inode->i_mapping, GFP_NOFS);
-
 	fs_info->delayed_root = kmalloc(sizeof(struct btrfs_delayed_root),
 					GFP_KERNEL);
 	if (!fs_info->delayed_root) {
 		err = -ENOMEM;
-		goto fail_iput;
+		goto fail_alloc;
 	}
 	btrfs_init_delayed_root(fs_info->delayed_root);
 
@@ -2810,7 +2622,15 @@ int open_ctree(struct super_block *sb,
 	sb->s_blocksize = BTRFS_BDEV_BLOCKSIZE;
 	sb->s_blocksize_bits = blksize_bits(BTRFS_BDEV_BLOCKSIZE);
 
-	btrfs_init_btree_inode(fs_info);
+	fs_info->eb_info = kzalloc(sizeof(struct btrfs_eb_info), GFP_KERNEL);
+	if (!fs_info->eb_info) {
+		err = -ENOMEM;
+		goto fail_alloc;
+	}
+	if (btrfs_init_eb_info(fs_info)) {
+		err = -ENOMEM;
+		goto fail_alloc;
+	}
 
 	spin_lock_init(&fs_info->block_group_cache_lock);
 	fs_info->block_group_cache_tree = RB_ROOT;
@@ -3243,6 +3063,14 @@ int open_ctree(struct super_block *sb,
 	if (sb_rdonly(sb))
 		return 0;
 
+	/*
+	 * We need to make sure we are on the bdi's dirty list so we get
+	 * writeback requests for our fs properly.
+	 */
+	spin_lock(&sb->s_bdi->sb_list_lock);
+	list_add_tail(&sb->s_bdi->dirty_sb_list, &sb->s_bdi_dirty_list);
+	spin_unlock(&sb->s_bdi->sb_list_lock);
+
 	if (btrfs_test_opt(fs_info, CLEAR_CACHE) &&
 	    btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE)) {
 		clear_free_space_tree = 1;
@@ -3346,7 +3174,8 @@ int open_ctree(struct super_block *sb,
 	 * make sure we're done with the btree inode before we stop our
 	 * kthreads
 	 */
-	filemap_write_and_wait(fs_info->btree_inode->i_mapping);
+	btree_write_range(fs_info, 0, (u64)-1);
+	btree_wait_range(fs_info, 0, (u64)-1);
 
 fail_sysfs:
 	btrfs_sysfs_remove_mounted(fs_info);
@@ -3359,17 +3188,12 @@ int open_ctree(struct super_block *sb,
 
 fail_tree_roots:
 	free_root_pointers(fs_info, 1);
-	invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
-
+	btrfs_invalidate_eb_info(fs_info->eb_info);
 fail_sb_buffer:
 	btrfs_stop_all_workers(fs_info);
 	btrfs_free_block_groups(fs_info);
 fail_alloc:
-fail_iput:
 	btrfs_mapping_tree_free(&fs_info->mapping_tree);
-
-	iput(fs_info->btree_inode);
-fail_bio_counter:
 	percpu_counter_destroy(&fs_info->bio_counter);
 fail_delalloc_bytes:
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
@@ -4041,7 +3865,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 	 * we must make sure there is not any read request to
 	 * submit after we stopping all workers.
 	 */
-	invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
 	btrfs_stop_all_workers(fs_info);
 
 	btrfs_free_block_groups(fs_info);
@@ -4049,8 +3872,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 	clear_bit(BTRFS_FS_OPEN, &fs_info->flags);
 	free_root_pointers(fs_info, 1);
 
-	iput(fs_info->btree_inode);
-
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 	if (btrfs_test_opt(fs_info, CHECK_INTEGRITY))
 		btrfsic_unmount(fs_info->fs_devices);
@@ -4059,6 +3880,8 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 	btrfs_close_devices(fs_info->fs_devices);
 	btrfs_mapping_tree_free(&fs_info->mapping_tree);
 
+	btrfs_invalidate_eb_info(fs_info->eb_info);
+
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
 	percpu_counter_destroy(&fs_info->bio_counter);
@@ -4084,14 +3907,12 @@ int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
 			  int atomic)
 {
 	int ret;
-	struct inode *btree_inode = buf->pages[0]->mapping->host;
 
 	ret = extent_buffer_uptodate(buf);
 	if (!ret)
 		return ret;
 
-	ret = verify_parent_transid(&BTRFS_I(btree_inode)->io_tree, buf,
-				    parent_transid, atomic);
+	ret = verify_parent_transid(buf, parent_transid, atomic);
 	if (ret == -EAGAIN)
 		return ret;
 	return !ret;
@@ -4113,8 +3934,8 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
 	if (unlikely(test_bit(EXTENT_BUFFER_DUMMY, &buf->bflags)))
 		return;
 #endif
-	root = BTRFS_I(buf->pages[0]->mapping->host)->root;
-	fs_info = root->fs_info;
+	fs_info = buf->eb_info->fs_info;
+	root = fs_info->tree_root;
 	btrfs_assert_tree_locked(buf);
 	if (transid != fs_info->generation)
 		WARN(1, KERN_CRIT "btrfs transid mismatch buffer %llu, found %llu running %llu\n",
@@ -4140,6 +3961,7 @@ static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info,
 	 * this code, they end up stuck in balance_dirty_pages forever
 	 */
 	int ret;
+	struct super_block *sb = fs_info->sb;
 
 	if (current->flags & PF_MEMALLOC)
 		return;
@@ -4149,10 +3971,8 @@ static void __btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info,
 
 	ret = percpu_counter_compare(&fs_info->dirty_metadata_bytes,
 				     BTRFS_DIRTY_METADATA_THRESH);
-	if (ret > 0) {
-		balance_dirty_pages_ratelimited(fs_info->sb->s_bdi,
-						fs_info->sb);
-	}
+	if (ret > 0)
+		balance_dirty_pages_ratelimited(sb->s_bdi, sb);
 }
 
 void btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info)
@@ -4167,9 +3987,7 @@ void btrfs_btree_balance_dirty_nodelay(struct btrfs_fs_info *fs_info)
 
 int btrfs_read_buffer(struct extent_buffer *buf, u64 parent_transid)
 {
-	struct btrfs_root *root = BTRFS_I(buf->pages[0]->mapping->host)->root;
-	struct btrfs_fs_info *fs_info = root->fs_info;
-
+	struct btrfs_fs_info *fs_info = buf->eb_info->fs_info;
 	return btree_read_extent_buffer_pages(fs_info, buf, parent_transid);
 }
 
@@ -4513,15 +4331,12 @@ static int btrfs_destroy_marked_extents(struct btrfs_fs_info *fs_info,
 
 		clear_extent_bits(dirty_pages, start, end, mark);
 		while (start <= end) {
-			eb = find_extent_buffer(fs_info, start);
+			eb = find_extent_buffer(fs_info->eb_info, start);
 			start += fs_info->nodesize;
 			if (!eb)
 				continue;
 			wait_on_extent_buffer_writeback(eb);
-
-			if (test_and_clear_bit(EXTENT_BUFFER_DIRTY,
-					       &eb->bflags))
-				clear_extent_buffer_dirty(eb);
+			clear_extent_buffer_dirty(eb);
 			free_extent_buffer_stale(eb);
 		}
 	}
@@ -4710,16 +4525,37 @@ static int btrfs_cleanup_transaction(struct btrfs_fs_info *fs_info)
 
 static struct btrfs_fs_info *btree_fs_info(void *private_data)
 {
-	struct inode *inode = private_data;
-	return btrfs_sb(inode->i_sb);
+	struct btrfs_eb_info *eb_info = private_data;
+	return eb_info->fs_info;
+}
+
+static int btree_merge_bio_hook(struct page *page, unsigned long offset,
+				size_t size, struct bio *bio,
+				unsigned long bio_flags)
+{
+	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	struct btrfs_fs_info *fs_info = eb->eb_info->fs_info;
+	u64 logical = (u64)bio->bi_iter.bi_sector << 9;
+	u64 length = 0;
+	u64 map_length;
+	int ret;
+
+	length = bio->bi_iter.bi_size;
+	map_length = length;
+	ret = btrfs_map_block(fs_info, bio_op(bio), logical, &map_length,
+			      NULL, 0);
+	if (ret < 0)
+		return ret;
+	if (map_length < length + size)
+		return 1;
+	return 0;
 }
 
 static const struct extent_io_ops btree_extent_io_ops = {
 	/* mandatory callbacks */
 	.submit_bio_hook = btree_submit_bio_hook,
 	.readpage_end_io_hook = btree_readpage_end_io_hook,
-	/* note we're sharing with inode.c for the merge bio hook */
-	.merge_bio_hook = btrfs_merge_bio_hook,
+	.merge_bio_hook = btree_merge_bio_hook,
 	.readpage_io_failed_hook = btree_io_failed_hook,
 	.set_range_writeback = btrfs_set_range_writeback,
 	.tree_fs_info = btree_fs_info,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0bdc10b453b9..a48fb3abed0c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1158,28 +1158,30 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
 			if (type == BTRFS_TREE_BLOCK_REF_KEY)
 				return type;
 			if (type == BTRFS_SHARED_BLOCK_REF_KEY) {
-				ASSERT(eb->fs_info);
+				ASSERT(eb->eb_info);
 				/*
 				 * Every shared one has parent tree
 				 * block, which must be aligned to
 				 * nodesize.
 				 */
 				if (offset &&
-				    IS_ALIGNED(offset, eb->fs_info->nodesize))
+				    IS_ALIGNED(offset,
+					       eb->eb_info->fs_info->nodesize))
 					return type;
 			}
 		} else if (is_data == BTRFS_REF_TYPE_DATA) {
 			if (type == BTRFS_EXTENT_DATA_REF_KEY)
 				return type;
 			if (type == BTRFS_SHARED_DATA_REF_KEY) {
-				ASSERT(eb->fs_info);
+				ASSERT(eb->eb_info->fs_info);
 				/*
 				 * Every shared one has parent tree
 				 * block, which must be aligned to
 				 * nodesize.
 				 */
 				if (offset &&
-				    IS_ALIGNED(offset, eb->fs_info->nodesize))
+				    IS_ALIGNED(offset,
+					       eb->eb_info->fs_info->nodesize))
 					return type;
 			}
 		} else {
@@ -1189,7 +1191,7 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
 	}
 
 	btrfs_print_leaf((struct extent_buffer *)eb);
-	btrfs_err(eb->fs_info, "eb %llu invalid extent inline ref type %d",
+	btrfs_err(eb->eb_info->fs_info, "eb %llu invalid extent inline ref type %d",
 		  eb->start, type);
 	WARN_ON(1);
 
@@ -8731,7 +8733,7 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 	bytenr = btrfs_node_blockptr(path->nodes[level], path->slots[level]);
 	blocksize = fs_info->nodesize;
 
-	next = find_extent_buffer(fs_info, bytenr);
+	next = find_extent_buffer(fs_info->eb_info, bytenr);
 	if (!next) {
 		next = btrfs_find_create_tree_block(fs_info, bytenr);
 		if (IS_ERR(next))
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0538bf85adc3..bb10dc6f4e41 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2768,6 +2768,7 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree,
 			      int mirror_num,
 			      unsigned long prev_bio_flags,
 			      unsigned long bio_flags,
+			      enum rw_hint io_hint,
 			      bool force_bio_submit)
 {
 	int ret = 0;
@@ -2804,7 +2805,7 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree,
 	bio_add_page(bio, page, page_size, offset);
 	bio->bi_end_io = end_io_func;
 	bio->bi_private = tree;
-	bio->bi_write_hint = page->mapping->host->i_write_hint;
+	bio->bi_write_hint = io_hint;
 	bio->bi_opf = opf;
 	if (wbc) {
 		wbc_init_bio(wbc, bio);
@@ -3065,7 +3066,7 @@ static int __do_readpage(struct extent_io_tree *tree,
 					 bdev, bio,
 					 end_bio_extent_readpage, mirror_num,
 					 *bio_flags,
-					 this_bio_flag,
+					 this_bio_flag, inode->i_write_hint,
 					 force_bio_submit);
 		if (!ret) {
 			nr++;
@@ -3433,7 +3434,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode,
 					 page, sector, iosize, pg_offset,
 					 bdev, &epd->bio,
 					 end_bio_extent_writepage,
-					 0, 0, 0, false);
+					 0, 0, 0, inode->i_write_hint, false);
 		if (ret) {
 			SetPageError(page);
 			if (PageWriteback(page))
@@ -3539,7 +3540,7 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 			  struct btrfs_fs_info *fs_info,
 			  struct extent_page_data *epd)
 {
-	unsigned long i, num_pages;
+	struct btrfs_eb_info *eb_info = fs_info->eb_info;
 	int flush = 0;
 	int ret = 0;
 
@@ -3586,37 +3587,42 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 
 	btrfs_tree_unlock(eb);
 
-	if (!ret)
-		return ret;
-
-	num_pages = num_extent_pages(eb->start, eb->len);
-	for (i = 0; i < num_pages; i++) {
-		struct page *p = eb->pages[i];
-
-		if (!trylock_page(p)) {
-			if (!flush) {
-				flush_write_bio(epd);
-				flush = 1;
-			}
-			lock_page(p);
-		}
+	/*
+	 * We cleared dirty on this buffer, we need to adjust the radix tags.
+	 * We do the actual page accounting in write_one_eb.
+	 */
+	if (ret) {
+		spin_lock_irq(&eb_info->buffer_lock);
+		radix_tree_tag_set(&eb_info->buffer_radix, eb_index(eb),
+				   PAGECACHE_TAG_WRITEBACK);
+		radix_tree_tag_clear(&eb_info->buffer_radix, eb_index(eb),
+				     PAGECACHE_TAG_DIRTY);
+		radix_tree_tag_clear(&eb_info->buffer_radix, eb_index(eb),
+				     PAGECACHE_TAG_TOWRITE);
+		spin_unlock_irq(&eb_info->buffer_lock);
 	}
-
 	return ret;
 }
 
 static void end_extent_buffer_writeback(struct extent_buffer *eb)
 {
-	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
-	smp_mb__after_atomic();
-	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
+	if (test_and_clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
+		struct btrfs_eb_info *eb_info = eb->eb_info;
+		unsigned long flags;
+
+		spin_lock_irqsave(&eb_info->buffer_lock, flags);
+		radix_tree_tag_clear(&eb_info->buffer_radix, eb_index(eb),
+				     PAGECACHE_TAG_WRITEBACK);
+		spin_unlock_irqrestore(&eb_info->buffer_lock, flags);
+		wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
+	}
 }
 
 static void set_btree_ioerr(struct page *page)
 {
 	struct extent_buffer *eb = (struct extent_buffer *)page->private;
+	struct btrfs_fs_info *fs_info = eb->eb_info->fs_info;
 
-	SetPageError(page);
 	if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
 		return;
 
@@ -3625,8 +3631,7 @@ static void set_btree_ioerr(struct page *page)
 	 * failed, increment the counter transaction->eb_write_errors.
 	 * We do this because while the transaction is running and before it's
 	 * committing (when we call filemap_fdata[write|wait]_range against
-	 * the btree inode), we might have
-	 * btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
+	 * the btree inode), we might have write_metadata() called - if it
 	 * returns an error or an error happens during writeback, when we're
 	 * committing the transaction we wouldn't know about it, since the pages
 	 * can be no longer dirty nor marked anymore for writeback (if a
@@ -3660,13 +3665,13 @@ static void set_btree_ioerr(struct page *page)
 	 */
 	switch (eb->log_index) {
 	case -1:
-		set_bit(BTRFS_FS_BTREE_ERR, &eb->fs_info->flags);
+		set_bit(BTRFS_FS_BTREE_ERR, &fs_info->flags);
 		break;
 	case 0:
-		set_bit(BTRFS_FS_LOG1_ERR, &eb->fs_info->flags);
+		set_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
 		break;
 	case 1:
-		set_bit(BTRFS_FS_LOG2_ERR, &eb->fs_info->flags);
+		set_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
 		break;
 	default:
 		BUG(); /* unexpected, logic error */
@@ -3682,22 +3687,20 @@ static void end_bio_extent_buffer_writepage(struct bio *bio)
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, i) {
 		struct page *page = bvec->bv_page;
+		struct super_block *sb;
 
 		eb = (struct extent_buffer *)page->private;
 		BUG_ON(!eb);
 		done = atomic_dec_and_test(&eb->io_pages);
 
 		if (bio->bi_status ||
-		    test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
-			ClearPageUptodate(page);
+		    test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
 			set_btree_ioerr(page);
-		}
-
-		end_page_writeback(page);
 
+		sb = eb->eb_info->fs_info->sb;
+		account_metadata_end_writeback(page, sb->s_bdi, PAGE_SIZE);
 		if (!done)
 			continue;
-
 		end_extent_buffer_writeback(eb);
 	}
 
@@ -3710,7 +3713,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 			struct extent_page_data *epd)
 {
 	struct block_device *bdev = fs_info->fs_devices->latest_bdev;
-	struct extent_io_tree *tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
+	struct extent_io_tree *tree = &fs_info->eb_info->io_tree;
 	u64 offset = eb->start;
 	u32 nritems;
 	unsigned long i, num_pages;
@@ -3741,44 +3744,105 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = eb->pages[i];
 
-		clear_page_dirty_for_io(p);
-		set_page_writeback(p);
 		ret = submit_extent_page(REQ_OP_WRITE | write_flags, tree, wbc,
 					 p, offset >> 9, PAGE_SIZE, 0, bdev,
 					 &epd->bio,
 					 end_bio_extent_buffer_writepage,
-					 0, 0, 0, false);
+					 0, 0, 0, 0, false);
 		if (ret) {
 			set_btree_ioerr(p);
-			if (PageWriteback(p))
-				end_page_writeback(p);
 			if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
 				end_extent_buffer_writeback(eb);
 			ret = -EIO;
 			break;
 		}
+		account_metadata_writeback(p, fs_info->sb->s_bdi, PAGE_SIZE);
 		offset += PAGE_SIZE;
 		update_nr_written(wbc, 1);
-		unlock_page(p);
 	}
 
-	if (unlikely(ret)) {
-		for (; i < num_pages; i++) {
-			struct page *p = eb->pages[i];
-			clear_page_dirty_for_io(p);
-			unlock_page(p);
-		}
+	return ret;
+}
+
+static void tag_ebs_for_writeback(struct btrfs_eb_info *eb_info, pgoff_t start,
+				  pgoff_t end)
+{
+#define EB_TAG_BATCH 4096
+	unsigned long tagged = 0;
+	struct radix_tree_iter iter;
+	void **slot;
+
+	spin_lock_irq(&eb_info->buffer_lock);
+	radix_tree_for_each_tagged(slot, &eb_info->buffer_radix, &iter, start,
+				   PAGECACHE_TAG_DIRTY) {
+		if (iter.index > end)
+			break;
+		radix_tree_iter_tag_set(&eb_info->buffer_radix, &iter,
+					PAGECACHE_TAG_TOWRITE);
+		tagged++;
+		if ((tagged % EB_TAG_BATCH) != 0)
+			continue;
+		slot = radix_tree_iter_resume(slot, &iter);
+		spin_unlock_irq(&eb_info->buffer_lock);
+		cond_resched();
+		spin_lock_irq(&eb_info->buffer_lock);
 	}
+	spin_unlock_irq(&eb_info->buffer_lock);
+}
+
+static unsigned eb_lookup_tag(struct btrfs_eb_info *eb_info,
+			      struct extent_buffer **ebs, pgoff_t *index,
+			      int tag, unsigned nr)
+{
+	struct radix_tree_iter iter;
+	void **slot;
+	unsigned ret = 0;
 
+	if (unlikely(!nr))
+		return 0;
+
+	rcu_read_lock();
+	radix_tree_for_each_tagged(slot, &eb_info->buffer_radix, &iter, *index,
+				   tag) {
+		struct extent_buffer *eb;
+repeat:
+		eb = radix_tree_deref_slot(slot);
+		if (unlikely(!eb))
+			continue;
+
+		if (radix_tree_exception(eb)) {
+			if (radix_tree_deref_retry(eb)) {
+				slot = radix_tree_iter_retry(&iter);
+				continue;
+			}
+			continue;
+		}
+
+		if (unlikely(!atomic_inc_not_zero(&eb->refs)))
+			continue;
+
+		if (unlikely(eb != *slot)) {
+			free_extent_buffer(eb);
+			goto repeat;
+		}
+
+		ebs[ret] = eb;
+		if (++ret == nr)
+			break;
+	}
+	rcu_read_unlock();
+	if (ret)
+		*index = (ebs[ret - 1]->start >> PAGE_SHIFT) + 1;
 	return ret;
 }
 
-int btree_write_cache_pages(struct address_space *mapping,
+#define EBVEC_SIZE 16
+static int btree_write_cache_pages(struct btrfs_fs_info *fs_info,
 				   struct writeback_control *wbc)
 {
-	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
-	struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info;
-	struct extent_buffer *eb, *prev_eb = NULL;
+	struct btrfs_eb_info *eb_info = fs_info->eb_info;
+	struct extent_io_tree *tree = &eb_info->io_tree;
+	struct extent_buffer *eb;
 	struct extent_page_data epd = {
 		.bio = NULL,
 		.tree = tree,
@@ -3788,16 +3852,16 @@ int btree_write_cache_pages(struct address_space *mapping,
 	int ret = 0;
 	int done = 0;
 	int nr_to_write_done = 0;
-	struct pagevec pvec;
-	int nr_pages;
+	struct extent_buffer *ebs[EBVEC_SIZE];
+	int nr_ebs;
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
+	pgoff_t done_index = 0;
 	int scanned = 0;
 	int tag;
 
-	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
-		index = mapping->writeback_index; /* Start from prev offset */
+		index = eb_info->writeback_index; /* Start from prev offset */
 		end = -1;
 	} else {
 		index = wbc->range_start >> PAGE_SHIFT;
@@ -3810,53 +3874,27 @@ int btree_write_cache_pages(struct address_space *mapping,
 		tag = PAGECACHE_TAG_DIRTY;
 retry:
 	if (wbc->sync_mode == WB_SYNC_ALL)
-		tag_pages_for_writeback(mapping, index, end);
+		tag_ebs_for_writeback(fs_info->eb_info, index, end);
 	while (!done && !nr_to_write_done && (index <= end) &&
-	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+	       (nr_ebs = eb_lookup_tag(eb_info, ebs, &index, tag,
+			min(end - index, (pgoff_t)EBVEC_SIZE-1) + 1))) {
 		unsigned i;
 
 		scanned = 1;
-		for (i = 0; i < nr_pages; i++) {
-			struct page *page = pvec.pages[i];
-
-			if (!PagePrivate(page))
-				continue;
-
-			if (!wbc->range_cyclic && page->index > end) {
-				done = 1;
-				break;
-			}
-
-			spin_lock(&mapping->private_lock);
-			if (!PagePrivate(page)) {
-				spin_unlock(&mapping->private_lock);
-				continue;
-			}
-
-			eb = (struct extent_buffer *)page->private;
-
-			/*
-			 * Shouldn't happen and normally this would be a BUG_ON
-			 * but no sense in crashing the users box for something
-			 * we can survive anyway.
-			 */
-			if (WARN_ON(!eb)) {
-				spin_unlock(&mapping->private_lock);
+		for (i = 0; i < nr_ebs; i++) {
+			eb = ebs[i];
+			if (done) {
+				free_extent_buffer(eb);
 				continue;
 			}
 
-			if (eb == prev_eb) {
-				spin_unlock(&mapping->private_lock);
+			if (!wbc->range_cyclic && eb->start > wbc->range_end) {
+				done = 1;
+				free_extent_buffer(eb);
 				continue;
 			}
 
-			ret = atomic_inc_not_zero(&eb->refs);
-			spin_unlock(&mapping->private_lock);
-			if (!ret)
-				continue;
-
-			prev_eb = eb;
+			done_index = eb_index(eb);
 			ret = lock_extent_buffer_for_io(eb, fs_info, &epd);
 			if (!ret) {
 				free_extent_buffer(eb);
@@ -3864,12 +3902,11 @@ int btree_write_cache_pages(struct address_space *mapping,
 			}
 
 			ret = write_one_eb(eb, fs_info, wbc, &epd);
+			free_extent_buffer(eb);
 			if (ret) {
 				done = 1;
-				free_extent_buffer(eb);
-				break;
+				continue;
 			}
-			free_extent_buffer(eb);
 
 			/*
 			 * the filesystem may choose to bump up nr_to_write.
@@ -3878,7 +3915,6 @@ int btree_write_cache_pages(struct address_space *mapping,
 			 */
 			nr_to_write_done = wbc->nr_to_write <= 0;
 		}
-		pagevec_release(&pvec);
 		cond_resched();
 	}
 	if (!scanned && !done) {
@@ -3890,10 +3926,77 @@ int btree_write_cache_pages(struct address_space *mapping,
 		index = 0;
 		goto retry;
 	}
+	if (wbc->range_cyclic)
+		fs_info->eb_info->writeback_index = done_index;
 	flush_write_bio(&epd);
 	return ret;
 }
 
+void btrfs_write_ebs(struct super_block *sb, struct writeback_control *wbc)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+	btree_write_cache_pages(fs_info, wbc);
+}
+
+static int __btree_write_range(struct btrfs_fs_info *fs_info, u64 start,
+			       u64 end, int sync_mode)
+{
+	struct writeback_control wbc = {
+		.sync_mode = sync_mode,
+		.nr_to_write = LONG_MAX,
+		.range_start = start,
+		.range_end = end,
+	};
+
+	return btree_write_cache_pages(fs_info, &wbc);
+}
+
+void btree_flush(struct btrfs_fs_info *fs_info)
+{
+	__btree_write_range(fs_info, 0, (u64)-1, WB_SYNC_NONE);
+}
+
+int btree_write_range(struct btrfs_fs_info *fs_info, u64 start, u64 end)
+{
+	return __btree_write_range(fs_info, start, end, WB_SYNC_ALL);
+}
+
+int btree_wait_range(struct btrfs_fs_info *fs_info, u64 start, u64 end)
+{
+	struct extent_buffer *ebs[EBVEC_SIZE];
+	pgoff_t index = start >> PAGE_SHIFT;
+	pgoff_t end_index = end >> PAGE_SHIFT;
+	unsigned nr_ebs;
+	int ret = 0;
+
+	if (end < start)
+		return ret;
+
+	while ((index <= end) &&
+	       (nr_ebs = eb_lookup_tag(fs_info->eb_info, ebs, &index,
+				       PAGECACHE_TAG_WRITEBACK,
+				       min(end_index - index,
+					   (pgoff_t)EBVEC_SIZE-1) + 1)) != 0) {
+		unsigned i;
+
+		for (i = 0; i < nr_ebs; i++) {
+			struct extent_buffer *eb = ebs[i];
+
+			if (eb->start > end) {
+				free_extent_buffer(eb);
+				continue;
+			}
+
+			wait_on_extent_buffer_writeback(eb);
+			if (test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+				ret = -EIO;
+			free_extent_buffer(eb);
+		}
+		cond_resched();
+	}
+	return ret;
+}
+
 /**
  * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
  * @mapping: address space structure to write
@@ -4680,7 +4783,6 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
 {
 	unsigned long index;
 	struct page *page;
-	int mapped = !test_bit(EXTENT_BUFFER_DUMMY, &eb->bflags);
 
 	BUG_ON(extent_buffer_under_io(eb));
 
@@ -4688,39 +4790,21 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
 	if (index == 0)
 		return;
 
+	ASSERT(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
 	do {
 		index--;
 		page = eb->pages[index];
 		if (!page)
 			continue;
-		if (mapped)
-			spin_lock(&page->mapping->private_lock);
-		/*
-		 * We do this since we'll remove the pages after we've
-		 * removed the eb from the radix tree, so we could race
-		 * and have this page now attached to the new eb.  So
-		 * only clear page_private if it's still connected to
-		 * this eb.
-		 */
-		if (PagePrivate(page) &&
-		    page->private == (unsigned long)eb) {
-			BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
-			BUG_ON(PageDirty(page));
-			BUG_ON(PageWriteback(page));
-			/*
-			 * We need to make sure we haven't be attached
-			 * to a new eb.
-			 */
-			ClearPagePrivate(page);
-			set_page_private(page, 0);
-			/* One for the page private */
-			put_page(page);
-		}
+		ASSERT(PagePrivate(page));
+		ASSERT(page->private == (unsigned long)eb);
+		ClearPagePrivate(page);
+		set_page_private(page, 0);
 
-		if (mapped)
-			spin_unlock(&page->mapping->private_lock);
+		/* Once for the page private. */
+		put_page(page);
 
-		/* One for when we allocated the page */
+		/* Once for the alloc_page. */
 		put_page(page);
 	} while (index != 0);
 }
@@ -4735,7 +4819,7 @@ static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
 }
 
 static struct extent_buffer *
-__alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
+__alloc_extent_buffer(struct btrfs_eb_info *eb_info, u64 start,
 		      unsigned long len)
 {
 	struct extent_buffer *eb = NULL;
@@ -4743,7 +4827,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
 	eb->start = start;
 	eb->len = len;
-	eb->fs_info = fs_info;
+	eb->eb_info = eb_info;
 	eb->bflags = 0;
 	rwlock_init(&eb->lock);
 	atomic_set(&eb->write_locks, 0);
@@ -4755,6 +4839,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 	eb->lock_nested = 0;
 	init_waitqueue_head(&eb->write_lock_wq);
 	init_waitqueue_head(&eb->read_lock_wq);
+	INIT_LIST_HEAD(&eb->lru);
 
 	btrfs_leak_debug_add(&eb->leak_list, &buffers);
 
@@ -4779,7 +4864,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
 	struct extent_buffer *new;
 	unsigned long num_pages = num_extent_pages(src->start, src->len);
 
-	new = __alloc_extent_buffer(src->fs_info, src->start, src->len);
+	new = __alloc_extent_buffer(src->eb_info, src->start, src->len);
 	if (new == NULL)
 		return NULL;
 
@@ -4790,8 +4875,6 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
 			return NULL;
 		}
 		attach_extent_buffer_page(new, p);
-		WARN_ON(PageDirty(p));
-		SetPageUptodate(p);
 		new->pages[i] = p;
 		copy_page(page_address(p), page_address(src->pages[i]));
 	}
@@ -4802,8 +4885,8 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
 	return new;
 }
 
-struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
-						  u64 start, unsigned long len)
+struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_eb_info *eb_info,
+						u64 start, unsigned long len)
 {
 	struct extent_buffer *eb;
 	unsigned long num_pages;
@@ -4811,7 +4894,7 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 
 	num_pages = num_extent_pages(start, len);
 
-	eb = __alloc_extent_buffer(fs_info, start, len);
+	eb = __alloc_extent_buffer(eb_info, start, len);
 	if (!eb)
 		return NULL;
 
@@ -4819,6 +4902,7 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 		eb->pages[i] = alloc_page(GFP_NOFS);
 		if (!eb->pages[i])
 			goto err;
+		attach_extent_buffer_page(eb, eb->pages[i]);
 	}
 	set_extent_buffer_uptodate(eb);
 	btrfs_set_header_nritems(eb, 0);
@@ -4826,18 +4910,10 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 
 	return eb;
 err:
-	for (; i > 0; i--)
-		__free_page(eb->pages[i - 1]);
-	__free_extent_buffer(eb);
+	btrfs_release_extent_buffer(eb);
 	return NULL;
 }
 
-struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
-						u64 start)
-{
-	return __alloc_dummy_extent_buffer(fs_info, start, fs_info->nodesize);
-}
-
 static void check_buffer_tree_ref(struct extent_buffer *eb)
 {
 	int refs;
@@ -4887,13 +4963,13 @@ static void mark_extent_buffer_accessed(struct extent_buffer *eb,
 	}
 }
 
-struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
+struct extent_buffer *find_extent_buffer(struct btrfs_eb_info *eb_info,
 					 u64 start)
 {
 	struct extent_buffer *eb;
 
 	rcu_read_lock();
-	eb = radix_tree_lookup(&fs_info->buffer_radix,
+	eb = radix_tree_lookup(&eb_info->buffer_radix,
 			       start >> PAGE_SHIFT);
 	if (eb && atomic_inc_not_zero(&eb->refs)) {
 		rcu_read_unlock();
@@ -4925,30 +5001,30 @@ struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
 }
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
-					u64 start)
+struct extent_buffer *alloc_test_extent_buffer(struct btrfs_eb_info *eb_info,
+					       u64 start, u32 nodesize)
 {
 	struct extent_buffer *eb, *exists = NULL;
 	int ret;
 
-	eb = find_extent_buffer(fs_info, start);
+	eb = find_extent_buffer(eb_info, start);
 	if (eb)
 		return eb;
-	eb = alloc_dummy_extent_buffer(fs_info, start);
+	eb = alloc_dummy_extent_buffer(eb_info, start, nodesize);
 	if (!eb)
 		return NULL;
-	eb->fs_info = fs_info;
+	eb->eb_info = eb_info;
 again:
 	ret = radix_tree_preload(GFP_NOFS);
 	if (ret)
 		goto free_eb;
-	spin_lock(&fs_info->buffer_lock);
-	ret = radix_tree_insert(&fs_info->buffer_radix,
+	spin_lock_irq(&eb_info->buffer_lock);
+	ret = radix_tree_insert(&eb_info->buffer_radix,
 				start >> PAGE_SHIFT, eb);
-	spin_unlock(&fs_info->buffer_lock);
+	spin_unlock_irq(&eb_info->buffer_lock);
 	radix_tree_preload_end();
 	if (ret == -EEXIST) {
-		exists = find_extent_buffer(fs_info, start);
+		exists = find_extent_buffer(eb_info, start);
 		if (exists)
 			goto free_eb;
 		else
@@ -4964,6 +5040,7 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
 	 * bump the ref count again.
 	 */
 	atomic_inc(&eb->refs);
+	set_extent_buffer_uptodate(eb);
 	return eb;
 free_eb:
 	btrfs_release_extent_buffer(eb);
@@ -4977,12 +5054,10 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	unsigned long len = fs_info->nodesize;
 	unsigned long num_pages = num_extent_pages(start, len);
 	unsigned long i;
-	unsigned long index = start >> PAGE_SHIFT;
 	struct extent_buffer *eb;
 	struct extent_buffer *exists = NULL;
 	struct page *p;
-	struct address_space *mapping = fs_info->btree_inode->i_mapping;
-	int uptodate = 1;
+	struct btrfs_eb_info *eb_info = fs_info->eb_info;
 	int ret;
 
 	if (!IS_ALIGNED(start, fs_info->sectorsize)) {
@@ -4990,62 +5065,23 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		return ERR_PTR(-EINVAL);
 	}
 
-	eb = find_extent_buffer(fs_info, start);
+	eb = find_extent_buffer(eb_info, start);
 	if (eb)
 		return eb;
 
-	eb = __alloc_extent_buffer(fs_info, start, len);
+	eb = __alloc_extent_buffer(eb_info, start, len);
 	if (!eb)
 		return ERR_PTR(-ENOMEM);
 
-	for (i = 0; i < num_pages; i++, index++) {
-		p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
+	for (i = 0; i < num_pages; i++) {
+		p = alloc_page(GFP_NOFS|__GFP_NOFAIL);
 		if (!p) {
 			exists = ERR_PTR(-ENOMEM);
 			goto free_eb;
 		}
-
-		spin_lock(&mapping->private_lock);
-		if (PagePrivate(p)) {
-			/*
-			 * We could have already allocated an eb for this page
-			 * and attached one so lets see if we can get a ref on
-			 * the existing eb, and if we can we know it's good and
-			 * we can just return that one, else we know we can just
-			 * overwrite page->private.
-			 */
-			exists = (struct extent_buffer *)p->private;
-			if (atomic_inc_not_zero(&exists->refs)) {
-				spin_unlock(&mapping->private_lock);
-				unlock_page(p);
-				put_page(p);
-				mark_extent_buffer_accessed(exists, p);
-				goto free_eb;
-			}
-			exists = NULL;
-
-			/*
-			 * Do this so attach doesn't complain and we need to
-			 * drop the ref the old guy had.
-			 */
-			ClearPagePrivate(p);
-			WARN_ON(PageDirty(p));
-			put_page(p);
-		}
 		attach_extent_buffer_page(eb, p);
-		spin_unlock(&mapping->private_lock);
-		WARN_ON(PageDirty(p));
 		eb->pages[i] = p;
-		if (!PageUptodate(p))
-			uptodate = 0;
-
-		/*
-		 * see below about how we avoid a nasty race with release page
-		 * and why we unlock later
-		 */
 	}
-	if (uptodate)
-		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 again:
 	ret = radix_tree_preload(GFP_NOFS);
 	if (ret) {
@@ -5053,13 +5089,13 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		goto free_eb;
 	}
 
-	spin_lock(&fs_info->buffer_lock);
-	ret = radix_tree_insert(&fs_info->buffer_radix,
+	spin_lock_irq(&eb_info->buffer_lock);
+	ret = radix_tree_insert(&eb_info->buffer_radix,
 				start >> PAGE_SHIFT, eb);
-	spin_unlock(&fs_info->buffer_lock);
+	spin_unlock_irq(&eb_info->buffer_lock);
 	radix_tree_preload_end();
 	if (ret == -EEXIST) {
-		exists = find_extent_buffer(fs_info, start);
+		exists = find_extent_buffer(eb_info, start);
 		if (exists)
 			goto free_eb;
 		else
@@ -5069,31 +5105,10 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	check_buffer_tree_ref(eb);
 	set_bit(EXTENT_BUFFER_IN_TREE, &eb->bflags);
 
-	/*
-	 * there is a race where release page may have
-	 * tried to find this extent buffer in the radix
-	 * but failed.  It will tell the VM it is safe to
-	 * reclaim the, and it will clear the page private bit.
-	 * We must make sure to set the page private bit properly
-	 * after the extent buffer is in the radix tree so
-	 * it doesn't get lost
-	 */
-	SetPageChecked(eb->pages[0]);
-	for (i = 1; i < num_pages; i++) {
-		p = eb->pages[i];
-		ClearPageChecked(p);
-		unlock_page(p);
-	}
-	unlock_page(eb->pages[0]);
 	return eb;
 
 free_eb:
 	WARN_ON(!atomic_dec_and_test(&eb->refs));
-	for (i = 0; i < num_pages; i++) {
-		if (eb->pages[i])
-			unlock_page(eb->pages[i]);
-	}
-
 	btrfs_release_extent_buffer(eb);
 	return exists;
 }
@@ -5109,17 +5124,19 @@ static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head)
 /* Expects to have eb->eb_lock already held */
 static int release_extent_buffer(struct extent_buffer *eb)
 {
+	struct btrfs_eb_info *eb_info = eb->eb_info;
+
 	WARN_ON(atomic_read(&eb->refs) == 0);
 	if (atomic_dec_and_test(&eb->refs)) {
+		if (eb_info)
+			list_lru_del(&eb_info->lru_list, &eb->lru);
 		if (test_and_clear_bit(EXTENT_BUFFER_IN_TREE, &eb->bflags)) {
-			struct btrfs_fs_info *fs_info = eb->fs_info;
-
 			spin_unlock(&eb->refs_lock);
 
-			spin_lock(&fs_info->buffer_lock);
-			radix_tree_delete(&fs_info->buffer_radix,
-					  eb->start >> PAGE_SHIFT);
-			spin_unlock(&fs_info->buffer_lock);
+			spin_lock_irq(&eb_info->buffer_lock);
+			radix_tree_delete(&eb_info->buffer_radix,
+					  eb_index(eb));
+			spin_unlock_irq(&eb_info->buffer_lock);
 		} else {
 			spin_unlock(&eb->refs_lock);
 		}
@@ -5134,6 +5151,8 @@ static int release_extent_buffer(struct extent_buffer *eb)
 #endif
 		call_rcu(&eb->rcu_head, btrfs_release_extent_buffer_rcu);
 		return 1;
+	} else if (eb_info && atomic_read(&eb->refs) == 1) {
+		list_lru_add(&eb_info->lru_list, &eb->lru);
 	}
 	spin_unlock(&eb->refs_lock);
 
@@ -5167,10 +5186,6 @@ void free_extent_buffer(struct extent_buffer *eb)
 	    test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
 		atomic_dec(&eb->refs);
 
-	/*
-	 * I know this is terrible, but it's temporary until we stop tracking
-	 * the uptodate bits and such for the extent buffers.
-	 */
 	release_extent_buffer(eb);
 }
 
@@ -5188,82 +5203,156 @@ void free_extent_buffer_stale(struct extent_buffer *eb)
 	release_extent_buffer(eb);
 }
 
-void clear_extent_buffer_dirty(struct extent_buffer *eb)
+long btrfs_nr_ebs(struct super_block *sb, struct shrink_control *sc)
 {
-	unsigned long i;
-	unsigned long num_pages;
-	struct page *page;
+	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+	struct btrfs_eb_info *eb_info = fs_info->eb_info;
 
-	num_pages = num_extent_pages(eb->start, eb->len);
+	return list_lru_shrink_count(&eb_info->lru_list, sc);
+}
 
-	for (i = 0; i < num_pages; i++) {
-		page = eb->pages[i];
-		if (!PageDirty(page))
-			continue;
+static enum lru_status eb_lru_isolate(struct list_head *item,
+				      struct list_lru_one *lru,
+				      spinlock_t *lru_lock, void *arg)
+{
+	struct list_head *freeable = (struct list_head *)arg;
+	struct extent_buffer *eb = container_of(item, struct extent_buffer,
+						lru);
+	enum lru_status ret;
+	int refs;
 
-		lock_page(page);
-		WARN_ON(!PagePrivate(page));
+	if (!spin_trylock(&eb->refs_lock))
+		return LRU_SKIP;
 
-		clear_page_dirty_for_io(page);
-		spin_lock_irq(&page->mapping->tree_lock);
-		if (!PageDirty(page)) {
-			radix_tree_tag_clear(&page->mapping->page_tree,
-						page_index(page),
-						PAGECACHE_TAG_DIRTY);
-		}
-		spin_unlock_irq(&page->mapping->tree_lock);
-		ClearPageError(page);
-		unlock_page(page);
+	if (extent_buffer_under_io(eb)) {
+		ret = LRU_ROTATE;
+		goto out;
+	}
+
+	refs = atomic_read(&eb->refs);
+	/* We can race with somebody freeing us, just skip if this happens. */
+	if (refs == 0) {
+		ret = LRU_SKIP;
+		goto out;
+	}
+
+	/* Eb is in use, don't kill it. */
+	if (refs > 1) {
+		ret = LRU_ROTATE;
+		goto out;
+	}
+
+	/*
+	 * If we don't clear the TREE_REF flag then this eb is going to
+	 * disappear soon anyway.  Otherwise we become responsible for dropping
+	 * the last ref on this eb and we know it'll survive until we call
+	 * dispose_list.
+	 */
+	if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
+		ret = LRU_SKIP;
+		goto out;
+	}
+	list_lru_isolate_move(lru, &eb->lru, freeable);
+	ret = LRU_REMOVED;
+out:
+	spin_unlock(&eb->refs_lock);
+	return ret;
+}
+
+static void dispose_list(struct list_head *list)
+{
+	struct extent_buffer *eb;
+
+	while (!list_empty(list)) {
+		eb = list_first_entry(list, struct extent_buffer, lru);
+
+		spin_lock(&eb->refs_lock);
+		list_del_init(&eb->lru);
+		spin_unlock(&eb->refs_lock);
+		free_extent_buffer(eb);
+		cond_resched();
 	}
+}
+
+long btrfs_free_ebs(struct super_block *sb, struct shrink_control *sc)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+	struct btrfs_eb_info *eb_info = fs_info->eb_info;
+	LIST_HEAD(freeable);
+	long freed;
+
+	freed = list_lru_shrink_walk(&eb_info->lru_list, sc, eb_lru_isolate,
+				     &freeable);
+	dispose_list(&freeable);
+	return freed;
+}
+
+void btrfs_invalidate_eb_info(struct btrfs_eb_info *eb_info)
+{
+	LIST_HEAD(freeable);
+
+	/*
+	 * We should be able to free all the extent buffers at this point, if we
+	 * can't there's a problem and we should complain loudly about it.
+	 */
+	do {
+		list_lru_walk(&eb_info->lru_list, eb_lru_isolate, &freeable, LONG_MAX);
+	} while (WARN_ON(list_lru_count(&eb_info->lru_list)));
+	dispose_list(&freeable);
+	synchronize_rcu();
+}
+
+int clear_extent_buffer_dirty(struct extent_buffer *eb)
+{
+	struct btrfs_eb_info *eb_info = eb->eb_info;
+	struct super_block *sb = eb_info->fs_info->sb;
+	unsigned long num_pages;
+
+	if (!test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags))
+		return 0;
+
+	spin_lock_irq(&eb_info->buffer_lock);
+	radix_tree_tag_clear(&eb_info->buffer_radix, eb_index(eb),
+			     PAGECACHE_TAG_DIRTY);
+	spin_unlock_irq(&eb_info->buffer_lock);
+
+	num_pages = num_extent_pages(eb->start, eb->len);
+	account_metadata_cleaned(eb->pages[0], sb->s_bdi, eb->len);
 	WARN_ON(atomic_read(&eb->refs) == 0);
+	return 1;
 }
 
 int set_extent_buffer_dirty(struct extent_buffer *eb)
 {
-	unsigned long i;
+	struct btrfs_eb_info *eb_info = eb->eb_info;
+	struct super_block *sb = eb_info->fs_info->sb;
 	unsigned long num_pages;
 	int was_dirty = 0;
 
 	check_buffer_tree_ref(eb);
 
-	was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
-
-	num_pages = num_extent_pages(eb->start, eb->len);
 	WARN_ON(atomic_read(&eb->refs) == 0);
 	WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags));
+	if (test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags))
+		return 1;
 
-	for (i = 0; i < num_pages; i++)
-		set_page_dirty(eb->pages[i]);
+	num_pages = num_extent_pages(eb->start, eb->len);
+	account_metadata_dirtied(eb->pages[0], sb->s_bdi, eb->len);
+	spin_lock_irq(&eb_info->buffer_lock);
+	radix_tree_tag_set(&eb_info->buffer_radix, eb_index(eb),
+			   PAGECACHE_TAG_DIRTY);
+	spin_unlock_irq(&eb_info->buffer_lock);
 	return was_dirty;
 }
 
 void clear_extent_buffer_uptodate(struct extent_buffer *eb)
 {
-	unsigned long i;
-	struct page *page;
-	unsigned long num_pages;
-
 	clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
-	num_pages = num_extent_pages(eb->start, eb->len);
-	for (i = 0; i < num_pages; i++) {
-		page = eb->pages[i];
-		if (page)
-			ClearPageUptodate(page);
-	}
 }
 
 void set_extent_buffer_uptodate(struct extent_buffer *eb)
 {
-	unsigned long i;
-	struct page *page;
-	unsigned long num_pages;
-
 	set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
-	num_pages = num_extent_pages(eb->start, eb->len);
-	for (i = 0; i < num_pages; i++) {
-		page = eb->pages[i];
-		SetPageUptodate(page);
-	}
 }
 
 int extent_buffer_uptodate(struct extent_buffer *eb)
@@ -5271,112 +5360,165 @@ int extent_buffer_uptodate(struct extent_buffer *eb)
 	return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 }
 
-int read_extent_buffer_pages(struct extent_io_tree *tree,
-			     struct extent_buffer *eb, int wait,
-			     get_extent_t *get_extent, int mirror_num)
+static void end_bio_extent_buffer_readpage(struct bio *bio)
 {
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct extent_io_tree *tree = NULL;
+	struct bio_vec *bvec;
+	u64 unlock_start = 0, unlock_len = 0;
+	int mirror_num = io_bio->mirror_num;
+	int uptodate = !bio->bi_status;
+	int i, ret;
+
+	bio_for_each_segment_all(bvec, bio, i) {
+		struct page *page = bvec->bv_page;
+		struct btrfs_eb_info *eb_info;
+		struct extent_buffer *eb;
+
+		eb = (struct extent_buffer *)page->private;
+		if (WARN_ON(!eb))
+			continue;
+
+		eb_info = eb->eb_info;
+		if (!tree)
+			tree = &eb_info->io_tree;
+		if (uptodate) {
+			/*
+			 * btree_readpage_end_io_hook doesn't care about
+			 * start/end so just pass 0.  We'll kill this later.
+			 */
+			ret = tree->ops->readpage_end_io_hook(io_bio, 0,
+							      page, 0, 0,
+							      mirror_num);
+			if (ret) {
+				uptodate = 0;
+			} else {
+				u64 start = eb->start;
+				int c, num_pages;
+
+				num_pages = num_extent_pages(eb->start,
+							     eb->len);
+				for (c = 0; c < num_pages; c++) {
+					if (eb->pages[c] == page)
+						break;
+					start += PAGE_SIZE;
+				}
+				clean_io_failure(eb_info->fs_info,
+						 &eb_info->io_failure_tree,
+						 tree, start, page, 0, 0);
+			}
+		}
+		/*
+		 * We never fix anything in btree_io_failed_hook.
+		 *
+		 * TODO: rework the io failed hook to not assume we can fix
+		 * anything.
+		 */
+		if (!uptodate)
+			tree->ops->readpage_io_failed_hook(page, mirror_num);
+
+		if (unlock_start == 0) {
+			unlock_start = eb->start;
+			unlock_len = PAGE_SIZE;
+		} else {
+			unlock_len += PAGE_SIZE;
+		}
+	}
+
+	if (unlock_start)
+		unlock_extent(tree, unlock_start,
+			      unlock_start + unlock_len - 1);
+	if (io_bio->end_io)
+		io_bio->end_io(io_bio, blk_status_to_errno(bio->bi_status));
+	bio_put(bio);
+}
+
+int read_extent_buffer_pages(struct extent_buffer *eb, int wait,
+			     int mirror_num)
+{
+	struct btrfs_eb_info *eb_info = eb->eb_info;
+	struct extent_io_tree *io_tree = &eb_info->io_tree;
+	struct block_device *bdev = eb_info->fs_info->fs_devices->latest_bdev;
+	struct bio *bio = NULL;
+	u64 offset = eb->start;
+	u64 unlock_start = 0, unlock_len = 0;
 	unsigned long i;
 	struct page *page;
 	int err;
 	int ret = 0;
-	int locked_pages = 0;
-	int all_uptodate = 1;
 	unsigned long num_pages;
-	unsigned long num_reads = 0;
-	struct bio *bio = NULL;
-	unsigned long bio_flags = 0;
 
 	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
 		return 0;
 
-	num_pages = num_extent_pages(eb->start, eb->len);
-	for (i = 0; i < num_pages; i++) {
-		page = eb->pages[i];
-		if (wait == WAIT_NONE) {
-			if (!trylock_page(page))
-				goto unlock_exit;
-		} else {
-			lock_page(page);
-		}
-		locked_pages++;
-	}
-	/*
-	 * We need to firstly lock all pages to make sure that
-	 * the uptodate bit of our pages won't be affected by
-	 * clear_extent_buffer_uptodate().
-	 */
-	for (i = 0; i < num_pages; i++) {
-		page = eb->pages[i];
-		if (!PageUptodate(page)) {
-			num_reads++;
-			all_uptodate = 0;
-		}
-	}
-
-	if (all_uptodate) {
-		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
-		goto unlock_exit;
+	if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) {
+		if (wait != WAIT_COMPLETE)
+			return 0;
+		wait_on_bit_io(&eb->bflags, EXTENT_BUFFER_READING,
+			       TASK_UNINTERRUPTIBLE);
+		if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
+			ret = -EIO;
+		return ret;
 	}
 
+	lock_extent(io_tree, eb->start, eb->start + eb->len - 1);
+	num_pages = num_extent_pages(eb->start, eb->len);
 	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
 	eb->read_mirror = 0;
-	atomic_set(&eb->io_pages, num_reads);
+	atomic_set(&eb->io_pages, num_pages);
 	for (i = 0; i < num_pages; i++) {
 		page = eb->pages[i];
-
-		if (!PageUptodate(page)) {
-			if (ret) {
-				atomic_dec(&eb->io_pages);
-				unlock_page(page);
-				continue;
+		if (ret) {
+			unlock_len += PAGE_SIZE;
+			if (atomic_dec_and_test(&eb->io_pages)) {
+				clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
+				smp_mb__after_atomic();
+				wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
 			}
+			continue;
+		}
 
-			ClearPageError(page);
-			err = __extent_read_full_page(tree, page,
-						      get_extent, &bio,
-						      mirror_num, &bio_flags,
-						      REQ_META);
-			if (err) {
-				ret = err;
-				/*
-				 * We use &bio in above __extent_read_full_page,
-				 * so we ensure that if it returns error, the
-				 * current page fails to add itself to bio and
-				 * it's been unlocked.
-				 *
-				 * We must dec io_pages by ourselves.
-				 */
-				atomic_dec(&eb->io_pages);
+		err = submit_extent_page(REQ_OP_READ | REQ_META, io_tree, NULL,
+					 page, offset >> 9, PAGE_SIZE, 0, bdev,
+					 &bio, end_bio_extent_buffer_readpage,
+					 mirror_num, 0, 0, 0, false);
+		if (err) {
+			ret = err;
+			/*
+			 * We use &bio in above submit_extent_page
+			 * so we ensure that if it returns error, the
+			 * current page fails to add itself to bio and
+			 * it's been unlocked.
+			 *
+			 * We must dec io_pages by ourselves.
+			 */
+			if (atomic_dec_and_test(&eb->io_pages)) {
+				clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
+				smp_mb__after_atomic();
+				wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
 			}
-		} else {
-			unlock_page(page);
+			unlock_start = eb->start;
+			unlock_len = PAGE_SIZE;
 		}
+		offset += PAGE_SIZE;
 	}
 
 	if (bio) {
-		err = submit_one_bio(bio, mirror_num, bio_flags);
+		err = submit_one_bio(bio, mirror_num, 0);
 		if (err)
 			return err;
 	}
 
+	if (ret && unlock_start)
+		unlock_extent(io_tree, unlock_start,
+			      unlock_start + unlock_len - 1);
 	if (ret || wait != WAIT_COMPLETE)
 		return ret;
 
-	for (i = 0; i < num_pages; i++) {
-		page = eb->pages[i];
-		wait_on_page_locked(page);
-		if (!PageUptodate(page))
-			ret = -EIO;
-	}
-
-	return ret;
-
-unlock_exit:
-	while (locked_pages > 0) {
-		locked_pages--;
-		page = eb->pages[locked_pages];
-		unlock_page(page);
-	}
+	wait_on_bit_io(&eb->bflags, EXTENT_BUFFER_READING,
+		       TASK_UNINTERRUPTIBLE);
+	if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
+		ret = -EIO;
 	return ret;
 }
 
@@ -5533,7 +5675,6 @@ void write_extent_buffer_chunk_tree_uuid(struct extent_buffer *eb,
 {
 	char *kaddr;
 
-	WARN_ON(!PageUptodate(eb->pages[0]));
 	kaddr = page_address(eb->pages[0]);
 	memcpy(kaddr + offsetof(struct btrfs_header, chunk_tree_uuid), srcv,
 			BTRFS_FSID_SIZE);
@@ -5543,7 +5684,6 @@ void write_extent_buffer_fsid(struct extent_buffer *eb, const void *srcv)
 {
 	char *kaddr;
 
-	WARN_ON(!PageUptodate(eb->pages[0]));
 	kaddr = page_address(eb->pages[0]);
 	memcpy(kaddr + offsetof(struct btrfs_header, fsid), srcv,
 			BTRFS_FSID_SIZE);
@@ -5567,7 +5707,6 @@ void write_extent_buffer(struct extent_buffer *eb, const void *srcv,
 
 	while (len > 0) {
 		page = eb->pages[i];
-		WARN_ON(!PageUptodate(page));
 
 		cur = min(len, PAGE_SIZE - offset);
 		kaddr = page_address(page);
@@ -5597,7 +5736,6 @@ void memzero_extent_buffer(struct extent_buffer *eb, unsigned long start,
 
 	while (len > 0) {
 		page = eb->pages[i];
-		WARN_ON(!PageUptodate(page));
 
 		cur = min(len, PAGE_SIZE - offset);
 		kaddr = page_address(page);
@@ -5642,7 +5780,6 @@ void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src,
 
 	while (len > 0) {
 		page = dst->pages[i];
-		WARN_ON(!PageUptodate(page));
 
 		cur = min(len, (unsigned long)(PAGE_SIZE - offset));
 
@@ -5745,7 +5882,6 @@ int extent_buffer_test_bit(struct extent_buffer *eb, unsigned long start,
 
 	eb_bitmap_offset(eb, start, nr, &i, &offset);
 	page = eb->pages[i];
-	WARN_ON(!PageUptodate(page));
 	kaddr = page_address(page);
 	return 1U & (kaddr[offset] >> (nr & (BITS_PER_BYTE - 1)));
 }
@@ -5770,7 +5906,6 @@ void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start,
 
 	eb_bitmap_offset(eb, start, pos, &i, &offset);
 	page = eb->pages[i];
-	WARN_ON(!PageUptodate(page));
 	kaddr = page_address(page);
 
 	while (len >= bits_to_set) {
@@ -5781,7 +5916,6 @@ void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start,
 		if (++offset >= PAGE_SIZE && len > 0) {
 			offset = 0;
 			page = eb->pages[++i];
-			WARN_ON(!PageUptodate(page));
 			kaddr = page_address(page);
 		}
 	}
@@ -5812,7 +5946,6 @@ void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long start,
 
 	eb_bitmap_offset(eb, start, pos, &i, &offset);
 	page = eb->pages[i];
-	WARN_ON(!PageUptodate(page));
 	kaddr = page_address(page);
 
 	while (len >= bits_to_clear) {
@@ -5823,7 +5956,6 @@ void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long start,
 		if (++offset >= PAGE_SIZE && len > 0) {
 			offset = 0;
 			page = eb->pages[++i];
-			WARN_ON(!PageUptodate(page));
 			kaddr = page_address(page);
 		}
 	}
@@ -5864,7 +5996,7 @@ static void copy_pages(struct page *dst_page, struct page *src_page,
 void memcpy_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len)
 {
-	struct btrfs_fs_info *fs_info = dst->fs_info;
+	struct btrfs_fs_info *fs_info = dst->eb_info->fs_info;
 	size_t cur;
 	size_t dst_off_in_page;
 	size_t src_off_in_page;
@@ -5911,7 +6043,7 @@ void memcpy_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len)
 {
-	struct btrfs_fs_info *fs_info = dst->fs_info;
+	struct btrfs_fs_info *fs_info = dst->eb_info->fs_info;
 	size_t cur;
 	size_t dst_off_in_page;
 	size_t src_off_in_page;
@@ -5957,45 +6089,3 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 		len -= cur;
 	}
 }
-
-int try_release_extent_buffer(struct page *page)
-{
-	struct extent_buffer *eb;
-
-	/*
-	 * We need to make sure nobody is attaching this page to an eb right
-	 * now.
-	 */
-	spin_lock(&page->mapping->private_lock);
-	if (!PagePrivate(page)) {
-		spin_unlock(&page->mapping->private_lock);
-		return 1;
-	}
-
-	eb = (struct extent_buffer *)page->private;
-	BUG_ON(!eb);
-
-	/*
-	 * This is a little awful but should be ok, we need to make sure that
-	 * the eb doesn't disappear out from under us while we're looking at
-	 * this page.
-	 */
-	spin_lock(&eb->refs_lock);
-	if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
-		spin_unlock(&eb->refs_lock);
-		spin_unlock(&page->mapping->private_lock);
-		return 0;
-	}
-	spin_unlock(&page->mapping->private_lock);
-
-	/*
-	 * If tree ref isn't set then we know the ref on this eb is a real ref,
-	 * so just return, this page will likely be freed soon anyway.
-	 */
-	if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
-		spin_unlock(&eb->refs_lock);
-		return 0;
-	}
-
-	return release_extent_buffer(eb);
-}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 861dacb371c7..f18cbce1f2f1 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -47,6 +47,8 @@
 #define EXTENT_BUFFER_DUMMY 9
 #define EXTENT_BUFFER_IN_TREE 10
 #define EXTENT_BUFFER_WRITE_ERR 11    /* write IO error */
+#define EXTENT_BUFFER_MIXED_PAGES 12	/* the pages span multiple zones or numa nodes. */
+#define EXTENT_BUFFER_READING 13 /* currently reading this eb. */
 
 /* these are flags for __process_pages_contig */
 #define PAGE_UNLOCK		(1 << 0)
@@ -160,13 +162,25 @@ struct extent_state {
 #endif
 };
 
+struct btrfs_eb_info {
+	struct btrfs_fs_info *fs_info;
+	struct extent_io_tree io_tree;
+	struct extent_io_tree io_failure_tree;
+
+	/* Extent buffer radix tree */
+	spinlock_t buffer_lock;
+	struct radix_tree_root buffer_radix;
+	struct list_lru lru_list;
+	pgoff_t writeback_index;
+};
+
 #define INLINE_EXTENT_BUFFER_PAGES 16
 #define MAX_INLINE_EXTENT_BUFFER_SIZE (INLINE_EXTENT_BUFFER_PAGES * PAGE_SIZE)
 struct extent_buffer {
 	u64 start;
 	unsigned long len;
 	unsigned long bflags;
-	struct btrfs_fs_info *fs_info;
+	struct btrfs_eb_info *eb_info;
 	spinlock_t refs_lock;
 	atomic_t refs;
 	atomic_t io_pages;
@@ -201,6 +215,7 @@ struct extent_buffer {
 #ifdef CONFIG_BTRFS_DEBUG
 	struct list_head leak_list;
 #endif
+	struct list_head lru;
 };
 
 /*
@@ -408,8 +423,6 @@ int extent_writepages(struct extent_io_tree *tree,
 		      struct address_space *mapping,
 		      get_extent_t *get_extent,
 		      struct writeback_control *wbc);
-int btree_write_cache_pages(struct address_space *mapping,
-			    struct writeback_control *wbc);
 int extent_readpages(struct extent_io_tree *tree,
 		     struct address_space *mapping,
 		     struct list_head *pages, unsigned nr_pages,
@@ -420,21 +433,18 @@ void set_page_extent_mapped(struct page *page);
 
 struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 					  u64 start);
-struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
-						  u64 start, unsigned long len);
-struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
-						u64 start);
+struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_eb_info *eb_info,
+						u64 start, unsigned long len);
 struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src);
-struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
+struct extent_buffer *find_extent_buffer(struct btrfs_eb_info *eb_info,
 					 u64 start);
 void free_extent_buffer(struct extent_buffer *eb);
 void free_extent_buffer_stale(struct extent_buffer *eb);
 #define WAIT_NONE	0
 #define WAIT_COMPLETE	1
 #define WAIT_PAGE_LOCK	2
-int read_extent_buffer_pages(struct extent_io_tree *tree,
-			     struct extent_buffer *eb, int wait,
-			     get_extent_t *get_extent, int mirror_num);
+int read_extent_buffer_pages(struct extent_buffer *eb, int wait,
+			     int mirror_num);
 void wait_on_extent_buffer_writeback(struct extent_buffer *eb);
 
 static inline unsigned long num_extent_pages(u64 start, u64 len)
@@ -448,6 +458,11 @@ static inline void extent_buffer_get(struct extent_buffer *eb)
 	atomic_inc(&eb->refs);
 }
 
+static inline unsigned long eb_index(struct extent_buffer *eb)
+{
+	return eb->start >> PAGE_SHIFT;
+}
+
 int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv,
 			 unsigned long start, unsigned long len);
 void read_extent_buffer(const struct extent_buffer *eb, void *dst,
@@ -478,7 +493,7 @@ void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start,
 			      unsigned long pos, unsigned long len);
 void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long start,
 				unsigned long pos, unsigned long len);
-void clear_extent_buffer_dirty(struct extent_buffer *eb);
+int clear_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_dirty(struct extent_buffer *eb);
 void set_extent_buffer_uptodate(struct extent_buffer *eb);
 void clear_extent_buffer_uptodate(struct extent_buffer *eb);
@@ -512,6 +527,14 @@ int clean_io_failure(struct btrfs_fs_info *fs_info,
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_fs_info *fs_info,
 			 struct extent_buffer *eb, int mirror_num);
+void btree_flush(struct btrfs_fs_info *fs_info);
+int btree_write_range(struct btrfs_fs_info *fs_info, u64 start, u64 end);
+int btree_wait_range(struct btrfs_fs_info *fs_info, u64 start, u64 end);
+long btrfs_free_ebs(struct super_block *sb, struct shrink_control *sc);
+long btrfs_nr_ebs(struct super_block *sb, struct shrink_control *sc);
+void btrfs_write_ebs(struct super_block *sb, struct writeback_control *wbc);
+void btrfs_invalidate_eb_info(struct btrfs_eb_info *eb_info);
+int btrfs_init_eb_info(struct btrfs_fs_info *fs_info);
 
 /*
  * When IO fails, either with EIO or csum verification fails, we
@@ -552,6 +575,6 @@ noinline u64 find_lock_delalloc_range(struct inode *inode,
 				      struct page *locked_page, u64 *start,
 				      u64 *end, u64 max_bytes);
 #endif
-struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
-					       u64 start);
+struct extent_buffer *alloc_test_extent_buffer(struct btrfs_eb_info *eb_info,
+					       u64 start, u32 nodesize);
 #endif
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 46b5632a7c6d..27bc64fb6d3e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1877,9 +1877,9 @@ static void btrfs_clear_bit_hook(void *private_data,
  * return 0 if page can be merged to bio
  * return error otherwise
  */
-int btrfs_merge_bio_hook(struct page *page, unsigned long offset,
-			 size_t size, struct bio *bio,
-			 unsigned long bio_flags)
+static int btrfs_merge_bio_hook(struct page *page, unsigned long offset,
+				size_t size, struct bio *bio,
+				unsigned long bio_flags)
 {
 	struct inode *inode = page->mapping->host;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index 569205e651c7..f912c8166d94 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -102,6 +102,7 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int type)
 	ptr = (unsigned long)iref;
 	end = (unsigned long)ei + item_size;
 	while (ptr < end) {
+		struct btrfs_fs_info *fs_info = eb->eb_info->fs_info;
 		iref = (struct btrfs_extent_inline_ref *)ptr;
 		type = btrfs_extent_inline_ref_type(eb, iref);
 		offset = btrfs_extent_inline_ref_offset(eb, iref);
@@ -116,9 +117,9 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int type)
 			 * offset is supposed to be a tree block which
 			 * must be aligned to nodesize.
 			 */
-			if (!IS_ALIGNED(offset, eb->fs_info->nodesize))
+			if (!IS_ALIGNED(offset, fs_info->nodesize))
 				pr_info("\t\t\t(parent %llu is NOT ALIGNED to nodesize %llu)\n",
-					offset, (unsigned long long)eb->fs_info->nodesize);
+					offset, (unsigned long long)fs_info->nodesize);
 			break;
 		case BTRFS_EXTENT_DATA_REF_KEY:
 			dref = (struct btrfs_extent_data_ref *)(&iref->offset);
@@ -132,9 +133,9 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int type)
 			 * offset is supposed to be a tree block which
 			 * must be aligned to nodesize.
 			 */
-			if (!IS_ALIGNED(offset, eb->fs_info->nodesize))
+			if (!IS_ALIGNED(offset, fs_info->nodesize))
 				pr_info("\t\t\t(parent %llu is NOT ALIGNED to nodesize %llu)\n",
-				     offset, (unsigned long long)eb->fs_info->nodesize);
+				     offset, (unsigned long long)fs_info->nodesize);
 			break;
 		default:
 			pr_cont("(extent %llu has INVALID ref type %d)\n",
@@ -199,7 +200,7 @@ void btrfs_print_leaf(struct extent_buffer *l)
 	if (!l)
 		return;
 
-	fs_info = l->fs_info;
+	fs_info = l->eb_info->fs_info;
 	nr = btrfs_header_nritems(l);
 
 	btrfs_info(fs_info, "leaf %llu total ptrs %d free space %d",
@@ -347,7 +348,7 @@ void btrfs_print_tree(struct extent_buffer *c)
 
 	if (!c)
 		return;
-	fs_info = c->fs_info;
+	fs_info = c->eb_info->fs_info;
 	nr = btrfs_header_nritems(c);
 	level = btrfs_header_level(c);
 	if (level == 0) {
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index ab852b8e3e37..c6244890085f 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -210,7 +210,7 @@ static void __readahead_hook(struct btrfs_fs_info *fs_info,
 
 int btree_readahead_hook(struct extent_buffer *eb, int err)
 {
-	struct btrfs_fs_info *fs_info = eb->fs_info;
+	struct btrfs_fs_info *fs_info = eb->eb_info->fs_info;
 	int ret = 0;
 	struct reada_extent *re;
 
diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index 3338407ef0f0..e40bd9a910dd 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -45,7 +45,7 @@ static void btrfs_read_root_item(struct extent_buffer *eb, int slot,
 	if (!need_reset && btrfs_root_generation(item)
 		!= btrfs_root_generation_v2(item)) {
 		if (btrfs_root_generation_v2(item) != 0) {
-			btrfs_warn(eb->fs_info,
+			btrfs_warn(eb->eb_info->fs_info,
 					"mismatching generation and generation_v2 found in root item. This root was probably mounted with an older kernel. Resetting all new fields.");
 		}
 		need_reset = 1;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8e74f7029e12..3b5fe791639d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1198,7 +1198,7 @@ int btrfs_sync_fs(struct super_block *sb, int wait)
 	trace_btrfs_sync_fs(fs_info, wait);
 
 	if (!wait) {
-		filemap_flush(fs_info->btree_inode->i_mapping);
+		btree_flush(fs_info);
 		return 0;
 	}
 
@@ -2284,19 +2284,22 @@ static int btrfs_show_devname(struct seq_file *m, struct dentry *root)
 }
 
 static const struct super_operations btrfs_super_ops = {
-	.drop_inode	= btrfs_drop_inode,
-	.evict_inode	= btrfs_evict_inode,
-	.put_super	= btrfs_put_super,
-	.sync_fs	= btrfs_sync_fs,
-	.show_options	= btrfs_show_options,
-	.show_devname	= btrfs_show_devname,
-	.write_inode	= btrfs_write_inode,
-	.alloc_inode	= btrfs_alloc_inode,
-	.destroy_inode	= btrfs_destroy_inode,
-	.statfs		= btrfs_statfs,
-	.remount_fs	= btrfs_remount,
-	.freeze_fs	= btrfs_freeze,
-	.unfreeze_fs	= btrfs_unfreeze,
+	.drop_inode		= btrfs_drop_inode,
+	.evict_inode		= btrfs_evict_inode,
+	.put_super		= btrfs_put_super,
+	.sync_fs		= btrfs_sync_fs,
+	.show_options		= btrfs_show_options,
+	.show_devname		= btrfs_show_devname,
+	.write_inode		= btrfs_write_inode,
+	.alloc_inode		= btrfs_alloc_inode,
+	.destroy_inode		= btrfs_destroy_inode,
+	.statfs			= btrfs_statfs,
+	.remount_fs		= btrfs_remount,
+	.freeze_fs		= btrfs_freeze,
+	.unfreeze_fs		= btrfs_unfreeze,
+	.nr_cached_objects	= btrfs_nr_ebs,
+	.free_cached_objects	= btrfs_free_ebs,
+	.write_metadata		= btrfs_write_ebs,
 };
 
 static const struct file_operations btrfs_ctl_fops = {
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index d3f25376a0f8..dbf05b2ab9ee 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -102,15 +102,32 @@ struct btrfs_fs_info *btrfs_alloc_dummy_fs_info(u32 nodesize, u32 sectorsize)
 
 	fs_info->nodesize = nodesize;
 	fs_info->sectorsize = sectorsize;
+	fs_info->eb_info = kzalloc(sizeof(struct btrfs_eb_info),
+				   GFP_KERNEL);
+	if (!fs_info->eb_info) {
+		kfree(fs_info->fs_devices);
+		kfree(fs_info->super_copy);
+		kfree(fs_info);
+		return NULL;
+	}
+
+	if (btrfs_init_eb_info(fs_info)) {
+		kfree(fs_info->eb_info);
+		kfree(fs_info->fs_devices);
+		kfree(fs_info->super_copy);
+		kfree(fs_info);
+		return NULL;
+	}
 
 	if (init_srcu_struct(&fs_info->subvol_srcu)) {
+		list_lru_destroy(&fs_info->eb_info->lru_list);
+		kfree(fs_info->eb_info);
 		kfree(fs_info->fs_devices);
 		kfree(fs_info->super_copy);
 		kfree(fs_info);
 		return NULL;
 	}
 
-	spin_lock_init(&fs_info->buffer_lock);
 	spin_lock_init(&fs_info->qgroup_lock);
 	spin_lock_init(&fs_info->qgroup_op_lock);
 	spin_lock_init(&fs_info->super_lock);
@@ -126,7 +143,6 @@ struct btrfs_fs_info *btrfs_alloc_dummy_fs_info(u32 nodesize, u32 sectorsize)
 	INIT_LIST_HEAD(&fs_info->dirty_qgroups);
 	INIT_LIST_HEAD(&fs_info->dead_roots);
 	INIT_LIST_HEAD(&fs_info->tree_mod_seq_list);
-	INIT_RADIX_TREE(&fs_info->buffer_radix, GFP_ATOMIC);
 	INIT_RADIX_TREE(&fs_info->fs_roots_radix, GFP_ATOMIC);
 	extent_io_tree_init(&fs_info->freed_extents[0], NULL);
 	extent_io_tree_init(&fs_info->freed_extents[1], NULL);
@@ -140,6 +156,7 @@ struct btrfs_fs_info *btrfs_alloc_dummy_fs_info(u32 nodesize, u32 sectorsize)
 
 void btrfs_free_dummy_fs_info(struct btrfs_fs_info *fs_info)
 {
+	struct btrfs_eb_info *eb_info;
 	struct radix_tree_iter iter;
 	void **slot;
 
@@ -150,13 +167,14 @@ void btrfs_free_dummy_fs_info(struct btrfs_fs_info *fs_info)
 			      &fs_info->fs_state)))
 		return;
 
+	eb_info = fs_info->eb_info;
 	test_mnt->mnt_sb->s_fs_info = NULL;
 
-	spin_lock(&fs_info->buffer_lock);
-	radix_tree_for_each_slot(slot, &fs_info->buffer_radix, &iter, 0) {
+	spin_lock_irq(&eb_info->buffer_lock);
+	radix_tree_for_each_slot(slot, &eb_info->buffer_radix, &iter, 0) {
 		struct extent_buffer *eb;
 
-		eb = radix_tree_deref_slot_protected(slot, &fs_info->buffer_lock);
+		eb = radix_tree_deref_slot_protected(slot, &eb_info->buffer_lock);
 		if (!eb)
 			continue;
 		/* Shouldn't happen but that kind of thinking creates CVE's */
@@ -166,15 +184,17 @@ void btrfs_free_dummy_fs_info(struct btrfs_fs_info *fs_info)
 			continue;
 		}
 		slot = radix_tree_iter_resume(slot, &iter);
-		spin_unlock(&fs_info->buffer_lock);
+		spin_unlock_irq(&eb_info->buffer_lock);
 		free_extent_buffer_stale(eb);
-		spin_lock(&fs_info->buffer_lock);
+		spin_lock_irq(&eb_info->buffer_lock);
 	}
-	spin_unlock(&fs_info->buffer_lock);
+	spin_unlock_irq(&eb_info->buffer_lock);
 
 	btrfs_free_qgroup_config(fs_info);
 	btrfs_free_fs_roots(fs_info);
 	cleanup_srcu_struct(&fs_info->subvol_srcu);
+	list_lru_destroy(&eb_info->lru_list);
+	kfree(fs_info->eb_info);
 	kfree(fs_info->super_copy);
 	kfree(fs_info->fs_devices);
 	kfree(fs_info);
diff --git a/fs/btrfs/tests/extent-buffer-tests.c b/fs/btrfs/tests/extent-buffer-tests.c
index b9142c614114..9a264b81a7b4 100644
--- a/fs/btrfs/tests/extent-buffer-tests.c
+++ b/fs/btrfs/tests/extent-buffer-tests.c
@@ -61,7 +61,8 @@ static int test_btrfs_split_item(u32 sectorsize, u32 nodesize)
 		goto out;
 	}
 
-	path->nodes[0] = eb = alloc_dummy_extent_buffer(fs_info, nodesize);
+	path->nodes[0] = eb = alloc_dummy_extent_buffer(fs_info->eb_info, 0,
+							nodesize);
 	if (!eb) {
 		test_msg("Could not allocate dummy buffer\n");
 		ret = -ENOMEM;
diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
index d06b1c931d05..600c01ddf0d0 100644
--- a/fs/btrfs/tests/extent-io-tests.c
+++ b/fs/btrfs/tests/extent-io-tests.c
@@ -406,7 +406,7 @@ static int test_eb_bitmaps(u32 sectorsize, u32 nodesize)
 		return -ENOMEM;
 	}
 
-	eb = __alloc_dummy_extent_buffer(fs_info, 0, len);
+	eb = alloc_dummy_extent_buffer(NULL, 0, len);
 	if (!eb) {
 		test_msg("Couldn't allocate test extent buffer\n");
 		kfree(bitmap);
@@ -419,7 +419,7 @@ static int test_eb_bitmaps(u32 sectorsize, u32 nodesize)
 
 	/* Do it over again with an extent buffer which isn't page-aligned. */
 	free_extent_buffer(eb);
-	eb = __alloc_dummy_extent_buffer(NULL, nodesize / 2, len);
+	eb = alloc_dummy_extent_buffer(NULL, nodesize / 2, len);
 	if (!eb) {
 		test_msg("Couldn't allocate test extent buffer\n");
 		kfree(bitmap);
diff --git a/fs/btrfs/tests/free-space-tree-tests.c b/fs/btrfs/tests/free-space-tree-tests.c
index 8444a018cca2..afba937f4365 100644
--- a/fs/btrfs/tests/free-space-tree-tests.c
+++ b/fs/btrfs/tests/free-space-tree-tests.c
@@ -474,7 +474,8 @@ static int run_test(test_func_t test_func, int bitmaps, u32 sectorsize,
 	root->fs_info->free_space_root = root;
 	root->fs_info->tree_root = root;
 
-	root->node = alloc_test_extent_buffer(root->fs_info, nodesize);
+	root->node = alloc_test_extent_buffer(fs_info->eb_info, nodesize,
+					      nodesize);
 	if (!root->node) {
 		test_msg("Couldn't allocate dummy buffer\n");
 		ret = -ENOMEM;
diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
index 11c77eafde00..486aa7fbfce2 100644
--- a/fs/btrfs/tests/inode-tests.c
+++ b/fs/btrfs/tests/inode-tests.c
@@ -261,7 +261,7 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		goto out;
 	}
 
-	root->node = alloc_dummy_extent_buffer(fs_info, nodesize);
+	root->node = alloc_dummy_extent_buffer(fs_info->eb_info, 0, nodesize);
 	if (!root->node) {
 		test_msg("Couldn't allocate dummy buffer\n");
 		goto out;
@@ -867,7 +867,7 @@ static int test_hole_first(u32 sectorsize, u32 nodesize)
 		goto out;
 	}
 
-	root->node = alloc_dummy_extent_buffer(fs_info, nodesize);
+	root->node = alloc_dummy_extent_buffer(fs_info->eb_info, 0, nodesize);
 	if (!root->node) {
 		test_msg("Couldn't allocate dummy buffer\n");
 		goto out;
diff --git a/fs/btrfs/tests/qgroup-tests.c b/fs/btrfs/tests/qgroup-tests.c
index 0f4ce970d195..0ba27cd9ae4c 100644
--- a/fs/btrfs/tests/qgroup-tests.c
+++ b/fs/btrfs/tests/qgroup-tests.c
@@ -486,7 +486,8 @@ int btrfs_test_qgroups(u32 sectorsize, u32 nodesize)
 	 * Can't use bytenr 0, some things freak out
 	 * *cough*backref walking code*cough*
 	 */
-	root->node = alloc_test_extent_buffer(root->fs_info, nodesize);
+	root->node = alloc_test_extent_buffer(fs_info->eb_info, nodesize,
+					      nodesize);
 	if (!root->node) {
 		test_msg("Couldn't allocate dummy buffer\n");
 		ret = -ENOMEM;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 9fed8c67b6e8..5df3963c413e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -293,8 +293,7 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
 	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
 	spin_lock_init(&cur_trans->dropped_roots_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
-	extent_io_tree_init(&cur_trans->dirty_pages,
-			     fs_info->btree_inode);
+	extent_io_tree_init(&cur_trans->dirty_pages, NULL);
 	fs_info->generation++;
 	cur_trans->transid = fs_info->generation;
 	fs_info->running_transaction = cur_trans;
@@ -944,12 +943,10 @@ int btrfs_write_marked_extents(struct btrfs_fs_info *fs_info,
 {
 	int err = 0;
 	int werr = 0;
-	struct address_space *mapping = fs_info->btree_inode->i_mapping;
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
 
-	atomic_inc(&BTRFS_I(fs_info->btree_inode)->sync_writers);
 	while (!find_first_extent_bit(dirty_pages, start, &start, &end,
 				      mark, &cached_state)) {
 		bool wait_writeback = false;
@@ -975,17 +972,16 @@ int btrfs_write_marked_extents(struct btrfs_fs_info *fs_info,
 			wait_writeback = true;
 		}
 		if (!err)
-			err = filemap_fdatawrite_range(mapping, start, end);
+			err = btree_write_range(fs_info, start, end);
 		if (err)
 			werr = err;
 		else if (wait_writeback)
-			werr = filemap_fdatawait_range(mapping, start, end);
+			werr = btree_wait_range(fs_info, start, end);
 		free_extent_state(cached_state);
 		cached_state = NULL;
 		cond_resched();
 		start = end + 1;
 	}
-	atomic_dec(&BTRFS_I(fs_info->btree_inode)->sync_writers);
 	return werr;
 }
 
@@ -1000,7 +996,6 @@ static int __btrfs_wait_marked_extents(struct btrfs_fs_info *fs_info,
 {
 	int err = 0;
 	int werr = 0;
-	struct address_space *mapping = fs_info->btree_inode->i_mapping;
 	struct extent_state *cached_state = NULL;
 	u64 start = 0;
 	u64 end;
@@ -1021,7 +1016,7 @@ static int __btrfs_wait_marked_extents(struct btrfs_fs_info *fs_info,
 		if (err == -ENOMEM)
 			err = 0;
 		if (!err)
-			err = filemap_fdatawait_range(mapping, start, end);
+			err = btree_wait_range(fs_info, start, end);
 		if (err)
 			werr = err;
 		free_extent_state(cached_state);
-- 
2.7.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 09/10] btrfs: rework end io for extent buffer reads
  2017-12-11 21:55 [PATCH v3 00/11] Metadata specific accouting and dirty writeout Josef Bacik
                   ` (7 preceding siblings ...)
  2017-12-11 21:55 ` [PATCH v3 08/10] Btrfs: kill the btree_inode Josef Bacik
@ 2017-12-11 21:55 ` Josef Bacik
  2017-12-11 21:55 ` [PATCH v3 10/10] btrfs: add NR_METADATA_BYTES accounting Josef Bacik
  9 siblings, 0 replies; 31+ messages in thread
From: Josef Bacik @ 2017-12-11 21:55 UTC (permalink / raw)
  To: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team, linux-btrfs
  Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

Now that the only thing that keeps eb's alive is io_pages and it's
refcount we need to hold the eb ref for the entire end io call so we
don't get it removed out from underneath us.  Also the hooks make no
sense for us now, so rework this to be cleaner.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/disk-io.c   | 63 ++++---------------------------------------------
 fs/btrfs/disk-io.h   |  1 +
 fs/btrfs/extent_io.c | 66 +++++++++++++++++++++++++++-------------------------
 3 files changed, 40 insertions(+), 90 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d9d69e181942..1a890f8c78c8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -755,33 +755,13 @@ static int check_node(struct btrfs_root *root, struct extent_buffer *node)
 	return ret;
 }
 
-static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
-				      u64 phy_offset, struct page *page,
-				      u64 start, u64 end, int mirror)
+int btrfs_extent_buffer_end_read(struct extent_buffer *eb, int mirror)
 {
+	struct btrfs_fs_info *fs_info = eb->eb_info->fs_info;
+	struct btrfs_root *root = fs_info->tree_root;
 	u64 found_start;
 	int found_level;
-	struct extent_buffer *eb;
-	struct btrfs_root *root;
-	struct btrfs_fs_info *fs_info;
 	int ret = 0;
-	int reads_done;
-
-	if (!page->private)
-		goto out;
-
-	eb = (struct extent_buffer *)page->private;
-
-	/* the pending IO might have been the only thing that kept this buffer
-	 * in memory.  Make sure we have a ref for all this other checks
-	 */
-	extent_buffer_get(eb);
-	fs_info = eb->eb_info->fs_info;
-	root = fs_info->tree_root;
-
-	reads_done = atomic_dec_and_test(&eb->io_pages);
-	if (!reads_done)
-		goto err;
 
 	eb->read_mirror = mirror;
 	if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
@@ -833,45 +813,14 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	if (!ret)
 		set_extent_buffer_uptodate(eb);
 err:
-	if (reads_done &&
-	    test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
+	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
 		btree_readahead_hook(eb, ret);
 
-	if (ret) {
-		/*
-		 * our io error hook is going to dec the io pages
-		 * again, we have to make sure it has something
-		 * to decrement.
-		 *
-		 * TODO: Kill this, we've re-arranged how this works now so we
-		 * don't need to do this io_pages dance.
-		 */
-		atomic_inc(&eb->io_pages);
+	if (ret)
 		clear_extent_buffer_uptodate(eb);
-	}
-	if (reads_done) {
-		clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
-		smp_mb__after_atomic();
-		wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
-	}
-	free_extent_buffer(eb);
-out:
 	return ret;
 }
 
-static int btree_io_failed_hook(struct page *page, int failed_mirror)
-{
-	struct extent_buffer *eb;
-
-	eb = (struct extent_buffer *)page->private;
-	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
-	eb->read_mirror = failed_mirror;
-	atomic_dec(&eb->io_pages);
-	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
-		btree_readahead_hook(eb, -EIO);
-	return -EIO;	/* we fixed nothing */
-}
-
 static void end_workqueue_bio(struct bio *bio)
 {
 	struct btrfs_end_io_wq *end_io_wq = bio->bi_private;
@@ -4554,9 +4503,7 @@ static int btree_merge_bio_hook(struct page *page, unsigned long offset,
 static const struct extent_io_ops btree_extent_io_ops = {
 	/* mandatory callbacks */
 	.submit_bio_hook = btree_submit_bio_hook,
-	.readpage_end_io_hook = btree_readpage_end_io_hook,
 	.merge_bio_hook = btree_merge_bio_hook,
-	.readpage_io_failed_hook = btree_io_failed_hook,
 	.set_range_writeback = btrfs_set_range_writeback,
 	.tree_fs_info = btree_fs_info,
 
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 7f7c35d6347a..e1f4fef91547 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -152,6 +152,7 @@ int btree_lock_page_hook(struct page *page, void *data,
 int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
 int __init btrfs_end_io_wq_init(void);
 void btrfs_end_io_wq_exit(void);
+int btrfs_extent_buffer_end_read(struct extent_buffer *eb, int mirror);
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 void btrfs_init_lockdep(void);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bb10dc6f4e41..e11372455fb0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "disk-io.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -5360,6 +5361,14 @@ int extent_buffer_uptodate(struct extent_buffer *eb)
 	return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 }
 
+static void mark_eb_failed(struct extent_buffer *eb, int failed_mirror)
+{
+	set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
+	eb->read_mirror = failed_mirror;
+	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
+		btree_readahead_hook(eb, -EIO);
+}
+
 static void end_bio_extent_buffer_readpage(struct bio *bio)
 {
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
@@ -5368,12 +5377,13 @@ static void end_bio_extent_buffer_readpage(struct bio *bio)
 	u64 unlock_start = 0, unlock_len = 0;
 	int mirror_num = io_bio->mirror_num;
 	int uptodate = !bio->bi_status;
-	int i, ret;
+	int i;
 
 	bio_for_each_segment_all(bvec, bio, i) {
 		struct page *page = bvec->bv_page;
 		struct btrfs_eb_info *eb_info;
 		struct extent_buffer *eb;
+		int reads_done;
 
 		eb = (struct extent_buffer *)page->private;
 		if (WARN_ON(!eb))
@@ -5382,41 +5392,33 @@ static void end_bio_extent_buffer_readpage(struct bio *bio)
 		eb_info = eb->eb_info;
 		if (!tree)
 			tree = &eb_info->io_tree;
+		extent_buffer_get(eb);
+		reads_done = atomic_dec_and_test(&eb->io_pages);
 		if (uptodate) {
-			/*
-			 * btree_readpage_end_io_hook doesn't care about
-			 * start/end so just pass 0.  We'll kill this later.
-			 */
-			ret = tree->ops->readpage_end_io_hook(io_bio, 0,
-							      page, 0, 0,
-							      mirror_num);
-			if (ret) {
-				uptodate = 0;
-			} else {
-				u64 start = eb->start;
-				int c, num_pages;
-
-				num_pages = num_extent_pages(eb->start,
-							     eb->len);
-				for (c = 0; c < num_pages; c++) {
-					if (eb->pages[c] == page)
-						break;
-					start += PAGE_SIZE;
-				}
-				clean_io_failure(eb_info->fs_info,
-						 &eb_info->io_failure_tree,
-						 tree, start, page, 0, 0);
+			u64 start = eb->start;
+			int c, num_pages;
+
+			num_pages = num_extent_pages(eb->start,
+						     eb->len);
+			for (c = 0; c < num_pages; c++) {
+				if (eb->pages[c] == page)
+					break;
+				start += PAGE_SIZE;
 			}
+			clean_io_failure(eb_info->fs_info,
+					 &eb_info->io_failure_tree,
+					 tree, start, page, 0, 0);
 		}
-		/*
-		 * We never fix anything in btree_io_failed_hook.
-		 *
-		 * TODO: rework the io failed hook to not assume we can fix
-		 * anything.
-		 */
+		if (reads_done && btrfs_extent_buffer_end_read(eb, mirror_num))
+			uptodate = 0;
 		if (!uptodate)
-			tree->ops->readpage_io_failed_hook(page, mirror_num);
-
+			mark_eb_failed(eb, mirror_num);
+		if (reads_done) {
+			clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
+			smp_mb__after_atomic();
+			wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
+		}
+		free_extent_buffer(eb);
 		if (unlock_start == 0) {
 			unlock_start = eb->start;
 			unlock_len = PAGE_SIZE;
-- 
2.7.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 10/10] btrfs: add NR_METADATA_BYTES accounting
  2017-12-11 21:55 [PATCH v3 00/11] Metadata specific accouting and dirty writeout Josef Bacik
                   ` (8 preceding siblings ...)
  2017-12-11 21:55 ` [PATCH v3 09/10] btrfs: rework end io for extent buffer reads Josef Bacik
@ 2017-12-11 21:55 ` Josef Bacik
  9 siblings, 0 replies; 31+ messages in thread
From: Josef Bacik @ 2017-12-11 21:55 UTC (permalink / raw)
  To: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team, linux-btrfs
  Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

Now that we have these counters, account for the private pages we
allocate in NR_METADATA_BYTES.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent_io.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e11372455fb0..7536352f424d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4802,6 +4802,8 @@ static void btrfs_release_extent_buffer_page(struct extent_buffer *eb)
 		ClearPagePrivate(page);
 		set_page_private(page, 0);
 
+		mod_node_page_state(page_pgdat(page), NR_METADATA_BYTES,
+				    -(long)PAGE_SIZE);
 		/* Once for the page private. */
 		put_page(page);
 
@@ -5081,6 +5083,8 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 			goto free_eb;
 		}
 		attach_extent_buffer_page(eb, p);
+		mod_node_page_state(page_pgdat(p), NR_METADATA_BYTES,
+				    PAGE_SIZE);
 		eb->pages[i] = p;
 	}
 again:
-- 
2.7.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2017-12-11 21:55 ` [PATCH v3 06/10] writeback: introduce super_operations->write_metadata Josef Bacik
@ 2017-12-11 23:36   ` Dave Chinner
  2017-12-12 18:05     ` Josef Bacik
  2017-12-19 12:21   ` Jan Kara
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2017-12-11 23:36 UTC (permalink / raw)
  To: Josef Bacik
  Cc: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team,
	linux-btrfs, Josef Bacik

On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> From: Josef Bacik <jbacik@fb.com>
> 
> Now that we have metadata counters in the VM, we need to provide a way to kick
> writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
> allows file systems to deal with writing back any dirty metadata we need based
> on the writeback needs of the system.  Since there is no inode to key off of we
> need a list in the bdi for dirty super blocks to be added.  From there we can
> find any dirty sb's on the bdi we are currently doing writeback on and call into
> their ->write_metadata callback.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Tejun Heo <tj@kernel.org>
> ---
>  fs/fs-writeback.c                | 72 ++++++++++++++++++++++++++++++++++++----
>  fs/super.c                       |  6 ++++
>  include/linux/backing-dev-defs.h |  2 ++
>  include/linux/fs.h               |  4 +++
>  mm/backing-dev.c                 |  2 ++
>  5 files changed, 80 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 987448ed7698..fba703dff678 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
>  	return pages;
>  }
>  
> +static long writeback_sb_metadata(struct super_block *sb,
> +				  struct bdi_writeback *wb,
> +				  struct wb_writeback_work *work)
> +{
> +	struct writeback_control wbc = {
> +		.sync_mode		= work->sync_mode,
> +		.tagged_writepages	= work->tagged_writepages,
> +		.for_kupdate		= work->for_kupdate,
> +		.for_background		= work->for_background,
> +		.for_sync		= work->for_sync,
> +		.range_cyclic		= work->range_cyclic,
> +		.range_start		= 0,
> +		.range_end		= LLONG_MAX,
> +	};
> +	long write_chunk;
> +
> +	write_chunk = writeback_chunk_size(wb, work);
> +	wbc.nr_to_write = write_chunk;
> +	sb->s_op->write_metadata(sb, &wbc);
> +	work->nr_pages -= write_chunk - wbc.nr_to_write;
> +
> +	return write_chunk - wbc.nr_to_write;

Ok, writeback_chunk_size() returns a page count. We've already gone
through the "metadata is not page sized" dance on the dirty
accounting side, so how are we supposed to use pages to account for
metadata writeback?

And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or
work->tagged_writepages is set, this will basically tell us to flush
the entire dirty metadata cache because write_chunk will get set to
LONG_MAX.

IOWs, this would appear to me to change sync() behaviour quite
dramatically on filesystems where ->write_metadata is implemented.
That is, instead of leaving all the metadata dirty in memory and
just forcing the journal to stable storage, filesystems will be told
to also write back all their dirty metadata before sync() returns,
even though it is not necessary to provide correct sync()
semantics....

Mind you, writeback invocation is so convoluted now I could easily
be mis-interpretting this code, but it does seem to me like this
code is going to have some unintended behaviours....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2017-12-11 23:36   ` Dave Chinner
@ 2017-12-12 18:05     ` Josef Bacik
  2017-12-12 22:20       ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Josef Bacik @ 2017-12-12 18:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, hannes, linux-mm, akpm, jack, linux-fsdevel,
	kernel-team, linux-btrfs, Josef Bacik

On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > From: Josef Bacik <jbacik@fb.com>
> > 
> > Now that we have metadata counters in the VM, we need to provide a way to kick
> > writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
> > allows file systems to deal with writing back any dirty metadata we need based
> > on the writeback needs of the system.  Since there is no inode to key off of we
> > need a list in the bdi for dirty super blocks to be added.  From there we can
> > find any dirty sb's on the bdi we are currently doing writeback on and call into
> > their ->write_metadata callback.
> > 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > Reviewed-by: Tejun Heo <tj@kernel.org>
> > ---
> >  fs/fs-writeback.c                | 72 ++++++++++++++++++++++++++++++++++++----
> >  fs/super.c                       |  6 ++++
> >  include/linux/backing-dev-defs.h |  2 ++
> >  include/linux/fs.h               |  4 +++
> >  mm/backing-dev.c                 |  2 ++
> >  5 files changed, 80 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 987448ed7698..fba703dff678 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
> >  	return pages;
> >  }
> >  
> > +static long writeback_sb_metadata(struct super_block *sb,
> > +				  struct bdi_writeback *wb,
> > +				  struct wb_writeback_work *work)
> > +{
> > +	struct writeback_control wbc = {
> > +		.sync_mode		= work->sync_mode,
> > +		.tagged_writepages	= work->tagged_writepages,
> > +		.for_kupdate		= work->for_kupdate,
> > +		.for_background		= work->for_background,
> > +		.for_sync		= work->for_sync,
> > +		.range_cyclic		= work->range_cyclic,
> > +		.range_start		= 0,
> > +		.range_end		= LLONG_MAX,
> > +	};
> > +	long write_chunk;
> > +
> > +	write_chunk = writeback_chunk_size(wb, work);
> > +	wbc.nr_to_write = write_chunk;
> > +	sb->s_op->write_metadata(sb, &wbc);
> > +	work->nr_pages -= write_chunk - wbc.nr_to_write;
> > +
> > +	return write_chunk - wbc.nr_to_write;
> 
> Ok, writeback_chunk_size() returns a page count. We've already gone
> through the "metadata is not page sized" dance on the dirty
> accounting side, so how are we supposed to use pages to account for
> metadata writeback?
> 

This is just one of those things that's going to be slightly shitty.  It's the
same for memory reclaim, all of those places use pages so we just take
METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.

> And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or
> work->tagged_writepages is set, this will basically tell us to flush
> the entire dirty metadata cache because write_chunk will get set to
> LONG_MAX.
> 
> IOWs, this would appear to me to change sync() behaviour quite
> dramatically on filesystems where ->write_metadata is implemented.
> That is, instead of leaving all the metadata dirty in memory and
> just forcing the journal to stable storage, filesystems will be told
> to also write back all their dirty metadata before sync() returns,
> even though it is not necessary to provide correct sync()
> semantics....

Well for btrfs that's exactly what we have currently since it's just backed by
an inode.  Obviously this is different for journaled fs'es, but I assumed that
in your case you would either not use this part of the infrastructure or simply
ignore WB_SYNC_ALL and use WB_SYNC_NONE as a way to be nice under memory
pressure or whatever.

> 
> Mind you, writeback invocation is so convoluted now I could easily
> be mis-interpretting this code, but it does seem to me like this
> code is going to have some unintended behaviours....
> 

I don't think so, because right now this behavior is exactly what btrfs has
currently with it's inode setup.  I didn't really think the journaled use case
out since you guys are already rate limited by the journal.  If you would want
to start using this stuff what would you like to see done instead?  Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2017-12-12 18:05     ` Josef Bacik
@ 2017-12-12 22:20       ` Dave Chinner
  2017-12-12 23:59         ` Josef Bacik
  2017-12-19 12:07         ` Jan Kara
  0 siblings, 2 replies; 31+ messages in thread
From: Dave Chinner @ 2017-12-12 22:20 UTC (permalink / raw)
  To: Josef Bacik
  Cc: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team,
	linux-btrfs, Josef Bacik

On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > From: Josef Bacik <jbacik@fb.com>
> > > 
> > > Now that we have metadata counters in the VM, we need to provide a way to kick
> > > writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
> > > allows file systems to deal with writing back any dirty metadata we need based
> > > on the writeback needs of the system.  Since there is no inode to key off of we
> > > need a list in the bdi for dirty super blocks to be added.  From there we can
> > > find any dirty sb's on the bdi we are currently doing writeback on and call into
> > > their ->write_metadata callback.
> > > 
> > > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > Reviewed-by: Tejun Heo <tj@kernel.org>
> > > ---
> > >  fs/fs-writeback.c                | 72 ++++++++++++++++++++++++++++++++++++----
> > >  fs/super.c                       |  6 ++++
> > >  include/linux/backing-dev-defs.h |  2 ++
> > >  include/linux/fs.h               |  4 +++
> > >  mm/backing-dev.c                 |  2 ++
> > >  5 files changed, 80 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 987448ed7698..fba703dff678 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
> > >  	return pages;
> > >  }
> > >  
> > > +static long writeback_sb_metadata(struct super_block *sb,
> > > +				  struct bdi_writeback *wb,
> > > +				  struct wb_writeback_work *work)
> > > +{
> > > +	struct writeback_control wbc = {
> > > +		.sync_mode		= work->sync_mode,
> > > +		.tagged_writepages	= work->tagged_writepages,
> > > +		.for_kupdate		= work->for_kupdate,
> > > +		.for_background		= work->for_background,
> > > +		.for_sync		= work->for_sync,
> > > +		.range_cyclic		= work->range_cyclic,
> > > +		.range_start		= 0,
> > > +		.range_end		= LLONG_MAX,
> > > +	};
> > > +	long write_chunk;
> > > +
> > > +	write_chunk = writeback_chunk_size(wb, work);
> > > +	wbc.nr_to_write = write_chunk;
> > > +	sb->s_op->write_metadata(sb, &wbc);
> > > +	work->nr_pages -= write_chunk - wbc.nr_to_write;
> > > +
> > > +	return write_chunk - wbc.nr_to_write;
> > 
> > Ok, writeback_chunk_size() returns a page count. We've already gone
> > through the "metadata is not page sized" dance on the dirty
> > accounting side, so how are we supposed to use pages to account for
> > metadata writeback?
> > 
> 
> This is just one of those things that's going to be slightly shitty.  It's the
> same for memory reclaim, all of those places use pages so we just take
> METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.

Ok, so that isn't exactly easy to deal with, because all our
metadata writeback is based on log sequence number targets (i.e. how
far to push the tail of the log towards the current head). We've
actually got no idea how pages/bytes actually map to a LSN target
because while we might account a full buffer as dirty for memory
reclaim purposes (up to 64k in size), we might have only logged 128
bytes of it.

i.e. if we are asked to push 2MB of metadata and we treat that as
2MB of log space (i.e. push target of tail LSN + 2MB) we could have
logged several tens of megabytes of dirty metadata in that LSN
range and have to flush it all. OTOH, if the buffers are fully
logged, then that same target might only flush 1.5MB of metadata
once all the log overhead is taken into account.

So there's a fairly large disconnect between the "flush N bytes of
metadata" API and the "push to a target LSN" that XFS uses for
flushing metadata in aged order. I'm betting that extN and otehr
filesystems might have similar mismatches with their journal
flushing...

> > And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or
> > work->tagged_writepages is set, this will basically tell us to flush
> > the entire dirty metadata cache because write_chunk will get set to
> > LONG_MAX.
> > 
> > IOWs, this would appear to me to change sync() behaviour quite
> > dramatically on filesystems where ->write_metadata is implemented.
> > That is, instead of leaving all the metadata dirty in memory and
> > just forcing the journal to stable storage, filesystems will be told
> > to also write back all their dirty metadata before sync() returns,
> > even though it is not necessary to provide correct sync()
> > semantics....
> 
> Well for btrfs that's exactly what we have currently since it's just backed by
> an inode.

Hmmmm. That explains a lot.

Seems to me that btrfs is the odd one out here, so I'm not sure a
mechanism primarily designed for btrfs is going to work
generically....

> Obviously this is different for journaled fs'es, but I assumed that
> in your case you would either not use this part of the infrastructure or simply
> ignore WB_SYNC_ALL and use WB_SYNC_NONE as a way to be nice under memory
> pressure or whatever.

I don't think that designing an interface with the assumption other
filesystems will abuse it until it works for them is a great process
to follow...

> > Mind you, writeback invocation is so convoluted now I could easily
> > be mis-interpretting this code, but it does seem to me like this
> > code is going to have some unintended behaviours....
> > 
> 
> I don't think so, because right now this behavior is exactly what btrfs has
> currently with it's inode setup.  I didn't really think the journaled use case
> out since you guys are already rate limited by the journal.

We are?

XFS is rate limited by metadata writeback, not journal throughput.
Yes, journal space is limited by the metadata writeback rate, but
journalling itself is not the bottleneck.

> If you would want
> to start using this stuff what would you like to see done instead?  Thanks,

If this is all about reacting to memory pressure, then writeback is
not the mechanism that should drive this writeback. Reacting to
memory pressure is what shrinkers are for, and XFS already triggers
metadata writeback on memory pressure. Hence I don't see how this
writeback mechanism would help us if we have to abuse it to infer
"memory pressure occurring"

What I was hoping for was this interface to be a mechanism to drive
periodic background metadata writeback from the VFS so that when we
start to run out of memory the VFS has already started to ramp up
the rate of metadata writeback so we don't have huge amounts of dirty
metadata to write back during superblock shrinker based reclaim.

i.e. it works more like dirty background data writeback, get's the
amount of work to do from the amount of dirty metadata associated
with the bdi and doesn't actually do anything when operations like
sync() are run because there isn't a need to writeback metadata in
those operations.

IOWs, treating metadata like it's one great big data inode doesn't
seem to me to be the right abstraction to use for this - in most
fileystems it's a bunch of objects with a complex dependency tree
and unknown write ordering, not an inode full of data that can be
sequentially written.

Maybe we need multiple ops with well defined behaviours. e.g.
->writeback_metadata() for background writeback, ->sync_metadata() for
sync based operations. That way different filesystems can ignore the
parts they don't need simply by not implementing those operations,
and the writeback code doesn't need to try to cater for all
operations through the one op. The writeback code should be cleaner,
the filesystem code should be cleaner, and we can tailor the work
guidelines for each operation separately so there's less mismatch
between what writeback is asking and how filesystems track dirty
metadata...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2017-12-12 22:20       ` Dave Chinner
@ 2017-12-12 23:59         ` Josef Bacik
  2017-12-19 12:07         ` Jan Kara
  1 sibling, 0 replies; 31+ messages in thread
From: Josef Bacik @ 2017-12-12 23:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, hannes, linux-mm, akpm, jack, linux-fsdevel,
	kernel-team, linux-btrfs, Josef Bacik

On Wed, Dec 13, 2017 at 09:20:04AM +1100, Dave Chinner wrote:
> On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > > From: Josef Bacik <jbacik@fb.com>
> > > > 
> > > > Now that we have metadata counters in the VM, we need to provide a way to kick
> > > > writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
> > > > allows file systems to deal with writing back any dirty metadata we need based
> > > > on the writeback needs of the system.  Since there is no inode to key off of we
> > > > need a list in the bdi for dirty super blocks to be added.  From there we can
> > > > find any dirty sb's on the bdi we are currently doing writeback on and call into
> > > > their ->write_metadata callback.
> > > > 
> > > > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > > Reviewed-by: Tejun Heo <tj@kernel.org>
> > > > ---
> > > >  fs/fs-writeback.c                | 72 ++++++++++++++++++++++++++++++++++++----
> > > >  fs/super.c                       |  6 ++++
> > > >  include/linux/backing-dev-defs.h |  2 ++
> > > >  include/linux/fs.h               |  4 +++
> > > >  mm/backing-dev.c                 |  2 ++
> > > >  5 files changed, 80 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > index 987448ed7698..fba703dff678 100644
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
> > > >  	return pages;
> > > >  }
> > > >  
> > > > +static long writeback_sb_metadata(struct super_block *sb,
> > > > +				  struct bdi_writeback *wb,
> > > > +				  struct wb_writeback_work *work)
> > > > +{
> > > > +	struct writeback_control wbc = {
> > > > +		.sync_mode		= work->sync_mode,
> > > > +		.tagged_writepages	= work->tagged_writepages,
> > > > +		.for_kupdate		= work->for_kupdate,
> > > > +		.for_background		= work->for_background,
> > > > +		.for_sync		= work->for_sync,
> > > > +		.range_cyclic		= work->range_cyclic,
> > > > +		.range_start		= 0,
> > > > +		.range_end		= LLONG_MAX,
> > > > +	};
> > > > +	long write_chunk;
> > > > +
> > > > +	write_chunk = writeback_chunk_size(wb, work);
> > > > +	wbc.nr_to_write = write_chunk;
> > > > +	sb->s_op->write_metadata(sb, &wbc);
> > > > +	work->nr_pages -= write_chunk - wbc.nr_to_write;
> > > > +
> > > > +	return write_chunk - wbc.nr_to_write;
> > > 
> > > Ok, writeback_chunk_size() returns a page count. We've already gone
> > > through the "metadata is not page sized" dance on the dirty
> > > accounting side, so how are we supposed to use pages to account for
> > > metadata writeback?
> > > 
> > 
> > This is just one of those things that's going to be slightly shitty.  It's the
> > same for memory reclaim, all of those places use pages so we just take
> > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.
> 
> Ok, so that isn't exactly easy to deal with, because all our
> metadata writeback is based on log sequence number targets (i.e. how
> far to push the tail of the log towards the current head). We've
> actually got no idea how pages/bytes actually map to a LSN target
> because while we might account a full buffer as dirty for memory
> reclaim purposes (up to 64k in size), we might have only logged 128
> bytes of it.
> 
> i.e. if we are asked to push 2MB of metadata and we treat that as
> 2MB of log space (i.e. push target of tail LSN + 2MB) we could have
> logged several tens of megabytes of dirty metadata in that LSN
> range and have to flush it all. OTOH, if the buffers are fully
> logged, then that same target might only flush 1.5MB of metadata
> once all the log overhead is taken into account.
> 
> So there's a fairly large disconnect between the "flush N bytes of
> metadata" API and the "push to a target LSN" that XFS uses for
> flushing metadata in aged order. I'm betting that extN and otehr
> filesystems might have similar mismatches with their journal
> flushing...
> 

If there's not a correlation then there's no sense in xfs using this.  If btrfs
has 16gib of dirty metadata then that's exactly how much we have to write out,
which is what this is designed for.

> > > And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or
> > > work->tagged_writepages is set, this will basically tell us to flush
> > > the entire dirty metadata cache because write_chunk will get set to
> > > LONG_MAX.
> > > 
> > > IOWs, this would appear to me to change sync() behaviour quite
> > > dramatically on filesystems where ->write_metadata is implemented.
> > > That is, instead of leaving all the metadata dirty in memory and
> > > just forcing the journal to stable storage, filesystems will be told
> > > to also write back all their dirty metadata before sync() returns,
> > > even though it is not necessary to provide correct sync()
> > > semantics....
> > 
> > Well for btrfs that's exactly what we have currently since it's just backed by
> > an inode.
> 
> Hmmmm. That explains a lot.
> 
> Seems to me that btrfs is the odd one out here, so I'm not sure a
> mechanism primarily designed for btrfs is going to work
> generically....
> 

The generic stuff is very lightweight specifically because we don't need a whole
lot, just a way to get all of the balance dirty pages logic without duplicating
it internally in btrfs.

> > Obviously this is different for journaled fs'es, but I assumed that
> > in your case you would either not use this part of the infrastructure or simply
> > ignore WB_SYNC_ALL and use WB_SYNC_NONE as a way to be nice under memory
> > pressure or whatever.
> 
> I don't think that designing an interface with the assumption other
> filesystems will abuse it until it works for them is a great process
> to follow...
> 

Again not really designing it with your stuff in mind.  ext* and xfs already
handle dirty metadata fine, btrfs is the odd man out so we need a little extra.
It would be cool to at least use the accounting part of it in xfs and ext* so we
could see how much of the system memory is in use by metadata, but I imagine the
dirty metadata tracking is going to be mostly useless for you guys.

> > > Mind you, writeback invocation is so convoluted now I could easily
> > > be mis-interpretting this code, but it does seem to me like this
> > > code is going to have some unintended behaviours....
> > > 
> > 
> > I don't think so, because right now this behavior is exactly what btrfs has
> > currently with it's inode setup.  I didn't really think the journaled use case
> > out since you guys are already rate limited by the journal.
> 
> We are?
> 
> XFS is rate limited by metadata writeback, not journal throughput.
> Yes, journal space is limited by the metadata writeback rate, but
> journalling itself is not the bottleneck.
> 

I'm not saying "rate limited" as in xfs sucks because journal.  I'm saying your
dirty metadata foot print is limited by your journal size, so you aren't going
to have gigabytes of dirty metadata sitting around needing to be flushed (I
assume, I'm going on previous discussions with you about this.)

> > If you would want
> > to start using this stuff what would you like to see done instead?  Thanks,
> 
> If this is all about reacting to memory pressure, then writeback is
> not the mechanism that should drive this writeback. Reacting to
> memory pressure is what shrinkers are for, and XFS already triggers
> metadata writeback on memory pressure. Hence I don't see how this
> writeback mechanism would help us if we have to abuse it to infer
> "memory pressure occurring"
> 

This isn't reacting to memory pressure, it's reacting to dirty pressure.  Btrfs
is only limited by system memory for its metadata, so I want all the benefits of
years of work on balance_dirty_pages() without having to duplicate the effort
internally in btrfs.  This is how I'm going about doing it.

> What I was hoping for was this interface to be a mechanism to drive
> periodic background metadata writeback from the VFS so that when we
> start to run out of memory the VFS has already started to ramp up
> the rate of metadata writeback so we don't have huge amounts of dirty
> metadata to write back during superblock shrinker based reclaim.
> 
> i.e. it works more like dirty background data writeback, get's the
> amount of work to do from the amount of dirty metadata associated
> with the bdi and doesn't actually do anything when operations like
> sync() are run because there isn't a need to writeback metadata in
> those operations.
> 
> IOWs, treating metadata like it's one great big data inode doesn't
> seem to me to be the right abstraction to use for this - in most
> fileystems it's a bunch of objects with a complex dependency tree
> and unknown write ordering, not an inode full of data that can be
> sequentially written.

But this isn't dictating what to write out, just how much we need to undirty.
How the fs wants to write stuff out is completely up to the file system, I
specifically made it as generic as possible so we could do whatever we felt like
with the numbers we got.  This work gives you exactly what you want, a callback
when balance dirty pages is telling us that hey we have too much dirty memory in
use on the system.

> 
> Maybe we need multiple ops with well defined behaviours. e.g.
> ->writeback_metadata() for background writeback, ->sync_metadata() for
> sync based operations. That way different filesystems can ignore the
> parts they don't need simply by not implementing those operations,
> and the writeback code doesn't need to try to cater for all
> operations through the one op. The writeback code should be cleaner,
> the filesystem code should be cleaner, and we can tailor the work
> guidelines for each operation separately so there's less mismatch
> between what writeback is asking and how filesystems track dirty
> metadata...
> 

So I don't mind adding new things or changing around, but this is just getting
us the same behavior that I mentioned before, only at a higher level.  We want
the balance_dirty_pages() stuff to be able to dip into metadata writeback via
the method that I've implemented here.  Basically do data writeback, and if we
didn't do enough do some metadata writeback.  With what you've proposed we would
keep that and instead of doing ->write_metadata() when we have SYNC_ALL we'd
just do ->sync_metadata() and let the fs figure out what to do, which is what I
was suggesting fs'es do.

The problem is there's a disconnect between what btrfs and ext4 do with their
dirty metadata and what xfs does.  Ext4 is going to log entire blocks into the
journal, so there's a 1:1 mapping of dirty metadata to what's going to be
written out.  So telling it "write x pages" worth of metadata is going to be
somewhat useful.  That's not the case for xfs, and I'm not sure what a good way
to accommodate you would look like.  My first thought is a ratio, but man trying
to change how we dealt with slab ratios made me want to suck start a shotgun so I
don't really want to do something like that again.

How would you prefer to get information to act on from upper layers?  Personally
I feel like the generic writeback stuff already gives us enough info and we can
figure out what we want to do from there.  Thanks,

Josef

ps: I'm going to try and stay up for a while so we can hash this out now instead
of switching back and forth through our timezones.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 03/10] lib: add a __fprop_add_percpu_max
  2017-12-11 21:55 ` [PATCH v3 03/10] lib: add a __fprop_add_percpu_max Josef Bacik
@ 2017-12-19  7:25   ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2017-12-19  7:25 UTC (permalink / raw)
  To: Josef Bacik
  Cc: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team,
	linux-btrfs, Josef Bacik

On Mon 11-12-17 16:55:28, Josef Bacik wrote:
> From: Josef Bacik <jbacik@fb.com>
> 
> This helper allows us to add an arbitrary amount to the fprop
> structures.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/flex_proportions.h | 11 +++++++++--
>  lib/flex_proportions.c           |  9 +++++----
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/flex_proportions.h b/include/linux/flex_proportions.h
> index 0d348e011a6e..9f88684bf0a0 100644
> --- a/include/linux/flex_proportions.h
> +++ b/include/linux/flex_proportions.h
> @@ -83,8 +83,8 @@ struct fprop_local_percpu {
>  int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp);
>  void fprop_local_destroy_percpu(struct fprop_local_percpu *pl);
>  void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl);
> -void __fprop_inc_percpu_max(struct fprop_global *p, struct fprop_local_percpu *pl,
> -			    int max_frac);
> +void __fprop_add_percpu_max(struct fprop_global *p, struct fprop_local_percpu *pl,
> +			    unsigned long nr, int max_frac);
>  void fprop_fraction_percpu(struct fprop_global *p,
>  	struct fprop_local_percpu *pl, unsigned long *numerator,
>  	unsigned long *denominator);
> @@ -99,4 +99,11 @@ void fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl)
>  	local_irq_restore(flags);
>  }
>  
> +static inline
> +void __fprop_inc_percpu_max(struct fprop_global *p,
> +			    struct fprop_local_percpu *pl, int max_frac)
> +{
> +	__fprop_add_percpu_max(p, pl, 1, max_frac);
> +}
> +
>  #endif
> diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
> index 2cc1f94e03a1..31003989d34a 100644
> --- a/lib/flex_proportions.c
> +++ b/lib/flex_proportions.c
> @@ -255,8 +255,9 @@ void fprop_fraction_percpu(struct fprop_global *p,
>   * Like __fprop_inc_percpu() except that event is counted only if the given
>   * type has fraction smaller than @max_frac/FPROP_FRAC_BASE
>   */
> -void __fprop_inc_percpu_max(struct fprop_global *p,
> -			    struct fprop_local_percpu *pl, int max_frac)
> +void __fprop_add_percpu_max(struct fprop_global *p,
> +			    struct fprop_local_percpu *pl, unsigned long nr,
> +			    int max_frac)
>  {
>  	if (unlikely(max_frac < FPROP_FRAC_BASE)) {
>  		unsigned long numerator, denominator;
> @@ -267,6 +268,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
>  			return;
>  	} else
>  		fprop_reflect_period_percpu(p, pl);
> -	percpu_counter_add_batch(&pl->events, 1, PROP_BATCH);
> -	percpu_counter_add(&p->events, 1);
> +	percpu_counter_add_batch(&pl->events, nr, PROP_BATCH);
> +	percpu_counter_add(&p->events, nr);
>  }
> -- 
> 2.7.5
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 05/10] writeback: add counters for metadata usage
  2017-12-11 21:55 ` [PATCH v3 05/10] writeback: add counters for metadata usage Josef Bacik
@ 2017-12-19  7:52   ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2017-12-19  7:52 UTC (permalink / raw)
  To: Josef Bacik
  Cc: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team,
	linux-btrfs, Josef Bacik

On Mon 11-12-17 16:55:30, Josef Bacik wrote:
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 356a814e7c8e..48de090f5a07 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -179,9 +179,19 @@ enum node_stat_item {
>  	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
>  	NR_DIRTIED,		/* page dirtyings since bootup */
>  	NR_WRITTEN,		/* page writings since bootup */
> +	NR_METADATA_DIRTY_BYTES,	/* Metadata dirty bytes */
> +	NR_METADATA_WRITEBACK_BYTES,	/* Metadata writeback bytes */
> +	NR_METADATA_BYTES,	/* total metadata bytes in use. */
>  	NR_VM_NODE_STAT_ITEMS
>  };

Please add here something like: "Warning: These counters will overflow on
32-bit machines if we ever have more than 2G of metadata on such machine!
But kernel won't be able to address that easily either so it should not be
a real issue."

> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4bb13e72ac97..0b32e6381590 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -273,6 +273,13 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
>  
>  	t = __this_cpu_read(pcp->stat_threshold);
>  
> +	/*
> +	 * If this item is counted in bytes and not pages adjust the threshold
> +	 * accordingly.
> +	 */
> +	if (is_bytes_node_stat(item))
> +		t <<= PAGE_SHIFT;
> +
>  	if (unlikely(x > t || x < -t)) {
>  		node_page_state_add(x, pgdat, item);
>  		x = 0;

This is wrong. The per-cpu counters are stored in s8 so you cannot just
bump the threshold. I would just ignore the PCP counters for metadata (I
don't think they are that critical for performance for metadata tracking)
and add to the comment I've suggested above: "Also note that updates to
these counters won't be batched using per-cpu counters since the updates
are generally larger than the counter threshold."

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2017-12-12 22:20       ` Dave Chinner
  2017-12-12 23:59         ` Josef Bacik
@ 2017-12-19 12:07         ` Jan Kara
  2017-12-19 21:35           ` Dave Chinner
  1 sibling, 1 reply; 31+ messages in thread
From: Jan Kara @ 2017-12-19 12:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, hannes, linux-mm, akpm, jack, linux-fsdevel,
	kernel-team, linux-btrfs, Josef Bacik

On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > > From: Josef Bacik <jbacik@fb.com>
> > > > 
> > > > Now that we have metadata counters in the VM, we need to provide a way to kick
> > > > writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
> > > > allows file systems to deal with writing back any dirty metadata we need based
> > > > on the writeback needs of the system.  Since there is no inode to key off of we
> > > > need a list in the bdi for dirty super blocks to be added.  From there we can
> > > > find any dirty sb's on the bdi we are currently doing writeback on and call into
> > > > their ->write_metadata callback.
> > > > 
> > > > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > > Reviewed-by: Tejun Heo <tj@kernel.org>
> > > > ---
> > > >  fs/fs-writeback.c                | 72 ++++++++++++++++++++++++++++++++++++----
> > > >  fs/super.c                       |  6 ++++
> > > >  include/linux/backing-dev-defs.h |  2 ++
> > > >  include/linux/fs.h               |  4 +++
> > > >  mm/backing-dev.c                 |  2 ++
> > > >  5 files changed, 80 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > index 987448ed7698..fba703dff678 100644
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
> > > >  	return pages;
> > > >  }
> > > >  
> > > > +static long writeback_sb_metadata(struct super_block *sb,
> > > > +				  struct bdi_writeback *wb,
> > > > +				  struct wb_writeback_work *work)
> > > > +{
> > > > +	struct writeback_control wbc = {
> > > > +		.sync_mode		= work->sync_mode,
> > > > +		.tagged_writepages	= work->tagged_writepages,
> > > > +		.for_kupdate		= work->for_kupdate,
> > > > +		.for_background		= work->for_background,
> > > > +		.for_sync		= work->for_sync,
> > > > +		.range_cyclic		= work->range_cyclic,
> > > > +		.range_start		= 0,
> > > > +		.range_end		= LLONG_MAX,
> > > > +	};
> > > > +	long write_chunk;
> > > > +
> > > > +	write_chunk = writeback_chunk_size(wb, work);
> > > > +	wbc.nr_to_write = write_chunk;
> > > > +	sb->s_op->write_metadata(sb, &wbc);
> > > > +	work->nr_pages -= write_chunk - wbc.nr_to_write;
> > > > +
> > > > +	return write_chunk - wbc.nr_to_write;
> > > 
> > > Ok, writeback_chunk_size() returns a page count. We've already gone
> > > through the "metadata is not page sized" dance on the dirty
> > > accounting side, so how are we supposed to use pages to account for
> > > metadata writeback?
> > > 
> > 
> > This is just one of those things that's going to be slightly shitty.  It's the
> > same for memory reclaim, all of those places use pages so we just take
> > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.
> 
> Ok, so that isn't exactly easy to deal with, because all our
> metadata writeback is based on log sequence number targets (i.e. how
> far to push the tail of the log towards the current head). We've
> actually got no idea how pages/bytes actually map to a LSN target
> because while we might account a full buffer as dirty for memory
> reclaim purposes (up to 64k in size), we might have only logged 128
> bytes of it.
> 
> i.e. if we are asked to push 2MB of metadata and we treat that as
> 2MB of log space (i.e. push target of tail LSN + 2MB) we could have
> logged several tens of megabytes of dirty metadata in that LSN
> range and have to flush it all. OTOH, if the buffers are fully
> logged, then that same target might only flush 1.5MB of metadata
> once all the log overhead is taken into account.
> 
> So there's a fairly large disconnect between the "flush N bytes of
> metadata" API and the "push to a target LSN" that XFS uses for
> flushing metadata in aged order. I'm betting that extN and otehr
> filesystems might have similar mismatches with their journal
> flushing...

Well, for ext4 it isn't as bad since we do full block logging only. So if
we are asked to flush N pages, we can easily translate that to number of fs
blocks and flush that many from the oldest transaction.

Couldn't XFS just track how much it has cleaned (from reclaim perspective)
when pushing items from AIL (which is what I suppose XFS would do in
response to metadata writeback request) and just stop pushing when it has
cleaned as much as it was asked to?

> > > And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or
> > > work->tagged_writepages is set, this will basically tell us to flush
> > > the entire dirty metadata cache because write_chunk will get set to
> > > LONG_MAX.
> > > 
> > > IOWs, this would appear to me to change sync() behaviour quite
> > > dramatically on filesystems where ->write_metadata is implemented.
> > > That is, instead of leaving all the metadata dirty in memory and
> > > just forcing the journal to stable storage, filesystems will be told
> > > to also write back all their dirty metadata before sync() returns,
> > > even though it is not necessary to provide correct sync()
> > > semantics....
> > 
> > Well for btrfs that's exactly what we have currently since it's just backed by
> > an inode.
> 
> Hmmmm. That explains a lot.
> 
> Seems to me that btrfs is the odd one out here, so I'm not sure a
> mechanism primarily designed for btrfs is going to work
> generically....

For record ext4 is currently behaving the way btrfs is as well (at least in
practice). We expose committed but not yet checkpointed transaction data
(equivalent of XFS's AIL AFAICT) in block device's page cache as dirty
buffers. Thus calls like sync_blockdev() which are called as part of
sync(2) will result in flushing all of those metadata buffers to disk
(although as you properly point out it is not strictly necessary for
correctness of sync(2)).

> > Obviously this is different for journaled fs'es, but I assumed that
> > in your case you would either not use this part of the infrastructure or simply
> > ignore WB_SYNC_ALL and use WB_SYNC_NONE as a way to be nice under memory
> > pressure or whatever.
> 
> I don't think that designing an interface with the assumption other
> filesystems will abuse it until it works for them is a great process
> to follow...
> 
> > > Mind you, writeback invocation is so convoluted now I could easily
> > > be mis-interpretting this code, but it does seem to me like this
> > > code is going to have some unintended behaviours....
> > > 
> > 
> > I don't think so, because right now this behavior is exactly what btrfs has
> > currently with it's inode setup.  I didn't really think the journaled use case
> > out since you guys are already rate limited by the journal.
> 
> We are?
> 
> XFS is rate limited by metadata writeback, not journal throughput.
> Yes, journal space is limited by the metadata writeback rate, but
> journalling itself is not the bottleneck.
> 
> > If you would want
> > to start using this stuff what would you like to see done instead?  Thanks,
> 
> If this is all about reacting to memory pressure, then writeback is
> not the mechanism that should drive this writeback. Reacting to
> memory pressure is what shrinkers are for, and XFS already triggers
> metadata writeback on memory pressure. Hence I don't see how this
> writeback mechanism would help us if we have to abuse it to infer
> "memory pressure occurring"
> 
> What I was hoping for was this interface to be a mechanism to drive
> periodic background metadata writeback from the VFS so that when we
> start to run out of memory the VFS has already started to ramp up
> the rate of metadata writeback so we don't have huge amounts of dirty
> metadata to write back during superblock shrinker based reclaim.

Yeah, that's where I'd like this patch set to end up as well and I believe
Josef is as well - he wants to prevent too much dirty metadata to get
accumulated after all and background writeback of metadata is a part of
that.

> i.e. it works more like dirty background data writeback, get's the
> amount of work to do from the amount of dirty metadata associated
> with the bdi and doesn't actually do anything when operations like
> sync() are run because there isn't a need to writeback metadata in
> those operations.
> 
> IOWs, treating metadata like it's one great big data inode doesn't
> seem to me to be the right abstraction to use for this - in most
> fileystems it's a bunch of objects with a complex dependency tree
> and unknown write ordering, not an inode full of data that can be
> sequentially written.
> 
> Maybe we need multiple ops with well defined behaviours. e.g.
> ->writeback_metadata() for background writeback, ->sync_metadata() for
> sync based operations. That way different filesystems can ignore the
> parts they don't need simply by not implementing those operations,
> and the writeback code doesn't need to try to cater for all
> operations through the one op. The writeback code should be cleaner,
> the filesystem code should be cleaner, and we can tailor the work
> guidelines for each operation separately so there's less mismatch
> between what writeback is asking and how filesystems track dirty
> metadata...

I agree that writeback for memory cleaning and writeback for data integrity
are two very different things especially for metadata. In fact for data
integrity writeback we already have ->sync_fs operation so there the
functionality gets duplicated. What we could do is that in
writeback_sb_inodes() we'd call ->write_metadata only when
work->for_kupdate or work->for_background is set. That way ->write_metadata
would be called only for memory cleaning purposes.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2017-12-11 21:55 ` [PATCH v3 06/10] writeback: introduce super_operations->write_metadata Josef Bacik
  2017-12-11 23:36   ` Dave Chinner
@ 2017-12-19 12:21   ` Jan Kara
  1 sibling, 0 replies; 31+ messages in thread
From: Jan Kara @ 2017-12-19 12:21 UTC (permalink / raw)
  To: Josef Bacik
  Cc: hannes, linux-mm, akpm, jack, linux-fsdevel, kernel-team,
	linux-btrfs, Josef Bacik

On Mon 11-12-17 16:55:31, Josef Bacik wrote:
> @@ -1621,12 +1647,18 @@ static long writeback_sb_inodes(struct super_block *sb,
>  		 * background threshold and other termination conditions.
>  		 */
>  		if (wrote) {
> -			if (time_is_before_jiffies(start_time + HZ / 10UL))
> -				break;
> -			if (work->nr_pages <= 0)
> +			if (time_is_before_jiffies(start_time + HZ / 10UL) ||
> +			    work->nr_pages <= 0) {
> +				done = true;
>  				break;
> +			}
>  		}
>  	}
> +	if (!done && sb->s_op->write_metadata) {
> +		spin_unlock(&wb->list_lock);
> +		wrote += writeback_sb_metadata(sb, wb, work);
> +		spin_lock(&wb->list_lock);
> +	}
>  	return wrote;
>  }

One thing I've notice when looking at this patch again: This duplicates the
metadata writeback done in __writeback_inodes_wb(). So you probably need a
new helper function like writeback_sb() that will call writeback_sb_inodes()
and handle metadata writeback and call that from wb_writeback() instead of
writeback_sb_inodes() directly.

								Honza

> @@ -1635,6 +1667,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
>  {
>  	unsigned long start_time = jiffies;
>  	long wrote = 0;
> +	bool done = false;
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = wb_inode(wb->b_io.prev);
> @@ -1654,12 +1687,39 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
>  
>  		/* refer to the same tests at the end of writeback_sb_inodes */
>  		if (wrote) {
> -			if (time_is_before_jiffies(start_time + HZ / 10UL))
> -				break;
> -			if (work->nr_pages <= 0)
> +			if (time_is_before_jiffies(start_time + HZ / 10UL) ||
> +			    work->nr_pages <= 0) {
> +				done = true;
>  				break;
> +			}
>  		}
>  	}
> +
> +	if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) {
> +		LIST_HEAD(list);
> +
> +		spin_unlock(&wb->list_lock);
> +		spin_lock(&wb->bdi->sb_list_lock);
> +		list_splice_init(&wb->bdi->dirty_sb_list, &list);
> +		while (!list_empty(&list)) {
> +			struct super_block *sb;
> +
> +			sb = list_first_entry(&list, struct super_block,
> +					      s_bdi_dirty_list);
> +			list_move_tail(&sb->s_bdi_dirty_list,
> +				       &wb->bdi->dirty_sb_list);
> +			if (!sb->s_op->write_metadata)
> +				continue;
> +			if (!trylock_super(sb))
> +				continue;
> +			spin_unlock(&wb->bdi->sb_list_lock);
> +			wrote += writeback_sb_metadata(sb, wb, work);
> +			spin_lock(&wb->bdi->sb_list_lock);
> +			up_read(&sb->s_umount);
> +		}
> +		spin_unlock(&wb->bdi->sb_list_lock);
> +		spin_lock(&wb->list_lock);
> +	}
>  	/* Leave any unwritten inodes on b_io */
>  	return wrote;
>  }
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2017-12-19 12:07         ` Jan Kara
@ 2017-12-19 21:35           ` Dave Chinner
  2017-12-20 14:30             ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2017-12-19 21:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Josef Bacik, hannes, linux-mm, akpm, linux-fsdevel, kernel-team,
	linux-btrfs, Josef Bacik

On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> > > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > This is just one of those things that's going to be slightly shitty.  It's the
> > > same for memory reclaim, all of those places use pages so we just take
> > > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.
> > 
> > Ok, so that isn't exactly easy to deal with, because all our
> > metadata writeback is based on log sequence number targets (i.e. how
> > far to push the tail of the log towards the current head). We've
> > actually got no idea how pages/bytes actually map to a LSN target
> > because while we might account a full buffer as dirty for memory
> > reclaim purposes (up to 64k in size), we might have only logged 128
> > bytes of it.
> > 
> > i.e. if we are asked to push 2MB of metadata and we treat that as
> > 2MB of log space (i.e. push target of tail LSN + 2MB) we could have
> > logged several tens of megabytes of dirty metadata in that LSN
> > range and have to flush it all. OTOH, if the buffers are fully
> > logged, then that same target might only flush 1.5MB of metadata
> > once all the log overhead is taken into account.
> > 
> > So there's a fairly large disconnect between the "flush N bytes of
> > metadata" API and the "push to a target LSN" that XFS uses for
> > flushing metadata in aged order. I'm betting that extN and otehr
> > filesystems might have similar mismatches with their journal
> > flushing...
> 
> Well, for ext4 it isn't as bad since we do full block logging only. So if
> we are asked to flush N pages, we can easily translate that to number of fs
> blocks and flush that many from the oldest transaction.
> 
> Couldn't XFS just track how much it has cleaned (from reclaim perspective)
> when pushing items from AIL (which is what I suppose XFS would do in
> response to metadata writeback request) and just stop pushing when it has
> cleaned as much as it was asked to?

If only it were that simple :/

To start with, flushing the dirty objects (such as inodes) to their
backing buffers do not mean the the object is clean once the
writeback completes. XFS has decoupled in-memory objects with
logical object logging rather than logging physical buffers, and
so can be modified and dirtied while the inode buffer
is being written back. Hence if we just count things like "buffer
size written" it's not actually a correct account of the amount of
dirty metadata we've cleaned. If we don't get that right, it'll
result in accounting errors and incorrect behaviour.

The bigger problem, however, is that we have no channel to return
flush information from the AIL pushing to whatever caller asked for
the push. Pushing metadata is completely decoupled from every other
subsystem. i.e. the caller asked the xfsaild to push to a specific
LSN (e.g. to free up a certain amount of log space for new
transactions), and *nothing* has any idea of how much metadata we'll
need to write to push the tail of the log to that LSN.

It's also completely asynchronous - there's no mechanism for waiting
on a push to a specific LSN. Anything that needs a specific amount
of log space to be available waits in ordered ticket queues on the
log tail moving forwards. The only interfaces that have access to
the log tail ticket waiting is the transaction reservation
subsystem, which cannot be used during metadata writeback because
that's a guaranteed deadlock vector....

Saying "just account for bytes written" assumes directly connected,
synchronous dispatch metadata writeback infrastructure which we
simply don't have in XFS. "just clean this many bytes" doesn't
really fit at all because we have no way of referencing that to the
distance we need to push the tail of the log. An interface that
tells us "clean this percentage of dirty metadata" is much more
useful because we can map that easily to a log sequence number
based push target....

> > IOWs, treating metadata like it's one great big data inode doesn't
> > seem to me to be the right abstraction to use for this - in most
> > fileystems it's a bunch of objects with a complex dependency tree
> > and unknown write ordering, not an inode full of data that can be
> > sequentially written.
> > 
> > Maybe we need multiple ops with well defined behaviours. e.g.
> > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > sync based operations. That way different filesystems can ignore the
> > parts they don't need simply by not implementing those operations,
> > and the writeback code doesn't need to try to cater for all
> > operations through the one op. The writeback code should be cleaner,
> > the filesystem code should be cleaner, and we can tailor the work
> > guidelines for each operation separately so there's less mismatch
> > between what writeback is asking and how filesystems track dirty
> > metadata...
> 
> I agree that writeback for memory cleaning and writeback for data integrity
> are two very different things especially for metadata. In fact for data
> integrity writeback we already have ->sync_fs operation so there the
> functionality gets duplicated. What we could do is that in
> writeback_sb_inodes() we'd call ->write_metadata only when
> work->for_kupdate or work->for_background is set. That way ->write_metadata
> would be called only for memory cleaning purposes.

That makes sense, but I still think we need a better indication of
how much writeback we need to do than just "writeback this chunk of
pages". That "writeback a chunk" interface is necessary to share
writeback bandwidth across numerous data inodes so that we don't
starve any one inode of writeback bandwidth. That's unnecessary for
metadata writeback on a superblock - we don't need to share that
bandwidth around hundreds or thousands of inodes. What we actually
need to know is how much writeback we need to do as a total of all
the dirty metadata on the superblock.

Sure, that's not ideal for btrfs and mayext4, but we can write a
simple generic helper that converts "flush X percent of dirty
metadata" to a page/byte chunk as the current code does. DOing it
this way allows filesystems to completely internalise the accounting
that needs to be done, rather than trying to hack around a
writeback accounting interface with large impedance mismatches to
how the filesystem accounts for dirty metadata and/or tracks
writeback progress.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2017-12-19 21:35           ` Dave Chinner
@ 2017-12-20 14:30             ` Jan Kara
  2018-01-02 16:13               ` Josef Bacik
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2017-12-20 14:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Josef Bacik, hannes, linux-mm, akpm, linux-fsdevel,
	kernel-team, linux-btrfs, Josef Bacik

On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> > > > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > > > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > > This is just one of those things that's going to be slightly shitty.  It's the
> > > > same for memory reclaim, all of those places use pages so we just take
> > > > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.
> > > 
> > > Ok, so that isn't exactly easy to deal with, because all our
> > > metadata writeback is based on log sequence number targets (i.e. how
> > > far to push the tail of the log towards the current head). We've
> > > actually got no idea how pages/bytes actually map to a LSN target
> > > because while we might account a full buffer as dirty for memory
> > > reclaim purposes (up to 64k in size), we might have only logged 128
> > > bytes of it.
> > > 
> > > i.e. if we are asked to push 2MB of metadata and we treat that as
> > > 2MB of log space (i.e. push target of tail LSN + 2MB) we could have
> > > logged several tens of megabytes of dirty metadata in that LSN
> > > range and have to flush it all. OTOH, if the buffers are fully
> > > logged, then that same target might only flush 1.5MB of metadata
> > > once all the log overhead is taken into account.
> > > 
> > > So there's a fairly large disconnect between the "flush N bytes of
> > > metadata" API and the "push to a target LSN" that XFS uses for
> > > flushing metadata in aged order. I'm betting that extN and otehr
> > > filesystems might have similar mismatches with their journal
> > > flushing...
> > 
> > Well, for ext4 it isn't as bad since we do full block logging only. So if
> > we are asked to flush N pages, we can easily translate that to number of fs
> > blocks and flush that many from the oldest transaction.
> > 
> > Couldn't XFS just track how much it has cleaned (from reclaim perspective)
> > when pushing items from AIL (which is what I suppose XFS would do in
> > response to metadata writeback request) and just stop pushing when it has
> > cleaned as much as it was asked to?
> 
> If only it were that simple :/
> 
> To start with, flushing the dirty objects (such as inodes) to their
> backing buffers do not mean the the object is clean once the
> writeback completes. XFS has decoupled in-memory objects with
> logical object logging rather than logging physical buffers, and
> so can be modified and dirtied while the inode buffer
> is being written back. Hence if we just count things like "buffer
> size written" it's not actually a correct account of the amount of
> dirty metadata we've cleaned. If we don't get that right, it'll
> result in accounting errors and incorrect behaviour.
> 
> The bigger problem, however, is that we have no channel to return
> flush information from the AIL pushing to whatever caller asked for
> the push. Pushing metadata is completely decoupled from every other
> subsystem. i.e. the caller asked the xfsaild to push to a specific
> LSN (e.g. to free up a certain amount of log space for new
> transactions), and *nothing* has any idea of how much metadata we'll
> need to write to push the tail of the log to that LSN.
> 
> It's also completely asynchronous - there's no mechanism for waiting
> on a push to a specific LSN. Anything that needs a specific amount
> of log space to be available waits in ordered ticket queues on the
> log tail moving forwards. The only interfaces that have access to
> the log tail ticket waiting is the transaction reservation
> subsystem, which cannot be used during metadata writeback because
> that's a guaranteed deadlock vector....
> 
> Saying "just account for bytes written" assumes directly connected,
> synchronous dispatch metadata writeback infrastructure which we
> simply don't have in XFS. "just clean this many bytes" doesn't
> really fit at all because we have no way of referencing that to the
> distance we need to push the tail of the log. An interface that
> tells us "clean this percentage of dirty metadata" is much more
> useful because we can map that easily to a log sequence number
> based push target....

OK, understood.

> > > IOWs, treating metadata like it's one great big data inode doesn't
> > > seem to me to be the right abstraction to use for this - in most
> > > fileystems it's a bunch of objects with a complex dependency tree
> > > and unknown write ordering, not an inode full of data that can be
> > > sequentially written.
> > > 
> > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > > sync based operations. That way different filesystems can ignore the
> > > parts they don't need simply by not implementing those operations,
> > > and the writeback code doesn't need to try to cater for all
> > > operations through the one op. The writeback code should be cleaner,
> > > the filesystem code should be cleaner, and we can tailor the work
> > > guidelines for each operation separately so there's less mismatch
> > > between what writeback is asking and how filesystems track dirty
> > > metadata...
> > 
> > I agree that writeback for memory cleaning and writeback for data integrity
> > are two very different things especially for metadata. In fact for data
> > integrity writeback we already have ->sync_fs operation so there the
> > functionality gets duplicated. What we could do is that in
> > writeback_sb_inodes() we'd call ->write_metadata only when
> > work->for_kupdate or work->for_background is set. That way ->write_metadata
> > would be called only for memory cleaning purposes.
> 
> That makes sense, but I still think we need a better indication of
> how much writeback we need to do than just "writeback this chunk of
> pages". That "writeback a chunk" interface is necessary to share
> writeback bandwidth across numerous data inodes so that we don't
> starve any one inode of writeback bandwidth. That's unnecessary for
> metadata writeback on a superblock - we don't need to share that
> bandwidth around hundreds or thousands of inodes. What we actually
> need to know is how much writeback we need to do as a total of all
> the dirty metadata on the superblock.
> 
> Sure, that's not ideal for btrfs and mayext4, but we can write a
> simple generic helper that converts "flush X percent of dirty
> metadata" to a page/byte chunk as the current code does. DOing it
> this way allows filesystems to completely internalise the accounting
> that needs to be done, rather than trying to hack around a
> writeback accounting interface with large impedance mismatches to
> how the filesystem accounts for dirty metadata and/or tracks
> writeback progress.

Let me think loud on how we could tie this into how memory cleaning
writeback currently works - the one with for_background == 1 which is
generally used to get amount of dirty pages in the system under control.
We have a queue of inodes to write, we iterate over this queue and ask each
inode to write some amount (e.g. 64 M - exact amount depends on measured
writeback bandwidth etc.). Some amount from that inode gets written and we
continue with the next inode in the queue (put this one at the end of the
queue if it still has dirty pages). We do this until:

a) the number of dirty pages in the system is below background dirty limit
   and the number dirty pages for this device is below background dirty
   limit for this device.
b) run out of dirty inodes on this device
c) someone queues different type of writeback

And we need to somehow incorporate metadata writeback into this loop. I see
two questions here:

1) When / how often should we ask for metadata writeback?
2) How much to ask to write in one go?

The second question is especially tricky in the presence of completely
async metadata flushing in XFS - we can ask to write say half of dirty
metadata but then we have no idea whether the next observation of dirty
metadata counters is with that part of metadata already under writeback /
cleaned or whether xfsaild didn't even start working and pushing more has
no sense. Partly, this could be dealt with by telling the filesystem
"metadata dirty target" - i.e. "get your dirty metadata counters below X"
- and whether we communicate that in bytes, pages, or a fraction of
current dirty metadata counter value is a detail I don't have a strong
opinion on now. And the fact is the amount written by the filesystem
doesn't have to be very accurate anyway - we basically just want to make
some forward progress with writing metadata, don't want that to take too
long (so that other writeback from the thread isn't stalled), and if
writeback code is unhappy about the state of counters next time it looks,
it will ask the filesystem again...

This gets me directly to another problem with async nature of XFS metadata
writeback. That is that it could get writeback thread into busyloop - we
are supposed to terminate memory cleaning writeback only once dirty
counters are below limit and in case dirty metadata is causing counters to
be over limit, we would just ask in a loop XFS to get metadata below the
target. I suppose XFS could just return "nothing written" from its
->write_metadata operation and in such case we could sleep a bit before
going for another writeback loop (the same thing happens when filesystem
reports all inodes are locked / busy and it cannot writeback anything). But
it's getting a bit ugly and is it really better than somehow waiting inside
XFS for metadata writeback to occur?  Any idea Dave?

Regarding question 1). What Josef does is that once we went through all
queued inodes and wrote some amount from each one, we'd go and ask fs to
write some metadata. And then we'll again go to write inodes that are still
dirty. That is somewhat rough but I guess it is fine for now.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2017-12-20 14:30             ` Jan Kara
@ 2018-01-02 16:13               ` Josef Bacik
  2018-01-03  2:32                 ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Josef Bacik @ 2018-01-02 16:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Josef Bacik, hannes, linux-mm, akpm, linux-fsdevel,
	kernel-team, linux-btrfs, Josef Bacik

On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> > > > > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > > > > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > > > This is just one of those things that's going to be slightly shitty.  It's the
> > > > > same for memory reclaim, all of those places use pages so we just take
> > > > > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.
> > > > 
> > > > Ok, so that isn't exactly easy to deal with, because all our
> > > > metadata writeback is based on log sequence number targets (i.e. how
> > > > far to push the tail of the log towards the current head). We've
> > > > actually got no idea how pages/bytes actually map to a LSN target
> > > > because while we might account a full buffer as dirty for memory
> > > > reclaim purposes (up to 64k in size), we might have only logged 128
> > > > bytes of it.
> > > > 
> > > > i.e. if we are asked to push 2MB of metadata and we treat that as
> > > > 2MB of log space (i.e. push target of tail LSN + 2MB) we could have
> > > > logged several tens of megabytes of dirty metadata in that LSN
> > > > range and have to flush it all. OTOH, if the buffers are fully
> > > > logged, then that same target might only flush 1.5MB of metadata
> > > > once all the log overhead is taken into account.
> > > > 
> > > > So there's a fairly large disconnect between the "flush N bytes of
> > > > metadata" API and the "push to a target LSN" that XFS uses for
> > > > flushing metadata in aged order. I'm betting that extN and otehr
> > > > filesystems might have similar mismatches with their journal
> > > > flushing...
> > > 
> > > Well, for ext4 it isn't as bad since we do full block logging only. So if
> > > we are asked to flush N pages, we can easily translate that to number of fs
> > > blocks and flush that many from the oldest transaction.
> > > 
> > > Couldn't XFS just track how much it has cleaned (from reclaim perspective)
> > > when pushing items from AIL (which is what I suppose XFS would do in
> > > response to metadata writeback request) and just stop pushing when it has
> > > cleaned as much as it was asked to?
> > 
> > If only it were that simple :/
> > 
> > To start with, flushing the dirty objects (such as inodes) to their
> > backing buffers do not mean the the object is clean once the
> > writeback completes. XFS has decoupled in-memory objects with
> > logical object logging rather than logging physical buffers, and
> > so can be modified and dirtied while the inode buffer
> > is being written back. Hence if we just count things like "buffer
> > size written" it's not actually a correct account of the amount of
> > dirty metadata we've cleaned. If we don't get that right, it'll
> > result in accounting errors and incorrect behaviour.
> > 
> > The bigger problem, however, is that we have no channel to return
> > flush information from the AIL pushing to whatever caller asked for
> > the push. Pushing metadata is completely decoupled from every other
> > subsystem. i.e. the caller asked the xfsaild to push to a specific
> > LSN (e.g. to free up a certain amount of log space for new
> > transactions), and *nothing* has any idea of how much metadata we'll
> > need to write to push the tail of the log to that LSN.
> > 
> > It's also completely asynchronous - there's no mechanism for waiting
> > on a push to a specific LSN. Anything that needs a specific amount
> > of log space to be available waits in ordered ticket queues on the
> > log tail moving forwards. The only interfaces that have access to
> > the log tail ticket waiting is the transaction reservation
> > subsystem, which cannot be used during metadata writeback because
> > that's a guaranteed deadlock vector....
> > 
> > Saying "just account for bytes written" assumes directly connected,
> > synchronous dispatch metadata writeback infrastructure which we
> > simply don't have in XFS. "just clean this many bytes" doesn't
> > really fit at all because we have no way of referencing that to the
> > distance we need to push the tail of the log. An interface that
> > tells us "clean this percentage of dirty metadata" is much more
> > useful because we can map that easily to a log sequence number
> > based push target....
> 
> OK, understood.
> 
> > > > IOWs, treating metadata like it's one great big data inode doesn't
> > > > seem to me to be the right abstraction to use for this - in most
> > > > fileystems it's a bunch of objects with a complex dependency tree
> > > > and unknown write ordering, not an inode full of data that can be
> > > > sequentially written.
> > > > 
> > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > > > sync based operations. That way different filesystems can ignore the
> > > > parts they don't need simply by not implementing those operations,
> > > > and the writeback code doesn't need to try to cater for all
> > > > operations through the one op. The writeback code should be cleaner,
> > > > the filesystem code should be cleaner, and we can tailor the work
> > > > guidelines for each operation separately so there's less mismatch
> > > > between what writeback is asking and how filesystems track dirty
> > > > metadata...
> > > 
> > > I agree that writeback for memory cleaning and writeback for data integrity
> > > are two very different things especially for metadata. In fact for data
> > > integrity writeback we already have ->sync_fs operation so there the
> > > functionality gets duplicated. What we could do is that in
> > > writeback_sb_inodes() we'd call ->write_metadata only when
> > > work->for_kupdate or work->for_background is set. That way ->write_metadata
> > > would be called only for memory cleaning purposes.
> > 
> > That makes sense, but I still think we need a better indication of
> > how much writeback we need to do than just "writeback this chunk of
> > pages". That "writeback a chunk" interface is necessary to share
> > writeback bandwidth across numerous data inodes so that we don't
> > starve any one inode of writeback bandwidth. That's unnecessary for
> > metadata writeback on a superblock - we don't need to share that
> > bandwidth around hundreds or thousands of inodes. What we actually
> > need to know is how much writeback we need to do as a total of all
> > the dirty metadata on the superblock.
> > 
> > Sure, that's not ideal for btrfs and mayext4, but we can write a
> > simple generic helper that converts "flush X percent of dirty
> > metadata" to a page/byte chunk as the current code does. DOing it
> > this way allows filesystems to completely internalise the accounting
> > that needs to be done, rather than trying to hack around a
> > writeback accounting interface with large impedance mismatches to
> > how the filesystem accounts for dirty metadata and/or tracks
> > writeback progress.
> 
> Let me think loud on how we could tie this into how memory cleaning
> writeback currently works - the one with for_background == 1 which is
> generally used to get amount of dirty pages in the system under control.
> We have a queue of inodes to write, we iterate over this queue and ask each
> inode to write some amount (e.g. 64 M - exact amount depends on measured
> writeback bandwidth etc.). Some amount from that inode gets written and we
> continue with the next inode in the queue (put this one at the end of the
> queue if it still has dirty pages). We do this until:
> 
> a) the number of dirty pages in the system is below background dirty limit
>    and the number dirty pages for this device is below background dirty
>    limit for this device.
> b) run out of dirty inodes on this device
> c) someone queues different type of writeback
> 
> And we need to somehow incorporate metadata writeback into this loop. I see
> two questions here:
> 
> 1) When / how often should we ask for metadata writeback?
> 2) How much to ask to write in one go?
> 
> The second question is especially tricky in the presence of completely
> async metadata flushing in XFS - we can ask to write say half of dirty
> metadata but then we have no idea whether the next observation of dirty
> metadata counters is with that part of metadata already under writeback /
> cleaned or whether xfsaild didn't even start working and pushing more has
> no sense. Partly, this could be dealt with by telling the filesystem
> "metadata dirty target" - i.e. "get your dirty metadata counters below X"
> - and whether we communicate that in bytes, pages, or a fraction of
> current dirty metadata counter value is a detail I don't have a strong
> opinion on now. And the fact is the amount written by the filesystem
> doesn't have to be very accurate anyway - we basically just want to make
> some forward progress with writing metadata, don't want that to take too
> long (so that other writeback from the thread isn't stalled), and if
> writeback code is unhappy about the state of counters next time it looks,
> it will ask the filesystem again...
> 
> This gets me directly to another problem with async nature of XFS metadata
> writeback. That is that it could get writeback thread into busyloop - we
> are supposed to terminate memory cleaning writeback only once dirty
> counters are below limit and in case dirty metadata is causing counters to
> be over limit, we would just ask in a loop XFS to get metadata below the
> target. I suppose XFS could just return "nothing written" from its
> ->write_metadata operation and in such case we could sleep a bit before
> going for another writeback loop (the same thing happens when filesystem
> reports all inodes are locked / busy and it cannot writeback anything). But
> it's getting a bit ugly and is it really better than somehow waiting inside
> XFS for metadata writeback to occur?  Any idea Dave?
> 
> Regarding question 1). What Josef does is that once we went through all
> queued inodes and wrote some amount from each one, we'd go and ask fs to
> write some metadata. And then we'll again go to write inodes that are still
> dirty. That is somewhat rough but I guess it is fine for now.
> 

Alright I'm back from vacation so am sufficiently hungover to try and figure
this out.  Btrfs and ext4 account their dirty metadata directly and reclaim it
like inodes, xfs doesn't.  Btrfs does do something similar to what xfs does with
delayed updates, but we just use the enospc logic to trigger when to update the
metadata blocks, and then those just get written out via the dirty balancing
stuff.  Since xfs doesn't have a direct way to tie that together, you'd rather
we'd have some sort of ratio so you know you need to flush dirty inodes, correct
Dave?

I don't think this is solvable for xfs.  The whole vm is around pages/bytes.
The only place we have this ratio thing is in slab reclaim, and we only have to
worry about actual memory pressure there because we have a nice external
trigger, we're out of pages.

For dirty throttling we have to know how much we're pushing and how much we need
to push, and that _requires_ bytes/pages.  And not like "we can only send you
bytes/pages to reclaim" but like the throttling stuff has all of it's accounting
in bytes/pages, so putting in arbitrary object counts into this logic is not
going to be straightforward.  The system administrator sets their dirty limits
to absolute numbers or % of total memory.  If xfs can't account for its metadata
this way then I don't think it can use any sort of infrastructure we provide in
the current framework.  We'd have to completely overhaul the dirty throttling
stuff for it to work, and even then we still need bandwidth of the device which
means how many bytes we're writing out.

I have an alternative proposal.  We keep these patches the way they are and use
it for btrfs and ext4 since our actual metadata pool is tracked similarly to
inodes.

Then to accomodate xfs (and btrfs and ext4) we have a separate throttling system
that is fs object based.  I know xfs and btrfs bypass the mark_inode_dirty()
stuff, but there's no reason we couldn't strap some logic in there to account
for how many dirty objects we have lying around.  Then we need to come up with a
way we want to limit the objects.  I feel like this part is going to be very fs
specific, btrfs will have it's enospc checks, ext4 and xfs will have log space
checks.  We add this into balance_dirty_pages(), so we benefit from the
per-process rate limiting, and we just have super->balance_metadata() and inside
balance_metadata() the fs takes into account the current dirty objects usage and
how much space we have left and then does it's job for throttling stuff.  If
there's plenty of space or whatever it's just a no-op and returns.

This would couple nicely with what I've done in these patches, as the
balance_metadata() would simply move our in-memory updates into the buffers
themselves, making them dirty.  Then the actualy dirty metadata block stuff
could see if it's time to do writeout.

And this would make xfs so it doesn't do writeout from the slab reclamation
path, which Facebook constantly has to patch out because it just make xfs
unusable in our environments.

>From here we could then extend the fs object dirty balancing to have more
generic logic for when we need to flush, but I feel like that's going to be a
much larger project than just providing a callback.  This doesn't get us
anything super new or fancy, but could even be used in the memory reclaim path
in a less critical area to flush dirty metadata rather than the slab reclaim
path that xfs currently uses.  And then we can build more complicated things
from there.  What do you think of this?

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2018-01-02 16:13               ` Josef Bacik
@ 2018-01-03  2:32                 ` Dave Chinner
  2018-01-03 13:59                   ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2018-01-03  2:32 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, hannes, linux-mm, akpm, linux-fsdevel, kernel-team,
	linux-btrfs, Josef Bacik

On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote:
> On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> > On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > > IOWs, treating metadata like it's one great big data inode doesn't
> > > > > seem to me to be the right abstraction to use for this - in most
> > > > > fileystems it's a bunch of objects with a complex dependency tree
> > > > > and unknown write ordering, not an inode full of data that can be
> > > > > sequentially written.
> > > > > 
> > > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > > > > sync based operations. That way different filesystems can ignore the
> > > > > parts they don't need simply by not implementing those operations,
> > > > > and the writeback code doesn't need to try to cater for all
> > > > > operations through the one op. The writeback code should be cleaner,
> > > > > the filesystem code should be cleaner, and we can tailor the work
> > > > > guidelines for each operation separately so there's less mismatch
> > > > > between what writeback is asking and how filesystems track dirty
> > > > > metadata...
> > > > 
> > > > I agree that writeback for memory cleaning and writeback for data integrity
> > > > are two very different things especially for metadata. In fact for data
> > > > integrity writeback we already have ->sync_fs operation so there the
> > > > functionality gets duplicated. What we could do is that in
> > > > writeback_sb_inodes() we'd call ->write_metadata only when
> > > > work->for_kupdate or work->for_background is set. That way ->write_metadata
> > > > would be called only for memory cleaning purposes.
> > > 
> > > That makes sense, but I still think we need a better indication of
> > > how much writeback we need to do than just "writeback this chunk of
> > > pages". That "writeback a chunk" interface is necessary to share
> > > writeback bandwidth across numerous data inodes so that we don't
> > > starve any one inode of writeback bandwidth. That's unnecessary for
> > > metadata writeback on a superblock - we don't need to share that
> > > bandwidth around hundreds or thousands of inodes. What we actually
> > > need to know is how much writeback we need to do as a total of all
> > > the dirty metadata on the superblock.
> > > 
> > > Sure, that's not ideal for btrfs and mayext4, but we can write a
> > > simple generic helper that converts "flush X percent of dirty
> > > metadata" to a page/byte chunk as the current code does. DOing it
> > > this way allows filesystems to completely internalise the accounting
> > > that needs to be done, rather than trying to hack around a
> > > writeback accounting interface with large impedance mismatches to
> > > how the filesystem accounts for dirty metadata and/or tracks
> > > writeback progress.
> > 
> > Let me think loud on how we could tie this into how memory cleaning
> > writeback currently works - the one with for_background == 1 which is
> > generally used to get amount of dirty pages in the system under control.
> > We have a queue of inodes to write, we iterate over this queue and ask each
> > inode to write some amount (e.g. 64 M - exact amount depends on measured

It's a maximum of 1024 pages per inode.

> > writeback bandwidth etc.). Some amount from that inode gets written and we
> > continue with the next inode in the queue (put this one at the end of the
> > queue if it still has dirty pages). We do this until:
> > 
> > a) the number of dirty pages in the system is below background dirty limit
> >    and the number dirty pages for this device is below background dirty
> >    limit for this device.
> > b) run out of dirty inodes on this device
> > c) someone queues different type of writeback
> > 
> > And we need to somehow incorporate metadata writeback into this loop. I see
> > two questions here:
> > 
> > 1) When / how often should we ask for metadata writeback?
> > 2) How much to ask to write in one go?
> > 
> > The second question is especially tricky in the presence of completely
> > async metadata flushing in XFS - we can ask to write say half of dirty
> > metadata but then we have no idea whether the next observation of dirty
> > metadata counters is with that part of metadata already under writeback /
> > cleaned or whether xfsaild didn't even start working and pushing more has
> > no sense.

Well, like with ext4, we've also got to consider that a bunch of the
recently dirtied metadata (e.g. from delalloc, EOF updates on IO
completion, etc) is still pinned in memory because the
journal has not been flushed/checkpointed. Hence we should not be
attempting to write back metadata we've dirtied as a result of
writing data in the background writeback loop.

That greatly simplifies what we need to consider here. That is, we
just need to sample the ratio of dirty metadata to clean metadata
before we start data writeback, and we calculate the amount of
metadata writeback we should trigger from there. We only need to
do this *once* per background writeback scan for a superblock
as there is no need for sharing bandwidth between lots of data
inodes - there's only one metadata inode for ext4/btrfs, and XFS is
completely async....

> > Partly, this could be dealt with by telling the filesystem
> > "metadata dirty target" - i.e. "get your dirty metadata counters below X"
> > - and whether we communicate that in bytes, pages, or a fraction of
> > current dirty metadata counter value is a detail I don't have a strong
> > opinion on now. And the fact is the amount written by the filesystem
> > doesn't have to be very accurate anyway - we basically just want to make
> > some forward progress with writing metadata, don't want that to take too
> > long (so that other writeback from the thread isn't stalled), and if
> > writeback code is unhappy about the state of counters next time it looks,
> > it will ask the filesystem again...

Right. The problem is communicating "how much" to the filesystem in
a useful manner....

> > This gets me directly to another problem with async nature of XFS metadata
> > writeback. That is that it could get writeback thread into busyloop - we
> > are supposed to terminate memory cleaning writeback only once dirty
> > counters are below limit and in case dirty metadata is causing counters to
> > be over limit, we would just ask in a loop XFS to get metadata below the
> > target. I suppose XFS could just return "nothing written" from its
> > ->write_metadata operation and in such case we could sleep a bit before
> > going for another writeback loop (the same thing happens when filesystem
> > reports all inodes are locked / busy and it cannot writeback anything). But
> > it's getting a bit ugly and is it really better than somehow waiting inside
> > XFS for metadata writeback to occur?  Any idea Dave?

I tend to think that the whole point of background writeback is to
do it asynchronously and keep the IO pipe full by avoiding blocking
on any specific object. i.e. if we can't do writeback from this
object, then skip it and do it from the next....

I think we could probably block ->write_metadata if necessary via a
completion/wakeup style notification when a specific LSN is reached
by the log tail, but realistically if there's any amount of data
needing to be written it'll throttle data writes because the IO
pipeline is being kept full by background metadata writes....

> > Regarding question 1). What Josef does is that once we went through all
> > queued inodes and wrote some amount from each one, we'd go and ask fs to
> > write some metadata. And then we'll again go to write inodes that are still
> > dirty. That is somewhat rough but I guess it is fine for now.
> > 
> 
> Alright I'm back from vacation so am sufficiently hungover to try and figure
> this out.  Btrfs and ext4 account their dirty metadata directly and reclaim it
> like inodes, xfs doesn't.

Terminology: "reclaim" is not what we do when accounting for
writeback IO completion.

And we've already been through the accounting side of things - we
can add that to XFS once it's converted to byte-based accounting.

> Btrfs does do something similar to what xfs does with
> delayed updates, but we just use the enospc logic to trigger when to update the
> metadata blocks, and then those just get written out via the dirty balancing
> stuff.  Since xfs doesn't have a direct way to tie that together, you'd rather
> we'd have some sort of ratio so you know you need to flush dirty inodes, correct
> Dave?

Again, terminology: We don't "need to flush dirty inodes" in XFS,
we need to flush /metadata objects/.

> I don't think this is solvable for xfs.  The whole vm is around pages/bytes.
> The only place we have this ratio thing is in slab reclaim, and we only have to
> worry about actual memory pressure there because we have a nice external
> trigger, we're out of pages.

WE don't need all of the complexity of slab reclaim, though. That's
a complete red herring.

All that is needed is for the writeback API to tell us "flush X% of
your dirty metadata".  We will have cached data and metadata in
bytes and dirty cached data and metadata in bytes at the generic
writeback level - it's not at all difficult to turn that into a
flush ratio. e.g. take the amount we are over the dirty metadata
background threshold, request writeback for that amount of metadata
as a percentage of the overall dirty metadata.

> For dirty throttling we have to know how much we're pushing and how much we need
> to push, and that _requires_ bytes/pages.

Dirty throttling does not need to know how much work you've asked
the filesystem to do. It does it's own accounting of bytes/pages
being cleaned based on the accounting updates from the filesystem
metadata object IO completion routines. That is what needs to be in
bytes/pages for dirty throttling to work.

> And not like "we can only send you
> bytes/pages to reclaim" but like the throttling stuff has all of it's accounting
> in bytes/pages, so putting in arbitrary object counts into this logic is not
> going to be straightforward.  The system administrator sets their dirty limits
> to absolute numbers or % of total memory.  If xfs can't account for its metadata
> this way then I don't think it can use any sort of infrastructure we provide in
> the current framework.

XFS will account for clean/dirty metadata in bytes, just like btrfs
and ext4 will do. We've already been over this and *solved that
problem*.

But really, though, I'm fed up with having to fight time and time
again over simple changes to core infrastructure that make it
generic rather than specifically tailored to the filesystem that
wants it first.  Merge whatever crap you need for btrfs and I'll
make it work for XFS later and leave what gets fed to btrfs
completely unchanged.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2018-01-03  2:32                 ` Dave Chinner
@ 2018-01-03 13:59                   ` Jan Kara
  2018-01-03 15:49                     ` Josef Bacik
  2018-01-04  1:32                     ` Dave Chinner
  0 siblings, 2 replies; 31+ messages in thread
From: Jan Kara @ 2018-01-03 13:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Josef Bacik, Jan Kara, hannes, linux-mm, akpm, linux-fsdevel,
	kernel-team, linux-btrfs, Josef Bacik

On Wed 03-01-18 13:32:19, Dave Chinner wrote:
> On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote:
> > On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> > > On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > > > IOWs, treating metadata like it's one great big data inode doesn't
> > > > > > seem to me to be the right abstraction to use for this - in most
> > > > > > fileystems it's a bunch of objects with a complex dependency tree
> > > > > > and unknown write ordering, not an inode full of data that can be
> > > > > > sequentially written.
> > > > > > 
> > > > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > > > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > > > > > sync based operations. That way different filesystems can ignore the
> > > > > > parts they don't need simply by not implementing those operations,
> > > > > > and the writeback code doesn't need to try to cater for all
> > > > > > operations through the one op. The writeback code should be cleaner,
> > > > > > the filesystem code should be cleaner, and we can tailor the work
> > > > > > guidelines for each operation separately so there's less mismatch
> > > > > > between what writeback is asking and how filesystems track dirty
> > > > > > metadata...
> > > > > 
> > > > > I agree that writeback for memory cleaning and writeback for data integrity
> > > > > are two very different things especially for metadata. In fact for data
> > > > > integrity writeback we already have ->sync_fs operation so there the
> > > > > functionality gets duplicated. What we could do is that in
> > > > > writeback_sb_inodes() we'd call ->write_metadata only when
> > > > > work->for_kupdate or work->for_background is set. That way ->write_metadata
> > > > > would be called only for memory cleaning purposes.
> > > > 
> > > > That makes sense, but I still think we need a better indication of
> > > > how much writeback we need to do than just "writeback this chunk of
> > > > pages". That "writeback a chunk" interface is necessary to share
> > > > writeback bandwidth across numerous data inodes so that we don't
> > > > starve any one inode of writeback bandwidth. That's unnecessary for
> > > > metadata writeback on a superblock - we don't need to share that
> > > > bandwidth around hundreds or thousands of inodes. What we actually
> > > > need to know is how much writeback we need to do as a total of all
> > > > the dirty metadata on the superblock.
> > > > 
> > > > Sure, that's not ideal for btrfs and mayext4, but we can write a
> > > > simple generic helper that converts "flush X percent of dirty
> > > > metadata" to a page/byte chunk as the current code does. DOing it
> > > > this way allows filesystems to completely internalise the accounting
> > > > that needs to be done, rather than trying to hack around a
> > > > writeback accounting interface with large impedance mismatches to
> > > > how the filesystem accounts for dirty metadata and/or tracks
> > > > writeback progress.
> > > 
> > > Let me think loud on how we could tie this into how memory cleaning
> > > writeback currently works - the one with for_background == 1 which is
> > > generally used to get amount of dirty pages in the system under control.
> > > We have a queue of inodes to write, we iterate over this queue and ask each
> > > inode to write some amount (e.g. 64 M - exact amount depends on measured
> 
> It's a maximum of 1024 pages per inode.

That's actually a minimum, not maximum, if I read the code in
writeback_chunk_size() right.

> > > writeback bandwidth etc.). Some amount from that inode gets written and we
> > > continue with the next inode in the queue (put this one at the end of the
> > > queue if it still has dirty pages). We do this until:
> > > 
> > > a) the number of dirty pages in the system is below background dirty limit
> > >    and the number dirty pages for this device is below background dirty
> > >    limit for this device.
> > > b) run out of dirty inodes on this device
> > > c) someone queues different type of writeback
> > > 
> > > And we need to somehow incorporate metadata writeback into this loop. I see
> > > two questions here:
> > > 
> > > 1) When / how often should we ask for metadata writeback?
> > > 2) How much to ask to write in one go?
> > > 
> > > The second question is especially tricky in the presence of completely
> > > async metadata flushing in XFS - we can ask to write say half of dirty
> > > metadata but then we have no idea whether the next observation of dirty
> > > metadata counters is with that part of metadata already under writeback /
> > > cleaned or whether xfsaild didn't even start working and pushing more has
> > > no sense.
> 
> Well, like with ext4, we've also got to consider that a bunch of the
> recently dirtied metadata (e.g. from delalloc, EOF updates on IO
> completion, etc) is still pinned in memory because the
> journal has not been flushed/checkpointed. Hence we should not be
> attempting to write back metadata we've dirtied as a result of
> writing data in the background writeback loop.

Agreed. Actually for ext4 I would not expose 'pinned' buffers as dirty to
VM - the journalling layer currently already works that way and it works
well for us. But that's just a small technical detail and different
filesystems can decide differently.

> That greatly simplifies what we need to consider here. That is, we
> just need to sample the ratio of dirty metadata to clean metadata
> before we start data writeback, and we calculate the amount of
> metadata writeback we should trigger from there. We only need to
> do this *once* per background writeback scan for a superblock
> as there is no need for sharing bandwidth between lots of data
> inodes - there's only one metadata inode for ext4/btrfs, and XFS is
> completely async....

OK, agreed again.

> > > Partly, this could be dealt with by telling the filesystem
> > > "metadata dirty target" - i.e. "get your dirty metadata counters below X"
> > > - and whether we communicate that in bytes, pages, or a fraction of
> > > current dirty metadata counter value is a detail I don't have a strong
> > > opinion on now. And the fact is the amount written by the filesystem
> > > doesn't have to be very accurate anyway - we basically just want to make
> > > some forward progress with writing metadata, don't want that to take too
> > > long (so that other writeback from the thread isn't stalled), and if
> > > writeback code is unhappy about the state of counters next time it looks,
> > > it will ask the filesystem again...
> 
> Right. The problem is communicating "how much" to the filesystem in
> a useful manner....

Yep. I'm fine with communication in the form of 'write X% of your dirty
metadata'. That should be useful for XFS and as you mentioned in some
previous email, we can provide a helper function to compute number of pages
to write (including some reasonable upper limit to bound time spent in one
->write_metadata invocation) for ext4 and btrfs.

> > > This gets me directly to another problem with async nature of XFS metadata
> > > writeback. That is that it could get writeback thread into busyloop - we
> > > are supposed to terminate memory cleaning writeback only once dirty
> > > counters are below limit and in case dirty metadata is causing counters to
> > > be over limit, we would just ask in a loop XFS to get metadata below the
> > > target. I suppose XFS could just return "nothing written" from its
> > > ->write_metadata operation and in such case we could sleep a bit before
> > > going for another writeback loop (the same thing happens when filesystem
> > > reports all inodes are locked / busy and it cannot writeback anything). But
> > > it's getting a bit ugly and is it really better than somehow waiting inside
> > > XFS for metadata writeback to occur?  Any idea Dave?
> 
> I tend to think that the whole point of background writeback is to
> do it asynchronously and keep the IO pipe full by avoiding blocking
> on any specific object. i.e. if we can't do writeback from this
> object, then skip it and do it from the next....

Agreed.

> I think we could probably block ->write_metadata if necessary via a
> completion/wakeup style notification when a specific LSN is reached
> by the log tail, but realistically if there's any amount of data
> needing to be written it'll throttle data writes because the IO
> pipeline is being kept full by background metadata writes....

So the problem I'm concerned about is a corner case. Consider a situation
when you have no dirty data, only dirty metadata but enough of them to
trigger background writeback. How should metadata writeback behave for XFS
in this case? Who should be responsible that wb_writeback() just does not
loop invoking ->write_metadata() as fast as CPU allows until xfsaild makes
enough progress?

Thinking about this today, I think this looping prevention belongs to
wb_writeback(). Sadly we don't have much info to decide how long to sleep
before trying more writeback so we'd have to just sleep for
<some_magic_amount> if we found no writeback happened in the last writeback
round before going through the whole writeback loop again. And
->write_metadata() for XFS would need to always return 0 (as in "no progress
made") to make sure this busyloop avoidance logic in wb_writeback()
triggers. ext4 and btrfs would return number of bytes written from
->write_metadata (or just 1 would be enough to indicate some progress in
metadata writeback was made and busyloop avoidance is not needed).

So overall I think I have pretty clear idea on how this all should work to
make ->write_metadata useful for btrfs, XFS, and ext4 and we agree on the
plan.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2018-01-03 13:59                   ` Jan Kara
@ 2018-01-03 15:49                     ` Josef Bacik
  2018-01-03 16:26                       ` Jan Kara
  2018-01-04  1:32                     ` Dave Chinner
  1 sibling, 1 reply; 31+ messages in thread
From: Josef Bacik @ 2018-01-03 15:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Josef Bacik, hannes, linux-mm, akpm, linux-fsdevel,
	kernel-team, linux-btrfs, Josef Bacik

On Wed, Jan 03, 2018 at 02:59:21PM +0100, Jan Kara wrote:
> On Wed 03-01-18 13:32:19, Dave Chinner wrote:
> > On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote:
> > > On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> > > > On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > > > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > > > > IOWs, treating metadata like it's one great big data inode doesn't
> > > > > > > seem to me to be the right abstraction to use for this - in most
> > > > > > > fileystems it's a bunch of objects with a complex dependency tree
> > > > > > > and unknown write ordering, not an inode full of data that can be
> > > > > > > sequentially written.
> > > > > > > 
> > > > > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > > > > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > > > > > > sync based operations. That way different filesystems can ignore the
> > > > > > > parts they don't need simply by not implementing those operations,
> > > > > > > and the writeback code doesn't need to try to cater for all
> > > > > > > operations through the one op. The writeback code should be cleaner,
> > > > > > > the filesystem code should be cleaner, and we can tailor the work
> > > > > > > guidelines for each operation separately so there's less mismatch
> > > > > > > between what writeback is asking and how filesystems track dirty
> > > > > > > metadata...
> > > > > > 
> > > > > > I agree that writeback for memory cleaning and writeback for data integrity
> > > > > > are two very different things especially for metadata. In fact for data
> > > > > > integrity writeback we already have ->sync_fs operation so there the
> > > > > > functionality gets duplicated. What we could do is that in
> > > > > > writeback_sb_inodes() we'd call ->write_metadata only when
> > > > > > work->for_kupdate or work->for_background is set. That way ->write_metadata
> > > > > > would be called only for memory cleaning purposes.
> > > > > 
> > > > > That makes sense, but I still think we need a better indication of
> > > > > how much writeback we need to do than just "writeback this chunk of
> > > > > pages". That "writeback a chunk" interface is necessary to share
> > > > > writeback bandwidth across numerous data inodes so that we don't
> > > > > starve any one inode of writeback bandwidth. That's unnecessary for
> > > > > metadata writeback on a superblock - we don't need to share that
> > > > > bandwidth around hundreds or thousands of inodes. What we actually
> > > > > need to know is how much writeback we need to do as a total of all
> > > > > the dirty metadata on the superblock.
> > > > > 
> > > > > Sure, that's not ideal for btrfs and mayext4, but we can write a
> > > > > simple generic helper that converts "flush X percent of dirty
> > > > > metadata" to a page/byte chunk as the current code does. DOing it
> > > > > this way allows filesystems to completely internalise the accounting
> > > > > that needs to be done, rather than trying to hack around a
> > > > > writeback accounting interface with large impedance mismatches to
> > > > > how the filesystem accounts for dirty metadata and/or tracks
> > > > > writeback progress.
> > > > 
> > > > Let me think loud on how we could tie this into how memory cleaning
> > > > writeback currently works - the one with for_background == 1 which is
> > > > generally used to get amount of dirty pages in the system under control.
> > > > We have a queue of inodes to write, we iterate over this queue and ask each
> > > > inode to write some amount (e.g. 64 M - exact amount depends on measured
> > 
> > It's a maximum of 1024 pages per inode.
> 
> That's actually a minimum, not maximum, if I read the code in
> writeback_chunk_size() right.
> 
> > > > writeback bandwidth etc.). Some amount from that inode gets written and we
> > > > continue with the next inode in the queue (put this one at the end of the
> > > > queue if it still has dirty pages). We do this until:
> > > > 
> > > > a) the number of dirty pages in the system is below background dirty limit
> > > >    and the number dirty pages for this device is below background dirty
> > > >    limit for this device.
> > > > b) run out of dirty inodes on this device
> > > > c) someone queues different type of writeback
> > > > 
> > > > And we need to somehow incorporate metadata writeback into this loop. I see
> > > > two questions here:
> > > > 
> > > > 1) When / how often should we ask for metadata writeback?
> > > > 2) How much to ask to write in one go?
> > > > 
> > > > The second question is especially tricky in the presence of completely
> > > > async metadata flushing in XFS - we can ask to write say half of dirty
> > > > metadata but then we have no idea whether the next observation of dirty
> > > > metadata counters is with that part of metadata already under writeback /
> > > > cleaned or whether xfsaild didn't even start working and pushing more has
> > > > no sense.
> > 
> > Well, like with ext4, we've also got to consider that a bunch of the
> > recently dirtied metadata (e.g. from delalloc, EOF updates on IO
> > completion, etc) is still pinned in memory because the
> > journal has not been flushed/checkpointed. Hence we should not be
> > attempting to write back metadata we've dirtied as a result of
> > writing data in the background writeback loop.
> 
> Agreed. Actually for ext4 I would not expose 'pinned' buffers as dirty to
> VM - the journalling layer currently already works that way and it works
> well for us. But that's just a small technical detail and different
> filesystems can decide differently.
> 
> > That greatly simplifies what we need to consider here. That is, we
> > just need to sample the ratio of dirty metadata to clean metadata
> > before we start data writeback, and we calculate the amount of
> > metadata writeback we should trigger from there. We only need to
> > do this *once* per background writeback scan for a superblock
> > as there is no need for sharing bandwidth between lots of data
> > inodes - there's only one metadata inode for ext4/btrfs, and XFS is
> > completely async....
> 
> OK, agreed again.
> 
> > > > Partly, this could be dealt with by telling the filesystem
> > > > "metadata dirty target" - i.e. "get your dirty metadata counters below X"
> > > > - and whether we communicate that in bytes, pages, or a fraction of
> > > > current dirty metadata counter value is a detail I don't have a strong
> > > > opinion on now. And the fact is the amount written by the filesystem
> > > > doesn't have to be very accurate anyway - we basically just want to make
> > > > some forward progress with writing metadata, don't want that to take too
> > > > long (so that other writeback from the thread isn't stalled), and if
> > > > writeback code is unhappy about the state of counters next time it looks,
> > > > it will ask the filesystem again...
> > 
> > Right. The problem is communicating "how much" to the filesystem in
> > a useful manner....
> 
> Yep. I'm fine with communication in the form of 'write X% of your dirty
> metadata'. That should be useful for XFS and as you mentioned in some
> previous email, we can provide a helper function to compute number of pages
> to write (including some reasonable upper limit to bound time spent in one
> ->write_metadata invocation) for ext4 and btrfs.
> 
> > > > This gets me directly to another problem with async nature of XFS metadata
> > > > writeback. That is that it could get writeback thread into busyloop - we
> > > > are supposed to terminate memory cleaning writeback only once dirty
> > > > counters are below limit and in case dirty metadata is causing counters to
> > > > be over limit, we would just ask in a loop XFS to get metadata below the
> > > > target. I suppose XFS could just return "nothing written" from its
> > > > ->write_metadata operation and in such case we could sleep a bit before
> > > > going for another writeback loop (the same thing happens when filesystem
> > > > reports all inodes are locked / busy and it cannot writeback anything). But
> > > > it's getting a bit ugly and is it really better than somehow waiting inside
> > > > XFS for metadata writeback to occur?  Any idea Dave?
> > 
> > I tend to think that the whole point of background writeback is to
> > do it asynchronously and keep the IO pipe full by avoiding blocking
> > on any specific object. i.e. if we can't do writeback from this
> > object, then skip it and do it from the next....
> 
> Agreed.
> 
> > I think we could probably block ->write_metadata if necessary via a
> > completion/wakeup style notification when a specific LSN is reached
> > by the log tail, but realistically if there's any amount of data
> > needing to be written it'll throttle data writes because the IO
> > pipeline is being kept full by background metadata writes....
> 
> So the problem I'm concerned about is a corner case. Consider a situation
> when you have no dirty data, only dirty metadata but enough of them to
> trigger background writeback. How should metadata writeback behave for XFS
> in this case? Who should be responsible that wb_writeback() just does not
> loop invoking ->write_metadata() as fast as CPU allows until xfsaild makes
> enough progress?
> 
> Thinking about this today, I think this looping prevention belongs to
> wb_writeback(). Sadly we don't have much info to decide how long to sleep
> before trying more writeback so we'd have to just sleep for
> <some_magic_amount> if we found no writeback happened in the last writeback
> round before going through the whole writeback loop again. And
> ->write_metadata() for XFS would need to always return 0 (as in "no progress
> made") to make sure this busyloop avoidance logic in wb_writeback()
> triggers. ext4 and btrfs would return number of bytes written from
> ->write_metadata (or just 1 would be enough to indicate some progress in
> metadata writeback was made and busyloop avoidance is not needed).
> 
> So overall I think I have pretty clear idea on how this all should work to
> make ->write_metadata useful for btrfs, XFS, and ext4 and we agree on the
> plan.
> 

I'm glad you do, I'm still confused.  I'm totally fine with sending a % to the
fs to figure out what it wants, what I'm confused about is how to get that % for
xfs?  Since xfs doesn't mark its actual buffers dirty, so wouldn't use
account_metadata_dirtied and it's family, how do we generate this % for xfs?  Or
am I misunderstanding and you do plan to use those helpers?  If you do plan to
use them, then we just need to figure out what we want the ratio to be of, and
then you'll be happy Dave?  I'm not trying to argue with you Dave, we're just in
that "talking past each other" stage of every email conversation we've ever had,
I'm trying to get to the "we both understand what we're both saying and are
happy again" stage.  Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2018-01-03 15:49                     ` Josef Bacik
@ 2018-01-03 16:26                       ` Jan Kara
  2018-01-03 16:29                         ` Josef Bacik
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2018-01-03 16:26 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, Dave Chinner, hannes, linux-mm, akpm, linux-fsdevel,
	kernel-team, linux-btrfs, Josef Bacik

On Wed 03-01-18 10:49:33, Josef Bacik wrote:
> On Wed, Jan 03, 2018 at 02:59:21PM +0100, Jan Kara wrote:
> > On Wed 03-01-18 13:32:19, Dave Chinner wrote:
> > > On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote:
> > > > On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> > > > > On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > > > > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > > > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > > > > > IOWs, treating metadata like it's one great big data inode doesn't
> > > > > > > > seem to me to be the right abstraction to use for this - in most
> > > > > > > > fileystems it's a bunch of objects with a complex dependency tree
> > > > > > > > and unknown write ordering, not an inode full of data that can be
> > > > > > > > sequentially written.
> > > > > > > > 
> > > > > > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > > > > > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > > > > > > > sync based operations. That way different filesystems can ignore the
> > > > > > > > parts they don't need simply by not implementing those operations,
> > > > > > > > and the writeback code doesn't need to try to cater for all
> > > > > > > > operations through the one op. The writeback code should be cleaner,
> > > > > > > > the filesystem code should be cleaner, and we can tailor the work
> > > > > > > > guidelines for each operation separately so there's less mismatch
> > > > > > > > between what writeback is asking and how filesystems track dirty
> > > > > > > > metadata...
> > > > > > > 
> > > > > > > I agree that writeback for memory cleaning and writeback for data integrity
> > > > > > > are two very different things especially for metadata. In fact for data
> > > > > > > integrity writeback we already have ->sync_fs operation so there the
> > > > > > > functionality gets duplicated. What we could do is that in
> > > > > > > writeback_sb_inodes() we'd call ->write_metadata only when
> > > > > > > work->for_kupdate or work->for_background is set. That way ->write_metadata
> > > > > > > would be called only for memory cleaning purposes.
> > > > > > 
> > > > > > That makes sense, but I still think we need a better indication of
> > > > > > how much writeback we need to do than just "writeback this chunk of
> > > > > > pages". That "writeback a chunk" interface is necessary to share
> > > > > > writeback bandwidth across numerous data inodes so that we don't
> > > > > > starve any one inode of writeback bandwidth. That's unnecessary for
> > > > > > metadata writeback on a superblock - we don't need to share that
> > > > > > bandwidth around hundreds or thousands of inodes. What we actually
> > > > > > need to know is how much writeback we need to do as a total of all
> > > > > > the dirty metadata on the superblock.
> > > > > > 
> > > > > > Sure, that's not ideal for btrfs and mayext4, but we can write a
> > > > > > simple generic helper that converts "flush X percent of dirty
> > > > > > metadata" to a page/byte chunk as the current code does. DOing it
> > > > > > this way allows filesystems to completely internalise the accounting
> > > > > > that needs to be done, rather than trying to hack around a
> > > > > > writeback accounting interface with large impedance mismatches to
> > > > > > how the filesystem accounts for dirty metadata and/or tracks
> > > > > > writeback progress.
> > > > > 
> > > > > Let me think loud on how we could tie this into how memory cleaning
> > > > > writeback currently works - the one with for_background == 1 which is
> > > > > generally used to get amount of dirty pages in the system under control.
> > > > > We have a queue of inodes to write, we iterate over this queue and ask each
> > > > > inode to write some amount (e.g. 64 M - exact amount depends on measured
> > > 
> > > It's a maximum of 1024 pages per inode.
> > 
> > That's actually a minimum, not maximum, if I read the code in
> > writeback_chunk_size() right.
> > 
> > > > > writeback bandwidth etc.). Some amount from that inode gets written and we
> > > > > continue with the next inode in the queue (put this one at the end of the
> > > > > queue if it still has dirty pages). We do this until:
> > > > > 
> > > > > a) the number of dirty pages in the system is below background dirty limit
> > > > >    and the number dirty pages for this device is below background dirty
> > > > >    limit for this device.
> > > > > b) run out of dirty inodes on this device
> > > > > c) someone queues different type of writeback
> > > > > 
> > > > > And we need to somehow incorporate metadata writeback into this loop. I see
> > > > > two questions here:
> > > > > 
> > > > > 1) When / how often should we ask for metadata writeback?
> > > > > 2) How much to ask to write in one go?
> > > > > 
> > > > > The second question is especially tricky in the presence of completely
> > > > > async metadata flushing in XFS - we can ask to write say half of dirty
> > > > > metadata but then we have no idea whether the next observation of dirty
> > > > > metadata counters is with that part of metadata already under writeback /
> > > > > cleaned or whether xfsaild didn't even start working and pushing more has
> > > > > no sense.
> > > 
> > > Well, like with ext4, we've also got to consider that a bunch of the
> > > recently dirtied metadata (e.g. from delalloc, EOF updates on IO
> > > completion, etc) is still pinned in memory because the
> > > journal has not been flushed/checkpointed. Hence we should not be
> > > attempting to write back metadata we've dirtied as a result of
> > > writing data in the background writeback loop.
> > 
> > Agreed. Actually for ext4 I would not expose 'pinned' buffers as dirty to
> > VM - the journalling layer currently already works that way and it works
> > well for us. But that's just a small technical detail and different
> > filesystems can decide differently.
> > 
> > > That greatly simplifies what we need to consider here. That is, we
> > > just need to sample the ratio of dirty metadata to clean metadata
> > > before we start data writeback, and we calculate the amount of
> > > metadata writeback we should trigger from there. We only need to
> > > do this *once* per background writeback scan for a superblock
> > > as there is no need for sharing bandwidth between lots of data
> > > inodes - there's only one metadata inode for ext4/btrfs, and XFS is
> > > completely async....
> > 
> > OK, agreed again.
> > 
> > > > > Partly, this could be dealt with by telling the filesystem
> > > > > "metadata dirty target" - i.e. "get your dirty metadata counters below X"
> > > > > - and whether we communicate that in bytes, pages, or a fraction of
> > > > > current dirty metadata counter value is a detail I don't have a strong
> > > > > opinion on now. And the fact is the amount written by the filesystem
> > > > > doesn't have to be very accurate anyway - we basically just want to make
> > > > > some forward progress with writing metadata, don't want that to take too
> > > > > long (so that other writeback from the thread isn't stalled), and if
> > > > > writeback code is unhappy about the state of counters next time it looks,
> > > > > it will ask the filesystem again...
> > > 
> > > Right. The problem is communicating "how much" to the filesystem in
> > > a useful manner....
> > 
> > Yep. I'm fine with communication in the form of 'write X% of your dirty
> > metadata'. That should be useful for XFS and as you mentioned in some
> > previous email, we can provide a helper function to compute number of pages
> > to write (including some reasonable upper limit to bound time spent in one
> > ->write_metadata invocation) for ext4 and btrfs.
> > 
> > > > > This gets me directly to another problem with async nature of XFS metadata
> > > > > writeback. That is that it could get writeback thread into busyloop - we
> > > > > are supposed to terminate memory cleaning writeback only once dirty
> > > > > counters are below limit and in case dirty metadata is causing counters to
> > > > > be over limit, we would just ask in a loop XFS to get metadata below the
> > > > > target. I suppose XFS could just return "nothing written" from its
> > > > > ->write_metadata operation and in such case we could sleep a bit before
> > > > > going for another writeback loop (the same thing happens when filesystem
> > > > > reports all inodes are locked / busy and it cannot writeback anything). But
> > > > > it's getting a bit ugly and is it really better than somehow waiting inside
> > > > > XFS for metadata writeback to occur?  Any idea Dave?
> > > 
> > > I tend to think that the whole point of background writeback is to
> > > do it asynchronously and keep the IO pipe full by avoiding blocking
> > > on any specific object. i.e. if we can't do writeback from this
> > > object, then skip it and do it from the next....
> > 
> > Agreed.
> > 
> > > I think we could probably block ->write_metadata if necessary via a
> > > completion/wakeup style notification when a specific LSN is reached
> > > by the log tail, but realistically if there's any amount of data
> > > needing to be written it'll throttle data writes because the IO
> > > pipeline is being kept full by background metadata writes....
> > 
> > So the problem I'm concerned about is a corner case. Consider a situation
> > when you have no dirty data, only dirty metadata but enough of them to
> > trigger background writeback. How should metadata writeback behave for XFS
> > in this case? Who should be responsible that wb_writeback() just does not
> > loop invoking ->write_metadata() as fast as CPU allows until xfsaild makes
> > enough progress?
> > 
> > Thinking about this today, I think this looping prevention belongs to
> > wb_writeback(). Sadly we don't have much info to decide how long to sleep
> > before trying more writeback so we'd have to just sleep for
> > <some_magic_amount> if we found no writeback happened in the last writeback
> > round before going through the whole writeback loop again. And
> > ->write_metadata() for XFS would need to always return 0 (as in "no progress
> > made") to make sure this busyloop avoidance logic in wb_writeback()
> > triggers. ext4 and btrfs would return number of bytes written from
> > ->write_metadata (or just 1 would be enough to indicate some progress in
> > metadata writeback was made and busyloop avoidance is not needed).
> > 
> > So overall I think I have pretty clear idea on how this all should work to
> > make ->write_metadata useful for btrfs, XFS, and ext4 and we agree on the
> > plan.
> > 
> 
> I'm glad you do, I'm still confused.  I'm totally fine with sending a % to the
> fs to figure out what it wants, what I'm confused about is how to get that % for
> xfs?  Since xfs doesn't mark its actual buffers dirty, so wouldn't use
> account_metadata_dirtied and it's family, how do we generate this % for xfs?  Or
> am I misunderstanding and you do plan to use those helpers?

AFAIU he plans to use account_metadata_dirtied() & co. in XFS.

> If you do plan to use them, then we just need to figure out what we want
> the ratio to be of, and then you'll be happy Dave?

Reasonably natural dirty target would be dirty_background_ratio of total
metadata amount to be dirty. We would have to be somewhat creative if
dirty_background_bytes is actually set instead of dirty_background_ratio
and use ratio like dirty_background_bytes / (data + metadata amount) but
it's doable...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2018-01-03 16:26                       ` Jan Kara
@ 2018-01-03 16:29                         ` Josef Bacik
  2018-01-29  9:06                           ` Chandan Rajendra
  0 siblings, 1 reply; 31+ messages in thread
From: Josef Bacik @ 2018-01-03 16:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Josef Bacik, Dave Chinner, hannes, linux-mm, akpm, linux-fsdevel,
	kernel-team, linux-btrfs, Josef Bacik

On Wed, Jan 03, 2018 at 05:26:03PM +0100, Jan Kara wrote:
> On Wed 03-01-18 10:49:33, Josef Bacik wrote:
> > On Wed, Jan 03, 2018 at 02:59:21PM +0100, Jan Kara wrote:
> > > On Wed 03-01-18 13:32:19, Dave Chinner wrote:
> > > > On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote:
> > > > > On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> > > > > > On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > > > > > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > > > > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > > > > > > IOWs, treating metadata like it's one great big data inode doesn't
> > > > > > > > > seem to me to be the right abstraction to use for this - in most
> > > > > > > > > fileystems it's a bunch of objects with a complex dependency tree
> > > > > > > > > and unknown write ordering, not an inode full of data that can be
> > > > > > > > > sequentially written.
> > > > > > > > > 
> > > > > > > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > > > > > > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > > > > > > > > sync based operations. That way different filesystems can ignore the
> > > > > > > > > parts they don't need simply by not implementing those operations,
> > > > > > > > > and the writeback code doesn't need to try to cater for all
> > > > > > > > > operations through the one op. The writeback code should be cleaner,
> > > > > > > > > the filesystem code should be cleaner, and we can tailor the work
> > > > > > > > > guidelines for each operation separately so there's less mismatch
> > > > > > > > > between what writeback is asking and how filesystems track dirty
> > > > > > > > > metadata...
> > > > > > > > 
> > > > > > > > I agree that writeback for memory cleaning and writeback for data integrity
> > > > > > > > are two very different things especially for metadata. In fact for data
> > > > > > > > integrity writeback we already have ->sync_fs operation so there the
> > > > > > > > functionality gets duplicated. What we could do is that in
> > > > > > > > writeback_sb_inodes() we'd call ->write_metadata only when
> > > > > > > > work->for_kupdate or work->for_background is set. That way ->write_metadata
> > > > > > > > would be called only for memory cleaning purposes.
> > > > > > > 
> > > > > > > That makes sense, but I still think we need a better indication of
> > > > > > > how much writeback we need to do than just "writeback this chunk of
> > > > > > > pages". That "writeback a chunk" interface is necessary to share
> > > > > > > writeback bandwidth across numerous data inodes so that we don't
> > > > > > > starve any one inode of writeback bandwidth. That's unnecessary for
> > > > > > > metadata writeback on a superblock - we don't need to share that
> > > > > > > bandwidth around hundreds or thousands of inodes. What we actually
> > > > > > > need to know is how much writeback we need to do as a total of all
> > > > > > > the dirty metadata on the superblock.
> > > > > > > 
> > > > > > > Sure, that's not ideal for btrfs and mayext4, but we can write a
> > > > > > > simple generic helper that converts "flush X percent of dirty
> > > > > > > metadata" to a page/byte chunk as the current code does. DOing it
> > > > > > > this way allows filesystems to completely internalise the accounting
> > > > > > > that needs to be done, rather than trying to hack around a
> > > > > > > writeback accounting interface with large impedance mismatches to
> > > > > > > how the filesystem accounts for dirty metadata and/or tracks
> > > > > > > writeback progress.
> > > > > > 
> > > > > > Let me think loud on how we could tie this into how memory cleaning
> > > > > > writeback currently works - the one with for_background == 1 which is
> > > > > > generally used to get amount of dirty pages in the system under control.
> > > > > > We have a queue of inodes to write, we iterate over this queue and ask each
> > > > > > inode to write some amount (e.g. 64 M - exact amount depends on measured
> > > > 
> > > > It's a maximum of 1024 pages per inode.
> > > 
> > > That's actually a minimum, not maximum, if I read the code in
> > > writeback_chunk_size() right.
> > > 
> > > > > > writeback bandwidth etc.). Some amount from that inode gets written and we
> > > > > > continue with the next inode in the queue (put this one at the end of the
> > > > > > queue if it still has dirty pages). We do this until:
> > > > > > 
> > > > > > a) the number of dirty pages in the system is below background dirty limit
> > > > > >    and the number dirty pages for this device is below background dirty
> > > > > >    limit for this device.
> > > > > > b) run out of dirty inodes on this device
> > > > > > c) someone queues different type of writeback
> > > > > > 
> > > > > > And we need to somehow incorporate metadata writeback into this loop. I see
> > > > > > two questions here:
> > > > > > 
> > > > > > 1) When / how often should we ask for metadata writeback?
> > > > > > 2) How much to ask to write in one go?
> > > > > > 
> > > > > > The second question is especially tricky in the presence of completely
> > > > > > async metadata flushing in XFS - we can ask to write say half of dirty
> > > > > > metadata but then we have no idea whether the next observation of dirty
> > > > > > metadata counters is with that part of metadata already under writeback /
> > > > > > cleaned or whether xfsaild didn't even start working and pushing more has
> > > > > > no sense.
> > > > 
> > > > Well, like with ext4, we've also got to consider that a bunch of the
> > > > recently dirtied metadata (e.g. from delalloc, EOF updates on IO
> > > > completion, etc) is still pinned in memory because the
> > > > journal has not been flushed/checkpointed. Hence we should not be
> > > > attempting to write back metadata we've dirtied as a result of
> > > > writing data in the background writeback loop.
> > > 
> > > Agreed. Actually for ext4 I would not expose 'pinned' buffers as dirty to
> > > VM - the journalling layer currently already works that way and it works
> > > well for us. But that's just a small technical detail and different
> > > filesystems can decide differently.
> > > 
> > > > That greatly simplifies what we need to consider here. That is, we
> > > > just need to sample the ratio of dirty metadata to clean metadata
> > > > before we start data writeback, and we calculate the amount of
> > > > metadata writeback we should trigger from there. We only need to
> > > > do this *once* per background writeback scan for a superblock
> > > > as there is no need for sharing bandwidth between lots of data
> > > > inodes - there's only one metadata inode for ext4/btrfs, and XFS is
> > > > completely async....
> > > 
> > > OK, agreed again.
> > > 
> > > > > > Partly, this could be dealt with by telling the filesystem
> > > > > > "metadata dirty target" - i.e. "get your dirty metadata counters below X"
> > > > > > - and whether we communicate that in bytes, pages, or a fraction of
> > > > > > current dirty metadata counter value is a detail I don't have a strong
> > > > > > opinion on now. And the fact is the amount written by the filesystem
> > > > > > doesn't have to be very accurate anyway - we basically just want to make
> > > > > > some forward progress with writing metadata, don't want that to take too
> > > > > > long (so that other writeback from the thread isn't stalled), and if
> > > > > > writeback code is unhappy about the state of counters next time it looks,
> > > > > > it will ask the filesystem again...
> > > > 
> > > > Right. The problem is communicating "how much" to the filesystem in
> > > > a useful manner....
> > > 
> > > Yep. I'm fine with communication in the form of 'write X% of your dirty
> > > metadata'. That should be useful for XFS and as you mentioned in some
> > > previous email, we can provide a helper function to compute number of pages
> > > to write (including some reasonable upper limit to bound time spent in one
> > > ->write_metadata invocation) for ext4 and btrfs.
> > > 
> > > > > > This gets me directly to another problem with async nature of XFS metadata
> > > > > > writeback. That is that it could get writeback thread into busyloop - we
> > > > > > are supposed to terminate memory cleaning writeback only once dirty
> > > > > > counters are below limit and in case dirty metadata is causing counters to
> > > > > > be over limit, we would just ask in a loop XFS to get metadata below the
> > > > > > target. I suppose XFS could just return "nothing written" from its
> > > > > > ->write_metadata operation and in such case we could sleep a bit before
> > > > > > going for another writeback loop (the same thing happens when filesystem
> > > > > > reports all inodes are locked / busy and it cannot writeback anything). But
> > > > > > it's getting a bit ugly and is it really better than somehow waiting inside
> > > > > > XFS for metadata writeback to occur?  Any idea Dave?
> > > > 
> > > > I tend to think that the whole point of background writeback is to
> > > > do it asynchronously and keep the IO pipe full by avoiding blocking
> > > > on any specific object. i.e. if we can't do writeback from this
> > > > object, then skip it and do it from the next....
> > > 
> > > Agreed.
> > > 
> > > > I think we could probably block ->write_metadata if necessary via a
> > > > completion/wakeup style notification when a specific LSN is reached
> > > > by the log tail, but realistically if there's any amount of data
> > > > needing to be written it'll throttle data writes because the IO
> > > > pipeline is being kept full by background metadata writes....
> > > 
> > > So the problem I'm concerned about is a corner case. Consider a situation
> > > when you have no dirty data, only dirty metadata but enough of them to
> > > trigger background writeback. How should metadata writeback behave for XFS
> > > in this case? Who should be responsible that wb_writeback() just does not
> > > loop invoking ->write_metadata() as fast as CPU allows until xfsaild makes
> > > enough progress?
> > > 
> > > Thinking about this today, I think this looping prevention belongs to
> > > wb_writeback(). Sadly we don't have much info to decide how long to sleep
> > > before trying more writeback so we'd have to just sleep for
> > > <some_magic_amount> if we found no writeback happened in the last writeback
> > > round before going through the whole writeback loop again. And
> > > ->write_metadata() for XFS would need to always return 0 (as in "no progress
> > > made") to make sure this busyloop avoidance logic in wb_writeback()
> > > triggers. ext4 and btrfs would return number of bytes written from
> > > ->write_metadata (or just 1 would be enough to indicate some progress in
> > > metadata writeback was made and busyloop avoidance is not needed).
> > > 
> > > So overall I think I have pretty clear idea on how this all should work to
> > > make ->write_metadata useful for btrfs, XFS, and ext4 and we agree on the
> > > plan.
> > > 
> > 
> > I'm glad you do, I'm still confused.  I'm totally fine with sending a % to the
> > fs to figure out what it wants, what I'm confused about is how to get that % for
> > xfs?  Since xfs doesn't mark its actual buffers dirty, so wouldn't use
> > account_metadata_dirtied and it's family, how do we generate this % for xfs?  Or
> > am I misunderstanding and you do plan to use those helpers?
> 
> AFAIU he plans to use account_metadata_dirtied() & co. in XFS.
> 
> > If you do plan to use them, then we just need to figure out what we want
> > the ratio to be of, and then you'll be happy Dave?
> 
> Reasonably natural dirty target would be dirty_background_ratio of total
> metadata amount to be dirty. We would have to be somewhat creative if
> dirty_background_bytes is actually set instead of dirty_background_ratio
> and use ratio like dirty_background_bytes / (data + metadata amount) but
> it's doable...
> 

Oh ok well if that's the case then I'll fix this up to be a ratio, test
everything, and send it along probably early next week.  Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2018-01-03 13:59                   ` Jan Kara
  2018-01-03 15:49                     ` Josef Bacik
@ 2018-01-04  1:32                     ` Dave Chinner
  2018-01-04  9:10                       ` Jan Kara
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2018-01-04  1:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Josef Bacik, hannes, linux-mm, akpm, linux-fsdevel, kernel-team,
	linux-btrfs, Josef Bacik

On Wed, Jan 03, 2018 at 02:59:21PM +0100, Jan Kara wrote:
> On Wed 03-01-18 13:32:19, Dave Chinner wrote:
> > I think we could probably block ->write_metadata if necessary via a
> > completion/wakeup style notification when a specific LSN is reached
> > by the log tail, but realistically if there's any amount of data
> > needing to be written it'll throttle data writes because the IO
> > pipeline is being kept full by background metadata writes....
> 
> So the problem I'm concerned about is a corner case. Consider a situation
> when you have no dirty data, only dirty metadata but enough of them to
> trigger background writeback. How should metadata writeback behave for XFS
> in this case? Who should be responsible that wb_writeback() just does not
> loop invoking ->write_metadata() as fast as CPU allows until xfsaild makes
> enough progress?
>
> Thinking about this today, I think this looping prevention belongs to
> wb_writeback().

Well, backgroudn data writeback can block in two ways. One is during
IO submission when the request queue is full, the other is when all
dirty inodes have had some work done on them and have all been moved
to b_more_io - wb_writeback waits for the __I_SYNC bit to be cleared
on the last(?) inode on that list, hence backing off before
submitting more IO.

IOws, there's a "during writeback" blocking mechanism as well as a
"between cycles" block mechanism.

> Sadly we don't have much info to decide how long to sleep
> before trying more writeback so we'd have to just sleep for
> <some_magic_amount> if we found no writeback happened in the last writeback
> round before going through the whole writeback loop again.

Right - I don't think we can provide a generic "between cycles"
blocking mechanism for XFS, but I'm pretty sure we can emulate a
"during writeback" blocking mechanism to avoid busy looping inside
the XFS code.

e.g. if we get a writeback call that asks for 5% to be written,
and we already have a metadata writeback target of 5% in place,
that means we should block for a while. That would emulate request
queue blocking and prevent busy looping in this case....

> And
> ->write_metadata() for XFS would need to always return 0 (as in "no progress
> made") to make sure this busyloop avoidance logic in wb_writeback()
> triggers. ext4 and btrfs would return number of bytes written from
> ->write_metadata (or just 1 would be enough to indicate some progress in
> metadata writeback was made and busyloop avoidance is not needed).

Well, if we block for a little while, we can indicate that progress
has been made and this whole mess would go away, right?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2018-01-04  1:32                     ` Dave Chinner
@ 2018-01-04  9:10                       ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2018-01-04  9:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Josef Bacik, hannes, linux-mm, akpm, linux-fsdevel,
	kernel-team, linux-btrfs, Josef Bacik

On Thu 04-01-18 12:32:07, Dave Chinner wrote:
> On Wed, Jan 03, 2018 at 02:59:21PM +0100, Jan Kara wrote:
> > On Wed 03-01-18 13:32:19, Dave Chinner wrote:
> > > I think we could probably block ->write_metadata if necessary via a
> > > completion/wakeup style notification when a specific LSN is reached
> > > by the log tail, but realistically if there's any amount of data
> > > needing to be written it'll throttle data writes because the IO
> > > pipeline is being kept full by background metadata writes....
> > 
> > So the problem I'm concerned about is a corner case. Consider a situation
> > when you have no dirty data, only dirty metadata but enough of them to
> > trigger background writeback. How should metadata writeback behave for XFS
> > in this case? Who should be responsible that wb_writeback() just does not
> > loop invoking ->write_metadata() as fast as CPU allows until xfsaild makes
> > enough progress?
> >
> > Thinking about this today, I think this looping prevention belongs to
> > wb_writeback().
> 
> Well, backgroudn data writeback can block in two ways. One is during
> IO submission when the request queue is full, the other is when all
> dirty inodes have had some work done on them and have all been moved
> to b_more_io - wb_writeback waits for the __I_SYNC bit to be cleared
> on the last(?) inode on that list, hence backing off before
> submitting more IO.
> 
> IOws, there's a "during writeback" blocking mechanism as well as a
> "between cycles" block mechanism.
> 
> > Sadly we don't have much info to decide how long to sleep
> > before trying more writeback so we'd have to just sleep for
> > <some_magic_amount> if we found no writeback happened in the last writeback
> > round before going through the whole writeback loop again.
> 
> Right - I don't think we can provide a generic "between cycles"
> blocking mechanism for XFS, but I'm pretty sure we can emulate a
> "during writeback" blocking mechanism to avoid busy looping inside
> the XFS code.
> 
> e.g. if we get a writeback call that asks for 5% to be written,
> and we already have a metadata writeback target of 5% in place,
> that means we should block for a while. That would emulate request
> queue blocking and prevent busy looping in this case....

If you can do this in XFS then fine, it saves some mess in the generic
code.

> > And
> > ->write_metadata() for XFS would need to always return 0 (as in "no progress
> > made") to make sure this busyloop avoidance logic in wb_writeback()
> > triggers. ext4 and btrfs would return number of bytes written from
> > ->write_metadata (or just 1 would be enough to indicate some progress in
> > metadata writeback was made and busyloop avoidance is not needed).
> 
> Well, if we block for a little while, we can indicate that progress
> has been made and this whole mess would go away, right?

Right. So let's just ignore the problem for the sake of Josef's patch set.
Once the patches land and when XFS starts using the infrastructure, we will
make sure this is handled properly.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2018-01-03 16:29                         ` Josef Bacik
@ 2018-01-29  9:06                           ` Chandan Rajendra
  2018-09-28  8:37                             ` Chandan Rajendra
  0 siblings, 1 reply; 31+ messages in thread
From: Chandan Rajendra @ 2018-01-29  9:06 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, Dave Chinner, hannes, linux-mm, akpm, linux-fsdevel,
	kernel-team, linux-btrfs, Josef Bacik

On Wednesday, January 3, 2018 9:59:24 PM IST Josef Bacik wrote:
> On Wed, Jan 03, 2018 at 05:26:03PM +0100, Jan Kara wrote:

> 
> Oh ok well if that's the case then I'll fix this up to be a ratio, test
> everything, and send it along probably early next week.  Thanks,
> 

Hi Josef,

Did you get a chance to work on the next version of this patchset?


-- 
chandan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
  2018-01-29  9:06                           ` Chandan Rajendra
@ 2018-09-28  8:37                             ` Chandan Rajendra
  0 siblings, 0 replies; 31+ messages in thread
From: Chandan Rajendra @ 2018-09-28  8:37 UTC (permalink / raw)
  To: Chandan Rajendra, linux-mm
  Cc: Josef Bacik, akpm, linux-fsdevel, kernel-team, linux-btrfs

On Monday, January 29, 2018 2:36:15 PM IST Chandan Rajendra wrote:
> On Wednesday, January 3, 2018 9:59:24 PM IST Josef Bacik wrote:
> > On Wed, Jan 03, 2018 at 05:26:03PM +0100, Jan Kara wrote:
> 
> > 
> > Oh ok well if that's the case then I'll fix this up to be a ratio, test
> > everything, and send it along probably early next week.  Thanks,
> > 
> 
> Hi Josef,
> 
> Did you get a chance to work on the next version of this patchset?
> 
> 
> 

Josef,  Any updates on this and the "Kill Btree inode" patchset?

-- 
chandan

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-09-28  8:37 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-11 21:55 [PATCH v3 00/11] Metadata specific accouting and dirty writeout Josef Bacik
2017-12-11 21:55 ` [PATCH v3 01/10] remove mapping from balance_dirty_pages*() Josef Bacik
2017-12-11 21:55 ` [PATCH v3 02/10] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes Josef Bacik
2017-12-11 21:55 ` [PATCH v3 03/10] lib: add a __fprop_add_percpu_max Josef Bacik
2017-12-19  7:25   ` Jan Kara
2017-12-11 21:55 ` [PATCH v3 04/10] writeback: convert the flexible prop stuff to bytes Josef Bacik
2017-12-11 21:55 ` [PATCH v3 05/10] writeback: add counters for metadata usage Josef Bacik
2017-12-19  7:52   ` Jan Kara
2017-12-11 21:55 ` [PATCH v3 06/10] writeback: introduce super_operations->write_metadata Josef Bacik
2017-12-11 23:36   ` Dave Chinner
2017-12-12 18:05     ` Josef Bacik
2017-12-12 22:20       ` Dave Chinner
2017-12-12 23:59         ` Josef Bacik
2017-12-19 12:07         ` Jan Kara
2017-12-19 21:35           ` Dave Chinner
2017-12-20 14:30             ` Jan Kara
2018-01-02 16:13               ` Josef Bacik
2018-01-03  2:32                 ` Dave Chinner
2018-01-03 13:59                   ` Jan Kara
2018-01-03 15:49                     ` Josef Bacik
2018-01-03 16:26                       ` Jan Kara
2018-01-03 16:29                         ` Josef Bacik
2018-01-29  9:06                           ` Chandan Rajendra
2018-09-28  8:37                             ` Chandan Rajendra
2018-01-04  1:32                     ` Dave Chinner
2018-01-04  9:10                       ` Jan Kara
2017-12-19 12:21   ` Jan Kara
2017-12-11 21:55 ` [PATCH v3 07/10] export radix_tree_iter_tag_set Josef Bacik
2017-12-11 21:55 ` [PATCH v3 08/10] Btrfs: kill the btree_inode Josef Bacik
2017-12-11 21:55 ` [PATCH v3 09/10] btrfs: rework end io for extent buffer reads Josef Bacik
2017-12-11 21:55 ` [PATCH v3 10/10] btrfs: add NR_METADATA_BYTES accounting Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).