linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3][V2] Provide accounting for dirty metadata
@ 2016-08-22 17:34 Josef Bacik
  2016-08-22 17:35 ` [PATCH 1/3] remove mapping from balance_dirty_pages*() Josef Bacik
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Josef Bacik @ 2016-08-22 17:34 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel, kernel-team, jack, viro, dchinner,
	hch, linux-mm

Here is my updated set of patches for providing a way for fs'es to do their own
accounting for their dirty metadata pages.  The changes since V1 include

-Split the accounting + ->write_metadata patches out into their own patches.
-Added a few more account_metadata* helpers that I hadn't thought about
previously.
-Changed the bdi->sb_list to bdi->dirty_sb_list.  This is to avoid confusion
about the purpose of the list.  I do a splice of this list when processing it as
we have to drop the list lock and I didn't want to worry about umounts screwing
up the list while we were writing metadata.
-Added the dirty metadata counter to the various places we output those counters
(meminfo, oom messages, etc).

I've also actually changed btrfs to use these interfaces and have been testing
that code for almost a week now and have fixed up the various problems that
happend with V1 of this code.  Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] remove mapping from balance_dirty_pages*()
  2016-08-22 17:34 [PATCH 0/3][V2] Provide accounting for dirty metadata Josef Bacik
@ 2016-08-22 17:35 ` Josef Bacik
  2016-08-22 18:29   ` kbuild test robot
  2016-08-22 17:35 ` [PATCH 2/3] writeback: allow for dirty metadata accounting Josef Bacik
  2016-08-22 17:35 ` [PATCH 3/3] writeback: introduce super_operations->write_metadata Josef Bacik
  2 siblings, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2016-08-22 17:35 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel, kernel-team, jack, viro, dchinner,
	hch, linux-mm

The only reason we pass in the mapping is to get the inode in order to see if
writeback cgroups is enabled, and even then it only checks the bdi and a super
block flag.  balance_dirty_pages() doesn't even use the mapping.  Since
balance_dirty_pages*() works on a bdi level, just pass in the bdi and super
block directly so we can avoid using mapping.  This will allow us to still use
balance_dirty_pages for dirty metadata pages that are not backed by an
address_mapping.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 drivers/mtd/devices/block2mtd.c | 12 ++++++++----
 fs/btrfs/disk-io.c              |  4 ++--
 fs/btrfs/file.c                 |  3 ++-
 fs/btrfs/ioctl.c                |  3 ++-
 fs/btrfs/relocation.c           |  3 ++-
 fs/buffer.c                     |  3 ++-
 fs/iomap.c                      |  3 ++-
 fs/ntfs/attrib.c                | 10 +++++++---
 fs/ntfs/file.c                  |  4 ++--
 include/linux/backing-dev.h     | 29 +++++++++++++++++++++++------
 include/linux/writeback.h       |  3 ++-
 mm/filemap.c                    |  4 +++-
 mm/memory.c                     |  9 +++++++--
 mm/page-writeback.c             | 15 +++++++--------
 14 files changed, 71 insertions(+), 34 deletions(-)

diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index 7c887f1..7892d0b 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -52,7 +52,8 @@ static struct page *page_read(struct address_space *mapping, int index)
 /* erase a specified part of the device */
 static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len)
 {
-	struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+	struct inode *inode = dev->blkdev->bd_inode;
+	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
 	int index = to >> PAGE_SHIFT;	// page index
 	int pages = len >> PAGE_SHIFT;
@@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len)
 				memset(page_address(page), 0xff, PAGE_SIZE);
 				set_page_dirty(page);
 				unlock_page(page);
-				balance_dirty_pages_ratelimited(mapping);
+				balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+								inode->i_sb);
 				break;
 			}
 
@@ -141,7 +143,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, const u_char *buf,
 		loff_t to, size_t len, size_t *retlen)
 {
 	struct page *page;
-	struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+	struct inode *inode = dev->blkdev->bd_inode;
+	struct address_space *mapping = inode->i_mapping;
 	int index = to >> PAGE_SHIFT;	// page index
 	int offset = to & ~PAGE_MASK;	// page offset
 	int cpylen;
@@ -162,7 +165,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, const u_char *buf,
 			memcpy(page_address(page) + offset, buf, cpylen);
 			set_page_dirty(page);
 			unlock_page(page);
-			balance_dirty_pages_ratelimited(mapping);
+			balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+							inode->i_sb);
 		}
 		put_page(page);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 87dad55..4034ad6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4024,8 +4024,8 @@ static void __btrfs_btree_balance_dirty(struct btrfs_root *root,
 	ret = percpu_counter_compare(&root->fs_info->dirty_metadata_bytes,
 				     BTRFS_DIRTY_METADATA_THRESH);
 	if (ret > 0) {
-		balance_dirty_pages_ratelimited(
-				   root->fs_info->btree_inode->i_mapping);
+		balance_dirty_pages_ratelimited(&root->fs_info->bdi,
+						root->fs_info->sb);
 	}
 }
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9404121..f060b08 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1686,7 +1686,8 @@ again:
 
 		cond_resched();
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		if (dirty_pages < (root->nodesize >> PAGE_SHIFT) + 1)
 			btrfs_btree_balance_dirty(root);
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 14ed1e9..a222bad 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1410,7 +1410,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 		}
 
 		defrag_count += ret;
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		inode_unlock(inode);
 
 		if (newer_than) {
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b26a5ae..6e194a5 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3202,7 +3202,8 @@ static int relocate_file_extent_cluster(struct inode *inode,
 		put_page(page);
 
 		index++;
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		btrfs_throttle(BTRFS_I(inode)->root);
 	}
 	WARN_ON(nr != cluster->nr);
diff --git a/fs/buffer.c b/fs/buffer.c
index 9c8eb9b..9bbe30d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2386,7 +2386,8 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
 		BUG_ON(err != len);
 		err = 0;
 
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 
 		if (unlikely(fatal_signal_pending(current))) {
 			err = -EINTR;
diff --git a/fs/iomap.c b/fs/iomap.c
index 48141b8..937e266 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -226,7 +226,8 @@ again:
 		written += copied;
 		length -= copied;
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 	} while (iov_iter_count(i) && length);
 
 	return written ? written : status;
diff --git a/fs/ntfs/attrib.c b/fs/ntfs/attrib.c
index 44a39a0..0a8a39e 100644
--- a/fs/ntfs/attrib.c
+++ b/fs/ntfs/attrib.c
@@ -2493,6 +2493,7 @@ conv_err_out:
 int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 {
 	ntfs_volume *vol = ni->vol;
+	struct inode *inode = VFS_I(ni);
 	struct address_space *mapping;
 	struct page *page;
 	u8 *kaddr;
@@ -2545,7 +2546,8 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		kunmap_atomic(kaddr);
 		set_page_dirty(page);
 		put_page(page);
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		cond_resched();
 		if (idx == end)
 			goto done;
@@ -2586,7 +2588,8 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		/* Finally unlock and release the page. */
 		unlock_page(page);
 		put_page(page);
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		cond_resched();
 	}
 	/* If there is a last partial page, need to do it the slow way. */
@@ -2603,7 +2606,8 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		kunmap_atomic(kaddr);
 		set_page_dirty(page);
 		put_page(page);
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		cond_resched();
 	}
 done:
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index f548629..66082eb 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -276,7 +276,7 @@ do_non_resident_extend:
 		 * number of pages we read and make dirty in the case of sparse
 		 * files.
 		 */
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(vi), vi->i_sb);
 		cond_resched();
 	} while (++index < end_index);
 	read_lock_irqsave(&ni->size_lock, flags);
@@ -1914,7 +1914,7 @@ again:
 		iov_iter_advance(i, copied);
 		pos += copied;
 		written += copied;
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(vi), vi->i_sb);
 		if (fatal_signal_pending(current)) {
 			status = -EINTR;
 			break;
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 491a917..089acf6 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -252,8 +252,9 @@ void wb_blkcg_offline(struct blkcg *blkcg);
 int inode_congested(struct inode *inode, int cong_bits);
 
 /**
- * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
- * @inode: inode of interest
+ * bdi_cgwb_enabled - test wether cgroup writeback is enabled on a filesystem
+ * @bdi: the bdi we care about
+ * @sb: the super for the bdi
  *
  * cgroup writeback requires support from both the bdi and filesystem.
  * Also, both memcg and iocg have to be on the default hierarchy.  Test
@@ -262,15 +263,25 @@ int inode_congested(struct inode *inode, int cong_bits);
  * Note that the test result may change dynamically on the same inode
  * depending on how memcg and iocg are configured.
  */
-static inline bool inode_cgwb_enabled(struct inode *inode)
+static inline bool bdi_cgwb_enabled(struct backing_dev_info *bdi,
+				    struct super_block *sb)
 {
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
-
 	return cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
 		cgroup_subsys_on_dfl(io_cgrp_subsys) &&
 		bdi_cap_account_dirty(bdi) &&
 		(bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
-		(inode->i_sb->s_iflags & SB_I_CGROUPWB);
+		(sb->s_iflags & SB_I_CGROUPWB);
+}
+
+/**
+ * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
+ * @inode: inode of interest
+ *
+ * Does the inode have cgroup writeback support.
+ */
+static inline bool inode_cgwb_enabled(struct inode *inode)
+{
+	return bdi_cgwb_enabled(inode_to_bdi(inode), inode->i_sb);
 }
 
 /**
@@ -413,6 +424,12 @@ static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
+static inline bool bdi_cgwb_enabled(struct backing_dev_info *bdi,
+				    struct super_block *sb)
+{
+	return false;
+}
+
 static inline bool inode_cgwb_enabled(struct inode *inode)
 {
 	return false;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index fc1e16c..256ffc3 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -364,7 +364,8 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
 
 void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
 void page_writeback_init(void);
-void balance_dirty_pages_ratelimited(struct address_space *mapping);
+void balance_dirty_pages_ratelimited(struct backing_dev_info *bdi,
+				     struct super_block *sb);
 bool wb_over_bg_thresh(struct bdi_writeback *wb);
 
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/filemap.c b/mm/filemap.c
index 3083ded..abb0e98 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2667,6 +2667,7 @@ ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
 {
 	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	long status = 0;
 	ssize_t written = 0;
@@ -2746,7 +2747,8 @@ again:
 		pos += copied;
 		written += copied;
 
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 	} while (iov_iter_count(i));
 
 	return written ? written : status;
diff --git a/mm/memory.c b/mm/memory.c
index 83be99d..d43e73b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -64,6 +64,7 @@
 #include <linux/debugfs.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
+#include <linux/backing-dev.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -2105,11 +2106,13 @@ static inline int wp_page_reuse(struct fault_env *fe, pte_t orig_pte,
 		put_page(page);
 
 		if ((dirtied || page_mkwrite) && mapping) {
+			struct inode *inode = mapping->host;
 			/*
 			 * Some device drivers do not set page.mapping
 			 * but still dirty their pages
 			 */
-			balance_dirty_pages_ratelimited(mapping);
+			balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+							inode->i_sb);
 		}
 
 		if (!page_mkwrite)
@@ -3291,11 +3294,13 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff)
 	mapping = page_rmapping(fault_page);
 	unlock_page(fault_page);
 	if ((dirtied || vma->vm_ops->page_mkwrite) && mapping) {
+		struct inode *inode = mapping->host;
 		/*
 		 * Some device drivers do not set page.mapping but still
 		 * dirty their pages
 		 */
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 	}
 
 	if (!vma->vm_ops->page_mkwrite)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f4cd7d8..121a6e3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1559,8 +1559,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
-static void balance_dirty_pages(struct address_space *mapping,
-				struct bdi_writeback *wb,
+static void balance_dirty_pages(struct bdi_writeback *wb,
 				unsigned long pages_dirtied)
 {
 	struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
@@ -1849,7 +1848,8 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
 
 /**
  * balance_dirty_pages_ratelimited - balance dirty memory state
- * @mapping: address_space which was dirtied
+ * @bdi: the bdi that was dirtied
+ * @sb: the super block that was dirtied
  *
  * Processes which are dirtying memory should call in here once for each page
  * which was newly dirtied.  The function will periodically check the system's
@@ -1860,10 +1860,9 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
  * limit we decrease the ratelimiting by a lot, to prevent individual processes
  * from overshooting the limit by (ratelimit_pages) each.
  */
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+void balance_dirty_pages_ratelimited(struct backing_dev_info *bdi,
+				     struct super_block *sb)
 {
-	struct inode *inode = mapping->host;
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct bdi_writeback *wb = NULL;
 	int ratelimit;
 	int *p;
@@ -1871,7 +1870,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	if (inode_cgwb_enabled(inode))
+	if (bdi_cgwb_enabled(bdi, sb))
 		wb = wb_get_create_current(bdi, GFP_KERNEL);
 	if (!wb)
 		wb = &bdi->wb;
@@ -1909,7 +1908,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
-		balance_dirty_pages(mapping, wb, current->nr_dirtied);
+		balance_dirty_pages(wb, current->nr_dirtied);
 
 	wb_put(wb);
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] writeback: allow for dirty metadata accounting
  2016-08-22 17:34 [PATCH 0/3][V2] Provide accounting for dirty metadata Josef Bacik
  2016-08-22 17:35 ` [PATCH 1/3] remove mapping from balance_dirty_pages*() Josef Bacik
@ 2016-08-22 17:35 ` Josef Bacik
  2016-09-09  8:17   ` Jan Kara
  2016-08-22 17:35 ` [PATCH 3/3] writeback: introduce super_operations->write_metadata Josef Bacik
  2 siblings, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2016-08-22 17:35 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel, kernel-team, jack, viro, dchinner,
	hch, linux-mm

Provide a mechanism for file systems to indicate how much dirty metadata they
are holding.  This introduces a few things

1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
2) WB stat for dirty metadata.  This way we know if we need to try and call into
the file system to write out metadata.  This could potentially be used in the
future to make balancing of dirty pages smarter.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 arch/tile/mm/pgtable.c           |   3 +-
 drivers/base/node.c              |   2 +
 fs/fs-writeback.c                |   1 +
 fs/proc/meminfo.c                |   2 +
 include/linux/backing-dev-defs.h |   1 +
 include/linux/mm.h               |   7 +++
 include/linux/mmzone.h           |   1 +
 include/trace/events/writeback.h |   7 ++-
 mm/backing-dev.c                 |   2 +
 mm/page-writeback.c              | 100 +++++++++++++++++++++++++++++++++++++--
 mm/page_alloc.c                  |   7 ++-
 mm/vmscan.c                      |   3 +-
 12 files changed, 127 insertions(+), 9 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 7cc6ee7..9543468 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -44,12 +44,13 @@ void show_mem(unsigned int filter)
 {
 	struct zone *zone;
 
-	pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
+	pr_err("Active:%lu inactive:%lu dirty:%lu metadata_dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
 	       (global_node_page_state(NR_ACTIVE_ANON) +
 		global_node_page_state(NR_ACTIVE_FILE)),
 	       (global_node_page_state(NR_INACTIVE_ANON) +
 		global_node_page_state(NR_INACTIVE_FILE)),
 	       global_node_page_state(NR_FILE_DIRTY),
+	       global_node_page_state(NR_METADATA_DIRTY),
 	       global_node_page_state(NR_WRITEBACK),
 	       global_node_page_state(NR_UNSTABLE_NFS),
 	       global_page_state(NR_FREE_PAGES),
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f96..efc867b2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -99,6 +99,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
 	n += sprintf(buf + n,
 		       "Node %d Dirty:          %8lu kB\n"
+		       "Node %d MetadataDirty:	%8lu kB\n"
 		       "Node %d Writeback:      %8lu kB\n"
 		       "Node %d FilePages:      %8lu kB\n"
 		       "Node %d Mapped:         %8lu kB\n"
@@ -119,6 +120,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
 			,
 		       nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
+		       nid, K(node_page_state(pgdat, NR_METADATA_DIRTY)),
 		       nid, K(node_page_state(pgdat, NR_WRITEBACK)),
 		       nid, K(node_page_state(pgdat, NR_FILE_PAGES)),
 		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 56c8fda..d329f89 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1809,6 +1809,7 @@ static unsigned long get_nr_dirty_pages(void)
 {
 	return global_node_page_state(NR_FILE_DIRTY) +
 		global_node_page_state(NR_UNSTABLE_NFS) +
+		global_node_page_state(NR_METADATA_DIRTY) +
 		get_nr_dirty_inodes();
 }
 
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 09e18fd..8ca094f 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -80,6 +80,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		"SwapTotal:      %8lu kB\n"
 		"SwapFree:       %8lu kB\n"
 		"Dirty:          %8lu kB\n"
+		"MetadataDirty:  %8lu kB\n"
 		"Writeback:      %8lu kB\n"
 		"AnonPages:      %8lu kB\n"
 		"Mapped:         %8lu kB\n"
@@ -139,6 +140,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		K(i.totalswap),
 		K(i.freeswap),
 		K(global_node_page_state(NR_FILE_DIRTY)),
+		K(global_node_page_state(NR_METADATA_DIRTY)),
 		K(global_node_page_state(NR_WRITEBACK)),
 		K(global_node_page_state(NR_ANON_MAPPED)),
 		K(global_node_page_state(NR_FILE_MAPPED)),
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 3f10307..1200aae 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -34,6 +34,7 @@ typedef int (congested_fn)(void *, int);
 enum wb_stat_item {
 	WB_RECLAIMABLE,
 	WB_WRITEBACK,
+	WB_METADATA_DIRTY,
 	WB_DIRTIED,
 	WB_WRITTEN,
 	NR_WB_STAT_ITEMS
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 08ed53e..5a3f626 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -31,6 +31,7 @@ struct file_ra_state;
 struct user_struct;
 struct writeback_control;
 struct bdi_writeback;
+struct backing_dev_info;
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES	/* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -1363,6 +1364,12 @@ int redirty_page_for_writepage(struct writeback_control *wbc,
 void account_page_dirtied(struct page *page, struct address_space *mapping);
 void account_page_cleaned(struct page *page, struct address_space *mapping,
 			  struct bdi_writeback *wb);
+void account_metadata_dirtied(struct page *page, struct backing_dev_info *bdi);
+void account_metadata_cleaned(struct page *page, struct backing_dev_info *bdi);
+void account_metadata_writeback(struct page *page,
+				struct backing_dev_info *bdi);
+void account_metadata_end_writeback(struct page *page,
+				    struct backing_dev_info *bdi);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
 void cancel_dirty_page(struct page *page);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f2e4e90..c4177ef 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -167,6 +167,7 @@ enum node_stat_item {
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
+	NR_METADATA_DIRTY,	/* Metadata dirty pages */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 2ccd9cc..c9f6427 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -402,6 +402,7 @@ TRACE_EVENT(global_dirty_state,
 
 	TP_STRUCT__entry(
 		__field(unsigned long,	nr_dirty)
+		__field(unsigned long,	nr_metadata_dirty)
 		__field(unsigned long,	nr_writeback)
 		__field(unsigned long,	nr_unstable)
 		__field(unsigned long,	background_thresh)
@@ -413,6 +414,7 @@ TRACE_EVENT(global_dirty_state,
 
 	TP_fast_assign(
 		__entry->nr_dirty	= global_node_page_state(NR_FILE_DIRTY);
+		__entry->nr_metadata_dirty = global_node_page_state(NR_METADATA_DIRTY);
 		__entry->nr_writeback	= global_node_page_state(NR_WRITEBACK);
 		__entry->nr_unstable	= global_node_page_state(NR_UNSTABLE_NFS);
 		__entry->nr_dirtied	= global_node_page_state(NR_DIRTIED);
@@ -424,7 +426,7 @@ TRACE_EVENT(global_dirty_state,
 
 	TP_printk("dirty=%lu writeback=%lu unstable=%lu "
 		  "bg_thresh=%lu thresh=%lu limit=%lu "
-		  "dirtied=%lu written=%lu",
+		  "dirtied=%lu written=%lu metadata_dirty=%lu",
 		  __entry->nr_dirty,
 		  __entry->nr_writeback,
 		  __entry->nr_unstable,
@@ -432,7 +434,8 @@ TRACE_EVENT(global_dirty_state,
 		  __entry->dirty_thresh,
 		  __entry->dirty_limit,
 		  __entry->nr_dirtied,
-		  __entry->nr_written
+		  __entry->nr_written,
+		  __entry->nr_metadata_dirty
 	)
 );
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index efe2377..b48d4e4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -78,6 +78,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "BackgroundThresh:   %10lu kB\n"
 		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
+		   "BdiMetadataDirty:   %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
@@ -92,6 +93,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   K(background_thresh),
 		   (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
 		   (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
+		   (unsigned long) K(wb_stat(wb, WB_METADATA_DIRTY)),
 		   (unsigned long) K(wb->write_bandwidth),
 		   nr_dirty,
 		   nr_io,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 121a6e3..6a52723 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -506,6 +506,7 @@ bool node_dirty_ok(struct pglist_data *pgdat)
 	nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
 	nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
 	nr_pages += node_page_state(pgdat, NR_WRITEBACK);
+	nr_pages += node_page_state(pgdat, NR_METADATA_DIRTY);
 
 	return nr_pages <= limit;
 }
@@ -1595,7 +1596,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 		 * been flushed to permanent storage.
 		 */
 		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
-					global_node_page_state(NR_UNSTABLE_NFS);
+				global_node_page_state(NR_UNSTABLE_NFS) +
+				global_node_page_state(NR_METADATA_DIRTY);
 		gdtc->avail = global_dirtyable_memory();
 		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
 
@@ -1935,7 +1937,8 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	 */
 	gdtc->avail = global_dirtyable_memory();
 	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
-		      global_node_page_state(NR_UNSTABLE_NFS);
+		      global_node_page_state(NR_UNSTABLE_NFS) +
+		      global_node_page_state(NR_METADATA_DIRTY);
 	domain_dirty_limits(gdtc);
 
 	if (gdtc->dirty > gdtc->bg_thresh)
@@ -2009,7 +2012,8 @@ void laptop_mode_timer_fn(unsigned long data)
 {
 	struct request_queue *q = (struct request_queue *)data;
 	int nr_pages = global_node_page_state(NR_FILE_DIRTY) +
-		global_node_page_state(NR_UNSTABLE_NFS);
+		global_node_page_state(NR_UNSTABLE_NFS) +
+		global_node_page_state(NR_METADATA_DIRTY);
 	struct bdi_writeback *wb;
 
 	/*
@@ -2473,6 +2477,96 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 EXPORT_SYMBOL(account_page_dirtied);
 
 /*
+ * account_metadata_dirtied
+ * @page - the page being dirited
+ * @bdi - the bdi that owns this page
+ *
+ * Do the dirty page accounting for metadata pages that aren't backed by an
+ * address_space.
+ */
+void account_metadata_dirtied(struct page *page, struct backing_dev_info *bdi)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_node_page_state(page, NR_METADATA_DIRTY);
+	__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+	__inc_node_page_state(page, NR_DIRTIED);
+	__inc_wb_stat(&bdi->wb, WB_RECLAIMABLE);
+	__inc_wb_stat(&bdi->wb, WB_DIRTIED);
+	__inc_wb_stat(&bdi->wb, WB_METADATA_DIRTY);
+	current->nr_dirtied++;
+	task_io_account_write(PAGE_SIZE);
+	this_cpu_inc(bdp_ratelimits);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(account_metadata_dirtied);
+
+/*
+ * account_metadata_cleaned
+ * @page - the page being cleaned
+ * @bdi - the bdi that owns this page
+ *
+ * Called on a no longer dirty metadata page.
+ */
+void account_metadata_cleaned(struct page *page, struct backing_dev_info *bdi)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_node_page_state(page, NR_METADATA_DIRTY);
+	__dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+	__dec_wb_stat(&bdi->wb, WB_RECLAIMABLE);
+	__dec_wb_stat(&bdi->wb, WB_METADATA_DIRTY);
+	task_io_account_cancelled_write(PAGE_SIZE);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(account_metadata_cleaned);
+
+/*
+ * account_metadata_writeback
+ * @page - the page being marked as writeback
+ * @bdi - the bdi that owns this page
+ *
+ * Called on a metadata page that has been marked writeback.
+ */
+void account_metadata_writeback(struct page *page,
+				struct backing_dev_info *bdi)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_wb_stat(&bdi->wb, WB_WRITEBACK);
+	__inc_node_page_state(page, NR_WRITEBACK);
+	__dec_node_page_state(page, NR_METADATA_DIRTY);
+	__dec_wb_stat(&bdi->wb, WB_METADATA_DIRTY);
+	__dec_wb_stat(&bdi->wb, WB_RECLAIMABLE);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(account_metadata_writeback);
+
+/*
+ * account_metadata_end_writeback
+ * @page - the page we are ending writeback on
+ * @bdi - the bdi that owns this page
+ *
+ * Called on a metadata page that has completed writeback.
+ */
+void account_metadata_end_writeback(struct page *page,
+				    struct backing_dev_info *bdi)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_wb_stat(&bdi->wb, WB_WRITEBACK);
+	__dec_node_page_state(page, NR_WRITEBACK);
+	__dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+	__inc_node_page_state(page, NR_WRITTEN);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(account_metadata_end_writeback);
+
+/*
  * Helper function for deaccounting dirty page without writeback.
  *
  * Caller must hold lock_page_memcg().
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 39a372a..bc3523e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4218,8 +4218,8 @@ void show_free_areas(unsigned int filter)
 
 	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
 		" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
-		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
-		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
+		" unevictable:%lu dirty:%lu metadata_dirty:%lu writeback:%lu\n"
+	        " unstable:%lu slab_reclaimable:%lu slab_unreclaimable:%lu\n"
 		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
 		" free:%lu free_pcp:%lu free_cma:%lu\n",
 		global_node_page_state(NR_ACTIVE_ANON),
@@ -4230,6 +4230,7 @@ void show_free_areas(unsigned int filter)
 		global_node_page_state(NR_ISOLATED_FILE),
 		global_node_page_state(NR_UNEVICTABLE),
 		global_node_page_state(NR_FILE_DIRTY),
+		global_node_page_state(NR_METADATA_DIRTY),
 		global_node_page_state(NR_WRITEBACK),
 		global_node_page_state(NR_UNSTABLE_NFS),
 		global_page_state(NR_SLAB_RECLAIMABLE),
@@ -4253,6 +4254,7 @@ void show_free_areas(unsigned int filter)
 			" isolated(file):%lukB"
 			" mapped:%lukB"
 			" dirty:%lukB"
+			" metadata_dirty:%lukB"
 			" writeback:%lukB"
 			" shmem:%lukB"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -4275,6 +4277,7 @@ void show_free_areas(unsigned int filter)
 			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
 			K(node_page_state(pgdat, NR_FILE_MAPPED)),
 			K(node_page_state(pgdat, NR_FILE_DIRTY)),
+			K(node_page_state(pgdat, NR_METADATA_DIRTY)),
 			K(node_page_state(pgdat, NR_WRITEBACK)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 374d95d..fb3eb62 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3714,7 +3714,8 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
 
 	/* If we can't clean pages, remove dirty pages from consideration */
 	if (!(node_reclaim_mode & RECLAIM_WRITE))
-		delta += node_page_state(pgdat, NR_FILE_DIRTY);
+		delta += node_page_state(pgdat, NR_FILE_DIRTY) +
+			node_page_state(pgdat, NR_METADATA_DIRTY);
 
 	/* Watch for any possible underflows due to delta */
 	if (unlikely(delta > nr_pagecache_reclaimable))
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] writeback: introduce super_operations->write_metadata
  2016-08-22 17:34 [PATCH 0/3][V2] Provide accounting for dirty metadata Josef Bacik
  2016-08-22 17:35 ` [PATCH 1/3] remove mapping from balance_dirty_pages*() Josef Bacik
  2016-08-22 17:35 ` [PATCH 2/3] writeback: allow for dirty metadata accounting Josef Bacik
@ 2016-08-22 17:35 ` Josef Bacik
  2016-09-09  8:29   ` Jan Kara
  2 siblings, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2016-08-22 17:35 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel, kernel-team, jack, viro, dchinner,
	hch, linux-mm

Now that we have metadata counters in the VM, we need to provide a way to kick
writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
allows file systems to deal with writing back any dirty metadata we need based
on the writeback needs of the system.  Since there is no inode to key off of we
need a list in the bdi for dirty super blocks to be added.  From there we can
find any dirty sb's on the bdi we are currently doing writeback on and call into
their ->write_metadata callback.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/fs-writeback.c                | 58 +++++++++++++++++++++++++++++++++++++---
 fs/super.c                       |  7 +++++
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/fs.h               |  4 +++
 mm/backing-dev.c                 |  1 +
 5 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d329f89..b7d8946 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1615,11 +1615,36 @@ static long writeback_sb_inodes(struct super_block *sb,
 	return wrote;
 }
 
+static long writeback_sb_metadata(struct super_block *sb,
+				  struct bdi_writeback *wb,
+				  struct wb_writeback_work *work)
+{
+	struct writeback_control wbc = {
+		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
+		.for_kupdate		= work->for_kupdate,
+		.for_background		= work->for_background,
+		.for_sync		= work->for_sync,
+		.range_cyclic		= work->range_cyclic,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+	};
+	long write_chunk;
+
+	write_chunk = writeback_chunk_size(wb, work);
+	wbc.nr_to_write = write_chunk;
+	sb->s_op->write_metadata(sb, &wbc);
+	work->nr_pages -= write_chunk - wbc.nr_to_write;
+
+	return write_chunk - wbc.nr_to_write;
+}
+
 static long __writeback_inodes_wb(struct bdi_writeback *wb,
 				  struct wb_writeback_work *work)
 {
 	unsigned long start_time = jiffies;
 	long wrote = 0;
+	bool done = false;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1639,11 +1664,38 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 
 		/* refer to the same tests at the end of writeback_sb_inodes */
 		if (wrote) {
-			if (time_is_before_jiffies(start_time + HZ / 10UL))
-				break;
-			if (work->nr_pages <= 0)
+			if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+			    work->nr_pages <= 0) {
+				done = true;
 				break;
+			}
+		}
+	}
+
+	if (!done && wb_stat(wb, WB_METADATA_DIRTY)) {
+		LIST_HEAD(list);
+
+		spin_unlock(&wb->list_lock);
+		spin_lock(&wb->bdi->sb_list_lock);
+		list_splice_init(&wb->bdi->dirty_sb_list, &list);
+		while (!list_empty(&list)) {
+			struct super_block *sb;
+
+			sb = list_first_entry(&list, struct super_block,
+					      s_bdi_list);
+			list_move_tail(&sb->s_bdi_list,
+				       &wb->bdi->dirty_sb_list);
+			if (!sb->s_op->write_metadata)
+				continue;
+			if (!trylock_super(sb))
+				continue;
+			spin_unlock(&wb->bdi->sb_list_lock);
+			wrote += writeback_sb_metadata(sb, wb, work);
+			spin_lock(&wb->bdi->sb_list_lock);
+			up_read(&sb->s_umount);
 		}
+		spin_unlock(&wb->bdi->sb_list_lock);
+		spin_lock(&wb->list_lock);
 	}
 	/* Leave any unwritten inodes on b_io */
 	return wrote;
diff --git a/fs/super.c b/fs/super.c
index c2ff475..c1b1028 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -215,6 +215,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	spin_lock_init(&s->s_inode_list_lock);
 	INIT_LIST_HEAD(&s->s_inodes_wb);
 	spin_lock_init(&s->s_inode_wblist_lock);
+	INIT_LIST_HEAD(&s->s_bdi_list);
 
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
@@ -305,6 +306,8 @@ void deactivate_locked_super(struct super_block *s)
 {
 	struct file_system_type *fs = s->s_type;
 	if (atomic_dec_and_test(&s->s_active)) {
+		struct backing_dev_info *bdi = s->s_bdi;
+
 		cleancache_invalidate_fs(s);
 		unregister_shrinker(&s->s_shrink);
 		fs->kill_sb(s);
@@ -317,6 +320,10 @@ void deactivate_locked_super(struct super_block *s)
 		list_lru_destroy(&s->s_dentry_lru);
 		list_lru_destroy(&s->s_inode_lru);
 
+		spin_lock(&bdi->sb_list_lock);
+		list_del_init(&s->s_bdi_list);
+		spin_unlock(&bdi->sb_list_lock);
+
 		put_filesystem(fs);
 		put_super(s);
 	} else {
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 1200aae..ee6f27f 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -167,6 +167,8 @@ struct backing_dev_info {
 
 	struct timer_list laptop_mode_wb_timer;
 
+	spinlock_t sb_list_lock;
+	struct list_head dirty_sb_list;
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f3f0b4c8..c063ac6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1430,6 +1430,8 @@ struct super_block {
 
 	spinlock_t		s_inode_wblist_lock;
 	struct list_head	s_inodes_wb;	/* writeback inodes */
+
+	struct list_head	s_bdi_list;
 };
 
 /* Helper functions so that in most cases filesystems will
@@ -1805,6 +1807,8 @@ struct super_operations {
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
+	void (*write_metadata)(struct super_block *sb,
+			       struct writeback_control *wbc);
 };
 
 /*
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b48d4e4..80ba94a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -782,6 +782,7 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->max_prop_frac = FPROP_FRAC_BASE;
 	INIT_LIST_HEAD(&bdi->bdi_list);
 	INIT_LIST_HEAD(&bdi->wb_list);
+	INIT_LIST_HEAD(&bdi->dirty_sb_list);
 	init_waitqueue_head(&bdi->wb_waitq);
 
 	ret = cgwb_bdi_init(bdi);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] remove mapping from balance_dirty_pages*()
  2016-08-22 17:35 ` [PATCH 1/3] remove mapping from balance_dirty_pages*() Josef Bacik
@ 2016-08-22 18:29   ` kbuild test robot
  0 siblings, 0 replies; 13+ messages in thread
From: kbuild test robot @ 2016-08-22 18:29 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kbuild-all, linux-btrfs, linux-fsdevel, kernel-team, jack, viro,
	dchinner, hch, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3573 bytes --]

Hi Josef,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.8-rc3 next-20160822]
[cannot apply to linux/master]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
[Suggest to use git(>=2.9.0) format-patch --base=<commit> (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:    https://github.com/0day-ci/linux/commits/Josef-Bacik/Provide-accounting-for-dirty-metadata/20160823-014222
config: sparc64-allyesconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 5.4.0-6) 5.4.0 20160609
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=sparc64 

All error/warnings (new ones prefixed by >>):

   fs/ntfs/attrib.c: In function 'ntfs_attr_set':
>> fs/ntfs/attrib.c:2549:35: error: implicit declaration of function 'inode_to_bdi' [-Werror=implicit-function-declaration]
      balance_dirty_pages_ratelimited(inode_to_bdi(inode),
                                      ^
>> fs/ntfs/attrib.c:2549:35: warning: passing argument 1 of 'balance_dirty_pages_ratelimited' makes pointer from integer without a cast [-Wint-conversion]
   In file included from include/linux/memcontrol.h:30:0,
                    from include/linux/swap.h:8,
                    from fs/ntfs/attrib.c:26:
   include/linux/writeback.h:367:6: note: expected 'struct backing_dev_info *' but argument is of type 'int'
    void balance_dirty_pages_ratelimited(struct backing_dev_info *bdi,
         ^
   fs/ntfs/attrib.c:2591:35: warning: passing argument 1 of 'balance_dirty_pages_ratelimited' makes pointer from integer without a cast [-Wint-conversion]
      balance_dirty_pages_ratelimited(inode_to_bdi(inode),
                                      ^
   In file included from include/linux/memcontrol.h:30:0,
                    from include/linux/swap.h:8,
                    from fs/ntfs/attrib.c:26:
   include/linux/writeback.h:367:6: note: expected 'struct backing_dev_info *' but argument is of type 'int'
    void balance_dirty_pages_ratelimited(struct backing_dev_info *bdi,
         ^
   fs/ntfs/attrib.c:2609:35: warning: passing argument 1 of 'balance_dirty_pages_ratelimited' makes pointer from integer without a cast [-Wint-conversion]
      balance_dirty_pages_ratelimited(inode_to_bdi(inode),
                                      ^
   In file included from include/linux/memcontrol.h:30:0,
                    from include/linux/swap.h:8,
                    from fs/ntfs/attrib.c:26:
   include/linux/writeback.h:367:6: note: expected 'struct backing_dev_info *' but argument is of type 'int'
    void balance_dirty_pages_ratelimited(struct backing_dev_info *bdi,
         ^
   cc1: some warnings being treated as errors

vim +/inode_to_bdi +2549 fs/ntfs/attrib.c

  2543			kaddr = kmap_atomic(page);
  2544			memset(kaddr + start_ofs, val, size - start_ofs);
  2545			flush_dcache_page(page);
  2546			kunmap_atomic(kaddr);
  2547			set_page_dirty(page);
  2548			put_page(page);
> 2549			balance_dirty_pages_ratelimited(inode_to_bdi(inode),
  2550							inode->i_sb);
  2551			cond_resched();
  2552			if (idx == end)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 47053 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] writeback: allow for dirty metadata accounting
  2016-08-22 17:35 ` [PATCH 2/3] writeback: allow for dirty metadata accounting Josef Bacik
@ 2016-09-09  8:17   ` Jan Kara
  2016-09-12  0:46     ` Dave Chinner
  2016-09-12 14:56     ` Josef Bacik
  0 siblings, 2 replies; 13+ messages in thread
From: Jan Kara @ 2016-09-09  8:17 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, linux-fsdevel, kernel-team, jack, viro, dchinner,
	hch, linux-mm

On Mon 22-08-16 13:35:01, Josef Bacik wrote:
> Provide a mechanism for file systems to indicate how much dirty metadata they
> are holding.  This introduces a few things
> 
> 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
> 2) WB stat for dirty metadata.  This way we know if we need to try and call into
> the file system to write out metadata.  This could potentially be used in the
> future to make balancing of dirty pages smarter.

So I'm curious about one thing: In the previous posting you have mentioned
that the main motivation for this work is to have a simple support for
sub-pagesize dirty metadata blocks that need tracking in btrfs. However you
do the dirty accounting at page granularity. What are your plans to handle
this mismatch?

The thing is you actually shouldn't miscount by too much as that could
upset some checks in mm checking how much dirty pages a node has directing
how reclaim should be done... But it's a question whether NR_METADATA_DIRTY
should be actually used in the checks in node_limits_ok() or in
node_pagecache_reclaimable() at all because once you start accounting dirty
slab objects, you are really on a thin ice...

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 56c8fda..d329f89 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1809,6 +1809,7 @@ static unsigned long get_nr_dirty_pages(void)
>  {
>  	return global_node_page_state(NR_FILE_DIRTY) +
>  		global_node_page_state(NR_UNSTABLE_NFS) +
> +		global_node_page_state(NR_METADATA_DIRTY) +
>  		get_nr_dirty_inodes();

With my question is also connected this - when we have NR_METADATA_DIRTY,
we could just account dirty inodes there and get rid of this
get_nr_dirty_inodes() hack...

But actually getting this to work right to be able to track dirty inodes would
be useful on its own - some throlling of creation of dirty inodes would be
useful for several filesystems (ext4, xfs, ...).

> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 121a6e3..6a52723 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -506,6 +506,7 @@ bool node_dirty_ok(struct pglist_data *pgdat)
>  	nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
>  	nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
>  	nr_pages += node_page_state(pgdat, NR_WRITEBACK);
> +	nr_pages += node_page_state(pgdat, NR_METADATA_DIRTY);
>  
>  	return nr_pages <= limit;
>  }
> @@ -1595,7 +1596,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
>  		 * been flushed to permanent storage.
>  		 */
>  		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
> -					global_node_page_state(NR_UNSTABLE_NFS);
> +				global_node_page_state(NR_UNSTABLE_NFS) +
> +				global_node_page_state(NR_METADATA_DIRTY);
>  		gdtc->avail = global_dirtyable_memory();
>  		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
>  
> @@ -1935,7 +1937,8 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
>  	 */
>  	gdtc->avail = global_dirtyable_memory();
>  	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
> -		      global_node_page_state(NR_UNSTABLE_NFS);
> +		      global_node_page_state(NR_UNSTABLE_NFS) +
> +		      global_node_page_state(NR_METADATA_DIRTY);
>  	domain_dirty_limits(gdtc);
>  
>  	if (gdtc->dirty > gdtc->bg_thresh)
> @@ -2009,7 +2012,8 @@ void laptop_mode_timer_fn(unsigned long data)
>  {
>  	struct request_queue *q = (struct request_queue *)data;
>  	int nr_pages = global_node_page_state(NR_FILE_DIRTY) +
> -		global_node_page_state(NR_UNSTABLE_NFS);
> +		global_node_page_state(NR_UNSTABLE_NFS) +
> +		global_node_page_state(NR_METADATA_DIRTY);
>  	struct bdi_writeback *wb;
>  
>  	/*
> @@ -2473,6 +2477,96 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
>  EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
> + * account_metadata_dirtied
> + * @page - the page being dirited
> + * @bdi - the bdi that owns this page
> + *
> + * Do the dirty page accounting for metadata pages that aren't backed by an
> + * address_space.
> + */
> +void account_metadata_dirtied(struct page *page, struct backing_dev_info *bdi)
> +{
> +	unsigned long flags;
> +

A bdi_cap_account_dirty() check here and in following functions?

> +	local_irq_save(flags);
> +	__inc_node_page_state(page, NR_METADATA_DIRTY);
> +	__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> +	__inc_node_page_state(page, NR_DIRTIED);
> +	__inc_wb_stat(&bdi->wb, WB_RECLAIMABLE);
> +	__inc_wb_stat(&bdi->wb, WB_DIRTIED);
> +	__inc_wb_stat(&bdi->wb, WB_METADATA_DIRTY);
> +	current->nr_dirtied++;
> +	task_io_account_write(PAGE_SIZE);
> +	this_cpu_inc(bdp_ratelimits);
> +	local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL(account_metadata_dirtied);
> +
> +/*
> + * account_metadata_cleaned
> + * @page - the page being cleaned
> + * @bdi - the bdi that owns this page
> + *
> + * Called on a no longer dirty metadata page.
> + */
> +void account_metadata_cleaned(struct page *page, struct backing_dev_info *bdi)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__dec_node_page_state(page, NR_METADATA_DIRTY);
> +	__dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> +	__dec_wb_stat(&bdi->wb, WB_RECLAIMABLE);
> +	__dec_wb_stat(&bdi->wb, WB_METADATA_DIRTY);
> +	task_io_account_cancelled_write(PAGE_SIZE);
> +	local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL(account_metadata_cleaned);
> +
> +/*
> + * account_metadata_writeback
> + * @page - the page being marked as writeback
> + * @bdi - the bdi that owns this page
> + *
> + * Called on a metadata page that has been marked writeback.
> + */
> +void account_metadata_writeback(struct page *page,
> +				struct backing_dev_info *bdi)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__inc_wb_stat(&bdi->wb, WB_WRITEBACK);
> +	__inc_node_page_state(page, NR_WRITEBACK);
> +	__dec_node_page_state(page, NR_METADATA_DIRTY);
> +	__dec_wb_stat(&bdi->wb, WB_METADATA_DIRTY);
> +	__dec_wb_stat(&bdi->wb, WB_RECLAIMABLE);
> +	local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL(account_metadata_writeback);
> +
> +/*
> + * account_metadata_end_writeback
> + * @page - the page we are ending writeback on
> + * @bdi - the bdi that owns this page
> + *
> + * Called on a metadata page that has completed writeback.
> + */
> +void account_metadata_end_writeback(struct page *page,
> +				    struct backing_dev_info *bdi)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__dec_wb_stat(&bdi->wb, WB_WRITEBACK);
> +	__dec_node_page_state(page, NR_WRITEBACK);
> +	__dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> +	__inc_node_page_state(page, NR_WRITTEN);
> +	local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL(account_metadata_end_writeback);
> +
> +/*
>   * Helper function for deaccounting dirty page without writeback.
>   *
>   * Caller must hold lock_page_memcg().
...
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 374d95d..fb3eb62 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3714,7 +3714,8 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
>  
>  	/* If we can't clean pages, remove dirty pages from consideration */
>  	if (!(node_reclaim_mode & RECLAIM_WRITE))
> -		delta += node_page_state(pgdat, NR_FILE_DIRTY);
> +		delta += node_page_state(pgdat, NR_FILE_DIRTY) +
> +			node_page_state(pgdat, NR_METADATA_DIRTY);
>  
>  	/* Watch for any possible underflows due to delta */
>  	if (unlikely(delta > nr_pagecache_reclaimable))

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] writeback: introduce super_operations->write_metadata
  2016-08-22 17:35 ` [PATCH 3/3] writeback: introduce super_operations->write_metadata Josef Bacik
@ 2016-09-09  8:29   ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2016-09-09  8:29 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, linux-fsdevel, kernel-team, jack, viro, dchinner,
	hch, linux-mm

On Mon 22-08-16 13:35:02, Josef Bacik wrote:
> Now that we have metadata counters in the VM, we need to provide a way to kick
> writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
> allows file systems to deal with writing back any dirty metadata we need based
> on the writeback needs of the system.  Since there is no inode to key off of we
> need a list in the bdi for dirty super blocks to be added.  From there we can
> find any dirty sb's on the bdi we are currently doing writeback on and call into
> their ->write_metadata callback.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
...
> @@ -1639,11 +1664,38 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
>  
>  		/* refer to the same tests at the end of writeback_sb_inodes */
>  		if (wrote) {
> -			if (time_is_before_jiffies(start_time + HZ / 10UL))
> -				break;
> -			if (work->nr_pages <= 0)
> +			if (time_is_before_jiffies(start_time + HZ / 10UL) ||
> +			    work->nr_pages <= 0) {
> +				done = true;
>  				break;
> +			}
> +		}
> +	}
> +
> +	if (!done && wb_stat(wb, WB_METADATA_DIRTY)) {
> +		LIST_HEAD(list);
> +
> +		spin_unlock(&wb->list_lock);
> +		spin_lock(&wb->bdi->sb_list_lock);
> +		list_splice_init(&wb->bdi->dirty_sb_list, &list);
> +		while (!list_empty(&list)) {
> +			struct super_block *sb;
> +
> +			sb = list_first_entry(&list, struct super_block,
> +					      s_bdi_list);
> +			list_move_tail(&sb->s_bdi_list,
> +				       &wb->bdi->dirty_sb_list);
> +			if (!sb->s_op->write_metadata)
> +				continue;
> +			if (!trylock_super(sb))
> +				continue;
> +			spin_unlock(&wb->bdi->sb_list_lock);
> +			wrote += writeback_sb_metadata(sb, wb, work);
> +			spin_lock(&wb->bdi->sb_list_lock);
> +			up_read(&sb->s_umount);
>  		}
> +		spin_unlock(&wb->bdi->sb_list_lock);
> +		spin_lock(&wb->list_lock);
>  	}
>  	/* Leave any unwritten inodes on b_io */
>  	return wrote;

So this will hook metadata writeback into the periodic writeback but when
work->sb is set, metadata won't be written because in that case we call
writeback_sb_inodes() directly. So you need to call writeback_sb_metadata()
from wb_writeback() in that case as well. 

...

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f3f0b4c8..c063ac6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1430,6 +1430,8 @@ struct super_block {
>  
>  	spinlock_t		s_inode_wblist_lock;
>  	struct list_head	s_inodes_wb;	/* writeback inodes */
> +
> +	struct list_head	s_bdi_list;

Maybe call this s_bdi_dirty_list?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] writeback: allow for dirty metadata accounting
  2016-09-09  8:17   ` Jan Kara
@ 2016-09-12  0:46     ` Dave Chinner
  2016-09-12  7:34       ` Jan Kara
  2016-09-12 14:56     ` Josef Bacik
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2016-09-12  0:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Josef Bacik, linux-btrfs, linux-fsdevel, kernel-team, jack, viro,
	dchinner, hch, linux-mm

On Fri, Sep 09, 2016 at 10:17:43AM +0200, Jan Kara wrote:
> On Mon 22-08-16 13:35:01, Josef Bacik wrote:
> > Provide a mechanism for file systems to indicate how much dirty metadata they
> > are holding.  This introduces a few things
> > 
> > 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
> > 2) WB stat for dirty metadata.  This way we know if we need to try and call into
> > the file system to write out metadata.  This could potentially be used in the
> > future to make balancing of dirty pages smarter.
> 
> So I'm curious about one thing: In the previous posting you have mentioned
> that the main motivation for this work is to have a simple support for
> sub-pagesize dirty metadata blocks that need tracking in btrfs. However you
> do the dirty accounting at page granularity. What are your plans to handle
> this mismatch?
> 
> The thing is you actually shouldn't miscount by too much as that could
> upset some checks in mm checking how much dirty pages a node has directing
> how reclaim should be done... But it's a question whether NR_METADATA_DIRTY
> should be actually used in the checks in node_limits_ok() or in
> node_pagecache_reclaimable() at all because once you start accounting dirty
> slab objects, you are really on a thin ice...

The other thing I'm concerned about is that it's a btrfs-only thing,
which means having dirty btrfs metadata on a system with different
filesystems (e.g. btrfs root/home, XFS data) is going to affect how
memory balance and throttling is run on other filesystems. i.e. it's
going ot make a filesystem specific issue into a problem that
affects global behaviour.
> 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 56c8fda..d329f89 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1809,6 +1809,7 @@ static unsigned long get_nr_dirty_pages(void)
> >  {
> >  	return global_node_page_state(NR_FILE_DIRTY) +
> >  		global_node_page_state(NR_UNSTABLE_NFS) +
> > +		global_node_page_state(NR_METADATA_DIRTY) +
> >  		get_nr_dirty_inodes();
> 
> With my question is also connected this - when we have NR_METADATA_DIRTY,
> we could just account dirty inodes there and get rid of this
> get_nr_dirty_inodes() hack...

Accounting of dirty inodes would have to applied to every
filesystem before that could be done, but....

> But actually getting this to work right to be able to track dirty inodes would
> be useful on its own - some throlling of creation of dirty inodes would be
> useful for several filesystems (ext4, xfs, ...).

... this relies on the VFS being able to track and control all
dirtying of inodes and metadata.

Which, it should be noted, cannot be done unconditionally because
some filesystems /explicitly avoid/ dirtying VFS inodes for anything
other than dirty data and provide no mechanism to the VFS for
writeback inodes or their related metadata. e.g. XFS, where all
metadata changes are transactional and so all dirty inode tracking
and writeback control is internal the to the XFS transaction
subsystem.

Adding an external throttle to dirtying of metadata doesn't make any
sense in this sort of architecture - in XFS we already have all the
throttles and expedited writeback triggers integrated into the
transaction subsystem (e.g transaction reservation limits, log space
limits, periodic background writeback, memory reclaim triggers,
etc). It's all so tightly integrated around the physical structure
of the filesystem I can't see any way to sanely abstract it to work
with a generic "dirty list" accounting and writeback engine at this
point...

I can see how tracking of information such as the global amount of
dirty metadata is useful for diagnostics, but I'm not convinced we
should be using it for globally scoped external control of deeply
integrated and highly specific internal filesystem functionality.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] writeback: allow for dirty metadata accounting
  2016-09-12  0:46     ` Dave Chinner
@ 2016-09-12  7:34       ` Jan Kara
  2016-09-12 14:24         ` Josef Bacik
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2016-09-12  7:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Josef Bacik, linux-btrfs, linux-fsdevel, kernel-team,
	jack, viro, dchinner, hch, linux-mm

On Mon 12-09-16 10:46:56, Dave Chinner wrote:
> On Fri, Sep 09, 2016 at 10:17:43AM +0200, Jan Kara wrote:
> > On Mon 22-08-16 13:35:01, Josef Bacik wrote:
> > > Provide a mechanism for file systems to indicate how much dirty metadata they
> > > are holding.  This introduces a few things
> > > 
> > > 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
> > > 2) WB stat for dirty metadata.  This way we know if we need to try and call into
> > > the file system to write out metadata.  This could potentially be used in the
> > > future to make balancing of dirty pages smarter.
> > 
> > So I'm curious about one thing: In the previous posting you have mentioned
> > that the main motivation for this work is to have a simple support for
> > sub-pagesize dirty metadata blocks that need tracking in btrfs. However you
> > do the dirty accounting at page granularity. What are your plans to handle
> > this mismatch?
> > 
> > The thing is you actually shouldn't miscount by too much as that could
> > upset some checks in mm checking how much dirty pages a node has directing
> > how reclaim should be done... But it's a question whether NR_METADATA_DIRTY
> > should be actually used in the checks in node_limits_ok() or in
> > node_pagecache_reclaimable() at all because once you start accounting dirty
> > slab objects, you are really on a thin ice...
> 
> The other thing I'm concerned about is that it's a btrfs-only thing,
> which means having dirty btrfs metadata on a system with different
> filesystems (e.g. btrfs root/home, XFS data) is going to affect how
> memory balance and throttling is run on other filesystems. i.e. it's
> going ot make a filesystem specific issue into a problem that
> affects global behaviour.

Umm, I don't think it will be different than currently. Currently btrfs
dirty metadata is accounted as dirty page cache because they have this
virtual fs inode which owns all metadata pages. It is pretty similar to
e.g. ext2 where you have bdev inode which effectively owns all metadata
pages and these dirty pages account towards the dirty limits. For ext4
things are more complicated due to journaling and thus ext4 hides the fact
that a metadata page is dirty until the corresponding transaction is
committed.  But from that moment on dirty metadata is again just a dirty
pagecache page in the bdev inode.

So current Josef's patch just splits the counter in which btrfs metadata
pages would be accounted but effectively there should be no change in the
behavior. It is just a question whether this approach is workable in the
future when they'd like to track different objects than just pages in the
counter.

> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 56c8fda..d329f89 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -1809,6 +1809,7 @@ static unsigned long get_nr_dirty_pages(void)
> > >  {
> > >  	return global_node_page_state(NR_FILE_DIRTY) +
> > >  		global_node_page_state(NR_UNSTABLE_NFS) +
> > > +		global_node_page_state(NR_METADATA_DIRTY) +
> > >  		get_nr_dirty_inodes();
> > 
> > With my question is also connected this - when we have NR_METADATA_DIRTY,
> > we could just account dirty inodes there and get rid of this
> > get_nr_dirty_inodes() hack...
> 
> Accounting of dirty inodes would have to applied to every
> filesystem before that could be done, but....

Well, this particular hack of adding get_nr_dirty_inodes() into the result
of get_nr_dirty_pages() is there only so that we do writeback even if there
are only dirty inodes without dirty pages. Since XFS doesn't care about
writeback for dirty inodes, it would be fine regardless what we do here,
won't it?

> > But actually getting this to work right to be able to track dirty inodes would
> > be useful on its own - some throlling of creation of dirty inodes would be
> > useful for several filesystems (ext4, xfs, ...).
> 
> ... this relies on the VFS being able to track and control all
> dirtying of inodes and metadata.
> 
> Which, it should be noted, cannot be done unconditionally because
> some filesystems /explicitly avoid/ dirtying VFS inodes for anything
> other than dirty data and provide no mechanism to the VFS for
> writeback inodes or their related metadata. e.g. XFS, where all
> metadata changes are transactional and so all dirty inode tracking
> and writeback control is internal the to the XFS transaction
> subsystem.
> 
> Adding an external throttle to dirtying of metadata doesn't make any
> sense in this sort of architecture - in XFS we already have all the
> throttles and expedited writeback triggers integrated into the
> transaction subsystem (e.g transaction reservation limits, log space
> limits, periodic background writeback, memory reclaim triggers,
> etc). It's all so tightly integrated around the physical structure
> of the filesystem I can't see any way to sanely abstract it to work
> with a generic "dirty list" accounting and writeback engine at this
> point...

OK, thanks for providing the details about XFS. So my idea was (and Josef's
patches seem to be working towards this), that filesystems that decide to
use the generic throttling, would just account the dirty metadata in some
counter. That counter would be included in the limits checked by
balance_dirty_pages(). Filesystem that don't want to use generic throttling
would have the counter 0 and thus for them there'd be no difference. And in
appropriate points after metadata was dirtied, filesystems that care could
call balance_dirty_pages() to throttle the process creating dirty
metadata.

> I can see how tracking of information such as the global amount of
> dirty metadata is useful for diagnostics, but I'm not convinced we
> should be using it for globally scoped external control of deeply
> integrated and highly specific internal filesystem functionality.

You are right that when journalling comes in, things get more complex. But
btrfs already uses a scheme similar to the above and I believe ext4 could
be made to use a similar scheme as well. If you have something better for
XFS, then that's good for you and we should make sure we don't interfere
with that. Currently I don't think we will.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] writeback: allow for dirty metadata accounting
  2016-09-12  7:34       ` Jan Kara
@ 2016-09-12 14:24         ` Josef Bacik
  2016-09-12 23:01           ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2016-09-12 14:24 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner
  Cc: linux-btrfs, linux-fsdevel, kernel-team, jack, viro, dchinner,
	hch, linux-mm

Dave your reply got eaten somewhere along the way for me, so all i have is this 
email.  I'm going to respond to your stuff here.

On 09/12/2016 03:34 AM, Jan Kara wrote:
> On Mon 12-09-16 10:46:56, Dave Chinner wrote:
>> On Fri, Sep 09, 2016 at 10:17:43AM +0200, Jan Kara wrote:
>>> On Mon 22-08-16 13:35:01, Josef Bacik wrote:
>>>> Provide a mechanism for file systems to indicate how much dirty metadata they
>>>> are holding.  This introduces a few things
>>>>
>>>> 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
>>>> 2) WB stat for dirty metadata.  This way we know if we need to try and call into
>>>> the file system to write out metadata.  This could potentially be used in the
>>>> future to make balancing of dirty pages smarter.
>>>
>>> So I'm curious about one thing: In the previous posting you have mentioned
>>> that the main motivation for this work is to have a simple support for
>>> sub-pagesize dirty metadata blocks that need tracking in btrfs. However you
>>> do the dirty accounting at page granularity. What are your plans to handle
>>> this mismatch?
>>>
>>> The thing is you actually shouldn't miscount by too much as that could
>>> upset some checks in mm checking how much dirty pages a node has directing
>>> how reclaim should be done... But it's a question whether NR_METADATA_DIRTY
>>> should be actually used in the checks in node_limits_ok() or in
>>> node_pagecache_reclaimable() at all because once you start accounting dirty
>>> slab objects, you are really on a thin ice...
>>
>> The other thing I'm concerned about is that it's a btrfs-only thing,
>> which means having dirty btrfs metadata on a system with different
>> filesystems (e.g. btrfs root/home, XFS data) is going to affect how
>> memory balance and throttling is run on other filesystems. i.e. it's
>> going ot make a filesystem specific issue into a problem that
>> affects global behaviour.
>
> Umm, I don't think it will be different than currently. Currently btrfs
> dirty metadata is accounted as dirty page cache because they have this
> virtual fs inode which owns all metadata pages. It is pretty similar to
> e.g. ext2 where you have bdev inode which effectively owns all metadata
> pages and these dirty pages account towards the dirty limits. For ext4
> things are more complicated due to journaling and thus ext4 hides the fact
> that a metadata page is dirty until the corresponding transaction is
> committed.  But from that moment on dirty metadata is again just a dirty
> pagecache page in the bdev inode.
>
> So current Josef's patch just splits the counter in which btrfs metadata
> pages would be accounted but effectively there should be no change in the
> behavior. It is just a question whether this approach is workable in the
> future when they'd like to track different objects than just pages in the
> counter.

+1 to what Jan said.  Btrfs's dirty metadata is always going to affect any other 
file systems in the system, no matter how we deal with it.  In fact it's worse 
with our btree_inode approach as the dirtied_when thing will likely screw 
somebody and make us skip writing out dirty metadata when we want to.  At least 
with this framework in place we can start to make the throttling smarter, so say 
make us flush metadata if that is the bigger % of the dirty pages in the system. 
  All I do now is move the status quo around, we are no worse, and arguably 
better with these patches than we were without them.

>
>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>> index 56c8fda..d329f89 100644
>>>> --- a/fs/fs-writeback.c
>>>> +++ b/fs/fs-writeback.c
>>>> @@ -1809,6 +1809,7 @@ static unsigned long get_nr_dirty_pages(void)
>>>>  {
>>>>  	return global_node_page_state(NR_FILE_DIRTY) +
>>>>  		global_node_page_state(NR_UNSTABLE_NFS) +
>>>> +		global_node_page_state(NR_METADATA_DIRTY) +
>>>>  		get_nr_dirty_inodes();
>>>
>>> With my question is also connected this - when we have NR_METADATA_DIRTY,
>>> we could just account dirty inodes there and get rid of this
>>> get_nr_dirty_inodes() hack...
>>
>> Accounting of dirty inodes would have to applied to every
>> filesystem before that could be done, but....
>
> Well, this particular hack of adding get_nr_dirty_inodes() into the result
> of get_nr_dirty_pages() is there only so that we do writeback even if there
> are only dirty inodes without dirty pages. Since XFS doesn't care about
> writeback for dirty inodes, it would be fine regardless what we do here,
> won't it?
>
>>> But actually getting this to work right to be able to track dirty inodes would
>>> be useful on its own - some throlling of creation of dirty inodes would be
>>> useful for several filesystems (ext4, xfs, ...).
>>
>> ... this relies on the VFS being able to track and control all
>> dirtying of inodes and metadata.
>>
>> Which, it should be noted, cannot be done unconditionally because
>> some filesystems /explicitly avoid/ dirtying VFS inodes for anything
>> other than dirty data and provide no mechanism to the VFS for
>> writeback inodes or their related metadata. e.g. XFS, where all
>> metadata changes are transactional and so all dirty inode tracking
>> and writeback control is internal the to the XFS transaction
>> subsystem.
>>
>> Adding an external throttle to dirtying of metadata doesn't make any
>> sense in this sort of architecture - in XFS we already have all the
>> throttles and expedited writeback triggers integrated into the
>> transaction subsystem (e.g transaction reservation limits, log space
>> limits, periodic background writeback, memory reclaim triggers,
>> etc). It's all so tightly integrated around the physical structure
>> of the filesystem I can't see any way to sanely abstract it to work
>> with a generic "dirty list" accounting and writeback engine at this
>> point...
>
> OK, thanks for providing the details about XFS. So my idea was (and Josef's
> patches seem to be working towards this), that filesystems that decide to
> use the generic throttling, would just account the dirty metadata in some
> counter. That counter would be included in the limits checked by
> balance_dirty_pages(). Filesystem that don't want to use generic throttling
> would have the counter 0 and thus for them there'd be no difference. And in
> appropriate points after metadata was dirtied, filesystems that care could
> call balance_dirty_pages() to throttle the process creating dirty
> metadata.
>
>> I can see how tracking of information such as the global amount of
>> dirty metadata is useful for diagnostics, but I'm not convinced we
>> should be using it for globally scoped external control of deeply
>> integrated and highly specific internal filesystem functionality.
>
> You are right that when journalling comes in, things get more complex. But
> btrfs already uses a scheme similar to the above and I believe ext4 could
> be made to use a similar scheme as well. If you have something better for
> XFS, then that's good for you and we should make sure we don't interfere
> with that. Currently I don't think we will.

XFS doesn't have the problem that btrfs has, so I don't expect it to take 
advantage of this.  Your writeback throttling is tied to your journal 
transactions, which are already limited in size.  Btrfs on the other hand is 
only limited by the amount of memory in the system, which is why I want to 
leverage the global throttling code.  I went down this path years ago and tried 
to build the throttling intelligence into btrfs itself and that way lies 
dragons.  I cycled between OOM'ing the box and duplicating a bunch of the global 
counters inside btrfs.

Btrfs doesn't have the benefit of a arbitrary journal size constraint on it's 
dirty metadata, so we have to rely on the global limits to make sure we're kept 
in check.  The only thing my patches do is allow us to account for this 
separately and trigger writeback specifically.  Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] writeback: allow for dirty metadata accounting
  2016-09-09  8:17   ` Jan Kara
  2016-09-12  0:46     ` Dave Chinner
@ 2016-09-12 14:56     ` Josef Bacik
  2016-09-12 23:18       ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Josef Bacik @ 2016-09-12 14:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-btrfs, linux-fsdevel, kernel-team, jack, viro, dchinner,
	hch, linux-mm

On 09/09/2016 04:17 AM, Jan Kara wrote:
> On Mon 22-08-16 13:35:01, Josef Bacik wrote:
>> Provide a mechanism for file systems to indicate how much dirty metadata they
>> are holding.  This introduces a few things
>>
>> 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
>> 2) WB stat for dirty metadata.  This way we know if we need to try and call into
>> the file system to write out metadata.  This could potentially be used in the
>> future to make balancing of dirty pages smarter.
>
> So I'm curious about one thing: In the previous posting you have mentioned
> that the main motivation for this work is to have a simple support for
> sub-pagesize dirty metadata blocks that need tracking in btrfs. However you
> do the dirty accounting at page granularity. What are your plans to handle
> this mismatch?

We already track how much dirty metadata we have internally in btrfs, I 
envisioned the subpage blocksize guys just calling the accounting ever N objects 
that were dirited in order to keep the accounting correct.  This is not great, 
but it was better than the hoops we needed to jump through to deal with the 
btree_inode and subpagesize blocksizes.

>
> The thing is you actually shouldn't miscount by too much as that could
> upset some checks in mm checking how much dirty pages a node has directing
> how reclaim should be done... But it's a question whether NR_METADATA_DIRTY
> should be actually used in the checks in node_limits_ok() or in
> node_pagecache_reclaimable() at all because once you start accounting dirty
> slab objects, you are really on a thin ice...

Agreed, this does get a bit ugly.

>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index 56c8fda..d329f89 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -1809,6 +1809,7 @@ static unsigned long get_nr_dirty_pages(void)
>>  {
>>  	return global_node_page_state(NR_FILE_DIRTY) +
>>  		global_node_page_state(NR_UNSTABLE_NFS) +
>> +		global_node_page_state(NR_METADATA_DIRTY) +
>>  		get_nr_dirty_inodes();
>
> With my question is also connected this - when we have NR_METADATA_DIRTY,
> we could just account dirty inodes there and get rid of this
> get_nr_dirty_inodes() hack...
>
> But actually getting this to work right to be able to track dirty inodes would
> be useful on its own - some throlling of creation of dirty inodes would be
> useful for several filesystems (ext4, xfs, ...).

So I suppose what I could do is instead provide a callback for the vm to ask how 
many dirty objects we have in the file system, instead of adding another page 
counter.  That way the actual accounting is kept internal to the file system, 
and it gets rid of the weird mismatch when blocksize < pagesize.  Does that 
sound like a more acceptable approach?  Unfortunately I decided to do this work 
to make the blocksize < pagesize work easier, but then didn't actually think 
about how the accounting would interact with that case, because I'm an idiot.

I think that looping through all the sb's in the system would be kinda shitty 
for this tho, we want the "get number of dirty pages" part to be relatively 
fast.  What if I do something like the shrinker_control only for dirty objects. 
So the fs registers some dirty_objects_control, we call into each of those and 
get the counts from that.  Does that sound less crappy?  Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] writeback: allow for dirty metadata accounting
  2016-09-12 14:24         ` Josef Bacik
@ 2016-09-12 23:01           ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-09-12 23:01 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, linux-btrfs, linux-fsdevel, kernel-team, jack, viro,
	dchinner, hch, linux-mm

On Mon, Sep 12, 2016 at 10:24:12AM -0400, Josef Bacik wrote:
> Dave your reply got eaten somewhere along the way for me, so all i
> have is this email.  I'm going to respond to your stuff here.

No worries, I'll do a 2-in-1 reply :P

> On 09/12/2016 03:34 AM, Jan Kara wrote:
> >On Mon 12-09-16 10:46:56, Dave Chinner wrote:
> >>On Fri, Sep 09, 2016 at 10:17:43AM +0200, Jan Kara wrote:
> >>>On Mon 22-08-16 13:35:01, Josef Bacik wrote:
> >>>>Provide a mechanism for file systems to indicate how much dirty metadata they
> >>>>are holding.  This introduces a few things
> >>>>
> >>>>1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
> >>>>2) WB stat for dirty metadata.  This way we know if we need to try and call into
> >>>>the file system to write out metadata.  This could potentially be used in the
> >>>>future to make balancing of dirty pages smarter.
> >>>
> >>>So I'm curious about one thing: In the previous posting you have mentioned
> >>>that the main motivation for this work is to have a simple support for
> >>>sub-pagesize dirty metadata blocks that need tracking in btrfs. However you
> >>>do the dirty accounting at page granularity. What are your plans to handle
> >>>this mismatch?
> >>>
> >>>The thing is you actually shouldn't miscount by too much as that could
> >>>upset some checks in mm checking how much dirty pages a node has directing
> >>>how reclaim should be done... But it's a question whether NR_METADATA_DIRTY
> >>>should be actually used in the checks in node_limits_ok() or in
> >>>node_pagecache_reclaimable() at all because once you start accounting dirty
> >>>slab objects, you are really on a thin ice...
> >>
> >>The other thing I'm concerned about is that it's a btrfs-only thing,
> >>which means having dirty btrfs metadata on a system with different
> >>filesystems (e.g. btrfs root/home, XFS data) is going to affect how
> >>memory balance and throttling is run on other filesystems. i.e. it's
> >>going ot make a filesystem specific issue into a problem that
> >>affects global behaviour.
> >
> >Umm, I don't think it will be different than currently. Currently btrfs
> >dirty metadata is accounted as dirty page cache because they have this
> >virtual fs inode which owns all metadata pages.

Yes, so effectively they are treated the same as file data pages
w.r.t. throttling, writeback and reclaim....

> >It is pretty similar to
> >e.g. ext2 where you have bdev inode which effectively owns all metadata
> >pages and these dirty pages account towards the dirty limits. For ext4
> >things are more complicated due to journaling and thus ext4 hides the fact
> >that a metadata page is dirty until the corresponding transaction is
> >committed.  But from that moment on dirty metadata is again just a dirty
> >pagecache page in the bdev inode.

Yeah, though those filesystems don't suffer from the uncontrolled
explosion of metadata that btrfs is suffering from, so simply
treating them as another dirty inode that needs flushing works just
fine.

> >So current Josef's patch just splits the counter in which btrfs metadata
> >pages would be accounted but effectively there should be no change in the
> >behavior.

Yup, I missed the addition to the node_pagecache_reclaimable() that
ensures reclaim sees the same number or dirty pages...

> >It is just a question whether this approach is workable in the
> >future when they'd like to track different objects than just pages in the
> >counter.

I don't think it can. Because the counters directly influences the
page lru reclaim scanning algorithms, it can only be used to
account for pages that are in the LRUs. Other objects like slab
objects need to be accounted for and reclaimed by the shrinker
infrastructure.

Accounting for metadata writeback is a different issue - it could
track slab objects if we wanted to, but the issue is that these are
often difficult to determine the amount of IO needed to clean them
so generic balancing is hard. (e.g. effect of inode write
clustering).

> +1 to what Jan said.  Btrfs's dirty metadata is always going to
> affect any other file systems in the system, no matter how we deal
> with it.  In fact it's worse with our btree_inode approach as the
> dirtied_when thing will likely screw somebody and make us skip
> writing out dirty metadata when we want to.

XFS takes care of metadata flushing with a periodic background work
controlled by /proc/sys/fs/xfs/xfssyncd_centisecs. We trigger both
background async inode reclaim and background dirty metadata
flushing from this (run on workqueues) if the system is idle or
hasn't had some other trigger fire to run these sooner.  It works
well enough that I can't remember the last time someone asked a
question about needing to tune this parameter, or had a problem that
required tuning it to fix....

> At least with this
> framework in place we can start to make the throttling smarter, so
> say make us flush metadata if that is the bigger % of the dirty
> pages in the system.  All I do now is move the status quo around, we
> are no worse, and arguably better with these patches than we were
> without them.

Agreed - I misread part of the patch, so I was a little off-target.

> >OK, thanks for providing the details about XFS. So my idea was (and Josef's
> >patches seem to be working towards this), that filesystems that decide to
> >use the generic throttling, would just account the dirty metadata in some
> >counter. That counter would be included in the limits checked by
> >balance_dirty_pages(). Filesystem that don't want to use generic throttling
> >would have the counter 0 and thus for them there'd be no difference. And in
> >appropriate points after metadata was dirtied, filesystems that care could
> >call balance_dirty_pages() to throttle the process creating dirty
> >metadata.
> >
> >>I can see how tracking of information such as the global amount of
> >>dirty metadata is useful for diagnostics, but I'm not convinced we
> >>should be using it for globally scoped external control of deeply
> >>integrated and highly specific internal filesystem functionality.
> >
> >You are right that when journalling comes in, things get more complex. But
> >btrfs already uses a scheme similar to the above and I believe ext4 could
> >be made to use a similar scheme as well. If you have something better for
> >XFS, then that's good for you and we should make sure we don't interfere
> >with that. Currently I don't think we will.
> 
> XFS doesn't have the problem that btrfs has, so I don't expect it to
> take advantage of this.  Your writeback throttling is tied to your
> journal transactions, which are already limited in size.  Btrfs on
> the other hand is only limited by the amount of memory in the
> system, which is why I want to leverage the global throttling code.
> Btrfs doesn't have the benefit of a arbitrary journal size
> constraint on it's dirty metadata, so we have to rely on the global
> limits to make sure we're kept in check.

Though you do have a transaction reservation system, so I would
expect that be the place where a dirty metadata throttling could be
applied sanely. i.e. if there's too much dirty metadata in the
system, then you can't get a reservation to dirty more until some
of the metadata has been cleaned.

> The only thing my patches
> do is allow us to account for this separately and trigger writeback
> specifically.  Thanks,

I'm not sure that global metadata accounting makes sense if there is
no mechanism provided for per-sb throttling based purely on metadata
demand. Sure, global dirty metadata might be a way of measuring how
much metadata all the active filesystems have created, but then
you're going to throttle filesystems with no dirty metadata because
some other filesystem is being a massive hog.

I suspect that for a well balanced system we are going to need some
kind of "in-between" solution that shares the dirty metadata limits
across all bdis somewhat fairly similar to file data. How that ties
into the bdi writeback rate estimates and other IO/dirty page
throttling feedback loops is an interesting question - to me this
isn't as obviously simple as it simply separating out metadata
accounting...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] writeback: allow for dirty metadata accounting
  2016-09-12 14:56     ` Josef Bacik
@ 2016-09-12 23:18       ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-09-12 23:18 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, linux-btrfs, linux-fsdevel, kernel-team, jack, viro,
	dchinner, hch, linux-mm

On Mon, Sep 12, 2016 at 10:56:04AM -0400, Josef Bacik wrote:
> I think that looping through all the sb's in the system would be
> kinda shitty for this tho, we want the "get number of dirty pages"
> part to be relatively fast.  What if I do something like the
> shrinker_control only for dirty objects. So the fs registers some
> dirty_objects_control, we call into each of those and get the counts
> from that.  Does that sound less crappy?  Thanks,

Hmmm - just an off-the-wall thought on this....

If you're going to do that, then why wouldn't you simply use a
"shrinker" to do the metadata writeback rather than having a hook to
count dirty objects to pass to some other writeback code that calls
a hook to write the metadata?

That way filesystems can also implement dirty accounting and
"writers" for each cache of objects they currently implement
shrinkers for. i.e. just expanding shrinkers to be able to "count
dirty objects" and "write dirty objects" so that we can tell
filesystems to write back all their different metadata caches
proportionally to the size of the page cache and it's dirty state.
The existing file data and inode writeback could then just be new
generic "superblock shrinker" operations, and the fs could have it's
own private metadata writeback similar to the private sb shrinker
callout we currently have...

And, in doing so, we might be able to completely hide memcg from the
writeback implementations similar to the way memcg is completely
hidden from the shrinker reclaim implementations...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-09-12 23:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-22 17:34 [PATCH 0/3][V2] Provide accounting for dirty metadata Josef Bacik
2016-08-22 17:35 ` [PATCH 1/3] remove mapping from balance_dirty_pages*() Josef Bacik
2016-08-22 18:29   ` kbuild test robot
2016-08-22 17:35 ` [PATCH 2/3] writeback: allow for dirty metadata accounting Josef Bacik
2016-09-09  8:17   ` Jan Kara
2016-09-12  0:46     ` Dave Chinner
2016-09-12  7:34       ` Jan Kara
2016-09-12 14:24         ` Josef Bacik
2016-09-12 23:01           ` Dave Chinner
2016-09-12 14:56     ` Josef Bacik
2016-09-12 23:18       ` Dave Chinner
2016-08-22 17:35 ` [PATCH 3/3] writeback: introduce super_operations->write_metadata Josef Bacik
2016-09-09  8:29   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).