All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5][RESEND] Support for metadata specific accounting
@ 2016-10-25 18:41 Josef Bacik
  2016-10-25 18:41 ` [PATCH 1/5] remove mapping from balance_dirty_pages*() Josef Bacik
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Josef Bacik @ 2016-10-25 18:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

(Sending again as 5/5 got eaten and I used the wrong email address for Dave.)

(Dave again I apologize, for some reason our email server hates you and so I
didn't get your previous responses again, and didn't notice until I was looking
at the patchwork history for my previous submissions, so I'll try and answer all
your questions here, and then I'll be much more judicious about making sure I'm
getting your emails.)

This is my current batch of patches to add some better infrastructure for
handling metadata related memory.  There's been a few changes to these patches
since the last go around but the approach is essentially the same.

Changes since last go around:
-WB_WRITTEN/WB_DIRTIED are now being counted in bytes.  This is necessary to
deal with metadata that is smaller than page size.  Currently btrfs only does
things in pagesize increments, but with these patches it will allow us to easily
support sub-pagesize blocksizes so we need these counters to be in bytes.
-Added a METADATA_BYTES counter to account for the total number of bytes used
for metadata.  We need this because the VM adjusts the dirty ratios based on how
much memory it thinks we can dirty, which is just the amount of free memory
plus the pagecache.
-Patch 5, which is a doozy, and I will explain below.

This is a more complete approach to keeping track of memory that is in use for
metadata.  Previously Btrfs had been using a special inode to allocate all of
our metadata pages out of, so if we had a heavy metadata workload we still
benefited from the balance_dirty_pages() throttling which kept everything sane.
We want to kill this in order to make it easier to support pagesize > blocksize
support in btrfs, in much the same way XFS does this support.

One concern that was brought up was the global accounting vs one bad actor.  We
still need the global accounting so we can enforce the global dirty limits, but
we have the per-bdi accounting as well to make sure we are punishing the bad
actor and not all the other file systems that happened to be hooked into this
infrastructure.

The *REFERENCED patch is to fix a behavior I noticed when testing these patches
on a more memory-constrained system than I had previously used.  Now that the
eviction of metadata pages is controlled soley through the shrinker interface I
was seeing a big perf drop when we had to start doing memory reclaim.  This was
because most of the RAM on the box (10gig of 16gig) was used soley by icache and
dcache.  Because this workload was just create a file and never touch it again
most of this cache was easily evictable.  However because we by default send
every object in the icache/dcache through the LRU twice, it would take around
10,000 iterations through the shrinker before it would start to evict things
(there was ~10 million objects between the two caches).  The pagecache solves
this by putting everything on the INACTIVE list first, and then theres a two
part activation phase before it's moved to the active list, so you have to
really touch a page a lot before it is considered active.  However we assume
active first, which is very latency inducing when we want to start reclaiming
objects.  Making this change keeps us in line with what pagecache does and
eliminates the performance drop I was seeing.

Let me know what you think.  I've tried to keep this as minimal and sane as
possible so we can build on it in the future as we want to handle more nuanced
scenarios.  Also if we think this is a bad approach I won't feel as bad throwing
it out ;).  Thanks,

Josef

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/5] remove mapping from balance_dirty_pages*()
  2016-10-25 18:41 [PATCH 0/5][RESEND] Support for metadata specific accounting Josef Bacik
@ 2016-10-25 18:41 ` Josef Bacik
  2016-10-25 18:47   ` Tejun Heo
  2016-10-25 18:41 ` [PATCH 2/5] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes Josef Bacik
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2016-10-25 18:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

The only reason we pass in the mapping is to get the inode in order to see if
writeback cgroups is enabled, and even then it only checks the bdi and a super
block flag.  balance_dirty_pages() doesn't even use the mapping.  Since
balance_dirty_pages*() works on a bdi level, just pass in the bdi and super
block directly so we can avoid using mapping.  This will allow us to still use
balance_dirty_pages for dirty metadata pages that are not backed by an
address_mapping.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 drivers/mtd/devices/block2mtd.c | 12 ++++++++----
 fs/btrfs/disk-io.c              |  4 ++--
 fs/btrfs/file.c                 |  3 ++-
 fs/btrfs/ioctl.c                |  3 ++-
 fs/btrfs/relocation.c           |  3 ++-
 fs/buffer.c                     |  3 ++-
 fs/iomap.c                      |  3 ++-
 fs/ntfs/attrib.c                | 10 +++++++---
 fs/ntfs/file.c                  |  4 ++--
 include/linux/backing-dev.h     | 29 +++++++++++++++++++++++------
 include/linux/writeback.h       |  3 ++-
 mm/filemap.c                    |  4 +++-
 mm/memory.c                     |  9 +++++++--
 mm/page-writeback.c             | 15 +++++++--------
 14 files changed, 71 insertions(+), 34 deletions(-)

diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index 7c887f1..7892d0b 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -52,7 +52,8 @@ static struct page *page_read(struct address_space *mapping, int index)
 /* erase a specified part of the device */
 static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len)
 {
-	struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+	struct inode *inode = dev->blkdev->bd_inode;
+	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
 	int index = to >> PAGE_SHIFT;	// page index
 	int pages = len >> PAGE_SHIFT;
@@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len)
 				memset(page_address(page), 0xff, PAGE_SIZE);
 				set_page_dirty(page);
 				unlock_page(page);
-				balance_dirty_pages_ratelimited(mapping);
+				balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+								inode->i_sb);
 				break;
 			}
 
@@ -141,7 +143,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, const u_char *buf,
 		loff_t to, size_t len, size_t *retlen)
 {
 	struct page *page;
-	struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+	struct inode *inode = dev->blkdev->bd_inode;
+	struct address_space *mapping = inode->i_mapping;
 	int index = to >> PAGE_SHIFT;	// page index
 	int offset = to & ~PAGE_MASK;	// page offset
 	int cpylen;
@@ -162,7 +165,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, const u_char *buf,
 			memcpy(page_address(page) + offset, buf, cpylen);
 			set_page_dirty(page);
 			unlock_page(page);
-			balance_dirty_pages_ratelimited(mapping);
+			balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+							inode->i_sb);
 		}
 		put_page(page);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e720d3e..fa07e81 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4085,8 +4085,8 @@ static void __btrfs_btree_balance_dirty(struct btrfs_root *root,
 	ret = percpu_counter_compare(&root->fs_info->dirty_metadata_bytes,
 				     BTRFS_DIRTY_METADATA_THRESH);
 	if (ret > 0) {
-		balance_dirty_pages_ratelimited(
-				   root->fs_info->btree_inode->i_mapping);
+		balance_dirty_pages_ratelimited(&root->fs_info->bdi,
+						root->fs_info->sb);
 	}
 }
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 72a180d..cbefdc8 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1711,7 +1711,8 @@ again:
 
 		cond_resched();
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		if (dirty_pages < (root->nodesize >> PAGE_SHIFT) + 1)
 			btrfs_btree_balance_dirty(root);
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index af69129..f68692a 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1410,7 +1410,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 		}
 
 		defrag_count += ret;
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		inode_unlock(inode);
 
 		if (newer_than) {
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 0ec8ffa..de4f221 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3228,7 +3228,8 @@ static int relocate_file_extent_cluster(struct inode *inode,
 		put_page(page);
 
 		index++;
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		btrfs_throttle(BTRFS_I(inode)->root);
 	}
 	WARN_ON(nr != cluster->nr);
diff --git a/fs/buffer.c b/fs/buffer.c
index 9c8eb9b..9bbe30d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2386,7 +2386,8 @@ static int cont_expand_zero(struct file *file, struct address_space *mapping,
 		BUG_ON(err != len);
 		err = 0;
 
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 
 		if (unlikely(fatal_signal_pending(current))) {
 			err = -EINTR;
diff --git a/fs/iomap.c b/fs/iomap.c
index 706270f..9025c43 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -226,7 +226,8 @@ again:
 		written += copied;
 		length -= copied;
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 	} while (iov_iter_count(i) && length);
 
 	return written ? written : status;
diff --git a/fs/ntfs/attrib.c b/fs/ntfs/attrib.c
index 44a39a0..0a8a39e 100644
--- a/fs/ntfs/attrib.c
+++ b/fs/ntfs/attrib.c
@@ -2493,6 +2493,7 @@ conv_err_out:
 int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 {
 	ntfs_volume *vol = ni->vol;
+	struct inode *inode = VFS_I(ni);
 	struct address_space *mapping;
 	struct page *page;
 	u8 *kaddr;
@@ -2545,7 +2546,8 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		kunmap_atomic(kaddr);
 		set_page_dirty(page);
 		put_page(page);
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		cond_resched();
 		if (idx == end)
 			goto done;
@@ -2586,7 +2588,8 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		/* Finally unlock and release the page. */
 		unlock_page(page);
 		put_page(page);
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		cond_resched();
 	}
 	/* If there is a last partial page, need to do it the slow way. */
@@ -2603,7 +2606,8 @@ int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const u8 val)
 		kunmap_atomic(kaddr);
 		set_page_dirty(page);
 		put_page(page);
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 		cond_resched();
 	}
 done:
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index f548629..66082eb 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -276,7 +276,7 @@ do_non_resident_extend:
 		 * number of pages we read and make dirty in the case of sparse
 		 * files.
 		 */
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(vi), vi->i_sb);
 		cond_resched();
 	} while (++index < end_index);
 	read_lock_irqsave(&ni->size_lock, flags);
@@ -1914,7 +1914,7 @@ again:
 		iov_iter_advance(i, copied);
 		pos += copied;
 		written += copied;
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(vi), vi->i_sb);
 		if (fatal_signal_pending(current)) {
 			status = -EINTR;
 			break;
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 43b93a9..9eb2cf2 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -253,8 +253,9 @@ void wb_blkcg_offline(struct blkcg *blkcg);
 int inode_congested(struct inode *inode, int cong_bits);
 
 /**
- * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
- * @inode: inode of interest
+ * bdi_cgwb_enabled - test wether cgroup writeback is enabled on a filesystem
+ * @bdi: the bdi we care about
+ * @sb: the super for the bdi
  *
  * cgroup writeback requires support from both the bdi and filesystem.
  * Also, both memcg and iocg have to be on the default hierarchy.  Test
@@ -263,15 +264,25 @@ int inode_congested(struct inode *inode, int cong_bits);
  * Note that the test result may change dynamically on the same inode
  * depending on how memcg and iocg are configured.
  */
-static inline bool inode_cgwb_enabled(struct inode *inode)
+static inline bool bdi_cgwb_enabled(struct backing_dev_info *bdi,
+				    struct super_block *sb)
 {
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
-
 	return cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
 		cgroup_subsys_on_dfl(io_cgrp_subsys) &&
 		bdi_cap_account_dirty(bdi) &&
 		(bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
-		(inode->i_sb->s_iflags & SB_I_CGROUPWB);
+		(sb->s_iflags & SB_I_CGROUPWB);
+}
+
+/**
+ * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
+ * @inode: inode of interest
+ *
+ * Does the inode have cgroup writeback support.
+ */
+static inline bool inode_cgwb_enabled(struct inode *inode)
+{
+	return bdi_cgwb_enabled(inode_to_bdi(inode), inode->i_sb);
 }
 
 /**
@@ -414,6 +425,12 @@ static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
+static inline bool bdi_cgwb_enabled(struct backing_dev_info *bdi,
+				    struct super_block *sb)
+{
+	return false;
+}
+
 static inline bool inode_cgwb_enabled(struct inode *inode)
 {
 	return false;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index fc1e16c..256ffc3 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -364,7 +364,8 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
 
 void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
 void page_writeback_init(void);
-void balance_dirty_pages_ratelimited(struct address_space *mapping);
+void balance_dirty_pages_ratelimited(struct backing_dev_info *bdi,
+				     struct super_block *sb);
 bool wb_over_bg_thresh(struct bdi_writeback *wb);
 
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287df..03fa60a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2667,6 +2667,7 @@ ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
 {
 	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	long status = 0;
 	ssize_t written = 0;
@@ -2746,7 +2747,8 @@ again:
 		pos += copied;
 		written += copied;
 
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 	} while (iov_iter_count(i));
 
 	return written ? written : status;
diff --git a/mm/memory.c b/mm/memory.c
index 793fe0f..b47a73f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -64,6 +64,7 @@
 #include <linux/debugfs.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
+#include <linux/backing-dev.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -2105,11 +2106,13 @@ static inline int wp_page_reuse(struct fault_env *fe, pte_t orig_pte,
 		put_page(page);
 
 		if ((dirtied || page_mkwrite) && mapping) {
+			struct inode *inode = mapping->host;
 			/*
 			 * Some device drivers do not set page.mapping
 			 * but still dirty their pages
 			 */
-			balance_dirty_pages_ratelimited(mapping);
+			balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+							inode->i_sb);
 		}
 
 		if (!page_mkwrite)
@@ -3291,11 +3294,13 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff)
 	mapping = page_rmapping(fault_page);
 	unlock_page(fault_page);
 	if ((dirtied || vma->vm_ops->page_mkwrite) && mapping) {
+		struct inode *inode = mapping->host;
 		/*
 		 * Some device drivers do not set page.mapping but still
 		 * dirty their pages
 		 */
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+						inode->i_sb);
 	}
 
 	if (!vma->vm_ops->page_mkwrite)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f4cd7d8..121a6e3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1559,8 +1559,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
-static void balance_dirty_pages(struct address_space *mapping,
-				struct bdi_writeback *wb,
+static void balance_dirty_pages(struct bdi_writeback *wb,
 				unsigned long pages_dirtied)
 {
 	struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
@@ -1849,7 +1848,8 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
 
 /**
  * balance_dirty_pages_ratelimited - balance dirty memory state
- * @mapping: address_space which was dirtied
+ * @bdi: the bdi that was dirtied
+ * @sb: the super block that was dirtied
  *
  * Processes which are dirtying memory should call in here once for each page
  * which was newly dirtied.  The function will periodically check the system's
@@ -1860,10 +1860,9 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
  * limit we decrease the ratelimiting by a lot, to prevent individual processes
  * from overshooting the limit by (ratelimit_pages) each.
  */
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+void balance_dirty_pages_ratelimited(struct backing_dev_info *bdi,
+				     struct super_block *sb)
 {
-	struct inode *inode = mapping->host;
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct bdi_writeback *wb = NULL;
 	int ratelimit;
 	int *p;
@@ -1871,7 +1870,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	if (inode_cgwb_enabled(inode))
+	if (bdi_cgwb_enabled(bdi, sb))
 		wb = wb_get_create_current(bdi, GFP_KERNEL);
 	if (!wb)
 		wb = &bdi->wb;
@@ -1909,7 +1908,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
-		balance_dirty_pages(mapping, wb, current->nr_dirtied);
+		balance_dirty_pages(wb, current->nr_dirtied);
 
 	wb_put(wb);
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/5] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes
  2016-10-25 18:41 [PATCH 0/5][RESEND] Support for metadata specific accounting Josef Bacik
  2016-10-25 18:41 ` [PATCH 1/5] remove mapping from balance_dirty_pages*() Josef Bacik
@ 2016-10-25 18:41 ` Josef Bacik
  2016-10-25 19:03   ` Tejun Heo
  2016-10-30 15:13   ` Jan Kara
  2016-10-25 18:41 ` [PATCH 3/5] writeback: add counters for metadata usage Josef Bacik
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 26+ messages in thread
From: Josef Bacik @ 2016-10-25 18:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

These are counters that constantly go up in order to do bandwidth calculations.
It isn't important what the units are in, as long as they are consistent between
the two of them, so convert them to count bytes written/dirtied, and allow the
metadata accounting stuff to change the counters as well.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/fuse/file.c                   |  4 ++--
 include/linux/backing-dev-defs.h |  4 ++--
 include/linux/backing-dev.h      |  2 +-
 mm/backing-dev.c                 |  9 +++++----
 mm/page-writeback.c              | 26 +++++++++++++++-----------
 5 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3988b43..81eee7e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1467,7 +1467,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 	for (i = 0; i < req->num_pages; i++) {
 		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
 		dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP);
-		wb_writeout_inc(&bdi->wb);
+		wb_writeout_add(&bdi->wb, PAGE_SIZE);
 	}
 	wake_up(&fi->page_waitq);
 }
@@ -1771,7 +1771,7 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req,
 
 		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK_TEMP);
-		wb_writeout_inc(&bdi->wb);
+		wb_writeout_add(&bdi->wb, PAGE_SIZE);
 		fuse_writepage_free(fc, new_req);
 		fuse_request_free(new_req);
 		goto out;
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index c357f27..71ea5a6 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -34,8 +34,8 @@ typedef int (congested_fn)(void *, int);
 enum wb_stat_item {
 	WB_RECLAIMABLE,
 	WB_WRITEBACK,
-	WB_DIRTIED,
-	WB_WRITTEN,
+	WB_DIRTIED_BYTES,
+	WB_WRITTEN_BYTES,
 	NR_WB_STAT_ITEMS
 };
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 9eb2cf2..edddcb8 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -114,7 +114,7 @@ static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item)
 	return sum;
 }
 
-extern void wb_writeout_inc(struct bdi_writeback *wb);
+extern void wb_writeout_add(struct bdi_writeback *wb, long bytes);
 
 /*
  * maximal error of a stat counter.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8fde443..433db42 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -70,14 +70,15 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	wb_thresh = wb_calc_thresh(wb, dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
+#define BtoK(x) ((x) >> 10)
 	seq_printf(m,
 		   "BdiWriteback:       %10lu kB\n"
 		   "BdiReclaimable:     %10lu kB\n"
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
-		   "BdiDirtied:         %10lu kB\n"
-		   "BdiWritten:         %10lu kB\n"
+		   "BdiDirtiedBytes:    %10lu kB\n"
+		   "BdiWrittenBytes:    %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
@@ -90,8 +91,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   K(wb_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
-		   (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
-		   (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
+		   (unsigned long) BtoK(wb_stat(wb, WB_DIRTIED_BYTES)),
+		   (unsigned long) BtoK(wb_stat(wb, WB_WRITTEN_BYTES)),
 		   (unsigned long) K(wb->write_bandwidth),
 		   nr_dirty,
 		   nr_io,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 121a6e3..e09b3ad 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -596,11 +596,11 @@ static void wb_domain_writeout_inc(struct wb_domain *dom,
  * Increment @wb's writeout completion count and the global writeout
  * completion count. Called from test_clear_page_writeback().
  */
-static inline void __wb_writeout_inc(struct bdi_writeback *wb)
+static inline void __wb_writeout_inc(struct bdi_writeback *wb, long bytes)
 {
 	struct wb_domain *cgdom;
 
-	__inc_wb_stat(wb, WB_WRITTEN);
+	__add_wb_stat(wb, WB_WRITTEN_BYTES, bytes);
 	wb_domain_writeout_inc(&global_wb_domain, &wb->completions,
 			       wb->bdi->max_prop_frac);
 
@@ -610,15 +610,15 @@ static inline void __wb_writeout_inc(struct bdi_writeback *wb)
 				       wb->bdi->max_prop_frac);
 }
 
-void wb_writeout_inc(struct bdi_writeback *wb)
+void wb_writeout_add(struct bdi_writeback *wb, long bytes)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
-	__wb_writeout_inc(wb);
+	__wb_writeout_inc(wb, bytes);
 	local_irq_restore(flags);
 }
-EXPORT_SYMBOL_GPL(wb_writeout_inc);
+EXPORT_SYMBOL_GPL(wb_writeout_add);
 
 /*
  * On idle system, we can be called long after we scheduled because we use
@@ -1362,8 +1362,8 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
-	dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]);
-	written = percpu_counter_read(&wb->stat[WB_WRITTEN]);
+	dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED_BYTES]) >> PAGE_SHIFT;
+	written = percpu_counter_read(&wb->stat[WB_WRITTEN_BYTES]) >> PAGE_SHIFT;
 
 	/*
 	 * Skip quiet periods when disk bandwidth is under-utilized.
@@ -2464,7 +2464,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_node_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
-		__inc_wb_stat(wb, WB_DIRTIED);
+		__add_wb_stat(wb, WB_DIRTIED_BYTES, PAGE_SIZE);
 		task_io_account_write(PAGE_SIZE);
 		current->nr_dirtied++;
 		this_cpu_inc(bdp_ratelimits);
@@ -2547,12 +2547,16 @@ void account_page_redirty(struct page *page)
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
+		unsigned long flags;
 		bool locked;
 
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		current->nr_dirtied--;
-		dec_node_page_state(page, NR_DIRTIED);
-		dec_wb_stat(wb, WB_DIRTIED);
+
+		local_irq_save(flags);
+		__dec_node_page_state(page, NR_DIRTIED);
+		__add_wb_stat(wb, WB_DIRTIED_BYTES, -(long)PAGE_SIZE);
+		local_irq_restore(flags);
 		unlocked_inode_to_wb_end(inode, locked);
 	}
 }
@@ -2772,7 +2776,7 @@ int test_clear_page_writeback(struct page *page)
 				struct bdi_writeback *wb = inode_to_wb(inode);
 
 				__dec_wb_stat(wb, WB_WRITEBACK);
-				__wb_writeout_inc(wb);
+				__wb_writeout_inc(wb, PAGE_SIZE);
 			}
 		}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/5] writeback: add counters for metadata usage
  2016-10-25 18:41 [PATCH 0/5][RESEND] Support for metadata specific accounting Josef Bacik
  2016-10-25 18:41 ` [PATCH 1/5] remove mapping from balance_dirty_pages*() Josef Bacik
  2016-10-25 18:41 ` [PATCH 2/5] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes Josef Bacik
@ 2016-10-25 18:41 ` Josef Bacik
  2016-10-25 19:50   ` Tejun Heo
  2016-10-30 15:36   ` Jan Kara
  2016-10-25 18:41 ` [PATCH 4/5] writeback: introduce super_operations->write_metadata Josef Bacik
  2016-10-25 18:41 ` [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list Josef Bacik
  4 siblings, 2 replies; 26+ messages in thread
From: Josef Bacik @ 2016-10-25 18:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

Btrfs has no bounds except memory on the amount of dirty memory that we have in
use for metadata.  Historically we have used a special inode so we could take
advantage of the balance_dirty_pages throttling that comes with using pagecache.
However as we'd like to support different blocksizes it would be nice to not
have to rely on pagecache, but still get the balance_dirty_pages throttling
without having to do it ourselves.

So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES.  These are
zone and bdi_writeback counters to keep track of how many bytes we have in
flight for METADATA.  We need to count in bytes as blocksizes could be
percentages of pagesize.  We simply convert the bytes to number of pages where
it is needed for the throttling.

Also introduce NR_METADATA_BYTES so we can keep track of the total amount of
pages used for metadata on the system.  This is also needed so things like dirty
throttling know that this is dirtyable memory as well and easily reclaimed.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 arch/tile/mm/pgtable.c           |   5 +-
 drivers/base/node.c              |   8 ++
 fs/fs-writeback.c                |   2 +
 fs/proc/meminfo.c                |   7 ++
 include/linux/backing-dev-defs.h |   2 +
 include/linux/mm.h               |   9 +++
 include/linux/mmzone.h           |   3 +
 include/trace/events/writeback.h |  13 +++-
 mm/backing-dev.c                 |   4 +
 mm/page-writeback.c              | 157 +++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c                  |  20 ++++-
 mm/util.c                        |   2 +
 mm/vmscan.c                      |  19 ++++-
 mm/vmstat.c                      |   3 +
 14 files changed, 229 insertions(+), 25 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 7cc6ee7..a159459 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -44,13 +44,16 @@ void show_mem(unsigned int filter)
 {
 	struct zone *zone;
 
-	pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
+	pr_err("Active:%lu inactive:%lu dirty:%lu metadata:%lu metadata_dirty:%lu writeback:%lu metadata_writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
 	       (global_node_page_state(NR_ACTIVE_ANON) +
 		global_node_page_state(NR_ACTIVE_FILE)),
 	       (global_node_page_state(NR_INACTIVE_ANON) +
 		global_node_page_state(NR_INACTIVE_FILE)),
 	       global_node_page_state(NR_FILE_DIRTY),
+	       global_node_page_state(NR_METADATA_BYTES) >> PAGE_SHIFT,
+	       global_node_page_state(NR_METADATA_DIRTY_BYTES) >> PAGE_SHIFT,
 	       global_node_page_state(NR_WRITEBACK),
+	       global_node_page_state(NR_METADATA_WRITEBACK_BYTES) >> PAGE_SHIFT,
 	       global_node_page_state(NR_UNSTABLE_NFS),
 	       global_page_state(NR_FREE_PAGES),
 	       (global_page_state(NR_SLAB_RECLAIMABLE) +
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f96..9d6c239 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -51,6 +51,8 @@ static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
+#define BtoK(x) ((x) >> 10)
+
 static ssize_t node_read_meminfo(struct device *dev,
 			struct device_attribute *attr, char *buf)
 {
@@ -99,7 +101,10 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
 	n += sprintf(buf + n,
 		       "Node %d Dirty:          %8lu kB\n"
+		       "Node %d MetadataDirty:	%8lu kB\n"
 		       "Node %d Writeback:      %8lu kB\n"
+		       "Node %d MetaWriteback:  %8lu kB\n"
+		       "Node %d Metadata:       %8lu kB\n"
 		       "Node %d FilePages:      %8lu kB\n"
 		       "Node %d Mapped:         %8lu kB\n"
 		       "Node %d AnonPages:      %8lu kB\n"
@@ -119,8 +124,11 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
 			,
 		       nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
+		       nid, BtoK(node_page_state(pgdat, NR_METADATA_DIRTY_BYTES)),
 		       nid, K(node_page_state(pgdat, NR_WRITEBACK)),
+		       nid, BtoK(node_page_state(pgdat, NR_METADATA_WRITEBACK_BYTES)),
 		       nid, K(node_page_state(pgdat, NR_FILE_PAGES)),
+		       nid, BtoK(node_page_state(pgdat, NR_METADATA_BYTES)),
 		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
 		       nid, K(i.sharedram),
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 05713a5..a5cb1dd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1802,6 +1802,7 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
 	return work;
 }
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
 /*
  * Add in the number of potentially dirty inodes, because each inode
  * write can dirty pagecache in the underlying blockdev.
@@ -1810,6 +1811,7 @@ static unsigned long get_nr_dirty_pages(void)
 {
 	return global_node_page_state(NR_FILE_DIRTY) +
 		global_node_page_state(NR_UNSTABLE_NFS) +
+		BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES)) +
 		get_nr_dirty_inodes();
 }
 
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index b9a8c81..72da154 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -36,6 +36,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
  * display in kilobytes.
  */
 #define K(x) ((x) << (PAGE_SHIFT - 10))
+#define BtoK(x) ((x) >> 10)
 	si_meminfo(&i);
 	si_swapinfo(&i);
 	committed = percpu_counter_read_positive(&vm_committed_as);
@@ -60,6 +61,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		"Buffers:        %8lu kB\n"
 		"Cached:         %8lu kB\n"
 		"SwapCached:     %8lu kB\n"
+		"Metadata:       %8lu kB\n"
 		"Active:         %8lu kB\n"
 		"Inactive:       %8lu kB\n"
 		"Active(anon):   %8lu kB\n"
@@ -80,7 +82,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		"SwapTotal:      %8lu kB\n"
 		"SwapFree:       %8lu kB\n"
 		"Dirty:          %8lu kB\n"
+		"MetadataDirty:  %8lu kB\n"
 		"Writeback:      %8lu kB\n"
+		"MetaWriteback:  %8lu kB\n"
 		"AnonPages:      %8lu kB\n"
 		"Mapped:         %8lu kB\n"
 		"Shmem:          %8lu kB\n"
@@ -119,6 +123,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		K(i.bufferram),
 		K(cached),
 		K(total_swapcache_pages()),
+		BtoK(global_node_page_state(NR_METADATA_BYTES)),
 		K(pages[LRU_ACTIVE_ANON]   + pages[LRU_ACTIVE_FILE]),
 		K(pages[LRU_INACTIVE_ANON] + pages[LRU_INACTIVE_FILE]),
 		K(pages[LRU_ACTIVE_ANON]),
@@ -139,7 +144,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		K(i.totalswap),
 		K(i.freeswap),
 		K(global_node_page_state(NR_FILE_DIRTY)),
+		BtoK(global_node_page_state(NR_METADATA_DIRTY_BYTES)),
 		K(global_node_page_state(NR_WRITEBACK)),
+		BtoK(global_node_page_state(NR_METADATA_WRITEBACK_BYTES)),
 		K(global_node_page_state(NR_ANON_MAPPED)),
 		K(global_node_page_state(NR_FILE_MAPPED)),
 		K(i.sharedram),
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 71ea5a6..b1f8f70 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -36,6 +36,8 @@ enum wb_stat_item {
 	WB_WRITEBACK,
 	WB_DIRTIED_BYTES,
 	WB_WRITTEN_BYTES,
+	WB_METADATA_DIRTY_BYTES,
+	WB_METADATA_WRITEBACK_BYTES,
 	NR_WB_STAT_ITEMS
 };
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef815b9..8b425ae 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -31,6 +31,7 @@ struct file_ra_state;
 struct user_struct;
 struct writeback_control;
 struct bdi_writeback;
+struct backing_dev_info;
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES	/* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -1363,6 +1364,14 @@ int redirty_page_for_writepage(struct writeback_control *wbc,
 void account_page_dirtied(struct page *page, struct address_space *mapping);
 void account_page_cleaned(struct page *page, struct address_space *mapping,
 			  struct bdi_writeback *wb);
+void account_metadata_dirtied(struct page *page, struct backing_dev_info *bdi,
+			      long bytes);
+void account_metadata_cleaned(struct page *page, struct backing_dev_info *bdi,
+			      long bytes);
+void account_metadata_writeback(struct page *page,
+				struct backing_dev_info *bdi, long bytes);
+void account_metadata_end_writeback(struct page *page,
+				    struct backing_dev_info *bdi, long bytes);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
 void cancel_dirty_page(struct page *page);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7f2ae99..89db46c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -169,6 +169,9 @@ enum node_stat_item {
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
+	NR_METADATA_DIRTY_BYTES,	/* Metadata dirty bytes */
+	NR_METADATA_WRITEBACK_BYTES,	/* Metadata writeback bytes */
+	NR_METADATA_BYTES,	/* total metadata bytes in use. */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 2ccd9cc..f97c8de 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -390,6 +390,8 @@ TRACE_EVENT(writeback_queue_io,
 	)
 );
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
+
 TRACE_EVENT(global_dirty_state,
 
 	TP_PROTO(unsigned long background_thresh,
@@ -402,7 +404,9 @@ TRACE_EVENT(global_dirty_state,
 
 	TP_STRUCT__entry(
 		__field(unsigned long,	nr_dirty)
+		__field(unsigned long,	nr_metadata_dirty)
 		__field(unsigned long,	nr_writeback)
+		__field(unsigned long,	nr_metadata_writeback)
 		__field(unsigned long,	nr_unstable)
 		__field(unsigned long,	background_thresh)
 		__field(unsigned long,	dirty_thresh)
@@ -413,7 +417,9 @@ TRACE_EVENT(global_dirty_state,
 
 	TP_fast_assign(
 		__entry->nr_dirty	= global_node_page_state(NR_FILE_DIRTY);
+		__entry->nr_metadata_dirty = BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES));
 		__entry->nr_writeback	= global_node_page_state(NR_WRITEBACK);
+		__entry->nr_metadata_dirty = BtoP(global_node_page_state(NR_METADATA_WRITEBACK_BYTES));
 		__entry->nr_unstable	= global_node_page_state(NR_UNSTABLE_NFS);
 		__entry->nr_dirtied	= global_node_page_state(NR_DIRTIED);
 		__entry->nr_written	= global_node_page_state(NR_WRITTEN);
@@ -424,7 +430,8 @@ TRACE_EVENT(global_dirty_state,
 
 	TP_printk("dirty=%lu writeback=%lu unstable=%lu "
 		  "bg_thresh=%lu thresh=%lu limit=%lu "
-		  "dirtied=%lu written=%lu",
+		  "dirtied=%lu written=%lu metadata_dirty=%lu "
+		  "metadata_writeback=%lu",
 		  __entry->nr_dirty,
 		  __entry->nr_writeback,
 		  __entry->nr_unstable,
@@ -432,7 +439,9 @@ TRACE_EVENT(global_dirty_state,
 		  __entry->dirty_thresh,
 		  __entry->dirty_limit,
 		  __entry->nr_dirtied,
-		  __entry->nr_written
+		  __entry->nr_written,
+		  __entry->nr_metadata_dirty,
+		  __entry->nr_metadata_writeback
 	)
 );
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 433db42..da3f68b 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -79,6 +79,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "BackgroundThresh:   %10lu kB\n"
 		   "BdiDirtiedBytes:    %10lu kB\n"
 		   "BdiWrittenBytes:    %10lu kB\n"
+		   "BdiMetadataDirty:   %10lu kB\n"
+		   "BdiMetaWriteback:	%10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
@@ -93,6 +95,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   K(background_thresh),
 		   (unsigned long) BtoK(wb_stat(wb, WB_DIRTIED_BYTES)),
 		   (unsigned long) BtoK(wb_stat(wb, WB_WRITTEN_BYTES)),
+		   (unsigned long) BtoK(wb_stat(wb, WB_METADATA_DIRTY_BYTES)),
+		   (unsigned long) BtoK(wb_stat(wb, WB_METADATA_WRITEBACK_BYTES)),
 		   (unsigned long) K(wb->write_bandwidth),
 		   nr_dirty,
 		   nr_io,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e09b3ad..48faf1b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -296,6 +296,7 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 
 	nr_pages += node_page_state(pgdat, NR_INACTIVE_FILE);
 	nr_pages += node_page_state(pgdat, NR_ACTIVE_FILE);
+	nr_pages += node_page_state(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT;
 
 	return nr_pages;
 }
@@ -372,6 +373,7 @@ static unsigned long global_dirtyable_memory(void)
 
 	x += global_node_page_state(NR_INACTIVE_FILE);
 	x += global_node_page_state(NR_ACTIVE_FILE);
+	x += global_node_page_state(NR_METADATA_BYTES) >> PAGE_SHIFT;
 
 	if (!vm_highmem_is_dirtyable)
 		x -= highmem_dirtyable_memory(x);
@@ -380,6 +382,30 @@ static unsigned long global_dirtyable_memory(void)
 }
 
 /**
+ * global_dirty_memory - the number of globally dirty pages
+ *
+ * Returns the global number of pages that are dirty in pagecache and metadata.
+ */
+static unsigned long global_dirty_memory(void)
+{
+	return global_node_page_state(NR_FILE_DIRTY) +
+		global_node_page_state(NR_UNSTABLE_NFS) +
+		(global_node_page_state(NR_METADATA_DIRTY_BYTES) >> PAGE_SHIFT);
+}
+
+/**
+ * global_writeback_memory - the number of pages under writeback globally
+ *
+ * Returns the global number of pages under writeback both in pagecache and in
+ * metadata.
+ */
+static unsigned long global_writeback_memory(void)
+{
+	return global_node_page_state(NR_WRITEBACK) +
+		(global_node_page_state(NR_METADATA_WRITEBACK_BYTES) >> PAGE_SHIFT);
+}
+
+/**
  * domain_dirty_limits - calculate thresh and bg_thresh for a wb_domain
  * @dtc: dirty_throttle_control of interest
  *
@@ -1514,7 +1540,7 @@ static long wb_min_pause(struct bdi_writeback *wb,
 static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
 {
 	struct bdi_writeback *wb = dtc->wb;
-	unsigned long wb_reclaimable;
+	unsigned long wb_reclaimable, wb_writeback;
 
 	/*
 	 * wb_thresh is not treated as some limiting factor as
@@ -1544,12 +1570,17 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
 	 * deltas.
 	 */
 	if (dtc->wb_thresh < 2 * wb_stat_error(wb)) {
-		wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
-		dtc->wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK);
+		wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE) +
+			(wb_stat_sum(wb, WB_METADATA_DIRTY_BYTES) >> PAGE_SHIFT);
+		wb_writeback = wb_stat_sum(wb, WB_WRITEBACK) +
+			(wb_stat_sum(wb, WB_METADATA_WRITEBACK_BYTES) >> PAGE_SHIFT);
 	} else {
-		wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE);
-		dtc->wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK);
+		wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE) +
+			(wb_stat(wb, WB_METADATA_DIRTY_BYTES) >> PAGE_SHIFT);
+		wb_writeback = wb_stat(wb, WB_WRITEBACK) +
+			(wb_stat(wb, WB_METADATA_WRITEBACK_BYTES) >> PAGE_SHIFT);
 	}
+	dtc->wb_dirty = wb_reclaimable + wb_writeback;
 }
 
 /*
@@ -1594,10 +1625,9 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
-					global_node_page_state(NR_UNSTABLE_NFS);
+		nr_reclaimable = global_dirty_memory();
 		gdtc->avail = global_dirtyable_memory();
-		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
+		gdtc->dirty = nr_reclaimable + global_writeback_memory();
 
 		domain_dirty_limits(gdtc);
 
@@ -1928,20 +1958,22 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	struct dirty_throttle_control * const gdtc = &gdtc_stor;
 	struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
 						     &mdtc_stor : NULL;
+	unsigned long wb_reclaimable;
 
 	/*
 	 * Similar to balance_dirty_pages() but ignores pages being written
 	 * as we're trying to decide whether to put more under writeback.
 	 */
 	gdtc->avail = global_dirtyable_memory();
-	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
-		      global_node_page_state(NR_UNSTABLE_NFS);
+	gdtc->dirty = global_dirty_memory();
 	domain_dirty_limits(gdtc);
 
 	if (gdtc->dirty > gdtc->bg_thresh)
 		return true;
 
-	if (wb_stat(wb, WB_RECLAIMABLE) >
+	wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE) +
+		(wb_stat(wb, WB_METADATA_DIRTY_BYTES) >> PAGE_SHIFT);
+	if (wb_reclaimable >
 	    wb_calc_thresh(gdtc->wb, gdtc->bg_thresh))
 		return true;
 
@@ -1956,7 +1988,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 		if (mdtc->dirty > mdtc->bg_thresh)
 			return true;
 
-		if (wb_stat(wb, WB_RECLAIMABLE) >
+		if (wb_reclaimable >
 		    wb_calc_thresh(mdtc->wb, mdtc->bg_thresh))
 			return true;
 	}
@@ -1980,8 +2012,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
                 if (global_node_page_state(NR_UNSTABLE_NFS) +
-			global_node_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
+		    global_writeback_memory() <= dirty_thresh)
+			break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
@@ -2008,8 +2040,7 @@ int dirty_writeback_centisecs_handler(struct ctl_table *table, int write,
 void laptop_mode_timer_fn(unsigned long data)
 {
 	struct request_queue *q = (struct request_queue *)data;
-	int nr_pages = global_node_page_state(NR_FILE_DIRTY) +
-		global_node_page_state(NR_UNSTABLE_NFS);
+	int nr_pages = global_dirty_memory();
 	struct bdi_writeback *wb;
 
 	/*
@@ -2473,6 +2504,100 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 EXPORT_SYMBOL(account_page_dirtied);
 
 /*
+ * account_metadata_dirtied
+ * @page - the page being dirited
+ * @bdi - the bdi that owns this page
+ * @bytes - the number of bytes being dirtied
+ *
+ * Do the dirty page accounting for metadata pages that aren't backed by an
+ * address_space.
+ */
+void account_metadata_dirtied(struct page *page, struct backing_dev_info *bdi,
+			      long bytes)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_node_page_state(page_pgdat(page), NR_METADATA_DIRTY_BYTES,
+			      bytes);
+	__add_wb_stat(&bdi->wb, WB_DIRTIED_BYTES, bytes);
+	__add_wb_stat(&bdi->wb, WB_METADATA_DIRTY_BYTES, bytes);
+	current->nr_dirtied++;
+	task_io_account_write(bytes);
+	this_cpu_inc(bdp_ratelimits);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(account_metadata_dirtied);
+
+/*
+ * account_metadata_cleaned
+ * @page - the page being cleaned
+ * @bdi - the bdi that owns this page
+ * @bytes - the number of bytes cleaned
+ *
+ * Called on a no longer dirty metadata page.
+ */
+void account_metadata_cleaned(struct page *page, struct backing_dev_info *bdi,
+			      long bytes)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_node_page_state(page_pgdat(page), NR_METADATA_DIRTY_BYTES,
+			      -bytes);
+	__add_wb_stat(&bdi->wb, WB_METADATA_DIRTY_BYTES, -bytes);
+	task_io_account_cancelled_write(bytes);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(account_metadata_cleaned);
+
+/*
+ * account_metadata_writeback
+ * @page - the page being marked as writeback
+ * @bdi - the bdi that owns this page
+ * @bytes - the number of bytes we are submitting for writeback
+ *
+ * Called on a metadata page that has been marked writeback.
+ */
+void account_metadata_writeback(struct page *page,
+				struct backing_dev_info *bdi, long bytes)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_wb_stat(&bdi->wb, WB_METADATA_DIRTY_BYTES, -bytes);
+	__mod_node_page_state(page_pgdat(page), NR_METADATA_DIRTY_BYTES,
+					 -bytes);
+	__add_wb_stat(&bdi->wb, WB_METADATA_WRITEBACK_BYTES, bytes);
+	__mod_node_page_state(page_pgdat(page), NR_METADATA_WRITEBACK_BYTES,
+					 bytes);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(account_metadata_writeback);
+
+/*
+ * account_metadata_end_writeback
+ * @page - the page we are ending writeback on
+ * @bdi - the bdi that owns this page
+ * @bytes - the number of bytes that just ended writeback
+ *
+ * Called on a metadata page that has completed writeback.
+ */
+void account_metadata_end_writeback(struct page *page,
+				    struct backing_dev_info *bdi, long bytes)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_wb_stat(&bdi->wb, WB_METADATA_WRITEBACK_BYTES, -bytes);
+	__mod_node_page_state(page_pgdat(page), NR_METADATA_WRITEBACK_BYTES,
+					 -bytes);
+	__wb_writeout_inc(&bdi->wb, bytes);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(account_metadata_end_writeback);
+
+/*
  * Helper function for deaccounting dirty page without writeback.
  *
  * Caller must hold lock_page_memcg().
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2214c6..b8c9d93 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4113,6 +4113,8 @@ out:
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
+#define BtoK(x) ((x) >> 10)
+#define BtoP(x) ((x) >> PAGE_SHIFT)
 
 static void show_migration_types(unsigned char type)
 {
@@ -4167,10 +4169,11 @@ void show_free_areas(unsigned int filter)
 
 	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
 		" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
-		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
-		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
-		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
-		" free:%lu free_pcp:%lu free_cma:%lu\n",
+		" unevictable:%lu metadata:%lu dirty:%lu metadata_dirty:%lu\n"
+		" writeback:%lu unstable:%lu metadata_writeback:%lu\n"
+		" slab_reclaimable:%lu slab_unreclaimable:%lu mapped:%lu\n"
+		" shmem:%lu pagetables:%lu bounce:%lu free:%lu free_pcp:%lu\n"
+	        " free_cma:%lu\n",
 		global_node_page_state(NR_ACTIVE_ANON),
 		global_node_page_state(NR_INACTIVE_ANON),
 		global_node_page_state(NR_ISOLATED_ANON),
@@ -4178,9 +4181,12 @@ void show_free_areas(unsigned int filter)
 		global_node_page_state(NR_INACTIVE_FILE),
 		global_node_page_state(NR_ISOLATED_FILE),
 		global_node_page_state(NR_UNEVICTABLE),
+		BtoP(global_node_page_state(NR_METADATA_BYTES)),
 		global_node_page_state(NR_FILE_DIRTY),
+		BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES)),
 		global_node_page_state(NR_WRITEBACK),
 		global_node_page_state(NR_UNSTABLE_NFS),
+		BtoP(global_node_page_state(NR_METADATA_WRITEBACK_BYTES)),
 		global_page_state(NR_SLAB_RECLAIMABLE),
 		global_page_state(NR_SLAB_UNRECLAIMABLE),
 		global_node_page_state(NR_FILE_MAPPED),
@@ -4200,9 +4206,12 @@ void show_free_areas(unsigned int filter)
 			" unevictable:%lukB"
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
+			" metadata:%lukB"
 			" mapped:%lukB"
 			" dirty:%lukB"
+			" metadata_dirty:%lukB"
 			" writeback:%lukB"
+			" metadata_writeback:%lukB"
 			" shmem:%lukB"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			" shmem_thp: %lukB"
@@ -4222,9 +4231,12 @@ void show_free_areas(unsigned int filter)
 			K(node_page_state(pgdat, NR_UNEVICTABLE)),
 			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
 			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
+			BtoK(node_page_state(pgdat, NR_METADATA_BYTES)),
 			K(node_page_state(pgdat, NR_FILE_MAPPED)),
 			K(node_page_state(pgdat, NR_FILE_DIRTY)),
+			BtoK(node_page_state(pgdat, NR_METADATA_DIRTY_BYTES)),
 			K(node_page_state(pgdat, NR_WRITEBACK)),
+			BtoK(node_page_state(pgdat, NR_METADATA_WRITEBACK_BYTES)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR),
 			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)
diff --git a/mm/util.c b/mm/util.c
index 662cddf..e8b8b7f 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -548,6 +548,8 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		 */
 		free += global_page_state(NR_SLAB_RECLAIMABLE);
 
+		free += global_page_state(NR_METADATA_BYTES) >> PAGE_SHIFT;
+
 		/*
 		 * Leave reserved pages. The pages are not for anonymous pages.
 		 */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fe8b71..322d6d4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -218,7 +218,8 @@ unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 
 	nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
 	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
-	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
+	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE) +
+	     (node_page_state_snapshot(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT);
 
 	if (get_nr_swap_pages() > 0)
 		nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
@@ -3680,6 +3681,7 @@ static inline unsigned long node_unmapped_file_pages(struct pglist_data *pgdat)
 static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
 {
 	unsigned long nr_pagecache_reclaimable;
+	unsigned long nr_metadata_reclaimable;
 	unsigned long delta = 0;
 
 	/*
@@ -3701,7 +3703,20 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
 	if (unlikely(delta > nr_pagecache_reclaimable))
 		delta = nr_pagecache_reclaimable;
 
-	return nr_pagecache_reclaimable - delta;
+	nr_metadata_reclaimable =
+		node_page_state(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT;
+	/*
+	 * We don't do writeout through the shrinkers so subtract any
+	 * dirty/writeback metadata bytes from the reclaimable count.
+	 */
+	if (nr_metadata_reclaimable) {
+		unsigned long unreclaimable =
+			node_page_state(pgdat, NR_METADATA_DIRTY_BYTES) +
+			node_page_state(pgdat, NR_METADATA_WRITEBACK_BYTES);
+		unreclaimable >>= PAGE_SHIFT;
+		nr_metadata_reclaimable -= unreclaimable;
+	}
+	return nr_metadata_reclaimable + nr_pagecache_reclaimable - delta;
 }
 
 /*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 89cec42..b762e39 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -973,6 +973,9 @@ const char * const vmstat_text[] = {
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",
 	"nr_written",
+	"nr_metadata_dirty_bytes",
+	"nr_metadata_writeback_bytes",
+	"nr_metadata_bytes",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/5] writeback: introduce super_operations->write_metadata
  2016-10-25 18:41 [PATCH 0/5][RESEND] Support for metadata specific accounting Josef Bacik
                   ` (2 preceding siblings ...)
  2016-10-25 18:41 ` [PATCH 3/5] writeback: add counters for metadata usage Josef Bacik
@ 2016-10-25 18:41 ` Josef Bacik
  2016-10-25 20:00   ` Tejun Heo
  2016-10-25 18:41 ` [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list Josef Bacik
  4 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2016-10-25 18:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

Now that we have metadata counters in the VM, we need to provide a way to kick
writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
allows file systems to deal with writing back any dirty metadata we need based
on the writeback needs of the system.  Since there is no inode to key off of we
need a list in the bdi for dirty super blocks to be added.  From there we can
find any dirty sb's on the bdi we are currently doing writeback on and call into
their ->write_metadata callback.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                | 72 ++++++++++++++++++++++++++++++++++++----
 fs/super.c                       |  7 ++++
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/fs.h               |  4 +++
 mm/backing-dev.c                 |  2 ++
 5 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a5cb1dd..8e392b2 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1465,6 +1465,31 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
 	return pages;
 }
 
+static long writeback_sb_metadata(struct super_block *sb,
+				  struct bdi_writeback *wb,
+				  struct wb_writeback_work *work)
+{
+	struct writeback_control wbc = {
+		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
+		.for_kupdate		= work->for_kupdate,
+		.for_background		= work->for_background,
+		.for_sync		= work->for_sync,
+		.range_cyclic		= work->range_cyclic,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+	};
+	long write_chunk;
+
+	write_chunk = writeback_chunk_size(wb, work);
+	wbc.nr_to_write = write_chunk;
+	sb->s_op->write_metadata(sb, &wbc);
+	work->nr_pages -= write_chunk - wbc.nr_to_write;
+
+	return write_chunk - wbc.nr_to_write;
+}
+
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -1491,6 +1516,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 	unsigned long start_time = jiffies;
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
+	bool done = false;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1607,12 +1633,18 @@ static long writeback_sb_inodes(struct super_block *sb,
 		 * background threshold and other termination conditions.
 		 */
 		if (wrote) {
-			if (time_is_before_jiffies(start_time + HZ / 10UL))
-				break;
-			if (work->nr_pages <= 0)
+			if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+			    work->nr_pages <= 0) {
+				done = true;
 				break;
+			}
 		}
 	}
+	if (!done && sb->s_op->write_metadata) {
+		spin_unlock(&wb->list_lock);
+		wrote += writeback_sb_metadata(sb, wb, work);
+		spin_lock(&wb->list_lock);
+	}
 	return wrote;
 }
 
@@ -1621,6 +1653,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 {
 	unsigned long start_time = jiffies;
 	long wrote = 0;
+	bool done = false;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1640,12 +1673,39 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 
 		/* refer to the same tests at the end of writeback_sb_inodes */
 		if (wrote) {
-			if (time_is_before_jiffies(start_time + HZ / 10UL))
-				break;
-			if (work->nr_pages <= 0)
+			if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+			    work->nr_pages <= 0) {
+				done = true;
 				break;
+			}
 		}
 	}
+
+	if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) {
+		LIST_HEAD(list);
+
+		spin_unlock(&wb->list_lock);
+		spin_lock(&wb->bdi->sb_list_lock);
+		list_splice_init(&wb->bdi->dirty_sb_list, &list);
+		while (!list_empty(&list)) {
+			struct super_block *sb;
+
+			sb = list_first_entry(&list, struct super_block,
+					      s_bdi_dirty_list);
+			list_move_tail(&sb->s_bdi_dirty_list,
+				       &wb->bdi->dirty_sb_list);
+			if (!sb->s_op->write_metadata)
+				continue;
+			if (!trylock_super(sb))
+				continue;
+			spin_unlock(&wb->bdi->sb_list_lock);
+			wrote += writeback_sb_metadata(sb, wb, work);
+			spin_lock(&wb->bdi->sb_list_lock);
+			up_read(&sb->s_umount);
+		}
+		spin_unlock(&wb->bdi->sb_list_lock);
+		spin_lock(&wb->list_lock);
+	}
 	/* Leave any unwritten inodes on b_io */
 	return wrote;
 }
diff --git a/fs/super.c b/fs/super.c
index c2ff475..eb32913 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -215,6 +215,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	spin_lock_init(&s->s_inode_list_lock);
 	INIT_LIST_HEAD(&s->s_inodes_wb);
 	spin_lock_init(&s->s_inode_wblist_lock);
+	INIT_LIST_HEAD(&s->s_bdi_dirty_list);
 
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
@@ -305,6 +306,8 @@ void deactivate_locked_super(struct super_block *s)
 {
 	struct file_system_type *fs = s->s_type;
 	if (atomic_dec_and_test(&s->s_active)) {
+		struct backing_dev_info *bdi = s->s_bdi;
+
 		cleancache_invalidate_fs(s);
 		unregister_shrinker(&s->s_shrink);
 		fs->kill_sb(s);
@@ -317,6 +320,10 @@ void deactivate_locked_super(struct super_block *s)
 		list_lru_destroy(&s->s_dentry_lru);
 		list_lru_destroy(&s->s_inode_lru);
 
+		spin_lock(&bdi->sb_list_lock);
+		list_del_init(&s->s_bdi_dirty_list);
+		spin_unlock(&bdi->sb_list_lock);
+
 		put_filesystem(fs);
 		put_super(s);
 	} else {
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index b1f8f70..8758d31 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -169,6 +169,8 @@ struct backing_dev_info {
 
 	struct timer_list laptop_mode_wb_timer;
 
+	spinlock_t sb_list_lock;
+	struct list_head dirty_sb_list;
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 901e25d..0217820 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1431,6 +1431,8 @@ struct super_block {
 
 	spinlock_t		s_inode_wblist_lock;
 	struct list_head	s_inodes_wb;	/* writeback inodes */
+
+	struct list_head	s_bdi_dirty_list;
 };
 
 /* Helper functions so that in most cases filesystems will
@@ -1806,6 +1808,8 @@ struct super_operations {
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
+	void (*write_metadata)(struct super_block *sb,
+			       struct writeback_control *wbc);
 };
 
 /*
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index da3f68b..10d9286 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -785,6 +785,8 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->max_prop_frac = FPROP_FRAC_BASE;
 	INIT_LIST_HEAD(&bdi->bdi_list);
 	INIT_LIST_HEAD(&bdi->wb_list);
+	INIT_LIST_HEAD(&bdi->dirty_sb_list);
+	spin_lock_init(&bdi->sb_list_lock);
 	init_waitqueue_head(&bdi->wb_waitq);
 
 	ret = cgwb_bdi_init(bdi);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
  2016-10-25 18:41 [PATCH 0/5][RESEND] Support for metadata specific accounting Josef Bacik
                   ` (3 preceding siblings ...)
  2016-10-25 18:41 ` [PATCH 4/5] writeback: introduce super_operations->write_metadata Josef Bacik
@ 2016-10-25 18:41 ` Josef Bacik
  2016-10-25 22:01   ` Dave Chinner
  2016-10-25 22:44   ` Omar Sandoval
  4 siblings, 2 replies; 26+ messages in thread
From: Josef Bacik @ 2016-10-25 18:41 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

With anything that populates the inode/dentry cache with a lot of one time use
inodes we can really put a lot of pressure on the system for things we don't
need to keep in cache.  It takes two runs through the LRU to evict these one use
entries, and if you have a lot of memory you can end up with 10's of millions of
entries in the dcache or icache that have never actually been touched since they
were first instantiated, and it will take a lot of CPU and a lot of pressure to
evict all of them.

So instead do what we do with pagecache, only set the *REFERENCED flags if we
are being used after we've been put onto the LRU.  This makes a significant
difference in the system's ability to evict these useless cache entries.  With a
fs_mark workload that creates 40 million files we get the following results (all
in files/sec)

Btrfs			Patched		Unpatched
Average Files/sec:	72209.3		63254.2
p50 Files/sec:		70850		57560
p90 Files/sec:		68757		53085
p99 Files/sec:		68757		53085

XFS			Patched		Unpatched
Average Files/sec:	61025.5		60719.5
p50 Files/sec:		60107		59465
p90 Files/sec: 		59300		57966
p99 Files/sec: 		59227		57528

Ext4			Patched		Unpatched
Average Files/sec:	121785.4	119248.0
p50 Files/sec:		120852		119274
p90 Files/sec:		116703		112783
p99 Files/sec:		114393		104934

The reason Btrfs has a much larger improvement is because it holds a lot more
things in memory so benefits more from faster slab reclaim, but across the board
is an improvement for each of the file systems.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/dcache.c | 15 ++++++++++-----
 fs/inode.c  |  5 ++++-
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5c7cc95..a558075 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -779,8 +779,6 @@ repeat:
 			goto kill_it;
 	}
 
-	if (!(dentry->d_flags & DCACHE_REFERENCED))
-		dentry->d_flags |= DCACHE_REFERENCED;
 	dentry_lru_add(dentry);
 
 	dentry->d_lockref.count--;
@@ -803,6 +801,13 @@ static inline void __dget_dlock(struct dentry *dentry)
 	dentry->d_lockref.count++;
 }
 
+static inline void __dget_dlock_reference(struct dentry *dentry)
+{
+	if (dentry->d_flags & DCACHE_LRU_LIST)
+		dentry->d_flags |= DCACHE_REFERENCED;
+	dentry->d_lockref.count++;
+}
+
 static inline void __dget(struct dentry *dentry)
 {
 	lockref_get(&dentry->d_lockref);
@@ -875,7 +880,7 @@ again:
 			    (alias->d_flags & DCACHE_DISCONNECTED)) {
 				discon_alias = alias;
 			} else {
-				__dget_dlock(alias);
+				__dget_dlock_reference(alias);
 				spin_unlock(&alias->d_lock);
 				return alias;
 			}
@@ -886,7 +891,7 @@ again:
 		alias = discon_alias;
 		spin_lock(&alias->d_lock);
 		if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
-			__dget_dlock(alias);
+			__dget_dlock_reference(alias);
 			spin_unlock(&alias->d_lock);
 			return alias;
 		}
@@ -2250,7 +2255,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
 		if (!d_same_name(dentry, parent, name))
 			goto next;
 
-		dentry->d_lockref.count++;
+		__dget_dlock_reference(dentry);
 		found = dentry;
 		spin_unlock(&dentry->d_lock);
 		break;
diff --git a/fs/inode.c b/fs/inode.c
index 7e3ef3a..5937d3a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -796,6 +796,8 @@ repeat:
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
+		if (!list_empty(&inode->i_lru))
+			inode->i_state |= I_REFERENCED;
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		return inode;
@@ -823,6 +825,8 @@ repeat:
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
+		if (!list_empty(&inode->i_lru))
+			inode->i_state |= I_REFERENCED;
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		return inode;
@@ -1463,7 +1467,6 @@ static void iput_final(struct inode *inode)
 		drop = generic_drop_inode(inode);
 
 	if (!drop && (sb->s_flags & MS_ACTIVE)) {
-		inode->i_state |= I_REFERENCED;
 		inode_add_lru(inode);
 		spin_unlock(&inode->i_lock);
 		return;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] remove mapping from balance_dirty_pages*()
  2016-10-25 18:41 ` [PATCH 1/5] remove mapping from balance_dirty_pages*() Josef Bacik
@ 2016-10-25 18:47   ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2016-10-25 18:47 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

On Tue, Oct 25, 2016 at 02:41:40PM -0400, Josef Bacik wrote:
> The only reason we pass in the mapping is to get the inode in order to see if
> writeback cgroups is enabled, and even then it only checks the bdi and a super
> block flag.  balance_dirty_pages() doesn't even use the mapping.  Since
> balance_dirty_pages*() works on a bdi level, just pass in the bdi and super
> block directly so we can avoid using mapping.  This will allow us to still use
> balance_dirty_pages for dirty metadata pages that are not backed by an
> address_mapping.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes
  2016-10-25 18:41 ` [PATCH 2/5] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes Josef Bacik
@ 2016-10-25 19:03   ` Tejun Heo
  2016-10-25 19:09     ` Josef Bacik
  2016-10-30 15:13   ` Jan Kara
  1 sibling, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2016-10-25 19:03 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

Hello, Josef.

On Tue, Oct 25, 2016 at 02:41:41PM -0400, Josef Bacik wrote:
> These are counters that constantly go up in order to do bandwidth calculations.
> It isn't important what the units are in, as long as they are consistent between
> the two of them, so convert them to count bytes written/dirtied, and allow the
> metadata accounting stuff to change the counters as well.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Acked-by: Tejun Heo <tj@kernel.org>

A small nit below.

> @@ -2547,12 +2547,16 @@ void account_page_redirty(struct page *page)
>  	if (mapping && mapping_cap_account_dirty(mapping)) {
>  		struct inode *inode = mapping->host;
>  		struct bdi_writeback *wb;
> +		unsigned long flags;
>  		bool locked;
>  
>  		wb = unlocked_inode_to_wb_begin(inode, &locked);
>  		current->nr_dirtied--;
> -		dec_node_page_state(page, NR_DIRTIED);
> -		dec_wb_stat(wb, WB_DIRTIED);
> +
> +		local_irq_save(flags);
> +		__dec_node_page_state(page, NR_DIRTIED);
> +		__add_wb_stat(wb, WB_DIRTIED_BYTES, -(long)PAGE_SIZE);
> +		local_irq_restore(flags);

Hmmm... so, the explicit irq clustering is neutral or win as the code
currently stands but AFAICS that's just because add_wb_stat() doesn't
use the right percpu ops.  If we convert add_wb_stat() to use the
matching percpu ops, the above change would be more expensive at least
on x86.  Maybe just skip this part?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes
  2016-10-25 19:03   ` Tejun Heo
@ 2016-10-25 19:09     ` Josef Bacik
  0 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2016-10-25 19:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

On 10/25/2016 03:03 PM, Tejun Heo wrote:
> Hello, Josef.
>
> On Tue, Oct 25, 2016 at 02:41:41PM -0400, Josef Bacik wrote:
>> These are counters that constantly go up in order to do bandwidth calculations.
>> It isn't important what the units are in, as long as they are consistent between
>> the two of them, so convert them to count bytes written/dirtied, and allow the
>> metadata accounting stuff to change the counters as well.
>>
>> Signed-off-by: Josef Bacik <jbacik@fb.com>
>
> Acked-by: Tejun Heo <tj@kernel.org>
>
> A small nit below.
>
>> @@ -2547,12 +2547,16 @@ void account_page_redirty(struct page *page)
>>  	if (mapping && mapping_cap_account_dirty(mapping)) {
>>  		struct inode *inode = mapping->host;
>>  		struct bdi_writeback *wb;
>> +		unsigned long flags;
>>  		bool locked;
>>
>>  		wb = unlocked_inode_to_wb_begin(inode, &locked);
>>  		current->nr_dirtied--;
>> -		dec_node_page_state(page, NR_DIRTIED);
>> -		dec_wb_stat(wb, WB_DIRTIED);
>> +
>> +		local_irq_save(flags);
>> +		__dec_node_page_state(page, NR_DIRTIED);
>> +		__add_wb_stat(wb, WB_DIRTIED_BYTES, -(long)PAGE_SIZE);
>> +		local_irq_restore(flags);
>
> Hmmm... so, the explicit irq clustering is neutral or win as the code
> currently stands but AFAICS that's just because add_wb_stat() doesn't
> use the right percpu ops.  If we convert add_wb_stat() to use the
> matching percpu ops, the above change would be more expensive at least
> on x86.  Maybe just skip this part?
>

Yeah I can convert it over and then just not do the irq dance.  I'll fix that up 
for the next go around.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] writeback: add counters for metadata usage
  2016-10-25 18:41 ` [PATCH 3/5] writeback: add counters for metadata usage Josef Bacik
@ 2016-10-25 19:50   ` Tejun Heo
  2016-10-26 15:20     ` Josef Bacik
  2016-10-30 15:36   ` Jan Kara
  1 sibling, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2016-10-25 19:50 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

Hello,

On Tue, Oct 25, 2016 at 02:41:42PM -0400, Josef Bacik wrote:
> Btrfs has no bounds except memory on the amount of dirty memory that we have in
> use for metadata.  Historically we have used a special inode so we could take
> advantage of the balance_dirty_pages throttling that comes with using pagecache.
> However as we'd like to support different blocksizes it would be nice to not
> have to rely on pagecache, but still get the balance_dirty_pages throttling
> without having to do it ourselves.
> 
> So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES.  These are
> zone and bdi_writeback counters to keep track of how many bytes we have in
> flight for METADATA.  We need to count in bytes as blocksizes could be
> percentages of pagesize.  We simply convert the bytes to number of pages where
> it is needed for the throttling.
> 
> Also introduce NR_METADATA_BYTES so we can keep track of the total amount of
> pages used for metadata on the system.  This is also needed so things like dirty
> throttling know that this is dirtyable memory as well and easily reclaimed.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Some nits.

It'd be nice to note that this patch just introduces new fields
without using them and thus doesn't cause any behavioral changes.

> @@ -51,6 +51,8 @@ static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
>  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
>  
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
> +#define BtoK(x) ((x) >> 10)

This would belong in a separate patch but any chance we can share
these definitions?  It's fine to have the definitions in a couple
places but these are getting duplicated in multiple spots and actually
getting confusing with K meaning pages to kilobytes.  I'm not sure how
it exactly should be tho.

> @@ -2473,6 +2504,100 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
>  EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*

/**

> + * account_metadata_dirtied
> + * @page - the page being dirited
> + * @bdi - the bdi that owns this page
> + * @bytes - the number of bytes being dirtied
> + *
> + * Do the dirty page accounting for metadata pages that aren't backed by an
> + * address_space.
> + */
> +void account_metadata_dirtied(struct page *page, struct backing_dev_info *bdi,
> +			      long bytes)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__mod_node_page_state(page_pgdat(page), NR_METADATA_DIRTY_BYTES,
> +			      bytes);
> +	__add_wb_stat(&bdi->wb, WB_DIRTIED_BYTES, bytes);
> +	__add_wb_stat(&bdi->wb, WB_METADATA_DIRTY_BYTES, bytes);
> +	current->nr_dirtied++;
> +	task_io_account_write(bytes);
> +	this_cpu_inc(bdp_ratelimits);
> +	local_irq_restore(flags);

Again, I'm not sure about the explicit irq ops especially as some of
the counters are already irq safe.

> +}
> +EXPORT_SYMBOL(account_metadata_dirtied);
> +
> +/*

/**

> + * account_metadata_cleaned
> + * @page - the page being cleaned
> + * @bdi - the bdi that owns this page
> + * @bytes - the number of bytes cleaned
> + *
> + * Called on a no longer dirty metadata page.
> + */
> +void account_metadata_cleaned(struct page *page, struct backing_dev_info *bdi,
> +			      long bytes)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__mod_node_page_state(page_pgdat(page), NR_METADATA_DIRTY_BYTES,
> +			      -bytes);
> +	__add_wb_stat(&bdi->wb, WB_METADATA_DIRTY_BYTES, -bytes);
> +	task_io_account_cancelled_write(bytes);
> +	local_irq_restore(flags);

Ditto with irq and the following functions.

> @@ -3701,7 +3703,20 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
>  	if (unlikely(delta > nr_pagecache_reclaimable))
>  		delta = nr_pagecache_reclaimable;
>  
> -	return nr_pagecache_reclaimable - delta;
> +	nr_metadata_reclaimable =
> +		node_page_state(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT;
> +	/*
> +	 * We don't do writeout through the shrinkers so subtract any
> +	 * dirty/writeback metadata bytes from the reclaimable count.
> +	 */

Hmm... up until this point, the dirty metadata was handled the same
way as regular dirty data but it deviates here.  Is this right?  The
calculations in writeback code also assumes that the dirty pages are
reclaimable.  If this is inherently different, it'd be nice to explain
more explicitly why this is different from others.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/5] writeback: introduce super_operations->write_metadata
  2016-10-25 18:41 ` [PATCH 4/5] writeback: introduce super_operations->write_metadata Josef Bacik
@ 2016-10-25 20:00   ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2016-10-25 20:00 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

Hello,

On Tue, Oct 25, 2016 at 02:41:43PM -0400, Josef Bacik wrote:
> Now that we have metadata counters in the VM, we need to provide a way to kick
> writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
> allows file systems to deal with writing back any dirty metadata we need based
> on the writeback needs of the system.  Since there is no inode to key off of we
> need a list in the bdi for dirty super blocks to be added.  From there we can
> find any dirty sb's on the bdi we are currently doing writeback on and call into
> their ->write_metadata callback.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Reviewed-by: Tejun Heo <tj@kernel.org>

> @@ -1491,6 +1516,7 @@ static long writeback_sb_inodes(struct super_block *sb,
>  	unsigned long start_time = jiffies;
>  	long write_chunk;
>  	long wrote = 0;  /* count both pages and inodes */
> +	bool done = false;
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = wb_inode(wb->b_io.prev);
> @@ -1607,12 +1633,18 @@ static long writeback_sb_inodes(struct super_block *sb,
>  		 * background threshold and other termination conditions.
>  		 */
>  		if (wrote) {
> -			if (time_is_before_jiffies(start_time + HZ / 10UL))
> -				break;
> -			if (work->nr_pages <= 0)
> +			if (time_is_before_jiffies(start_time + HZ / 10UL) ||
> +			    work->nr_pages <= 0) {
> +				done = true;
>  				break;
> +			}
>  		}
>  	}
> +	if (!done && sb->s_op->write_metadata) {
> +		spin_unlock(&wb->list_lock);
> +		wrote += writeback_sb_metadata(sb, wb, work);
> +		spin_lock(&wb->list_lock);

So, here and below, we're writing out metadata only after finishing
writing out data if the writeout budget is limited.  It'd be great to
document the behavior and why it's so in the callback definition.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
  2016-10-25 18:41 ` [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list Josef Bacik
@ 2016-10-25 22:01   ` Dave Chinner
  2016-10-25 23:36     ` Dave Chinner
  2016-10-26 15:11     ` Josef Bacik
  2016-10-25 22:44   ` Omar Sandoval
  1 sibling, 2 replies; 26+ messages in thread
From: Dave Chinner @ 2016-10-25 22:01 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, jack, linux-fsdevel, viro, hch, jweiner

On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> With anything that populates the inode/dentry cache with a lot of one time use
> inodes we can really put a lot of pressure on the system for things we don't
> need to keep in cache.  It takes two runs through the LRU to evict these one use
> entries, and if you have a lot of memory you can end up with 10's of millions of
> entries in the dcache or icache that have never actually been touched since they
> were first instantiated, and it will take a lot of CPU and a lot of pressure to
> evict all of them.
> 
> So instead do what we do with pagecache, only set the *REFERENCED flags if we
> are being used after we've been put onto the LRU.  This makes a significant
> difference in the system's ability to evict these useless cache entries.  With a
> fs_mark workload that creates 40 million files we get the following results (all
> in files/sec)

What's the workload, storage, etc?

> Btrfs			Patched		Unpatched
> Average Files/sec:	72209.3		63254.2
> p50 Files/sec:	70850		57560
> p90 Files/sec:	68757		53085
> p99 Files/sec:	68757		53085

So how much of this is from changing the dentry referenced
behaviour, and how much from the inode? Can you separate out the two
changes so we know which one is actually affecting reclaim
performance?

Indeed, I wonder if just changing the superblock shrinker
default_seeks for btrfs would have exactly the same impact because
that canbe used to exactly double the reclaim scan rate for the same
memory pressure.  If that doesn't change performance by a similar
amount (changing defaults seeks is the normal way of changing
shrinker balance), then more digging is required here to explain why
the referenced bits make such an impact to steady state
performance...

> XFS			Patched		Unpatched
> Average Files/sec:	61025.5		60719.5
> p50 Files/sec:	60107		59465
> p90 Files/sec:	59300		57966
> p99 Files/sec:	59227		57528

You made XFS never use I_REFERENCED at all (hint: not all
filesystems use find_inode/find_inode_fast()), so it's not clear
that the extra scanning (less than 1% difference in average
throughput) is actuallly the cause of things being slower in btrfs.

> The reason Btrfs has a much larger improvement is because it holds a lot more
> things in memory so benefits more from faster slab reclaim, but across the board
> is an improvement for each of the file systems.

Less than 1% for XFS and ~1.5% for ext4 is well within the
run-to-run variation of fsmark. It looks like it might be slightly
faster, but it's not a cut-and-dried win for anything other than
btrfs.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
  2016-10-25 18:41 ` [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list Josef Bacik
  2016-10-25 22:01   ` Dave Chinner
@ 2016-10-25 22:44   ` Omar Sandoval
  2016-10-26  4:17     ` [PATCH 5/5] " Andreas Dilger
  1 sibling, 1 reply; 26+ messages in thread
From: Omar Sandoval @ 2016-10-25 22:44 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> With anything that populates the inode/dentry cache with a lot of one time use
> inodes we can really put a lot of pressure on the system for things we don't
> need to keep in cache.  It takes two runs through the LRU to evict these one use
> entries, and if you have a lot of memory you can end up with 10's of millions of
> entries in the dcache or icache that have never actually been touched since they
> were first instantiated, and it will take a lot of CPU and a lot of pressure to
> evict all of them.
> 
> So instead do what we do with pagecache, only set the *REFERENCED flags if we
> are being used after we've been put onto the LRU.  This makes a significant
> difference in the system's ability to evict these useless cache entries.  With a
> fs_mark workload that creates 40 million files we get the following results (all
> in files/sec)
> 
> Btrfs			Patched		Unpatched
> Average Files/sec:	72209.3		63254.2
> p50 Files/sec:		70850		57560
> p90 Files/sec:		68757		53085
> p99 Files/sec:		68757		53085
> 
> XFS			Patched		Unpatched
> Average Files/sec:	61025.5		60719.5
> p50 Files/sec:		60107		59465
> p90 Files/sec: 		59300		57966
> p99 Files/sec: 		59227		57528
> 
> Ext4			Patched		Unpatched
> Average Files/sec:	121785.4	119248.0
> p50 Files/sec:		120852		119274
> p90 Files/sec:		116703		112783
> p99 Files/sec:		114393		104934
> 
> The reason Btrfs has a much larger improvement is because it holds a lot more
> things in memory so benefits more from faster slab reclaim, but across the board
> is an improvement for each of the file systems.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/dcache.c | 15 ++++++++++-----
>  fs/inode.c  |  5 ++++-
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 5c7cc95..a558075 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -779,8 +779,6 @@ repeat:
>  			goto kill_it;
>  	}
>  
> -	if (!(dentry->d_flags & DCACHE_REFERENCED))
> -		dentry->d_flags |= DCACHE_REFERENCED;
>  	dentry_lru_add(dentry);
>  
>  	dentry->d_lockref.count--;
> @@ -803,6 +801,13 @@ static inline void __dget_dlock(struct dentry *dentry)
>  	dentry->d_lockref.count++;
>  }
>  
> +static inline void __dget_dlock_reference(struct dentry *dentry)
> +{
> +	if (dentry->d_flags & DCACHE_LRU_LIST)
> +		dentry->d_flags |= DCACHE_REFERENCED;
> +	dentry->d_lockref.count++;
> +}
> +
>  static inline void __dget(struct dentry *dentry)
>  {
>  	lockref_get(&dentry->d_lockref);
> @@ -875,7 +880,7 @@ again:
>  			    (alias->d_flags & DCACHE_DISCONNECTED)) {
>  				discon_alias = alias;
>  			} else {
> -				__dget_dlock(alias);
> +				__dget_dlock_reference(alias);
>  				spin_unlock(&alias->d_lock);
>  				return alias;
>  			}
> @@ -886,7 +891,7 @@ again:
>  		alias = discon_alias;
>  		spin_lock(&alias->d_lock);
>  		if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
> -			__dget_dlock(alias);
> +			__dget_dlock_reference(alias);
>  			spin_unlock(&alias->d_lock);
>  			return alias;
>  		}
> @@ -2250,7 +2255,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
>  		if (!d_same_name(dentry, parent, name))
>  			goto next;
>  
> -		dentry->d_lockref.count++;
> +		__dget_dlock_reference(dentry);

This misses dentries that we get through __d_lookup_rcu(), so I think
your change made it so most things aren't getting DCACHE_REFERENCED set
at all. Maybe something like this instead?

diff --git a/fs/dcache.c b/fs/dcache.c
index 5c7cc95..d7a56a8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -412,15 +412,6 @@ static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry,
 	list_lru_isolate_move(lru, &dentry->d_lru, list);
 }
 
-/*
- * dentry_lru_(add|del)_list) must be called with d_lock held.
- */
-static void dentry_lru_add(struct dentry *dentry)
-{
-	if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST)))
-		d_lru_add(dentry);
-}
-
 /**
  * d_drop - drop a dentry
  * @dentry: dentry to drop
@@ -779,9 +770,12 @@ void dput(struct dentry *dentry)
 			goto kill_it;
 	}
 
-	if (!(dentry->d_flags & DCACHE_REFERENCED))
-		dentry->d_flags |= DCACHE_REFERENCED;
-	dentry_lru_add(dentry);
+	if (likely(dentry->d_flags & DCACHE_LRU_LIST)) {
+		if (!(dentry->d_flags & DCACHE_REFERENCED))
+			dentry->d_flags |= DCACHE_REFERENCED;
+	} else {
+		d_lru_add(dentry);
+	}
 
 	dentry->d_lockref.count--;
 	spin_unlock(&dentry->d_lock);
diff --git a/fs/inode.c b/fs/inode.c
index 88110fd..16faca3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -798,6 +798,8 @@ static struct inode *find_inode(struct super_block *sb,
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
+		if (!list_empty(&inode->i_lru))
+			inode->i_state |= I_REFERENCED;
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		return inode;
@@ -825,6 +827,8 @@ static struct inode *find_inode_fast(struct super_block *sb,
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
+		if (!list_empty(&inode->i_lru))
+			inode->i_state |= I_REFERENCED;
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		return inode;
@@ -1492,7 +1496,6 @@ static void iput_final(struct inode *inode)
 		drop = generic_drop_inode(inode);
 
 	if (!drop && (sb->s_flags & MS_ACTIVE)) {
-		inode->i_state |= I_REFERENCED;
 		inode_add_lru(inode);
 		spin_unlock(&inode->i_lock);
 		return;

-- 
Omar

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
  2016-10-25 22:01   ` Dave Chinner
@ 2016-10-25 23:36     ` Dave Chinner
  2016-10-26 20:03       ` Josef Bacik
  2016-10-26 15:11     ` Josef Bacik
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2016-10-25 23:36 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, jack, linux-fsdevel, viro, hch, jweiner

On Wed, Oct 26, 2016 at 09:01:13AM +1100, Dave Chinner wrote:
> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> > With anything that populates the inode/dentry cache with a lot of one time use
> > inodes we can really put a lot of pressure on the system for things we don't
> > need to keep in cache.  It takes two runs through the LRU to evict these one use
> > entries, and if you have a lot of memory you can end up with 10's of millions of
> > entries in the dcache or icache that have never actually been touched since they
> > were first instantiated, and it will take a lot of CPU and a lot of pressure to
> > evict all of them.
> > 
> > So instead do what we do with pagecache, only set the *REFERENCED flags if we
> > are being used after we've been put onto the LRU.  This makes a significant
> > difference in the system's ability to evict these useless cache entries.  With a
> > fs_mark workload that creates 40 million files we get the following results (all
> > in files/sec)
> 
> What's the workload, storage, etc?
> 
> > Btrfs			Patched		Unpatched
> > Average Files/sec:	72209.3		63254.2
> > p50 Files/sec:	70850		57560
> > p90 Files/sec:	68757		53085
> > p99 Files/sec:	68757		53085
> 
> So how much of this is from changing the dentry referenced
> behaviour, and how much from the inode? Can you separate out the two
> changes so we know which one is actually affecting reclaim
> performance?

FWIW, I just went to run my usual zero-length file creation fsmark
test (16-way create, large sparse FS image on SSDs). XFS (with debug
enabled) takes 4m10s to run at an average of ~230k files/s, with a
std deviation of +/-1.7k files/s.

Unfortunately, btrfs turns that into more heat than it ever has done
before. It's only managing 35k files/s and the profile looks like
this:

  58.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   5.61%  [kernel]  [k] queued_write_lock_slowpath
   1.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   1.38%  [kernel]  [k] reschedule_interrupt
   1.08%  [kernel]  [k] _raw_spin_lock
   0.92%  [kernel]  [k] __radix_tree_lookup
   0.86%  [kernel]  [k] _raw_spin_lock_irqsave
   0.83%  [kernel]  [k] btrfs_set_lock_blocking_rw

I killed it because this would take too long to run.

I reduced the concurrency down to 4-way, spinlock contention went
down to about 40% of the CPU time.  I reduced the concurrency down
to 2 and saw about 16% of CPU time being spent in lock contention.

Throughput results:
				btrfs throughput
			2-way			4-way
unpatched	46938.1+/-2.8e+03	40273.4+/-3.1e+03
patched		45697.2+/-2.4e+03	49287.1+/-3e+03


So, 2-way has not improved. If changing referenced behaviour was an
obvious win for btrfs, we'd expect to see that here as well.
however, because 4-way improved by 20%, I think all we're seeing is
a slight change in lock contention levels inside btrfs. Indeed,
looking at the profiles I see that lock contention time was reduced
to around 32% of the total CPU used (down by about 20%):

  20.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   3.85%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   3.68%  [kernel]  [k] _raw_spin_lock
   3.40%  [kernel]  [k] queued_write_lock_slowpath
   .....

IOWs, the performance increase comes from the fact that changing
inode cache eviction patterns causes slightly less lock contention
in btrfs inode reclaim. IOWs, the problem that needs fixing is the
btrfs lock contention, not the VFS cache LRU algorithms.

Root cause analysis needs to be done properly before behavioural
changes like this are proposed, people!

-Dave.

PS: I caught this profile on unmount when the 8 million inodes
cached inodes were being reclaimed. evict_inodes() ignores the
referenced bit, so this gives a lot of insight into the work being
done by inode reclaim in a filesystem:

  18.54%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   9.43%  [kernel]  [k] rb_erase
   8.03%  [kernel]  [k] __btrfs_release_delayed_node
   7.23%  [kernel]  [k] _raw_spin_lock
   6.93%  [kernel]  [k] __list_del_entry
   4.35%  [kernel]  [k] __slab_free
   3.93%  [kernel]  [k] __mutex_lock_slowpath
   2.77%  [kernel]  [k] bit_waitqueue
   2.58%  [kernel]  [k] kmem_cache_alloc
   2.50%  [kernel]  [k] __radix_tree_lookup
   2.44%  [kernel]  [k] _raw_spin_lock_irq
   2.18%  [kernel]  [k] kmem_cache_free
   2.17%  [kernel]  [k] evict			<<<<<<<<<<<
   2.13%  [kernel]  [k] fsnotify_destroy_marks
   1.68%  [kernel]  [k] btrfs_remove_delayed_node
   1.61%  [kernel]  [k] __call_rcu.constprop.70
   1.50%  [kernel]  [k] __remove_inode_hash
   1.49%  [kernel]  [k] kmem_cache_alloc_trace
   1.39%  [kernel]  [k] ___might_sleep
   1.15%  [kernel]  [k] __memset
   1.12%  [kernel]  [k] __mutex_unlock_slowpath
   1.03%  [kernel]  [k] evict_inodes		<<<<<<<<<<
   1.02%  [kernel]  [k] cmpxchg_double_slab.isra.66
   0.93%  [kernel]  [k] free_extent_map
   0.83%  [kernel]  [k] _raw_write_lock
   0.69%  [kernel]  [k] __might_sleep
   0.56%  [kernel]  [k] _raw_spin_unlock

The VFS inode cache traversal to evict inodes is a very small part
of the CPU usage being recorded. btrfs lock traffic alone accounts
for more than 10x as much CPU usage as inode cache traversal....
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] don't set *REFERENCED unless we are on the lru list
  2016-10-25 22:44   ` Omar Sandoval
@ 2016-10-26  4:17     ` Andreas Dilger
  2016-10-26  5:24       ` Omar Sandoval
  0 siblings, 1 reply; 26+ messages in thread
From: Andreas Dilger @ 2016-10-26  4:17 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Josef Bacik, linux-btrfs, kernel-team, david, jack,
	linux-fsdevel, viro, hch, jweiner

[-- Attachment #1: Type: text/plain, Size: 6249 bytes --]

On Oct 25, 2016, at 4:44 PM, Omar Sandoval <osandov@osandov.com> wrote:
> 
> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
>> With anything that populates the inode/dentry cache with a lot of one time use
>> inodes we can really put a lot of pressure on the system for things we don't
>> need to keep in cache.  It takes two runs through the LRU to evict these one use
>> entries, and if you have a lot of memory you can end up with 10's of millions of
>> entries in the dcache or icache that have never actually been touched since they
>> were first instantiated, and it will take a lot of CPU and a lot of pressure to
>> evict all of them.
>> 
>> So instead do what we do with pagecache, only set the *REFERENCED flags if we
>> are being used after we've been put onto the LRU.  This makes a significant
>> difference in the system's ability to evict these useless cache entries.  With a
>> fs_mark workload that creates 40 million files we get the following results (all
>> in files/sec)
>> 
>> Btrfs			Patched		Unpatched
>> Average Files/sec:	72209.3		63254.2
>> p50 Files/sec:		70850		57560
>> p90 Files/sec:		68757		53085
>> p99 Files/sec:		68757		53085
>> 
>> XFS			Patched		Unpatched
>> Average Files/sec:	61025.5		60719.5
>> p50 Files/sec:		60107		59465
>> p90 Files/sec: 		59300		57966
>> p99 Files/sec: 		59227		57528
>> 
>> Ext4			Patched		Unpatched
>> Average Files/sec:	121785.4	119248.0
>> p50 Files/sec:		120852		119274
>> p90 Files/sec:		116703		112783
>> p99 Files/sec:		114393		104934
>> 
>> The reason Btrfs has a much larger improvement is because it holds a lot more
>> things in memory so benefits more from faster slab reclaim, but across the board
>> is an improvement for each of the file systems.
>> 
>> Signed-off-by: Josef Bacik <jbacik@fb.com>
>> ---
>> fs/dcache.c | 15 ++++++++++-----
>> fs/inode.c  |  5 ++++-
>> 2 files changed, 14 insertions(+), 6 deletions(-)
>> 
>> diff --git a/fs/dcache.c b/fs/dcache.c
>> index 5c7cc95..a558075 100644
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -779,8 +779,6 @@ repeat:
>> 			goto kill_it;
>> 	}
>> 
>> -	if (!(dentry->d_flags & DCACHE_REFERENCED))
>> -		dentry->d_flags |= DCACHE_REFERENCED;
>> 	dentry_lru_add(dentry);
>> 
>> 	dentry->d_lockref.count--;
>> @@ -803,6 +801,13 @@ static inline void __dget_dlock(struct dentry *dentry)
>> 	dentry->d_lockref.count++;
>> }
>> 
>> +static inline void __dget_dlock_reference(struct dentry *dentry)
>> +{
>> +	if (dentry->d_flags & DCACHE_LRU_LIST)
>> +		dentry->d_flags |= DCACHE_REFERENCED;
>> +	dentry->d_lockref.count++;
>> +}
>> +
>> static inline void __dget(struct dentry *dentry)
>> {
>> 	lockref_get(&dentry->d_lockref);
>> @@ -875,7 +880,7 @@ again:
>> 			    (alias->d_flags & DCACHE_DISCONNECTED)) {
>> 				discon_alias = alias;
>> 			} else {
>> -				__dget_dlock(alias);
>> +				__dget_dlock_reference(alias);
>> 				spin_unlock(&alias->d_lock);
>> 				return alias;
>> 			}
>> @@ -886,7 +891,7 @@ again:
>> 		alias = discon_alias;
>> 		spin_lock(&alias->d_lock);
>> 		if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
>> -			__dget_dlock(alias);
>> +			__dget_dlock_reference(alias);
>> 			spin_unlock(&alias->d_lock);
>> 			return alias;
>> 		}
>> @@ -2250,7 +2255,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
>> 		if (!d_same_name(dentry, parent, name))
>> 			goto next;
>> 
>> -		dentry->d_lockref.count++;
>> +		__dget_dlock_reference(dentry);
> 
> This misses dentries that we get through __d_lookup_rcu(), so I think
> your change made it so most things aren't getting DCACHE_REFERENCED set
> at all. Maybe something like this instead?
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 5c7cc95..d7a56a8 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -412,15 +412,6 @@ static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry,
> 	list_lru_isolate_move(lru, &dentry->d_lru, list);
> }
> 
> -/*
> - * dentry_lru_(add|del)_list) must be called with d_lock held.
> - */
> -static void dentry_lru_add(struct dentry *dentry)
> -{
> -	if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST)))
> -		d_lru_add(dentry);
> -}
> -
> /**
>  * d_drop - drop a dentry
>  * @dentry: dentry to drop
> @@ -779,9 +770,12 @@ void dput(struct dentry *dentry)
> 			goto kill_it;
> 	}
> 
> -	if (!(dentry->d_flags & DCACHE_REFERENCED))
> -		dentry->d_flags |= DCACHE_REFERENCED;
> -	dentry_lru_add(dentry);
> +	if (likely(dentry->d_flags & DCACHE_LRU_LIST)) {
> +		if (!(dentry->d_flags & DCACHE_REFERENCED))
> +			dentry->d_flags |= DCACHE_REFERENCED;

There is no benefit to checking if DCACHE_REFERENCED is unset before
trying to set it.  You've already accessed the cacheline, so you can
avoid the branch here and just set it unconditionally.

Cheers, Andreas

> +	} else {
> +		d_lru_add(dentry);
> +	}
> 
> 	dentry->d_lockref.count--;
> 	spin_unlock(&dentry->d_lock);
> diff --git a/fs/inode.c b/fs/inode.c
> index 88110fd..16faca3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -798,6 +798,8 @@ static struct inode *find_inode(struct super_block *sb,
> 			__wait_on_freeing_inode(inode);
> 			goto repeat;
> 		}
> +		if (!list_empty(&inode->i_lru))
> +			inode->i_state |= I_REFERENCED;
> 		__iget(inode);
> 		spin_unlock(&inode->i_lock);
> 		return inode;
> @@ -825,6 +827,8 @@ static struct inode *find_inode_fast(struct super_block *sb,
> 			__wait_on_freeing_inode(inode);
> 			goto repeat;
> 		}
> +		if (!list_empty(&inode->i_lru))
> +			inode->i_state |= I_REFERENCED;
> 		__iget(inode);
> 		spin_unlock(&inode->i_lock);
> 		return inode;
> @@ -1492,7 +1496,6 @@ static void iput_final(struct inode *inode)
> 		drop = generic_drop_inode(inode);
> 
> 	if (!drop && (sb->s_flags & MS_ACTIVE)) {
> -		inode->i_state |= I_REFERENCED;
> 		inode_add_lru(inode);
> 		spin_unlock(&inode->i_lock);
> 		return;
> 
> --
> Omar
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] don't set *REFERENCED unless we are on the lru list
  2016-10-26  4:17     ` [PATCH 5/5] " Andreas Dilger
@ 2016-10-26  5:24       ` Omar Sandoval
  0 siblings, 0 replies; 26+ messages in thread
From: Omar Sandoval @ 2016-10-26  5:24 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Josef Bacik, linux-btrfs, kernel-team, david, jack,
	linux-fsdevel, viro, hch, jweiner

On Tue, Oct 25, 2016 at 10:17:19PM -0600, Andreas Dilger wrote:
> On Oct 25, 2016, at 4:44 PM, Omar Sandoval <osandov@osandov.com> wrote:
> > 
> > On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> >> With anything that populates the inode/dentry cache with a lot of one time use
> >> inodes we can really put a lot of pressure on the system for things we don't
> >> need to keep in cache.  It takes two runs through the LRU to evict these one use
> >> entries, and if you have a lot of memory you can end up with 10's of millions of
> >> entries in the dcache or icache that have never actually been touched since they
> >> were first instantiated, and it will take a lot of CPU and a lot of pressure to
> >> evict all of them.
> >> 
> >> So instead do what we do with pagecache, only set the *REFERENCED flags if we
> >> are being used after we've been put onto the LRU.  This makes a significant
> >> difference in the system's ability to evict these useless cache entries.  With a
> >> fs_mark workload that creates 40 million files we get the following results (all
> >> in files/sec)
> >> 
> >> Btrfs			Patched		Unpatched
> >> Average Files/sec:	72209.3		63254.2
> >> p50 Files/sec:		70850		57560
> >> p90 Files/sec:		68757		53085
> >> p99 Files/sec:		68757		53085
> >> 
> >> XFS			Patched		Unpatched
> >> Average Files/sec:	61025.5		60719.5
> >> p50 Files/sec:		60107		59465
> >> p90 Files/sec: 		59300		57966
> >> p99 Files/sec: 		59227		57528
> >> 
> >> Ext4			Patched		Unpatched
> >> Average Files/sec:	121785.4	119248.0
> >> p50 Files/sec:		120852		119274
> >> p90 Files/sec:		116703		112783
> >> p99 Files/sec:		114393		104934
> >> 
> >> The reason Btrfs has a much larger improvement is because it holds a lot more
> >> things in memory so benefits more from faster slab reclaim, but across the board
> >> is an improvement for each of the file systems.
> >> 
> >> Signed-off-by: Josef Bacik <jbacik@fb.com>
> >> ---
> >> fs/dcache.c | 15 ++++++++++-----
> >> fs/inode.c  |  5 ++++-
> >> 2 files changed, 14 insertions(+), 6 deletions(-)
> >> 
> >> diff --git a/fs/dcache.c b/fs/dcache.c
> >> index 5c7cc95..a558075 100644
> >> --- a/fs/dcache.c
> >> +++ b/fs/dcache.c
> >> @@ -779,8 +779,6 @@ repeat:
> >> 			goto kill_it;
> >> 	}
> >> 
> >> -	if (!(dentry->d_flags & DCACHE_REFERENCED))
> >> -		dentry->d_flags |= DCACHE_REFERENCED;
> >> 	dentry_lru_add(dentry);
> >> 
> >> 	dentry->d_lockref.count--;
> >> @@ -803,6 +801,13 @@ static inline void __dget_dlock(struct dentry *dentry)
> >> 	dentry->d_lockref.count++;
> >> }
> >> 
> >> +static inline void __dget_dlock_reference(struct dentry *dentry)
> >> +{
> >> +	if (dentry->d_flags & DCACHE_LRU_LIST)
> >> +		dentry->d_flags |= DCACHE_REFERENCED;
> >> +	dentry->d_lockref.count++;
> >> +}
> >> +
> >> static inline void __dget(struct dentry *dentry)
> >> {
> >> 	lockref_get(&dentry->d_lockref);
> >> @@ -875,7 +880,7 @@ again:
> >> 			    (alias->d_flags & DCACHE_DISCONNECTED)) {
> >> 				discon_alias = alias;
> >> 			} else {
> >> -				__dget_dlock(alias);
> >> +				__dget_dlock_reference(alias);
> >> 				spin_unlock(&alias->d_lock);
> >> 				return alias;
> >> 			}
> >> @@ -886,7 +891,7 @@ again:
> >> 		alias = discon_alias;
> >> 		spin_lock(&alias->d_lock);
> >> 		if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
> >> -			__dget_dlock(alias);
> >> +			__dget_dlock_reference(alias);
> >> 			spin_unlock(&alias->d_lock);
> >> 			return alias;
> >> 		}
> >> @@ -2250,7 +2255,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
> >> 		if (!d_same_name(dentry, parent, name))
> >> 			goto next;
> >> 
> >> -		dentry->d_lockref.count++;
> >> +		__dget_dlock_reference(dentry);
> > 
> > This misses dentries that we get through __d_lookup_rcu(), so I think
> > your change made it so most things aren't getting DCACHE_REFERENCED set
> > at all. Maybe something like this instead?
> > 
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 5c7cc95..d7a56a8 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -412,15 +412,6 @@ static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry,
> > 	list_lru_isolate_move(lru, &dentry->d_lru, list);
> > }
> > 
> > -/*
> > - * dentry_lru_(add|del)_list) must be called with d_lock held.
> > - */
> > -static void dentry_lru_add(struct dentry *dentry)
> > -{
> > -	if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST)))
> > -		d_lru_add(dentry);
> > -}
> > -
> > /**
> >  * d_drop - drop a dentry
> >  * @dentry: dentry to drop
> > @@ -779,9 +770,12 @@ void dput(struct dentry *dentry)
> > 			goto kill_it;
> > 	}
> > 
> > -	if (!(dentry->d_flags & DCACHE_REFERENCED))
> > -		dentry->d_flags |= DCACHE_REFERENCED;
> > -	dentry_lru_add(dentry);
> > +	if (likely(dentry->d_flags & DCACHE_LRU_LIST)) {
> > +		if (!(dentry->d_flags & DCACHE_REFERENCED))
> > +			dentry->d_flags |= DCACHE_REFERENCED;
> 
> There is no benefit to checking if DCACHE_REFERENCED is unset before
> trying to set it.  You've already accessed the cacheline, so you can
> avoid the branch here and just set it unconditionally.
> 
> Cheers, Andreas

We've accessed the cacheline, but it can still be shared, no? If we did
this unconditionally, we'd dirty the cacheline on every dput() as
opposed to only the first and second times it's been put.

-- 
Omar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
  2016-10-25 22:01   ` Dave Chinner
  2016-10-25 23:36     ` Dave Chinner
@ 2016-10-26 15:11     ` Josef Bacik
  2016-10-27  0:30       ` Dave Chinner
  1 sibling, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2016-10-26 15:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-btrfs, kernel-team, jack, linux-fsdevel, viro, hch, jweiner

On 10/25/2016 06:01 PM, Dave Chinner wrote:
> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
>> With anything that populates the inode/dentry cache with a lot of one time use
>> inodes we can really put a lot of pressure on the system for things we don't
>> need to keep in cache.  It takes two runs through the LRU to evict these one use
>> entries, and if you have a lot of memory you can end up with 10's of millions of
>> entries in the dcache or icache that have never actually been touched since they
>> were first instantiated, and it will take a lot of CPU and a lot of pressure to
>> evict all of them.
>>
>> So instead do what we do with pagecache, only set the *REFERENCED flags if we
>> are being used after we've been put onto the LRU.  This makes a significant
>> difference in the system's ability to evict these useless cache entries.  With a
>> fs_mark workload that creates 40 million files we get the following results (all
>> in files/sec)
>
> What's the workload, storage, etc?

Oops sorry I thought I said it.  It's fs_mark creating 20 million empty files on 
a single NVME drive.

>
>> Btrfs			Patched		Unpatched
>> Average Files/sec:	72209.3		63254.2
>> p50 Files/sec:	70850		57560
>> p90 Files/sec:	68757		53085
>> p99 Files/sec:	68757		53085
>
> So how much of this is from changing the dentry referenced
> behaviour, and how much from the inode? Can you separate out the two
> changes so we know which one is actually affecting reclaim
> performance?
>
> Indeed, I wonder if just changing the superblock shrinker
> default_seeks for btrfs would have exactly the same impact because
> that canbe used to exactly double the reclaim scan rate for the same
> memory pressure.  If that doesn't change performance by a similar
> amount (changing defaults seeks is the normal way of changing
> shrinker balance), then more digging is required here to explain why
> the referenced bits make such an impact to steady state
> performance...
>

I'll tease out the impact of changing dcache vs icache vs both.  Yeah I'll 
reduce default_seeks and see what that turns out to be.

>> XFS			Patched		Unpatched
>> Average Files/sec:	61025.5		60719.5
>> p50 Files/sec:	60107		59465
>> p90 Files/sec:	59300		57966
>> p99 Files/sec:	59227		57528
>
> You made XFS never use I_REFERENCED at all (hint: not all
> filesystems use find_inode/find_inode_fast()), so it's not clear
> that the extra scanning (less than 1% difference in average
> throughput) is actuallly the cause of things being slower in btrfs.
>
>> The reason Btrfs has a much larger improvement is because it holds a lot more
>> things in memory so benefits more from faster slab reclaim, but across the board
>> is an improvement for each of the file systems.
>
> Less than 1% for XFS and ~1.5% for ext4 is well within the
> run-to-run variation of fsmark. It looks like it might be slightly
> faster, but it's not a cut-and-dried win for anything other than
> btrfs.
>

Sure the win in this benchmark is clearly benefiting btrfs the most, but I think 
the overall approach is sound and likely to help everybody in theory.  Inside FB 
we definitely have had problems where the memory pressure induced by some 
idi^H^H^Hprocess goes along and runs find / which causes us to evict real things 
that are being used rather than these one use inodes.  This sort of behavior 
could possibly be mitigated by this patch, but I haven't sat down to figure out 
a reliable way to mirror this workload to test that theory.  Thanks

Josef


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] writeback: add counters for metadata usage
  2016-10-25 19:50   ` Tejun Heo
@ 2016-10-26 15:20     ` Josef Bacik
  2016-10-26 15:49       ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2016-10-26 15:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

On 10/25/2016 03:50 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Oct 25, 2016 at 02:41:42PM -0400, Josef Bacik wrote:
>> Btrfs has no bounds except memory on the amount of dirty memory that we have in
>> use for metadata.  Historically we have used a special inode so we could take
>> advantage of the balance_dirty_pages throttling that comes with using pagecache.
>> However as we'd like to support different blocksizes it would be nice to not
>> have to rely on pagecache, but still get the balance_dirty_pages throttling
>> without having to do it ourselves.
>>
>> So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES.  These are
>> zone and bdi_writeback counters to keep track of how many bytes we have in
>> flight for METADATA.  We need to count in bytes as blocksizes could be
>> percentages of pagesize.  We simply convert the bytes to number of pages where
>> it is needed for the throttling.
>>
>> Also introduce NR_METADATA_BYTES so we can keep track of the total amount of
>> pages used for metadata on the system.  This is also needed so things like dirty
>> throttling know that this is dirtyable memory as well and easily reclaimed.
>>
>> Signed-off-by: Josef Bacik <jbacik@fb.com>
>
> Some nits.
>
> It'd be nice to note that this patch just introduces new fields
> without using them and thus doesn't cause any behavioral changes.
>
>> @@ -51,6 +51,8 @@ static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
>>  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
>>
>>  #define K(x) ((x) << (PAGE_SHIFT - 10))
>> +#define BtoK(x) ((x) >> 10)
>
> This would belong in a separate patch but any chance we can share
> these definitions?  It's fine to have the definitions in a couple
> places but these are getting duplicated in multiple spots and actually
> getting confusing with K meaning pages to kilobytes.  I'm not sure how
> it exactly should be tho.
>
>> @@ -2473,6 +2504,100 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
>>  EXPORT_SYMBOL(account_page_dirtied);
>>
>>  /*
>
> /**
>
>> + * account_metadata_dirtied
>> + * @page - the page being dirited
>> + * @bdi - the bdi that owns this page
>> + * @bytes - the number of bytes being dirtied
>> + *
>> + * Do the dirty page accounting for metadata pages that aren't backed by an
>> + * address_space.
>> + */
>> +void account_metadata_dirtied(struct page *page, struct backing_dev_info *bdi,
>> +			      long bytes)
>> +{
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
>> +	__mod_node_page_state(page_pgdat(page), NR_METADATA_DIRTY_BYTES,
>> +			      bytes);
>> +	__add_wb_stat(&bdi->wb, WB_DIRTIED_BYTES, bytes);
>> +	__add_wb_stat(&bdi->wb, WB_METADATA_DIRTY_BYTES, bytes);
>> +	current->nr_dirtied++;
>> +	task_io_account_write(bytes);
>> +	this_cpu_inc(bdp_ratelimits);
>> +	local_irq_restore(flags);
>
> Again, I'm not sure about the explicit irq ops especially as some of
> the counters are already irq safe.
>
>> +}
>> +EXPORT_SYMBOL(account_metadata_dirtied);
>> +
>> +/*
>
> /**
>
>> + * account_metadata_cleaned
>> + * @page - the page being cleaned
>> + * @bdi - the bdi that owns this page
>> + * @bytes - the number of bytes cleaned
>> + *
>> + * Called on a no longer dirty metadata page.
>> + */
>> +void account_metadata_cleaned(struct page *page, struct backing_dev_info *bdi,
>> +			      long bytes)
>> +{
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
>> +	__mod_node_page_state(page_pgdat(page), NR_METADATA_DIRTY_BYTES,
>> +			      -bytes);
>> +	__add_wb_stat(&bdi->wb, WB_METADATA_DIRTY_BYTES, -bytes);
>> +	task_io_account_cancelled_write(bytes);
>> +	local_irq_restore(flags);
>
> Ditto with irq and the following functions.
>
>> @@ -3701,7 +3703,20 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
>>  	if (unlikely(delta > nr_pagecache_reclaimable))
>>  		delta = nr_pagecache_reclaimable;
>>
>> -	return nr_pagecache_reclaimable - delta;
>> +	nr_metadata_reclaimable =
>> +		node_page_state(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT;
>> +	/*
>> +	 * We don't do writeout through the shrinkers so subtract any
>> +	 * dirty/writeback metadata bytes from the reclaimable count.
>> +	 */
>
> Hmm... up until this point, the dirty metadata was handled the same
> way as regular dirty data but it deviates here.  Is this right?  The
> calculations in writeback code also assumes that the dirty pages are
> reclaimable.  If this is inherently different, it'd be nice to explain
> more explicitly why this is different from others.
>

So there is logic above this that subtracts out the NR_FILE_DIRTY from the file 
reclaimable if we can't do write's during reclaim.  Since we can always wait on 
writeback during reclaim it doesn't subtract out writeback.  I took this to mean 
that the general expectation of this function is to only count thing things that 
the shrinker can specifically reclaim itself, so I discounted anything under io 
since the slab shrinkers have no idea if its ok to write or not and so in the 
case of btrfs simply skip anything that is dirty or under writeback.  Does that 
make sense?  I'll fix up the other issues you pointed out.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] writeback: add counters for metadata usage
  2016-10-26 15:20     ` Josef Bacik
@ 2016-10-26 15:49       ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2016-10-26 15:49 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

Hello, Josef.

On Wed, Oct 26, 2016 at 11:20:16AM -0400, Josef Bacik wrote:
> > > @@ -3701,7 +3703,20 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
> > >  	if (unlikely(delta > nr_pagecache_reclaimable))
> > >  		delta = nr_pagecache_reclaimable;
> > > 
> > > -	return nr_pagecache_reclaimable - delta;
> > > +	nr_metadata_reclaimable =
> > > +		node_page_state(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT;
> > > +	/*
> > > +	 * We don't do writeout through the shrinkers so subtract any
> > > +	 * dirty/writeback metadata bytes from the reclaimable count.
> > > +	 */
> > 
> > Hmm... up until this point, the dirty metadata was handled the same
> > way as regular dirty data but it deviates here.  Is this right?  The
> > calculations in writeback code also assumes that the dirty pages are
> > reclaimable.  If this is inherently different, it'd be nice to explain
> > more explicitly why this is different from others.
> 
> So there is logic above this that subtracts out the NR_FILE_DIRTY from the
> file reclaimable if we can't do write's during reclaim.  Since we can always
> wait on writeback during reclaim it doesn't subtract out writeback.  I took
> this to mean that the general expectation of this function is to only count
> thing things that the shrinker can specifically reclaim itself, so I
> discounted anything under io since the slab shrinkers have no idea if its ok
> to write or not and so in the case of btrfs simply skip anything that is
> dirty or under writeback.  Does that make sense?  I'll fix up the other
> issues you pointed out.  Thanks,

Yeap, that makes sense to me.

Thanks for the explanation.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
  2016-10-25 23:36     ` Dave Chinner
@ 2016-10-26 20:03       ` Josef Bacik
  2016-10-26 22:20         ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2016-10-26 20:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-btrfs, kernel-team, jack, linux-fsdevel, viro, hch, jweiner

On 10/25/2016 07:36 PM, Dave Chinner wrote:
> On Wed, Oct 26, 2016 at 09:01:13AM +1100, Dave Chinner wrote:
>> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
>>> With anything that populates the inode/dentry cache with a lot of one time use
>>> inodes we can really put a lot of pressure on the system for things we don't
>>> need to keep in cache.  It takes two runs through the LRU to evict these one use
>>> entries, and if you have a lot of memory you can end up with 10's of millions of
>>> entries in the dcache or icache that have never actually been touched since they
>>> were first instantiated, and it will take a lot of CPU and a lot of pressure to
>>> evict all of them.
>>>
>>> So instead do what we do with pagecache, only set the *REFERENCED flags if we
>>> are being used after we've been put onto the LRU.  This makes a significant
>>> difference in the system's ability to evict these useless cache entries.  With a
>>> fs_mark workload that creates 40 million files we get the following results (all
>>> in files/sec)
>>
>> What's the workload, storage, etc?
>>
>>> Btrfs			Patched		Unpatched
>>> Average Files/sec:	72209.3		63254.2
>>> p50 Files/sec:	70850		57560
>>> p90 Files/sec:	68757		53085
>>> p99 Files/sec:	68757		53085
>>
>> So how much of this is from changing the dentry referenced
>> behaviour, and how much from the inode? Can you separate out the two
>> changes so we know which one is actually affecting reclaim
>> performance?

(Ok figured out the problem, apparently outlook thinks you are spam and has been 
filtering all your emails into its junk folder without giving them to me.  I've 
fixed this so now we should be able to get somewhere)

>
> FWIW, I just went to run my usual zero-length file creation fsmark
> test (16-way create, large sparse FS image on SSDs). XFS (with debug
> enabled) takes 4m10s to run at an average of ~230k files/s, with a
> std deviation of +/-1.7k files/s.
>
> Unfortunately, btrfs turns that into more heat than it ever has done
> before. It's only managing 35k files/s and the profile looks like
> this:
>
>   58.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    5.61%  [kernel]  [k] queued_write_lock_slowpath
>    1.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>    1.38%  [kernel]  [k] reschedule_interrupt
>    1.08%  [kernel]  [k] _raw_spin_lock
>    0.92%  [kernel]  [k] __radix_tree_lookup
>    0.86%  [kernel]  [k] _raw_spin_lock_irqsave
>    0.83%  [kernel]  [k] btrfs_set_lock_blocking_rw
>
> I killed it because this would take too long to run.
>
> I reduced the concurrency down to 4-way, spinlock contention went
> down to about 40% of the CPU time.  I reduced the concurrency down
> to 2 and saw about 16% of CPU time being spent in lock contention.
>
> Throughput results:
> 				btrfs throughput
> 			2-way			4-way
> unpatched	46938.1+/-2.8e+03	40273.4+/-3.1e+03
> patched		45697.2+/-2.4e+03	49287.1+/-3e+03
>
>
> So, 2-way has not improved. If changing referenced behaviour was an
> obvious win for btrfs, we'd expect to see that here as well.
> however, because 4-way improved by 20%, I think all we're seeing is
> a slight change in lock contention levels inside btrfs. Indeed,
> looking at the profiles I see that lock contention time was reduced
> to around 32% of the total CPU used (down by about 20%):
>
>   20.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    3.85%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>    3.68%  [kernel]  [k] _raw_spin_lock
>    3.40%  [kernel]  [k] queued_write_lock_slowpath
>    .....
>
> IOWs, the performance increase comes from the fact that changing
> inode cache eviction patterns causes slightly less lock contention
> in btrfs inode reclaim. IOWs, the problem that needs fixing is the
> btrfs lock contention, not the VFS cache LRU algorithms.
>
> Root cause analysis needs to be done properly before behavioural
> changes like this are proposed, people!
>

We are doing different things.  Yes when you do it all into the same subvol the 
lock contention is obviously much worse and the bottleneck, but that's not what 
I'm doing, I'm creating a subvol for each thread to reduce the lock contention 
on the root nodes.  If you retest by doing that then you will see the 
differences I was seeing.  Are you suggesting that I just made these numbers up? 
  Instead of assuming I'm an idiot and don't know how to root cause issues 
please instead ask what is different from your run versus my run.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
  2016-10-26 20:03       ` Josef Bacik
@ 2016-10-26 22:20         ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2016-10-26 22:20 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, jack, linux-fsdevel, viro, hch, jweiner

On Wed, Oct 26, 2016 at 04:03:54PM -0400, Josef Bacik wrote:
> On 10/25/2016 07:36 PM, Dave Chinner wrote:
> >So, 2-way has not improved. If changing referenced behaviour was an
> >obvious win for btrfs, we'd expect to see that here as well.
> >however, because 4-way improved by 20%, I think all we're seeing is
> >a slight change in lock contention levels inside btrfs. Indeed,
> >looking at the profiles I see that lock contention time was reduced
> >to around 32% of the total CPU used (down by about 20%):
> >
> >  20.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
> >   3.85%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
> >   3.68%  [kernel]  [k] _raw_spin_lock
> >   3.40%  [kernel]  [k] queued_write_lock_slowpath
> >   .....
> >
> >IOWs, the performance increase comes from the fact that changing
> >inode cache eviction patterns causes slightly less lock contention
> >in btrfs inode reclaim. IOWs, the problem that needs fixing is the
> >btrfs lock contention, not the VFS cache LRU algorithms.
> >
> >Root cause analysis needs to be done properly before behavioural
> >changes like this are proposed, people!
> >
> 
> We are doing different things.

Well, no, we're doing the same thing. Except...

> Yes when you do it all into the same
> subvol the lock contention is obviously much worse and the
> bottleneck, but that's not what I'm doing, I'm creating a subvol for
> each thread to reduce the lock contention on the root nodes.

.. you have a non-default filesystem config that you didn't mention
even in response to a /direct question/. Seriously, this is a major
configuration detail that is necessary for anyone who wants to
replicate your results!

The difference is that I'm using is the /default/ btrfs filesystem
configuration and you're using a carefully contrived structure that
is designed to work around fundamental scalability issues the
benchmark normally exposes.

This raises another question: Does this subvol setup reflect
production configurations, or is it simply a means to get the
benchmark to put enough load on the inode cache to demonstrate the
effect of changing the reference bits?

> If you
> retest by doing that then you will see the differences I was seeing.
> Are you suggesting that I just made these numbers up? 

No, I'm suggesting that you haven't explained the problem or how to
reproduce it in sufficient detail.

You haven't explained why reducing scanning (which in and of iteself
has minimal overhead and is not visible to btrfs) has such a marked
effect on the behaviour of btrfs. You haven't given us details on
the workload or storage config so we can reproduce the numbers (and
still haven't!). You're arguing that numbers far below variablity
and accuracy measurement limits are significant (i.e. the patched
numbers are better for XFS and ext4). There's lots of information
you've chosen to omit that we need to know.

To fill in the gaps, I've done some analysis, provided perf numbers
and CPU profiles, and put them into an explanation that explains why
there is a difference in performance for a default btrfs config. I
note that you've not actually addressed that analysis and the
problems it uncovers, but instead just said "I'm using a different
configuration".

What about all the btrfs users that aren't using your configuration
that will see these problems? How does the scanning change actually
benefit them more than fixing the locking issues that exist? Memory
reclaim and LRU algorithms affect everyone, not just a benchmark
configuration. We need to know a lot more about the issue at hand
to be able to evaluate whether this is the /right solution for the
problem./

> Instead of
> assuming I'm an idiot and don't know how to root cause issues please
> instead ask what is different from your run versus my run.  Thanks,

I'm assuming you're a compentent engineer, Josef, who knows that
details about benchmarks and performance measurement are extremely
important, and that someone reviewing code needs to understand why
the behaviour change had the impact it did to be able to evaluate it
properly. I'm assuming that you also know that if you understand the
root cause of a problem, you put it in the commit message so that
everyone else knows it, too.

So if you know what the root cause of the btrfs problem that
reducing reclaim scanning avoids, then please explain it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
  2016-10-26 15:11     ` Josef Bacik
@ 2016-10-27  0:30       ` Dave Chinner
  2016-10-27 13:13         ` Josef Bacik
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2016-10-27  0:30 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, jack, linux-fsdevel, viro, hch, jweiner

On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote:
> On 10/25/2016 06:01 PM, Dave Chinner wrote:
> >On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> >>With anything that populates the inode/dentry cache with a lot of one time use
> >>inodes we can really put a lot of pressure on the system for things we don't
> >>need to keep in cache.  It takes two runs through the LRU to evict these one use
> >>entries, and if you have a lot of memory you can end up with 10's of millions of
> >>entries in the dcache or icache that have never actually been touched since they
> >>were first instantiated, and it will take a lot of CPU and a lot of pressure to
> >>evict all of them.
> >>
> >>So instead do what we do with pagecache, only set the *REFERENCED flags if we
> >>are being used after we've been put onto the LRU.  This makes a significant
> >>difference in the system's ability to evict these useless cache entries.  With a
> >>fs_mark workload that creates 40 million files we get the following results (all
> >>in files/sec)
> >
> >What's the workload, storage, etc?
> 
> Oops sorry I thought I said it.  It's fs_mark creating 20 million
> empty files on a single NVME drive.

How big is the drive/filesystem (e.g. has impact on XFS allocation
concurrency)?  And multiple btrfs subvolumes, too, by the sound of
it. How did you set those up?  What about concurrency, directory
sizes, etc?  Can you post the fsmark command line as these details
do actually matter...

Getting the benchmark configuration to reproduce posted results
should not require playing 20 questions!

> >>The reason Btrfs has a much larger improvement is because it holds a lot more
> >>things in memory so benefits more from faster slab reclaim, but across the board
> >>is an improvement for each of the file systems.
> >
> >Less than 1% for XFS and ~1.5% for ext4 is well within the
> >run-to-run variation of fsmark. It looks like it might be slightly
> >faster, but it's not a cut-and-dried win for anything other than
> >btrfs.
> >
> 
> Sure the win in this benchmark is clearly benefiting btrfs the most,
> but I think the overall approach is sound and likely to help
> everybody in theory.

Yup, but without an explanation of why it makes such a big change to
btrfs, we can't really say what effect it's really going to have.
Why does cycling the inode a second time through the LRU make any
difference? Once we're in steady state on this workload, one or two
cycles through the LRU should make no difference at all to
performance because all the inodes are instantiated in identical
states (including the referenced bit) and so scanning treats every
inode identically. i.e. the reclaim rate (i.e. how often
evict_inode() is called) should be exactly the same and the only
difference is the length of time between the inode being put on the
LRU and when it is evicted.

Is there an order difference in reclaim as a result of earlier
reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs
rather than contending on spinning locks? Is it having to wait for
transaction commits or some other metadata IO because reclaim is
happening earlier? i.e. The question I'm asking is what, exactly,
leads to such a marked performance improvement in steady state
behaviour?

I want to know because if there's behavioural changes in LRU reclaim
order having a significant effect on affecting btrfs, then there is
going to be some effects on other filesystems, too. Maybe not in
this benchmark, but we can't anticipate potential problems if we
don't understand exactly what is going on here.

> Inside FB we definitely have had problems
> where the memory pressure induced by some idi^H^H^Hprocess goes
> along and runs find / which causes us to evict real things that are
> being used rather than these one use inodes.

That's one of the problems the IO throttling in the XFS shrinker
tends to avoid. i.e. This is one of the specific cases we expect to see
on all production systems - backup applications are the common cause
of regular full filesystem traversals.

FWIW, there's an element of deja vu in this thread: that XFS inode
cache shrinker IO throttling is exactly what Chris proposed we gut
last week to solve some other FB memory reclaim problem that had no
explanation of the root cause.

(http://www.spinics.net/lists/linux-xfs/msg01541.html)

> This sort of behavior
> could possibly be mitigated by this patch, but I haven't sat down to
> figure out a reliable way to mirror this workload to test that
> theory.  Thanks

I use fsmark to create filesystems with tens of millions of small
files, then do things like run concurrent greps repeatedly over a
small portion of the directory structure (e.g. 1M cached inodes
and 4GB worth of cached data) and then run concurrent find
traversals across the entire filesystem to simulate this sort
use-once vs working set reclaim...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
  2016-10-27  0:30       ` Dave Chinner
@ 2016-10-27 13:13         ` Josef Bacik
  2016-10-28  3:48           ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2016-10-27 13:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-btrfs, kernel-team, jack, linux-fsdevel, viro, hch, jweiner

On 10/26/2016 08:30 PM, Dave Chinner wrote:
> On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote:
>> On 10/25/2016 06:01 PM, Dave Chinner wrote:
>>> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
>>>> With anything that populates the inode/dentry cache with a lot of one time use
>>>> inodes we can really put a lot of pressure on the system for things we don't
>>>> need to keep in cache.  It takes two runs through the LRU to evict these one use
>>>> entries, and if you have a lot of memory you can end up with 10's of millions of
>>>> entries in the dcache or icache that have never actually been touched since they
>>>> were first instantiated, and it will take a lot of CPU and a lot of pressure to
>>>> evict all of them.
>>>>
>>>> So instead do what we do with pagecache, only set the *REFERENCED flags if we
>>>> are being used after we've been put onto the LRU.  This makes a significant
>>>> difference in the system's ability to evict these useless cache entries.  With a
>>>> fs_mark workload that creates 40 million files we get the following results (all
>>>> in files/sec)
>>>
>>> What's the workload, storage, etc?
>>
>> Oops sorry I thought I said it.  It's fs_mark creating 20 million
>> empty files on a single NVME drive.
>
> How big is the drive/filesystem (e.g. has impact on XFS allocation
> concurrency)?  And multiple btrfs subvolumes, too, by the sound of
> it. How did you set those up?  What about concurrency, directory
> sizes, etc?  Can you post the fsmark command line as these details
> do actually matter...
>
> Getting the benchmark configuration to reproduce posted results
> should not require playing 20 questions!

This is the disk

Disk /dev/nvme0n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors

This is the script

https://paste.fedoraproject.org/461874/73910147/1

It's on a 1 socket 8 core cpu with 16gib of ram.

>
>>>> The reason Btrfs has a much larger improvement is because it holds a lot more
>>>> things in memory so benefits more from faster slab reclaim, but across the board
>>>> is an improvement for each of the file systems.
>>>
>>> Less than 1% for XFS and ~1.5% for ext4 is well within the
>>> run-to-run variation of fsmark. It looks like it might be slightly
>>> faster, but it's not a cut-and-dried win for anything other than
>>> btrfs.
>>>
>>
>> Sure the win in this benchmark is clearly benefiting btrfs the most,
>> but I think the overall approach is sound and likely to help
>> everybody in theory.
>
> Yup, but without an explanation of why it makes such a big change to
> btrfs, we can't really say what effect it's really going to have.
> Why does cycling the inode a second time through the LRU make any
> difference? Once we're in steady state on this workload, one or two
> cycles through the LRU should make no difference at all to
> performance because all the inodes are instantiated in identical
> states (including the referenced bit) and so scanning treats every
> inode identically. i.e. the reclaim rate (i.e. how often
> evict_inode() is called) should be exactly the same and the only
> difference is the length of time between the inode being put on the
> LRU and when it is evicted.
>
> Is there an order difference in reclaim as a result of earlier
> reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs
> rather than contending on spinning locks? Is it having to wait for
> transaction commits or some other metadata IO because reclaim is
> happening earlier? i.e. The question I'm asking is what, exactly,
> leads to such a marked performance improvement in steady state
> behaviour?

I would have seen this in my traces.  There's tooooons of places to improve 
btrfs's performance and behavior here no doubt.  But simply moving from 
pagecache to a slab shrinker shouldn't have drastically changed how we perform 
in this test.  I feel like the shrinkers need to be used differently, but I 
completely destroyed vmscan.c trying different approaches and none of them made 
a difference like this patch made.  From what I was seeing in my trace we were 
simply reclaiming less per kswapd scan iteration with the old approach vs. the 
new approach.

>
> I want to know because if there's behavioural changes in LRU reclaim
> order having a significant effect on affecting btrfs, then there is
> going to be some effects on other filesystems, too. Maybe not in
> this benchmark, but we can't anticipate potential problems if we
> don't understand exactly what is going on here.

So I'll just describe to you what I was seeing and maybe we can work out where 
we think the problem is.

1) We go at X speed until we fill up all of the memory with the various caches.
2) We lost about 15% when kswapd kicks in and it never recovered.

Doing tracing I was seeing that we would try to scan very small batches of 
objects, usually less than the batch size, because with btrfs not using 
pagecache for anything and the shrinker scanning being based on how much 
pagecache was scanned our scan totals would be very much smaller than the total 
cache size.  I originally thought this was the problem, but short of just 
forcing us to scan the whole cache every time nothing I did made any difference.

Once we'd scanned through the entire LRU at once suddenly we started reclaiming 
almost as many objects as we scanned, as opposed to < 10 items.  This is because 
of the whole referenced thing.  So we'd see a small perf bump when we did manage 
to do this, and then it would drop again once the lru was full of fresh inodes 
again.  So we'd get this saw-toothy sort of reclaim.

This is when I realized the whole REFERENCED thing was probably screwing us, so 
I went to look at what pagecache did to eliminate this problem.  They do two 
things, first start all pages on the inactive list, and second balance between 
scanning the active list vs inactive list based on the size of either list.  I 
rigged up something to do an active/inactive list inside of list_lru which again 
did basically nothing.  So I changed how the REFERENCED was marked to match how 
pagecache did things and viola, my performance was back.

I did all sorts of tracing to try and figure out what exactly it was about the 
reclaim that made everything so much slower, but with that many things going on 
it was hard to decide what was noise and what was actually causing the problem. 
Without this patch I see kswapd stay at higher CPU usage for longer, because it 
has to keep scanning things to satisfy the high watermark and it takes scanning 
multiple million object lists multiple times before it starts to make any 
progress towards reclaim.  This logically makes sense and is not ideal behavior.

>
>> Inside FB we definitely have had problems
>> where the memory pressure induced by some idi^H^H^Hprocess goes
>> along and runs find / which causes us to evict real things that are
>> being used rather than these one use inodes.
>
> That's one of the problems the IO throttling in the XFS shrinker
> tends to avoid. i.e. This is one of the specific cases we expect to see
> on all production systems - backup applications are the common cause
> of regular full filesystem traversals.
>
> FWIW, there's an element of deja vu in this thread: that XFS inode
> cache shrinker IO throttling is exactly what Chris proposed we gut
> last week to solve some other FB memory reclaim problem that had no
> explanation of the root cause.
>
> (https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spinics.net_lists_linux-2Dxfs_msg01541.html&d=DQIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=sDzg6MvHymKOUgI8SFIm4Q&m=REEqbp9jT-ifN0oN83tK5X3BxEL76jmuULRORyO8Qeg&s=vlsX-YRw_2HCnXFThpOjyumOHu4JFDcSqssSQG1Rt40&e= )
>
>> This sort of behavior
>> could possibly be mitigated by this patch, but I haven't sat down to
>> figure out a reliable way to mirror this workload to test that
>> theory.  Thanks
>
> I use fsmark to create filesystems with tens of millions of small
> files, then do things like run concurrent greps repeatedly over a
> small portion of the directory structure (e.g. 1M cached inodes
> and 4GB worth of cached data) and then run concurrent find
> traversals across the entire filesystem to simulate this sort
> use-once vs working set reclaim...
>

Yeah that sounds reasonable, like I said I just haven't tried to test it as my 
numbers got bigger and I was happy with that.  I'll rig this up and see how it 
performs with my patch to see if there's a significant difference with FS's 
other than btrfs.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
  2016-10-27 13:13         ` Josef Bacik
@ 2016-10-28  3:48           ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2016-10-28  3:48 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, jack, linux-fsdevel, viro, hch, jweiner

On Thu, Oct 27, 2016 at 09:13:44AM -0400, Josef Bacik wrote:
> On 10/26/2016 08:30 PM, Dave Chinner wrote:
> >On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote:
> >>On 10/25/2016 06:01 PM, Dave Chinner wrote:
> >>>On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> >>>>With anything that populates the inode/dentry cache with a lot of one time use
> >>>>inodes we can really put a lot of pressure on the system for things we don't
> >>>>need to keep in cache.  It takes two runs through the LRU to evict these one use
> >>>>entries, and if you have a lot of memory you can end up with 10's of millions of
> >>>>entries in the dcache or icache that have never actually been touched since they
> >>>>were first instantiated, and it will take a lot of CPU and a lot of pressure to
> >>>>evict all of them.
> >>>>
> >>>>So instead do what we do with pagecache, only set the *REFERENCED flags if we
> >>>>are being used after we've been put onto the LRU.  This makes a significant
> >>>>difference in the system's ability to evict these useless cache entries.  With a
> >>>>fs_mark workload that creates 40 million files we get the following results (all
> >>>>in files/sec)
> >>>
> >>>What's the workload, storage, etc?
> >>
> >>Oops sorry I thought I said it.  It's fs_mark creating 20 million
> >>empty files on a single NVME drive.
> >
> >How big is the drive/filesystem (e.g. has impact on XFS allocation
> >concurrency)?  And multiple btrfs subvolumes, too, by the sound of
> >it. How did you set those up?  What about concurrency, directory
> >sizes, etc?  Can you post the fsmark command line as these details
> >do actually matter...
> >
> >Getting the benchmark configuration to reproduce posted results
> >should not require playing 20 questions!
> 
> This is the disk
> 
> Disk /dev/nvme0n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> 
> This is the script
> 
> https://paste.fedoraproject.org/461874/73910147/1

Link no workee. This does, though:

https://paste.fedoraproject.org/461874/73910147

> 
> It's on a 1 socket 8 core cpu with 16gib of ram.

Thanks!

> >Is there an order difference in reclaim as a result of earlier
> >reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs
> >rather than contending on spinning locks? Is it having to wait for
> >transaction commits or some other metadata IO because reclaim is
> >happening earlier? i.e. The question I'm asking is what, exactly,
> >leads to such a marked performance improvement in steady state
> >behaviour?
> 
> I would have seen this in my traces.  There's tooooons of places to
> improve btrfs's performance and behavior here no doubt.  But simply
> moving from pagecache to a slab shrinker shouldn't have drastically
> changed how we perform in this test.

Not sure what you mean by this. Do you mean that because of the
preceeding patches that change the page cache accounting for btrfs
metadata there is now not enough pressure being placed on the btrfs
inode cache shrinker?

FWIW, this zero length file fsmark workload produces zero page
cache pressure on XFS. If the case is that btrfs is no longer using the pa

> I feel like the shrinkers need
> to be used differently, but I completely destroyed vmscan.c trying
> different approaches and none of them made a difference like this
> patch made.  From what I was seeing in my trace we were simply
> reclaiming less per kswapd scan iteration with the old approach vs.
> the new approach.

Yup, that's what default_seeks is supposed to be used to tune - the
relative pressure that should be placed on the slab compared to
everything else.

> >I want to know because if there's behavioural changes in LRU reclaim
> >order having a significant effect on affecting btrfs, then there is
> >going to be some effects on other filesystems, too. Maybe not in
> >this benchmark, but we can't anticipate potential problems if we
> >don't understand exactly what is going on here.
> 
> So I'll just describe to you what I was seeing and maybe we can work
> out where we think the problem is.
> 
> 1) We go at X speed until we fill up all of the memory with the various caches.
> 2) We lost about 15% when kswapd kicks in and it never recovered.

That's expected behaviour (i.e. that's exactly what I see when XFS
fill memory and reclaim gates new allocations). Memory reclaim does
not come for free.

> Doing tracing I was seeing that we would try to scan very small
> batches of objects, usually less than the batch size, because with
> btrfs not using pagecache for anything and the shrinker scanning
> being based on how much pagecache was scanned our scan totals would
> be very much smaller than the total cache size.  I originally
> thought this was the problem, but short of just forcing us to scan
> the whole cache every time nothing I did made any difference.

So this is from a kernel just running your patch 5 (the referenced
bit changes) and samples were taken for about a minute at steady
state after memory filled (about 20m files into a 16-way 50m file
create with subvols running at ~130k files/s in steady state) using:

# ~/trace-cmd/trace-cmd record -e mm_shrink_slab\*

There were this many shrinker invocations, but all of them came from
kswapd (no direct reclaim at all).

$ grep 0xffff88036a2774c0 t.t |grep shrink_slab_start |wc -l
1923

The amount of reclaim each shrinker invocation did was:

# grep 0xffff88036a2774c0 t.t  |awk -e '/mm_shrink_slab_end/ { print $23 }' |sort -n |uniq -c |sort -nr
    623 1025
    481 0
    172 3075
    145 2050
    131 6150
     82 12300
     54 7175
     40 4100
     40 11275
     34 5125
     32 14350
     30 13325
     19 23575
     18 24600
      7 15375
      4 8200
      3 22550
      3 10250
      2 47150
      2 16400
      1 9225
      1 25625
      1 14349
      1 11274

I don't see any partial batches there, but I also don't see enough
reclaim pressure being put on the inode cache. I'll come back to
this, but first lets look at what how XFS behaves, because it
doesn't create any page cache pressure on these workloads:


$ grep 0xffff880094aea4c0 t.t  |awk -e '/mm_shrink_slab_end/ { print $23 }' |sort -n |uniq -c |sort -nr
   1059 0
      5 1000
      4 7045
      4 1008
      3 7970
      3 7030
      2 996
      2 995
      2 962
      2 9005
      2 8961
      2 8937
      2 8062
      2 8049
.....

Very different look to thework szies. Sorting on the reclaim batch
sizes shows this:

      1 901131
      1 874354
      1 872911
      1 847679
      1 760696
      1 616491
      1 590231
      1 542313
      1 517882
      1 513400
      1 507431
      1 479111
      1 439889
      1 433734
      1 410811
      1 388127
.....

There are massive batches of inodes being reclaimed. Let's have a
look at that one of those traces more closely:

kswapd1-862   [001] 183735.856649: mm_shrink_slab_start:
	super_cache_scan+0x0 0xffff880094aea4c0: nid: 1
	objects to shrink 882936 gfp_flags GFP_KERNEL
	pgs_scanned 32 lru_pgs 6236
	cache items 2416122 delta 24792 total_scan 907728

So here we see only 32 pages were scanned in the page cache, of
a very small 6236 pages - there is no page cache to speak of, and
the 32 pages scanned is the defined minimum for the page cache.

However, the calculated delta is still reasonable at 24792, with the
cache size at 2416122 objects. It's reasonable because the delta is
almost exactly 1% of the cache size and that's a good reclaim batch
size for a single shrinker invocation.

However, total_scan is at 907728, which means there have been lots
of shrinker invocations that have not done work due to GFP_NOFS
context, and so that work has been deferred. kswapd is invoked with
GFP_KERNEL context, so we'd expect it to do most of that work right
now:

kswapd1-862   [009] 183737.582875: mm_shrink_slab_end:
	super_cache_scan+0x0 0xffff880094aea4c0: nid: 1
	unused scan count 882936 new scan count 464 total_scan 464
	last shrinker return val 874354

And it does - it freed 874354 objects from the cache, and the
deferred scan count returned to the shrinker was only 464. I'll also
point out that the reclaim rate here is about 500,000 inodes/s (per
kswapd) so we've got plenty of headroom and scalability here.

This is behaving exactly as we'd expect if there was no page cache
pressure and substantial GFP_NOFS allocations. THe delta calculation
is in the correct ballpark, the reclaim batch size is being
determined by the amount of work we are deferring to kswapd.

Just to show what GFP_NOFS windup from direct reclaim looks like:

fs_mark-30202 [006] 183695.757995: mm_shrink_slab_start:
	super_cache_scan+0x0 0xffff880094aea4c0: nid: 1
	objects to shrink 64064 gfp_flags GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK
	pgs_scanned 32 lru_pgs 6697
	cache items 2236078 delta 21365 total_scan 85429
fs_mark-30202 [006] 183695.757995: mm_shrink_slab_end:
	super_cache_scan+0x0 0xffff880094aea4c0: nid: 1
	unused scan count 64064 new scan count 85429 total_scan 85429
	last shrinker return val 0

incoming deferred work was 64064 objects, delta added was 21365
(again, 1% of total cache size), zero objects were freed, outgoing
deferred work in now 85429 objects.

So we can see that regardless of the invocation context of the
shrinker, XFS is asking for 1% of the cache to be removed. Knowing
this, let's go back to the btrfs traces and examine that:

kswapd1-862   [013] 185883.279076: mm_shrink_slab_start:
	super_cache_scan+0x0 0xffff8803329404c0: nid: 1
	objects to shrink 1006 gfp_flags GFP_KERNEL
	pgs_scanned 203 lru_pgs 251067
	cache items 1787431 delta 2890 total_scan 3896

Ok, the first thing to note here is that the delta is much smaller
than what we see with XFS. That is also no obvious deferred work,
which we'd expect because there is no direct reclaim records in the
trace for btrfs. It's pretty clear from this that each shrinker
invocation is asking for about 0.15% of the cache to be scanned -
there's much lower pressure being put on the shrinker when compared
to XFS. Hence it will take 5-10x as many invocations of the shrinker
to do the same amount of work.

Occasionally we see the page cache get scanned enough to bring
deltas up to 1% of the cache size:

kswapd1-862   [013] 185883.943647: mm_shrink_slab_start:
	super_cache_scan+0x0 0xffff8803329404c0: nid: 1
	objects to shrink 459 gfp_flags GFP_KERNEL
	pgs_scanned 1585 lru_pgs 249194
	cache items 1784345 delta 22698 total_scan 23157

See the difference here? pgs_scanned is much higher than the
trace from than half a second ago, and now the delta reaches that
~1% of cache size number. This is an outlier, though, but what it
indicates is that the problem is with the amount of page cache
scanning being done during reclaim.

There are two ways to deal with this: fix whatever problem is
causing the page cache to be underscanned, or make the
pgs_scanned:shrinker delta ratio larger. The later can be don via
changing the shrinker seek count or modifying
/proc/sys/vm/vfs_cache_pressure to make the cache seem 10x larger
than it really is....

> Once we'd scanned through the entire LRU at once suddenly we started
> reclaiming almost as many objects as we scanned, as opposed to < 10
> items.  This is because of the whole referenced thing.

That looks like a side effect of not generating enough pressure on
the shrinker.

> So we'd see
> a small perf bump when we did manage to do this, and then it would
> drop again once the lru was full of fresh inodes again.  So we'd get
> this saw-toothy sort of reclaim.

Yup - butin normal conditions this workload should result with a
relatively even mix of referenced and unreferenced inodes on the
LRU, so this behaviour should not happen.

Indeed, I just pulled the referenced patch out of the kernel and
reran the XFS tests, and I see regular "delta 24000, freed 13000"
traces. There are much fewer huge reclaim peaks (i.e. less GFP_NOFS
windup) and no two shrinker invocations freed the exact same number
of objects.

$ grep 0xffff88033a638cc0 t.t  |awk -e '/mm_shrink_slab_end/ { print $23 }' |sort -n |uniq -c |sort -nr -k2
1 381458
1 333756
1 90265
1 88664
1 82371
1 82340
1 81849
1 80772
1 79268

i.e. changing the referenced bit behaviour has made the XFS inode
shrinker too aggressive for this workload, so rebalancing would be
necessary...

At the same time, I changed btrfs to has "seeks = 1". This shows
/substantially/ more sustained and larger amounts of pressure being
put on the btrfs inode shrinker than the referenced patch:

$ grep 0xffff88004ed79cc0: b.t  |awk -e '/mm_shrink_slab_end/ { print $23 }' |sort -n |uniq -c |sort -nr -k2
1 59021
1 56271
1 53743
1 52885
1 47992
1 44442
1 41265
1 40718
1 36038
1 34872
....

.... but it still has a really, really long tail of small reclaim
attempts so it's better but still not perfect....

Unfortunately, because default_seeks is only 2, we're going to have
to change this value  and the constants in do_shrinker_slab() to
give us more ranging to increase reclaim pressure in this manner....

> I did all sorts of tracing to try and figure out what exactly it was
> about the reclaim that made everything so much slower, but with that
> many things going on it was hard to decide what was noise and what
> was actually causing the problem. Without this patch I see kswapd
> stay at higher CPU usage for longer, because it has to keep scanning
> things to satisfy the high watermark and it takes scanning multiple
> million object lists multiple times before it starts to make any
> progress towards reclaim.  This logically makes sense and is not
> ideal behavior.

I would expect the elevated kswapd CPU is due to the fact it is
being called so many more times to do the inode cache reclaim work.
It has to do more page cache scanning because the shrinker is not
removing the memory pressure, and so it's a downward spiral of
inefficiency. let's see what happens when we get the shrinker
pressure to reliably scan 1% of the cache on every invocation....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes
  2016-10-25 18:41 ` [PATCH 2/5] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes Josef Bacik
  2016-10-25 19:03   ` Tejun Heo
@ 2016-10-30 15:13   ` Jan Kara
  1 sibling, 0 replies; 26+ messages in thread
From: Jan Kara @ 2016-10-30 15:13 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

On Tue 25-10-16 14:41:41, Josef Bacik wrote:
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 121a6e3..e09b3ad 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -596,11 +596,11 @@ static void wb_domain_writeout_inc(struct wb_domain *dom,
>   * Increment @wb's writeout completion count and the global writeout
>   * completion count. Called from test_clear_page_writeback().
>   */
> -static inline void __wb_writeout_inc(struct bdi_writeback *wb)
> +static inline void __wb_writeout_inc(struct bdi_writeback *wb, long bytes)

Please keep the names consistent - i.e. when you rename wb_writeout_inc to
wb_writeout_add, then you should do the same with __wb_writeout_inc...

>  {
>  	struct wb_domain *cgdom;
>  
> -	__inc_wb_stat(wb, WB_WRITTEN);
> +	__add_wb_stat(wb, WB_WRITTEN_BYTES, bytes);
>  	wb_domain_writeout_inc(&global_wb_domain, &wb->completions,
>  			       wb->bdi->max_prop_frac);

Also I think you will need to change the per-domain writeback statistics to
bytes as well. Otherwise the proportions can get skewed.

Other than that the patch looks good to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] writeback: add counters for metadata usage
  2016-10-25 18:41 ` [PATCH 3/5] writeback: add counters for metadata usage Josef Bacik
  2016-10-25 19:50   ` Tejun Heo
@ 2016-10-30 15:36   ` Jan Kara
  1 sibling, 0 replies; 26+ messages in thread
From: Jan Kara @ 2016-10-30 15:36 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, kernel-team, david, jack, linux-fsdevel, viro, hch, jweiner

On Tue 25-10-16 14:41:42, Josef Bacik wrote:
> Btrfs has no bounds except memory on the amount of dirty memory that we have in
> use for metadata.  Historically we have used a special inode so we could take
> advantage of the balance_dirty_pages throttling that comes with using pagecache.
> However as we'd like to support different blocksizes it would be nice to not
> have to rely on pagecache, but still get the balance_dirty_pages throttling
> without having to do it ourselves.
> 
> So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES.  These are
> zone and bdi_writeback counters to keep track of how many bytes we have in
> flight for METADATA.  We need to count in bytes as blocksizes could be
> percentages of pagesize.  We simply convert the bytes to number of pages where
> it is needed for the throttling.
> 
> Also introduce NR_METADATA_BYTES so we can keep track of the total amount of
> pages used for metadata on the system.  This is also needed so things like dirty
> throttling know that this is dirtyable memory as well and easily reclaimed.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-10-31  1:09 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-25 18:41 [PATCH 0/5][RESEND] Support for metadata specific accounting Josef Bacik
2016-10-25 18:41 ` [PATCH 1/5] remove mapping from balance_dirty_pages*() Josef Bacik
2016-10-25 18:47   ` Tejun Heo
2016-10-25 18:41 ` [PATCH 2/5] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes Josef Bacik
2016-10-25 19:03   ` Tejun Heo
2016-10-25 19:09     ` Josef Bacik
2016-10-30 15:13   ` Jan Kara
2016-10-25 18:41 ` [PATCH 3/5] writeback: add counters for metadata usage Josef Bacik
2016-10-25 19:50   ` Tejun Heo
2016-10-26 15:20     ` Josef Bacik
2016-10-26 15:49       ` Tejun Heo
2016-10-30 15:36   ` Jan Kara
2016-10-25 18:41 ` [PATCH 4/5] writeback: introduce super_operations->write_metadata Josef Bacik
2016-10-25 20:00   ` Tejun Heo
2016-10-25 18:41 ` [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list Josef Bacik
2016-10-25 22:01   ` Dave Chinner
2016-10-25 23:36     ` Dave Chinner
2016-10-26 20:03       ` Josef Bacik
2016-10-26 22:20         ` Dave Chinner
2016-10-26 15:11     ` Josef Bacik
2016-10-27  0:30       ` Dave Chinner
2016-10-27 13:13         ` Josef Bacik
2016-10-28  3:48           ` Dave Chinner
2016-10-25 22:44   ` Omar Sandoval
2016-10-26  4:17     ` [PATCH 5/5] " Andreas Dilger
2016-10-26  5:24       ` Omar Sandoval

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.