linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] fsio: filesystem io accounting cgroup
@ 2013-07-08  9:59 Konstantin Khlebnikov
  2013-07-10  4:43 ` Sha Zhengju
  0 siblings, 1 reply; 29+ messages in thread
From: Konstantin Khlebnikov @ 2013-07-08  9:59 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel

This is proof of concept, just basic functionality for IO controller.
This cgroup will control filesystem usage on vfs layer, it's main goal is
bandwidth control. It's supposed to be much more lightweight than memcg/blkio.

This patch shows easy way for accounting pages in dirty/writeback state in
per-inode manner. This is easier that doing this in memcg in per-page manner.
Main idea is in keeping on each inode pointer (->i_fsio) to cgroup which owns
dirty data in that inode. It's settled by fsio_account_page_dirtied() when
first dirty tag appears in the inode. Relying to mapping tags gives us locking
for free, this patch doesn't add any new locks to hot paths.

Unlike to blkio this method works for all of filesystems, not just disk-backed.
Also it's able to handle writeback, because each inode has context which can be
used in writeback thread to account io operations.

This is early prototype, I have some plans about extra functionality because
this accounting itself is mostly useless, but it can be used as basis for more
usefull features.

Planned impovements:
* Split bdi into several tiers and account them separately. For example:
  hdd/ssd/usb/nfs. In complicated containerized environments that might be
  different kinds of storages with different limits and billing. This is more
  usefull that independent per-disk accounting and much easier to implement
  because all per-tier structures are allocated before disk appearance.
* Add some hooks for accounting actualy issued IO requests (iops).
* Implement bandwidth throttlers for each tier individually (bps and iops).
  This will be the most tasty feature. I already have very effective prototype.
* Add hook into balance_dirty_pages to limit amount of dirty page for each
  cgroup in each tier individually. This is required for accurate throttling,
  because if we want to limit speed of writeback we also must limit amount
  of dirty pages otherwise we have to inject enourmous delay after each sync().
* Implement filtered writeback requests for writing only data which belongs to
  particular fsio cgroup (or cgroups tree) to keep dirty balance in background.
* Implement filtered 'sync', special mode for sync() which syncs only
  filesystems which 'belong' to current fsio cgroup. Each container should sync
  only it's own filesystems. This also can be made in terms of 'visibility' in
  vfsmount namespaces.

This patch lays on top of this:
b26008c page_writeback: put account_page_redirty() after set_page_dirty()
80979bd page_writeback: get rid of account_size argument in cancel_dirty_page()
c575ef6 hugetlbfs: remove cancel_dirty_page() from truncate_huge_page()
b720923 nfs: remove redundant cancel_dirty_page() from nfs_wb_page_cancel()
4c21e52 mm: remove redundant dirty pages check from __delete_from_page_cache()

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: cgroups@vger.kernel.org
Cc: devel@openvz.org
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sha Zhengju <handai.szj@gmail.com>
---
 block/blk-core.c              |    2 +
 fs/Makefile                   |    2 +
 fs/direct-io.c                |    2 +
 fs/fsio_cgroup.c              |  137 +++++++++++++++++++++++++++++++++++++++++
 fs/nfs/direct.c               |    2 +
 include/linux/cgroup_subsys.h |    6 ++
 include/linux/fs.h            |    3 +
 include/linux/fsio_cgroup.h   |  136 +++++++++++++++++++++++++++++++++++++++++
 init/Kconfig                  |    3 +
 mm/page-writeback.c           |    8 ++
 mm/readahead.c                |    2 +
 mm/truncate.c                 |    2 +
 12 files changed, 304 insertions(+), 1 deletion(-)
 create mode 100644 fs/fsio_cgroup.c
 create mode 100644 include/linux/fsio_cgroup.h

diff --git a/block/blk-core.c b/block/blk-core.c
index 93a18d1..5f4e6c9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
 #include <linux/swap.h>
 #include <linux/writeback.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/fsio_cgroup.h>
 #include <linux/fault-inject.h>
 #include <linux/list_sort.h>
 #include <linux/delay.h>
@@ -1866,6 +1867,7 @@ void submit_bio(int rw, struct bio *bio)
 			count_vm_events(PGPGOUT, count);
 		} else {
 			task_io_account_read(bio->bi_size);
+			fsio_account_read(bio->bi_size);
 			count_vm_events(PGPGIN, count);
 		}
 
diff --git a/fs/Makefile b/fs/Makefile
index 4fe6df3..d4432e7 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_FHANDLE)		+= fhandle.o
 
 obj-y				+= quota/
 
+obj-$(CONFIG_FSIO_CGROUP)	+= fsio_cgroup.o
+
 obj-$(CONFIG_PROC_FS)		+= proc/
 obj-$(CONFIG_SYSFS)		+= sysfs/
 obj-$(CONFIG_CONFIGFS_FS)	+= configfs/
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7ab90f5..0fe99af 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -28,6 +28,7 @@
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/fsio_cgroup.h>
 #include <linux/bio.h>
 #include <linux/wait.h>
 #include <linux/err.h>
@@ -722,6 +723,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		 * Read accounting is performed in submit_bio()
 		 */
 		task_io_account_write(len);
+		fsio_account_write(len);
 	}
 
 	/*
diff --git a/fs/fsio_cgroup.c b/fs/fsio_cgroup.c
new file mode 100644
index 0000000..5d5158f
--- /dev/null
+++ b/fs/fsio_cgroup.c
@@ -0,0 +1,137 @@
+#include <linux/fsio_cgroup.h>
+#include "internal.h"
+
+static inline struct fsio_cgroup *cgroup_fsio(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, fsio_subsys_id),
+			    struct fsio_cgroup, css);
+}
+
+static void fsio_free(struct fsio_cgroup *fsio)
+{
+	percpu_counter_destroy(&fsio->read_bytes);
+	percpu_counter_destroy(&fsio->write_bytes);
+	percpu_counter_destroy(&fsio->nr_dirty);
+	percpu_counter_destroy(&fsio->nr_writeback);
+	kfree(fsio);
+}
+
+static struct cgroup_subsys_state *fsio_css_alloc(struct cgroup *cgroup)
+{
+	struct fsio_cgroup *fsio;
+
+	fsio = kzalloc(sizeof(struct fsio_cgroup), GFP_KERNEL);
+	if (!fsio)
+		return ERR_PTR(-ENOMEM);
+
+	if (percpu_counter_init(&fsio->read_bytes, 0) ||
+	    percpu_counter_init(&fsio->write_bytes, 0) ||
+	    percpu_counter_init(&fsio->nr_dirty, 0) ||
+	    percpu_counter_init(&fsio->nr_writeback, 0)) {
+		fsio_free(fsio);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	return &fsio->css;
+}
+
+/* switch all ->i_fsio references to the parent cgroup */
+static void fsio_switch_sb(struct super_block *sb, void *_fsio)
+{
+	struct fsio_cgroup *fsio = _fsio;
+	struct fsio_cgroup *parent = cgroup_fsio(fsio->css.cgroup->parent);
+	struct address_space *mapping;
+	struct inode *inode;
+
+	spin_lock(&inode_sb_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		mapping = inode->i_mapping;
+		if (mapping->i_fsio == fsio &&
+		    (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
+		     mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))) {
+			spin_lock_irq(&mapping->tree_lock);
+			if (mapping->i_fsio == fsio)
+				mapping->i_fsio = parent;
+			spin_unlock_irq(&mapping->tree_lock);
+		}
+	}
+	spin_unlock(&inode_sb_list_lock);
+}
+
+static void fsio_css_free(struct cgroup *cgroup)
+{
+	struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
+	struct fsio_cgroup *parent = cgroup_fsio(fsio->css.cgroup->parent);
+	u64 nr_dirty, nr_writeback, tmp;
+
+	nr_dirty = percpu_counter_sum_positive(&fsio->nr_dirty);
+	percpu_counter_add(&parent->nr_dirty, nr_dirty);
+
+	nr_writeback = percpu_counter_sum_positive(&fsio->nr_writeback);
+	percpu_counter_add(&parent->nr_writeback, nr_writeback);
+
+	iterate_supers(fsio_switch_sb, fsio);
+
+	tmp = percpu_counter_sum(&fsio->nr_dirty);
+	percpu_counter_add(&parent->nr_dirty, tmp - nr_dirty);
+
+	tmp = percpu_counter_sum(&fsio->nr_writeback);
+	percpu_counter_add(&parent->nr_writeback, tmp - nr_writeback);
+}
+
+static u64 fsio_get_read_bytes(struct cgroup *cgroup, struct cftype *cft)
+{
+	struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
+
+	return percpu_counter_sum(&fsio->read_bytes);
+}
+
+static u64 fsio_get_write_bytes(struct cgroup *cgroup, struct cftype *cft)
+{
+	struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
+
+	return percpu_counter_sum(&fsio->write_bytes);
+}
+
+static u64 fsio_get_dirty_bytes(struct cgroup *cgroup, struct cftype *cft)
+{
+	struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
+
+	return percpu_counter_sum_positive(&fsio->nr_dirty) * PAGE_CACHE_SIZE;
+}
+
+static u64 fsio_get_writeback_bytes(struct cgroup *cgroup, struct cftype *cft)
+{
+	struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
+
+	return percpu_counter_sum_positive(&fsio->nr_writeback) *
+		PAGE_CACHE_SIZE;
+}
+
+static struct cftype fsio_files[] = {
+	{
+		.name = "read_bytes",
+		.read_u64 = fsio_get_read_bytes,
+	},
+	{
+		.name = "write_bytes",
+		.read_u64 = fsio_get_write_bytes,
+	},
+	{
+		.name = "dirty_bytes",
+		.read_u64 = fsio_get_dirty_bytes,
+	},
+	{
+		.name = "writeback_bytes",
+		.read_u64 = fsio_get_writeback_bytes,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys fsio_subsys = {
+	.name = "fsio",
+	.subsys_id = fsio_subsys_id,
+	.css_alloc = fsio_css_alloc,
+	.css_free = fsio_css_free,
+	.base_cftypes = fsio_files,
+};
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 0bd7a55..b8e61ee 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -46,6 +46,7 @@
 #include <linux/kref.h>
 #include <linux/slab.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/fsio_cgroup.h>
 #include <linux/module.h>
 
 #include <linux/nfs_fs.h>
@@ -987,6 +988,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		goto out;
 
 	task_io_account_write(count);
+	fsio_account_write(count);
 
 	retval = nfs_direct_write(iocb, iov, nr_segs, pos, count, uio);
 	if (retval > 0) {
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 6e7ec64..d16df6e 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -84,3 +84,9 @@ SUBSYS(bcache)
 #endif
 
 /* */
+
+#if IS_SUBSYS_ENABLED(CONFIG_FSIO_CGROUP)
+SUBSYS(fsio)
+#endif
+
+/* */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 99be011..8156d99 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -421,6 +421,9 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	void			*private_data;	/* ditto */
+#ifdef CONFIG_FSIO_CGROUP
+	struct fsio_cgroup	*i_fsio;	/* protected with tree_lock */
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
diff --git a/include/linux/fsio_cgroup.h b/include/linux/fsio_cgroup.h
new file mode 100644
index 0000000..78aa53b
--- /dev/null
+++ b/include/linux/fsio_cgroup.h
@@ -0,0 +1,136 @@
+#ifndef _LINUX_FSIO_CGGROUP_H
+#define _LINUX_FSIO_CGGROUP_H
+
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/cgroup.h>
+#include <linux/workqueue.h>
+#include <linux/percpu_counter.h>
+
+struct fsio_cgroup {
+	struct cgroup_subsys_state css;
+	struct percpu_counter read_bytes;
+	struct percpu_counter write_bytes;
+	struct percpu_counter nr_dirty;
+	struct percpu_counter nr_writeback;
+};
+
+#ifdef CONFIG_FSIO_CGROUP
+
+static inline struct fsio_cgroup *task_fsio_cgroup(struct task_struct *task)
+{
+	return container_of(task_subsys_state(task, fsio_subsys_id),
+			    struct fsio_cgroup, css);
+}
+
+/*
+ * This accounts all reads, both cached and direct-io.
+ */
+static inline void fsio_account_read(unsigned long bytes)
+{
+	struct task_struct *task = current;
+	struct fsio_cgroup *fsio;
+
+	rcu_read_lock();
+	fsio = task_fsio_cgroup(task);
+	__percpu_counter_add(&fsio->read_bytes, bytes,
+			PAGE_CACHE_SIZE * percpu_counter_batch);
+	rcu_read_unlock();
+}
+
+/*
+ * This is used for accounting  direct-io writes.
+ */
+static inline void fsio_account_write(unsigned long bytes)
+{
+	struct task_struct *task = current;
+	struct fsio_cgroup *fsio;
+
+	rcu_read_lock();
+	fsio = task_fsio_cgroup(task);
+	__percpu_counter_add(&fsio->write_bytes, bytes,
+			PAGE_CACHE_SIZE * percpu_counter_batch);
+	rcu_read_unlock();
+}
+
+/*
+ * This called under mapping->tree_lock before setting radix-tree tag,
+ * page which caused this call either locked (write) or mapped (unmap).
+ *
+ * If this is first dirty page in inode and there is no writeback then current
+ * fsio cgroup becomes owner of this inode. Following radix_tree_tag_set()
+ * will pin this pointer till the end of writeback or truncate. Otherwise
+ * mapping->i_fsio already valid and points to cgroup who owns this inode.
+ */
+static inline void fsio_account_page_dirtied(struct address_space *mapping)
+{
+	struct fsio_cgroup *fsio = mapping->i_fsio;
+
+	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
+	    !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+		rcu_read_lock();
+		fsio = task_fsio_cgroup(current);
+		mapping->i_fsio = fsio;
+		rcu_read_unlock();
+	}
+
+	percpu_counter_inc(&fsio->nr_dirty);
+}
+
+/*
+ * This called after clearing dirty bit. Page here locked and unmapped,
+ * thus dirtying process is complete and mapping->i_fsio is valid.
+ * This state is stable because at this point page still in mapping and tagged.
+ * We cannot reset mapping->i_mapping here even that was last dirty page, becase
+ * this hook is called without tree_lock before removing page from page cache.
+ */
+static inline void fsio_cancel_dirty_page(struct address_space *mapping)
+{
+	percpu_counter_dec(&mapping->i_fsio->nr_dirty);
+}
+
+/*
+ * This called after redirtying page, thus nr_dirty will not fall to zero.
+ * No tree_lock, page is locked and still tagged in mapping.
+ */
+static inline void fsio_account_page_redirty(struct address_space *mapping)
+{
+	percpu_counter_dec(&mapping->i_fsio->nr_dirty);
+}
+
+/*
+ * This switches page accounging from dirty to writeback.
+ */
+static inline void fsio_set_page_writeback(struct address_space *mapping)
+{
+	struct fsio_cgroup *fsio = mapping->i_fsio;
+
+	percpu_counter_inc(&fsio->nr_writeback);
+	percpu_counter_dec(&fsio->nr_dirty);
+}
+
+/*
+ * Writeback is done, mapping->i_fsio pointer becomes invalid after that.
+ */
+static inline void fsio_clear_page_writeback(struct address_space *mapping)
+{
+	struct fsio_cgroup *fsio = mapping->i_fsio;
+
+	__percpu_counter_add(&fsio->write_bytes, PAGE_CACHE_SIZE,
+			PAGE_CACHE_SIZE * percpu_counter_batch);
+	percpu_counter_dec(&fsio->nr_writeback);
+}
+
+#else /* CONFIG_FSIO_CGROUP */
+
+static inline void fsio_account_read(unsigned long bytes) {}
+static inline void fsio_account_write(unsigned long bytes) {}
+static inline void fsio_account_page_dirtied(struct address_space *mapping) {}
+static inline void fsio_cancel_dirty_page(struct address_space *mapping) {}
+static inline void fsio_account_page_redirty(struct address_space *mapping) {}
+static inline void fsio_set_page_writeback(struct address_space *mapping) {}
+static inline void fsio_clear_page_writeback(struct address_space *mapping) {}
+
+#endif /* CONFIG_FSIO_CGROUP */
+
+#endif /* _LINUX_FSIO_CGGROUP_H */
diff --git a/init/Kconfig b/init/Kconfig
index ea1be00..689b29a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1050,6 +1050,9 @@ config DEBUG_BLK_CGROUP
 	Enable some debugging help. Currently it exports additional stat
 	files in a cgroup which can be useful for debugging.
 
+config FSIO_CGROUP
+	bool "Filesystem IO controller"
+
 endif # CGROUPS
 
 config CHECKPOINT_RESTORE
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a599f38..bc39a36 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/fsio_cgroup.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1994,6 +1995,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_io_account_write(PAGE_CACHE_SIZE);
+		fsio_account_page_dirtied(mapping);
 		current->nr_dirtied++;
 		this_cpu_inc(bdp_ratelimits);
 	}
@@ -2071,6 +2073,7 @@ void account_page_redirty(struct page *page)
 		current->nr_dirtied--;
 		dec_zone_page_state(page, NR_DIRTIED);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
+		fsio_account_page_redirty(mapping);
 	}
 }
 EXPORT_SYMBOL(account_page_redirty);
@@ -2242,6 +2245,7 @@ int test_clear_page_writeback(struct page *page)
 			if (bdi_cap_account_writeback(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
 				__bdi_writeout_inc(bdi);
+				fsio_clear_page_writeback(mapping);
 			}
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -2270,8 +2274,10 @@ int test_set_page_writeback(struct page *page)
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-			if (bdi_cap_account_writeback(bdi))
+			if (bdi_cap_account_writeback(bdi)) {
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+				fsio_set_page_writeback(mapping);
+			}
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
diff --git a/mm/readahead.c b/mm/readahead.c
index 829a77c..530599c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -15,6 +15,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/fsio_cgroup.h>
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 #include <linux/syscalls.h>
@@ -102,6 +103,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);
+		fsio_account_read(PAGE_CACHE_SIZE);
 	}
 	return ret;
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index e212252..e84668e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -17,6 +17,7 @@
 #include <linux/highmem.h>
 #include <linux/pagevec.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/fsio_cgroup.h>
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
 #include <linux/cleancache.h>
@@ -75,6 +76,7 @@ void cancel_dirty_page(struct page *page)
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
 			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
+			fsio_cancel_dirty_page(mapping);
 		}
 	}
 }


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-08  9:59 [PATCH RFC] fsio: filesystem io accounting cgroup Konstantin Khlebnikov
@ 2013-07-10  4:43 ` Sha Zhengju
  2013-07-10  6:03   ` Konstantin Khlebnikov
  0 siblings, 1 reply; 29+ messages in thread
From: Sha Zhengju @ 2013-07-10  4:43 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: linux-mm, Andrew Morton, LKML

Hi,

On Mon, Jul 8, 2013 at 5:59 PM, Konstantin Khlebnikov
<khlebnikov@openvz.org> wrote:
> This is proof of concept, just basic functionality for IO controller.
> This cgroup will control filesystem usage on vfs layer, it's main goal is
> bandwidth control. It's supposed to be much more lightweight than memcg/blkio.
>
> This patch shows easy way for accounting pages in dirty/writeback state in
> per-inode manner. This is easier that doing this in memcg in per-page manner.
> Main idea is in keeping on each inode pointer (->i_fsio) to cgroup which owns
> dirty data in that inode. It's settled by fsio_account_page_dirtied() when
> first dirty tag appears in the inode. Relying to mapping tags gives us locking
> for free, this patch doesn't add any new locks to hot paths.

While referring to dirty/writeback numbers, what I care about is 'how
many dirties in how many memory' and later may use the proportion to
decide throttling or something else. So if you are talking about nr of
dirty pages without memcg's amount of memory, I don't see the meaning
of a single number.

What's more, counting dirty/writeback stats in per-node manner can
bring inaccuracy in some situations: considering two tasks from
different fsio cgroups are dirtying one file concurrently but may only
be counting in one fsio stats, or a task is moved to another fsio
cgroup after dirtrying one file. As talking about task moving, it is
the root cause of adding memcg locks in page stat routines, since
there's a race window between 'modify cgroup owner' and 'update stats
using cgroup pointer'. But if you are going to handle task move or
take care of ->i_fsio for better accuracy in future, I'm afraid you
will also need some synchronization mechanism in hot paths. Maybe also
a new lock or mapping->tree_lock(which is already hot enough) IMHO.



Thanks,
Sha


>
> Unlike to blkio this method works for all of filesystems, not just disk-backed.
> Also it's able to handle writeback, because each inode has context which can be
> used in writeback thread to account io operations.
>
> This is early prototype, I have some plans about extra functionality because
> this accounting itself is mostly useless, but it can be used as basis for more
> usefull features.
>
> Planned impovements:
> * Split bdi into several tiers and account them separately. For example:
>   hdd/ssd/usb/nfs. In complicated containerized environments that might be
>   different kinds of storages with different limits and billing. This is more
>   usefull that independent per-disk accounting and much easier to implement
>   because all per-tier structures are allocated before disk appearance.
> * Add some hooks for accounting actualy issued IO requests (iops).
> * Implement bandwidth throttlers for each tier individually (bps and iops).
>   This will be the most tasty feature. I already have very effective prototype.
> * Add hook into balance_dirty_pages to limit amount of dirty page for each
>   cgroup in each tier individually. This is required for accurate throttling,
>   because if we want to limit speed of writeback we also must limit amount
>   of dirty pages otherwise we have to inject enourmous delay after each sync().
> * Implement filtered writeback requests for writing only data which belongs to
>   particular fsio cgroup (or cgroups tree) to keep dirty balance in background.
> * Implement filtered 'sync', special mode for sync() which syncs only
>   filesystems which 'belong' to current fsio cgroup. Each container should sync
>   only it's own filesystems. This also can be made in terms of 'visibility' in
>   vfsmount namespaces.
>
> This patch lays on top of this:
> b26008c page_writeback: put account_page_redirty() after set_page_dirty()
> 80979bd page_writeback: get rid of account_size argument in cancel_dirty_page()
> c575ef6 hugetlbfs: remove cancel_dirty_page() from truncate_huge_page()
> b720923 nfs: remove redundant cancel_dirty_page() from nfs_wb_page_cancel()
> 4c21e52 mm: remove redundant dirty pages check from __delete_from_page_cache()
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> Cc: cgroups@vger.kernel.org
> Cc: devel@openvz.org
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Sha Zhengju <handai.szj@gmail.com>
> ---
>  block/blk-core.c              |    2 +
>  fs/Makefile                   |    2 +
>  fs/direct-io.c                |    2 +
>  fs/fsio_cgroup.c              |  137 +++++++++++++++++++++++++++++++++++++++++
>  fs/nfs/direct.c               |    2 +
>  include/linux/cgroup_subsys.h |    6 ++
>  include/linux/fs.h            |    3 +
>  include/linux/fsio_cgroup.h   |  136 +++++++++++++++++++++++++++++++++++++++++
>  init/Kconfig                  |    3 +
>  mm/page-writeback.c           |    8 ++
>  mm/readahead.c                |    2 +
>  mm/truncate.c                 |    2 +
>  12 files changed, 304 insertions(+), 1 deletion(-)
>  create mode 100644 fs/fsio_cgroup.c
>  create mode 100644 include/linux/fsio_cgroup.h
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 93a18d1..5f4e6c9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -26,6 +26,7 @@
>  #include <linux/swap.h>
>  #include <linux/writeback.h>
>  #include <linux/task_io_accounting_ops.h>
> +#include <linux/fsio_cgroup.h>
>  #include <linux/fault-inject.h>
>  #include <linux/list_sort.h>
>  #include <linux/delay.h>
> @@ -1866,6 +1867,7 @@ void submit_bio(int rw, struct bio *bio)
>                         count_vm_events(PGPGOUT, count);
>                 } else {
>                         task_io_account_read(bio->bi_size);
> +                       fsio_account_read(bio->bi_size);
>                         count_vm_events(PGPGIN, count);
>                 }
>
> diff --git a/fs/Makefile b/fs/Makefile
> index 4fe6df3..d4432e7 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -52,6 +52,8 @@ obj-$(CONFIG_FHANDLE)         += fhandle.o
>
>  obj-y                          += quota/
>
> +obj-$(CONFIG_FSIO_CGROUP)      += fsio_cgroup.o
> +
>  obj-$(CONFIG_PROC_FS)          += proc/
>  obj-$(CONFIG_SYSFS)            += sysfs/
>  obj-$(CONFIG_CONFIGFS_FS)      += configfs/
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 7ab90f5..0fe99af 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -28,6 +28,7 @@
>  #include <linux/highmem.h>
>  #include <linux/pagemap.h>
>  #include <linux/task_io_accounting_ops.h>
> +#include <linux/fsio_cgroup.h>
>  #include <linux/bio.h>
>  #include <linux/wait.h>
>  #include <linux/err.h>
> @@ -722,6 +723,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
>                  * Read accounting is performed in submit_bio()
>                  */
>                 task_io_account_write(len);
> +               fsio_account_write(len);
>         }
>
>         /*
> diff --git a/fs/fsio_cgroup.c b/fs/fsio_cgroup.c
> new file mode 100644
> index 0000000..5d5158f
> --- /dev/null
> +++ b/fs/fsio_cgroup.c
> @@ -0,0 +1,137 @@
> +#include <linux/fsio_cgroup.h>
> +#include "internal.h"
> +
> +static inline struct fsio_cgroup *cgroup_fsio(struct cgroup *cgroup)
> +{
> +       return container_of(cgroup_subsys_state(cgroup, fsio_subsys_id),
> +                           struct fsio_cgroup, css);
> +}
> +
> +static void fsio_free(struct fsio_cgroup *fsio)
> +{
> +       percpu_counter_destroy(&fsio->read_bytes);
> +       percpu_counter_destroy(&fsio->write_bytes);
> +       percpu_counter_destroy(&fsio->nr_dirty);
> +       percpu_counter_destroy(&fsio->nr_writeback);
> +       kfree(fsio);
> +}
> +
> +static struct cgroup_subsys_state *fsio_css_alloc(struct cgroup *cgroup)
> +{
> +       struct fsio_cgroup *fsio;
> +
> +       fsio = kzalloc(sizeof(struct fsio_cgroup), GFP_KERNEL);
> +       if (!fsio)
> +               return ERR_PTR(-ENOMEM);
> +
> +       if (percpu_counter_init(&fsio->read_bytes, 0) ||
> +           percpu_counter_init(&fsio->write_bytes, 0) ||
> +           percpu_counter_init(&fsio->nr_dirty, 0) ||
> +           percpu_counter_init(&fsio->nr_writeback, 0)) {
> +               fsio_free(fsio);
> +               return ERR_PTR(-ENOMEM);
> +       }
> +
> +       return &fsio->css;
> +}
> +
> +/* switch all ->i_fsio references to the parent cgroup */
> +static void fsio_switch_sb(struct super_block *sb, void *_fsio)
> +{
> +       struct fsio_cgroup *fsio = _fsio;
> +       struct fsio_cgroup *parent = cgroup_fsio(fsio->css.cgroup->parent);
> +       struct address_space *mapping;
> +       struct inode *inode;
> +
> +       spin_lock(&inode_sb_list_lock);
> +       list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +               mapping = inode->i_mapping;
> +               if (mapping->i_fsio == fsio &&
> +                   (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
> +                    mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))) {
> +                       spin_lock_irq(&mapping->tree_lock);
> +                       if (mapping->i_fsio == fsio)
> +                               mapping->i_fsio = parent;
> +                       spin_unlock_irq(&mapping->tree_lock);
> +               }
> +       }
> +       spin_unlock(&inode_sb_list_lock);
> +}
> +
> +static void fsio_css_free(struct cgroup *cgroup)
> +{
> +       struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
> +       struct fsio_cgroup *parent = cgroup_fsio(fsio->css.cgroup->parent);
> +       u64 nr_dirty, nr_writeback, tmp;
> +
> +       nr_dirty = percpu_counter_sum_positive(&fsio->nr_dirty);
> +       percpu_counter_add(&parent->nr_dirty, nr_dirty);
> +
> +       nr_writeback = percpu_counter_sum_positive(&fsio->nr_writeback);
> +       percpu_counter_add(&parent->nr_writeback, nr_writeback);
> +
> +       iterate_supers(fsio_switch_sb, fsio);
> +
> +       tmp = percpu_counter_sum(&fsio->nr_dirty);
> +       percpu_counter_add(&parent->nr_dirty, tmp - nr_dirty);
> +
> +       tmp = percpu_counter_sum(&fsio->nr_writeback);
> +       percpu_counter_add(&parent->nr_writeback, tmp - nr_writeback);
> +}
> +
> +static u64 fsio_get_read_bytes(struct cgroup *cgroup, struct cftype *cft)
> +{
> +       struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
> +
> +       return percpu_counter_sum(&fsio->read_bytes);
> +}
> +
> +static u64 fsio_get_write_bytes(struct cgroup *cgroup, struct cftype *cft)
> +{
> +       struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
> +
> +       return percpu_counter_sum(&fsio->write_bytes);
> +}
> +
> +static u64 fsio_get_dirty_bytes(struct cgroup *cgroup, struct cftype *cft)
> +{
> +       struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
> +
> +       return percpu_counter_sum_positive(&fsio->nr_dirty) * PAGE_CACHE_SIZE;
> +}
> +
> +static u64 fsio_get_writeback_bytes(struct cgroup *cgroup, struct cftype *cft)
> +{
> +       struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
> +
> +       return percpu_counter_sum_positive(&fsio->nr_writeback) *
> +               PAGE_CACHE_SIZE;
> +}
> +
> +static struct cftype fsio_files[] = {
> +       {
> +               .name = "read_bytes",
> +               .read_u64 = fsio_get_read_bytes,
> +       },
> +       {
> +               .name = "write_bytes",
> +               .read_u64 = fsio_get_write_bytes,
> +       },
> +       {
> +               .name = "dirty_bytes",
> +               .read_u64 = fsio_get_dirty_bytes,
> +       },
> +       {
> +               .name = "writeback_bytes",
> +               .read_u64 = fsio_get_writeback_bytes,
> +       },
> +       { }     /* terminate */
> +};
> +
> +struct cgroup_subsys fsio_subsys = {
> +       .name = "fsio",
> +       .subsys_id = fsio_subsys_id,
> +       .css_alloc = fsio_css_alloc,
> +       .css_free = fsio_css_free,
> +       .base_cftypes = fsio_files,
> +};
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 0bd7a55..b8e61ee 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -46,6 +46,7 @@
>  #include <linux/kref.h>
>  #include <linux/slab.h>
>  #include <linux/task_io_accounting_ops.h>
> +#include <linux/fsio_cgroup.h>
>  #include <linux/module.h>
>
>  #include <linux/nfs_fs.h>
> @@ -987,6 +988,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
>                 goto out;
>
>         task_io_account_write(count);
> +       fsio_account_write(count);
>
>         retval = nfs_direct_write(iocb, iov, nr_segs, pos, count, uio);
>         if (retval > 0) {
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 6e7ec64..d16df6e 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -84,3 +84,9 @@ SUBSYS(bcache)
>  #endif
>
>  /* */
> +
> +#if IS_SUBSYS_ENABLED(CONFIG_FSIO_CGROUP)
> +SUBSYS(fsio)
> +#endif
> +
> +/* */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 99be011..8156d99 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -421,6 +421,9 @@ struct address_space {
>         spinlock_t              private_lock;   /* for use by the address_space */
>         struct list_head        private_list;   /* ditto */
>         void                    *private_data;  /* ditto */
> +#ifdef CONFIG_FSIO_CGROUP
> +       struct fsio_cgroup      *i_fsio;        /* protected with tree_lock */
> +#endif
>  } __attribute__((aligned(sizeof(long))));
>         /*
>          * On most architectures that alignment is already the case; but
> diff --git a/include/linux/fsio_cgroup.h b/include/linux/fsio_cgroup.h
> new file mode 100644
> index 0000000..78aa53b
> --- /dev/null
> +++ b/include/linux/fsio_cgroup.h
> @@ -0,0 +1,136 @@
> +#ifndef _LINUX_FSIO_CGGROUP_H
> +#define _LINUX_FSIO_CGGROUP_H
> +
> +#include <linux/fs.h>
> +#include <linux/pagemap.h>
> +#include <linux/cgroup.h>
> +#include <linux/workqueue.h>
> +#include <linux/percpu_counter.h>
> +
> +struct fsio_cgroup {
> +       struct cgroup_subsys_state css;
> +       struct percpu_counter read_bytes;
> +       struct percpu_counter write_bytes;
> +       struct percpu_counter nr_dirty;
> +       struct percpu_counter nr_writeback;
> +};
> +
> +#ifdef CONFIG_FSIO_CGROUP
> +
> +static inline struct fsio_cgroup *task_fsio_cgroup(struct task_struct *task)
> +{
> +       return container_of(task_subsys_state(task, fsio_subsys_id),
> +                           struct fsio_cgroup, css);
> +}
> +
> +/*
> + * This accounts all reads, both cached and direct-io.
> + */
> +static inline void fsio_account_read(unsigned long bytes)
> +{
> +       struct task_struct *task = current;
> +       struct fsio_cgroup *fsio;
> +
> +       rcu_read_lock();
> +       fsio = task_fsio_cgroup(task);
> +       __percpu_counter_add(&fsio->read_bytes, bytes,
> +                       PAGE_CACHE_SIZE * percpu_counter_batch);
> +       rcu_read_unlock();
> +}
> +
> +/*
> + * This is used for accounting  direct-io writes.
> + */
> +static inline void fsio_account_write(unsigned long bytes)
> +{
> +       struct task_struct *task = current;
> +       struct fsio_cgroup *fsio;
> +
> +       rcu_read_lock();
> +       fsio = task_fsio_cgroup(task);
> +       __percpu_counter_add(&fsio->write_bytes, bytes,
> +                       PAGE_CACHE_SIZE * percpu_counter_batch);
> +       rcu_read_unlock();
> +}
> +
> +/*
> + * This called under mapping->tree_lock before setting radix-tree tag,
> + * page which caused this call either locked (write) or mapped (unmap).
> + *
> + * If this is first dirty page in inode and there is no writeback then current
> + * fsio cgroup becomes owner of this inode. Following radix_tree_tag_set()
> + * will pin this pointer till the end of writeback or truncate. Otherwise
> + * mapping->i_fsio already valid and points to cgroup who owns this inode.
> + */
> +static inline void fsio_account_page_dirtied(struct address_space *mapping)
> +{
> +       struct fsio_cgroup *fsio = mapping->i_fsio;
> +
> +       if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
> +           !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +               rcu_read_lock();
> +               fsio = task_fsio_cgroup(current);
> +               mapping->i_fsio = fsio;
> +               rcu_read_unlock();
> +       }
> +
> +       percpu_counter_inc(&fsio->nr_dirty);
> +}
> +
> +/*
> + * This called after clearing dirty bit. Page here locked and unmapped,
> + * thus dirtying process is complete and mapping->i_fsio is valid.
> + * This state is stable because at this point page still in mapping and tagged.
> + * We cannot reset mapping->i_mapping here even that was last dirty page, becase
> + * this hook is called without tree_lock before removing page from page cache.
> + */
> +static inline void fsio_cancel_dirty_page(struct address_space *mapping)
> +{
> +       percpu_counter_dec(&mapping->i_fsio->nr_dirty);
> +}
> +
> +/*
> + * This called after redirtying page, thus nr_dirty will not fall to zero.
> + * No tree_lock, page is locked and still tagged in mapping.
> + */
> +static inline void fsio_account_page_redirty(struct address_space *mapping)
> +{
> +       percpu_counter_dec(&mapping->i_fsio->nr_dirty);
> +}
> +
> +/*
> + * This switches page accounging from dirty to writeback.
> + */
> +static inline void fsio_set_page_writeback(struct address_space *mapping)
> +{
> +       struct fsio_cgroup *fsio = mapping->i_fsio;
> +
> +       percpu_counter_inc(&fsio->nr_writeback);
> +       percpu_counter_dec(&fsio->nr_dirty);
> +}
> +
> +/*
> + * Writeback is done, mapping->i_fsio pointer becomes invalid after that.
> + */
> +static inline void fsio_clear_page_writeback(struct address_space *mapping)
> +{
> +       struct fsio_cgroup *fsio = mapping->i_fsio;
> +
> +       __percpu_counter_add(&fsio->write_bytes, PAGE_CACHE_SIZE,
> +                       PAGE_CACHE_SIZE * percpu_counter_batch);
> +       percpu_counter_dec(&fsio->nr_writeback);
> +}
> +
> +#else /* CONFIG_FSIO_CGROUP */
> +
> +static inline void fsio_account_read(unsigned long bytes) {}
> +static inline void fsio_account_write(unsigned long bytes) {}
> +static inline void fsio_account_page_dirtied(struct address_space *mapping) {}
> +static inline void fsio_cancel_dirty_page(struct address_space *mapping) {}
> +static inline void fsio_account_page_redirty(struct address_space *mapping) {}
> +static inline void fsio_set_page_writeback(struct address_space *mapping) {}
> +static inline void fsio_clear_page_writeback(struct address_space *mapping) {}
> +
> +#endif /* CONFIG_FSIO_CGROUP */
> +
> +#endif /* _LINUX_FSIO_CGGROUP_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index ea1be00..689b29a 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1050,6 +1050,9 @@ config DEBUG_BLK_CGROUP
>         Enable some debugging help. Currently it exports additional stat
>         files in a cgroup which can be useful for debugging.
>
> +config FSIO_CGROUP
> +       bool "Filesystem IO controller"
> +
>  endif # CGROUPS
>
>  config CHECKPOINT_RESTORE
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a599f38..bc39a36 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -23,6 +23,7 @@
>  #include <linux/init.h>
>  #include <linux/backing-dev.h>
>  #include <linux/task_io_accounting_ops.h>
> +#include <linux/fsio_cgroup.h>
>  #include <linux/blkdev.h>
>  #include <linux/mpage.h>
>  #include <linux/rmap.h>
> @@ -1994,6 +1995,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
>                 __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>                 __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
>                 task_io_account_write(PAGE_CACHE_SIZE);
> +               fsio_account_page_dirtied(mapping);
>                 current->nr_dirtied++;
>                 this_cpu_inc(bdp_ratelimits);
>         }
> @@ -2071,6 +2073,7 @@ void account_page_redirty(struct page *page)
>                 current->nr_dirtied--;
>                 dec_zone_page_state(page, NR_DIRTIED);
>                 dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
> +               fsio_account_page_redirty(mapping);
>         }
>  }
>  EXPORT_SYMBOL(account_page_redirty);
> @@ -2242,6 +2245,7 @@ int test_clear_page_writeback(struct page *page)
>                         if (bdi_cap_account_writeback(bdi)) {
>                                 __dec_bdi_stat(bdi, BDI_WRITEBACK);
>                                 __bdi_writeout_inc(bdi);
> +                               fsio_clear_page_writeback(mapping);
>                         }
>                 }
>                 spin_unlock_irqrestore(&mapping->tree_lock, flags);
> @@ -2270,8 +2274,10 @@ int test_set_page_writeback(struct page *page)
>                         radix_tree_tag_set(&mapping->page_tree,
>                                                 page_index(page),
>                                                 PAGECACHE_TAG_WRITEBACK);
> -                       if (bdi_cap_account_writeback(bdi))
> +                       if (bdi_cap_account_writeback(bdi)) {
>                                 __inc_bdi_stat(bdi, BDI_WRITEBACK);
> +                               fsio_set_page_writeback(mapping);
> +                       }
>                 }
>                 if (!PageDirty(page))
>                         radix_tree_tag_clear(&mapping->page_tree,
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 829a77c..530599c 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -15,6 +15,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/backing-dev.h>
>  #include <linux/task_io_accounting_ops.h>
> +#include <linux/fsio_cgroup.h>
>  #include <linux/pagevec.h>
>  #include <linux/pagemap.h>
>  #include <linux/syscalls.h>
> @@ -102,6 +103,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
>                         break;
>                 }
>                 task_io_account_read(PAGE_CACHE_SIZE);
> +               fsio_account_read(PAGE_CACHE_SIZE);
>         }
>         return ret;
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index e212252..e84668e 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -17,6 +17,7 @@
>  #include <linux/highmem.h>
>  #include <linux/pagevec.h>
>  #include <linux/task_io_accounting_ops.h>
> +#include <linux/fsio_cgroup.h>
>  #include <linux/buffer_head.h> /* grr. try_to_release_page,
>                                    do_invalidatepage */
>  #include <linux/cleancache.h>
> @@ -75,6 +76,7 @@ void cancel_dirty_page(struct page *page)
>                         dec_bdi_stat(mapping->backing_dev_info,
>                                         BDI_RECLAIMABLE);
>                         task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> +                       fsio_cancel_dirty_page(mapping);
>                 }
>         }
>  }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



--
Thanks,
Sha

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-10  4:43 ` Sha Zhengju
@ 2013-07-10  6:03   ` Konstantin Khlebnikov
  2013-07-10  8:37     ` Sha Zhengju
  0 siblings, 1 reply; 29+ messages in thread
From: Konstantin Khlebnikov @ 2013-07-10  6:03 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, Andrew Morton, LKML, Vivek Goyal, Tejun Heo,
	Jens Axboe, devel, cgroups, Michal Hocko

Sha Zhengju wrote:
> Hi,
>
> On Mon, Jul 8, 2013 at 5:59 PM, Konstantin Khlebnikov
> <khlebnikov@openvz.org>  wrote:
>> >  This is proof of concept, just basic functionality for IO controller.
>> >  This cgroup will control filesystem usage on vfs layer, it's main goal is
>> >  bandwidth control. It's supposed to be much more lightweight than memcg/blkio.
>> >
>> >  This patch shows easy way for accounting pages in dirty/writeback state in
>> >  per-inode manner. This is easier that doing this in memcg in per-page manner.
>> >  Main idea is in keeping on each inode pointer (->i_fsio) to cgroup which owns
>> >  dirty data in that inode. It's settled by fsio_account_page_dirtied() when
>> >  first dirty tag appears in the inode. Relying to mapping tags gives us locking
>> >  for free, this patch doesn't add any new locks to hot paths.
> While referring to dirty/writeback numbers, what I care about is 'how
> many dirties in how many memory' and later may use the proportion to
> decide throttling or something else. So if you are talking about nr of
> dirty pages without memcg's amount of memory, I don't see the meaning
> of a single number.

I'm planning to add some thresholds or limits to fsio cgroup -- how many dirty pages
this cgroup may have. memcg is completely different thing: memcg controls data storage
while fsio controls data flows. Memcg already handles too much, I just don't want add
yet another unrelated stuff into it. Otherwise we will end with one single controller
which would handle all possible resources, because they all related in some cases.

>
> What's more, counting dirty/writeback stats in per-node manner can
> bring inaccuracy in some situations: considering two tasks from
> different fsio cgroups are dirtying one file concurrently but may only
> be counting in one fsio stats, or a task is moved to another fsio
> cgroup after dirtrying one file. As talking about task moving, it is
> the root cause of adding memcg locks in page stat routines, since
> there's a race window between 'modify cgroup owner' and 'update stats
> using cgroup pointer'. But if you are going to handle task move or
> take care of ->i_fsio for better accuracy in future, I'm afraid you
> will also need some synchronization mechanism in hot paths. Maybe also
> a new lock or mapping->tree_lock(which is already hot enough) IMHO.

Yes, per-inode accounting is less accurate. But this approach works really
well in the real life. I don't want add new locks and loose performance just
to fix accuracy for some artificial cases.

to Tejun:
BTW I don't like that volatility of task's cgroup ponters. I'd like to forbid
moving tasks between cgroups except for 'current', existing behavior can be
kept with help of task_work: instead external change of task->cgroups we can
schedule task_work into it in and change that pointer in the 'current' context.
That will save us a lot of rcu_lock/unlock and atomic operations in grabbing
temporary pointers to current cgroup because current->cgroups will be stable.
I don't think that external cross-cgroup task migration is really performance
critical. Currently I don't know what to do with kernel threads and workqueues,
but any way this problem doesn't look unsolvable.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-10  6:03   ` Konstantin Khlebnikov
@ 2013-07-10  8:37     ` Sha Zhengju
  0 siblings, 0 replies; 29+ messages in thread
From: Sha Zhengju @ 2013-07-10  8:37 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Andrew Morton, LKML, Vivek Goyal, Tejun Heo,
	Jens Axboe, devel, cgroups, Michal Hocko, Greg Thelen

On Wed, Jul 10, 2013 at 2:03 PM, Konstantin Khlebnikov
<khlebnikov@openvz.org> wrote:
> Sha Zhengju wrote:
>>
>> Hi,
>>
>> On Mon, Jul 8, 2013 at 5:59 PM, Konstantin Khlebnikov
>> <khlebnikov@openvz.org>  wrote:
>>>
>>> >  This is proof of concept, just basic functionality for IO controller.
>>> >  This cgroup will control filesystem usage on vfs layer, it's main goal
>>> > is
>>> >  bandwidth control. It's supposed to be much more lightweight than
>>> > memcg/blkio.
>>> >
>>> >  This patch shows easy way for accounting pages in dirty/writeback
>>> > state in
>>> >  per-inode manner. This is easier that doing this in memcg in per-page
>>> > manner.
>>> >  Main idea is in keeping on each inode pointer (->i_fsio) to cgroup
>>> > which owns
>>> >  dirty data in that inode. It's settled by fsio_account_page_dirtied()
>>> > when
>>> >  first dirty tag appears in the inode. Relying to mapping tags gives us
>>> > locking
>>> >  for free, this patch doesn't add any new locks to hot paths.
>>
>> While referring to dirty/writeback numbers, what I care about is 'how
>> many dirties in how many memory' and later may use the proportion to
>> decide throttling or something else. So if you are talking about nr of
>> dirty pages without memcg's amount of memory, I don't see the meaning
>> of a single number.
>
>
> I'm planning to add some thresholds or limits to fsio cgroup -- how many
> dirty pages
> this cgroup may have. memcg is completely different thing: memcg controls
> data storage
> while fsio controls data flows. Memcg already handles too much, I just don't
> want add
> yet another unrelated stuff into it. Otherwise we will end with one single

I don't think handling dirty pages is an unrelated stuff to memcg. See
problem met by others using memcg:
https://lists.linux-foundation.org/pipermail/containers/2013-June/032917.html
Memcg dirty/writeback accounting is also only antipasto, long term
work is memcg aware dirty throttling and possibly per-memcg flusher.
Greg has already done some works here.


> controller
> which would handle all possible resources, because they all related in some
> cases.
>
>
>>
>> What's more, counting dirty/writeback stats in per-node manner can
>> bring inaccuracy in some situations: considering two tasks from
>> different fsio cgroups are dirtying one file concurrently but may only
>> be counting in one fsio stats, or a task is moved to another fsio
>> cgroup after dirtrying one file. As talking about task moving, it is
>> the root cause of adding memcg locks in page stat routines, since
>> there's a race window between 'modify cgroup owner' and 'update stats
>> using cgroup pointer'. But if you are going to handle task move or
>> take care of ->i_fsio for better accuracy in future, I'm afraid you
>> will also need some synchronization mechanism in hot paths. Maybe also
>> a new lock or mapping->tree_lock(which is already hot enough) IMHO.
>
>
> Yes, per-inode accounting is less accurate. But this approach works really
> well in the real life. I don't want add new locks and loose performance just
> to fix accuracy for some artificial cases.
>
> to Tejun:
> BTW I don't like that volatility of task's cgroup ponters. I'd like to
> forbid
> moving tasks between cgroups except for 'current', existing behavior can be
> kept with help of task_work: instead external change of task->cgroups we can
> schedule task_work into it in and change that pointer in the 'current'
> context.
> That will save us a lot of rcu_lock/unlock and atomic operations in grabbing
> temporary pointers to current cgroup because current->cgroups will be
> stable.
> I don't think that external cross-cgroup task migration is really
> performance
> critical. Currently I don't know what to do with kernel threads and
> workqueues,
> but any way this problem doesn't look unsolvable.
>



--
Thanks,
Sha

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
       [not found]                             ` <20130710030955.GA3569@redhat.com>
@ 2013-07-10  3:50                               ` Tejun Heo
  0 siblings, 0 replies; 29+ messages in thread
From: Tejun Heo @ 2013-07-10  3:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Konstantin Khlebnikov, linux-mm, linux-kernel, Michal Hocko,
	cgroups, Andrew Morton, Sha Zhengju, devel, Jens Axboe

Hello,

On Tue, Jul 09, 2013 at 11:09:55PM -0400, Vivek Goyal wrote:
> Stacking drivers are pretty important ones and we expect throttling
> to work with them. By throttling bio, a single hook worked both for
> request based drivers and bio based drivers.

Oh yeah, sure, we have them working now, so there's no way to break
them but that doesn't mean it's a good overall design.  I don't have a
good answer for this one.  The root cause is having the distinction
between bio and rq based drivers.  With the right constructs, I
suspect we probably could have done away with bio based drivers, but,
well, that's all history now.

> So looks like for bio based drivers you want bio throttling and for
> request based drviers, request throttling and define a separate hook
> in blk_queue_bio(). A generic hook probably can check the type of request
> queue and not throttle bio if it is request based queue and ultimately
> request queue based hook will throttle it.
> 
> So in a cgroup we blkio.throttle.io_serviced will have stats for
> bio/request depending on type of device.
> 
> And we will need to modify throttling logic so that it can handle
> both bio and request throttling. Not sure how much of code can be
> shared for bio/request throttling.

I'm not sure how much (de)multiplexing and sharing we'd be doing but
I'm afraid there's gonna need to be some.  We really can't use the
same logic for SSDs and rotating rusts after all and it probably would
be best to avoid contaminating SSD paths with lots of guesstimating
logics necessary for rotating rusts.

> I am not sure about request based multipath driver and it might
> require some special handling.

If it's not supported now, I'll be happy with just leaving it alone
and telling mp users to configure the underlying queues.

> Is it roughly inline with what you have been thinking.

I'm hoping to keep it somewhat manageable at least.  I wouldn't mind
leaving stacking driver and cfq-iosched support as they are while only
supporting SSD devices with new code.  It's all pie in the sky at this
point and none of this matters before we fix the bdi and writeback
issue anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 18:35                           ` Vivek Goyal
@ 2013-07-09 20:54                             ` Konstantin Khlebnikov
  0 siblings, 0 replies; 29+ messages in thread
From: Konstantin Khlebnikov @ 2013-07-09 20:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

Vivek Goyal wrote:
> On Tue, Jul 09, 2013 at 09:42:57PM +0400, Konstantin Khlebnikov wrote:
>
> [..]
>>> So what kind of priority inversion you are facing with blkcg and how would
>>> you avoid it with your implementation?
>>>
>>> I know that serialization can happen at filesystem level while trying
>>> to commit journal. But I think same thing will happen with your
>>> implementation too.
>>
>> Yes, metadata changes are serialized and and they depends on data commits,
>> thus block layer cannot delay write requests without introducing nasty priority
>> inversions.
>
> Tejun had some thoughts about this on how to solve this problem. I don't
> remember the details though. Tejun?
>
>> Cached read requests cannot be delayed at all.
>
> Who wants to delay the reads which are coming out of cache. That sounds
> like a mis-feature.

Nope, I'm telling about reading into page cache. If page already here but
still not uptodate because its read request was delayed then following
cached read will wait for this delayed request too.

>
>> All solutions either
>> breaks throttling or adds PI. So block layer is just wrong place for this.
>
> Well implmenting throttling at block layer can allow you to cache writes
> so that application does not see the dealye for small writes at the same
> time it protects against that burst being visible on device and it
> impacting other IO going device.
>
> Not sure how much does it matter but atleast this was one discussion
> point in the past. Implementing it at device level provides better
> control when it comes to avoiding interference from bursty buffered
> writes.
>
>>
>>>
>>> One simple way of avoiding that will be to throttle IO even earlier
>>> but that means we do not take advantage of writeback cache and buffered
>>> writes will slow down.
>>
>> If we want to control writeback speed we also must control size of dirty set.
>> There are several possibilities: we either can start writeback earlier,
>> or when dirty set exceeds some threshold we will start charging that dirty
>> memory into throttler and slow down all tasks who generates this dirty memory.
>> Because dirty memory is charged and accounted we can write it without delays.
>
> Ok, so this is equivalent to allowing bursty IO. Admit bunch of IO burst
> (dirty set) and then apply throttling rules. Dirty set can be flushed
> without throttling if sync requires that but future admission of IO will
> be delayed. That can avoid PI problems due arising due to file system
> journaling.
>
> We have discussed implementing throttling at higher layer in the past
> too.  Various proof of concept implementations had been posted to do
> throttling in higher layer.
>
> blk-throttle: Throttle buffered WRITEs in balance_dirty_pages()
> https://lkml.org/lkml/2011/6/28/243
>
> buffered write IO controller in balance_dirty_pages()
> https://lkml.org/lkml/2012/3/28/275
>
> Andrea Righi had posted some proof of concept implementations too.
>
> None of these implementations ever made any progress. Tejun always
> liked the idea of doing throttling at lower layers and then generating
> back pressure on bdi which in turn controls the size of dirty set.

Ok, thanks. I'll see them. I have made similar 'blance_dirty_pages'-like engine
for our commercial product and works perfectly for several years. At the same
time prioritization in CFQ never worked for me, and I've give up trying to fix it.

My idea is in doing accounting at lower layer and injecting delays into tasks who
generates that pressure. We can avoid PI because delays can be injected in
safe places where tasks don't hold any locks. I've found that recently
added 'task_work' interface perfectly fits for injecting delays into tasks,
probably it's overkill but I really like how it works.

>
> To me sovling the issue of Priority inversion in file systems is
> important one. If we can't solve that reasonably with existing mechanism
> it does make a case that why throttling at higher level might be
> interesting.
>
>>
>>>
>>> So I am curious how would you take care of these serialization issue.
>>>
>>> Also the throttlers you are planning to implement, what kind of throttling
>>> do they provide. Is it throttling rate per cgroup or per file per cgroup
>>> or rules will be per bdi per cgroup or something else.
>>
>> Currently I'm thinking about per-cgroup X per-tier. Each bdi will be assigned
>> to some tier. It's flexible enough and solves chicken-and-egg problem:
>> when disk appears it will be assigned to default tier and can be reassigned.
>
> Ok, this is completely orthogonal issue. It has nothing to do with whether
> to apply throttling at block layer or at higher leayer.
>
> To solve the chicken and egg problem we need to take help of user space
> here and not rely on kernel storing the rules and apply these when devices
> show up.
>
> Also how would you create rules for assigning a bdi to a tier. How would
> you identify a bdi uniquely in a persistent manner.

That's the point that I don't want do this. I'll let userspace configure this.
It can precreate bunch of tiers, configure them and assign bdi to tier before
mounting filesystem or switch them in runtime. Or tell kernel somehow that
all 'nfs' bdi must be placed there while all 'usb' bdi must be here.
Looks pretentious.. ok let's leave this question for a while.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 17:42                         ` Konstantin Khlebnikov
@ 2013-07-09 18:35                           ` Vivek Goyal
  2013-07-09 20:54                             ` Konstantin Khlebnikov
  0 siblings, 1 reply; 29+ messages in thread
From: Vivek Goyal @ 2013-07-09 18:35 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Tejun Heo, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

On Tue, Jul 09, 2013 at 09:42:57PM +0400, Konstantin Khlebnikov wrote:

[..]
> >So what kind of priority inversion you are facing with blkcg and how would
> >you avoid it with your implementation?
> >
> >I know that serialization can happen at filesystem level while trying
> >to commit journal. But I think same thing will happen with your
> >implementation too.
> 
> Yes, metadata changes are serialized and and they depends on data commits,
> thus block layer cannot delay write requests without introducing nasty priority
> inversions.

Tejun had some thoughts about this on how to solve this problem. I don't
remember the details though. Tejun?

> Cached read requests cannot be delayed at all.

Who wants to delay the reads which are coming out of cache. That sounds
like a mis-feature.

> All solutions either
> breaks throttling or adds PI. So block layer is just wrong place for this.

Well implmenting throttling at block layer can allow you to cache writes
so that application does not see the dealye for small writes at the same
time it protects against that burst being visible on device and it
impacting other IO going device.

Not sure how much does it matter but atleast this was one discussion
point in the past. Implementing it at device level provides better
control when it comes to avoiding interference from bursty buffered
writes.

> 
> >
> >One simple way of avoiding that will be to throttle IO even earlier
> >but that means we do not take advantage of writeback cache and buffered
> >writes will slow down.
> 
> If we want to control writeback speed we also must control size of dirty set.
> There are several possibilities: we either can start writeback earlier,
> or when dirty set exceeds some threshold we will start charging that dirty
> memory into throttler and slow down all tasks who generates this dirty memory.
> Because dirty memory is charged and accounted we can write it without delays.

Ok, so this is equivalent to allowing bursty IO. Admit bunch of IO burst
(dirty set) and then apply throttling rules. Dirty set can be flushed
without throttling if sync requires that but future admission of IO will
be delayed. That can avoid PI problems due arising due to file system
journaling.

We have discussed implementing throttling at higher layer in the past
too.  Various proof of concept implementations had been posted to do
throttling in higher layer.

blk-throttle: Throttle buffered WRITEs in balance_dirty_pages()
https://lkml.org/lkml/2011/6/28/243

buffered write IO controller in balance_dirty_pages()
https://lkml.org/lkml/2012/3/28/275

Andrea Righi had posted some proof of concept implementations too.

None of these implementations ever made any progress. Tejun always
liked the idea of doing throttling at lower layers and then generating
back pressure on bdi which in turn controls the size of dirty set.

To me sovling the issue of Priority inversion in file systems is
important one. If we can't solve that reasonably with existing mechanism
it does make a case that why throttling at higher level might be
interesting.

> 
> >
> >So I am curious how would you take care of these serialization issue.
> >
> >Also the throttlers you are planning to implement, what kind of throttling
> >do they provide. Is it throttling rate per cgroup or per file per cgroup
> >or rules will be per bdi per cgroup or something else.
> 
> Currently I'm thinking about per-cgroup X per-tier. Each bdi will be assigned
> to some tier. It's flexible enough and solves chicken-and-egg problem:
> when disk appears it will be assigned to default tier and can be reassigned.

Ok, this is completely orthogonal issue. It has nothing to do with whether
to apply throttling at block layer or at higher leayer.

To solve the chicken and egg problem we need to take help of user space
here and not rely on kernel storing the rules and apply these when devices
show up.

Also how would you create rules for assigning a bdi to a tier. How would
you identify a bdi uniquely in a persistent manner.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 15:06                       ` Vivek Goyal
@ 2013-07-09 17:42                         ` Konstantin Khlebnikov
  2013-07-09 18:35                           ` Vivek Goyal
  0 siblings, 1 reply; 29+ messages in thread
From: Konstantin Khlebnikov @ 2013-07-09 17:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

Vivek Goyal wrote:
> On Tue, Jul 09, 2013 at 06:35:54PM +0400, Konstantin Khlebnikov wrote:
>
> [..]
>> I'm not interested in QoS or proportional control. Let schedulers do it.
>> I want just bandwidth control. I don't want to write a new scheduler
>> or extend some of existing one. I want implement simple and lightweight
>> accounting and add couple of throttlers on top of that.
>> It can be easily done without violation of that hierarchical design.
>>
>> The same problem already has happened with cpu scheduler. It has really
>> complicated rate limiter which is actually useless in the real world because
>> it triggers all possible priority inversions since it puts bunch of tasks into
>> deep sleep while some of them may hold kernel locks. Perfect.
>>
>> QoS and scheduling policy are good thing, but rate-limiting must be separated
>> and done only in places where it doesn't leads to these problems.
>
> So what kind of priority inversion you are facing with blkcg and how would
> you avoid it with your implementation?
>
> I know that serialization can happen at filesystem level while trying
> to commit journal. But I think same thing will happen with your
> implementation too.

Yes, metadata changes are serialized and and they depends on data commits,
thus block layer cannot delay write requests without introducing nasty priority
inversions. Cached read requests cannot be delayed at all. All solutions either
breaks throttling or adds PI. So block layer is just wrong place for this.

>
> One simple way of avoiding that will be to throttle IO even earlier
> but that means we do not take advantage of writeback cache and buffered
> writes will slow down.

If we want to control writeback speed we also must control size of dirty set.
There are several possibilities: we either can start writeback earlier,
or when dirty set exceeds some threshold we will start charging that dirty
memory into throttler and slow down all tasks who generates this dirty memory.
Because dirty memory is charged and accounted we can write it without delays.

>
> So I am curious how would you take care of these serialization issue.
>
> Also the throttlers you are planning to implement, what kind of throttling
> do they provide. Is it throttling rate per cgroup or per file per cgroup
> or rules will be per bdi per cgroup or something else.

Currently I'm thinking about per-cgroup X per-tier. Each bdi will be assigned
to some tier. It's flexible enough and solves chicken-and-egg problem:
when disk appears it will be assigned to default tier and can be reassigned.

>
> Thanks
> Vivek
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 15:39 ` Theodore Ts'o
@ 2013-07-09 17:12   ` Konstantin Khlebnikov
  0 siblings, 0 replies; 29+ messages in thread
From: Konstantin Khlebnikov @ 2013-07-09 17:12 UTC (permalink / raw)
  To: Theodore Ts'o, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel

Theodore Ts'o wrote:
> Another major problem with this concept is that it lumps all I/O's
> into a single cgroup.  So I/O's from pseudo filesystems (such as
> reading from /sys/kernel/debug/tracing/trace_pipe), networked file
> systems such as NFS, and I/O to various different block devices all
> get counted in a single per-cgroup limit.
>
> This doesn't seem terribly useful to me.  Network resources and block
> resources are quite different, and counting pseudo file systems and
> ram disks makes no sense at all.

Yep, I know it. I've already mentioned about this as first planned improvement:

|* Split bdi into several tiers and account them separately. For example:
|  hdd/ssd/usb/nfs. In complicated containerized environments that might be
|  different kinds of storages with different limits and billing. This is more
|  usefull that independent per-disk accounting and much easier to implement
|  because all per-tier structures are allocated before disk appearance.

Accounting each BDI separately doesn't very useful too, so I've chosen
something in the middle.

>
> Regards,
>
> 					- Ted
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-08 10:01 Konstantin Khlebnikov
  2013-07-08 17:00 ` Tejun Heo
  2013-07-08 18:11 ` Vivek Goyal
@ 2013-07-09 15:39 ` Theodore Ts'o
  2013-07-09 17:12   ` Konstantin Khlebnikov
  2 siblings, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2013-07-09 15:39 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Michal Hocko, cgroups, Andrew Morton,
	Sha Zhengju, devel

Another major problem with this concept is that it lumps all I/O's
into a single cgroup.  So I/O's from pseudo filesystems (such as
reading from /sys/kernel/debug/tracing/trace_pipe), networked file
systems such as NFS, and I/O to various different block devices all
get counted in a single per-cgroup limit.

This doesn't seem terribly useful to me.  Network resources and block
resources are quite different, and counting pseudo file systems and
ram disks makes no sense at all.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 14:54                         ` Vivek Goyal
@ 2013-07-09 15:08                           ` Tejun Heo
       [not found]                             ` <20130710030955.GA3569@redhat.com>
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-07-09 15:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Konstantin Khlebnikov, linux-mm, linux-kernel, Michal Hocko,
	cgroups, Andrew Morton, Sha Zhengju, devel, Jens Axboe

Hello, Vivek.

On Tue, Jul 09, 2013 at 10:54:30AM -0400, Vivek Goyal wrote:
> It is not clear whether counting bio or counting request is right
> thing to do here. It depends where you are trying to throttle. For
> bio based drivers there is request and they need throttling mechanism
> too. So keeping it common for both, kind of makes sense.

It gets weird because we may end up with wildy disagreeing statistics
from queue and the resource management.  It should have been part of
request_queue not something sitting on top.  Note that with
multi-queue support, we're unlikely to need bio based drivers except
for the stacking ones.

> Ok, so first of all you agree that time slice management is not a
> requirement for fast devices.

Not fast, but consistent.

> So time slice management is a problem even on slow devices which implement
> NCQ. IIRC, in the beginning even CFQ as doing some kind of request
> management (and not time slice management). And later it switched to
> time slice management in an effort to provide better fairness (If somebody
> is doing random IO and seek takes more time the process should be
> accounted for it).
> 
> But ideal time slice accounting requires driving a queue depth of 1
> and for any non-sequential IO, it kills performance.

Yeap, complete control only works with qd == 1 and even then write
buffering will throw you off.  But even w/ qd > 1 and write buffering,
time slice is fundamentally right thing to manage and than iops for
disks - e.g. you want to group IOs from the same issuer in the same
time slice even if the time accounting for that is not accurate so
that you can size the slice according to the operating characteristics
of the device and do things like idling inbetween.

> Seriously, time slice accounting is one way of managing resource. Same
> disk resource can be divided proportionally by counting either iops
> or by counting amount of IO done (bandwidth).

In practice, bio iops based proportional control becomes almost
completely worthless if you have any mix of random and sequential
accesses.  cfq wouldn't be accurate but it'd be *far* closer than
anything based on iops.

> If we count iops or bandwidth, it might not be most fair way of doing
> things on rotational media but it also should provide more accurate
> results in case of NCQ. When multiple requests have been dispatched
> to disk we have no idea which request consumed how much of disk time.
> So there is no way to account it properly. Iops or bandwidth based
> accounting will work just fine even with NCQ.

Sure, if iops or bw is what you explicitly want to control with hard
limits, it's fine, but doing proportional control with that on
rotating disk is just silly.

> So you want this generic block layer proportional implementation to
> do time slice management?
> 
> I thought we talked about this implementation to use some kind of
> token based mechanism so that it scales better on faster
> devices. And on slower devices one will continue to use CFQ.

I want to leave rotating disk proportional control to cfq-iosched for
as long as it matters and do iops / bw based things in the generic
layer.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 14:35                     ` Konstantin Khlebnikov
  2013-07-09 14:42                       ` Tejun Heo
@ 2013-07-09 15:06                       ` Vivek Goyal
  2013-07-09 17:42                         ` Konstantin Khlebnikov
  1 sibling, 1 reply; 29+ messages in thread
From: Vivek Goyal @ 2013-07-09 15:06 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Tejun Heo, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

On Tue, Jul 09, 2013 at 06:35:54PM +0400, Konstantin Khlebnikov wrote:

[..]
> I'm not interested in QoS or proportional control. Let schedulers do it.
> I want just bandwidth control. I don't want to write a new scheduler
> or extend some of existing one. I want implement simple and lightweight
> accounting and add couple of throttlers on top of that.
> It can be easily done without violation of that hierarchical design.
> 
> The same problem already has happened with cpu scheduler. It has really
> complicated rate limiter which is actually useless in the real world because
> it triggers all possible priority inversions since it puts bunch of tasks into
> deep sleep while some of them may hold kernel locks. Perfect.
> 
> QoS and scheduling policy are good thing, but rate-limiting must be separated
> and done only in places where it doesn't leads to these problems.

So what kind of priority inversion you are facing with blkcg and how would
you avoid it with your implementation?

I know that serialization can happen at filesystem level while trying
to commit journal. But I think same thing will happen with your
implementation too. 

One simple way of avoiding that will be to throttle IO even earlier
but that means we do not take advantage of writeback cache and buffered
writes will slow down. 

So I am curious how would you take care of these serialization issue.

Also the throttlers you are planning to implement, what kind of throttling
do they provide. Is it throttling rate per cgroup or per file per cgroup
or rules will be per bdi per cgroup or something else.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 14:29                       ` Tejun Heo
@ 2013-07-09 14:54                         ` Vivek Goyal
  2013-07-09 15:08                           ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Vivek Goyal @ 2013-07-09 14:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Konstantin Khlebnikov, linux-mm, linux-kernel, Michal Hocko,
	cgroups, Andrew Morton, Sha Zhengju, devel, Jens Axboe

On Tue, Jul 09, 2013 at 07:29:08AM -0700, Tejun Heo wrote:
> On Tue, Jul 09, 2013 at 10:18:33AM -0400, Vivek Goyal wrote:
> > For implementing throttling one as such does not have to do time
> > slice management on the queue.  For providing constructs like IOPS
> > or bandwidth throttling, one just need to put one throttling knob 
> > in the cgroup pipe irrespective of time slice management on the
> > backing device/network.
> 
> We should be providing a comprehensive mechanism to be used from
> userland, not something which serves pieces of specialized
> requirements here and there.  blkio is already a mess with the
> capability changing depending on which elevator is in use and
> blk-throttle counting bios instead of merged requests making iops
> control a bit silly.  We need to clean that up, not add more mess on
> top.

It is not clear whether counting bio or counting request is right
thing to do here. It depends where you are trying to throttle. For
bio based drivers there is request and they need throttling mechanism
too. So keeping it common for both, kind of makes sense.

> 
> > Also time slice management is one way of managing the backend resource.
> > CFQ did that and it works only for slow devices. For faster devices
> > we anyway need some kind of token mechanism instead of keeping track
> > of time.
> 
> No, it is the *right* resource to manage for rotating devices if you
> want any sort of meaningful proportional resource distribution.  It's
> not something one dreams up out of blue but something which arises
> from the fundamental operating characteristics of the device.  For
> SSDs, iops is good enough as their latency profile is consistent
> enough but doing so with rotating disks doesn't yield anything useful.

Ok, so first of all you agree that time slice management is not a
requirement for fast devices.

Secondly, even for slow devices, time slice management practically works
only if NCQ is not implemented in device or NCQ is not being used because
CFQ is not dispatching more requests.

So even in CFQ, time slice accounting works only for sequential IO.
Anybody doing random IO, there is no notion of time slice. We allow
dispatching requests from multiple queues at the same time and then
we don't have a way to count time.

So time slice management is a problem even on slow devices which implement
NCQ. IIRC, in the beginning even CFQ as doing some kind of request
management (and not time slice management). And later it switched to
time slice management in an effort to provide better fairness (If somebody
is doing random IO and seek takes more time the process should be
accounted for it).

But ideal time slice accounting requires driving a queue depth of 1
and for any non-sequential IO, it kills performance.

> 
> > So I don't think trying to manage time slice is the requirement here.
> 
> For a cgroup resource controller, it *is* a frigging requirement to
> control the right fundamental resource at the right place where the
> resource resides and can be fully controlled.  Nobody should have any
> other impression.

Seriously, time slice accounting is one way of managing resource. Same
disk resource can be divided proportionally by counting either iops
or by counting amount of IO done (bandwidth).

If we count iops or bandwidth, it might not be most fair way of doing
things on rotational media but it also should provide more accurate
results in case of NCQ. When multiple requests have been dispatched
to disk we have no idea which request consumed how much of disk time.
So there is no way to account it properly. Iops or bandwidth based
accounting will work just fine even with NCQ.

> 
> > > and by the time you implemented proper hierarchy support and
> > > proportional contnrol, yours isn't gonna be that simple either.
> > 
> > I suspect he is not plannnig to do any proportional control at that
> > layer. Just throttling mechanism.
> 
> blkio should be able to do proportional control in general.  The fact
> that we aren't able to do that except when cfq-iosched is in use is a
> problem which needs to be fixed.  It's not a free-for-all pass for
> creating more broken stuff.

So you want this generic block layer proportional implementation to
do time slice management?

I thought we talked about this implementation to use some kind of token
based mechanism so that it scales better on faster devices. And on slower
devices one will continue to use CFQ.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 14:35                     ` Konstantin Khlebnikov
@ 2013-07-09 14:42                       ` Tejun Heo
  2013-07-09 15:06                       ` Vivek Goyal
  1 sibling, 0 replies; 29+ messages in thread
From: Tejun Heo @ 2013-07-09 14:42 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Vivek Goyal, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

On Tue, Jul 09, 2013 at 06:35:54PM +0400, Konstantin Khlebnikov wrote:
> I'm not interested in QoS or proportional control. Let schedulers do it.
> I want just bandwidth control. I don't want to write a new scheduler

Well, I'm not interested in adding more half-assed stuff on top of the
existing mess.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 13:45                   ` Tejun Heo
  2013-07-09 14:18                     ` Vivek Goyal
@ 2013-07-09 14:35                     ` Konstantin Khlebnikov
  2013-07-09 14:42                       ` Tejun Heo
  2013-07-09 15:06                       ` Vivek Goyal
  1 sibling, 2 replies; 29+ messages in thread
From: Konstantin Khlebnikov @ 2013-07-09 14:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

Tejun Heo wrote:
> On Tue, Jul 09, 2013 at 05:43:10PM +0400, Konstantin Khlebnikov wrote:
>> My concept it cgroup which would control io operation on vfs layer
>> for all filesystems.  It will account and manage IO operations. I've
>> found really lightweight technique for accounting and throttling
>> which don't introduce new locks or priority inversions (which is
>> major problem in all existing throttlers, including cpu cgroup rate
>> limiter) So, I've tried to keep code smaller, cleaner and saner as
>> possible while you guys are trying to push me into the block layer
>> =)
>
> You're trying to implement QoS in the place where you don't have
> control of the queue itself.  You aren't even managing the right type
> of resource for disks which is time slice rather than iops or
> bandwidth and by the time you implemented proper hierarchy support and
> proportional control, yours isn't gonna be that simple either.  The
> root problem is bdi failing to propagate pressure from the actual
> queue upwards.  Fix that.
>

I'm not interested in QoS or proportional control. Let schedulers do it.
I want just bandwidth control. I don't want to write a new scheduler
or extend some of existing one. I want implement simple and lightweight
accounting and add couple of throttlers on top of that.
It can be easily done without violation of that hierarchical design.

The same problem already has happened with cpu scheduler. It has really
complicated rate limiter which is actually useless in the real world because
it triggers all possible priority inversions since it puts bunch of tasks into
deep sleep while some of them may hold kernel locks. Perfect.

QoS and scheduling policy are good thing, but rate-limiting must be separated
and done only in places where it doesn't leads to these problems.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 14:18                     ` Vivek Goyal
@ 2013-07-09 14:29                       ` Tejun Heo
  2013-07-09 14:54                         ` Vivek Goyal
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-07-09 14:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Konstantin Khlebnikov, linux-mm, linux-kernel, Michal Hocko,
	cgroups, Andrew Morton, Sha Zhengju, devel, Jens Axboe

On Tue, Jul 09, 2013 at 10:18:33AM -0400, Vivek Goyal wrote:
> For implementing throttling one as such does not have to do time
> slice management on the queue.  For providing constructs like IOPS
> or bandwidth throttling, one just need to put one throttling knob 
> in the cgroup pipe irrespective of time slice management on the
> backing device/network.

We should be providing a comprehensive mechanism to be used from
userland, not something which serves pieces of specialized
requirements here and there.  blkio is already a mess with the
capability changing depending on which elevator is in use and
blk-throttle counting bios instead of merged requests making iops
control a bit silly.  We need to clean that up, not add more mess on
top.

> Also time slice management is one way of managing the backend resource.
> CFQ did that and it works only for slow devices. For faster devices
> we anyway need some kind of token mechanism instead of keeping track
> of time.

No, it is the *right* resource to manage for rotating devices if you
want any sort of meaningful proportional resource distribution.  It's
not something one dreams up out of blue but something which arises
from the fundamental operating characteristics of the device.  For
SSDs, iops is good enough as their latency profile is consistent
enough but doing so with rotating disks doesn't yield anything useful.

> So I don't think trying to manage time slice is the requirement here.

For a cgroup resource controller, it *is* a frigging requirement to
control the right fundamental resource at the right place where the
resource resides and can be fully controlled.  Nobody should have any
other impression.

> > and by the time you implemented proper hierarchy support and
> > proportional contnrol, yours isn't gonna be that simple either.
> 
> I suspect he is not plannnig to do any proportional control at that
> layer. Just throttling mechanism.

blkio should be able to do proportional control in general.  The fact
that we aren't able to do that except when cfq-iosched is in use is a
problem which needs to be fixed.  It's not a free-for-all pass for
creating more broken stuff.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 13:45                   ` Tejun Heo
@ 2013-07-09 14:18                     ` Vivek Goyal
  2013-07-09 14:29                       ` Tejun Heo
  2013-07-09 14:35                     ` Konstantin Khlebnikov
  1 sibling, 1 reply; 29+ messages in thread
From: Vivek Goyal @ 2013-07-09 14:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Konstantin Khlebnikov, linux-mm, linux-kernel, Michal Hocko,
	cgroups, Andrew Morton, Sha Zhengju, devel, Jens Axboe

On Tue, Jul 09, 2013 at 06:45:58AM -0700, Tejun Heo wrote:
> On Tue, Jul 09, 2013 at 05:43:10PM +0400, Konstantin Khlebnikov wrote:
> > My concept it cgroup which would control io operation on vfs layer
> > for all filesystems.  It will account and manage IO operations. I've
> > found really lightweight technique for accounting and throttling
> > which don't introduce new locks or priority inversions (which is
> > major problem in all existing throttlers, including cpu cgroup rate
> > limiter) So, I've tried to keep code smaller, cleaner and saner as
> > possible while you guys are trying to push me into the block layer
> > =)
> 
> You're trying to implement QoS in the place where you don't have
> control of the queue itself.

> You aren't even managing the right type
> of resource for disks which is time slice rather than iops or
> bandwidth

For implementing throttling one as such does not have to do time
slice management on the queue.  For providing constructs like IOPS
or bandwidth throttling, one just need to put one throttling knob 
in the cgroup pipe irrespective of time slice management on the
backing device/network.

Also time slice management is one way of managing the backend resource.
CFQ did that and it works only for slow devices. For faster devices
we anyway need some kind of token mechanism instead of keeping track
of time.

So I don't think trying to manage time slice is the requirement here.

> and by the time you implemented proper hierarchy support and
> proportional control, yours isn't gonna be that simple either.

I suspect he is not plannnig to do any proportional control at that
layer. Just throttling mechanism.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 13:43                 ` Konstantin Khlebnikov
@ 2013-07-09 13:45                   ` Tejun Heo
  2013-07-09 14:18                     ` Vivek Goyal
  2013-07-09 14:35                     ` Konstantin Khlebnikov
  0 siblings, 2 replies; 29+ messages in thread
From: Tejun Heo @ 2013-07-09 13:45 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Vivek Goyal, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

On Tue, Jul 09, 2013 at 05:43:10PM +0400, Konstantin Khlebnikov wrote:
> My concept it cgroup which would control io operation on vfs layer
> for all filesystems.  It will account and manage IO operations. I've
> found really lightweight technique for accounting and throttling
> which don't introduce new locks or priority inversions (which is
> major problem in all existing throttlers, including cpu cgroup rate
> limiter) So, I've tried to keep code smaller, cleaner and saner as
> possible while you guys are trying to push me into the block layer
> =)

You're trying to implement QoS in the place where you don't have
control of the queue itself.  You aren't even managing the right type
of resource for disks which is time slice rather than iops or
bandwidth and by the time you implemented proper hierarchy support and
proportional control, yours isn't gonna be that simple either.  The
root problem is bdi failing to propagate pressure from the actual
queue upwards.  Fix that.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 13:16               ` Tejun Heo
@ 2013-07-09 13:43                 ` Konstantin Khlebnikov
  2013-07-09 13:45                   ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Konstantin Khlebnikov @ 2013-07-09 13:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

Tejun Heo wrote:
> On Tue, Jul 09, 2013 at 06:16:05AM -0700, Tejun Heo wrote:
>> On Tue, Jul 09, 2013 at 05:15:14PM +0400, Konstantin Khlebnikov wrote:
>>> blkio controls block devices. not filesystems or superblocks or bdi or pagecache.
>>> It's all about block layer and nothing more. Am I right?
>>>
>>> So, you want to link some completely unrelated subsystems like NFS into the block layer?
>>
>> Heh, yeah, sure, network QoS is completely unrelated to sockets too,
>> right?
>
> And, no, blkio wouldn't have anything to do with NFS.  Where did you
> get that idea?
>

My concept it cgroup which would control io operation on vfs layer for all filesystems.
It will account and manage IO operations. I've found really lightweight technique
for accounting and throttling which don't introduce new locks or priority inversions
(which is major problem in all existing throttlers, including cpu cgroup rate limiter)
So, I've tried to keep code smaller, cleaner and saner as possible while you guys are
trying to push me into the block layer =)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 13:16             ` Tejun Heo
@ 2013-07-09 13:16               ` Tejun Heo
  2013-07-09 13:43                 ` Konstantin Khlebnikov
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-07-09 13:16 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Vivek Goyal, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

On Tue, Jul 09, 2013 at 06:16:05AM -0700, Tejun Heo wrote:
> On Tue, Jul 09, 2013 at 05:15:14PM +0400, Konstantin Khlebnikov wrote:
> > blkio controls block devices. not filesystems or superblocks or bdi or pagecache.
> > It's all about block layer and nothing more. Am I right?
> > 
> > So, you want to link some completely unrelated subsystems like NFS into the block layer?
> 
> Heh, yeah, sure, network QoS is completely unrelated to sockets too,
> right?

And, no, blkio wouldn't have anything to do with NFS.  Where did you
get that idea?

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 13:15           ` Konstantin Khlebnikov
@ 2013-07-09 13:16             ` Tejun Heo
  2013-07-09 13:16               ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-07-09 13:16 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Vivek Goyal, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

On Tue, Jul 09, 2013 at 05:15:14PM +0400, Konstantin Khlebnikov wrote:
> blkio controls block devices. not filesystems or superblocks or bdi or pagecache.
> It's all about block layer and nothing more. Am I right?
> 
> So, you want to link some completely unrelated subsystems like NFS into the block layer?

Heh, yeah, sure, network QoS is completely unrelated to sockets too,
right?

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09 12:57         ` Tejun Heo
@ 2013-07-09 13:15           ` Konstantin Khlebnikov
  2013-07-09 13:16             ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Konstantin Khlebnikov @ 2013-07-09 13:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

Tejun Heo wrote:
> Hello,
>
> On Tue, Jul 09, 2013 at 12:28:15PM +0400, Konstantin Khlebnikov wrote:
>> Yep, blkio has plenty problems and flaws and I don't get how it's related
>> to vfs layer, dirty set control and non-disk or network backed filesystems.
>> Any problem can be fixed by introducing new abstract layer, except too many
>> abstraction levels. Cgroup is pluggable subsystem, blkio has it's own plugins
>> and it's build on top of io scheduler plugin. All this stuff always have worked
>
> What does that have to do with anything?
>
>> with block devices. Now you suggest to handle all filesystems in this stack.
>> I think binding them to unrealated cgroup is rough leveling violation.
>
> How is blkio unrelated to filesystems mounted on block devices?
> You're suggesting a duplicate solution which can't be complete.

blkio controls block devices. not filesystems or superblocks or bdi or pagecache.
It's all about block layer and nothing more. Am I right?

So, you want to link some completely unrelated subsystems like NFS into the block layer?

>
>> NFS cannot be controlled only by network throttlers because we
>> cannot slow down writeback process when it happens, we must slow
>> down tasks who generates dirty memory.
>
> That's exactly the same problem why blkio doesn't work for async IOs
> right now, so if you're interested in the area, please contribute to
> fixing that problem.
>
>> Plus it's close to impossible to separate several workloads if they
>> share one NFS sb.
>
> Again, the same problem with blkio.  We need separate pressure
> channels on bdi for each cgroup.
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-09  8:28       ` Konstantin Khlebnikov
@ 2013-07-09 12:57         ` Tejun Heo
  2013-07-09 13:15           ` Konstantin Khlebnikov
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-07-09 12:57 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Vivek Goyal, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

Hello,

On Tue, Jul 09, 2013 at 12:28:15PM +0400, Konstantin Khlebnikov wrote:
> Yep, blkio has plenty problems and flaws and I don't get how it's related
> to vfs layer, dirty set control and non-disk or network backed filesystems.
> Any problem can be fixed by introducing new abstract layer, except too many
> abstraction levels. Cgroup is pluggable subsystem, blkio has it's own plugins
> and it's build on top of io scheduler plugin. All this stuff always have worked

What does that have to do with anything?

> with block devices. Now you suggest to handle all filesystems in this stack.
> I think binding them to unrealated cgroup is rough leveling violation.

How is blkio unrelated to filesystems mounted on block devices?
You're suggesting a duplicate solution which can't be complete.

> NFS cannot be controlled only by network throttlers because we
> cannot slow down writeback process when it happens, we must slow
> down tasks who generates dirty memory.

That's exactly the same problem why blkio doesn't work for async IOs
right now, so if you're interested in the area, please contribute to
fixing that problem.

> Plus it's close to impossible to separate several workloads if they
> share one NFS sb.

Again, the same problem with blkio.  We need separate pressure
channels on bdi for each cgroup.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-08 17:56     ` Tejun Heo
@ 2013-07-09  8:28       ` Konstantin Khlebnikov
  2013-07-09 12:57         ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Konstantin Khlebnikov @ 2013-07-09  8:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, linux-mm, linux-kernel, Michal Hocko, cgroups,
	Andrew Morton, Sha Zhengju, devel, Jens Axboe

Tejun Heo wrote:
> Hello, Vivek.
>
> On Mon, Jul 08, 2013 at 01:52:01PM -0400, Vivek Goyal wrote:
>>> Again, a problem to be fixed in the stack rather than patching up from
>>> up above.  The right thing to do is to propagate pressure through bdi
>>> properly and let whatever is backing the bdi generate appropriate
>>> amount of pressure, be that disk or network.
>>
>> Ok, so use network controller for controlling IO rate on NFS? I had
>> tried it once and it did not work. I think it had problems related
>> to losing the context info as IO propagated through the stack. So
>> we will have to fix that too.
>
> But that's a similar problem we have with blkcg anyway - losing the
> dirtier information by the time writeback comes down through bdi.  It
> might not be exactly the same and might need some impedance matching
> on the network side but I don't see any fundamental differences.
>
> Thanks.
>

Yep, blkio has plenty problems and flaws and I don't get how it's related
to vfs layer, dirty set control and non-disk or network backed filesystems.
Any problem can be fixed by introducing new abstract layer, except too many
abstraction levels. Cgroup is pluggable subsystem, blkio has it's own plugins
and it's build on top of io scheduler plugin. All this stuff always have worked
with block devices. Now you suggest to handle all filesystems in this stack.
I think binding them to unrealated cgroup is rough leveling violation.

NFS cannot be controlled only by network throttlers because we cannot slow down
writeback process when it happens, we must slow down tasks who generates dirty memory.
Plus it's close to impossible to separate several workloads if they share one NFS sb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-08 10:01 Konstantin Khlebnikov
  2013-07-08 17:00 ` Tejun Heo
@ 2013-07-08 18:11 ` Vivek Goyal
  2013-07-09 15:39 ` Theodore Ts'o
  2 siblings, 0 replies; 29+ messages in thread
From: Vivek Goyal @ 2013-07-08 18:11 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Michal Hocko, cgroups, Andrew Morton,
	Sha Zhengju, devel, Greg Thelen

On Mon, Jul 08, 2013 at 02:01:39PM +0400, Konstantin Khlebnikov wrote:
> RESEND: fix CC
> 
> This is proof of concept, just basic functionality for IO controller.
> This cgroup will control filesystem usage on vfs layer, it's main goal is
> bandwidth control. It's supposed to be much more lightweight than memcg/blkio.
> 
> This patch shows easy way for accounting pages in dirty/writeback state in
> per-inode manner. This is easier that doing this in memcg in per-page manner.

In the past other developers (CC Greg Thelen <gthelen@google.com>) have posted
patches where inode was associated with a memcg for writeback accounting.

So accounting was per inode and not per page. We lose granularity in the
process if two processes in different cgroups are doing IO to a file but it
was considered good enough approximation in general. 

> Main idea is in keeping on each inode pointer (->i_fsio) to cgroup which owns
> dirty data in that inode. It's settled by fsio_account_page_dirtied() when
> first dirty tag appears in the inode. Relying to mapping tags gives us locking
> for free, this patch doesn't add any new locks to hot paths.
> 
> Unlike to blkio this method works for all of filesystems, not just disk-backed.
> Also it's able to handle writeback, because each inode has context which can be
> used in writeback thread to account io operations.
> 
> This is early prototype, I have some plans about extra functionality because
> this accounting itself is mostly useless, but it can be used as basis for more
> usefull features.
> 
> Planned impovements:
> * Split bdi into several tiers and account them separately. For example:
>   hdd/ssd/usb/nfs. In complicated containerized environments that might be
>   different kinds of storages with different limits and billing. This is more
>   usefull that independent per-disk accounting and much easier to implement
>   because all per-tier structures are allocated before disk appearance.

What does this mean? With-in a cgroup there are different IO rate rules 
depending on who is backing the bdi? If yes, then can't this info be
exported to user space (if it is not already) and then user space can
set the rules accordingly on disk.

W.r.t the issue of being able to apply rules only when disk appears, I 
think we will need a solution for this in user space which applies the
rules when disk shows up.  I think systemd is planning to take care of of this
up to some extent where one can specify IO bandwidth limits and these
limits get applied when disk shows up. I am not sure if it allows
different limits for different disks or not.


> * Add some hooks for accounting actualy issued IO requests (iops).
> * Implement bandwidth throttlers for each tier individually (bps and iops).
>   This will be the most tasty feature. I already have very effective prototype.

Can you please give some details here and also explain why blkcg can not
achieve do the same.

> * Add hook into balance_dirty_pages to limit amount of dirty page for each
>   cgroup in each tier individually. This is required for accurate throttling,
>   because if we want to limit speed of writeback we also must limit amount
>   of dirty pages otherwise we have to inject enourmous delay after each sync().

So there is a separate knob for limiting on buffered write rate and it 
is not accounted towards direct writes etc?

Anyway, limiting dirty pages per memory cgroup will be required even if
we use blkcg.

> * Implement filtered writeback requests for writing only data which belongs to
>   particular fsio cgroup (or cgroups tree) to keep dirty balance in background.

I think Greg had done similar modifications and put inodes into cgroup
specific writeback lists and modify writeback logic.

So quite a few pieces w.r.t balance_dirty_pages() and inode tagging are
common even if we support writeback limits using blkcg. Only question we
will have to figure out why implement all this functinality using a new
cgroup controller instead of enhancing blkcg.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-08 17:52   ` Vivek Goyal
@ 2013-07-08 17:56     ` Tejun Heo
  2013-07-09  8:28       ` Konstantin Khlebnikov
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-07-08 17:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Konstantin Khlebnikov, linux-mm, linux-kernel, Michal Hocko,
	cgroups, Andrew Morton, Sha Zhengju, devel, Jens Axboe

Hello, Vivek.

On Mon, Jul 08, 2013 at 01:52:01PM -0400, Vivek Goyal wrote:
> > Again, a problem to be fixed in the stack rather than patching up from
> > up above.  The right thing to do is to propagate pressure through bdi
> > properly and let whatever is backing the bdi generate appropriate
> > amount of pressure, be that disk or network.
> 
> Ok, so use network controller for controlling IO rate on NFS? I had
> tried it once and it did not work. I think it had problems related
> to losing the context info as IO propagated through the stack. So
> we will have to fix that too.

But that's a similar problem we have with blkcg anyway - losing the
dirtier information by the time writeback comes down through bdi.  It
might not be exactly the same and might need some impedance matching
on the network side but I don't see any fundamental differences.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-08 17:00 ` Tejun Heo
@ 2013-07-08 17:52   ` Vivek Goyal
  2013-07-08 17:56     ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Vivek Goyal @ 2013-07-08 17:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Konstantin Khlebnikov, linux-mm, linux-kernel, Michal Hocko,
	cgroups, Andrew Morton, Sha Zhengju, devel, Jens Axboe

On Mon, Jul 08, 2013 at 10:00:47AM -0700, Tejun Heo wrote:
> (cc'ing Vivek and Jens)
> 
> Hello,
> 
> On Mon, Jul 08, 2013 at 02:01:39PM +0400, Konstantin Khlebnikov wrote:
> > This is proof of concept, just basic functionality for IO controller.
> > This cgroup will control filesystem usage on vfs layer, it's main goal is
> > bandwidth control. It's supposed to be much more lightweight than memcg/blkio.
> 
> While blkcg is pretty heavy handed right now, there's no inherent
> reason for it to be that way.  The right thing to do would be updating
> blkcg to be light-weight rather than adding yet another controller.
> Also, all controllers should support full hierarchy.

Agreed.

Looks like he is looking to implement only throttling IO with max upper
limits in fsio controller. And I thought that throttling IO part of blkcg was
pretty light weight. Konstantin, is that not the case. Or you find even
throttling functionality to be heavy weigth. If you have ideas to make
it light weight, we can always change it.

> 
> > Unlike to blkio this method works for all of filesystems, not just disk-backed.
> > Also it's able to handle writeback, because each inode has context which can be
> > used in writeback thread to account io operations.
> 
> Again, a problem to be fixed in the stack rather than patching up from
> up above.  The right thing to do is to propagate pressure through bdi
> properly and let whatever is backing the bdi generate appropriate
> amount of pressure, be that disk or network.

Ok, so use network controller for controlling IO rate on NFS? I had
tried it once and it did not work. I think it had problems related
to losing the context info as IO propagated through the stack. So
we will have to fix that too.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC] fsio: filesystem io accounting cgroup
  2013-07-08 10:01 Konstantin Khlebnikov
@ 2013-07-08 17:00 ` Tejun Heo
  2013-07-08 17:52   ` Vivek Goyal
  2013-07-08 18:11 ` Vivek Goyal
  2013-07-09 15:39 ` Theodore Ts'o
  2 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2013-07-08 17:00 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, linux-kernel, Michal Hocko, cgroups, Andrew Morton,
	Sha Zhengju, devel, Vivek Goyal, Jens Axboe

(cc'ing Vivek and Jens)

Hello,

On Mon, Jul 08, 2013 at 02:01:39PM +0400, Konstantin Khlebnikov wrote:
> This is proof of concept, just basic functionality for IO controller.
> This cgroup will control filesystem usage on vfs layer, it's main goal is
> bandwidth control. It's supposed to be much more lightweight than memcg/blkio.

While blkcg is pretty heavy handed right now, there's no inherent
reason for it to be that way.  The right thing to do would be updating
blkcg to be light-weight rather than adding yet another controller.
Also, all controllers should support full hierarchy.

> Unlike to blkio this method works for all of filesystems, not just disk-backed.
> Also it's able to handle writeback, because each inode has context which can be
> used in writeback thread to account io operations.

Again, a problem to be fixed in the stack rather than patching up from
up above.  The right thing to do is to propagate pressure through bdi
properly and let whatever is backing the bdi generate appropriate
amount of pressure, be that disk or network.

> This is early prototype, I have some plans about extra functionality because
> this accounting itself is mostly useless, but it can be used as basis for more
> usefull features.

I'm afraid it'd have very low chance of making upstream.  If this area
of work is interesting / important, please look into improving blkcg
rather than working around it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH RFC] fsio: filesystem io accounting cgroup
@ 2013-07-08 10:01 Konstantin Khlebnikov
  2013-07-08 17:00 ` Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Konstantin Khlebnikov @ 2013-07-08 10:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, cgroups, Andrew Morton, Sha Zhengju, devel

RESEND: fix CC

This is proof of concept, just basic functionality for IO controller.
This cgroup will control filesystem usage on vfs layer, it's main goal is
bandwidth control. It's supposed to be much more lightweight than memcg/blkio.

This patch shows easy way for accounting pages in dirty/writeback state in
per-inode manner. This is easier that doing this in memcg in per-page manner.
Main idea is in keeping on each inode pointer (->i_fsio) to cgroup which owns
dirty data in that inode. It's settled by fsio_account_page_dirtied() when
first dirty tag appears in the inode. Relying to mapping tags gives us locking
for free, this patch doesn't add any new locks to hot paths.

Unlike to blkio this method works for all of filesystems, not just disk-backed.
Also it's able to handle writeback, because each inode has context which can be
used in writeback thread to account io operations.

This is early prototype, I have some plans about extra functionality because
this accounting itself is mostly useless, but it can be used as basis for more
usefull features.

Planned impovements:
* Split bdi into several tiers and account them separately. For example:
  hdd/ssd/usb/nfs. In complicated containerized environments that might be
  different kinds of storages with different limits and billing. This is more
  usefull that independent per-disk accounting and much easier to implement
  because all per-tier structures are allocated before disk appearance.
* Add some hooks for accounting actualy issued IO requests (iops).
* Implement bandwidth throttlers for each tier individually (bps and iops).
  This will be the most tasty feature. I already have very effective prototype.
* Add hook into balance_dirty_pages to limit amount of dirty page for each
  cgroup in each tier individually. This is required for accurate throttling,
  because if we want to limit speed of writeback we also must limit amount
  of dirty pages otherwise we have to inject enourmous delay after each sync().
* Implement filtered writeback requests for writing only data which belongs to
  particular fsio cgroup (or cgroups tree) to keep dirty balance in background.
* Implement filtered 'sync', special mode for sync() which syncs only
  filesystems which 'belong' to current fsio cgroup. Each container should sync
  only it's own filesystems. This also can be made in terms of 'visibility' in
  vfsmount namespaces.

This patch lays on top of this:
b26008c page_writeback: put account_page_redirty() after set_page_dirty()
80979bd page_writeback: get rid of account_size argument in cancel_dirty_page()
c575ef6 hugetlbfs: remove cancel_dirty_page() from truncate_huge_page()
b720923 nfs: remove redundant cancel_dirty_page() from nfs_wb_page_cancel()
4c21e52 mm: remove redundant dirty pages check from __delete_from_page_cache()

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: cgroups@vger.kernel.org
Cc: devel@openvz.org
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sha Zhengju <handai.szj@gmail.com>
---
 block/blk-core.c              |    2 +
 fs/Makefile                   |    2 +
 fs/direct-io.c                |    2 +
 fs/fsio_cgroup.c              |  137 +++++++++++++++++++++++++++++++++++++++++
 fs/nfs/direct.c               |    2 +
 include/linux/cgroup_subsys.h |    6 ++
 include/linux/fs.h            |    3 +
 include/linux/fsio_cgroup.h   |  136 +++++++++++++++++++++++++++++++++++++++++
 init/Kconfig                  |    3 +
 mm/page-writeback.c           |    8 ++
 mm/readahead.c                |    2 +
 mm/truncate.c                 |    2 +
 12 files changed, 304 insertions(+), 1 deletion(-)
 create mode 100644 fs/fsio_cgroup.c
 create mode 100644 include/linux/fsio_cgroup.h

diff --git a/block/blk-core.c b/block/blk-core.c
index 93a18d1..5f4e6c9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
 #include <linux/swap.h>
 #include <linux/writeback.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/fsio_cgroup.h>
 #include <linux/fault-inject.h>
 #include <linux/list_sort.h>
 #include <linux/delay.h>
@@ -1866,6 +1867,7 @@ void submit_bio(int rw, struct bio *bio)
 			count_vm_events(PGPGOUT, count);
 		} else {
 			task_io_account_read(bio->bi_size);
+			fsio_account_read(bio->bi_size);
 			count_vm_events(PGPGIN, count);
 		}
 
diff --git a/fs/Makefile b/fs/Makefile
index 4fe6df3..d4432e7 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_FHANDLE)		+= fhandle.o
 
 obj-y				+= quota/
 
+obj-$(CONFIG_FSIO_CGROUP)	+= fsio_cgroup.o
+
 obj-$(CONFIG_PROC_FS)		+= proc/
 obj-$(CONFIG_SYSFS)		+= sysfs/
 obj-$(CONFIG_CONFIGFS_FS)	+= configfs/
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7ab90f5..0fe99af 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -28,6 +28,7 @@
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/fsio_cgroup.h>
 #include <linux/bio.h>
 #include <linux/wait.h>
 #include <linux/err.h>
@@ -722,6 +723,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		 * Read accounting is performed in submit_bio()
 		 */
 		task_io_account_write(len);
+		fsio_account_write(len);
 	}
 
 	/*
diff --git a/fs/fsio_cgroup.c b/fs/fsio_cgroup.c
new file mode 100644
index 0000000..5d5158f
--- /dev/null
+++ b/fs/fsio_cgroup.c
@@ -0,0 +1,137 @@
+#include <linux/fsio_cgroup.h>
+#include "internal.h"
+
+static inline struct fsio_cgroup *cgroup_fsio(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, fsio_subsys_id),
+			    struct fsio_cgroup, css);
+}
+
+static void fsio_free(struct fsio_cgroup *fsio)
+{
+	percpu_counter_destroy(&fsio->read_bytes);
+	percpu_counter_destroy(&fsio->write_bytes);
+	percpu_counter_destroy(&fsio->nr_dirty);
+	percpu_counter_destroy(&fsio->nr_writeback);
+	kfree(fsio);
+}
+
+static struct cgroup_subsys_state *fsio_css_alloc(struct cgroup *cgroup)
+{
+	struct fsio_cgroup *fsio;
+
+	fsio = kzalloc(sizeof(struct fsio_cgroup), GFP_KERNEL);
+	if (!fsio)
+		return ERR_PTR(-ENOMEM);
+
+	if (percpu_counter_init(&fsio->read_bytes, 0) ||
+	    percpu_counter_init(&fsio->write_bytes, 0) ||
+	    percpu_counter_init(&fsio->nr_dirty, 0) ||
+	    percpu_counter_init(&fsio->nr_writeback, 0)) {
+		fsio_free(fsio);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	return &fsio->css;
+}
+
+/* switch all ->i_fsio references to the parent cgroup */
+static void fsio_switch_sb(struct super_block *sb, void *_fsio)
+{
+	struct fsio_cgroup *fsio = _fsio;
+	struct fsio_cgroup *parent = cgroup_fsio(fsio->css.cgroup->parent);
+	struct address_space *mapping;
+	struct inode *inode;
+
+	spin_lock(&inode_sb_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		mapping = inode->i_mapping;
+		if (mapping->i_fsio == fsio &&
+		    (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
+		     mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))) {
+			spin_lock_irq(&mapping->tree_lock);
+			if (mapping->i_fsio == fsio)
+				mapping->i_fsio = parent;
+			spin_unlock_irq(&mapping->tree_lock);
+		}
+	}
+	spin_unlock(&inode_sb_list_lock);
+}
+
+static void fsio_css_free(struct cgroup *cgroup)
+{
+	struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
+	struct fsio_cgroup *parent = cgroup_fsio(fsio->css.cgroup->parent);
+	u64 nr_dirty, nr_writeback, tmp;
+
+	nr_dirty = percpu_counter_sum_positive(&fsio->nr_dirty);
+	percpu_counter_add(&parent->nr_dirty, nr_dirty);
+
+	nr_writeback = percpu_counter_sum_positive(&fsio->nr_writeback);
+	percpu_counter_add(&parent->nr_writeback, nr_writeback);
+
+	iterate_supers(fsio_switch_sb, fsio);
+
+	tmp = percpu_counter_sum(&fsio->nr_dirty);
+	percpu_counter_add(&parent->nr_dirty, tmp - nr_dirty);
+
+	tmp = percpu_counter_sum(&fsio->nr_writeback);
+	percpu_counter_add(&parent->nr_writeback, tmp - nr_writeback);
+}
+
+static u64 fsio_get_read_bytes(struct cgroup *cgroup, struct cftype *cft)
+{
+	struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
+
+	return percpu_counter_sum(&fsio->read_bytes);
+}
+
+static u64 fsio_get_write_bytes(struct cgroup *cgroup, struct cftype *cft)
+{
+	struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
+
+	return percpu_counter_sum(&fsio->write_bytes);
+}
+
+static u64 fsio_get_dirty_bytes(struct cgroup *cgroup, struct cftype *cft)
+{
+	struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
+
+	return percpu_counter_sum_positive(&fsio->nr_dirty) * PAGE_CACHE_SIZE;
+}
+
+static u64 fsio_get_writeback_bytes(struct cgroup *cgroup, struct cftype *cft)
+{
+	struct fsio_cgroup *fsio = cgroup_fsio(cgroup);
+
+	return percpu_counter_sum_positive(&fsio->nr_writeback) *
+		PAGE_CACHE_SIZE;
+}
+
+static struct cftype fsio_files[] = {
+	{
+		.name = "read_bytes",
+		.read_u64 = fsio_get_read_bytes,
+	},
+	{
+		.name = "write_bytes",
+		.read_u64 = fsio_get_write_bytes,
+	},
+	{
+		.name = "dirty_bytes",
+		.read_u64 = fsio_get_dirty_bytes,
+	},
+	{
+		.name = "writeback_bytes",
+		.read_u64 = fsio_get_writeback_bytes,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys fsio_subsys = {
+	.name = "fsio",
+	.subsys_id = fsio_subsys_id,
+	.css_alloc = fsio_css_alloc,
+	.css_free = fsio_css_free,
+	.base_cftypes = fsio_files,
+};
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 0bd7a55..b8e61ee 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -46,6 +46,7 @@
 #include <linux/kref.h>
 #include <linux/slab.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/fsio_cgroup.h>
 #include <linux/module.h>
 
 #include <linux/nfs_fs.h>
@@ -987,6 +988,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		goto out;
 
 	task_io_account_write(count);
+	fsio_account_write(count);
 
 	retval = nfs_direct_write(iocb, iov, nr_segs, pos, count, uio);
 	if (retval > 0) {
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 6e7ec64..d16df6e 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -84,3 +84,9 @@ SUBSYS(bcache)
 #endif
 
 /* */
+
+#if IS_SUBSYS_ENABLED(CONFIG_FSIO_CGROUP)
+SUBSYS(fsio)
+#endif
+
+/* */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 99be011..8156d99 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -421,6 +421,9 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	void			*private_data;	/* ditto */
+#ifdef CONFIG_FSIO_CGROUP
+	struct fsio_cgroup	*i_fsio;	/* protected with tree_lock */
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
diff --git a/include/linux/fsio_cgroup.h b/include/linux/fsio_cgroup.h
new file mode 100644
index 0000000..78aa53b
--- /dev/null
+++ b/include/linux/fsio_cgroup.h
@@ -0,0 +1,136 @@
+#ifndef _LINUX_FSIO_CGGROUP_H
+#define _LINUX_FSIO_CGGROUP_H
+
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/cgroup.h>
+#include <linux/workqueue.h>
+#include <linux/percpu_counter.h>
+
+struct fsio_cgroup {
+	struct cgroup_subsys_state css;
+	struct percpu_counter read_bytes;
+	struct percpu_counter write_bytes;
+	struct percpu_counter nr_dirty;
+	struct percpu_counter nr_writeback;
+};
+
+#ifdef CONFIG_FSIO_CGROUP
+
+static inline struct fsio_cgroup *task_fsio_cgroup(struct task_struct *task)
+{
+	return container_of(task_subsys_state(task, fsio_subsys_id),
+			    struct fsio_cgroup, css);
+}
+
+/*
+ * This accounts all reads, both cached and direct-io.
+ */
+static inline void fsio_account_read(unsigned long bytes)
+{
+	struct task_struct *task = current;
+	struct fsio_cgroup *fsio;
+
+	rcu_read_lock();
+	fsio = task_fsio_cgroup(task);
+	__percpu_counter_add(&fsio->read_bytes, bytes,
+			PAGE_CACHE_SIZE * percpu_counter_batch);
+	rcu_read_unlock();
+}
+
+/*
+ * This is used for accounting  direct-io writes.
+ */
+static inline void fsio_account_write(unsigned long bytes)
+{
+	struct task_struct *task = current;
+	struct fsio_cgroup *fsio;
+
+	rcu_read_lock();
+	fsio = task_fsio_cgroup(task);
+	__percpu_counter_add(&fsio->write_bytes, bytes,
+			PAGE_CACHE_SIZE * percpu_counter_batch);
+	rcu_read_unlock();
+}
+
+/*
+ * This called under mapping->tree_lock before setting radix-tree tag,
+ * page which caused this call either locked (write) or mapped (unmap).
+ *
+ * If this is first dirty page in inode and there is no writeback then current
+ * fsio cgroup becomes owner of this inode. Following radix_tree_tag_set()
+ * will pin this pointer till the end of writeback or truncate. Otherwise
+ * mapping->i_fsio already valid and points to cgroup who owns this inode.
+ */
+static inline void fsio_account_page_dirtied(struct address_space *mapping)
+{
+	struct fsio_cgroup *fsio = mapping->i_fsio;
+
+	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
+	    !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+		rcu_read_lock();
+		fsio = task_fsio_cgroup(current);
+		mapping->i_fsio = fsio;
+		rcu_read_unlock();
+	}
+
+	percpu_counter_inc(&fsio->nr_dirty);
+}
+
+/*
+ * This called after clearing dirty bit. Page here locked and unmapped,
+ * thus dirtying process is complete and mapping->i_fsio is valid.
+ * This state is stable because at this point page still in mapping and tagged.
+ * We cannot reset mapping->i_mapping here even that was last dirty page, becase
+ * this hook is called without tree_lock before removing page from page cache.
+ */
+static inline void fsio_cancel_dirty_page(struct address_space *mapping)
+{
+	percpu_counter_dec(&mapping->i_fsio->nr_dirty);
+}
+
+/*
+ * This called after redirtying page, thus nr_dirty will not fall to zero.
+ * No tree_lock, page is locked and still tagged in mapping.
+ */
+static inline void fsio_account_page_redirty(struct address_space *mapping)
+{
+	percpu_counter_dec(&mapping->i_fsio->nr_dirty);
+}
+
+/*
+ * This switches page accounging from dirty to writeback.
+ */
+static inline void fsio_set_page_writeback(struct address_space *mapping)
+{
+	struct fsio_cgroup *fsio = mapping->i_fsio;
+
+	percpu_counter_inc(&fsio->nr_writeback);
+	percpu_counter_dec(&fsio->nr_dirty);
+}
+
+/*
+ * Writeback is done, mapping->i_fsio pointer becomes invalid after that.
+ */
+static inline void fsio_clear_page_writeback(struct address_space *mapping)
+{
+	struct fsio_cgroup *fsio = mapping->i_fsio;
+
+	__percpu_counter_add(&fsio->write_bytes, PAGE_CACHE_SIZE,
+			PAGE_CACHE_SIZE * percpu_counter_batch);
+	percpu_counter_dec(&fsio->nr_writeback);
+}
+
+#else /* CONFIG_FSIO_CGROUP */
+
+static inline void fsio_account_read(unsigned long bytes) {}
+static inline void fsio_account_write(unsigned long bytes) {}
+static inline void fsio_account_page_dirtied(struct address_space *mapping) {}
+static inline void fsio_cancel_dirty_page(struct address_space *mapping) {}
+static inline void fsio_account_page_redirty(struct address_space *mapping) {}
+static inline void fsio_set_page_writeback(struct address_space *mapping) {}
+static inline void fsio_clear_page_writeback(struct address_space *mapping) {}
+
+#endif /* CONFIG_FSIO_CGROUP */
+
+#endif /* _LINUX_FSIO_CGGROUP_H */
diff --git a/init/Kconfig b/init/Kconfig
index ea1be00..689b29a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1050,6 +1050,9 @@ config DEBUG_BLK_CGROUP
 	Enable some debugging help. Currently it exports additional stat
 	files in a cgroup which can be useful for debugging.
 
+config FSIO_CGROUP
+	bool "Filesystem IO controller"
+
 endif # CGROUPS
 
 config CHECKPOINT_RESTORE
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a599f38..bc39a36 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/fsio_cgroup.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1994,6 +1995,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_io_account_write(PAGE_CACHE_SIZE);
+		fsio_account_page_dirtied(mapping);
 		current->nr_dirtied++;
 		this_cpu_inc(bdp_ratelimits);
 	}
@@ -2071,6 +2073,7 @@ void account_page_redirty(struct page *page)
 		current->nr_dirtied--;
 		dec_zone_page_state(page, NR_DIRTIED);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
+		fsio_account_page_redirty(mapping);
 	}
 }
 EXPORT_SYMBOL(account_page_redirty);
@@ -2242,6 +2245,7 @@ int test_clear_page_writeback(struct page *page)
 			if (bdi_cap_account_writeback(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
 				__bdi_writeout_inc(bdi);
+				fsio_clear_page_writeback(mapping);
 			}
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -2270,8 +2274,10 @@ int test_set_page_writeback(struct page *page)
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-			if (bdi_cap_account_writeback(bdi))
+			if (bdi_cap_account_writeback(bdi)) {
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+				fsio_set_page_writeback(mapping);
+			}
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
diff --git a/mm/readahead.c b/mm/readahead.c
index 829a77c..530599c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -15,6 +15,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/fsio_cgroup.h>
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 #include <linux/syscalls.h>
@@ -102,6 +103,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 			break;
 		}
 		task_io_account_read(PAGE_CACHE_SIZE);
+		fsio_account_read(PAGE_CACHE_SIZE);
 	}
 	return ret;
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index e212252..e84668e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -17,6 +17,7 @@
 #include <linux/highmem.h>
 #include <linux/pagevec.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/fsio_cgroup.h>
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
 #include <linux/cleancache.h>
@@ -75,6 +76,7 @@ void cancel_dirty_page(struct page *page)
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
 			task_io_account_cancelled_write(PAGE_CACHE_SIZE);
+			fsio_cancel_dirty_page(mapping);
 		}
 	}
 }


^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2013-07-10  8:37 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-08  9:59 [PATCH RFC] fsio: filesystem io accounting cgroup Konstantin Khlebnikov
2013-07-10  4:43 ` Sha Zhengju
2013-07-10  6:03   ` Konstantin Khlebnikov
2013-07-10  8:37     ` Sha Zhengju
2013-07-08 10:01 Konstantin Khlebnikov
2013-07-08 17:00 ` Tejun Heo
2013-07-08 17:52   ` Vivek Goyal
2013-07-08 17:56     ` Tejun Heo
2013-07-09  8:28       ` Konstantin Khlebnikov
2013-07-09 12:57         ` Tejun Heo
2013-07-09 13:15           ` Konstantin Khlebnikov
2013-07-09 13:16             ` Tejun Heo
2013-07-09 13:16               ` Tejun Heo
2013-07-09 13:43                 ` Konstantin Khlebnikov
2013-07-09 13:45                   ` Tejun Heo
2013-07-09 14:18                     ` Vivek Goyal
2013-07-09 14:29                       ` Tejun Heo
2013-07-09 14:54                         ` Vivek Goyal
2013-07-09 15:08                           ` Tejun Heo
     [not found]                             ` <20130710030955.GA3569@redhat.com>
2013-07-10  3:50                               ` Tejun Heo
2013-07-09 14:35                     ` Konstantin Khlebnikov
2013-07-09 14:42                       ` Tejun Heo
2013-07-09 15:06                       ` Vivek Goyal
2013-07-09 17:42                         ` Konstantin Khlebnikov
2013-07-09 18:35                           ` Vivek Goyal
2013-07-09 20:54                             ` Konstantin Khlebnikov
2013-07-08 18:11 ` Vivek Goyal
2013-07-09 15:39 ` Theodore Ts'o
2013-07-09 17:12   ` Konstantin Khlebnikov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).